f



aligning text with space-normalized text

I have a string with a bunch of whitespace in it, and a series of chunks 
of that string whose indices I need to find.  However, the chunks have 
been whitespace-normalized, so that multiple spaces and newlines have 
been converted to single spaces as if by ' '.join(chunk.split()).  Some 
example data to clarify my problem:

py> text = """\
....    aaa  bb ccc
.... dd eee.  fff gggg
.... hh   i.
....    jjj kk.
.... """
py> chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']

Note that the original "text" has a variety of whitespace between words, 
but the corresponding "chunks" have only single space characters between 
"words".  I'm looking for the indices of each chunk, so for this 
example, I'd like:

py> result = [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]

Note that the indices correspond to the *original* text so that the 
substrings in the given spans include the irregular whitespace:

py> for s, e in result:
....     print repr(text[s:e])
....
'aaa  bb'
'ccc\ndd eee.'
'fff gggg\nhh   i.'
'jjj'
'kk.'

I'm trying to write code to produce the indices.  Here's what I have:

py> def get_indices(text, chunks):
....     chunks = iter(chunks)
....     chunk = None
....     for text_index, c in enumerate(text):
....         if c.isspace():
....             continue
....         if chunk is None:
....             chunk = chunks.next().replace(' ', '')
....             chunk_start = text_index
....             chunk_index = 0
....         if c != chunk[chunk_index]:
....             raise Exception('unmatched: %r %r' %
....                           (c, chunk[chunk_index]))
....         else:
....             chunk_index += 1
....             if chunk_index == len(chunk):
....                 yield chunk_start, text_index + 1
....                 chunk = None
....

And it appears to work:

py> list(get_indices(text, chunks))
[(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]
py> list(get_indices(text, chunks)) == result
True

But it seems somewhat inelegant.  Can anyone see an easier/cleaner/more 
Pythonic way[1] of writing this code?

Thanks in advance,

STeVe

[1] Yes, I'm aware that these are subjective terms.  I'm looking for 
subjectively "better" solutions. ;)
0
6/30/2005 12:50:40 AM
comp.lang.python 77058 articles. 6 followers. Post Follow

6 Replies
665 Views

Similar Articles

[PageSpeed] 9

Steven Bethard wrote:
[snip]
> And it appears to work:
[snip]
> But it seems somewhat inelegant.  Can anyone see an easier/cleaner/more 
> Pythonic way[1] of writing this code?
> 
> Thanks in advance,
> 
> STeVe
> 
> [1] Yes, I'm aware that these are subjective terms.  I'm looking for 
> subjectively "better" solutions. ;)

Perhaps you should define "work" before you worry about """subjectively 
"better" solutions""".

If "work" is meant to detect *all* possibilities of 'chunks' not having 
been derived from 'text' in the described manner, then it doesn't work 
-- all information about the positions of the whitespace is thrown away 
by your code.

For example, text = 'foo bar', chunks = ['foobar']
0
sjmachin (3097)
6/30/2005 2:08:56 AM
John Machin wrote:
> If "work" is meant to detect *all* possibilities of 'chunks' not having 
> been derived from 'text' in the described manner, then it doesn't work 
> -- all information about the positions of the whitespace is thrown away 
> by your code.
> 
> For example, text = 'foo bar', chunks = ['foobar']

This doesn't match the (admittedly vague) spec which said that chunks 
are created "as if by ' '.join(chunk.split())".  For the text:
     'foo bar'
the possible chunk lists should be something like:
     ['foo bar']
     ['foo', 'bar']
If it helps, you can think of chunks as lists of words, where the words 
have been ' '.join()ed.

STeVe
0
6/30/2005 4:02:22 AM
Steven Bethard wrote:

> I have a string with a bunch of whitespace in it, and a series of chunks
> of that string whose indices I need to find.��However,�the�chunks�have
> been whitespace-normalized, so that multiple spaces and newlines have
> been converted to single spaces as if by ' '.join(chunk.split()).��Some

If you are willing to get your hands dirty with regexps:

import re
_reLump = re.compile(r"\S+")

def indices(text, chunks):
    lumps = _reLump.finditer(text)
    for chunk in chunks:
        lump = [lumps.next() for _ in chunk.split()]
        yield lump[0].start(), lump[-1].end()


def main():
    text = """\
   aaa  bb ccc
dd eee.  fff gggg
hh   i.
   jjj kk.
"""
    chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']
    assert list(indices(text, chunks)) == [(3, 10), (11, 22), (24, 40), (44,
47), (48, 51)]

if __name__ == "__main__":
    main()

Not tested beyond what you see.

Peter

0
__peter__ (4031)
6/30/2005 7:07:04 AM
Steven Bethard wrote:
> John Machin wrote:
> 
>> If "work" is meant to detect *all* possibilities of 'chunks' not 
>> having been derived from 'text' in the described manner, then it 
>> doesn't work -- all information about the positions of the whitespace 
>> is thrown away by your code.
>>
>> For example, text = 'foo bar', chunks = ['foobar']
> 
> 
> This doesn't match the (admittedly vague) spec

That is *exactly* my point -- it is not valid input, and you are not 
reporting all cases of invalid input; you have an exception where the 
non-spaces are impossible, but no exception where whitespaces are 
impossible.


which said that chunks
> are created "as if by ' '.join(chunk.split())".  For the text:
>     'foo bar'
> the possible chunk lists should be something like:
>     ['foo bar']
>     ['foo', 'bar']
> If it helps, you can think of chunks as lists of words, where the words 
> have been ' '.join()ed.

If it helps, you can re-read my message.

> 
> STeVe
0
sjmachin (3097)
6/30/2005 11:49:20 AM
John Machin wrote:
> Steven Bethard wrote:
> 
>> John Machin wrote:
>>
>>> For example, text = 'foo bar', chunks = ['foobar']
>>
>> This doesn't match the (admittedly vague) spec
> 
> That is *exactly* my point -- it is not valid input, and you are not 
> reporting all cases of invalid input; you have an exception where the 
> non-spaces are impossible, but no exception where whitespaces are 
> impossible.

Well, the input should never look like the above.  But if for some 
reason it did, I wouldn't want the error; I'd want the indices.  So:
     text = 'foo bar'
     chunks = ['foobar']
should produce:
     [(0, 7)]
not an exception.

STeVe
0
7/1/2005 4:52:42 AM
Peter Otten wrote:
> import re
> _reLump = re.compile(r"\S+")
> 
> def indices(text, chunks):
>     lumps = _reLump.finditer(text)
>     for chunk in chunks:
>         lump = [lumps.next() for _ in chunk.split()]
>         yield lump[0].start(), lump[-1].end()

Thanks, that's a really nice, clean solution!

STeVe
0
7/1/2005 5:38:49 PM
Reply: