f



duplicate words - vim

I have duplicate words on different lines


input
border,(SI)  border, border,(V)  border
rite,(SA) rite
edit,(TA) edit,(B)

I would like to get rid of duplicate words and keep the annotation.

output
border,(SI) (V)  
rite,(SA) 
edit,(TA) (B)

Thanks
0
florine
7/18/2016 2:51:21 PM
comp.editors 4871 articles. 0 followers. Post Follow

1 Replies
217 Views

Similar Articles

[PageSpeed] 19

On Monday, July 18, 2016 at 7:51:32 AM UTC-7, florine wrote:
> I have duplicate words on different lines
> 
> input
> border,(SI)  border, border,(V)  border
> rite,(SA) rite
> edit,(TA) edit,(B)
> 
> I would like to get rid of duplicate words and keep the annotation.
> 
> output
> border,(SI) (V)  
> rite,(SA) 
> edit,(TA) (B)

$ LC_ALL=C vi file
:%j
:s/,//g
:s/^ */ /
:s/ *$/ /
:s/ \([a-z][a-z]*\)\(([A-Z][A-Z]*)\)\(.*\) \1\(([A-Z][A-Z]*)\)/ \1\2\4 \3 /g
:s/ \([a-z][a-z]*\)\(([A-Z][A-Z]*)\(([A-Z][A-Z]*)\)*\)\(.*\) \1 / \1\2\4 /g
:s/\([a-z]\)(/\1,(/g
:s/  */^M/g
:g/^$/d
:wq

Of the first two long substitution lines above (the first two ending with
/g), after running the edit command line, if it made any substitution(s),
repeat the same command again, and continue to do so until it makes no
substitution(s) - only after that proceed to execute the edit command
line shown next in sequence.
Where ^M is shown above, use the literal <control-M> character, typically
entering it via the sequence <control-V><control-M>

The sequence given, basically does the following:

join all the lines, putting space(s) between where they're joined

get rid of the comma (,) characters

have line start with exactly one space character

have line end with exactly one space character

where word and tag(s) is later followed by same word and tag(s),
combine them to just word and those tag(s) in sequence,
repeat until no more to combine

where word and tag(s) is later followed by same word with no
tag(s) following it,
combine them to just word and tag(s),
repeat until no more to combine

put commas (,) back between word and ( for all on line

break line at all occurrences of one or more spaces

delete empty lines

write out our buffer and quit


Left as an exercise (-:

If first occurrence of word has no tags immediately after it,
but word is later repeated with or without tags, and one wants to
likewise combine, add the appropriate step(s) in appropriate
sequence to also handle that

Better specify exactly what constitutes "word", "tag", and fully
describes the data format (field and field separators/terminators,
sub-field separators/terminators, record separators/terminators,
what is/isn't legal data structure/format, allowed character set,
etc.

The example works as tested against the supplied data and
(incomplete/ambiguous) earlier "specification", and tested with
vi(1) (nvi(1)) and vim(1), your mileage (and data) may vary.
0
Michael
9/24/2016 1:55:25 PM
Reply: