f



Truncating text from a string with beginning text from another string

>From a line of arbitrary text, possibly followed by some amount of
text from the beginning of the string ' Reference #\d+', where \d+
represents one or more digit characters, I want to output the line
without the ending ' Reference...' string.  For example, the input line
'some arbitrary text Refer' would become 'some arbitrary text'.

Here are two programs that seem to do what I want, but they seem
overly complicated for this task.  I'm looking for a simpler solution,
possibly by using a better regular expression than I have chosen in my
first sample code.

First sample:
use strict ;
use warnings ;

my $re = qr'^(.*)\ ( (R$)|
                     (Re$)|
                     (Ref$)|
                     (Refe$)|
                     (Refer$)|
                     (Refere$)|
                     (Referenc$)|
                     (Reference\ {0,1}$)|
                     (Reference\ \#\d{0,}$)
                   )'x ;

while(<DATA>) {
    chomp ;
    print "in : >$_<\n" ;
    if (my($result) = /$re/g) {
        print "out: >$result<\n" ;
    }
    else {
        print "out: >$_<\n" ;
    }
}

__DATA__
Refer
One Referenc
two three Reference
xx yy Reference Reference
def Refere Reference #xx
abc the def Refere Reference #
abc the def Refere Reference #12


Second sample:
use strict ;
use warnings ;

my $PATTERN = 'Reference #000000' ;

my $pos ;
while (<DATA>) {
    chomp ;
    $pos = -1 ;
    while ((my $ind = index($_,' R',$pos)) != -1) {
        $pos = $ind + 1 ;
    }
    print "in : >$_<\n" ;
    my $result = $_ ;

    if ($pos > 0) {
        my $re = substr($_,$pos) ;
        $re =~ s/\d+$/\\d+/ ;
        $re = qr/^$re/ ;
        if ($PATTERN =~ /$re/) {
            $result = substr($_,0,$pos-1) ;
        }
    }
    print "out: >$result<\n" ;
}

__DATA__
Refer
One Referenc
two three Reference
xx yy Reference Reference
def Refere Reference #xx
abc the def Refere Reference #
abc the def Refere Reference #12

0
google132 (5)
3/23/2007 5:44:12 PM
comp.lang.perl.misc 33233 articles. 2 followers. brian (1246) is leader. Post Follow

19 Replies
1190 Views

Similar Articles

[PageSpeed] 1

Mark wrote:
> Here are two programs that seem to do what I want, but they seem
> overly complicated for this task.  I'm looking for a simpler solution,
> possibly by using a better regular expression than I have chosen in my
> first sample code.
> First sample:
> [...]
> Second sample:
> [...]

I don't really know what all this
should give, but whay wouldn't
a simple:

    while(<DATA>) {
       chomp && print "$1  ==> from [$_]\n" if /(.+?)Refer/
    }


do all you want? In your explanations you
mentioned you'd truncate all subsequent
occurencies of 'refer' 'reference' and all
following stuff.

Regards

M.
0
Mirco
3/23/2007 6:23:43 PM
On Mar 23, 5:44 pm, "Mark" <goo...@markginsburg.com> wrote:

[ An interesting problem ]

>  I'm looking for a simpler solution,
> possibly by using a better regular expression than I have chosen in my
> first sample code.

Wow! What a brilliant post. Clear, well thought out, interesting.

Just wish I had an answer. I'll think about that one tonight. I'll
probably be up all night thinking about it!

0
Brian
3/23/2007 6:34:36 PM
On Mar 23, 5:44 pm, "Mark" <goo...@markginsburg.com> wrote:
> use strict ;
> use warnings ;
>
> my $re = qr'^(.*)\ ( (R$)|
>                      (Re$)|
>                      (Ref$)|
>                      (Refe$)|
>                      (Refer$)|
>                      (Refere$)|
>                      (Referenc$)|
>                      (Reference\ {0,1}$)|
>                      (Reference\ \#\d{0,}$)
>                    )'x ;
>
> while(<DATA>) {
>     chomp ;
>     print "in : >$_<\n" ;
>     if (my($result) = /$re/g) {
>         print "out: >$result<\n" ;
>     }
>     else {
>         print "out: >$_<\n" ;
>     }
>
> }

Just being picky but...

As far as I can see the /g in the match does nothing useful.

Nor to most of the (...) in the regex.

{0,1} and {0,} in regex are so commonly used that they have one-
character short hands: ? and * respectively.

BTW are you perhaps trying to implement something like File::Stream?

0
Brian
3/23/2007 6:44:06 PM
On Mar 23, 10:44 am, "Mark" <goo...@markginsburg.com> wrote:

> my $re = qr'^(.*)\ ( (R$)|
>                      (Re$)|
>                      (Ref$)|
>                      (Refe$)|
>                      (Refer$)|
>                      (Refere$)|
>                      (Referenc$)|
>                      (Reference\ {0,1}$)|
>                      (Reference\ \#\d{0,}$)
>                    )'x ;

Try this instead; results are identical to your regex except what
happens to $2, which you don't use anyway (and you could avoid setting
$2, but extra complexity for no real gain):

   $re = qr{^(.*) Re?f?e?r?e?n?c?e? ?(\#\d*)$}x;


--
The best way to get a good answer is to ask a good question.
David Filmer (http://DavidFilmer.com)

0
usenet
3/23/2007 6:44:42 PM
On Mar 23, 6:44 pm, "Brian McCauley" <nobul...@gmail.com> wrote:
>
> BTW are you perhaps trying to implement something like
> File::Stream?

I thought I has d=E9j=E0-vu

http://groups.google.com/group/comp.lang.perl.misc/browse_frm/thread/6b7d06=
f61ea9f640

0
Brian
3/23/2007 6:46:56 PM
On Mar 23, 11:44 am, use...@DavidFilmer.com wrote:
>    $re = qr{^(.*) Re?f?e?r?e?n?c?e? ?(\#\d*)$}x;

Then again, it would be possible to "fool" this regex where your
original would not be fooled (for example, by dropping a middle
character).  Needs more thought....

--
The best way to get a good answer is to ask a good question.
David Filmer (http://DavidFilmer.com)

0
usenet
3/23/2007 6:49:07 PM
On Mar 23, 6:44 pm, use...@DavidFilmer.com wrote:
> On Mar 23, 10:44 am, "Mark" <goo...@markginsburg.com> wrote:
>
> > my $re = qr'^(.*)\ ( (R$)|
> >                      (Re$)|
> >                      (Ref$)|
> >                      (Refe$)|
> >                      (Refer$)|
> >                      (Refere$)|
> >                      (Referenc$)|
> >                      (Reference\ {0,1}$)|
> >                      (Reference\ \#\d{0,}$)
> >                    )'x ;
>
> Try this instead; results are identical to your regex except what
> happens to $2, which you don't use anyway (and you could avoid setting
> $2, but extra complexity for no real gain):
>
>    $re = qr{^(.*) Re?f?e?r?e?n?c?e? ?(\#\d*)$}x;

No, that matches "Rernc 10" etc too.

0
Brian
3/23/2007 6:51:32 PM
Brian McCauley <nobull67@gmail.com> wrote in comp.lang.perl.misc:
> On Mar 23, 5:44 pm, "Mark" <goo...@markginsburg.com> wrote:
> 
> [ An interesting problem ]
> 
> >  I'm looking for a simpler solution,
> > possibly by using a better regular expression than I have chosen in my
> > first sample code.
> 
> Wow! What a brilliant post. Clear, well thought out, interesting.

....plus runnable code, including a convincing set of test data.
I quite agree.

> Just wish I had an answer. I'll think about that one tonight. I'll
> probably be up all night thinking about it!

Ah, it won't take all night.  Here is my take:

    {
        my $fix = ' Reference #';
        my $pat = "$fix\\d+";
        my @parts = map substr( $fix, 0, $_), 1 .. length $fix;

        sub rem_ref {
            my $str = shift;
            $str =~ s/$pat$// and return $str;
            $str =~ s/$_$// and return $str for @parts;
            return $str;
        }
    }

    while ( <DATA> ) {
        chomp;
        print "in : >$_<\n";
        print "out: >", rem_ref( $_), "<\n";
    }

Anno
0
anno4000
3/23/2007 6:59:30 PM
 <usenet@DavidFilmer.com> wrote in comp.lang.perl.misc:
> On Mar 23, 10:44 am, "Mark" <goo...@markginsburg.com> wrote:
> 
> > my $re = qr'^(.*)\ ( (R$)|
> >                      (Re$)|
> >                      (Ref$)|
> >                      (Refe$)|
> >                      (Refer$)|
> >                      (Refere$)|
> >                      (Referenc$)|
> >                      (Reference\ {0,1}$)|
> >                      (Reference\ \#\d{0,}$)
> >                    )'x ;
> 
> Try this instead; results are identical to your regex except what
> happens to $2, which you don't use anyway (and you could avoid setting
> $2, but extra complexity for no real gain):
> 
>    $re = qr{^(.*) Re?f?e?r?e?n?c?e? ?(\#\d*)$}x;

No, that would also match things like "gaga Refe #12".

Anno
0
anno4000
3/23/2007 7:08:27 PM
Mark wrote:
> From a line of arbitrary text, possibly followed by some amount of
> text from the beginning of the string ' Reference #\d+', where \d+
> represents one or more digit characters, I want to output the line
> without the ending ' Reference...' string.  For example, the input line
> 'some arbitrary text Refer' would become 'some arbitrary text'.
> 
> Here are two programs that seem to do what I want, but they seem
> overly complicated for this task.  I'm looking for a simpler solution,
> possibly by using a better regular expression than I have chosen in my
> first sample code.

After making the wrong turn first,
I think this can't be solved very
much different from your solution.

The Regex can be an incremental one
(as was shown already by others) or a
sequence of alternations (as you tried).

One could rewrite it somehow 'different',
as a "split", like:

    use strict;
    use warnings;
    no warnings 'qw';

    my @end = qw{R e f e r e n c e \\s # \\d+};
    my $reg = '('.(join '|',map join('',@$_),map[@end[0..$_]],0..$#end).')$';

    while( <DATA> ) {
       chomp;
       print "[$_->[0]]\n\t[$_->[1]]\n"  for
          map [$_->[0]||'undef', $_->[1]||'undef'],
            [split /$reg/]
    }

__DATA__
....

Aside from the regex construction (which can be commented
properly ;-), this should be quite readable.


Regards

M.
0
Mirco
3/23/2007 7:22:59 PM
Mark wrote:
> Here are two programs that seem to do what I want, but they seem
> overly complicated for this task.  I'm looking for a simpler solution,
> possibly by using a better regular expression than I have chosen in my
> first sample code.

After making the wrong turn first,
I think this can't be solved very
much different from your solution.

Of course, one can write it somehow 'different',like:

    ...
    my @end = split //, 'Reference #000000';
    my $key = '('.(join '|', map join('',,@$_), map[@end[0..$_]], 0..$#end).')';
    ...

    while(<DATA>) {
       print "$1\t\t$2\n"
          if /^(.+?)($key)$/
    }

__DATA__
....

Regards

M.
0
Mirco
3/23/2007 7:23:44 PM
Mirco Wahab wrote:
> One could rewrite it somehow 'different',
> as a "split", like:
> 
>    use strict;
>    use warnings;
>    ...
>    [split /$reg/]
>    ...

....
reg and output slightly modified to match yours:


    ...
    no warnings 'qw';

    my @end = qw{R e f e r e n c e \\s # \\d+};
    my $reg = '\s+('.(join '|',map join('',@$_),map[@end[0..$_]],0..$#end).')$';

    while( <DATA> ) {
       chomp;
       print "in : >$_<\n";
       print "out: >", (split /$reg/)[0], "<\n"
    }
    ...

Regards

M.
0
Mirco
3/23/2007 7:28:37 PM
<anno4000@radom.zrz.tu-berlin.de> wrote:
> <usenet@DavidFilmer.com> wrote in comp.lang.perl.misc:
>> On Mar 23, 10:44 am, "Mark" <goo...@markginsburg.com> wrote:
>> 
>> > my $re = qr'^(.*)\ ( (R$)|
>> >                      (Re$)|
>> >                      (Ref$)|
>> >                      (Refe$)|
>> >                      (Refer$)|
>> >                      (Refere$)|
>> >                      (Referenc$)|
>> >                      (Reference\ {0,1}$)|
>> >                      (Reference\ \#\d{0,}$)
>> >                    )'x ;
>> 
>> Try this instead; results are identical to your regex except what
>> happens to $2, which you don't use anyway (and you could avoid setting
>> $2, but extra complexity for no real gain):
>> 
>>    $re = qr{^(.*) Re?f?e?r?e?n?c?e? ?(\#\d*)$}x;
>
>No, that would also match things like "gaga Refe #12".

You could write something like this

$re = qr{^(.*)\ (R(?:e(?:f(?:e(?:r(?:e(?:n(?:c(?:e(?:\ (?:\#\d*)
               ?)?)?)?)?)?)?)?)?))$}x;

but that's not clear at all to the human reader, and I don't think
adding more whitespace would help much in this case.

Depending on your needs, it might be more clear to use a simpler regex like
$re = qr{^(.*) ((R[a-z #]+) \d*)$};

and then test ($3 eq substr('Reference #', 0, length $3))

Gary Ansok
-- 
3M suggests that to obtain the best results, one should make the bond
"while the adhesive is wet, aggressively tacky."  I did not know what
"aggressively tacky" meant until I saw a recent notice in the Bboard.
0
ansok
3/23/2007 7:29:15 PM
At 2007-03-23 02:51PM, "Brian McCauley" wrote:
>  On Mar 23, 6:44 pm, use...@DavidFilmer.com wrote:
> > On Mar 23, 10:44 am, "Mark" <goo...@markginsburg.com> wrote:
> >
> > > my $re = qr'^(.*)\ ( (R$)|
> > >                      (Re$)|
> > >                      (Ref$)|
> > >                      (Refe$)|
> > >                      (Refer$)|
> > >                      (Refere$)|
> > >                      (Referenc$)|
> > >                      (Reference\ {0,1}$)|
> > >                      (Reference\ \#\d{0,}$)
> > >                    )'x ;
> >
> > Try this instead; results are identical to your regex except what
> > happens to $2, which you don't use anyway (and you could avoid setting
> > $2, but extra complexity for no real gain):
> >
> >    $re = qr{^(.*) Re?f?e?r?e?n?c?e? ?(\#\d*)$}x;
>  
>  No, that matches "Rernc 10" etc too.

So instead you'd want...
    
    $re = qr{^(.*) R(e(f(e(r(e(n(c(e( (#\d*)?)?)?)?)?)?)?)?)?)?$}

or

    $re = qr{^(.*) R(?:e(?:f(?:e(?:r(?:e(?:n(?:c(?:e(?: (?:#\d*)?)?)?)?)?)?)?)?)?)?$}


-- 
Glenn Jackman
"You can only be young once. But you can always be immature." -- Dave Barry
0
Glenn
3/23/2007 7:29:30 PM
Mirco Wahab wrote:
>    ...
>    [split /$reg/]
>    ...


regex/output simplified and slightly modified
to match yours:

    ...
    no warnings 'qw';
    my @end = qw{R e f e r e n c e \\s # \\d+};

    my $reg = '\s+('.(join'|',map join('',@end[0..$_]),0..$#end).')$';

    while( <DATA> ) {
       chomp;
       print "in : >$_<\n";
       print "out: >", (split /$reg/)[0], "<\n"
    }
    ...

Regards

M.
0
Mirco
3/23/2007 7:31:07 PM
On Mar 23, 10:44 am, "Mark" <goo...@markginsburg.com> wrote:
> >From a line of arbitrary text, possibly followed by some amount of
>
> text from the beginning of the string ' Reference #\d+', where \d+
> represents one or more digit characters, I want to output the line
> without the ending ' Reference...' string.  For example, the input line
> 'some arbitrary text Refer' would become 'some arbitrary text'.
>

Thanks to all who responded and offered ideas.  Anno's post was
especially interesting.

- M

0
Mark
3/23/2007 8:15:46 PM
Mark <google@markginsburg.com> wrote in comp.lang.perl.misc:
> On Mar 23, 10:44 am, "Mark" <goo...@markginsburg.com> wrote:
> > >From a line of arbitrary text, possibly followed by some amount of
> >
> > text from the beginning of the string ' Reference #\d+', where \d+
> > represents one or more digit characters, I want to output the line
> > without the ending ' Reference...' string.  For example, the input line
> > 'some arbitrary text Refer' would become 'some arbitrary text'.
> >
> 
> Thanks to all who responded and offered ideas.  Anno's post was
> especially interesting.

Thanks.  Since you mention it, the sub definition can be slightly
simplified:

    {
        my $fix = ' Reference #';
        my @parts = map substr( $fix, 0, $_), 1 .. length $fix;

        sub rem_ref {
            my $str = shift;
            $str =~ s/$_$// and return $str for @parts, "$fix\\d+";
            return $str;
        }
    }

Anno
0
anno4000
3/23/2007 9:12:43 PM
On Fri, 23 Mar 2007 20:23:44 +0100, Mirco Wahab <wahab-mail@gmx.de>
wrote:

>    my @end = split //, 'Reference #000000';
>    my $key = '('.(join '|', map join('',,@$_), map[@end[0..$_]], 0..$#end).')';

Isn't that an awkward way to reimplement substr()?
;-)


Michele
-- 
{$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
(($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
0
Michele
3/24/2007 9:39:50 AM
Michele Dondi wrote:
> On Fri, 23 Mar 2007 20:23:44 +0100, Mirco Wahab <wahab-mail@gmx.de>
> wrote:
> 
>>    my @end = split //, 'Reference #000000';
>>    my $key = '('.(join '|', map join('',,@$_), map[@end[0..$_]], 0..$#end).')';
> 
> Isn't that an awkward way to reimplement substr()?

First this - and the whole approach shown above also
will not work (to solve to said problem). I tried to
cancel the message (and post a working solution) after
thinking again - but your news server didn't honor my
cancel attempts. This way, all came to the light ...

Regards

Mirco

0
Mirco
3/25/2007 10:03:12 PM
Reply: