f



if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just 'C'

To repeat the title, in case it is munged by Google Groups:

  if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just
'C'

I've been developing with perl for years; but even simple things in it
still
sometimes throw up surprises.

The regexp /:(.*?)$/ is anchored on the right by $, then comes a non-
greedy
match which, AIUI, is the "shortest string it can get away with",
preceded
by a colon. So I would expect this to pick up just the "C", as it does
with
  /([^:]*)$/.

Am I assuming/doing something silly? It is friday afternoon after all.


Cheers

John R Ramsden
0
OwlHoot
11/12/2010 2:38:08 PM
comp.lang.perl.misc 33233 articles. 2 followers. brian (1246) is leader. Post Follow

9 Replies
1049 Views

Similar Articles

[PageSpeed] 33

On 12.11.2010 15:38, OwlHoot wrote:
> To repeat the title, in case it is munged by Google Groups:
> 
>   if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just
> 'C'
> 
> I've been developing with perl for years; but even simple things in it
> still
> sometimes throw up surprises.
> 
> The regexp /:(.*?)$/ is anchored on the right by $, then comes a non-
> greedy
> match which, AIUI, is the "shortest string it can get away with",
> preceded
> by a colon. So I would expect this to pick up just the "C", as it does
> with
>   /([^:]*)$/.

The regexp matches from the left to the right, even if there is an
anchor on the right side of the string.

Thus the : first tries to match first : in your string, i.e the one
between A and B. Then .*? tries to match any number of chars, starting
from zero because of then ?. But if zero chars are matched, the $ fails.
So the regexp tries to make the number of characters matched by the .*?
longer and longer, and finally the $ matches. The regexp does not need
to go back and select the next : in this case.

..*? means: take as few chars as possible _at this position_
It does not mean: do backtracking and try to find if it could match
fewer chars at some other place in the string

So if you add .* to the beginning, you will get the last : in your string.
/.*:(.*?)$/
In this case the .* would try to eat as many chars as possible, then
search for a :. So this would try the last : first.

Anyway, you could also use (split /:/, 'A:B:C')[-1] here.

Cheers, Wolf
0
Wolf
11/12/2010 2:56:11 PM
On Fri, 12 Nov 2010 06:38:08 -0800 (PST), OwlHoot <ravensdean@googlemail.com> wrote:

>To repeat the title, in case it is munged by Google Groups:
>
>  if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just
>'C'
>
>I've been developing with perl for years; but even simple things in it
>still
>sometimes throw up surprises.
>
>The regexp /:(.*?)$/ is anchored on the right by $, then comes a non-
>greedy
>match which, AIUI, is the "shortest string it can get away with",
>preceded
>by a colon. So I would expect this to pick up just the "C", as it does
>with
>  /([^:]*)$/.
>

Its not the shortest, its the first to satisfy it.
It is anchored on the left and right. The regex is allowing
another ':' when it traverses the string from the left.
/:(.*)$/ has the same result without checking chars between the
first ':' and the end of string.

Notice that /:(.*?):/ does the same thing, it says get all between
the first ':' and the next ':'. However,
  'A:B:C:D' =~ /:(.*):/
greedily grabs all between the first and last ':', but
  'A:B:C:D' =~ /:(.*?):/
grabs only that between the first 2 ':'s.

Since there is only one end of line, it gets all between the first ':'
and end of line regardless of ?.

-sln
0
sln
11/12/2010 3:28:28 PM
Wolf Behrenhoff <NoSpamPleaseButThisIsValid3@gmx.net> writes:
> On 12.11.2010 15:38, OwlHoot wrote:
>> To repeat the title, in case it is munged by Google Groups:
>> 
>>   if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just
>> 'C'

You should ask your question in the body of your message anyway.
Newsreaders vary in how they display subject lines.

>> I've been developing with perl for years; but even simple things in
>> it still sometimes throw up surprises.
>> 
>> The regexp /:(.*?)$/ is anchored on the right by $, then comes a non-
>> greedy match which, AIUI, is the "shortest string it can get away
>> with", preceded by a colon. So I would expect this to pick up just
>> the "C", as it does with
>>   /([^:]*)$/.
>
> The regexp matches from the left to the right, even if there is an
> anchor on the right side of the string.
>
[more explanation snipped]
>
> Anyway, you could also use (split /:/, 'A:B:C')[-1] here.

Another possibility is
    if ('A:B:C' =~ /:([^:]*)$/)

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Nokia
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
0
Keith
11/12/2010 4:44:36 PM
On Nov 12, 8:44=A0am, Keith Thompson <ks...@mib.org> wrote:

....

>
> > Anyway, you could also use (split /:/, 'A:B:C')[-1] here.
>
> Another possibility is
> =A0 =A0 if ('A:B:C' =3D~ /:([^:]*)$/)
>

Yet another:

      'A:B:C' =3D~ /.*:(.*)/;



--
Charles DeRykus


0
C
11/12/2010 8:55:28 PM
>>>>> "O" == OwlHoot  <ravensdean@googlemail.com> writes:

  O> The regexp /:(.*?)$/ is anchored on the right by $, then comes a non-
  O> greedy
  O> match which, AIUI, is the "shortest string it can get away with",
  O> preceded
  O> by a colon. So I would expect this to pick up just the "C", as it does
  O> with
  O>   /([^:]*)$/.

as others have said, you didn't get what ? does for quantifiers. perl
will match the leftmost working match. with a greedy quantifier, it will
continue to match chars until it fails and then stop. with the
non-greedy modifier ? it will stop after the first (and locally
shortest) match. it will not globally find the shortest possible match
anywhere in the string. so the key is remembering leftmost correct match
first and then short or greedy based on the modifier.

uri

-- 
Uri Guttman  ------  uri@stemsystems.com  --------  http://www.sysarch.com --
-----  Perl Code Review , Architecture, Development, Training, Support ------
---------  Gourmet Hot Cocoa Mix  ----  http://bestfriendscocoa.com ---------
0
Uri
11/12/2010 9:19:55 PM
OwlHoot wrote:
> To repeat the title, in case it is munged by Google Groups:
> 
>   if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just
> 'C'
> 
> I've been developing with perl for years; but even simple things in it
> still
> sometimes throw up surprises.
> 
> The regexp /:(.*?)$/ is anchored on the right by $, then 

There is no "then".  Being anchored at the end does not change the order 
of evaluation (or at least, does not do so in a way that effects the 
outcome--the optimized engine can do things in whatever order it wants, 
as long as behaves as if it were done left to right.)


> comes a non-
> greedy

Really it is not non-greedy.  It is still greedy, it just greedy for 
less, rather than greedy for more.  It it is still greedy because it 
satisfies itself, without looking around at the "wants" of others.

> match which, AIUI, is the "shortest string it can get away with",
> preceded
> by a colon. 

The colon is also greedy.  It is greedy to match as far left as it can 
get away with.  And because it comes before the .*? does, its greed wins.

Xho
0
Xho
11/13/2010 3:37:11 AM
On 2010-11-13 03:37, Xho Jingleheimerschmidt <xhoster@gmail.com> wrote:
> Really it is not non-greedy.  It is still greedy, it just greedy for 
> less, rather than greedy for more.  It it is still greedy because it 
> satisfies itself, without looking around at the "wants" of others.

> The colon is also greedy.  It is greedy to match as far left as it can 
> get away with.  And because it comes before the .*? does, its greed wins.

Please. "Greedy" in the context of regular expressions is a technical
term with a precisely defined meaning. You are not helping by inventing
a different meaning for the word based on its meaning in common English.

	hp

0
Peter
11/14/2010 10:49:28 AM
Peter J. Holzer <hjp-usenet2@hjp.at> wrote:
> On 2010-11-13 03:37, Xho Jingleheimerschmidt <xhoster@gmail.com> wrote:
>> Really it is not non-greedy.  It is still greedy, it just greedy for 
>> less, rather than greedy for more.  It it is still greedy because it 
>> satisfies itself, without looking around at the "wants" of others.


Patterns match left-to-right, this is the essential thing that
the OP was missing.

There are no "wants" of others, unless the others are to the left
of the part being considered.


>> The colon is also greedy.  It is greedy to match as far left as it can 
>> get away with.  And because it comes before the .*? does, its greed wins.


It is not the colon's greed that makes it want to match as far left
as possible, it is the left-to-right nature of regular expressions
that makes it match at the leftmost opportunity.


> Please. "Greedy" in the context of regular expressions is a technical
> term with a precisely defined meaning. You are not helping by inventing
> a different meaning for the word based on its meaning in common English.


An important concept to grokking regex greed is that "greedy" never
changes whether a pattern will match or not.

If it matches greedy, it will match non-greedy.

If it fails to match greedy, it will fail to match non-greedy.

Greed does not affect "match or fail".

Greed only comes in to play when there is more than one way for
the match to succeed. If greedy, then it chooses the longest way,
if non-greedy it chooses the shortest way. There is no way to get
it to match any of the intermediate-length ways.


-- 
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.
0
Tad
11/14/2010 3:48:41 PM
Peter J. Holzer wrote:
> On 2010-11-13 03:37, Xho Jingleheimerschmidt <xhoster@gmail.com> wrote:
>> Really it is not non-greedy.  It is still greedy, it just greedy for 
>> less, rather than greedy for more.  It it is still greedy because it 
>> satisfies itself, without looking around at the "wants" of others.
> 
>> The colon is also greedy.  It is greedy to match as far left as it can 
>> get away with.  And because it comes before the .*? does, its greed wins.
> 
> Please. "Greedy" in the context of regular expressions is a technical
> term with a precisely defined meaning. You are not helping by inventing
> a different meaning for the word based on its meaning in common English.

Greedy is well defined in the field of computer science, and I am not 
the one inventing new meanings for it.

Xho
0
Xho
11/15/2010 12:33:20 AM
Reply: