f



a regexp riddle: re.search(r'(?:(\w+), |and (\w+))+', 'whatever a, bbb, and c') =? ('a', 'bbb', 'c')

HypoNt:

I need to turn a human-readable list into a list():

   print re.search(r'(?:(\w+), |and (\w+))+', 'whatever a, bbb, and
c').groups()

That currently returns ('c',). I'm trying to match "any word \w+
followed by a comma, or a final word preceded by and."

The match returns 'a, bbb, and c', but the groups return ('bbb', 'c').
What do I type for .groups() to also get the 'a'?

Please go easy on me (and no RTFM!), because I have only been using
regular expressions for about 20 years...

--
  Phlip
  http://bit.ly/ZeekLand
0
phlip2005 (2215)
11/25/2010 4:46:18 AM
comp.lang.python 77058 articles. 6 followers. Post Follow

9 Replies
1703 Views

Similar Articles

[PageSpeed] 19

--- On Thu, 11/25/10, Phlip <phlip2005@gmail.com> wrote:=0A> From: Phlip <p=
hlip2005@gmail.com>=0A> Subject: a regexp riddle: re.search(r'=0A> To: pyth=
on-list@python.org=0A> Date: Thursday, November 25, 2010, 8:46 AM=0A> HypoN=
t:=0A> =0A> I need to turn a human-readable list into a list():=0A> =0A> =
=A0=A0=A0print re.search(r'(?:(\w+), |and=0A> (\w+))+', 'whatever a, bbb, a=
nd=0A> c').groups()=0A> =0A> That currently returns ('c',). I'm trying to m=
atch "any=0A> word \w+=0A> followed by a comma, or a final word preceded by=
 and."=0A> =0A> The match returns 'a, bbb, and c', but the groups return=0A=
> ('bbb', 'c').=0A> What do I type for .groups() to also get the 'a'?=0A> =
=0A=0AFirst of all, the 'bbb' coresponds to the first capturing=0Agroup and=
 'c' the second. But 'a' is forgotten be cause=0Ait was the first match of =
the first group, but there=0Ais a second match 'bbb'.=0A=0AGenerally, a cap=
turing group only remembers the last match.=0A=0AIt also seems that your re=
 may match this: 'and c',=0Awhich does not seem to be your intention.=0ASo =
it may be more intuitively written as:=0A=0Ar'(?:(\w+), )+and (\w+)'=0A=0AI=
'm not sure how to get it done in one step,=0Abut it would be easy to first=
 get the whole =0Amatch, then process it with:=0A=0Are.findall(r'(\w+)(?:,|=
$)', the_whole_match)=0A=0Acheers,=0A=0AYingjie=0A=0A=0A=0A      
0
Yingjie
11/25/2010 9:44:44 AM
Accepting input from a human is frought with dangers and edge cases.  ;)

Some time ago I wrote a regular expression generator that creates 
regexen that can parse arbitrarily delimited text, supports quoting (to 
avoid accidentally separating two elements that should be treated as 
one), and works in both directions (text<->native).

The code that generates the regex is heavily commented:

	https://github.com/pulp/marrow.util/blob/master/marrow/util/convert.py#L123-234

You 

should be able to use this as-is and simply handle the optional 'and' 
on the last element yourself.  You can even create an instance of the 
class with the options you want then get the generated regular 
expression by running print(parser.pattern).

Note that I have friends who use 'and' multiple times when describing 
lists of things.  :P

	— Alice.

0
utf
11/25/2010 10:00:11 AM
Now that I think about it, and can be stripped using a callback 
function as the 'normalize' argument to my KeywordProcessor class:

def normalize(value):
    value = value.strip()

    if value.startswith("and"):
        value = value[3:]

    return value

parser = KeywordProcessor(',', normalize=normalize, result=list)

	— Alice.

0
utf
11/25/2010 10:04:56 AM
Phlip,

> I'm trying to match "any word \w+ followed by a comma, or a final word preceded by and."

Here's a non-regex solution that handles multi-word values and multiple
instances of 'and' (as pointed out by Alice). The posted code could be
simplified via list comprehension - I chose the more verbose method to
illustrate the logic.

def to_list( text ):
    
    text = text.replace( ' and ', ',' )
    output = list()
    for item in text.split( ',' ):
        if item:
            output.append( item.strip() )
    return output

test = 'cat, dog, big fish, goat and puppy and horse'

print to_list( test )

Outputs:

['cat', 'dog', 'big fish', 'goat', 'puppy', 'horse']

Malcolm
0
python
11/25/2010 2:21:16 PM
On 11/24/2010 10:46 PM, Phlip wrote:
> HypoNt:
> 
> I need to turn a human-readable list into a list():
> 
>    print re.search(r'(?:(\w+), |and (\w+))+', 'whatever a, bbb, and
> c').groups()
> 
> That currently returns ('c',). I'm trying to match "any word \w+
> followed by a comma, or a final word preceded by and."
> 
> The match returns 'a, bbb, and c', but the groups return ('bbb', 'c').
> What do I type for .groups() to also get the 'a'?
> 
> Please go easy on me (and no RTFM!), because I have only been using
> regular expressions for about 20 years...

A kind of lazy way just uses a pattern for the separators to fuel a call
to re.split(). I assume that " and " and " , " are both acceptable in
any position:

The best I've been able to do so far (due to split's annoying habit of
including the matches of any groups in the pattern I have to throw away
every second element) is:

>>> re.split("\s*(,|and)?\s*", 'whatever a, bbb, and c')[::2]
['whatever', 'a', 'bbb', '', 'c']

That empty string is because of the ", and" which isn't recognise as a
single delimiter.

A parsing package might give you better results.

regards
 Steve
-- 
Steve Holden           +1 571 484 6266   +1 800 494 3119
PyCon 2011 Atlanta March 9-17       http://us.pycon.org/
See Python Video!       http://python.mirocommunity.org/
Holden Web LLC                 http://www.holdenweb.com/

0
Steve
11/25/2010 3:25:52 PM
On 25/11/2010 04:46, Phlip wrote:
> HypoNt:
>
> I need to turn a human-readable list into a list():
>
>     print re.search(r'(?:(\w+), |and (\w+))+', 'whatever a, bbb, and
> c').groups()
>
> That currently returns ('c',). I'm trying to match "any word \w+
> followed by a comma, or a final word preceded by and."
>
> The match returns 'a, bbb, and c', but the groups return ('bbb', 'c').
> What do I type for .groups() to also get the 'a'?
>
> Please go easy on me (and no RTFM!), because I have only been using
> regular expressions for about 20 years...
>
Try re.findall:

     >>> re.findall(r'(\w+), |and (\w+)', 'whatever a, bbb, and c')
     [('a', ''), ('bbb', ''), ('', 'c')]

You can get a list of strings like this:

     >>> [x or y for x, y in re.findall(r'(\w+), |and (\w+)', 'whatever 
a, bbb, and c')]
     ['a', 'bbb', 'c']
0
MRAB
11/25/2010 4:16:15 PM
> Accepting input from a human is fraught with dangers and edge cases.

> Here's a non-regex solution

Thanks all for playing! And as usual I forgot a critical detail:

I'm writing a matcher for a Morelia /viridis/ Scenario step, so the
matcher must be a single regexp.

  http://c2.com/cgi/wiki?MoreliaViridis

I'm avoiding the current situation, where Morelia pulls out (.*), and
the step handler "manually" splits that up with:

  flags = re.split(r', (?:and )?', flags)

That means I already had a brute-force version. A regexp version is
always better because, especially in Morelia, it validates input. (.*)
is less specific than (\w+).

So if the step says:

  Alice has crypto keys apple, barley, and flax

Then the step handler could say (if this worked):

  def step_user_has_crypto_keys_(self, user, *keys):
      r'(\w+) has crypto keys (?:(\w+), )+and (\w+)'

      # assert that user with those keys here

That does not work because "a capturing group only remembers the last
match". This would appear to be an irritating 'feature' in Regexp. The
total match is 'apple, barley, and flax', but the stored groups behave
as if each () were a slot, so (\w+)+ would not store "more than one
group". Unless there's a Regexp workaround to mean "arbitrary number
of slots for each ()", then I /might/ go with this:

   got = re.findall(r'(?:(\w+), )?(?:(\w+), )?(?:(\w+), )?(?:(\w+), )?
(?:(\w+), and )?(\w+)$', 'whatever a, bbb, and c')
   print got  #  [('a', '', '', '', 'bbb', 'c')]

The trick is to simply paste in a high number of (?:(\w+), )?
segments, assuming that nobody should plug in too many. Behavior
Driven Development scenarios should be readable and not run-on.
(Morelia has a table feature for when you actually need lots of
arguments.)

Next question: Does re.search() return a match object that I can get
('a', '', '', '', 'bbb', 'c') out of? The calls to groups() and such
always return this crazy ('a', 2, 'bbb', 'c') thing that would disturb
my user-programmers.

--
  Phlip
0
Phlip
11/25/2010 7:57:33 PM
On 25/11/2010 19:57, Phlip wrote:
>> Accepting input from a human is fraught with dangers and edge cases.
>
>> Here's a non-regex solution
>
> Thanks all for playing! And as usual I forgot a critical detail:
>
> I'm writing a matcher for a Morelia /viridis/ Scenario step, so the
> matcher must be a single regexp.
>
>    http://c2.com/cgi/wiki?MoreliaViridis
>
> I'm avoiding the current situation, where Morelia pulls out (.*), and
> the step handler "manually" splits that up with:
>
>    flags = re.split(r', (?:and )?', flags)
>
> That means I already had a brute-force version. A regexp version is
> always better because, especially in Morelia, it validates input. (.*)
> is less specific than (\w+).
>
> So if the step says:
>
>    Alice has crypto keys apple, barley, and flax
>
> Then the step handler could say (if this worked):
>
>    def step_user_has_crypto_keys_(self, user, *keys):
>        r'(\w+) has crypto keys (?:(\w+), )+and (\w+)'
>
>        # assert that user with those keys here
>
[snip]
You could do:

     def step_user_has_crypto_keys_(self, user, keys):
         r'(\w+) has crypto keys ((?:\w+, )+and \w+)'

to validate and capture, and then split the keys string.
0
MRAB
11/25/2010 8:45:02 PM
In article <ad755502-bf79-46ea-b7ee-57ac6f7ee1cf@z26g2000prf.googlegroups.com>,
Phlip  <phlip2005@gmail.com> wrote:
>
>Thanks all for playing! And as usual I forgot a critical detail:
>
>I'm writing a matcher for a Morelia /viridis/ Scenario step, so the
>matcher must be a single regexp.

Why?  (You're apparently the author of Morelia, but I don't really
understand it.)
-- 
Aahz (aahz@pythoncraft.com)           <*>         http://www.pythoncraft.com/

"Think of it as evolution in action."  --Tony Rand
0
aahz
11/26/2010 7:57:03 PM
Reply:

Similar Artilces:

'^=' and '~='?
Hello, What is the difference between '^=' and '~='? Thanks, Duckhye ...

'is not' or '!='
A newbie question to you; what is the difference between statements like: if x is not None: and if x != None: Without any context, which one should be preferred? IMHO, the latter is more readable. On 2014-08-18 21:35, ElChino wrote: > A newbie question to you; what is the difference between statements > like: > if x is not None: > and > if x != None: > > Without any context, which one should be preferred? > IMHO, the latter is more readable. > "x == y" tells you whether x and y refer to objects that are equal. "x is y&qu...

'''''''''''''The Running Update/Append Queries Using VBA code Ordeal''''''''''''''
Hello fellow programmers, I am trying to run an append/update query from code, a command button on a form initiates the queries. the format i am using is; _____________________________________________________ SELECT "criteria" FROM "criteria" WHERE "criteria" UPDATE/APPEND "field selections" RecordSource "qryExample" = above text strings" _______________________________________________________________________ When i am running a SELECT query in this manner it works fine with no problems, and accepts the values of specified linked for...

Re: if str_mo not in ('','.') and str_da not in ('','.') and str_yy not in ('','.') Any shorter ?
OR you could use ARRAY data new; set old; array igor $ (*) str_mo str_da str_yr; do over igor; if igor ~in (' ','.') then do; end; run; Prasad Ravi Igor Kurbeko <ikurbeko@ATHEROGENIC To: SAS-L@LISTSERV.UGA.EDU S.COM> cc: Sent by: "SAS(r) Subject: if str_mo not in ('','.') and str_da not in ('','.') and str_yy ...

'''''''''''''The Running Update/Append Queries Using VBA code Ordeal'''''''''''''' #2
Hi, Thanks for ur help there HJ. I know how to do the tasks you specified there. I would like for the update query to use field values from some of the fields on the form (frmInvoices) such as InvoiceNumber, DateFrom, DateTo. My problem is that an append/update query can't find the values in the open Form (frmInvoices) when I specify them as; [Forms]![frmInvoices]![InvoiceNumber] a select query has no problem finding the field values on a form. please help. Aaron Hi Aaron, Could you post the entire code that you are having trouble with? Now it is not possible to see what goes wron...

if str_mo not in ('','.') and str_da not in ('','.') and str_yy not in ('','.') Any shorter ?
Hi, there. =20 I'm just curious if it ever dawned on anybody how to abbreviate this line : if str_mo not in ('','.') and str_da not in ('','.') and str_yy not in ('','.')=20 =20 Igor Kurbeko Clinical Programmer Analyst 678 336 4328 ikurbeko@atherogenics.com =20 no brain no pain =20 how about: if not (str_mo in ('','.') or str_da in ('','.') or str_yy in ('','.')) OR if not (missing(str_mo) or missing(str_da) or missing(str_yy)) Eric On 22 Oct 03 21:13:37 GMT, ikurbeko@ATHER...

Re: '^=' and '~='?
Duckhye, According to the doc ( http://xrl.us/befwjx ) they, and one other set of characters, and the mnemonic 'NE' all represent 'NOT EQUAL'. Art ------- On Wed, 11 Feb 2009 16:52:40 -0600, Duck-Hye Yang <dyang@CHAPINHALL.ORG> wrote: >Hello, >What is the difference between '^=' and '~='? > >Thanks, >Duckhye ...

Re: if str_mo not in ('','.') and str_da not in ('','.') and str_yy not in ('','.') Any shorter ? #2
Igor, There are many ways to make it more concise, however the parsimony is likely to be achieved at the expense of clarity. For instance, the expressions length ( input (mm||dd||yy, $10.) ) > 2 length ( compress(mm||dd||yy, ' .') ) > 2 and like might be somewhat shorter than the original, but they will execute slower, and their intent is far less eminent. Since it appears that you are trying to validate the components of a date, maybe it is not a worthless idea to try the date informat conforming to the mask you are testing. Say if all the pieces are 2-digit, the expres...

Re: if str_mo not in ('','.') and str_da not in ('','.') and str_ yy not in ('','.') Any shorter ?
Igor: Are you really looking for missing values? If so, use the "nmiss" function. It returns the number of missing values in the supplied variables. e.g. if nmiss(str_mo,str_da,str_yy) then put 'Have missing values'; -----Original Message----- From: Igor Kurbeko [mailto:ikurbeko@ATHEROGENICS.COM] Sent: Wednesday, October 22, 2003 5:14 PM To: SAS-L@LISTSERV.UGA.EDU Subject: if str_mo not in ('','.') and str_da not in ('','.') and str_yy not in (&#...

How to turn off 'N', 'S', 'E', 'W'
in the map lat and lon labels? I'm doing a final figure and I really don't need an 'N' after every latitude label, jus the number would suffice. How do I turn of the letters followin the number? What function / toolbox are you refering to?!? Being a little more specific would sure help to get an answer ... Sebastian ...

error: expected '=', ',', ';', 'asm' or '__attrib
Hi I'm trying to compile an ADC Driver & come acrosss the following error. I've no experience writing drivers before, and hence have no clue how to fix it. Hope someone out there has encountered the problem & suggesst a fix for the same. The Error is I get is : qadc.c: At top level: qadc.c:97: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'qadc_read' make: *** [qadc.o] Error 1 [root@localhost qadc]# ########################################################################### ADC Driver Code ####################...

error: expected '=', ',', ';', 'asm' or '__attrib
Hi I'm trying to compile an ADC Driver & come acrosss the following error. I've no experience writing drivers before, and hence have no clue how to fix it. Hope someone out there has encountered the problem & suggesst a fix for the same. The Error is I get is : qadc.c: At top level: qadc.c:97: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'qadc_read' make: *** [qadc.o] Error 1 [root@localhost qadc]# ########################################################################### ADC Driver Code ##...

'[OFF]' as in 'offensive'???
Hi, given that 'off-topicness' is indicated as '[OT]' and taking a look at those postings that started the threads indicated as '[OFF]' (which may both be seen as being somewhat offensive) may lead to the conclusion that '[OFF]' stands for offensiveness. I don't think that this is the intended meaning so what actually *does* '[OFF]' mean? I never came across that abbreviation before (although I have been around on the USENET for quite some time) but maybe it is worth knowing? Josef 'Jupp' Schugt NOTE: mails >100 KiB ...

replacement for '{' and '}' ?
I am still playing around with what kind of syntax I would like to mark up my documents. Are there equivalent long substitutes for { and } when they are not used to describe arguments to functions? Something like \begin{group} and \end{group}. In other words, if I could force myself to write, say, \begin{group} \it ... \end{group} instead of {\it ... }, then I believe I could identify from the markup context what is an argument that belongs to a just invoked macro and what is text. {Of course, in this context, \textit{...} would be better.} No more ambiguity whether a in \myfunction{a} i...

Web resources about - a regexp riddle: re.search(r'(?:(\w+), |and (\w+))+', 'whatever a, bbb, and c') =? ('a', 'bbb', 'c') - comp.lang.python

Resources last updated: 1/24/2016 5:49:52 AM