parsing alpha and numeric characters out of string

  • Follow


im not a very good awk user, but i use it a lot to parse out data in
graphic filenames.  usually the filenames have _ separators (i.e.,
ef_001_comp_01.0001.tif), so something like:

set VAR = `echo $ef_001_comp_01.0001.tif | awk -F_ '{print $1}'`

gives me:

echo $VAR

ef

however now i have a bunch of filenames in this form:

ef0001.0001.tif
sb0001.0001.tif
rs0004.0001.tif

etc.

i can isolate the first section and get:

ef0001
sb0001
rs0004

which gives me the two pieces of information i want together ( first
two alphas are one thing, the next four numbers are another).  how do
i parse them out?  there's no consistent character to awk (in my
limited ability).

tia

christopher

0
Reply deepstructure (3) 4/19/2007 10:35:33 PM

o.k., reading my awk/sed o'reilly book i realize that the -F option
just changes what the separator is, and since nothing separates these
things in the string i probably can't use that.  at least the split
happens between alpha and numeric characters.  i keep seeing things
like

/[A-Za-z]+/

in the book, but so far haven't been able to figure out how to
implement that.  again, any help is appreciated!

cheers
christopher


0
Reply deepstructure 4/19/2007 11:34:27 PM


deepstructure@gmail.com wrote:
> im not a very good awk user, but i use it a lot to parse out data in
> graphic filenames.  usually the filenames have _ separators (i.e.,
> ef_001_comp_01.0001.tif), so something like:
> 
> set VAR = `echo $ef_001_comp_01.0001.tif | awk -F_ '{print $1}'`

(What syntax is that; C shell? Completely off-topic here.)

> 
> gives me:
> 
> echo $VAR
> 
> ef
> 
> however now i have a bunch of filenames in this form:
> 
> ef0001.0001.tif
> sb0001.0001.tif
> rs0004.0001.tif
> 
> etc.
> 
> i can isolate the first section and get:
> 
> ef0001
> sb0001
> rs0004

awk '{sub(/\..*/,""); print $0}'

Substitute the data (in $0) starting from a literal dot \. followed by
an arbitrary amount of any character .* by the empty string "", then
print the data.

Janis

> 
> which gives me the two pieces of information i want together ( first
> two alphas are one thing, the next four numbers are another).  how do
> i parse them out?  there's no consistent character to awk (in my
> limited ability).
> 
> tia
> 
> christopher
> 
0
Reply Janis 4/20/2007 12:31:00 AM

deepstructure@gmail.com wrote:
> o.k., reading my awk/sed o'reilly book i realize that the -F option
> just changes what the separator is, and since nothing separates these
> things in the string i probably can't use that.  at least the split
> happens between alpha and numeric characters.  i keep seeing things
> like
> 
> /[A-Za-z]+/
> 
> in the book, but so far haven't been able to figure out how to
> implement that.  again, any help is appreciated!

What is your _concrete_ question in *this* posting?

You want to know the meaning of the above regexp?
It matches an arbitrary long sequence (at least one character) of
alpha characters.

Janis

> 
> cheers
> christopher
> 
> 
0
Reply Janis 4/20/2007 12:34:05 AM

On Apr 19, 5:31 pm, Janis Papanagnou <Janis_Papanag...@hotmail.com>
wrote:
>
> (What syntax is that; C shell? Completely off-topic here.)
>

yep.  yes, i know im not supposed to be using c shell but
unfortunately that's what i learned in and am currently stuck with.

my second post was just me realizing that using -F wasn't going to
help since there isn't a separator.

unfortunately i don't see this statement working:

awk '{sub(/\..*/,""); print $0}'

i think you misunderstood what i was trying to parse.  im trying to
separate the "ef" and "0001" in "ef0001".  i tried your code and got
this:

% set VAR = `echo ef0015 | awk '{sub(/\..*/,""); print $0}'`
% echo $VAR
ef0015

so ignore the first part of what i wrote above - that was just
context.  basically i have a list of filenames:

ef0001
ef0002
rb0001
sr0004

etc., etc., that i need to separate the two-alpha character beginning
of and the numeric portion of.  make sense?

thanks for you help!

cheers
christopher


0
Reply deepstructure 4/20/2007 1:55:33 AM

* deepstructure@gmail.com [2007.04.20 01:55]:
> so ignore the first part of what i wrote above - that was just
> context.  basically i have a list of filenames:
>
> ef0001
> ef0002
> rb0001
> sr0004
>
> etc., etc., that i need to separate the two-alpha character beginning
> of and the numeric portion of.  make sense?

If this is fixed width, you can use substr():

awk '{ print substr($0,1,2), substr($0,3,4) }'

-- 
JR
0
Reply Jean 4/20/2007 2:48:41 AM

On Apr 20, 9:55 am, deepstruct...@gmail.com wrote:
> On Apr 19, 5:31 pm, Janis Papanagnou <Janis_Papanag...@hotmail.com>
> wrote:
>
>
>
> > (What syntax is that; C shell? Completely off-topic here.)
>
> yep.  yes, i know im not supposed to be using c shell but
> unfortunately that's what i learned in and am currently stuck with.
>
> my second post was just me realizing that using -F wasn't going to
> help since there isn't a separator.
>
> unfortunately i don't see this statement working:
>
> awk '{sub(/\..*/,""); print $0}'
>
> i think you misunderstood what i was trying to parse.  im trying to
> separate the "ef" and "0001" in "ef0001".  i tried your code and got
> this:
>
> % set VAR = `echo ef0015 | awk '{sub(/\..*/,""); print $0}'`
> % echo $VAR
> ef0015
>
> so ignore the first part of what i wrote above - that was just
> context.  basically i have a list of filenames:
>
> ef0001
> ef0002
> rb0001
> sr0004
>
> etc., etc., that i need to separate the two-alpha character beginning
> of and the numeric portion of.  make sense?
>
> thanks for you help!
>
> cheers
> christopher

awk '{ num = gensub(/[a-z]+/,"","g"); print num
      alpha = gensub(/[0-9]+/,"","g") ; print alpha
	}
' "file"

0
Reply mik3l3374 4/20/2007 2:55:51 AM

On 20 Apr., 04:48, Jean-Rene David <jrda...@magma.ca.INVALID> wrote:
> * deepstruct...@gmail.com [2007.04.20 01:55]:
>
> > so ignore the first part of what i wrote above - that was just
> > context.  basically i have a list of filenames:
>
> > ef0001
> > ef0002
> > rb0001
> > sr0004
>
> > etc., etc., that i need to separate the two-alpha character beginning
> > of and the numeric portion of.  make sense?
>
> If this is fixed width, you can use substr():
>
> awk '{ print substr($0,1,2), substr($0,3,4) }'

And if it's not fixed width you can use match() first to obtain the
required indices.

Janis

>
> --
> JR


0
Reply Janis 4/20/2007 10:16:52 AM

mik3l3374@gmail.com wrote:
> On Apr 20, 9:55 am, deepstruct...@gmail.com wrote:
> 
>>On Apr 19, 5:31 pm, Janis Papanagnou <Janis_Papanag...@hotmail.com>
>>wrote:
>>
>>
>>
>>
>>>(What syntax is that; C shell? Completely off-topic here.)
>>
>>yep.  yes, i know im not supposed to be using c shell but
>>unfortunately that's what i learned in and am currently stuck with.
>>
>>my second post was just me realizing that using -F wasn't going to
>>help since there isn't a separator.
>>
>>unfortunately i don't see this statement working:
>>
>>awk '{sub(/\..*/,""); print $0}'
>>
>>i think you misunderstood what i was trying to parse.  im trying to
>>separate the "ef" and "0001" in "ef0001".  i tried your code and got
>>this:
>>
>>% set VAR = `echo ef0015 | awk '{sub(/\..*/,""); print $0}'`
>>% echo $VAR
>>ef0015
>>
>>so ignore the first part of what i wrote above - that was just
>>context.  basically i have a list of filenames:
>>
>>ef0001
>>ef0002
>>rb0001
>>sr0004
>>
>>etc., etc., that i need to separate the two-alpha character beginning
>>of and the numeric portion of.  make sense?
>>
>>thanks for you help!
>>
>>cheers
>>christopher
> 
> 
> awk '{ num = gensub(/[a-z]+/,"","g"); print num
>       alpha = gensub(/[0-9]+/,"","g") ; print alpha
> 	}
> ' "file"
> 

To separate the 2 parts by a newline:

	awk 'sub(/[a-z]+/,"&\n")'

To get just the alpha part:

	awk 'sub(/[0-9]+/,"")'

To get just the numeric part:

	awk 'sub(/[a-z]+/,"")'

but there's a much better way to do it in shell without using awk. If 
you're interested, post to comp.unix.shell.

	Ed.
0
Reply Ed 4/20/2007 1:10:14 PM

On Apr 20, 6:10 am, Ed Morton <mor...@lsupcaemnt.com> wrote:
> mik3l3...@gmail.com wrote:
> > On Apr 20, 9:55 am, deepstruct...@gmail.com wrote:
>
> >>On Apr 19, 5:31 pm, Janis Papanagnou <Janis_Papanag...@hotmail.com>
> >>wrote:
>
> >>>(What syntax is that; C shell? Completely off-topic here.)
>
> >>yep.  yes, i know im not supposed to be using c shell but
> >>unfortunately that's what i learned in and am currently stuck with.
>
> >>my second post was just me realizing that using -F wasn't going to
> >>help since there isn't a separator.
>
> >>unfortunately i don't see this statement working:
>
> >>awk '{sub(/\..*/,""); print $0}'
>
> >>i think you misunderstood what i was trying to parse.  im trying to
> >>separate the "ef" and "0001" in "ef0001".  i tried your code and got
> >>this:
>
> >>% set VAR = `echo ef0015 | awk '{sub(/\..*/,""); print $0}'`
> >>% echo $VAR
> >>ef0015
>
> >>so ignore the first part of what i wrote above - that was just
> >>context.  basically i have a list of filenames:
>
> >>ef0001
> >>ef0002
> >>rb0001
> >>sr0004
>
> >>etc., etc., that i need to separate the two-alpha character beginning
> >>of and the numeric portion of.  make sense?
>
> >>thanks for you help!
>
> >>cheers
> >>christopher
>
> > awk '{ num = gensub(/[a-z]+/,"","g"); print num
> >       alpha = gensub(/[0-9]+/,"","g") ; print alpha
> >    }
> > ' "file"
>
> To separate the 2 parts by a newline:
>
>         awk 'sub(/[a-z]+/,"&\n")'
>
> To get just the alpha part:
>
>         awk 'sub(/[0-9]+/,"")'
>
> To get just the numeric part:
>
>         awk 'sub(/[a-z]+/,"")'
>
> but there's a much better way to do it in shell without using awk. If
> you're interested, post to comp.unix.shell.
>
>         Ed.


hey ed, that works really well.  thank you!  i took your advice and
posted to the unix.shell group also, but your solution will do the
trick.

and thanks to everyone else who contributed - tho some of those
solutions were more technical than i could handle!

0
Reply deepstructure 4/20/2007 5:44:28 PM

9 Replies
542 Views

(page loaded in 0.077 seconds)

Similiar Articles:













7/23/2012 5:30:07 PM


Reply: