0-padding 4 digit numbers with sed

  • Follow


Greetings,

Given this input file:
tst.txt:
ABCDEFG%8921%9251%0003,201004,201004
ABCDEFG%9351%2951%0004,201004,201004
ABCDEFG%6951%3951%0005,201004,201004
ABCDEFG%4951%9941%0006,201004,201004
ABCDEFG%9954%8954%0007,201004,201004
ABCDEFG%4951%1951%0008,201004,201004

My desired output would be:

ABCDEFG%08921%09251%0003,201004,201004
ABCDEFG%09351%02951%0004,201004,201004
ABCDEFG%06951%03951%0005,201004,201004
ABCDEFG%04951%09941%0006,201004,201004
ABCDEFG%09954%08954%0007,201004,201004
ABCDEFG%04951%01951%0008,201004,201004

which would 0-pad any non-zero four digit number.

However, this command:
sed -e 's:%\([1-9][0-9][0-9][0-9]\)%:%0\1%:g' tst.txt

will correctly 0-pad ONLY the first set of non-zero numbers, resulting
in this output:

ABCDEFG%08921%9251%0003,201004,201004
ABCDEFG%09351%2951%0004,201004,201004
ABCDEFG%06951%3951%0005,201004,201004
ABCDEFG%04951%9941%0006,201004,201004
ABCDEFG%09954%8954%0007,201004,201004
ABCDEFG%04951%1951%0008,201004,201004

The question is, even with the "g" option for global, why does the
above command only change the first occurrence?
Morever, what is a better way to get my desired result using sed?

Thanks in advance.
0
Reply jaredsubman (13) 2/24/2010 5:55:17 PM

* jaredsubman@yahoo.com [2010.02.24 17:55]:
> Given this input file:
> tst.txt:
> ABCDEFG%8921%9251%0003,201004,201004

[...]

> My desired output would be:
>
> ABCDEFG%08921%09251%0003,201004,201004

[...]

> However, this command:
> sed -e 's:%\([1-9][0-9][0-9][0-9]\)%:%0\1%:g' tst.txt
>
> will correctly 0-pad ONLY the first set of non-zero numbers, resulting
> in this output:
>
> ABCDEFG%08921%9251%0003,201004,201004

[...]

> The question is, even with the "g" option for global, why does the
> above command only change the first occurrence?

Because you included both '%' signs in your pattern. If the
second one matches, the search for the next occurence of the
pattern starts *after* it. Removing the second '%' in your
pattern fixes that problem, but I don't know if the
resulting command (below) is robust enough to process your
data correctly:

sed -e 's:%\([1-9][0-9][0-9][0-9]\):%0\1:g' tst.txt

-- 
JR
0
Reply Jean 2/24/2010 6:11:58 PM


jaredsubman@yahoo.com wrote:
> Greetings,
> 
> Given this input file:
> tst.txt:
> ABCDEFG%8921%9251%0003,201004,201004
> ABCDEFG%9351%2951%0004,201004,201004
> ABCDEFG%6951%3951%0005,201004,201004
> ABCDEFG%4951%9941%0006,201004,201004
> ABCDEFG%9954%8954%0007,201004,201004
> ABCDEFG%4951%1951%0008,201004,201004
> 
> My desired output would be:
> 
> ABCDEFG%08921%09251%0003,201004,201004
> ABCDEFG%09351%02951%0004,201004,201004
> ABCDEFG%06951%03951%0005,201004,201004
> ABCDEFG%04951%09941%0006,201004,201004
> ABCDEFG%09954%08954%0007,201004,201004
> ABCDEFG%04951%01951%0008,201004,201004
> 
> which would 0-pad any non-zero four digit number.
> 
> However, this command:
> sed -e 's:%\([1-9][0-9][0-9][0-9]\)%:%0\1%:g' tst.txt
> 
> will correctly 0-pad ONLY the first set of non-zero numbers, resulting
> in this output:
> 
> ABCDEFG%08921%9251%0003,201004,201004
> ABCDEFG%09351%2951%0004,201004,201004
> ABCDEFG%06951%3951%0005,201004,201004
> ABCDEFG%04951%9941%0006,201004,201004
> ABCDEFG%09954%8954%0007,201004,201004
> ABCDEFG%04951%1951%0008,201004,201004
> 
> The question is, even with the "g" option for global, why does the
> above command only change the first occurrence?
> Morever, what is a better way to get my desired result using sed?
> 
> Thanks in advance.

Your command works for me. Using GNU sed version 3.02.80

Janis
0
Reply Janis 2/24/2010 6:16:31 PM

On Feb 24, 11:55=A0am, "jaredsub...@yahoo.com" <jaredsub...@yahoo.com>
wrote:
> Greetings,
>
> Given this input file:
> tst.txt:
> ABCDEFG%8921%9251%0003,201004,201004
> ABCDEFG%9351%2951%0004,201004,201004
> ABCDEFG%6951%3951%0005,201004,201004
> ABCDEFG%4951%9941%0006,201004,201004
> ABCDEFG%9954%8954%0007,201004,201004
> ABCDEFG%4951%1951%0008,201004,201004
>
> My desired output would be:
>
> ABCDEFG%08921%09251%0003,201004,201004
> ABCDEFG%09351%02951%0004,201004,201004
> ABCDEFG%06951%03951%0005,201004,201004
> ABCDEFG%04951%09941%0006,201004,201004
> ABCDEFG%09954%08954%0007,201004,201004
> ABCDEFG%04951%01951%0008,201004,201004
>
> which would 0-pad any non-zero four digit number.
>
> However, this command:
> sed -e 's:%\([1-9][0-9][0-9][0-9]\)%:%0\1%:g' tst.txt
>
> will correctly 0-pad ONLY the first set of non-zero numbers, resulting
> in this output:
>
> ABCDEFG%08921%9251%0003,201004,201004
> ABCDEFG%09351%2951%0004,201004,201004
> ABCDEFG%06951%3951%0005,201004,201004
> ABCDEFG%04951%9941%0006,201004,201004
> ABCDEFG%09954%8954%0007,201004,201004
> ABCDEFG%04951%1951%0008,201004,201004
>
> The question is, even with the "g" option for global, why does the
> above command only change the first occurrence?
> Morever, what is a better way to get my desired result using sed?
>
> Thanks in advance.

You got the answers to your specific questions but I don't think
anyone's given you a robust sed solution yet so in the meantime you
might want to try this:

awk 'BEGIN{ FS=3DOFS=3D"%"; fmt=3D"%05s" }
{
    for (fldNr=3D1; fldNr<=3DNF; fldNr++) {
        fld =3D sep =3D ""
        numSubFlds =3D split($fldNr,fldArr,",")
        for (subNr=3D1; subNr<=3DnumSubFlds; subNr++) {
            fld =3D fld sep sprintf(fmt,fldArr[subNr])
            sep =3D ","
        }
        $fldNr =3D fld
    }
    print
}' tst.txt

        Ed
0
Reply Ed 2/24/2010 7:34:14 PM

On Feb 24, 1:34=A0pm, Ed Morton <mortons...@gmail.com> wrote:
> On Feb 24, 11:55=A0am, "jaredsub...@yahoo.com" <jaredsub...@yahoo.com>
> wrote:
>
>
>
>
>
> > Greetings,
>
> > Given this input file:
> > tst.txt:
> > ABCDEFG%8921%9251%0003,201004,201004
> > ABCDEFG%9351%2951%0004,201004,201004
> > ABCDEFG%6951%3951%0005,201004,201004
> > ABCDEFG%4951%9941%0006,201004,201004
> > ABCDEFG%9954%8954%0007,201004,201004
> > ABCDEFG%4951%1951%0008,201004,201004
>
> > My desired output would be:
>
> > ABCDEFG%08921%09251%0003,201004,201004
> > ABCDEFG%09351%02951%0004,201004,201004
> > ABCDEFG%06951%03951%0005,201004,201004
> > ABCDEFG%04951%09941%0006,201004,201004
> > ABCDEFG%09954%08954%0007,201004,201004
> > ABCDEFG%04951%01951%0008,201004,201004
>
> > which would 0-pad any non-zero four digit number.
>
> > However, this command:
> > sed -e 's:%\([1-9][0-9][0-9][0-9]\)%:%0\1%:g' tst.txt
>
> > will correctly 0-pad ONLY the first set of non-zero numbers, resulting
> > in this output:
>
> > ABCDEFG%08921%9251%0003,201004,201004
> > ABCDEFG%09351%2951%0004,201004,201004
> > ABCDEFG%06951%3951%0005,201004,201004
> > ABCDEFG%04951%9941%0006,201004,201004
> > ABCDEFG%09954%8954%0007,201004,201004
> > ABCDEFG%04951%1951%0008,201004,201004
>
> > The question is, even with the "g" option for global, why does the
> > above command only change the first occurrence?
> > Morever, what is a better way to get my desired result using sed?
>
> > Thanks in advance.
>
> You got the answers to your specific questions but I don't think
> anyone's given you a robust sed solution yet so in the meantime you
> might want to try this:
>
> awk 'BEGIN{ FS=3DOFS=3D"%"; fmt=3D"%05s" }
> {
> =A0 =A0 for (fldNr=3D1; fldNr<=3DNF; fldNr++) {
> =A0 =A0 =A0 =A0 fld =3D sep =3D ""
> =A0 =A0 =A0 =A0 numSubFlds =3D split($fldNr,fldArr,",")
> =A0 =A0 =A0 =A0 for (subNr=3D1; subNr<=3DnumSubFlds; subNr++) {
> =A0 =A0 =A0 =A0 =A0 =A0 fld =3D fld sep sprintf(fmt,fldArr[subNr])
> =A0 =A0 =A0 =A0 =A0 =A0 sep =3D ","
> =A0 =A0 =A0 =A0 }
> =A0 =A0 =A0 =A0 $fldNr =3D fld
> =A0 =A0 }
> =A0 =A0 print
>
> }' tst.txt
>
> =A0 =A0 =A0 =A0 Ed- Hide quoted text -
>
> - Show quoted text -

Hang on, I just noticed that you DON'T want the 4-digit strings at the
end of your input padded with leading zeros. That makes things much
simpler:

awk 'BEGIN{ FS=3DOFS=3D"%" } {
    for (i=3D1; i<=3DNF; i++)
        $i=3Dsprintf("%05s",$i)
}1' tst.txt

or to make sure you only operate on strings of all-digits:

awk 'BEGIN{ FS=3DOFS=3D"%" } {
    for (i=3D1; i<=3DNF; i++)
        if ($i ~ /^[0-9]+$/)
            $i =3D sprintf("%05s",$i)
}1' tst.txt

or to ONLY pad 4-digit numbers:

awk 'BEGIN{ FS=3DOFS=3D"%" } {
    for (i=3D1; i<=3DNF; i++)
        if ($i ~ /^[0-9][0-9][0-9][0-9]$/)
            $i =3D "0"$i
}1' tst.txt

The "sprintf()" solutions will pad any chains of 4 _or less_ digits.

Regards,

    Ed.
0
Reply Ed 2/24/2010 7:47:17 PM

On Feb 24, 10:55=A0pm, "jaredsub...@yahoo.com" <jaredsub...@yahoo.com>
wrote:
> Greetings,
>
> Given this input file:
> tst.txt:
> ABCDEFG%8921%9251%0003,201004,201004
> ABCDEFG%9351%2951%0004,201004,201004
> ABCDEFG%6951%3951%0005,201004,201004
> ABCDEFG%4951%9941%0006,201004,201004
> ABCDEFG%9954%8954%0007,201004,201004
> ABCDEFG%4951%1951%0008,201004,201004
>
> My desired output would be:
>
> ABCDEFG%08921%09251%0003,201004,201004
> ABCDEFG%09351%02951%0004,201004,201004
> ABCDEFG%06951%03951%0005,201004,201004
> ABCDEFG%04951%09941%0006,201004,201004
> ABCDEFG%09954%08954%0007,201004,201004
> ABCDEFG%04951%01951%0008,201004,201004
>


sed -e '
; # requirements:
; #1 only a 4-digit number considered. So any 3/2/1 digit numbers must
be excluded
; #2 that 4-digit number mustnt begin with a zero, i.e., mustnt
already be padded
; #3 ^num$ ^num| |num| |num$, where num is a 4-digit number, | means a
nondigit

:loop
; ## ^num$
/^[1-9][0-9][0-9][0-9]$/{
   s/^/0/;b
}
## ^num|
/^[1-9][0-9][0-9][0-9][^0-9]/{
   s/^/0/;bloop
}
## |num|
/\([^0-9]\)\([1-9][0-9][0-9][0-9]\)\([^0-9]\)/{
   s//\10\2\3/;bloop
}
## |num$
/\([^0-9]\)\([1-9][0-9][0-9][0-9]\)$/{
   s//\10\2/;bloop
}
' yourfile

0
Reply Rakesh 2/26/2010 8:54:19 AM

2010-02-24, 09:55(-08), jaredsubman@yahoo.com:
[...]
> Given this input file:
> tst.txt:
> ABCDEFG%8921%9251%0003,201004,201004
[...]
> sed -e 's:%\([1-9][0-9][0-9][0-9]\)%:%0\1%:g' tst.txt
>
> will correctly 0-pad ONLY the first set of non-zero numbers, resulting
> in this output:
>
> ABCDEFG%08921%9251%0003,201004,201004
[...]

That's what sed's "t" command is for:

sed -e :1 -e 's/%\([1-9][0-9]\{3\}%\)/%0\1/g;t1' tst.txt

repeat the operation until "s" no longer succeeds to replace.

-- 
St�phane
0
Reply Stephane 2/27/2010 7:54:34 PM

Stephane CHAZELAS <stephane_chazelas@yahoo.fr> writes:

> 2010-02-24, 09:55(-08), jaredsubman@yahoo.com:
> [...]
>> Given this input file:
>> tst.txt:
>> ABCDEFG%8921%9251%0003,201004,201004
> [...]
>> sed -e 's:%\([1-9][0-9][0-9][0-9]\)%:%0\1%:g' tst.txt
>>
>> will correctly 0-pad ONLY the first set of non-zero numbers, resulting
>> in this output:

The 'g' causes all non-overlapping occurrences to be replaced, but
because the pattern include both the initial % and the trailing %, the
effect would be to replace every other non-initial-zero four-digit
number.

>> ABCDEFG%08921%9251%0003,201004,201004
> [...]
>
> That's what sed's "t" command is for:
>
> sed -e :1 -e 's/%\([1-9][0-9]\{3\}%\)/%0\1/g;t1' tst.txt
>
> repeat the operation until "s" no longer succeeds to replace.

It might be a little clearer to drop the 'g' and rely solely on the
looping, but it is harmless.

-- 
Ben.
0
Reply Ben 2/27/2010 9:32:39 PM

7 Replies
745 Views

(page loaded in 0.114 seconds)

Similiar Articles:













7/23/2012 1:26:27 PM


Reply: