f



Script to extract portions of text from a text file

I had a need to extract portions of text (delimited by fixed patterns)
from a large text file.  Here's a shell script that I wrote for this
purpose.  Kindly suggest a more "elegant" way to do the same:

Thanks,
Bhat

#!/bin/ksh
# Script to extract portions of text from a text file
#
#
#set -x

if [ "$#" -lt "3" ]
then
echo "usage $0: <input-file> <begin-pattern> <end-pattern>"
exit 1
fi

if [ ! -f "$1" ]
then
echo "File $1 does not exist"
exit 2
fi

if [ ! -r "$1" ]
then
echo "Error reading file $1"
exit 3
fi


printFlag=false

function printLines
{
if [ "$printFlag" == "true" ]
then
echo "$*"
fi
}



while read line
do
val=`echo $line | grep "$2"`
if [ "$val" != "" ]
then
printFlag=true
else
val=`echo $line | grep "$3"`
if [ "$val" != "" ]
then
printFlag=false
echo "\n\n\n"
    fi
  fi
  printLines "$line"
done < $1

exit 0

0
KP
12/15/2004 12:10:04 AM
comp.unix.shell 15484 articles. 3 followers. Post Follow

23 Replies
952 Views

Similar Articles

[PageSpeed] 26

2004-12-14, 16:10(-08), KP Bhat:
> I had a need to extract portions of text (delimited by fixed patterns)
> from a large text file.  Here's a shell script that I wrote for this
> purpose.  Kindly suggest a more "elegant" way to do the same:
[...]
> while read line

It's generally a bad idea to write a while read loop. Best is to
use text processing tools to process text. shells are not meant
to be text stream processing tools.

> do
> val=`echo $line | grep "$2"`
> if [ "$val" != "" ]
> then
> printFlag=true
> else
> val=`echo $line | grep "$3"`
> if [ "$val" != "" ]
> then
> printFlag=false
> echo "\n\n\n"
>     fi
>   fi
>   printLines "$line"
> done < $1
[...]

awk '
  BEGIN {
    beginPattern = ARGV[2]
    endPattern = ARGV[3]
    ARGC = 2
  }
  {
     if ($0 ~ beginPattern)
       doPrint = 1
     else
       if ($0 ~ endPattern)
         doPrint = 0
  }
  doPrint != 0 {print $0}' "$1" "$2" "$3"
       
Beware that awk (use /usr/xpg4/bin/awk on Solaris)  patterns are
extended regular expressions except with GNU awk for which you
need to pass the -W re-interval or -W posix to have it recognize
the braces.

-- 
Stephane
0
Stephane
12/15/2004 8:24:01 AM
"KP Bhat" <kbhat@sta.samsung.com> writes:

>I had a need to extract portions of text (delimited by fixed patterns)
>from a large text file.  Here's a shell script that I wrote for this
>purpose.  Kindly suggest a more "elegant" way to do the same:

You could do that in 'sed'.

Nick.
-- 
http://www.nick-andrew.net/			http://aus.news-admin.org/
I prefer USENET replies. Don't send email copies. Drop the spamtrap to reply.
0
Nick
12/15/2004 8:39:47 AM
In article <slrncrvt51.4eq.stephane.chazelas@spam.is.invalid>,
Stephane CHAZELAS  <this.address@is.invalid> wrote:
>It's generally a bad idea to write a while read loop. Best is to
>use text processing tools to process text. shells are not meant
>to be text stream processing tools.

Truer words have never been spoken!

[...]

>awk '
>  BEGIN {
>    beginPattern = ARGV[2]
>    endPattern = ARGV[3]
>    ARGC = 2
>  }
.... etc ...

Better:

    /beginPattern/,/endPattern/

0
gazelle
12/15/2004 1:39:23 PM
Stephane CHAZELAS wrote:
> 2004-12-14, 16:10(-08), KP Bhat:
> > I had a need to extract portions of text (delimited by fixed
patterns)
> > from a large text file.  Here's a shell script that I wrote for
this
> > purpose.  Kindly suggest a more "elegant" way to do the same:
> [...]
> > while read line
>
> It's generally a bad idea to write a while read loop. Best is to
> use text processing tools to process text. shells are not meant
> to be text stream processing tools.

Not meant to be a text stream processing tool????  That's the
definition of a shell...It's what a shell does.

Along the same thought then, I suppose it's a bad idea to use a while
loop in perl to read from a file....


-- 
Dana French

0
dfrench
12/15/2004 3:25:57 PM
2004-12-15, 07:25(-08), dfrench@mtxia.com:
[...]
> Not meant to be a text stream processing tool????  That's the
> definition of a shell...It's what a shell does.

???

A shell is a tool to run other commands. Those other commands
can be text processing tools, web browsers, text editors...

Additionaly, a shell can take the list of commands from a
file instead of from the user. That's what we call a script.

> Along the same thought then, I suppose it's a bad idea to use a while
> loop in perl to read from a file....

It's different, perl is intended to be a programming language,
not a shell.

-- 
Stephane
0
Stephane
12/15/2004 3:42:06 PM
Stephane CHAZELAS wrote:
> 2004-12-15, 07:25(-08), dfrench@mtxia.com:
> [...]
> > Not meant to be a text stream processing tool????  That's the
> > definition of a shell...It's what a shell does.
>
> ???
>
> A shell is a tool to run other commands. Those other commands
> can be text processing tools, web browsers, text editors...


A shell is a TEXT PROCESSSING PROGRAM that interprets a data stream and
constructs commands, then processes those commands from that data
stream, or submits them to the operating system for processing.


>
> Additionaly, a shell can take the list of commands from a
> file instead of from the user. That's what we call a script.


There is no difference between commands processed from the the command
line or those stored in a file.


>
> > Along the same thought then, I suppose it's a bad idea to use a
while
> > loop in perl to read from a file....
>
> It's different, perl is intended to be a programming language,
> not a shell.


So you seem to be saying the only difference between whether or not it
is  a good idea to use a "while read" loop in a shell or perl, is
whether or not one of the uses of the scripting language is to process
text into commands, one at a time from the keyboard?


-- 
Dana French

0
dfrench
12/15/2004 4:04:11 PM
On Wed, 15 Dec 2004 at 15:42 GMT, Stephane CHAZELAS wrote:
> 2004-12-15, 07:25(-08), dfrench@mtxia.com:
> [...]
>> Not meant to be a text stream processing tool????  That's the
>> definition of a shell...It's what a shell does.
> 
> ???
> 
> A shell is a tool to run other commands. Those other commands
> can be text processing tools, web browsers, text editors...
> 
> Additionaly, a shell can take the list of commands from a
> file instead of from the user. That's what we call a script.
> 
>> Along the same thought then, I suppose it's a bad idea to use a while
>> loop in perl to read from a file....
> 
> It's different, perl is intended to be a programming language,
> not a shell.

    From its inception, the shell has also been designed as a
    programming language.

    As Stephen Bourne said in his introduction to the shell (1978?),
    "The shell is both a command language and a programming language
    that provides an interface to the UNIX operating system."

    As the Korn shell info page says, "The KornShell language is also
    a complete, powerful, high-level programming language for writing
    applications, often more easily and quickly than with other
    high-level languages."

    As the bash manual states, "A Unix shell is both a command
    interpreter, which provides the user interface to the rich set of
    GNU utilities, and a programming language, allowing these
    utilities to be combined."


    The shell is a very powerful programming language. It has its
    quirks, and there are places where it is not the best language to
    use (when dealing with binary data, for example). For some tasks
    it is too slow (e.g., very large files). But for a very wide range
    of applications is is more than adequate, and often the most
    efficient language for the task at hand.

-- 
    Chris F.A. Johnson                  http://cfaj.freeshell.org/shell
    ===================================================================
    My code (if any) in this post is copyright 2004, Chris F.A. Johnson
    and may be copied under the terms of the GNU General Public License
0
Chris
12/15/2004 4:28:42 PM
2004-12-15, 08:04(-08), dfrench@mtxia.com:
[...]
> A shell is a TEXT PROCESSSING PROGRAM that interprets a data stream and
> constructs commands, then processes those commands from that data
> stream, or submits them to the operating system for processing.

I call sed, awk, cut text processing programs. I call a shell a
command line interpreter, a shell.

>> Additionaly, a shell can take the list of commands from a
>> file instead of from the user. That's what we call a script.
>
>
> There is no difference between commands processed from the the command
> line or those stored in a file.

There are many differences, but that was not my point.

The processing done by the shell is different wether the input
is a file a tty or neither of them and depending on the
interactive mode being on or off.

Among the differences are how much is read at a time, the alias,
history expansion, the key handling, how some special builtins
behave...

[...]
> So you seem to be saying the only difference between whether or not it
> is  a good idea to use a "while read" loop in a shell or perl, is
> whether or not one of the uses of the scripting language is to process
> text into commands, one at a time from the keyboard?
[...]

The reason why one shouldn't use "while read" loop is because
shells were not designed to do that, while tools like awk are
designed to do that.

Look at what ugly things ksh93 had to come to to work around the
fact that it is a shell in order to pretend being a programming
language (reading chunks and lseeking back, random optimizations
changing even the meaning of things, building in all the
utilities)...

If read was supposed to be used that way, it wouldn't strip
backslashes and leading/trailing blanks by default. Because it's
a shell, it has to read one character at a time.

-- 
Stephane
0
Stephane
12/15/2004 4:42:23 PM
In article <32b71qF3ko4joU1@individual.net>,
Chris F.A. Johnson <cfajohnson@gmail.com> wrote:
>On Wed, 15 Dec 2004 at 15:42 GMT, Stephane CHAZELAS wrote:
>> 2004-12-15, 07:25(-08), dfrench@mtxia.com:
>> [...]
>>> Not meant to be a text stream processing tool????  That's the
>>> definition of a shell...It's what a shell does.
>> 
>> ???
>> 
>> A shell is a tool to run other commands. Those other commands
>> can be text processing tools, web browsers, text editors...
>> 
>> Additionaly, a shell can take the list of commands from a
>> file instead of from the user. That's what we call a script.
>> 
>>> Along the same thought then, I suppose it's a bad idea to use a while
>>> loop in perl to read from a file....
>> 
>> It's different, perl is intended to be a programming language,
>> not a shell.
>
>    From its inception, the shell has also been designed as a
>    programming language.

Stephane is right about shells and their place in the scheme of things.
The rest of you are just blowing smoke.

>    As Stephen Bourne said in his introduction to the shell (1978?),
>    "The shell is both a command language and a programming language
>    that provides an interface to the UNIX operating system."

And the documentation for Microsoft Windows says things like "Windows
version XYZ is a modern, stable, secure, useful operating system."

Saying stuff don't make it so.  The world would be a much nicer place if it
did...

0
gazelle
12/15/2004 4:57:59 PM
Stephane CHAZELAS wrote:
> 2004-12-15, 08:04(-08), dfrench@mtxia.com:
> [...]
> > A shell is a TEXT PROCESSSING PROGRAM that interprets a data stream
and
> > constructs commands, then processes those commands from that data
> > stream, or submits them to the operating system for processing.
>
> I call sed, awk, cut text processing programs. I call a shell a
> command line interpreter, a shell.


The point of a shell is to perform text processing.

SNIP

> > There is no difference between commands processed from the the
command
> > line or those stored in a file.
>
> There are many differences, but that was not my point.
>
> The processing done by the shell is different wether the input
> is a file a tty or neither of them and depending on the
> interactive mode being on or off.
>
> Among the differences are how much is read at a time, the alias,
> history expansion, the key handling, how some special builtins
> behave...
>

There is no difference between a command processed at the command line
or from a file.  There may be some additional activities related to the
opening/closing/manipulating the file, but the text processing of the
commands is the same.


> [...]
> > So you seem to be saying the only difference between whether or not
it
> > is  a good idea to use a "while read" loop in a shell or perl, is
> > whether or not one of the uses of the scripting language is to
process
> > text into commands, one at a time from the keyboard?
> [...]
>
> The reason why one shouldn't use "while read" loop is because
> shells were not designed to do that, while tools like awk are
> designed to do that.


"awk" is an interpreted scripting language that performs its
instructions one line at a time.  Those instructions come from the
command line, or from a file, just like a shell.
The difference here is you typically use a shell to perform text
processing on a data stream before sending that data stream to "awk".

Is your argument that with "awk" there is an implied "while read", as
opposed to a shell where there must be an explicit "while read"?


>
> Look at what ugly things ksh93 had to come to to work around the
> fact that it is a shell in order to pretend being a programming
> language (reading chunks and lseeking back, random optimizations
> changing even the meaning of things, building in all the
> utilities)...
>
> If read was supposed to be used that way, it wouldn't strip
> backslashes and leading/trailing blanks by default. Because it's
> a shell, it has to read one character at a time.


This makes no sense.  You'll have to explain this further before I
could even attempt to respond.


-- 
Dana French

0
dfrench
12/15/2004 5:18:42 PM
On Wed, 15 Dec 2004 at 16:57 GMT, Kenny McCormack wrote:
> In article <32b71qF3ko4joU1@individual.net>,
> Chris F.A. Johnson <cfajohnson@gmail.com> wrote:
>>On Wed, 15 Dec 2004 at 15:42 GMT, Stephane CHAZELAS wrote:
>>> 2004-12-15, 07:25(-08), dfrench@mtxia.com:
>>> [...]
>>>> Not meant to be a text stream processing tool????  That's the
>>>> definition of a shell...It's what a shell does.
>>> 
>>> ???
>>> 
>>> A shell is a tool to run other commands. Those other commands
>>> can be text processing tools, web browsers, text editors...
>>> 
>>> Additionaly, a shell can take the list of commands from a
>>> file instead of from the user. That's what we call a script.
>>> 
>>>> Along the same thought then, I suppose it's a bad idea to use a while
>>>> loop in perl to read from a file....
>>> 
>>> It's different, perl is intended to be a programming language,
>>> not a shell.
>>
>>    From its inception, the shell has also been designed as a
>>    programming language.
> 
> Stephane is right about shells and their place in the scheme of things.

    There are thousands of shell applications that prove otherwise.

> The rest of you are just blowing smoke.

    No, but as soon as I find a match and light my pipe I will be. ;)

>>    As Stephen Bourne said in his introduction to the shell (1978?),
>>    "The shell is both a command language and a programming language
>>    that provides an interface to the UNIX operating system."
> 
> And the documentation for Microsoft Windows says things like "Windows
> version XYZ is a modern, stable, secure, useful operating system."
> 
> Saying stuff don't make it so.  The world would be a much nicer place if it
> did...


-- 
    Chris F.A. Johnson                  http://cfaj.freeshell.org/shell
    ===================================================================
    My code (if any) in this post is copyright 2004, Chris F.A. Johnson
    and may be copied under the terms of the GNU General Public License
0
Chris
12/15/2004 5:21:45 PM
On Wed, 15 Dec 2004 at 16:42 GMT, Stephane CHAZELAS wrote:
> 2004-12-15, 08:04(-08), dfrench@mtxia.com:
> [...]
>> A shell is a TEXT PROCESSSING PROGRAM that interprets a data stream and
>> constructs commands, then processes those commands from that data
>> stream, or submits them to the operating system for processing.
> 
> I call sed, awk, cut text processing programs. I call a shell a
> command line interpreter, a shell.
> 
>>> Additionaly, a shell can take the list of commands from a
>>> file instead of from the user. That's what we call a script.
>>
>>
>> There is no difference between commands processed from the the command
>> line or those stored in a file.
> 
> There are many differences, but that was not my point.
> 
> The processing done by the shell is different wether the input
> is a file a tty or neither of them and depending on the
> interactive mode being on or off.
> 
> Among the differences are how much is read at a time, the alias,
> history expansion, the key handling, how some special builtins
> behave...
> 
> [...]
>> So you seem to be saying the only difference between whether or not it
>> is  a good idea to use a "while read" loop in a shell or perl, is
>> whether or not one of the uses of the scripting language is to process
>> text into commands, one at a time from the keyboard?
> [...]
> 
> The reason why one shouldn't use "while read" loop is because
> shells were not designed to do that, while tools like awk are
> designed to do that.

    Of course shells were designed to do that. The only difference
    between a shell and awk is that it is explicit (and therefore
    slower) in the shell.

> Look at what ugly things ksh93 had to come to to work around the
> fact that it is a shell in order to pretend being a programming
> language (reading chunks and lseeking back, random optimizations
> changing even the meaning of things, building in all the
> utilities)...
> 
> If read was supposed to be used that way, it wouldn't strip
> backslashes and leading/trailing blanks by default.

    On the contrary; stripping leading and trailing whitespace is a
    useful feature that can save a lot of work. On older systems there
    was a 'line' command if you needed absolutely everything on the
    line. It has been superseded by 'IFS= read -r'.

    The ability to continue a line by ending it with a backslash
    simplifies writing data files that would otherwise require very
    long lines.

> Because it's a shell, it has to read one character at a time.

    That looks like a _non sequitur_.

-- 
    Chris F.A. Johnson                  http://cfaj.freeshell.org/shell
    ===================================================================
    My code (if any) in this post is copyright 2004, Chris F.A. Johnson
    and may be copied under the terms of the GNU General Public License
0
Chris
12/15/2004 5:34:14 PM
2004-12-15, 09:18(-08), dfrench@mtxia.com:
[...]
> There is no difference between a command processed at the command line
> or from a file.  There may be some additional activities related to the
> opening/closing/manipulating the file, but the text processing of the
> commands is the same.

Some examples:

set -o emacs
echi^Ho foo

(where ^H is a BS character) is read differently from the tty
and from a script

exec 3> /
behaves differently in a script and at the prompt

Same for

echo > *

[...]
> "awk" is an interpreted scripting language that performs its
> instructions one line at a time.  Those instructions come from the
> command line, or from a file, just like a shell.
> The difference here is you typically use a shell to perform text
> processing on a data stream before sending that data stream to "awk".
>
> Is your argument that with "awk" there is an implied "while read", as
> opposed to a shell where there must be an explicit "while read"?

I don't follow, could you please come with an example?

[...]
>> Look at what ugly things ksh93 had to come to to work around the
>> fact that it is a shell in order to pretend being a programming
>> language (reading chunks and lseeking back, random optimizations
>> changing even the meaning of things, building in all the
>> utilities)...
>>
>> If read was supposed to be used that way, it wouldn't strip
>> backslashes and leading/trailing blanks by default. Because it's
>> a shell, it has to read one character at a time.
>
>
> This makes no sense.  You'll have to explain this further before I
> could even attempt to respond.
[...]

cmd1 | while IFS= read -r line
  expr -- "$var" : ...
done

"read" has to read one character at a time until it finds a \n ,
because otherwise, expr would not be able to get the second
line of input (if it wanted to. It doesn't want but read
does not know that).

Moreover, the fact that the above code is illegible is a strong
indication that I'm not doing it the right way.

The right way being:

cmd1 | awk '{ print substr($1, ...) }'

awk reads a whole buffer at a time, does the processing
internally.

-- 
Stephane
0
Stephane
12/15/2004 5:41:31 PM
2004-12-15, 17:34(+00), Chris F.A. Johnson:
[...]
>     On the contrary; stripping leading and trailing whitespace is a
>     useful feature that can save a lot of work. On older systems there
>     was a 'line' command if you needed absolutely everything on the
>     line. It has been superseded by 'IFS= read -r'.

The obvious way, if read was meant to read input for text
processing would have been:

read line # read a line without extra processing
read -s line # read line AND strip spaces (but best would have
             # been to provide an additional command/operator to
	     # do so.
read -l line # read logical line continued with backslashes.

>
>     The ability to continue a line by ending it with a backslash
>     simplifies writing data files that would otherwise require very
>     long lines.

But it's broken. It also strips backslashes in the middle of the
line, while the obvious way is to take only the last one
specially as is done in most other tools or as in most text
formats that follow that convention (which means you can't
process those formats (C, awk, rc...) with read.

And that feature is not useful for text processing but for
reading from the user.

-- 
Stephane
0
Stephane
12/15/2004 6:04:23 PM
Kenny McCormack <gazelle@yin.interaccess.com> wrote:
> >    As Stephen Bourne said in his introduction to the shell (1978?),
> >    "The shell is both a command language and a programming language
> >    that provides an interface to the UNIX operating system."
> 
> And the documentation for Microsoft Windows says things like "Windows
> version XYZ is a modern, stable, secure, useful operating system."
> 
> Saying stuff don't make it so.  The world would be a much nicer place
> if it did...

Just because you can't afford Windows and quality hardware to run it on,
that doesn't mean Microsoft has to spoon feed you and wipe your arse.  
Lay off on Windows.  It's good for what it was designed for.  

Just because you can't program in shell, that doesn't mean Bash isn't
good for programming.
0
William
12/15/2004 6:19:55 PM
On Wed, 15 Dec 2004 at 18:04 GMT, Stephane CHAZELAS wrote:
> 2004-12-15, 17:34(+00), Chris F.A. Johnson:
> [...]
>>     On the contrary; stripping leading and trailing whitespace is a
>>     useful feature that can save a lot of work. On older systems there
>>     was a 'line' command if you needed absolutely everything on the
>>     line. It has been superseded by 'IFS= read -r'.
> 
> The obvious way, if read was meant to read input for text
> processing would have been:
> 
> read line # read a line without extra processing

    That's what the "line" command was for.

> read -s line # read line AND strip spaces (but best would have
>              # been to provide an additional command/operator to
> 	     # do so.
> read -l line # read logical line continued with backslashes.
> 
>>
>>     The ability to continue a line by ending it with a backslash
>>     simplifies writing data files that would otherwise require very
>>     long lines.
> 
> But it's broken.

   It is not broken; it's behaviour is just not what YOU want. I have
   no problem with it.

> It also strips backslashes in the middle of the
> line, while the obvious way is to take only the last one
> specially as is done in most other tools or as in most text
> formats that follow that convention (which means you can't
> process those formats (C, awk, rc...) with read.

    Most text files do NOT have baskslashes; that's why it was chosen
    as the escape character. There are specialized text formats that
    do, and they must be handled differently.

> And that feature is not useful for text processing but for
> reading from the user.

    In my experience, it is exactly the opposite.

    I have never used it for reading from the user; I have used it
    many times for reading from a file.


-- 
    Chris F.A. Johnson                  http://cfaj.freeshell.org/shell
    ===================================================================
    My code (if any) in this post is copyright 2004, Chris F.A. Johnson
    and may be copied under the terms of the GNU General Public License
0
Chris
12/15/2004 6:22:00 PM
Stephane CHAZELAS wrote:
> 2004-12-15, 09:18(-08), dfrench@mtxia.com:
> [...]
> > There is no difference between a command processed at the command
line
> > or from a file.  There may be some additional activities related to
the
> > opening/closing/manipulating the file, but the text processing of
the
> > commands is the same.
>
> Some examples:
>
> set -o emacs
> echi^Ho foo
>
> (where ^H is a BS character) is read differently from the tty
> and from a script


But your example is not comparing the same command.  At the tty the ^H
is interpreted by the tty driver, not the shell, whereas if you imbed a
literal ^H into a command in a script, it fails because the command
"echi^Ho" does not exist.  In fact if you escape the ^H at the command
line, it is processed exactly the same as it would be in a script, with
an embedded ^H they both fail.


>
> exec 3> /
> behaves differently in a script and at the prompt

I saw no difference in behavior between the command line and script in
ksh93.  Others should respond regarding other shells.


>
> Same for
>
> echo > *

I'll concede this one, however it is an extremely unlikely this command
would actually be used from the command line or from a script.


>
> [...]
> > "awk" is an interpreted scripting language that performs its
> > instructions one line at a time.  Those instructions come from the
> > command line, or from a file, just like a shell.
> > The difference here is you typically use a shell to perform text
> > processing on a data stream before sending that data stream to
"awk".
> >
> > Is your argument that with "awk" there is an implied "while read",
as
> > opposed to a shell where there must be an explicit "while read"?
>

# explicit "while read" in shell followed
# by an implied "while read" in awk
while IFS=":" read U J J J J H J
do
print "${U} ${H}"
done < /etc/passwd  | awk '{ print $2 " is the home dir of " $1 }'

> I don't follow, could you please come with an example?
>
> [...]
> > This makes no sense.  You'll have to explain this further before I
> > could even attempt to respond.
> [...]
>
> cmd1 | while IFS= read -r line
>   expr -- "$var" : ...
> done
>
> "read" has to read one character at a time until it finds a \n ,
> because otherwise, expr would not be able to get the second
> line of input (if it wanted to. It doesn't want but read
> does not know that).


How is the shell using a \n to designate the end of a line different
from awk using the same character to designate the end of a line?


>
> Moreover, the fact that the above code is illegible is a strong
> indication that I'm not doing it the right way.


Most code is illegible if you don't know the syntax, that is why it's
called code.


>
> The right way being:
>
> cmd1 | awk '{ print substr($1, ...) }'


Why is this code more legible than the previous code?

> awk reads a whole buffer at a time, does the processing
> internally.

The shell generated the buffer for awk to read from.  Sounds like an
extra step to me.

The original argument was that a "while read" loop is a bad idea, as
though there was some fundamental problem with it.  If you are
processing large text files, it is generally faster to use "awk", but
you have not presented any reason why performing a shell "while read"
is a bad idea.


-- 
Dana French

0
dfrench
12/15/2004 7:27:16 PM
2004-12-15, 18:22(+00), Chris F.A. Johnson:
[...]
>> read line # read a line without extra processing
>
>     That's what the "line" command was for.

Yes. And what was the "read" command for, then?


>> But it's broken.
>
>    It is not broken; it's behaviour is just not what YOU want. I have
>    no problem with it.
>
>> It also strips backslashes in the middle of the
>> line, while the obvious way is to take only the last one
>> specially as is done in most other tools or as in most text
>> formats that follow that convention (which means you can't
>> process those formats (C, awk, rc...) with read.

Given that its behavior assumes a format that I have never seen
implemented (I mean, I have never seen any language/text format
where "foo\\bar\
baz" actually meant "foo\barbaz"), that
makes a good reason to think it is broken or unusable (not to
speak that that (to my mind if you like) broken behavior is
activated by default).

>     Most text files do NOT have baskslashes; that's why it was chosen
>     as the escape character. There are specialized text formats that
>     do, and they must be handled differently.

That's not a valid argument. Most text files don't have #
character, however read doesn't strip what's after "#", even
though that makes sense as very often, you don't want what is
after the "#".

I guess \ behaves that way because then you can have almost the
same token recognition as in the shell.

At the prompt:
cmd 1\ 2 3\
4

and cmd gets the two arguments "1 2" and "34". Similarly:

read a b
on an input like:
1\ 2 3\
4

$a gets "1 2", $b gets "34". So that's intuitive somehow for the
user when he is requested several space separated words. That
behavior makes no sense in a text processing perspective, except
in the very unlikely case where the data has that very
particular format.

-- 
Stephane
0
Stephane
12/15/2004 8:30:53 PM
In article <32bdm8F3acmt9U1@individual.net>, Chris F.A. Johnson wrote:
> On Wed, 15 Dec 2004 at 18:04 GMT, Stephane CHAZELAS wrote:
>> 2004-12-15, 17:34(+00), Chris F.A. Johnson:
>> [...]
>>>     On the contrary; stripping leading and trailing whitespace is a
>>>     useful feature that can save a lot of work. On older systems there
>>>     was a 'line' command if you needed absolutely everything on the
>>>     line. It has been superseded by 'IFS= read -r'.
>> 
>> The obvious way, if read was meant to read input for text
>> processing would have been:
>> 
>> read line # read a line without extra processing
> 
>     That's what the "line" command was for.
> 
>> read -s line # read line AND strip spaces (but best would have
>>              # been to provide an additional command/operator to
>> 	     # do so.
>> read -l line # read logical line continued with backslashes.
>> 
>>>
>>>     The ability to continue a line by ending it with a backslash
>>>     simplifies writing data files that would otherwise require very
>>>     long lines.
>> 
>> But it's broken.
> 
>    It is not broken; it's behaviour is just not what YOU want. I have
>    no problem with it.
> 
>> It also strips backslashes in the middle of the
>> line, while the obvious way is to take only the last one
>> specially as is done in most other tools or as in most text
>> formats that follow that convention (which means you can't
>> process those formats (C, awk, rc...) with read.
> 
>     Most text files do NOT have baskslashes; that's why it was chosen
>     as the escape character. There are specialized text formats that
>     do, and they must be handled differently.
> 
>> And that feature is not useful for text processing but for
>> reading from the user.
> 
>     In my experience, it is exactly the opposite.
> 
>     I have never used it for reading from the user; I have used it
>     many times for reading from a file.

Chris - I find it hard to believe that you have never used 'read' to read input
from user. Do you not write any interactive scripts? Maybe I'm mis-reading your
reply...

Kevin
0
Kevin
12/15/2004 11:51:23 PM
2004-12-15, 11:27(-08), dfrench@mtxia.com:
[...]
>> set -o emacs
>> echi^Ho foo
>>
>> (where ^H is a BS character) is read differently from the tty
>> and from a script
>
>
> But your example is not comparing the same command.  At the tty the ^H
> is interpreted by the tty driver

No, not in emacs mode, of course (the tty is put in raw and
noecho mode, otherwise ^B wouldn't move the cursor backward as
it wouldn't be sent to the shell until you press enter).


[...]
>> exec 3> /
>> behaves differently in a script and at the prompt
>
> I saw no difference in behavior between the command line and script in
> ksh93.  Others should respond regarding other shells.

That's documented. If exec fails, it exits the shell in a
script. Not at the prompt.

[...]
>> echo > *
>
> I'll concede this one, however it is an extremely unlikely this command
> would actually be used from the command line or from a script.

Same for:

echo > $file

[...]
>> "read" has to read one character at a time until it finds a \n ,
>> because otherwise, expr would not be able to get the second
>> line of input (if it wanted to. It doesn't want but read
>> does not know that).
>
>
> How is the shell using a \n to designate the end of a line different
> from awk using the same character to designate the end of a line?

The shell does:

while (read(0, &c, 1) && c != '\n')...
  (one read system call per byte).

While awk does:

read(0, buf, 4096); processBufferToGetRecords()


> Most code is illegible if you don't know the syntax, that is why it's
> called code.

Sure, some are more legible than others.

I know few people who know that to read a line of input in
shell, it's IFS=<non-blanks> read -r line. It looked like that
even you didn't as you used to forget the "IFS=".


>> The right way being:
>>
>> cmd1 | awk '{ print substr($1, ...) }'
>
>
> Why is this code more legible than the previous code?

The shell is doing its job: running two commands piped together
cmd1 is doing its job: output what it has to output
awk is doing its job: text processing.

How could it be neater?

>
>> awk reads a whole buffer at a time, does the processing
>> internally.
>
> The shell generated the buffer for awk to read from.  Sounds like an
> extra step to me.

No, cmd1 did, the shell didn't to anything on the data, it just
connected two applications together and left them do their job.

-- 
Stephane
0
Stephane
12/16/2004 9:58:25 AM
On Wed, 15 Dec 2004 at 20:30 GMT, Stephane CHAZELAS wrote:
> 2004-12-15, 18:22(+00), Chris F.A. Johnson:
> [...]
>>> read line # read a line without extra processing
>>
>>     That's what the "line" command was for.
> 
> Yes. And what was the "read" command for, then?

    To read normal text. Which it does without any problem.

>>> But it's broken.
>>
>>    It is not broken; it's behaviour is just not what YOU want. I have
>>    no problem with it.
>>
>>> It also strips backslashes in the middle of the
>>> line, while the obvious way is to take only the last one
>>> specially as is done in most other tools or as in most text
>>> formats that follow that convention (which means you can't
>>> process those formats (C, awk, rc...) with read.
> 
> Given that its behavior assumes a format that I have never seen
> implemented (I mean, I have never seen any language/text format
> where "foo\\bar\
> baz" actually meant "foo\barbaz"), that
> makes a good reason to think it is broken or unusable (not to
> speak that that (to my mind if you like) broken behavior is
> activated by default).
> 
>>     Most text files do NOT have baskslashes; that's why it was chosen
>>     as the escape character. There are specialized text formats that
>>     do, and they must be handled differently.
> 
> That's not a valid argument. Most text files don't have #
> character, however read doesn't strip what's after "#", even
> though that makes sense as very often, you don't want what is
> after the "#".

    The octothorpe is a character with a long history, and several
    uses in every-day text; the backslash is not.

> I guess \ behaves that way because then you can have almost the
> same token recognition as in the shell.
> 
> At the prompt:
> cmd 1\ 2 3\
> 4
> 
> and cmd gets the two arguments "1 2" and "34". Similarly:
> 
> read a b
> on an input like:
> 1\ 2 3\
> 4
> 
> $a gets "1 2", $b gets "34". So that's intuitive somehow for the
> user when he is requested several space separated words. That
> behavior makes no sense in a text processing perspective, except
> in the very unlikely case where the data has that very
> particular format.

    A file with backslashes is just as much a "very particular
    format". The backslash is not a character anyone uses (except in
    some specific computer-related scenarios); it is a non-issue.

    Besides, there have always been ways to get whatever behaviour one
    wants, whether it is "read -r" or "line".

-- 
    Chris F.A. Johnson                  http://cfaj.freeshell.org/shell
    ===================================================================
    My code (if any) in this post is copyright 2004, Chris F.A. Johnson
    and may be copied under the terms of the GNU General Public License
0
Chris
12/16/2004 5:10:50 PM
2004-12-16, 17:10(+00), Chris F.A. Johnson:
[...]
>     A file with backslashes is just as much a "very particular
>     format". The backslash is not a character anyone uses (except in
>     some specific computer-related scenarios); it is a non-issue.
[...]

As long as you use shells in non-computer-related scenarii, I'm
happy with that ;).

-- 
Stephane
0
Stephane
12/16/2004 5:32:45 PM
On Thu, 16 Dec 2004 at 17:32 GMT, Stephane CHAZELAS wrote:
> 2004-12-16, 17:10(+00), Chris F.A. Johnson:
> [...]
>>     A file with backslashes is just as much a "very particular
>>     format". The backslash is not a character anyone uses (except in
>>     some specific computer-related scenarios); it is a non-issue.
> [...]
> 
> As long as you use shells in non-computer-related scenarii, I'm
> happy with that ;).

   I have no issues with read in any scenario. 

-- 
    Chris F.A. Johnson                  http://cfaj.freeshell.org/shell
    ===================================================================
    My code (if any) in this post is copyright 2004, Chris F.A. Johnson
    and may be copied under the terms of the GNU General Public License
0
Chris
12/16/2004 6:03:02 PM
Reply: