f



Newby: how to transform text into lines of text

Hello,

I'va read a text file into variable "a"

     a=open('FicheroTexto.txt','r')
     a.read()

"a" contains all the lines of the text separated by '\n' characters.

Now, I want to work with each line separately, without the '\n'
character.

How can I get variable "b" as a list of such lines?

Thank you for your help
0
1/25/2009 1:32:38 PM
comp.lang.python 77058 articles. 6 followers. Post Follow

21 Replies
846 Views

Similar Articles

[PageSpeed] 35

vsoler schrieb:
> Hello,
> 
> I'va read a text file into variable "a"
> 
>      a=open('FicheroTexto.txt','r')
>      a.read()
> 
> "a" contains all the lines of the text separated by '\n' characters.

No, it doesn't. "a.read()" *returns* the contents, but you don't assign 
it, so it is discarded.

> Now, I want to work with each line separately, without the '\n'
> character.
> 
> How can I get variable "b" as a list of such lines?


The idiomatic way would be iterating over the file-object itself - which 
will get you the lines:

with open("foo.txt") as inf:
     for line in inf:
         print line


The advantage is that this works even for large files that otherwise 
won't fit into memory. Your approach of reading the full contents can be 
used like this:

content = a.read()
for line in content.split("\n"):
     print line


Diez
0
deets (3351)
1/25/2009 1:36:15 PM
> The idiomatic way would be iterating over the file-object itself - which 
> will get you the lines:
> 
> with open("foo.txt") as inf:
>      for line in inf:
>          print line

In versions of Python before the "with" was introduced (as in the 
2.4 installations I've got at both home and work), this can simply be

   for line in open("foo.txt"):
     print line

If you are processing lots of files, you can use

   f = open("foo.txt")
   for line in f:
     print line
   f.close()

One other caveat here, "line" contains the newline at the end, so 
you might have

  print line.rstrip('\r\n')

to remove them.


> content = a.read()
> for line in content.split("\n"):
>      print line

Strings have a "splitlines()" method for this purpose:

   content = a.read()
   for line in content.splitlines():
     print line

-tkc



0
python.list (1515)
1/25/2009 1:54:13 PM
On 25 ene, 14:36, "Diez B. Roggisch" <de...@nospam.web.de> wrote:
> vsoler schrieb:
>
> > Hello,
>
> > I'va read a text file into variable "a"
>
> > =A0 =A0 =A0a=3Dopen('FicheroTexto.txt','r')
> > =A0 =A0 =A0a.read()
>
> > "a" contains all the lines of the text separated by '\n' characters.
>
> No, it doesn't. "a.read()" *returns* the contents, but you don't assign
> it, so it is discarded.
>
> > Now, I want to work with each line separately, without the '\n'
> > character.
>
> > How can I get variable "b" as a list of such lines?
>
> The idiomatic way would be iterating over the file-object itself - which
> will get you the lines:
>
> with open("foo.txt") as inf:
> =A0 =A0 =A0for line in inf:
> =A0 =A0 =A0 =A0 =A0print line
>
> The advantage is that this works even for large files that otherwise
> won't fit into memory. Your approach of reading the full contents can be
> used like this:
>
> content =3D a.read()
> for line in content.split("\n"):
> =A0 =A0 =A0print line
>
> Diez

Thanks a lot. Very quick and clear
0
1/25/2009 2:04:41 PM
On Jan 26, 12:54=A0am, Tim Chase <python.l...@tim.thechases.com> wrote:

> One other caveat here, "line" contains the newline at the end, so
> you might have
>
> =A0 print line.rstrip('\r\n')
>
> to remove them.

I don't understand the presence of the '\r' there. Any '\x0d' that
remains after reading the file in text mode and is removed by that
rstrip would be a strange occurrence in the data which the OP may
prefer to find out about and deal with; it is not part of "the
newline". Why suppress one particular data character in preference to
others?

The same applies in any case to the use of rstrip('\n'); if that finds
more than one ocurrence of '\x0a' to remove, it has exceeded the
mandate of removing the newline (if any).

So, we are left with the unfortunately awkward
    if line.endswith('\n'):
        line =3D line[:-1]

Cheers,
John
0
sjmachin (3097)
1/25/2009 8:06:48 PM
>> One other caveat here, "line" contains the newline at the end, so
>> you might have
>>
>>   print line.rstrip('\r\n')
>>
>> to remove them.
> 
> I don't understand the presence of the '\r' there. Any '\x0d' that
> remains after reading the file in text mode and is removed by that
> rstrip would be a strange occurrence in the data which the OP may
> prefer to find out about and deal with; it is not part of "the
> newline". Why suppress one particular data character in preference to
> others?

In an ideal world where everybody knew how to make a proper 
text-file, it wouldn't be an issue.  Recreating the form of some 
of the data I get from customers/providers:

  >>> f = file('tmp/x.txt', 'wb')
  >>> f.write('headers\n')  # headers in Unix format
  >>> f.write('data1\r\n')  # data in Dos format
  >>> f.write('data2\r\n')
  >>> f.write('data3')   # no trailing newline of any sort
  >>> f.close()

Then reading it back in:

  >>> for line in file('tmp/x.txt'): print repr(line)
  ...
  'headers\n'
  'data1\r\n'
  'data2\r\n'
  'data3'

As for wanting to know about stray '\r' characters, I only want 
the data -- I don't particularly like to be reminded of the 
incompetence of those who send me malformed text-files ;-)

> The same applies in any case to the use of rstrip('\n'); if that finds
> more than one ocurrence of '\x0a' to remove, it has exceeded the
> mandate of removing the newline (if any).

I believe that using the formulaic "for line in file(FILENAME)" 
iteration guarantees that each "line" will have at most only one 
'\n' and it will be at the end (again, a malformed text-file with 
no terminal '\n' may cause it to be absent from the last line)

> So, we are left with the unfortunately awkward
>     if line.endswith('\n'):
>         line = line[:-1]

You're welcome to it, but I'll stick with my more DWIM solution 
of "get rid of anything that resembles an attempt at a CR/LF".

Thank goodness I haven't found any of my data-sources using 
"\n\r" instead, which would require me to left-strip '\r' 
characters as well.  Sigh.  My kingdom for competency. :-/

-tkc





0
python.list (1515)
1/25/2009 11:34:18 PM
On 26/01/2009 10:34 AM, Tim Chase wrote:

> I believe that using the formulaic "for line in file(FILENAME)" 
> iteration guarantees that each "line" will have at most only one '\n' 
> and it will be at the end (again, a malformed text-file with no terminal 
> '\n' may cause it to be absent from the last line)

It seems that you are right -- not that I can find such a guarantee 
written anywhere. I had armchair-philosophised that writing 
"foo\n\r\nbar\r\n" to a file in binary mode and reading it on Windows in 
text mode would be strict and report the first line as "foo\n\n"; I was 
wrong.

> 
>> So, we are left with the unfortunately awkward
>>     if line.endswith('\n'):
>>         line = line[:-1]
> 
> You're welcome to it, but I'll stick with my more DWIM solution of "get 
> rid of anything that resembles an attempt at a CR/LF".

Thanks, but I don't want it. My point was that you didn't TTOPEWYM (tell 
the OP exactly what you meant).

My approach to DWIM with data is, given
    norm_space = lambda s: u' '.join(s.split())
to break up the line into fields first (just in case the field delimiter 
== '\t') then apply norm_space to each field. This gets rid of your '\r' 
at end (or start!) of line, and multiple whitespace characters are 
replaced by a single space. Whitespace includes NBSP (U+00A0) as an 
added bonus for being righteous and using Unicode :-)

> Thank goodness I haven't found any of my data-sources using "\n\r" 
> instead, which would require me to left-strip '\r' characters as well.  
> Sigh.  My kingdom for competency. :-/

Indeed. I actually got data in that format once from a *x programmer who 
was so kind as to do it that way just for me because he knew that I use 
Windows and he thought that's what Windows text files looked like. No 
kidding.

Cheers,
John
0
sjmachin (3097)
1/26/2009 12:44:33 AM
John Machin wrote:
> On 26/01/2009 10:34 AM, Tim Chase wrote:
> 
>> I believe that using the formulaic "for line in file(FILENAME)" 
>> iteration guarantees that each "line" will have at most only one '\n' 
>> and it will be at the end (again, a malformed text-file with no 
>> terminal '\n' may cause it to be absent from the last line)
> 
> It seems that you are right -- not that I can find such a guarantee 
> written anywhere. I had armchair-philosophised that writing 
> "foo\n\r\nbar\r\n" to a file in binary mode and reading it on Windows in 
> text mode would be strict and report the first line as "foo\n\n"; I was 
> wrong.

Here's how I'd do it:
     with open('deheap/deheap.py', 'rU') as source:
         for line in source:
             print line.rstrip()  # Avoid trailing spaces as well.

This should handle \n, \r\n, and \n\r lines.

--Scott David Daniels
Scott.Daniels@Acm.Org
0
scott.daniels (1683)
1/26/2009 12:57:27 AM
On Sun, 25 Jan 2009 17:34:18 -0600, Tim Chase wrote:

> Thank goodness I haven't found any of my data-sources using "\n\r"
> instead, which would require me to left-strip '\r' characters as well. 
> Sigh.  My kingdom for competency. :-/

If I recall correctly, one of the accounting systems I used eight years 
ago gave you the option of exporting text files with either \r\n or \n\r 
as the end-of-line mark. Neither \n nor \r (POSIX or classic Mac) line 
endings were supported, as that would have been useful.

(It may have been Arrow Accounting, but don't quote me on that.)

I can only imagine the developer couldn't remember which order the 
characters were supposed to go, so rather than look it up, he made it 
optional.



-- 
Steven
0
steve9679 (1985)
1/26/2009 12:59:05 AM
Scott David Daniels wrote:
> Here's how I'd do it:
>      with open('deheap/deheap.py', 'rU') as source:
>          for line in source:
>              print line.rstrip()  # Avoid trailing spaces as well.
> 
> This should handle \n, \r\n, and \n\r lines.


Unfortunately, a raw rstrip() eats other whitespace that may be 
important.  I frequently get tab-delimited files, using the 
following pseudo-code:

   def clean_line(line):
     return line.rstrip('\r\n').split('\t')

   f = file('customer_x.txt')
   headers = clean_line(f.next())
   for line in f:
     field1, field2, field3 = clean_line(line)
     do_stuff()

if field3 is empty in the source-file, using rstrip(None) as you 
suggest triggers errors on the tuple assignment because it eats 
the tab that defined it.

I suppose if I were really smart, I'd dig a little deeper in the 
CSV module to sniff out the "right" way to parse tab-delimited files.

-tkc



0
python.list (1515)
1/26/2009 1:30:33 AM
En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase  
<python.list@tim.thechases.com> escribi�:

> Unfortunately, a raw rstrip() eats other whitespace that may be  
> important.  I frequently get tab-delimited files, using the following  
> pseudo-code:
>
>    def clean_line(line):
>      return line.rstrip('\r\n').split('\t')
>
>    f = file('customer_x.txt')
>    headers = clean_line(f.next())
>    for line in f:
>      field1, field2, field3 = clean_line(line)
>      do_stuff()
>
> if field3 is empty in the source-file, using rstrip(None) as you suggest  
> triggers errors on the tuple assignment because it eats the tab that  
> defined it.
>
> I suppose if I were really smart, I'd dig a little deeper in the CSV  
> module to sniff out the "right" way to parse tab-delimited files.

It's so easy that don't doing that is just inexcusable lazyness :)
Your own example, written using the csv module:

import csv

f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
headers = f.next()
for line in f:
     field1, field2, field3 = line
     do_stuff()

-- 
Gabriel Genellina

0
gagsl-py2 (3707)
1/26/2009 2:03:54 AM
On Jan 26, 1:03=A0pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
> En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase =A0
> <python.l...@tim.thechases.com> escribi=F3:
>
>
>
> > Unfortunately, a raw rstrip() eats other whitespace that may be =A0
> > important. =A0I frequently get tab-delimited files, using the following=
 =A0
> > pseudo-code:
>
> > =A0 =A0def clean_line(line):
> > =A0 =A0 =A0return line.rstrip('\r\n').split('\t')
>
> > =A0 =A0f =3D file('customer_x.txt')
> > =A0 =A0headers =3D clean_line(f.next())
> > =A0 =A0for line in f:
> > =A0 =A0 =A0field1, field2, field3 =3D clean_line(line)
> > =A0 =A0 =A0do_stuff()
>
> > if field3 is empty in the source-file, using rstrip(None) as you sugges=
t =A0
> > triggers errors on the tuple assignment because it eats the tab that =
=A0
> > defined it.
>
> > I suppose if I were really smart, I'd dig a little deeper in the CSV =
=A0
> > module to sniff out the "right" way to parse tab-delimited files.
>
> It's so easy that don't doing that is just inexcusable lazyness :)
> Your own example, written using the csv module:
>
> import csv
>
> f =3D csv.reader(open('customer_x.txt','rb'), delimiter=3D'\t')
> headers =3D f.next()
> for line in f:
> =A0 =A0 =A0field1, field2, field3 =3D line
> =A0 =A0 =A0do_stuff()
>

And where in all of that do you recommend that .decode(some_encoding)
be inserted?

0
sjmachin (3097)
1/26/2009 2:23:30 AM
En Mon, 26 Jan 2009 00:23:30 -0200, John Machin <sjmachin@lexicon.net>  
escribi�:
> On Jan 26, 1:03�pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
> wrote:

>> It's so easy that don't doing that is just inexcusable lazyness :)
>> Your own example, written using the csv module:
>>
>> import csv
>>
>> f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
>> headers = f.next()
>> for line in f:
>> � � �field1, field2, field3 = line
>> � � �do_stuff()
>
> And where in all of that do you recommend that .decode(some_encoding)
> be inserted?

For encodings that don't use embedded NUL bytes (latin1, utf8) I'd decode  
the fields right when extracting them:

     field1, field2, field3 = (field.decode('utf8') for field in line)

For encodings that allow NUL bytes, I'd use any of the recipes in the csv  
module documentation.

(That is, if I care about the encoding at all. Perhaps the file contains  
only numbers. Perhaps it contains only ASCII characters. Perhaps I'm only  
interested in some fields for which the encoding is irrelevant. Perhaps it  
is an internally generated file and it doesn't matter as long as I use the  
same encoding on output)
But I admit that in general, the "decode input early when reading, work in  
unicode, encode output late when writing" is the best practice.

-- 
Gabriel Genellina

0
gagsl-py2 (3707)
1/26/2009 4:57:55 AM
2009/1/25 Tim Chase <python.list@tim.thechases.com>:

> (again, a malformed text-file with no terminal '\n' may cause it
> to be absent from the last line)

Ahem. That may be "malformed" for some specific file specification,
but it is only "malformed" in general if you are using an operating
system that treats '\n' as a terminator (eg, Linux) rather than as a
separator (eg, MS DOS/Windows).

Perhaps what you don't /really/ want to be reminded of is the
existence of operating systems other than your preffered one?

-- 
Tim Rowe
0
digitig (154)
1/26/2009 11:23:18 AM
Diez B. Roggisch <deets@nospam.web.de> wrote:
> [ ... ] Your approach of reading the full contents can be 
>used like this:
>
>content = a.read()
>for line in content.split("\n"):
>     print line
>

Or if you want the full content in memory but only ever access it on a
line-by-line basis:

content = a.readlines()

(Just because we can now write "for line in file" doesn't mean that
readlines() is *totally* redundant.)

-- 
\S -- siona@chiark.greenend.org.uk -- http://www.chaos.org.uk/~sion/
   "Frankly I have no feelings towards penguins one way or the other"
        -- Arthur C. Clarke
   her nu become� se bera eadward ofdun hl�ddre heafdes b�ce bump bump bump
0
siona (310)
1/26/2009 12:22:18 PM
On Mon, 26 Jan 2009 12:22:18 +0000, Sion Arrowsmith wrote:

> content = a.readlines()
> 
> (Just because we can now write "for line in file" doesn't mean that
> readlines() is *totally* redundant.)

But ``content = list(a)`` is shorter.  :-)

Ciao,
	Marc 'BlackJack' Rintsch
0
bj_666 (1523)
1/26/2009 2:51:33 PM
On 26 Jan 2009 14:51:33 GMT Marc 'BlackJack' Rintsch <bj_666@gmx.net>
wrote:

> On Mon, 26 Jan 2009 12:22:18 +0000, Sion Arrowsmith wrote:
>=20
> > content =3D a.readlines()
> >=20
> > (Just because we can now write "for line in file" doesn't mean that
> > readlines() is *totally* redundant.)
>=20
> But ``content =3D list(a)`` is shorter.  :-)
>=20
But much less clear, wouldn't you say?

content is now what? A list of lines? Characters? Bytes? I-Nodes?
Dates? Granted, it can be inferred from the fact that a file is its
own iterator over its lines, but that is a mental step that readlines()
frees you from doing.

My ~0.0154 =E2=82=AC.

/W

--=20
My real email address is constructed by swapping the domain with the
recipient (local part).

0
geekmail (58)
1/26/2009 3:10:11 PM
On Sun, 2009-01-25 at 18:23 -0800, John Machin wrote:
> On Jan 26, 1:03 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
> wrote:
> > En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase  
> > <python.l...@tim.thechases.com> escribió:
> >
> >
> >
> > > Unfortunately, a raw rstrip() eats other whitespace that may be  
> > > important.  I frequently get tab-delimited files, using the following  
> > > pseudo-code:
> >
> > >    def clean_line(line):
> > >      return line.rstrip('\r\n').split('\t')
> >
> > >    f = file('customer_x.txt')
> > >    headers = clean_line(f.next())
> > >    for line in f:
> > >      field1, field2, field3 = clean_line(line)
> > >      do_stuff()
> >
> > > if field3 is empty in the source-file, using rstrip(None) as you suggest  
> > > triggers errors on the tuple assignment because it eats the tab that  
> > > defined it.
> >
> > > I suppose if I were really smart, I'd dig a little deeper in the CSV  
> > > module to sniff out the "right" way to parse tab-delimited files.
> >
> > It's so easy that don't doing that is just inexcusable lazyness :)
> > Your own example, written using the csv module:
> >
> > import csv
> >
> > f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
> > headers = f.next()
> > for line in f:
> >      field1, field2, field3 = line
> >      do_stuff()
> >
> 
> And where in all of that do you recommend that .decode(some_encoding)
> be inserted?
> 

If encoding is an issue for your application, then I'd recommend you use
codecs.open('customer_x.txt', 'rb', encoding='ebcdic') instead of open()

> --
> http://mail.python.org/mailman/listinfo/python-list
> 

0
jcd1 (296)
1/26/2009 3:35:39 PM
En Mon, 26 Jan 2009 13:35:39 -0200, J. Cliff Dyer <jcd@sdf.lonestar.org>  
escribi�:
> On Sun, 2009-01-25 at 18:23 -0800, John Machin wrote:
>> On Jan 26, 1:03 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
>> wrote:
>> > En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase
>> > <python.l...@tim.thechases.com> escribi�:

>> > > I suppose if I were really smart, I'd dig a little deeper in the CSV
>> > > module to sniff out the "right" way to parse tab-delimited files.
>> >
>> > It's so easy that don't doing that is just inexcusable lazyness :)
>> > Your own example, written using the csv module:
>> >
>> > import csv
>> >
>> > f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
>> > headers = f.next()
>> > for line in f:
>> >      field1, field2, field3 = line
>> >      do_stuff()
>> >
>>
>> And where in all of that do you recommend that .decode(some_encoding)
>> be inserted?
>
> If encoding is an issue for your application, then I'd recommend you use
> codecs.open('customer_x.txt', 'rb', encoding='ebcdic') instead of open()

This would be the best way *if* the csv module could handle Unicode input,  
but unfortunately this is not the case. See my other reply.

-- 
Gabriel Genellina

0
gagsl-py2 (3707)
1/26/2009 5:36:35 PM
En Mon, 26 Jan 2009 13:35:39 -0200, J. Cliff Dyer <jcd@sdf.lonestar.org>  
escribi�:
> On Sun, 2009-01-25 at 18:23 -0800, John Machin wrote:
>> On Jan 26, 1:03 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
>> wrote:
>> > En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase
>> > <python.l...@tim.thechases.com> escribi�:

>> > > I suppose if I were really smart, I'd dig a little deeper in the CSV
>> > > module to sniff out the "right" way to parse tab-delimited files.
>> >
>> > It's so easy that don't doing that is just inexcusable lazyness :)
>> > Your own example, written using the csv module:
>> >
>> > import csv
>> >
>> > f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
>> > headers = f.next()
>> > for line in f:
>> >      field1, field2, field3 = line
>> >      do_stuff()
>> >
>>
>> And where in all of that do you recommend that .decode(some_encoding)
>> be inserted?
>
> If encoding is an issue for your application, then I'd recommend you use
> codecs.open('customer_x.txt', 'rb', encoding='ebcdic') instead of open()

This would be the best way *if* the csv module could handle Unicode input,  
but unfortunately this is not the case. See my other reply.

-- 
Gabriel Genellina

0
gagsl-py2 (3707)
1/26/2009 5:36:36 PM
On Mon, 26 Jan 2009 16:10:11 +0100, Andreas Waldenburger wrote:

> On 26 Jan 2009 14:51:33 GMT Marc 'BlackJack' Rintsch <bj_666@gmx.net>
> wrote:
> 
>> On Mon, 26 Jan 2009 12:22:18 +0000, Sion Arrowsmith wrote:
>> 
>> > content = a.readlines()
>> > 
>> > (Just because we can now write "for line in file" doesn't mean that
>> > readlines() is *totally* redundant.)
>> 
>> But ``content = list(a)`` is shorter.  :-)
>> 
> But much less clear, wouldn't you say?

Okay, so let's make it clearer and even shorter: ``lines = list(a)``.  :-)

Ciao,
	Marc 'BlackJack' Rintsch
0
bj_666 (1523)
1/26/2009 10:12:43 PM
On 26 Jan 2009 22:12:43 GMT Marc 'BlackJack' Rintsch <bj_666@gmx.net>
wrote:

> On Mon, 26 Jan 2009 16:10:11 +0100, Andreas Waldenburger wrote:
> 
> > On 26 Jan 2009 14:51:33 GMT Marc 'BlackJack' Rintsch
> > <bj_666@gmx.net> wrote:
> > 
> >> On Mon, 26 Jan 2009 12:22:18 +0000, Sion Arrowsmith wrote:
> >> 
> >> > content = a.readlines()
> >> > 
> >> > (Just because we can now write "for line in file" doesn't mean
> >> > that readlines() is *totally* redundant.)
> >> 
> >> But ``content = list(a)`` is shorter.  :-)
> >> 
> > But much less clear, wouldn't you say?
> 
> Okay, so let's make it clearer and even shorter: ``lines =
> list(a)``.  :-)
> 
OK, you win. :)

/W

-- 
My real email address is constructed by swapping the domain with the
recipient (local part).
0
geekmail (58)
1/26/2009 10:38:14 PM
Reply: