f



pdf to text

I know this question comes up a lot, so here goes again. I want to read 
text from a PDF file, run re searches on the text, etc. I do not care 
about layout, fonts, borders, etc. I just want the text. I've been 
reading Adobe's PDF Reference Guide and I'm beginning to develop a 
better understanding of PDF in general, but I need a bit of help... this 
seems like it should be easier than it is. Here's some code:

import zlib

fp = open('test.pdf', 'rb')
bytes = []
while 1:
     byte = fp.read(1)
     #print byte
     bytes.append(byte)
     if not byte:
         break

for byte in bytes:

     op = open('pdf.txt', 'a')

     dco = zlib.decompressobj()

     try:
         s = dco.decompress(byte)
         #print >> op, s
         print s
     except Exception, e:
         print e

     op.close()

fp.close()

I know the text is compressed... that it would have stream and endstream 
makers and BT (Begin Text) and ET (End Text) and that the uncompressed 
text is enclosed in parenthesis (this is my text). Has anyone here done 
this in a simple fashion? I've played with the pyPdf library some, but 
it seems overly complex for my needs (merge PDFs, write PDFs, etc). I 
just want a simple PDF text extractor.

Thanks
0
tubby (16)
1/25/2007 9:05:11 PM
comp.lang.python 77058 articles. 6 followers. Post Follow

8 Replies
714 Views

Similar Articles

[PageSpeed] 22

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

have a look at the pdflib (www.pdflib.com). Their Text Extraction
Toolkit might be what you are looking for, though I'm not sure whether
you can use it detached from the pdflib itself.

hth

Nils

tubby schrieb:
> I know this question comes up a lot, so here goes again. I want to read 
> text from a PDF file, run re searches on the text, etc. I do not care 
> about layout, fonts, borders, etc. I just want the text. I've been 
> reading Adobe's PDF Reference Guide and I'm beginning to develop a 
> better understanding of PDF in general, but I need a bit of help... this 
> seems like it should be easier than it is. Here's some code:
> 
> import zlib
> 
> fp = open('test.pdf', 'rb')
> bytes = []
> while 1:
>      byte = fp.read(1)
>      #print byte
>      bytes.append(byte)
>      if not byte:
>          break
> 
> for byte in bytes:
> 
>      op = open('pdf.txt', 'a')
> 
>      dco = zlib.decompressobj()
> 
>      try:
>          s = dco.decompress(byte)
>          #print >> op, s
>          print s
>      except Exception, e:
>          print e
> 
>      op.close()
> 
> fp.close()
> 
> I know the text is compressed... that it would have stream and endstream 
> makers and BT (Begin Text) and ET (End Text) and that the uncompressed 
> text is enclosed in parenthesis (this is my text). Has anyone here done 
> this in a simple fashion? I've played with the pyPdf library some, but 
> it seems overly complex for my needs (merge PDFs, write PDFs, etc). I 
> just want a simple PDF text extractor.
> 
> Thanks

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFuSPozvGJy8WEGTcRAnY0AJ0VZez3XRbLm/JXZKhn/rgHP0R3qwCfWAnT
EupBECHab2kG33Rmnh+xf74=
=INM5
-----END PGP SIGNATURE-----
0
NO_Kroeger (13)
1/25/2007 9:40:57 PM
On Thursday 25 January 2007 22:05, tubby wrote:

> I know this question comes up a lot, so here goes again. I want to read
> text from a PDF file, run re searches on the text, etc. I do not care
> about layout, fonts, borders, etc. I just want the text. I've been
> reading Adobe's PDF Reference Guide and I'm beginning to develop a
> better understanding of PDF in general, but I need a bit of help... this
> seems like it should be easier than it is.

It _seems_ that way. ;-)

One of the more promising suggestions for a way to solve this came
up in a comp.lang.python thread last year:

http://groups.google.com/group/comp.lang.python/msg/cb6c97a44ce4cbe9?dmode=source

Basically, if you have access to the pdftotext command on a system
that supports xpdf, you should be able to get something reasonable
out of a PDF file.

> I know the text is compressed... that it would have stream and endstream
> makers and BT (Begin Text) and ET (End Text) and that the uncompressed
> text is enclosed in parenthesis (this is my text). Has anyone here done
> this in a simple fashion? I've played with the pyPdf library some, but
> it seems overly complex for my needs (merge PDFs, write PDFs, etc). I
> just want a simple PDF text extractor.

The pdftotext tool may do what you want:

  http://www.foolabs.com/xpdf/download.html

Let us know how you get on with it.

David
0
david2510 (172)
1/25/2007 9:46:51 PM
David Boddie wrote:
> The pdftotext tool may do what you want:
> 
>   http://www.foolabs.com/xpdf/download.html
> 
> Let us know how you get on with it.

I have used this tool. However, I need PDF read ability on Windows and 
Linux and in the future Macs. pdftotext works great on Linux, but poorly 
on Windows (100% sustained CPU usage, etc).

Thank you for the suggestion. I'll keep hammering away at a simple 
Python solution to this. Over the years, I have come to loath Adobe's 
Portable Document Format!
0
tubby (16)
1/25/2007 9:54:15 PM
David Boddie wrote:

> The pdftotext tool may do what you want:
> 
>   http://www.foolabs.com/xpdf/download.html
> 
> Let us know how you get on with it.
> 
> David

Perhaps I'm just using pdftotext wrong? Here's how I was using it:

f = filename

try:
     sout = os.popen('pdftotext "%s" - ' %f)
     data = sout.read().strip()
     print data
     sout.close()

except Exception, e:
     print e
0
tubby (16)
1/25/2007 10:09:42 PM
> Perhaps I'm just using pdftotext wrong? Here's how I was using it:
>

>      sout = os.popen('pdftotext "%s" - ' %f)


If you are having trouble with popen (not unlikely)
how about just writing to a temporary file and 
reading the text from there?

I've used pdftotext several times in the past few
weeks (but not on windows). It was a major
time saver for me.
0
missive (110)
1/25/2007 10:41:09 PM
tubby writes:

> David Boddie wrote:
>
>> The pdftotext tool may do what you want:
>>
>>   http://www.foolabs.com/xpdf/download.html
>>
>> Let us know how you get on with it.
>>
>> David
>
> Perhaps I'm just using pdftotext wrong? Here's how I was using it:
>
> f = filename
>
> try:
>     sout = os.popen('pdftotext "%s" - ' %f)
>     data = sout.read().strip()
>     print data
>     sout.close()
>
> except Exception, e:
>     print e

I am using pdftotext on Windows with cygwin on a regular basis without
any problem.

Your program above should read:

   sout = os.popen('pdftotext "%s" - ' % (f,))

--
Dieter Deyke

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
0
1/25/2007 10:54:54 PM
Dieter Deyke wrote:
>>     sout = os.popen('pdftotext "%s" - ' %f)

> Your program above should read:
> 
>    sout = os.popen('pdftotext "%s" - ' % (f,))

What is the significance of doing it this way?
0
tubby (16)
1/29/2007 2:06:27 PM
tubby wrote:
> Dieter Deyke wrote:
>>>     sout = os.popen('pdftotext "%s" - ' %f)
> 
>> Your program above should read:
>>
>>    sout = os.popen('pdftotext "%s" - ' % (f,))
> 
> What is the significance of doing it this way?

It's actually just nit-picking - as long as you know f is never going to 
  be a tuple then it's perfectly acceptable to use a single value as the 
right-hand operand.

Of course, if f ever *is* a tuple (with more than one element) then you 
will get an error:

  >>> for f in ['string',
                ('one-element tuple', ),
                ("two-element", "tuple")]:
  ...   print 'Nit: pdftotext "%s" - ' % (f,)
  ...   print 'You: pdftotext "%s" - ' %f
  ...
Nit: pdftotext "string" -
You: pdftotext "string" -
Nit: pdftotext "('one-element tuple',)" -
You: pdftotext "one-element tuple" -
Nit: pdftotext "('two-element', 'tuple')" -
Traceback (most recent call last):
   File "<stdin>", line 3, in <module>
TypeError: not all arguments converted during string formatting
  >>>

So there is potentially some value to it. But we often don't bother.

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC/Ltd          http://www.holdenweb.com
Skype: holdenweb     http://del.icio.us/steve.holden
Blog of Note:          http://holdenweb.blogspot.com
See you at PyCon?         http://us.pycon.org/TX2007

0
steve73 (4801)
1/30/2007 5:22:45 AM
Reply: