f



Extract Text from PDF

Hi,

Does anyone know a way to extract plain text from a PDF using Ruby?

Many Thanks,

~ Mark

-- 
Posted via http://www.ruby-forum.com/.

0
seo (16)
4/13/2007 12:06:14 PM
comp.lang.ruby 48886 articles. 0 followers. Post Follow

5 Replies
593 Views

Similar Articles

[PageSpeed] 4

On 13.04.2007 14:06, Mark Dodwell wrote:
> Does anyone know a way to extract plain text from a PDF using Ruby?

IIRC there is a project under way to extend PDFWriter with reading 
capabilities.  I don't know the current status of that.  HTH

	robert
0
shortcutter (5830)
4/13/2007 12:16:52 PM
Robert Klemme wrote:
> On 13.04.2007 14:06, Mark Dodwell wrote:
>> Does anyone know a way to extract plain text from a PDF using Ruby?
> 
> IIRC there is a project under way to extend PDFWriter with reading
> capabilities.  I don't know the current status of that.  HTH

In the meantime, you could use the commandline tools pdf2ps and ps2ascii 
(I think they use ghostscript as a backend), and read the resulting 
ascii file with ruby in the usual way.

Regards,


Chris

-- 
Posted via http://www.ruby-forum.com/.

0
4/13/2007 12:26:15 PM
Hi,

2007/4/13, Mark Dodwell <seo@mkdynamic.co.uk>:

> Does anyone know a way to extract plain text from a PDF using Ruby?

You can use Ruby/Poppler:
  http://ruby-gnome2.sourceforge.jp/hiki.cgi?Ruby%2FPoppler

Here is an example to do that:
  http://ruby-gnome2.cvs.sourceforge.net/ruby-gnome2/ruby-gnome2/poppler/sample/pdf2text.rb?revision=HEAD&view=markup


Thanks,
--
kou

0
kou (147)
4/13/2007 1:19:12 PM
Robert Klemme wrote:
> On 13.04.2007 14:06, Mark Dodwell wrote:
>> Does anyone know a way to extract plain text from a PDF using Ruby?
>
> IIRC there is a project under way to extend PDFWriter with reading 
> capabilities.  I don't know the current status of that.  HTH
>
>     robert
At least on Linux, there is "pdftotext", which is part of the "poppler" 
package. So you can simply shell out to it if it's installed. If you're 
more ambitious, you could write an extension to use the underlying 
libraries in poppler.
>
>


-- 
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.net/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.


0
znmeb (1667)
4/13/2007 1:20:42 PM
The trouble is, pdf is not always the same thing. Sometimes, there is  
no text at all in a pdf. It can be all vector art outlines or even  
all raster image graphics. There is never a guarantee that you will  
get any or all text that may otherwise be human readable in a pdf.  
Pdf has really become a kitchen sink format, so it is good to  
anticipate trouble parsing pdf files.

0
4/13/2007 11:52:34 PM
Reply: