f



Extract Text out of PDF file

Does anyone know how to extract text out of a PDF file so that it can
be ealisy imported into a databse?

Example: Books.
I would need a sepearte field for the title, author, publisher, date,
description, image name, etc...

I know all of this informaiton is stored in the PDF however, I can't
seem to get it out correctly with doing it manually. Maybe, a apple
script to pull based on font(?) or something...


Any help will be greatly appricated. If there is a program out there
or if anyone can build this for me that would rock.

Matt
0
KowSki213
11/25/2003 5:28:44 PM
comp.text.pdf 5600 articles. 0 followers. ramon (1518) is leader. Post Follow

2 Replies
932 Views

Similar Articles

[PageSpeed] 27

PDFBox from http://www.pdfbox.org will do the trick for you.  It is a
java library, take a look at the PrintDocumentMetaData and ExtractText
classes to get the information that you want.

Ben


KowSki213@yahoo.com (Matt) wrote in message news:<cc6ccf89.0311250928.21586ba4@posting.google.com>...
> Does anyone know how to extract text out of a PDF file so that it can
> be ealisy imported into a databse?
> 
> Example: Books.
> I would need a sepearte field for the title, author, publisher, date,
> description, image name, etc...
> 
> I know all of this informaiton is stored in the PDF however, I can't
> seem to get it out correctly with doing it manually. Maybe, a apple
> script to pull based on font(?) or something...
> 
> 
> Any help will be greatly appricated. If there is a program out there
> or if anyone can build this for me that would rock.
> 
> Matt
0
ben
11/25/2003 10:05:11 PM
Matt wrote:

> Does anyone know how to extract text out of a PDF file so that it can
> be ealisy imported into a databse?
> 
> Example: Books.
> I would need a sepearte field for the title, author, publisher, date,
> description, image name, etc...
> 
> I know all of this informaiton is stored in the PDF however, I can't
> seem to get it out correctly with doing it manually. Maybe, a apple
> script to pull based on font(?) or something...
> 
> 
> Any help will be greatly appricated. If there is a program out there
> or if anyone can build this for me that would rock.

Extraction of text from books in the general case is not easy because 
presumably you want to preserve some semanitics from the original layout.

However in the case of simple prose there is ps2ascii. This is a simple 
Ghostscript wrapper script. PS/PDF in and text out.

The alternative is to build an application that would scrape the data. I 
have seen applications like this, but they need a lot of local scripting to 
get it to understand the particular book you have in front of you. Then you 
can parse complex illustrated catalogues; text, illustrations, tables, 
whatever.

Choose which tool does it for you.

Eric.
0
Eric
11/26/2003 10:09:57 AM
Reply: