How to retrieve text content from PDF file by itext?

  • Follow


Hi guys,

  I am currently working on a project in a request of retrieveing text 
streams from PDF file. I have read through some threads with regard to 
itext library. I am quite new to the topic of converting PFD text to 
objects. So, first off, can anyone tell me is it possible to fullfill my 
goal with itext? (Namely with the class PdfReader, Is there other class 
do I need for this?) Secondly, could anyone give me some example in 
detailed codec to illurstrate me how to make a simplest PDF->text parser 
with PdfReader class in itext.

Thanks a lot!!

Regards

Rui
0
Reply Rui 4/28/2005 9:31:39 AM

Rui Chang wrote:
> Hi guys,
> 
>  I am currently working on a project in a request of retrieveing text 
> streams from PDF file. I have read through some threads with regard to 
> itext library. I am quite new to the topic of converting PFD text to 
> objects. So, first off, can anyone tell me is it possible to fullfill my 
> goal with itext? (Namely with the class PdfReader, Is there other class 
> do I need for this?) Secondly, could anyone give me some example in 
> detailed codec to illurstrate me how to make a simplest PDF->text parser 
> with PdfReader class in itext.
> 
> Thanks a lot!!
> 
> Regards
> 
> Rui


Hi,
  Hereby, I state my question a bit more detailed as follows
//creat a Pdfreader

      > PdfReader PDFreader=new PdfReader("somefile.pdf");

// retrieve page 2 for example
      > text=PDFreader.FlateDecode(PDFreader.getPageContent(2),true);

Is it all for parsing??(Obviously, I know I have something missing here, 
but what are them?)

  Thanks

  Rui
0
Reply Rui 4/28/2005 9:44:21 AM


Rui Chang wrote:
> Hi guys,
> 
>  I am currently working on a project in a request of retrieveing text 
> streams from PDF file. I have read through some threads with regard to 
> itext library. I am quite new to the topic of converting PFD text to 
> objects. So, first off, can anyone tell me is it possible to fullfill my 
> goal with itext? (Namely with the class PdfReader, Is there other class 
> do I need for this?) Secondly, could anyone give me some example in 
> detailed codec to illurstrate me how to make a simplest PDF->text parser 
> with PdfReader class in itext.

With iText you can extract Dictionaries, streams,... from a PDF file.
These are PDF objects as described in the PDF Reference Manual.
If you decode a stream, you get PDF syntax.
This doesn't mean you get the text that is shown in Acrobat Reader.
iText doesn't parse the Graphics State or Text State operators.

I could explain more about the internal of iText,
but I will keep it short:
If you want to use iText to manipulate existing PDFs,
read http://itext.sourceforge.net/tutorial/general/copystamp/
If you need to extract text from a PDF,
you will need another library.

br,
Bruno
0
Reply bruno 4/28/2005 10:47:11 AM

http://www.pdfbox.org is an open source Java PDF Library that does text
extraction.

See the command line tool org.pdfbox.ExtractText and utility class
org.pdfbox.util.PDFTextStripper to see how to extract text from a PDF
document.

Ben

0
Reply ben 4/28/2005 1:16:47 PM

ben@csh.rit.edu wrote:
> http://www.pdfbox.org is an open source Java PDF Library that does text
> extraction.
> 
> See the command line tool org.pdfbox.ExtractText and utility class
> org.pdfbox.util.PDFTextStripper to see how to extract text from a PDF
> document.
> 
> Ben
> 
Thanks for your suggestion.Ben. PDFBOX is a great library...I have 
already tested it, and it works very fine!! I will keep posting 
following questions (if there are) by using pdfbox.

Regards to all repliers
Rui
0
Reply Rui 4/28/2005 3:06:40 PM

I have tried to extract the pdf document to text using pdf box library only in Android
public static void read(String[] args) throws IOException{
		 
		PDDocument doc = null;  
        try {  
            doc = PDDocument.load("C:\\Android.pdf");  
            PDFTextStripper stripper = new PDFTextStripper();  
            String text =stripper.getText(doc);  
            
        } finally {  
            if (doc != null) {  
                doc.close();  
            }  
        }  
But getting error in the logcat that Could not find method org.apache.pdfbox.pdmodel.PDDocument.load, referenced from method com.packagename.classname.method...
moreover i had the classpath and path in system variables and jar file too 
why it so coming errors!!!

--
0
Reply deepandroid (2) 6/8/2012 8:12:24 AM

5 Replies
1434 Views

(page loaded in 0.076 seconds)

Similiar Articles:













7/20/2012 9:47:38 AM


Reply: