|
|
How to retrieve text content from PDF file by itext?
Hi guys,
I am currently working on a project in a request of retrieveing text
streams from PDF file. I have read through some threads with regard to
itext library. I am quite new to the topic of converting PFD text to
objects. So, first off, can anyone tell me is it possible to fullfill my
goal with itext? (Namely with the class PdfReader, Is there other class
do I need for this?) Secondly, could anyone give me some example in
detailed codec to illurstrate me how to make a simplest PDF->text parser
with PdfReader class in itext.
Thanks a lot!!
Regards
Rui
|
|
0
|
|
|
|
Reply
|
Rui
|
4/28/2005 9:31:39 AM |
|
Rui Chang wrote:
> Hi guys,
>
> I am currently working on a project in a request of retrieveing text
> streams from PDF file. I have read through some threads with regard to
> itext library. I am quite new to the topic of converting PFD text to
> objects. So, first off, can anyone tell me is it possible to fullfill my
> goal with itext? (Namely with the class PdfReader, Is there other class
> do I need for this?) Secondly, could anyone give me some example in
> detailed codec to illurstrate me how to make a simplest PDF->text parser
> with PdfReader class in itext.
>
> Thanks a lot!!
>
> Regards
>
> Rui
Hi,
Hereby, I state my question a bit more detailed as follows
//creat a Pdfreader
> PdfReader PDFreader=new PdfReader("somefile.pdf");
// retrieve page 2 for example
> text=PDFreader.FlateDecode(PDFreader.getPageContent(2),true);
Is it all for parsing??(Obviously, I know I have something missing here,
but what are them?)
Thanks
Rui
|
|
0
|
|
|
|
Reply
|
Rui
|
4/28/2005 9:44:21 AM
|
|
Rui Chang wrote:
> Hi guys,
>
> I am currently working on a project in a request of retrieveing text
> streams from PDF file. I have read through some threads with regard to
> itext library. I am quite new to the topic of converting PFD text to
> objects. So, first off, can anyone tell me is it possible to fullfill my
> goal with itext? (Namely with the class PdfReader, Is there other class
> do I need for this?) Secondly, could anyone give me some example in
> detailed codec to illurstrate me how to make a simplest PDF->text parser
> with PdfReader class in itext.
With iText you can extract Dictionaries, streams,... from a PDF file.
These are PDF objects as described in the PDF Reference Manual.
If you decode a stream, you get PDF syntax.
This doesn't mean you get the text that is shown in Acrobat Reader.
iText doesn't parse the Graphics State or Text State operators.
I could explain more about the internal of iText,
but I will keep it short:
If you want to use iText to manipulate existing PDFs,
read http://itext.sourceforge.net/tutorial/general/copystamp/
If you need to extract text from a PDF,
you will need another library.
br,
Bruno
|
|
0
|
|
|
|
Reply
|
bruno
|
4/28/2005 10:47:11 AM
|
|
http://www.pdfbox.org is an open source Java PDF Library that does text
extraction.
See the command line tool org.pdfbox.ExtractText and utility class
org.pdfbox.util.PDFTextStripper to see how to extract text from a PDF
document.
Ben
|
|
0
|
|
|
|
Reply
|
ben
|
4/28/2005 1:16:47 PM
|
|
ben@csh.rit.edu wrote:
> http://www.pdfbox.org is an open source Java PDF Library that does text
> extraction.
>
> See the command line tool org.pdfbox.ExtractText and utility class
> org.pdfbox.util.PDFTextStripper to see how to extract text from a PDF
> document.
>
> Ben
>
Thanks for your suggestion.Ben. PDFBOX is a great library...I have
already tested it, and it works very fine!! I will keep posting
following questions (if there are) by using pdfbox.
Regards to all repliers
Rui
|
|
0
|
|
|
|
Reply
|
Rui
|
4/28/2005 3:06:40 PM
|
|
I have tried to extract the pdf document to text using pdf box library only in Android
public static void read(String[] args) throws IOException{
PDDocument doc = null;
try {
doc = PDDocument.load("C:\\Android.pdf");
PDFTextStripper stripper = new PDFTextStripper();
String text =stripper.getText(doc);
} finally {
if (doc != null) {
doc.close();
}
}
But getting error in the logcat that Could not find method org.apache.pdfbox.pdmodel.PDDocument.load, referenced from method com.packagename.classname.method...
moreover i had the classpath and path in system variables and jar file too
why it so coming errors!!!
--
|
|
0
|
|
|
|
Reply
|
deepandroid (2)
|
6/8/2012 8:12:24 AM
|
|
|
5 Replies
1434 Views
(page loaded in 0.076 seconds)
|
|
|
|
|
|
|
|
|