f



Getting the Text from Image and PDF

Hi friends,

This is Jan, I am new to this Group.

I have a requirement here. 

      Is there any Java API for getting the Text data from an Image and PDF formats. Please let me know the same. If anything found, please suggest me regarding them.


Thanks && Regards..
Jan
0
Jan
2/14/2014 9:09:12 AM
comp.lang.java.programmer 52714 articles. 1 followers. Post Follow

4 Replies
660 Views

Similar Articles

[PageSpeed] 37

Jan <janreddy.sr@gmail.com> wrote:
> Hi friends,
> This is Jan, I am new to this Group.
> I have a requirement here. 
>       Is there any Java API for getting the Text data from an Image
>       and PDF formats.

For reading characters from graphical data, google "ocr" (and "java")
(the acronym means "optical character recognition")

PDFs may contain the text directly (non-graphically), which would make
extraction much easier (and not require ocr).

0
Andreas
2/14/2014 2:46:53 PM
On 02/14/2014 04:09 AM, Jan wrote:
> Hi friends,
>
> This is Jan, I am new to this Group.
>
> I have a requirement here.
>
>        Is there any Java API for getting the Text data from an Image and PDF formats. Please let me know the same. If anything found, please suggest me regarding them.
>

<http://www.catb.org/~esr/faqs/smart-questions.html#before>

0
Jeff
2/14/2014 6:23:19 PM
On Fri, 14 Feb 2014 01:09:12 -0800 (PST), Jan <janreddy.sr@gmail.com>
wrote, quoted or indirectly quoted someone who said :

> Is there any Java API for getting the Text data from an Image and PDF formats. 
>Pease let me know the same. If anything found, please suggest me regarding them

You could spawn a copy of Nuance Omnipage. It can OCR pdfs.  You might
just look for PDF --> X converters.

See http://mindprod.com/pdf.html

-- 
Roedy Green Canadian Mind Products http://mindprod.com
The future has already happened, it just isn’t evenly distributed.
~ William Gibson (born: 1948-03-17 age: 65)
0
Roedy
2/16/2014 3:24:08 PM
On Fri, 14 Feb 2014 14:46:53 +0000, Andreas Leitgeb wrote:

> Jan <janreddy.sr@gmail.com> wrote:
>> Hi friends,
>> This is Jan, I am new to this Group.
>> I have a requirement here.
>>       Is there any Java API for getting the Text data from an Image and
>>       PDF formats.
> 
> For reading characters from graphical data, google "ocr" (and "java")
> (the acronym means "optical character recognition")
>
I wonder what the OP means by "the Text data from an Image" because there 
are two possibilities: images of text, which would require OCR 
capabilities, or the text held in the EXIF tags found in many image types.

It might help if this was clarified by the OP.
 
> PDFs may contain the text directly (non-graphically), which would make
> extraction much easier (and not require ocr).
>
I think something similar applies to PDFs: some are little more than an 
encapsulated page image, in which case you need OCR capabilities to 
extract it, but other PDFs contain blocks of text, possibly interspersed 
with embedded images. These are the ones that can be opened and modified 
by Office Libre and similar programs.

Again, it would help if the OP could say what he is trying to do.


-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |
0
Martin
2/16/2014 4:11:23 PM
Reply:

Similar Artilces:

pdf \ text (get rid of text in pdf)
Is there a way to remove all text from PDF? Will extract images work for you? If so, PDF-Tools by Tracker Software will do it. http://www.docu-track.com/ -- Don Vancouver, USA "MarosV" <maros.vranec@gmail.com> wrote in message news:ebb897e1-c8e3-4b3a-9274-dfd9d2c845c3@c4g2000hsg.googlegroups.com... > Is there a way to remove all text from PDF? ...

PDF image of text to readable text ?
Seems there are web based tools and software. My son needs text to have it read for him. He has a PC. Found PDF reader $50 , http://thurly.net/11ia and http://thurly.net/11i4 the last being google. Wondering what you folks found useful or use ? Thanks! -- Bill S. Jersey USA zone 5 shade garden http://uppitywis.org/ live WI ...

convert text images in pdf to text
Hi, I have a pdf file where someone scanned in pages from a book and stored them as images in a pdf file which is 16 meg. Is there an easy way I can do some kind of OCR on this file to just get plain text from the images of text in the pdf? Using a standard pdf to txt converter won't work because the file contains images of text as it appears in the book. Thanks, Ben -- Cheap long distance calling using Onesuite (http://www.onesuite.com). 2.5 cents/min anywhere in the U.S., to Canada or the U.K. No monthly or connection fees! Use promotional code 038664643 for 20 free minutes. Tr...

Convert PDF image to PDF text
Hi, I have a PDF file with image text documents. hence I can't search the text. How can I convert it to searchable PDF file? Thanx You will have to scan it using an OCR (Optical Character Recognition) application and then create a PDF file. -- Don Vancouver, USA <WhiteLen@gmail.com> wrote in message news:1193737890.691267.165820@z9g2000hsf.googlegroups.com... > Hi, > > I have a PDF file with image text documents. hence I can't search the > text. How can I convert it to searchable PDF file? > > Thanx > Any recommendation for OCR application? ...

getting java.lang.NoClassDefFoundError: com/ibm/icu/text/BreakIterator in standalone eclipse
hi.. trying to create a standalone eclipse application but getting this: java.lang.NoClassDefFoundError: com/ibm/icu/text/BreakIterator what jar am i missing? Elhanan <emaayan@hotmail.com> wrote in message news:1171178397.965993.218090@m58g2000cwm.googlegroups.com... > hi.. > > trying to create a standalone eclipse application but getting this: > java.lang.NoClassDefFoundError: com/ibm/icu/text/BreakIterator > > what jar am i missing? "International Components for Unicode" (ICU) http://dev.eclipse.org/viewcvs/index.cgi/org.eclipse.sdk-feature/plugins/com.ibm.icu/ See also: http://icu.sourceforge.net/ and http://www-306.ibm.com/software/globalization/icu/index.jsp Regards, Christian ...

convert PDF images to PDF text (OCR)
Hi, I have a book which was entirely scanned into a PDF, there are approximately 300 pages. The problem is that each page is an image of the paper page, not selectable text. How can I convert this PDF made of images to a text-selectable PDF? Or directly convert it to Word? I suppose I'll need some form of OCR. Thanks in advance, Raphael Yes, since you have an image, you need to OCR it into data. You will not have much luck with any formating info, but most OCR programs today to a pretty good job if the image is fairly clean . Larry T. larrynospam@nospamjbmsystems.com wrote in message...

scanned image pdf to searchable text pdf
We have a lot of pdf files that are just scanned images of documents. How easy is it to change these into pdf's that we can search for specific words. I believe Adobe Acrobat Capture will do this when the document is originally scanned, but can it use pdf's that have already been created. If it can, can this process be automated to convert 100s of pdf files? Are there other alternatives to Adobe Acrobat Capture as it is not cheap for a large number of documents? I am new to this, so please keep it simple. Thanks for any help. Adrian HI, As you indicated, your pdf's are ...

Who gets higher salary a Java Programmer or a C++ Programmer?
I have little experience in both Java and C++. I have designed a few programs in both languages. I get a lot confused as many times I use Java code in C++ and C++ code in Java. So I have descided to only work in one Language. Both C++ and Java has their importance. What language should I master. I just want to know who gets higher salary a Java Programmer or a C++ Programmer? Because Learning both creates confusions So I have to Choose the best among them. Whose future is better a Java Programmer or a C++ Programmer? What else should I learn for a good Career. Should I learn C# which is ...

Printing text, image, text, image, ...
I have bunch of text files (about 20 lines) and bunch of image files. Is there a clean way to print text, then image, then text, then image, ...., as though there are in single continuous document? -- William Park <opengeometry@yahoo.ca>, Toronto, Canada ThinFlash: Linux thin-client on USB key (flash) drive http://home.eol.ca/~parkw/thinflash.html BashDiff: Super Bash shell http://freshmeat.net/projects/bashdiff/ ["Followup-To:" header set to comp.os.linux.misc.] On Fri, 23 Jun 2006 20:01:01 -0400, William Park staggered into the Black Sun and said: > I have [a] bu...

How to get text from PDF?
Hi all, I have my web server bases on linux. I am working on a project for which I need to get text out of PDF file. I need to know which text belongs to which PDF page number? Is there any utility/tool that should be installed on linux and I can use it from command line in PHP through exec() or system() etc for this purpose? Please reply me urgently. Thanks in advance. Shahid wrote: > Hi all, > > I have my web server bases on linux. I am working on a project for > which I need to get text out of PDF file. I need to know which text > belongs to which PDF page number? >...

Getting image from PDF?
I'm using a driver that creates a virtual printer on my PC. When I print to the virtual printer I can create a very large 36" wide PDF file with my image. In some cases I then need to edit the image at the 36" size. Is it possible to get the image back out of the PDF? I'm using that prints my document Adobe Acrobat Professional Version 6.0.6.2007010800 HotRdd wrote: > I'm using a driver that creates a virtual printer on my PC. When I print > to the virtual printer I can create a very large 36" wide PDF file with my > image. In some cases I then ne...

How to get text from PDF?
Hi all, I have my web server bases on linux. I am working on a project for which I need to get text out of PDF file. I need to know which text belongs to which PDF page number? Is there any utility/tool that should be installed on linux and I can use it from command line in PHP through exec() or system() etc for this purpose? Please reply me urgently. Thanks in advance. On 22 Dec, 15:03, Shahid <mirzashahidmahm...@gmail.com> wrote: > Hi all, > > I have my web server bases on linux. I am working on a project for > which I need to get text out of PDF file. I need to know w...

java.lang.ExceptionInInitializerError: java.lang.ArrayIndexOutOfBoundsException
Hi, I am new to DB2. I am getting this error while loading the DB2Driver. I don't have any idea about where i might have gone wrong. please help me. Below is the stack trace. Stack Trace: java.lang.ExceptionInInitializerError: java.lang.ArrayIndexOutOfBoundsException at COM.ibm.db2.jdbc.app.DB2Driver.SQLAllocEnv(Native Method) at COM.ibm.db2.jdbc.app.DB2Driver.<init>(DB2Driver.java:245) at COM.ibm.db2.jdbc.app.DB2Driver.<clinit>(DB2Driver.java:130) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Cla...

How to get clear image of a scanned image by removing black dots,resize,despeckle,clearer text
i want to do scanned image of a text document in to a clear text document image.. pls can anyone help .. i converted image into black&white.and applied bwmorph(bw,'remove'), it removes most black part .. but i am not getting clearer text. pls help me thanx. jagadish katti: Maybe you could try this: http://www.leptonica.com/binarization.html or look here: http://iris.usc.edu/Vision-Notes/bibliography/contentschar.html#OCR,%20Document%20Analysis%20and%20Character%20Recognition%20Systems Or try MATLAB's adapthisteq() in the Image Processing Toolbox. ...

PDF Searchable Image(Exact): is that a PDF with an image overlay?
Hi, Acrobat Capture 3 does (among others formats) output to: PDF Searchable Image(Exact). Is that the same like "PDF with an image overlay" or/and "image with hidden text" ? Thank`s for any tip! John ...

java.lang.StringIndexOutOfBoundsException: String index out of range: 23 at java.lang.String.charAt(String.java:460)
Hi All I am having the torque3.1.jar and postgresql-7.4. I have compiled the new jdbc driver called as postgresql.jar and have placed it in the lib directory from where the ant scripts catch the jars. Whenever i try to access through torque gestList = BaseGestlistPeer.doSelect(new Criteria()); this error arises java.lang.StringIndexOutOfBoundsException: String index out of range: 23 at java.lang.String.charAt(String.java:460) at org.postgresql.jdbc2.ResultSet.toTimestamp(ResultSet.java:1653) at org.postgresql.jdbc2.ResultSet.getTimestamp(ResultSet.java...

PDF gets printed as image
Hello, when creating a PDF out of OpenOffice and trying to print it, I noticed the printing jobs getting awfully big. After looking into the issue, it seems that the document gets printed as image, with all the fonts lost (even though the option "Print as image" is not active). The printing works fine for a PDF generated in Distiller. I tried on different printers with the same results. What could the problem be? The only difference between the OpenOffice PDF and the Distiller PDF that I could notice is the "Encoding" of the fonts (as seen in the document properties dialo...

getting an image off a pdf file
i've got a pdf file that's got one or more scanned images. i'd like to get programmatic access to the image and do nasty stuff to it. a page explaining the format of pdf file - specially embedded images should do it. thanks. "Tashkant" <goahead@spamme.com> wrote: >i've got a pdf file that's got one or more >scanned images. > >i'd like to get programmatic access to the >image and do nasty stuff to it. > >a page explaining the format of pdf file - >specially embedded images should do it. "A page" sounds optimistic....

Get the thumbnail of a PDF as an image
Hi - How can I extract the thumbnail of the first page in an existing PDF file in my Windows Visual C++ application as a JPEG, PNG or TIFF? - If possible I don't want to use the Acrobat SDK. Is there another library available to do this? I have to do that in a Windows C++ application and eventually in a MAC OS X application (Codewarrior; MACH-O). Thanks Hans "Hans Stoessel" <hstoessel.list@pm-medici.ch> wrote: >- How can I extract the thumbnail of the first page in an existing PDF file >in my Windows Visual C++ application as a JPEG, PNG or TIFF? Do you ...

java.lang.Set with elements of type java.lang.Set
Roughly I do something along the lines of: Set set = new HashSet(); Set elem = new HashSet(); set.add(elem); // now we change the elem and add it again to the set elem.add(some object here); set.add(elem); I found out the hard way that 'set' may now contain 'elem' either once or twice, the reason being that 'elem.add()' changes the hashCode of elem such that it is not noticed that it is in 'set' already on the 2nd 'set.add()'. Question: What I would actually want is an IdentityHashSet() set = new IdentityHashSet() but this does not...

How to get a JPG image from a PDF?
Hi, I'm using PHP 4.4.4 (but can upgrade if that's the answer to this question). Given a PDF file, I want to be able to generate a JPG of each page. I have no idea where to start but sure this is a problem someone has tackled before. Anything out there you would recommend? - Dave laredotornado@zipmail.com wrote: > Hi, > > I'm using PHP 4.4.4 (but can upgrade if that's the answer to this > question). Given a PDF file, I want to be able to generate a JPG of > each page. I have no idea where to start but sure this is a problem > someone has tackled before...

ANN: Fly Text to PDF
Hi All: Fly Text to PDF 1.3 is powerful tool which can convert your text files into PDF. This tool is powerful converter tool running on Microsoft Windows Operating System. You can use this tool to convert your text report, text documents and other text files into PDF quickly and easily. You also can set the PDF properties in each text files by using special tags, or set the default properties for every output PDF files. Please visit our website for more information: http://www.medafan.com/pdf-tools For the output sample, please click on: http://www.medafan.com/pdf-tools/license.pdf Key fea...

to use import java.lang.* or import java.lang.Math or none at all?
Hi guys, i knew that by default all java.lang classes will be imported by the compiler during compilation. but, to make it easier for the computer, should i specify which class i really will be using? does this action will boost the performance during compilation and runtime or not a matter at all? the answer to this post will definitely affect my programming style in the future when i'm considering "to import or not to import"... hmm,,, thanks in advance. JPractitioner wrote: > i knew that by default all java.lang classes will be imported by the > compiler during compilation. but, to make it easier for the computer, > should i specify which class i really will be using? does this action > will boost the performance during compilation and runtime or not a > matter at all? Whether and how you import classes has exactly zero effect at runtime. Imports (with or without wildcards) are only a kind of abbreviation provided by the compiler to save us the effort of typing in fully-qualified type names every time. In theory explicit importing should make compilation faster -- by a very tiny amount. I've never heard anyone claim that they've even managed to measure a difference let alone found a case where it made a practical difference. So the question comes down to how to write your code for maximum clarity. One school of thought asserts that you should always import each class explicitly (rather than by a wildcard). There's a fai...

How to get the size of image/text entry
Is there a SQL-Statement to obtain the real size of an image or text entry? Regards Guido Frohn Found the (simple) solution by RTFM. select datalength(object) from table where object_id=4711 "Guido Frohn" <guido.frohn@compumatica.de> schrieb im Newsbeitrag news:bg2u2r$6c2$1@news.nrw.net... > Is there a SQL-Statement to obtain the real size of an image or text entry? > > Regards > Guido Frohn > > "Guido Frohn" <guido.frohn@compumatica.de> wrote in message news:bg2u2r$6c2$1@news.nrw.net... > Is there a SQL-Statement to obtain the real size of an image or text entry? > > Regards > Guido Frohn > Use the datalength() function on the text/image column. HTH, Rob ------------------------------------------------------------- Rob Verschoor Certified Sybase Professional DBA for ASE 12.5/12.0/11.5/11.0 and Replication Server 12.5 Author of "Tips, Tricks & Recipes for Sybase ASE" and "The Complete Sybase ASE Quick Reference Guide" Online orders accepted at http://www.sypron.nl/shop mailto:rob@DO.NOT.SPAM.sypron.nl.REMOVE.THIS.DECOY http://www.sypron.nl Sypron B.V., P.O.Box 10695, 2501HR Den Haag, The Netherlands ------------------------------------------------------------- ...

Web resources about - Getting the Text from Image and PDF - comp.lang.java.programmer

Why Evangelicals Support Trump
One of the interesting questions about the 2016 Republican presidential primaries is why evangelical Christians are voting for Donald Trump. ...

Ronda Rousey Emerges As UFC 196’s Biggest Winner After Holm Upset
Ronda Rousey never stepped into the octagon on Saturday night at UFC 196. The closest to ringside Ronda came – according to TMZ – was at a pro ...

OJ Simpson: Conflicting Reports Emerge about Whether Knife Found Buried at Rockingham Estate Could be ...
More than two decades after the murders of Nicole Brown Simpson and Ronald Goldman, it was revealed that a knife was discovered on the former ...

Scientists discover 'ghostlike' octopus that appears to belong to a previously unknown species
National Geographic Scientists discover 'ghostlike' octopus that appears to belong to a previously unknown species Economic Times WASHINGTON: ...

Lena Dunham Undergoes Surgery For Ruptured Ovarian Cyst
The Girls star and creator many love has been hospitalized for a ruptured ovarian cyst. Lena Dunham is an icon in our generation. Known for ...

Amazon to restore encryption to Fire tablets after complaints
Amazon.com Inc. said it plans to restore an encryption feature on its Fire tablets after customers and privacy advocates criticized the company ...

Darlene Saddler, Brian Landry: ‘Unusual Suspects’ To Air Orange County, California, Murder Case
Darlene Saddler, the mother who was found dead in her home from gunshot wounds eight years ago, will be the next intriguing case to air on ID’s ...

'Prince of Tides' author Pat Conroy lauded after death at 70
CHARLESTON, S.C. (AP) — Author Pat Conroy, whose beloved works "The Great Santini" and "The Prince of Tides" are set against the vistas of the ...

Governor calls on Greece to declare state of emergency
Greece needs to deliver quickly on its promise to provide accommodation for 50,000 refugees and the European Union should help Athens with this ...

Illinois Set To Hire Lovie Smith As Head Football Coach?
Lovie Smith coaching college football? According to CBS 2 Chicago’s Ryan Baker, sources have told him Smith will be the next head coach at Illinois. ...

Resources last updated: 3/6/2016 11:33:58 AM