f



convert PDF images to PDF text (OCR)

Hi,

I have a book which was entirely scanned into a PDF, there are
approximately 300 pages. The problem is that each page is an image of
the paper page, not selectable text.
How can I convert this PDF made of images to a text-selectable PDF? Or
directly convert it to Word? I suppose I'll need some form of OCR.

Thanks in advance,
Raphael
0
4/18/2005 1:42:11 PM
comp.text.pdf 5600 articles. 0 followers. ramon (1518) is leader. Post Follow

6 Replies
1548 Views

Similar Articles

[PageSpeed] 55

Yes, since you have an image, you need to OCR it into data.  You will not
have much luck with any formating info, but most OCR programs today to a
pretty good job if the image is fairly clean .

Larry T.
0
larrynospam
4/18/2005 2:04:40 PM
larrynospam@nospamjbmsystems.com wrote in message news:<Y7P8e.20618$nH4.109@trndny05>...
> Yes, since you have an image, you need to OCR it into data.  You will not
> have much luck with any formating info, but most OCR programs today to a
> pretty good job if the image is fairly clean .
> 
> Larry T.

Ok. I knew that. But which program would you recommend to OCR a pdf
made of images like mine?
Most work with scanners. Which ones work with pdf images?

Thanks,
Raphael
0
Raphael
4/18/2005 6:36:03 PM
"Ok. I knew that. But which program would you recommend to OCR a pdf
made of images like mine?
Most work with scanners. Which ones work with pdf images?"

Save the PDF as a TIF and then OCR that image.
Please note; If the image is black and white as in text?
It's best to do a contrast adjustment on the created TIF before scanning. 
The results will be more efficient OCR.


0
lostinspace
4/18/2005 7:11:09 PM
Hi Again,

Go to www.pdfzone.com for a good comprehensive buyer's guide on PDF related
software.  Set your search criteria as you require (search on OCR for
example) and you should find all the info you'd want on the applications
available.  Good Luck,

Larry T
0
larrynospam
4/18/2005 7:22:27 PM
Raph wrote:
> larrynospam@nospamjbmsystems.com wrote in message news:<Y7P8e.20618$nH4.109@trndny05>...
> 
>>Yes, since you have an image, you need to OCR it into data.  You will not
>>have much luck with any formating info, but most OCR programs today to a
>>pretty good job if the image is fairly clean .
>>
>>Larry T.
> 
> 
> Ok. I knew that. But which program would you recommend to OCR a pdf
> made of images like mine?
> Most work with scanners. Which ones work with pdf images?

In general, OCR programs do work with either scanned images or image 
files. You can easily extract the raster images from the PDF with 
"pdfimages".

http://www.foolabs.com/xpdf/

Then convert to TIFF, G4 encoding with convert from ImageMagick. You can 
get a zip file with all the images from your file, here [2.2 MB, smaller 
than original PDF due to better losslesss compression]:

http://www-user.tu-chemnitz.de/~ralk/pdf/images.zip

Then feed them into whatever OCR program you like best. I quickly used 
the OCR program I have installed here. Here is the result:

http://www-user.tu-chemnitz.de/~ralk/pdf/text.txt

Formatting isn't perfect, but acceptable.

Ralf
0
Ralf
4/18/2005 8:52:12 PM
On 18 Apr 2005 06:42:11 -0700, Raphael.Benedet@gmail.com (Raph) wrote:

>Hi,
>
>I have a book which was entirely scanned into a PDF, there are
>approximately 300 pages. The problem is that each page is an image of
>the paper page, not selectable text.
>How can I convert this PDF made of images to a text-selectable PDF? Or
>directly convert it to Word? I suppose I'll need some form of OCR.
>
>Thanks in advance,
>Raphael


Abbyy Finereader OCR application will read directly from the pdf file,
Open > myfile.pdf

jbl
0
jbl
4/18/2005 11:53:53 PM
Reply: