f



scanned image pdf to searchable text pdf

We have a lot of pdf files that are just scanned images of documents. How
easy is it to change these into pdf's that we can search for specific words.

I believe Adobe Acrobat Capture will do this when the document is originally
scanned, but can it use pdf's that have already been created. If it can, can
this process be automated to convert 100s of pdf files?

Are there other alternatives to Adobe Acrobat Capture as it is not cheap for
a large number of documents?

I am new to this, so please keep it simple.

Thanks for any help.

Adrian


0
Adrian
3/10/2005 4:34:06 PM
comp.text.pdf 5600 articles. 0 followers. ramon (1518) is leader. Post Follow

4 Replies
1591 Views

Similar Articles

[PageSpeed] 31

HI,

As you indicated, your pdf's are 'scanned images' therefore there are no
words (only pictures of words).  To create text you'd need to use an OCR
software (there are some very good ones out there) and then depending on the
results do a bit of editing to correct and then take the resultant document
and create searchable PDFs   This is not condusive to an automated volume
process.  There are a number of PD packages, including  Adobe and my own
OctoTools, that will create searchable PDFs in the first place.  In some
cases, older PDFs and non-text searchable PDFs were created with indexes and
key words (meta tags) and these files, which are essentially text, can be
searched by other 3rd party packages.  I do not think this  is what you are
dealing with.  You may want to look at ScanSoft software and see if their
higher end packages could hanled the OCR/conversion for you.   Your project,
if it is to be automated and handle 100s of PDFs could become quite costly,
depending on the level of OCR accuracy and the processing throughput
required.  If you'd like additional info or would like to further discuss
the issues, please feel free to call me at 978 535-7676 (Boston, MA) 
Thanks,

Larry T.
0
larrynospam
3/10/2005 7:49:29 PM
Larry,

Thanks for your reply.

It looks like OmniPage Pro 14 Office may do this. It says "........supports
the creation of searchable PDF files from any paper or electronic document.
OmniPage also allows users to turn PDF Files (even image-only files)
directly into their favorite editable document format..........."

Maybe there is someone in this forum who has experience of doing this.

Regards,

Adrian


<larrynospam@nospamjbmsystems.com> wrote in message
news:kqWdneHMqZZVOa3fRVn-3w@comcast.com...
> HI,
>
> As you indicated, your pdf's are 'scanned images' therefore there are no
> words (only pictures of words).  To create text you'd need to use an OCR
> software (there are some very good ones out there) and then depending on
the
> results do a bit of editing to correct and then take the resultant
document
> and create searchable PDFs   This is not condusive to an automated volume
> process.  There are a number of PD packages, including  Adobe and my own
> OctoTools, that will create searchable PDFs in the first place.  In some
> cases, older PDFs and non-text searchable PDFs were created with indexes
and
> key words (meta tags) and these files, which are essentially text, can be
> searched by other 3rd party packages.  I do not think this  is what you
are
> dealing with.  You may want to look at ScanSoft software and see if their
> higher end packages could hanled the OCR/conversion for you.   Your
project,
> if it is to be automated and handle 100s of PDFs could become quite
costly,
> depending on the level of OCR accuracy and the processing throughput
> required.  If you'd like additional info or would like to further discuss
> the issues, please feel free to call me at 978 535-7676 (Boston, MA)
> Thanks,
>
> Larry T.


0
Adrian
3/11/2005 8:36:16 AM
On Fri, 11 Mar 2005 08:36:16 GMT, "Adrian" <NoSpam@hotmail.com> wrote:

>Larry,
>
>Thanks for your reply.
>
>It looks like OmniPage Pro 14 Office may do this. It says "........supports
>the creation of searchable PDF files from any paper or electronic document.
>OmniPage also allows users to turn PDF Files (even image-only files)
>directly into their favorite editable document format..........."
>
>Maybe there is someone in this forum who has experience of doing this.
>
>Regards,
>
>Adrian

OCR use is intended to recognize somewhat crisp, clear text, as
editable text. Abbyy will recognize some fairly rough text.

While OmniPage has been around for a while, and is a good program, it
as well as all of the others are still not as good as Abbyy
Finereader. I use version 7.0

Quite a lot will depend on how the the PDF documents were created in
the first place.

If they used a .gif ,low res .jpg, or .png image to create the page
you are probably out of luck. While the page may be readable to the
eye, OCR  will not recognize it as text. If they used a high res .tiff
or comparable image then there may be hope.

If you have a sample page that is not the company secrets or
proprietary ( something I can see), I will see what, if any measures
will get this done for you.

If you want me to take a look at it and give it a try, email a sample
to me at ( REMOVE THE "nospam" )   jbl02gno@spamgmail.com

If a document can be converted from one format to another, I can do
it. If I cannot convert the document, then you can pretty much be sure
that it connot be converted as desired.

jbl
0
jbl
3/11/2005 12:30:15 PM
Adrian,

Indeed, Omnipage Pro 14 does the trick in a very good way. I use it to 
convert PDF-Patent-Texts into RTF-format text (for 
translation-purposes); and Yes it does need some post-editing, but still 
is very good.

Regards,

Jean
0
Jean
3/13/2005 12:59:45 PM
Reply: