f



Extract text from .pdf

I have Acrobat Pro, is it possible to extract text from a .pdf? I see the
"save as" options including Word Doc but it still seems to be an image?

The ocr software with my cannon lide 200 scanner is as useless as tits on a
boar hog..........

0
The
4/27/2009 10:03:17 PM
comp.sys.mac.apps 21416 articles. 3 followers. xxx613 (1334) is leader. Post Follow

13 Replies
834 Views

Similar Articles

[PageSpeed] 47

In article <C61B79B5.40FEF%elvisp@compuserve.com>, The Wolf
<elvisp@compuserve.com> wrote:

> I have Acrobat Pro, is it possible to extract text from a .pdf? I see the
> "save as" options including Word Doc but it still seems to be an image?

Acrbat has its own OCR built-in. I've found it to be very accurate,
even on a scanned document.
0
Dave
4/27/2009 10:23:05 PM
In article <C61B79B5.40FEF%elvisp@compuserve.com>,
 The Wolf <elvisp@compuserve.com> wrote:

> I have Acrobat Pro, is it possible to extract text from a .pdf? I see the
> "save as" options including Word Doc but it still seems to be an image?
> 
> The ocr software with my cannon lide 200 scanner is as useless as tits on a
> boar hog..........

Using either Acrobat Reader or Preview, I just select the text, copy and 
paste.

-- 
Tom Stiller

PGP fingerprint =  5108 DDB2 9761 EDE5 E7E3  7BDA 71ED 6496 99C0 C7CF
0
Tom
4/27/2009 10:26:53 PM
The Wolf wrote:
> I have Acrobat Pro, is it possible to extract text from a .pdf? I see the
> "save as" options including Word Doc but it still seems to be an image?
> 
> The ocr software with my cannon lide 200 scanner is as useless as tits on a
> boar hog..........
> 

Maybe I'm misunderstanding what you want to do -  but can't you just 
select, copy, and paste? I do this with text from a .pdf and then paste 
it into Neo Office, Apple Works, or iText Express for formatting and 
printing all the time.

Karl
0
KaPOP
4/27/2009 10:31:15 PM
On 4/27/09 3:31 PM, in article 4qqJl.15633$pr6.5847@flpi149.ffdc.sbc.com,
"KaPOP!" <kapop@att.net> wrote:

> The Wolf wrote:
>> I have Acrobat Pro, is it possible to extract text from a .pdf? I see the
>> "save as" options including Word Doc but it still seems to be an image?
>> 
>> The ocr software with my cannon lide 200 scanner is as useless as tits on a
>> boar hog..........
>> 
> 
> Maybe I'm misunderstanding what you want to do -  but can't you just
> select, copy, and paste? I do this with text from a .pdf and then paste
> it into Neo Office, Apple Works, or iText Express for formatting and
> printing all the time.
> 
> Karl
If I paste the selected text it is still an image and cannot be edited, I'm
missing something simple here.

0
The
4/27/2009 10:48:18 PM
In article <C61B8442.40FF8%elvisp@compuserve.com>,
 The Wolf <elvisp@compuserve.com> wrote:

> If I paste the selected text it is still an image and cannot be edited, I'm
> missing something simple here.

What tool are you using to select text and are you sure it is text?

-- 
dorayme
0
dorayme
4/27/2009 10:52:17 PM
In article <C61B8442.40FF8%elvisp@compuserve.com>, The Wolf
<elvisp@compuserve.com> wrote:

> If I paste the selected text it is still an image and cannot be edited, I'm
> missing something simple here.

Use AcroPro's OCR function.
0
Dave
4/27/2009 11:21:47 PM
The Wolf wrote:
> I have Acrobat Pro, is it possible to extract text from a .pdf? I see the
> "save as" options including Word Doc but it still seems to be an image?

If you magnify the document 2000% (or whatever the limit is), does the
Acrobat/preview readers still show well formed text characters, or do
you see pixelised characters ?

If the PDF contains an image (for instance, someome scanned a document
and embedded the .tiff into a PDF file.), then you absolutely need OCR
to parse the image.

If the PDF contains real text, then the text selection tool will let you
select the text in the document from either preview or acrobatreader
(unless the PDF document has a "con't copy text" attribute.

Applications such as Illustrator or Freehand can import/parse PDF
documents. You then get text blocks that you can edit.
However, consider the following Postscript code:

(I) show
x y moveto
(love) show
x1 y2 moveto
(chocolate) show

In Illustrator/Freehand, you will have 3 separate textblocks. So if you
change the "I" to "You", the "ou" will be over the "love" because "love"
is in a separate text block which won't move when you edit the text
before it.

It all depends on how the original application generated the postcript code.

The advantage of Acrobat Reader and Preview is that they have an
intelligent text election tool which "see" how the text is layed out
instead of seeing the original text blocks created by the underlying
postscript.
0
JF
4/28/2009 12:20:16 AM
On 4/27/09 5:20 PM, in article 003d88bc$0$20311$c3e8da3@news.astraweb.com,
"JF Mezei" <jfmezei.spamnot@vaxination.ca> wrote:

> The Wolf wrote:
>> I have Acrobat Pro, is it possible to extract text from a .pdf? I see the
>> "save as" options including Word Doc but it still seems to be an image?
> 
> If you magnify the document 2000% (or whatever the limit is), does the
> Acrobat/preview readers still show well formed text characters, or do
> you see pixelised characters ?
> 
> If the PDF contains an image (for instance, someome scanned a document
> and embedded the .tiff into a PDF file.), then you absolutely need OCR
> to parse the image.

Yes, it is scanned and .pdf was made from .tif

I'm not seeing ocr in the help section. I have v 6.02

0
The
4/28/2009 7:32:42 AM
In article <003d88bc$0$20311$c3e8da3@news.astraweb.com>,
 JF Mezei <jfmezei.spamnot@vaxination.ca> wrote:

> 
> If the PDF contains an image (for instance, someome scanned a document
> and embedded the .tiff into a PDF file.), then you absolutely need OCR
> to parse the image.
> 
> If the PDF contains real text, then the text selection tool will let you
> select the text in the document from either preview or acrobatreader
> (unless the PDF document has a "con't copy text" attribute.
> 

And, there's a third case.  In some scanned documents, the organization 
that did the scanning uses some high-end software to OCR the text parts 
of the scanned image and, in some fashion, embeds the character 
information under the scanned image, more or less character for 
character.

What you see on screen is just the scanned image (which preserves any 
graphics images that were part of the original image scan); but you can 
also pull a Text Selection tool across the text parts of the image and 
copy some or all of that text.

I've encountered documents from scientific journal archives that were 
apparently processed that way.
0
AES
4/28/2009 2:19:23 PM
On Tue, 28 Apr 2009 03:32:42 -0400, The Wolf wrote
(in article <C61BFF2A.41013%elvisp@compuserve.com>):

> On 4/27/09 5:20 PM, in article 003d88bc$0$20311$c3e8da3@news.astraweb.com,
> "JF Mezei" <jfmezei.spamnot@vaxination.ca> wrote:
> 
>> The Wolf wrote:
>>> I have Acrobat Pro, is it possible to extract text from a .pdf? I see the
>>> "save as" options including Word Doc but it still seems to be an image?
>> 
>> If you magnify the document 2000% (or whatever the limit is), does the
>> Acrobat/preview readers still show well formed text characters, or do
>> you see pixelised characters ?
>> 
>> If the PDF contains an image (for instance, someome scanned a document
>> and embedded the .tiff into a PDF file.), then you absolutely need OCR
>> to parse the image.
> 
> Yes, it is scanned and .pdf was made from .tif
> 
> I'm not seeing ocr in the help section. I have v 6.02
> 

Look under 'Document' if you have the Pro or Standard versions.

-- 
email to oshea dot j dot j at gmail dot com.

0
J
4/28/2009 2:50:50 PM
On 4/28/09 7:50 AM, in article gt754911s2r@news7.newsguy.com, "J.J. O'Shea"
<try.not.to@but.see.sig> wrote:

> On Tue, 28 Apr 2009 03:32:42 -0400, The Wolf wrote
> (in article <C61BFF2A.41013%elvisp@compuserve.com>):
> 
>> On 4/27/09 5:20 PM, in article 003d88bc$0$20311$c3e8da3@news.astraweb.com,
>> "JF Mezei" <jfmezei.spamnot@vaxination.ca> wrote:
>> 
>>> The Wolf wrote:
>>>> I have Acrobat Pro, is it possible to extract text from a .pdf? I see the
>>>> "save as" options including Word Doc but it still seems to be an image?
>>> 
>>> If you magnify the document 2000% (or whatever the limit is), does the
>>> Acrobat/preview readers still show well formed text characters, or do
>>> you see pixelised characters ?
>>> 
>>> If the PDF contains an image (for instance, someome scanned a document
>>> and embedded the .tiff into a PDF file.), then you absolutely need OCR
>>> to parse the image.
>> 
>> Yes, it is scanned and .pdf was made from .tif
>> 
>> I'm not seeing ocr in the help section. I have v 6.02
>> 
> 
> Look under 'Document' if you have the Pro or Standard versions.
Pro

I go to paper capture and it does it's thing. What do I do once paper
capture finishes? How can I edit the text?

0
The
4/28/2009 3:58:02 PM
On Tue, 28 Apr 2009 11:58:02 -0400, The Wolf wrote
(in article <C61C759A.41028%elvisp@compuserve.com>):

> On 4/28/09 7:50 AM, in article gt754911s2r@news7.newsguy.com, "J.J. O'Shea"
> <try.not.to@but.see.sig> wrote:
> 
>> On Tue, 28 Apr 2009 03:32:42 -0400, The Wolf wrote
>> (in article <C61BFF2A.41013%elvisp@compuserve.com>):
>> 
>>> On 4/27/09 5:20 PM, in article 003d88bc$0$20311$c3e8da3@news.astraweb.com,
>>> "JF Mezei" <jfmezei.spamnot@vaxination.ca> wrote:
>>> 
>>>> The Wolf wrote:
>>>>> I have Acrobat Pro, is it possible to extract text from a .pdf? I see the
>>>>> "save as" options including Word Doc but it still seems to be an image?
>>>> 
>>>> If you magnify the document 2000% (or whatever the limit is), does the
>>>> Acrobat/preview readers still show well formed text characters, or do
>>>> you see pixelised characters ?
>>>> 
>>>> If the PDF contains an image (for instance, someome scanned a document
>>>> and embedded the .tiff into a PDF file.), then you absolutely need OCR
>>>> to parse the image.
>>> 
>>> Yes, it is scanned and .pdf was made from .tif
>>> 
>>> I'm not seeing ocr in the help section. I have v 6.02
>>> 
>> 
>> Look under 'Document' if you have the Pro or Standard versions.
> Pro
> 
> I go to paper capture and it does it's thing. What do I do once paper
> capture finishes? How can I edit the text?
> 

I just open the document, select 'OCR' from the Documents menu, and go off 
and do something else while it runs.

You might want to get a newer version of Acrobat, though.

-- 
email to oshea dot j dot j at gmail dot com.

0
J
4/28/2009 4:16:26 PM
On Apr 27, 12:31=A0pm, "KaPOP!" <ka...@att.net> wrote:
> The Wolf wrote:
> > I have Acrobat Pro, is it possible to extract text from a .pdf? I see t=
he
> > "save as" options including Word Doc but it still seems to be an image?
>
> > The ocr software with my cannon lide 200 scanner is as useless as tits =
on a
> > boar hog..........
>
> Maybe I'm misunderstanding what you want to do - =A0but can't you just
> select, copy, and paste? I do this with text from a .pdf and then paste
> it into Neo Office, Apple Works, or iText Express for formatting and
> printing all the time.

I'm no expert on PDF files, but it seems there are two types. The
first type, which is scanned files saved in pdf format, are much
larger files and cannot be edited in preview, and cannot have text
selected. The second type allows text to be selected and copied/
pasted. This is the type of PDF file that is used in fillable forms. I
don't know why there are two types, but those are my experiences.

Note: Most of the PDF files I use are journal articles, downloaded
from our library. This is very handy, and there is a great Mac program
called papers for storing, reading and organizing pdf files. I have
found that most journals just have newer articles available as PDF
files, and these tend to be of the second type. When a journal, such
as the Journal of Biological Chemistry, puts up older papers (and all
of their issues are now available as pdf files), the PDF file is more
like an image file, and does not allow cutting and pasting of text.

0
Dave
4/28/2009 5:27:42 PM
Reply: