f



extract text from PDF file

Hello,

How can I extract text from a (MS Word) PDF file?

I've tryed pdftotext but it only produce crap, not one readable
cleartext sentence. :)

Exists other utilties to convert pdf to a text file or extract text?

I think it must possible, because I also can copy and paste text from
PDF documents.

greetings

Fabian
0
fho (16)
7/31/2006 7:59:28 AM
comp.lang.postscript 3552 articles. 0 followers. Post Follow

23 Replies
966 Views

Similar Articles

[PageSpeed] 24

In article <44cdb91b$0$7874$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
says...
> Hello,
> 
> How can I extract text from a (MS Word) PDF file?

This isn't really a PostScript question....

 
> I've tryed pdftotext but it only produce crap, not one readable
> cleartext sentence. :)

Most likely your PDF file contains a re-encoded font. Possibly a 
CIDFont, in both cases it is quite tricky to go form the number 
representing the glyph to an ASCII encoded character.

 
> Exists other utilties to convert pdf to a text file or extract text?
> 
> I think it must possible, because I also can copy and paste text from
> PDF documents.

Have you tried it with this file ? This is a common complaint, you make 
think it is easy to extract the 'text' from a PDF file, but in fact it 
can be quite hard, and in some cases impossible.

PDF is not intended as an editable format, while htere are several 
possible solutions to your problem, the easiest may simply be to print 
the file, scan it and then OCR it.


			Ken
0
ken161 (742)
7/31/2006 10:57:43 AM
Ken Sharp wrote:
> 
> PDF is not intended as an editable format, while htere are several 
> possible solutions to your problem, the easiest may simply be to print 
> the file, scan it and then OCR it.

Some OCR packages will take PDF as an input; this
gives some benefits over print-and-scan.

   BugBear
0
bugbear (609)
7/31/2006 11:34:53 AM
Ken Sharp wrote:
> In article <44cdb91b$0$7874$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
> says...
>> Hello,
>> 
>> How can I extract text from a (MS Word) PDF file?
> 
> This isn't really a PostScript question....

Change PDF to PS, is it possible with PostScript?
0
fho (16)
7/31/2006 3:05:12 PM
Fabian Holler wrote:
> Ken Sharp wrote:
> 
>>In article <44cdb91b$0$7874$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
>>says...
>>
>>>Hello,
>>>
>>>How can I extract text from a (MS Word) PDF file?
>>
>>This isn't really a PostScript question....
> 
> 
> Change PDF to PS, is it possible with PostScript?


Even less so.

   BugBear
0
bugbear (609)
7/31/2006 3:29:10 PM
>>>> How can I extract text from a (MS Word) PDF file?

>>> This isn't really a PostScript question....

>> Change PDF to PS, is it possible with PostScript?

> Even less so.

That answer is pedantically correct, but almost worthless.
Yes, the writer of the PostScript can make it arbitrarily hard
to extract text and things that look like text.
But in most practical cases it is easy to get an answer
that is at least moderately correct and useful.

Re-define (or "wrap") all the operators that nominally display text
(namely: show, ashow, awidthshow, widthshow, kshow, cshow, xshow,
yshow, xyshow) so that they write the string argument to a file.
Then process the file with ghostscript.

-- 
0
jreiser (115)
7/31/2006 9:26:33 PM
John Reiser wrote:
>>>>> How can I extract text from a (MS Word) PDF file?
> 
>>>> This isn't really a PostScript question....
> 
>>> Change PDF to PS, is it possible with PostScript?
> 
> Re-define (or "wrap") all the operators that nominally display text
> (namely: show, ashow, awidthshow, widthshow, kshow, cshow, xshow,
> yshow, xyshow) so that they write the string argument to a file.
> Then process the file with ghostscript.

Thank you John for your answer.
I will try this :)

0
fho (16)
8/1/2006 7:29:55 AM
In article <44cf03ac$0$7870$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
says...
> John Reiser wrote:
> >>>>> How can I extract text from a (MS Word) PDF file?
> > 
> >>>> This isn't really a PostScript question....
> > 
> >>> Change PDF to PS, is it possible with PostScript?
> > 
> > Re-define (or "wrap") all the operators that nominally display text
> > (namely: show, ashow, awidthshow, widthshow, kshow, cshow, xshow,
> > yshow, xyshow) so that they write the string argument to a file.
> > Then process the file with ghostscript.
> 
> Thank you John for your answer.
> I will try this :)

If the problem is a re-encoded font, you will simply get more garbage. 
If the problem is a CIDFont, then you will get bigger garbage I'm 
afraid.

I'm with Bugbear on this one, if you can't extract the 'text' from a PDF 
file, you are not likely to be able to extract it from the PostScript 
resulting from converting a PDF file. In general the act of conversion 
will only serve to obfuscate the problem still further.


			Ken
0
ken161 (742)
8/1/2006 10:29:19 AM
Ken Sharp wrote:
> In article <44cf03ac$0$7870$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
> says...
>> John Reiser wrote:
>> >>>>> How can I extract text from a (MS Word) PDF file?
>> > 
>> >>>> This isn't really a PostScript question....
>> > 
>> >>> Change PDF to PS, is it possible with PostScript?

> I'm with Bugbear on this one, if you can't extract the 'text' from a PDF 
> file, you are not likely to be able to extract it from the PostScript 
> resulting from converting a PDF file. In general the act of conversion 
> will only serve to obfuscate the problem still further.

I don't want to convert the PDF to PS.
I want to generate a Postscript from MS Office+Adobe Postscript driver.
0
fho (16)
8/1/2006 10:38:25 AM
Ken Sharp wrote:
> In article <44cdb91b$0$7874$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
>> How can I extract text from a (MS Word) PDF file?
[...]
>> I think it must possible, because I also can copy and paste text from
>> PDF documents.
> 
> Have you tried it with this file ? This is a common complaint, you make 
> think it is easy to extract the 'text' from a PDF file, but in fact it 
> can be quite hard, and in some cases impossible.

How did copy&paste work different than convert it to text?
0
fho (16)
8/1/2006 10:39:53 AM
Fabian Holler wrote:
> Ken Sharp wrote:
> 
>> In article <44cf03ac$0$7870$6e1ede2f@read.cnntp.org>,
>> fho@mailinator.com says...
>> 
>>> John Reiser wrote:
>>> 
>>>> How can I extract text from a (MS Word) PDF file?
>>>> This isn't really a PostScript question....
>>>> Change PDF to PS, is it possible with PostScript? 
>>
>> I'm with Bugbear on this one, if you can't extract the 'text' from
>> a PDF file, you are not likely to be able to extract it from the
>> PostScript resulting from converting a PDF file. In general the act
>> of conversion will only serve to obfuscate the problem still
>> further.
> 
> I don't want to convert the PDF to PS. I want to generate a
> Postscript from MS Office+Adobe Postscript driver.

A carfully chosen PS driver (i.e PPD) might make the text easier to 
extract, but why the intermediate format?

If you have the MS-Word DOC file why not just use Word to "Save As" text?
0
scobloke2 (557)
8/1/2006 12:49:01 PM
In article <44cf3032$0$7873$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
says...
> Ken Sharp wrote:
> > In article <44cdb91b$0$7874$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
> >> How can I extract text from a (MS Word) PDF file?
> [...]
> >> I think it must possible, because I also can copy and paste text from
> >> PDF documents.
> > 
> > Have you tried it with this file ? This is a common complaint, you make 
> > think it is easy to extract the 'text' from a PDF file, but in fact it 
> > can be quite hard, and in some cases impossible.
> 
> How did copy&paste work different than convert it to text?

Well you haven't answered my question, which is whether copy and paste 
worked with this particular file. However the simple answer is that it 
isn't different. Excepting that a clever text extractor can do more to 
try and reverse any odd encodings.


				Ken
0
ken161 (742)
8/1/2006 1:57:28 PM
In article <v_6dnTf9OJIi0FLZRVny2w@bt.com>, scobloke2@infotop.co.uk 
says...
> Fabian Holler wrote:
> > Ken Sharp wrote:
> > 
> >> In article <44cf03ac$0$7870$6e1ede2f@read.cnntp.org>,
> >> fho@mailinator.com says...
> >> 
> >>> John Reiser wrote:
> >>> 
> >>>> How can I extract text from a (MS Word) PDF file?
> >>>> This isn't really a PostScript question....
> >>>> Change PDF to PS, is it possible with PostScript? 
> >>
> >> I'm with Bugbear on this one, if you can't extract the 'text' from
> >> a PDF file, you are not likely to be able to extract it from the
> >> PostScript resulting from converting a PDF file. In general the act
> >> of conversion will only serve to obfuscate the problem still
> >> further.
> > 
> > I don't want to convert the PDF to PS. I want to generate a
> > Postscript from MS Office+Adobe Postscript driver.
> 
> A carfully chosen PS driver (i.e PPD) might make the text easier to 
> extract, but why the intermediate format?
> 
> If you have the MS-Word DOC file why not just use Word to "Save As" text?

Exactly! I assumed the OP had a PDF file, which was why I was pointing 
out that a conversion form PDF to PS wasn't likely to be any better than 
starting with a PDF file.

Converting a Word file to PDF or PostScript and *then* trying to extract 
text form it, as opposed to getting it out of the original document is 
just bizarre.

Ken
0
ken161 (742)
8/1/2006 2:00:29 PM
>>If you have the MS-Word DOC file why not just use Word to "Save As" text?
> 
> 
> Exactly! I assumed the OP had a PDF file, which was why I was pointing 
> out that a conversion form PDF to PS wasn't likely to be any better than 
> starting with a PDF file.
> 
> Converting a Word file to PDF or PostScript and *then* trying to extract 
> text form it, as opposed to getting it out of the original document is 
> just bizarre.

No, it is not bizarre.  Probably the .pdf was written by somebody else
who is unreachable, disinterested, uncooperative, or even antagonistic
towards making available the original text in a readily-processable format.
Think "government agency" or "faceless corporation."

-- 
0
jreiser (115)
8/1/2006 2:35:04 PM
In article <eanoq802vr8@enews3.newsguy.com>, jreiser@BitWagon.com 
says...
> >>If you have the MS-Word DOC file why not just use Word to "Save As" text?
> > 
> > 
> > Exactly! I assumed the OP had a PDF file, which was why I was pointing 
> > out that a conversion form PDF to PS wasn't likely to be any better than 
> > starting with a PDF file.
> > 
> > Converting a Word file to PDF or PostScript and *then* trying to extract 
> > text form it, as opposed to getting it out of the original document is 
> > just bizarre.
> 
> No, it is not bizarre.  Probably the .pdf was written by somebody else
> who is unreachable, disinterested, uncooperative, or even antagonistic
> towards making available the original text in a readily-processable format.
> Think "government agency" or "faceless corporation."

If you read the thread, this was my original assumption. This was 
destroyed when the poster said that he didn't want to 'convert the PDF 
to PS', He wanted to 'generate a PostScript file from MS Office'.

I now don't know what he's starting from, if he's got a PDF file then 
his question has been asnwered. If he has a Word document, then (as I 
said) converting it to PDF and *then* trying to extract the text *is* 
bizarre.


			Ken
0
ken161 (742)
8/1/2006 3:56:31 PM
Ken Sharp wrote:
> In article <eanoq802vr8@enews3.newsguy.com>, jreiser@BitWagon.com 
> says...
>> >>If you have the MS-Word DOC file why not just use Word to "Save As" text?
>> > 
>> > 
>> > Exactly! I assumed the OP had a PDF file, which was why I was pointing 
>> > out that a conversion form PDF to PS wasn't likely to be any better than 
>> > starting with a PDF file.
>> > 
>> > Converting a Word file to PDF or PostScript and *then* trying to extract 
>> > text form it, as opposed to getting it out of the original document is 
>> > just bizarre.
>> 
>> No, it is not bizarre.  Probably the .pdf was written by somebody else
>> who is unreachable, disinterested, uncooperative, or even antagonistic
>> towards making available the original text in a readily-processable format.
>> Think "government agency" or "faceless corporation."
> 
> If you read the thread, this was my original assumption. This was 
> destroyed when the poster said that he didn't want to 'convert the PDF 
> to PS', He wanted to 'generate a PostScript file from MS Office'.
> 
> I now don't know what he's starting from, if he's got a PDF file then 
> his question has been asnwered. If he has a Word document, then (as I 
> said) converting it to PDF and *then* trying to extract the text *is* 
> bizarre.

I want to extract text from a printer spool file(PS) in a CUPS backend.
(for some reasons...)
The user prints in windows(MS office) on a CUPS printer.

Another way is exactly, send the file RAW to the printer and try to
extract the text from .doc file.
0
fho (16)
8/2/2006 7:20:55 AM
Ken Sharp wrote:
> In article <44cf3032$0$7873$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
> says...
>> Ken Sharp wrote:
>> > In article <44cdb91b$0$7874$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
>> >> How can I extract text from a (MS Word) PDF file?
>> [...]
>> >> I think it must possible, because I also can copy and paste text from
>> >> PDF documents.
>> > 
>> > Have you tried it with this file ? This is a common complaint, you make 
>> > think it is easy to extract the 'text' from a PDF file, but in fact it 
>> > can be quite hard, and in some cases impossible.
>> 
>> How did copy&paste work different than convert it to text?
> 
> Well you haven't answered my question, which is whether copy and paste 
> worked with this particular file. However the simple answer is that it 
> isn't different. Excepting that a clever text extractor can do more to 
> try and reverse any odd encodings.

Yes sry, I've tested a little bit and pdftotext works if I can
copy&paste, if I can't convert the pdf to text copy&paste also didn't work.
0
fho (16)
8/2/2006 7:22:02 AM
Fabian Holler wrote:
> I want to extract text from a printer spool file(PS) in a CUPS backend.
> (for some reasons...)
> The user prints in windows(MS office) on a CUPS printer.
> 
> Another way is exactly, send the file RAW to the printer and try to
> extract the text from .doc file.

I googled for 'word doc to text' and found
   http://www.winfield.demon.nl/

I googled for 'convert word document to plain text' and found
   Use "AbiWord --to=txt"
   http://www.linux.com/article.pl?sid=06/02/22/201247
0
scobloke2 (557)
8/2/2006 10:16:49 AM
>>>>> "John" == John Reiser <jreiser@BitWagon.com> writes:

    John> Re-define (or "wrap") all the operators that nominally
    John> display text (namely: show, ashow, awidthshow, widthshow,
    John> kshow, cshow, xshow, yshow, xyshow) so that they write the
    John> string argument to a file.  Then process the file with
    John> ghostscript.

This  technique  won't  work  if  the PS/PDF  generating  program  has
deliberately obfuscated the file by recoding the font.


-- 
Lee Sau Dan                     李守敦                          ~{@nJX6X~}

E-mail: danlee@informatik.uni-freiburg.de
Home page: http://www.informatik.uni-freiburg.de/~danlee
0
danlee (1495)
8/2/2006 3:19:45 PM
In article <44d05352$0$7872$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
says...
 
> > Well you haven't answered my question, which is whether copy and paste 
> > worked with this particular file. However the simple answer is that it 
> > isn't different. Excepting that a clever text extractor can do more to 
> > try and reverse any odd encodings.
> 
> Yes sry, I've tested a little bit and pdftotext works if I can
> copy&paste, if I can't convert the pdf to text copy&paste also didn't work.

No surprise I guess, sounds like you have a re-encoded font. The only 
way to deal with this is to go from the encoding back to the glyph name, 
then try and gifure out if the glyphname has a matching ASCII character.

Sadly, some subset fonts will use unhelpful glyph names; for example 
instead of /adieresis, it might be /G00. There isn't really any way to 
go from names like that back to ASCII.

This, of course, also applies to PostScript, if the font has been 
embedded like that then there really is no way back through software. In 
my experience this happens most with embedded, subset, TrueType fonts in 
output from the Microsoft PostScript drivers aupplied with older 
versions of Windows.

TBH, the 'right' way to do this is by extracting text from the original 
application. If you can't do that, then the most reliable method is to 
OCR the document.

If you have complete control over the workflow it may be possible to set 
up the printer driver in such a way that the text does come through in a 
representation which happens to match ASCII, and then you can do a 
simple extraction. This may not be reliable as it depends on particular 
drivers configured in particular ways, any slight change may cause this 
to stop working.

I'm afraid neither PostScript nor PDF is intended as an 'editable' 
format, and so there is no provision for doign what you want.


			Ken
0
ken161 (742)
8/2/2006 3:20:57 PM
In article <44d0530e$0$7872$6e1ede2f@read.cnntp.org>, fho@mailinator.com 
says...

> I want to extract text from a printer spool file(PS) in a CUPS backend.
> (for some reasons...)
> The user prints in windows(MS office) on a CUPS printer.
> 
> Another way is exactly, send the file RAW to the printer and try to
> extract the text from .doc file.

If you absolutely must use a printer, have you tried printing to a 
'generic /text only' printer ? I don't know if that would be any better.

Filing that, then yes, I would use the document, although extracting 
text from a Word document isn't always trivial either.



		Ken
0
ken161 (742)
8/2/2006 3:23:48 PM
>>>>> "Ken" == Ken Sharp <ken@spamcop.net> writes:

    >> Yes sry, I've tested a little bit and pdftotext works if I can
    >> copy&paste, if I can't convert the pdf to text copy&paste also
    >> didn't work.

    Ken> No surprise I guess, sounds like you have a re-encoded
    Ken> font. The only way to deal with this is to go from the
    Ken> encoding back to the glyph name, then try and gifure out if
    Ken> the glyphname has a matching ASCII character.

In an attempt to obfuscate a PDF file, I have tried reencoding a font.
But both xpdf and Acrobat  Reader (Linux version) can do the cut&paste
correctly.  (They can even recognize  the "fi" ligature!)  At first, I
was puzzled: the recoding was  a random shuffling; how can these tools
figure  out  which charcode  is  which char?   It  must  be the  glyph
names...

So,  I went  a step  further: obfuscate  also the  glyph names  in the
fonts!  After that,  neither xpdf nor Acrobat Reader  can work out the
original text anymore.  ;)



    Ken> Sadly, some subset fonts will use unhelpful glyph names; for
    Ken> example instead of /adieresis, it might be /G00. There isn't
    Ken> really any way to go from names like that back to ASCII.

That's also  what I  do.  I gave  generated, meaningless names  to the
glyphs.  And that has achieved my goal: obfuscation.  :)


    Ken> TBH, the 'right' way to do this is by extracting text from
    Ken> the original application. If you can't do that, then the most
    Ken> reliable method is to OCR the document.

I  think so,  too.   But instead  of  printing a  hard  copy and  then
scanning  it  back in  (or  converting the  PS/PDF  to  an image  file
directly), are there  OCR software that are smart  enough to recognize
from the  glyph information?   e.g. the glyph  metrics may  already be
useful for guessing which characters they map to.  There is no need to
rework these measures from a raster sub-image.


    Ken> I'm afraid neither PostScript nor PDF is intended as an
    Ken> 'editable' format, and so there is no provision for doign
    Ken> what you want.

In  the  worse  case, the  PS/PDF  file  could  have just  drawn  each
character as  a drawing, skipping the  use of the  font machinery!  :(
(Yeah, I know that's very  inefficient, it can't take advantage of the
font caching mechanism.)



-- 
Lee Sau Dan                     李守敦                          ~{@nJX6X~}

E-mail: danlee@informatik.uni-freiburg.de
Home page: http://www.informatik.uni-freiburg.de/~danlee
0
danlee (1495)
8/3/2006 4:00:27 PM
In article <87psfhdd2c.fsf@informatik.uni-freiburg.de>, 
danlee@informatik.uni-freiburg.de says...

>     Ken> TBH, the 'right' way to do this is by extracting text from
>     Ken> the original application. If you can't do that, then the most
>     Ken> reliable method is to OCR the document.
> 
> I  think so,  too.   But instead  of  printing a  hard  copy and  then
> scanning  it  back in  (or  converting the  PS/PDF  to  an image  file
> directly), are there  OCR software that are smart  enough to recognize
> from the  glyph information?   e.g. the glyph  metrics may  already be
> useful for guessing which characters they map to.  There is no need to
> rework these measures from a raster sub-image.

I don't think there's anything that will deal with a PDF file, extract 
the font outlines and use those, no, but I could be wrong. In any event, 
as you note below, this is insufficient in the general case, since 
'text' can also be emitted as linework (draw a path matching the glyph 
and fill, or stroke) or as images (type 3 fonts, especially logos).

 
>     Ken> I'm afraid neither PostScript nor PDF is intended as an
>     Ken> 'editable' format, and so there is no provision for doign
>     Ken> what you want.
> 
> In  the  worse  case, the  PS/PDF  file  could  have just  drawn  each
> character as  a drawing, skipping the  use of the  font machinery!  :(
> (Yeah, I know that's very  inefficient, it can't take advantage of the
> font caching mechanism.)

Its also possible to get multiple instances of the same 'text', for 
example by filling a glyph in one colour, and stroking in another. A 
naive extraction might extract the text twice ;-)


				Ken
0
ken161 (742)
8/3/2006 4:42:50 PM
Ken Sharp wrote:

> Its also possible to get multiple instances of the same 'text', for
> example by filling a glyph in one colour, and stroking in another. A
> naive extraction might extract the text twice ;-)
>
>
> 				Ken

There are some .pdf files where one can't extract (or copy and paste)
text from the file. Some of these can be saved to .ps by Acrobat Pro
and then 'extract text' from gsview will work. I read this thread
briefly
but I saw no mention of the cause as author set pdf file permissions.

A grey limned boundary,
Stephen

0
TeXtonyx (2)
8/13/2006 2:42:15 PM
Reply:

Similar Artilces:

PDF::API2
Hello All, I am new to PDF files so I don't really know if what I want to do is possible and how to use the PDF::API2 modules. I need to extract information from columns in a table ( I assume that PDF does not know anything about tables). What I was thinking of doing was finding the horizontal location of the header (I know what it should be), then extract all text that starts at that location. I have played around with the PDF::API2 module and read the 'Using PDF::API2 - The code' help page, however it doesn't show me how to extract information from an existing file. ...

extracting text from pdf files
Can anyone help me with how to extract text from pdf files using PHP or ColdFusion? Thanks for any help. Hi, Try the Xpdf project. Run the pdftotext command in the shell to produce the text. http://www.foolabs.com/xpdf/download.html There's more tips at php.net/pdf. runner7@fastmail.fm wrote: > Can anyone help me with how to extract text from pdf files using PHP or > ColdFusion? Thanks for any help. petersprc@gmail.com wrote: > Hi, > > Try the Xpdf project. Run the pdftotext command in the shell to produce > the text. > > http://www.foolabs.com/xpd...

Extract Text out of PDF file
Does anyone know how to extract text out of a PDF file so that it can be ealisy imported into a databse? Example: Books. I would need a sepearte field for the title, author, publisher, date, description, image name, etc... I know all of this informaiton is stored in the PDF however, I can't seem to get it out correctly with doing it manually. Maybe, a apple script to pull based on font(?) or something... Any help will be greatly appricated. If there is a program out there or if anyone can build this for me that would rock. Matt PDFBox from http://www.pdfbox.org will do the trick for ...

ANN: Fly Text to PDF
Hi All: Fly Text to PDF 1.3 is powerful tool which can convert your text files into PDF. This tool is powerful converter tool running on Microsoft Windows Operating System. You can use this tool to convert your text report, text documents and other text files into PDF quickly and easily. You also can set the PDF properties in each text files by using special tags, or set the default properties for every output PDF files. Please visit our website for more information: http://www.medafan.com/pdf-tools For the output sample, please click on: http://www.medafan.com/pdf-tools/license.pdf Key fea...

Question about Text Extraction from postscript files
Hi, with the ERP of my company I can create some reports. These reports can generate postscript files that I can view with gsview, without problems. The problem is this: if I open the postscript source file with a text editor and I search for the text that I can see with gsview, I find nothing. For example: if with gsview I see a string like "BREAK", in the postscript source file I should find a line like "(BREAK) show", but I find only lines like "(\000%\0005\000\(\000$\000.\000!) show". Could someone tell me, what kind of encoding is this? I need to extract plain text string from the postscript files, generated from my ERP. Has someone suggestions ? Thanks in advance for any comments. Stefano sseghetto@gmail.com wrote: >For example: if with gsview I see a string like "BREAK", in the >postscript source file I should find a line like "(BREAK) show", but I >find only lines like "(\000%\0005\000\(\000$\000.\000!) show". > >Could someone tell me, what kind of encoding is this? Only wild guesses. It looks like a double byte font, perhaps a CID font, anyway, something using a two byte encoding. You'd have to refer to the embedded font and do detailed analysis for more information. It also looks likely that it uses arbitrary codes, and in most such cases text extraction is absolutely impossible. It may be time to start retyping; or perhaps focus on direct export from the ERP software --------...

Script to extract text from PDF files
I have a very crude Python script that extracts text from some (and I emphasize some) PDF documents. On many PDF docs, I cannot extract text, but this is because I'm doing something wrong. The PDF spec is large and complex and there are various ways in which to store and encode text. I wanted to post here and ask if anyone is interested in helping make the script better which means it should accurately extract text from most any pdf file... not just some. I know the topic of reading/extracting the text from a PDF document natively in Python comes up every now and then on comp.lang....

extract text from a PDF file with JAVA
Hi to all the newsgroup. this is my first post. I'm approaching the text retrieving from PDF files with java. I know this is a very known problem in this newgroup, but i need for some example code, tutorial, guide or similar. I'm using, for the moment, PDFBox library but i notice a lot of errors in its PDF parsing. So i've tried with "Pjx" library and i've found a good example code in this site: http://www.jguru.com/faq/view.jsp?EID=1074237 ....but i can't find a way to call "PdfParser.getContents()" method. I will appreciate any advice. Thanks in adv...

How to extract Arabic Text from PDF file
Dear All, Hi, I am doing following task in PHP.... I am using pdftotext command line utility of xpdf package for Windows and Linux. It successfully extracts English text from PDF files. Now I need to extract Unicoded Arabic text from PDF files. For this, I tried: "pdftotext -enc UTF-8 arabicFile.pdf arabicFile.txt" If I remove -enc switch/parameter, there is empty space in place of Arabic text, but English text is extracted from PDF. With -enc UTF-8, some Arabic characters/alphabet s are extracted from PDF, but the complete Arabic text is not extracted. I also have downloaded and ...

postscript file and pdf file
Dear Sir: On web if I need to download an ariticle, most of them are pdf files, which I can open it through acroread program. Ocassionally, there are postscript files(ps files). I noticed the size of a ps file is very big, while pdf file is small. I am curious why some people use ps files, not pdf files. I know that it is very easy to use ps2pdf, pdf2ps to change between them. I am just wondering if there is any advantage for using postscript files. By the way, for printing a postscript file, the printer needs to be a postscript printer, right? Thank you very much. Best Regards, Xiaoshen Xiaoshen Li wrote: > > I am curious why some people use ps files, not pdf files. I know that it > is very easy to use ps2pdf, pdf2ps to change between them. I am just > wondering if there is any advantage for using postscript files. > Very little advantage over PDF that I can imagine, for most purposes, and in fact any device-specific code there may be in a PS file can cause problems on other types of printer than the one originally specified. I think most of these files would be 'legacies' from before PDF became so widespread, or are created out of reluctance to adopt a newer format. Although raster images in PS files will typically be of much higher resolution than those in "screen optimized" PDFs intended for distribution over networks, the PDF format is equally capable of supporting high-quality imaging. While a PS file containing programmin...

Script to extract text from PDF files #2
On Wed Sep 26 15:06:54 CEST 2007, byte8bits wrote: > On Sep 25, 10:19 pm, Lawrence D'Oliveiro <l... at geek- > central.gen.new_zealand> wrote: > > > This is inherent in the nature of PDF: it's a page-description language, > > not a document-interchange language. Each text-drawing command can put a > > block of text anywhere on the page, so you have no idea, just from > > parsing the PDF content, how to join these blocks up into lines, > > paragraphs, columns etc. > > So (I'm not being a wise guy) how does pdftotext do it so well...

Re: Script to extract text from PDF files
On Wed Sep 26 23:50:16 CEST 2007, byte8bits wrote: > On Sep 26, 4:49 pm, Svenn Are Bjerkem <svenn.bjer... at googlemail.com> > wrote: > > > I have downloaded this package and installed it and found that the > > text-extraction is more or less useless. Looking into the code and > > comparing with the PDF spec show a very early implementation of text > > extraction. Luckily it is possible to overwrite the textextraction > > method in the base class without having to fiddle with the original > > code. I tried to contact the developer to offer s...

How to extract the text from pdf file using PDFBOX library
I have tried to extract the pdf document to text using pdf box library in Android public static void read(String[] args) throws IOException{ PDDocument doc = null; try { doc = PDDocument.load("C:\\Android.pdf"); PDFTextStripper stripper = new PDFTextStripper(); String text =stripper.getText(doc); } finally { if (doc != null) { doc.close(); } } But getting error in the logcat that Could not find method org.apache.pdfbox.pdmodel.PDDocument.load, ref...

extracting from a pdf files txt file
Hi there,is there a methode to extract from a pdf files in sed & awk languages the text file? thanks in advance On 2008-01-08, franzi <hazzino@gmail.com> wrote: > > > Hi there,is there a methode to extract from a pdf files in sed & awk > languages the text file? > thanks in advance pdftotext, but it is not a standard unix command. On Jan 8, 2:10 am, franzi <hazz...@gmail.com> wrote: > Hi there,is there a methode to extract from a pdf files in sed & awk > languages the text file? > thanks in advance One poster suggested pdftotext. An alternat...

Acumen Journal December 2006 now available. This month: Extracting Text from a PostScript File
Acumen Journal A Technical Newsletter from Acumen Training Issue 46, December 2006 ---------------------- Announcing issue 46 of Acumen Journal, a periodic technical newsletter for the PostScript and Acrobat community. Each issue of the Journal presents two articles, one each on PostScript and Acrobat: *PostScript Tech* is an article on some aspect of PostScript programming. *Acrobat User* is an article on the advanced use of some part of Acrobat. Also in each issue is the current schedule of Acumen Training's PostScript and Acrobat classes. The Acumen Journal is free for the downloading. Any comments, questions about PostScript or Acrobat, or suggestions for future articles, feel free to contact John Deubert at Acumen Training. (john@acumentraining.com) To get the latest issue, go to http://www.acumentraining.com/acumen_journal.asp ------------- In this issue ------------- PostScript Tech - Extracting Text From a PostScript File -------------------------------------------------------- We see how to use redefinitions of PostScript operators to extract the text from a PostScript file. The Acrobat User - Customizing Acrobat 8 Toolbars ------------------------------------------------- Acrobat 8 makes it remarkably simple to add and remove controls from its toolbars. This month's article demonstrates the process. Class Schedule: Jan-Feb-Mar --------------------------- Where and when are we teaching our PDF, PostScript, and Acrobat class...

Script to extract portions of text from a text file
I had a need to extract portions of text (delimited by fixed patterns) from a large text file. Here's a shell script that I wrote for this purpose. Kindly suggest a more "elegant" way to do the same: Thanks, Bhat #!/bin/ksh # Script to extract portions of text from a text file # # #set -x if [ "$#" -lt "3" ] then echo "usage $0: <input-file> <begin-pattern> <end-pattern>" exit 1 fi if [ ! -f "$1" ] then echo "File $1 does not exist" exit 2 fi if [ ! -r "$1" ] then echo "Error reading file $1...

Script to rename file to text extracted from file.
I have hundreds of lscfg files that are named with all sort of names . I would like to rename each file to the serial number in each file . I first sorted the files in to power rspc and chrp Only intrested in the chrp machines at this time I can creat a file with a list of the files in it . ( chrplist ) I can get the serial number . grep "Machine/Type" $filename | cut -d . -f 5 My problem is i can't seem to handel 2 varables example While read filename; do grep "Machine/Type" $filename | cut -d . -f 5 > newname cp $filename $newname done < chrplist I know this does not work and this is not even close to all the things i have tried a lot of other methods . I can do the renames by hand but what a pain if this was a one time thing i would not worry to much bt i get these files in often and makes ti a lot easie rto do compares and maintain the lateset configuration files if they are all named by their serial number. Any pointers would be appreciated. Andy <noname@nowhere.net> writes: After a tiny modification it *should* work: while read filename; do newname=`grep "Machine/Type" $filename | cut -d . -f 5` cp $filename $newname done < chrplist Cheers, -- In order to understand recursion you must first understand recursion. Remove /-nsp/ for email. On Fri, 11 Aug 2006 20:29:14 -0700, Paul Pluzhnikov <ppluzhnikov-nsp@charter.net> wrote: >Andy <noname@nowhere.net> writes: > >After a tiny modif...

ghostscript PDF page extraction, leaving text as text
Ghostscript may be used to extract pages from a PDF file with a command like this: gs -sDEVICE=pdfwrite \ -dNOPAUSE -dBATCH -dSAFER \ -dFirstPage=48 -dLastPage=48 \ -sOutputFile=onepage.pdf input.pdf The problem is, while that page looks the same as the original in a PDF reader, it seems to be an image rather than an "object" representation. That is, open the extracted PDF in something like Acrobat or PDF XChange Viewer and "search" and "text selection" work, whereas in the extracted one neither function works. Presumably this is because the text has been rasterized. Is it possible to use gs to extract ranges of pages, preferably also reducing the resolution of the embedded images, but leaving the text as text? I frequently need to reduce the size of PDF files, but it should all come out of the resolution of the images, and the text should remain as accessible as it was in the original. If ghostscript cannot do this, is there another linux tool that can? Thanks, David Mathog >>>>> "David" =3D=3D David Mathog <dmathog@gmail.com> writes: David> gs -sDEVICE=3Dpdfwrite \ -dNOPAUSE -dBATCH -dSAFER \ David> -dFirstPage=3D48 -dLastPage=3D48 \ -sOutputFile=3Donepage.pdf David> input.pdf I've just tried this with a PDF file, and it works: search and select works on both onepage.pdf and input.pdf. David> The problem is, while that page looks the same as the David> ori...

How to convert this ps file to a pdf file w/ searchable text? (pkfix-helper: No Type 3 fonts were encountered in the input file)
I try to use pkfix-helper to fix the pdf and then run ps2pdf on it. But I get the following error. Could you please let me know if there is any other way to generate a searchable pdf file? $pkfix-helper 0375.ps Reading 0375.ps ... done. Number of Type 3 fonts encountered: 0 pkfix-helper: No Type 3 fonts were encountered in the input file On May 13, 10:45=A0pm, Peng Yu <pengyu...@gmail.com> wrote: > I try to use pkfix-helper to fix the pdf and then run ps2pdf on it. > But I get the following error. Could you please let me know if there > is any other way to generate a searchable pdf file? > > $pkfix-helper 0375.ps > Reading 0375.ps ... done. > Number of Type 3 fonts encountered: 0 > pkfix-helper: No Type 3 fonts were encountered in the input file That looks more like a "nothing to do" message than an "error" per se. If we could see some of the postscript, ... On May 14, 12:41=A0am, luser- -droog <mijo...@yahoo.com> wrote: > On May 13, 10:45=A0pm, Peng =A0Yu <pengyu...@gmail.com> wrote: > > > I try to use pkfix-helper to fix the pdf and then run ps2pdf on it. > > But I get the following error. Could you please let me know if there > > is any other way to generate a searchable pdf file? > > > $pkfix-helper 0375.ps > > Reading 0375.ps ... done. > > Number of Type 3 fonts encountered: 0 > > pkfix-helper: No Type 3 fonts were encountered in the input file > > ...

Reading Text File, Text Scrolling and erase line from a text file
Hi , How can I Reading Text File, then Text Scrolling by means of keys "page UP" "page Down" and erase a specific line from a text file , when press "F2" key? Best Regards, Lidia from Poland -- ========================================= Pozdrawiam Lidiaa Lidiaa schrieb: > Hi , > > How can I Reading Text File, then Text Scrolling by means of > keys "page UP" "page Down" and erase a specific line from a text > file , when press "F2" key? > > ...

extract text layer from searchable pdf and merge with another pdf
Dear comp.text.pdfians I have a pdf (a searchable pdf consisting in book pages scans, then passed to ocr that has added a text layer hidden under images, so pdf is searchable) this pdf has jbig2 compression (it counts 135 pages in A5 format scanned at 300 dpi and its size is about 1928 KB) After ocrization, I noticed that scans have been degrated in quality, so I want extract text layer and merge this text layer with another copy of same pdf containing scans in high quality it is possible extract a text layer from a pdf and then merge with raster layer of another pdf? -- Puppy Linux...

A PDF into a FM file then save as pdf... how to have alll the pages of the included pdf file
FM 7.2 We can import a complete PDF file as an object into a framemaker file. Then when we try to generate a pdf from this framemaker file, we have only the firts page of the pdf imported. Is there a method to import a pdf file and force FM to generate a pdf file with save as... with the complete pdf file inserted into the padf ? > We can import a complete PDF file as an object into a framemaker file. Then > when we try to generate a pdf from this framemaker file, we have only the > firts page of the pdf imported. > > Is there a method to import a pdf file and force FM to generate a pdf file > with save as... with the complete pdf file inserted into the padf ? FrameMaker imports PDF files per page only. To do something similar to what you intend, I once came up with a Windows shell script that writes out a MIF file, incorporating the name of a referenced PDF file and its page count. Note that the MIF code was just stripped off a file and isn't constructed very nicely, but it does the job. After running the script, you should use FrameMaker to open and save the output file. There is no text flow, thus you can't import the file into a document, but you can use it as part of a book instead. Regards Johannes Here is the VB script, feel free to post an optimized version: Dim fso, mif, pdfname, pagecount, counter Set fso = CreateObject("Scripting.FileSystemObject") pdfname = InputBox("Name o...

Please help: Automatically extract PDF files from zip by keeping file structure?
Hello, Here is my problem : I would like to be able to a make an automatic backup copy of only pdf files that are included inside many different zip files. In fact I would like to keep only password protected pdf files. But at first I wish to be able to keep only all the pdf files with the original folder structure. Then, I hope that I can find a way to keep only pdf password protected files (maybe using pdftotext => if pdftotext can read the content of a pdf file (so it is not protected), delete it! I am asking all this because I have 10 GB of various data (html, pdf, ..). And I would l...

text on pdf file
How can I add a text on a pdf file in order to print it with the page? Thanks Sandro "Sandrobi" <sandbizzXXX@libero.it> wrote in message news:<TQP0c.42052$Kc3.1337410@twister2.libero.it>... > How can I add a text on a pdf file in order to print it with the page? > > Thanks > > Sandro There are so many ways to do it if you have the full version of Acrobat. Not the Free Reader. 1. Use the free text tool. 2. Use the forms tool and make a text box and fill it up. 3. Use the touchup text tool, click where you want the text inserted while holding the &...

Extract text from .pdf
I have Acrobat Pro, is it possible to extract text from a .pdf? I see the "save as" options including Word Doc but it still seems to be an image? The ocr software with my cannon lide 200 scanner is as useless as tits on a boar hog.......... In article <C61B79B5.40FEF%elvisp@compuserve.com>, The Wolf <elvisp@compuserve.com> wrote: > I have Acrobat Pro, is it possible to extract text from a .pdf? I see the > "save as" options including Word Doc but it still seems to be an image? Acrbat has its own OCR built-in. I've found it to be very accurate, eve...

Web resources about - extract text from PDF file - comp.lang.postscript

Extract, transform, load - Wikipedia, the free encyclopedia
Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization/ ...

Video to Photo Grabber - Video frame capture tool that grabs still photo and extracts high quality picture ...
Get Video to Photo Grabber - Video frame capture tool that grabs still photo and extracts high quality picture from movie file on the App Store. ...

Vanilla extract ready to sit - Flickr - Photo Sharing!
You aren't signed in Sign In Help Home The Tour Sign Up Explore Explore Home Last 7 Days Interesting Popular Tags Calendar Most Recent Uploads ...

Spare Time (1939) - extract - YouTube
This classic documentary about the British at leisure arose from director Humphrey Jennings' work as one of the founders of the Mass Observation ...

Pasadena Girl takes out Champagne Stakes as Peter Moody extracts every last drop
... the champion trainer wasn't expecting the filly to be a group 1 winner as a two-year-old. Just like the stubby he is always hoping to extract ...

Slipper withdraws allegation over release of diary extracts
Slipper withdraws allegation over release of diary extracts

These Things Happen book extract by Greg Fleet - HeraldSun Search Search
WARNING: Graphic content and coarse language This is an edited extract of Greg Fleet’s memoir These Things Happen, released by Macmillan Australia ...

Frank Sinatra: Behind the legend book extract - J. Randy Taraborrelli
IN 1938, twenty-two-year-old Frank Sinatra was young, handsome, charming, and, as a result of his bravado, able to have sex with virtually any ...

Harper Lee's Go Set a Watchman: Spectrum secures exclusive Australian print extract
The Sydney Morning Herald and The Age readers will be among the first in the world to read chapter one from Go Set a Watchman.

Canberra extracts healthy $850m in dividends from Medibank
Canberra extracts healthy $850m in dividends from Medibank

Resources last updated: 3/14/2016 4:14:07 AM