need help parsing PDF documents

  • Follow


Hi,

i'm currently writing a routine in PHP to extract text parts from within 
a PDF to build up an index and make a search in there. So far so good. I 
can extract all text parts that were built using FlateEncoding. But now 
i have two problems:

1) When FlateEncoding was used in connection with ASCii85Encoding i 
always end up in a failure. I made a routine which does the 
ASCii85Decoding stuff. It "should" work properly. At least when i use a 
test-string and calculate the result on paper the function produces the 
same result! But everytime i  convert a ASCii85 encoded stream from 
within a pdf and then try to make the "FlateDecode" i get a failure from 
the FlateDecoding. My guess is that PHP might mix up some of the data 
when reurned as a binary string for which i use the pack()-function btw.

2) The PDF documents which should be parsed were mostly created on a 
Macintosh and some under Windows. Now, when i extract the text parts 
from a "mac-pdf" it always srambles up the charcode. To be more precise, 
they were made using XPress and Distiller. Now the result is that the 
german special chars like "����" (ASCII 129, 132, 148, 225) from the 
extended ASCii table have the wrong charcode. They are printed as ���� 
(not sure if they will be printed on your machine correctly). Which in 
return means that words using those special characters can't be found in 
the index. I tried to find out which charcode has been used creating the 
documents so that i might be able to convert them but no luck sofar.

Can someone help me on this??? Thanks in advance! :)
ChK
0
Reply chKostal 4/23/2005 11:50:05 AM

In article <NDqae.16163$0z2.12513@news.chello.at>, plauderer@gmx.at=20
says...

> 1) When FlateEncoding was used in connection with ASCii85Encoding i=20
> always end up in a failure. I made a routine which does the=20
> ASCii85Decoding stuff. It "should" work properly. At least when i use a=
=20
> test-string and calculate the result on paper the function produces the=
=20
> same result! But everytime i  convert a ASCii85 encoded stream from=20
> within a pdf and then try to make the "FlateDecode" i get a failure from=
=20
> the FlateDecoding. My guess is that PHP might mix up some of the data=20
> when reurned as a binary string for which i use the pack()-function btw.

If your PHP server is messing up the data, then there's not a lot that=20
can be done to help you, but this seems unlikely, unless its converting=20
the line endings in a text stream. If its doing that, then chances are=20
your PDF file will be corrupted in other ways.

Do you have a copy of GhostScript ? If so you can use it to decode the=20
ASCII85 data. In fact you can use it to decode the Flate data too.

This *may* work on GS, its written for a different interpreter:
-------------------------------------------------------------------
/WorkString 255 string def
/OutFile (inflate.ps) (w) file def
/InFile (flate.ps) (r) file /ASCII85Decode filter /FlateDecode filter=20
def

{
=09InFile WorkString readstring
=09{
=09=09OutFile exch writestring
=09}
=09{
=09=09%% EOF, did we read *any* data ?
=09=09dup length 0 gt
=09=09{
=09=09=09OutFile exch writestring
=09=09}
=09=09{
=09=09=09pop
=09=09}ifelse
=09=09exit
=09} ifelse
} loop
-------------------------------------------------------------------

This reads data from a file called 'flate.ps' and writes it to a file=20
called 'inflate.ps'.

You can use this to check the results of your own code. You can also use=20
GS to create streams which you can test your code with. Better to test=20
with known data, then you can be sure its not you that's at fault.

=20
> 2) The PDF documents which should be parsed were mostly created on a=20
> Macintosh and some under Windows. Now, when i extract the text parts=20
> from a "mac-pdf" it always srambles up the charcode. To be more precise,=
=20
> they were made using XPress and Distiller. Now the result is that the=20
> german special chars like "=FC=E4=F6=DF" (ASCII 129, 132, 148, 225) from =
the=20
> extended ASCii table have the wrong charcode. They are printed as =9F=8A=
=9A=A7=20
> (not sure if they will be printed on your machine correctly). Which in=20
> return means that words using those special characters can't be found in=
=20
> the index. I tried to find out which charcode has been used creating the=
=20
> documents so that i might be able to convert them but no luck sofar.

You need to apply the Encoding in use with the font at the time. I know=20
it looks like the text in the strings is ASCII, but that's a false=20
impression. The outlines in a font are stored by *name*, usually names=20
like /A, /B, /C etc, but not neccessarily so, names like /G01 are not=20
uncommon.=20

The way the text string you see in the PDF file is mapped to the named=20
glyph in the font program is via the Encoding array. The numeric=20
character value is looked up in a 256 element array, which maps the=20
number to a name. For example, the Encoding array might contain /A at=20
index 0x41, /B at 0x42 etc.

This is quite common, and since these are ASCII values, it *looks* like=20
PDF text uses ASCII values.

However, because its not that simple, thsi will inevitably fail=20
eventually. A subset font, for instance, usually does not botherEncoding=20
its glyphs with ASCII values. A font with 8 glyphs may well position=20
them at Encoding positions 1 to 8. For example:

/Encoding:
Index=09=09Glyph Name
0=09=09/A
1=09=09/n
2=09=09/space
3=09=09/e
4=09=09/x
5=09=09/a
6=09=09/m
7=09=09/p
8=09=09/l

Now the text string will look like this:

(/000/001/002/003/004/005/006/007/008/003)

but the text in Acrobat will read 'An example'.

I imagine that the text you are looking at is encoded with MacRoman=20
encoding, which is differnet from WinAnsi encoding. As a result the=20
characters outside the normal 7 bit range (ie those above 127) are=20
different. IIRC the PDF Reference Manual has a list of the standard=20
Encodings, I expect Mac Roman will be in there.
=09=09
Now, how does this affect you ? Well basically, unless you can be=20
*certain* how the PDF files you are disassembling were produce, the=20
technique you are using is doomed to eventual failure.

The only way to really extract text from a PDF file is to fully=20
interpret it, and take account of the Encoding. Even then I'm afraid=20
that there are files which will defeat you. While its possible to follow=20
the Encoding array and extract the glyph name for a given text=20
character, its only possible to know what that character is by=20
understanding the glyph name (eg /space). If the font uses non-standard=20
glyph names (such as /G01) then you have no way to know what the=20
character actually is. Well, short of OCR anyway....

Even this is an over-simplification, things get still more complicated=20
when dealing with TrueType fonts, and with CIDFonts.

Is this a commercial project, or a spare time hobby ? If its commercial,=20
then there are tools which will do this for you. Even if its just a=20
hobby, there are freeware tools which can do some or all of what you are=20
attempting. I would suggest looking at www.pdfzone.com to see if=20
anything there will suit your needs. GhostScript is especially useful,=20
as is xpdf.


=09=09=09Ken
0
Reply Ken 4/23/2005 12:57:14 PM


First of all thanks for your professional answer! I was very impressed. :)
....
> Do you have a copy of GhostScript ? If so you can use it to decode the 
> ASCII85 data. In fact you can use it to decode the Flate data too.
Good idea to use GhostScript as a test-drive. I'll have to check the 
script you provided.
> 
>  
> 
> ...
> You need to apply the Encoding in use with the font at the time. I know 
> it looks like the text in the strings is ASCII, but that's a false 
> impression. The outlines in a font are stored by *name*, usually names 
> like /A, /B, /C etc, but not neccessarily so, names like /G01 are not 
> uncommon. 
Well, i thought that MacRoman encoding has been used but i couldn't just 
"remap" using some sort of translation table.
I really hoped there would be an easier way to do that! I read through 
the PDF-reference again and found out how to get the font and with it, 
the encoding used for the text fragment in question. So i would handle 
the streams like this:
1) Extract the text fragments encapsulated with BT and ET
2) Check for a line ending with "Tf" as the font operator. it is sort of 
an alias for the font in question. (Which leaves another question: f.e. 
in the line "/F2 1 Tf", "F2" is the alias but what is the "1" for? In my 
samples it is always a 1. I can't find any reference for that.)
3) Get the objects ("/Type /Page"). It should hold a subsection "/Font" 
resolving the aliases as a link to the objects holding the Font-object 
("/Type /Font"). Either that or it has another link ("/Resources") 
linking to the object with the needed "/Font" info.
4) Get the Font-object in question and look for the "/Encoding" parm if 
any which tells us the charcode used. According to the reference it can 
be /StandardEncoding, /MacRomanEncoding or /WinANSIENcoding (i guess the 
others or less important). It also usually holds a link to the 
"/FontDescriptor" object.
5) get the "/FontDescriptor" object and look for an entry "/Differences" 
for changes in the charmap (usually referenced as cmap). Either it is 
directly in there or in another linked object. It can hold information 
on whole char-ranges or single chars and their Unicode equivalent.

Ok, now i have the information i need to "remap" the charset to Unicode. 
But now i have another problem: How do i do that? :) Ok, i guess i'll 
need some sort of tranlation table holding the values i need to replace 
to get a unicode encoded string. But as soon as i "unzipped" the 
FlateEncoed stream isn't the content already set to a specific charmap? 
Would it be a solution to "pack" it into another binary string and 
replacing the chars there?
> 
> ...
> The only way to really extract text from a PDF file is to fully 
> interpret it, and take account of the Encoding. Even then I'm afraid 
> that there are files which will defeat you. While its possible to follow 
> the Encoding array and extract the glyph name for a given text 
> character, its only possible to know what that character is by 
> understanding the glyph name (eg /space). If the font uses non-standard 
> glyph names (such as /G01) then you have no way to know what the 
> character actually is. Well, short of OCR anyway....
> 
> Even this is an over-simplification, things get still more complicated 
> when dealing with TrueType fonts, and with CIDFonts.
Well, that is what i feared. The PDF standard is well complicated. I 
thought it would be easy to just extract the text parts. And it was 
(more or less) But the further i get the more problems arise. And with a 
text where 10% has the wrong charmap leaves me unsatisfied again.
> 
> Is this a commercial project, or a spare time hobby ? If its commercial, 
> then there are tools which will do this for you. Even if its just a 
> hobby, there are freeware tools which can do some or all of what you are 
> attempting. I would suggest looking at www.pdfzone.com to see if 
> anything there will suit your needs. GhostScript is especially useful, 
> as is xpdf.
I'm making this for a CMS to make a search-engine capable of searching 
within downloadable PDFs aswell. And as i'm self-employed this is partly 
commercial because i need it for a customers project and for me aswell 
for future use.
The thing is that most of the time you don't have the oportunity to 
install a program or module on another server. But you almost always can 
use PHP. SO I searched for PHP code that would do the job but couldn't 
find any suitable. (I found two but one didn't work at all and the other 
one lacks the same problem) I already thought about making this a new 
pet-project for i guess that others might be interested in such a parser 
aswell.

Anyhow thanks Ken for your help! It is much appreciated. :)

ChK
0
Reply chKostal 4/23/2005 6:57:01 PM

chKostal <plauderer@gmx.at> wrote:

>> You need to apply the Encoding in use with the font at the time. I know 
>> it looks like the text in the strings is ASCII, but that's a false 
>> impression. The outlines in a font are stored by *name*, usually names 
>> like /A, /B, /C etc, but not neccessarily so, names like /G01 are not 
>> uncommon. 
>Well, i thought that MacRoman encoding has been used but i couldn't just 
>"remap" using some sort of translation table.
>I really hoped there would be an easier way to do that! 

If you think text extraction is easy in any way, you just haven't read
enough of the PDF reference yet!

>1) Extract the text fragments encapsulated with BT and ET
>2) Check for a line ending with "Tf" as the font operator. it is sort of 
>an alias for the font in question. (Which leaves another question: f.e. 
>in the line "/F2 1 Tf", "F2" is the alias but what is the "1" for? In my 
>samples it is always a 1. I can't find any reference for that.)

No, you must parse for the operator, but it need not be at the end of
a line.

The PDF Reference should tell you what both operands in Tf are. 

Tf persists from one text group to the next, and may appear outside a
text group.

Tf must honour q/Q as it forms part of the graphics state.

>3) Get the objects ("/Type /Page"). It should hold a subsection "/Font" 
>resolving the aliases as a link to the objects holding the Font-object 
>("/Type /Font"). Either that or it has another link ("/Resources") 
>linking to the object with the needed "/Font" info.

No, there is no choice. In the page object will be a Resources
dictionary, which will contain a Font dictionary, which will contain a
font object.

Or it may be inherited.

Or the font reference may be a form XObject, pattern or type 3 font.

>4) Get the Font-object in question and look for the "/Encoding" parm if 
>any which tells us the charcode used. According to the reference it can 
>be /StandardEncoding, /MacRomanEncoding or /WinANSIENcoding (i guess the 
>others or less important).

All are equally important, and in many cases there will be a
Differerences array.

> It also usually holds a link to the 
>"/FontDescriptor" object.

Except in specific cases, yes.

>5) get the "/FontDescriptor" object and look for an entry "/Differences" 
>for changes in the charmap (usually referenced as cmap). 

No, Differences is part of the Encoding.

A CMap is, however, something else used for CID fonts. These are MUCH
more complicated again.

>directly in there or in another linked object. It can hold information 
>on whole char-ranges or single chars and their Unicode equivalent.

CID fonts again.
>
>Ok, now i have the information i need to "remap" the charset to Unicode. 
>But now i have another problem: How do i do that? 

Use the names in the encoding you have derived. You will need a
mapping of known names to Unicode.

Don't forget that the font might include a ToUnicode entry, that takes
precedence.

Remember that a Font object might not include an Encoding. In this
case, you need to analyse the embedded font program to get its native
encoding.

>. But as soon as i "unzipped" the 
>FlateEncoed stream isn't the content already set to a specific charmap? 

Encoding, yes. For each string, possibly separate; a single page may
refer to many fonts, each with its own encoding.

>Would it be a solution to "pack" it into another binary string and 
>replacing the chars there?

No.
>> 
>> ...
>> The only way to really extract text from a PDF file is to fully 
>> interpret it, and take account of the Encoding. Even then I'm afraid 
>> that there are files which will defeat you. While its possible to follow 
>> the Encoding array and extract the glyph name for a given text 
>> character, its only possible to know what that character is by 
>> understanding the glyph name (eg /space). If the font uses non-standard 
>> glyph names (such as /G01) then you have no way to know what the 
>> character actually is. Well, short of OCR anyway....
>> 
>> Even this is an over-simplification, things get still more complicated 
>> when dealing with TrueType fonts, and with CIDFonts.
>Well, that is what i feared. The PDF standard is well complicated. I 
>thought it would be easy to just extract the text parts. And it was 
>(more or less) But the further i get the more problems arise. And with a 
>text where 10% has the wrong charmap leaves me unsatisfied again.

You seem to be only a small way into the complexity of this problem,
to be brutally honest, you have been very lucky with your examples.

I would allow at least six months for me to create such a tool from
scratch, starting with a moderately deep knowledge of PDF.
----------------------------------------
Aandi Inston  quite@dial.pipex.com http://www.quite.com
Please support usenet! Post replies and follow-ups, don't e-mail them.

0
Reply quite 4/23/2005 7:10:57 PM

>>3) Get the objects ("/Type /Page"). It should hold a subsection "/Font" 
>>resolving the aliases as a link to the objects holding the Font-object 
>>("/Type /Font"). Either that or it has another link ("/Resources") 
>>linking to the object with the needed "/Font" info.
> 
> 
> No, there is no choice. In the page object will be a Resources
> dictionary, which will contain a Font dictionary, which will contain a
> font object.
> 
> Or it may be inherited.
How is that referenced?
> 
> Or the font reference may be a form XObject, pattern or type 3 font.
> 
> 
>>4) Get the Font-object in question and look for the "/Encoding" parm if 
>>any which tells us the charcode used. According to the reference it can 
>>be /StandardEncoding, /MacRomanEncoding or /WinANSIENcoding (i guess the 
>>others or less important).
> 
> 
> All are equally important, and in many cases there will be a
> Differerences array.
True ofcourse. I just assume that most of the documents i will have to 
cope with will use one of those encodings.
> 
> 
>>It also usually holds a link to the 
>>"/FontDescriptor" object.
> 
> 
> Except in specific cases, yes.
> 
> 
>>5) get the "/FontDescriptor" object and look for an entry "/Differences" 
>>for changes in the charmap (usually referenced as cmap). 
> 
> 
> No, Differences is part of the Encoding.
I thought they describe the differences between the used 
encoding-standard and the actual encoding in the document.
> 
>>Ok, now i have the information i need to "remap" the charset to Unicode. 
>>But now i have another problem: How do i do that? 
> 
> Remember that a Font object might not include an Encoding. In this
> case, you need to analyse the embedded font program to get its native
> encoding.
Erm, i guess i will rather give up in that case. :)
> 
>>. But as soon as i "unzipped" the 
>>FlateEncoed stream isn't the content already set to a specific charmap? 
> 
> Encoding, yes. For each string, possibly separate; a single page may
> refer to many fonts, each with its own encoding.
Well, this question was more directed to the actual PHP coding. I know 
that the PDFs encoding might/will be different from page to page and 
text-block to text-block and so on. The question is, assuming that i 
correctly retrieved all the encoding info i need, how can i change the 
text to the original encoding again after FlateDecoding the stream. Or 
in other words: When decoding the stream isn't the resulting string 
already encoded with an OS-specific encoding? And if so how can i change 
that?
> 
> 
>>Would it be a solution to "pack" it into another binary string and 
>>replacing the chars there?
> 
> 
> No.
> 
> You seem to be only a small way into the complexity of this problem,
> to be brutally honest, you have been very lucky with your examples.
> 
> I would allow at least six months for me to create such a tool from
> scratch, starting with a moderately deep knowledge of PDF.
Well, as the topic is getting more and more complex it also gets more 
and more interesting for me. I like the chalenge. :)

Thanks,
ChK
0
Reply chKostal 4/24/2005 6:30:39 PM

chKostal <plauderer@gmx.at> wrote:

>> Or it may be inherited.
>How is that referenced?

See Inheritance > Page attributes in the index.
>> 
>> All are equally important, and in many cases there will be a
>> Differerences array.
>True ofcourse. I just assume that most of the documents i will have to 
>cope with will use one of those encodings.

Almost all documents I see have a Differences array.
>> 

>> No, Differences is part of the Encoding.
>I thought they describe the differences between the used 
>encoding-standard and the actual encoding in the document.

They do, but they are stored in the Encoding dictionary.

> The question is, assuming that i 
>correctly retrieved all the encoding info i need, how can i change the 
>text to the original encoding again after FlateDecoding the stream. Or 
>in other words: When decoding the stream isn't the resulting string 
>already encoded with an OS-specific encoding? 

No, definitely not. Each string is encoded according to the font in
use for that particular font at that time. There may be many different
encodings in one page stream.

Conceptually: you need a table mapping Adobe's glyph names into your
target encoding.  Then, parsing the PDF will give you, for each string
that is shown, the encoding in use. This encoding is an array of 256
names.  You would take each byte from your string and map it to a
name, using the encoding from the PDF. Then you would search your
target encoding list for that name. Finally, you would use the name
you looked up to get the target encoding for that byte.  

That assumes the font has no ToUnicode map, and is a single byte font.

> I like the chalenge. :)

That's good. There are many interesting things to learn on the way.
----------------------------------------
Aandi Inston  quite@dial.pipex.com http://www.quite.com
Please support usenet! Post replies and follow-ups, don't e-mail them.

0
Reply quite 4/24/2005 6:52:11 PM

5 Replies
331 Views

(page loaded in 0.396 seconds)

Similiar Articles:













7/26/2012 2:21:58 PM


Reply: