Strange output through pdftotext

  • Follow


Bonsoir � tous,

Could someone be kind enough to explain me why this pdf file
(http://dl.free.fr/ambGjrH7D) produces a very strange output (without 
error code) through pdftotext (from plopper-utils on a debian lenny)  ?

<strange_output_excerpt>
!"

!" # "

!#$

%

!" & "

' ( )
#
</strange_output_excerpt>

Merci d'avance

--
Vincent
0
Reply pat143 (1) 11/17/2009 9:20:09 PM

On 2009-11-17, LoZ <pat@chau.de> wrote:
> Bonsoir � tous,
>
> Could someone be kind enough to explain me why this pdf file
> (http://dl.free.fr/ambGjrH7D) produces a very strange output (without 
> error code) through pdftotext (from plopper-utils on a debian lenny)  ?
>
><strange_output_excerpt>
> !"
>
> !" # "
> ...

I can't get that download to work, but typically that sort of output
means that there's no valid encoding information for the font(s) in
the PDF file.

For more details, see:

    http://www.glyphandcog.com/textext.html

- Derek
0
Reply Derek 11/17/2009 10:35:21 PM


Derek B. Noonburg wrote:
> On 2009-11-17, LoZ <pat@chau.de> wrote:
>> Bonsoir � tous,
>>
>> Could someone be kind enough to explain me why this pdf file
>> (http://dl.free.fr/ambGjrH7D) produces a very strange output (without 
>> error code) through pdftotext (from plopper-utils on a debian lenny)  ?
>>
>> <strange_output_excerpt>
>> !"
>>
>> !" # "
>> ...
> 
> I can't get that download to work, 
Try this one :  wget http://www.cijoint.fr/cj200911/cijXki3VKL.pdf

> but typically that sort of output
> means that there's no valid encoding information for the font(s) in
> the PDF file.
Thanks for your response and informations from 
http://www.glyphandcog.com/textext.html

If the problem is � unable to find information from the font �, is there 
     a way that pdftotext returns a message or a code for that situation ?

-- 
Vincent
0
Reply LoZ 11/18/2009 7:31:11 AM

On 2009-11-18, LoZ <pat@chau.de> wrote:
> Derek B. Noonburg wrote:
>> On 2009-11-17, LoZ <pat@chau.de> wrote:
>>> Bonsoir � tous,
>>>
>>> Could someone be kind enough to explain me why this pdf file
>>> (http://dl.free.fr/ambGjrH7D) produces a very strange output (without 
>>> error code) through pdftotext (from plopper-utils on a debian lenny)  ?
>>>
>>> <strange_output_excerpt>
>>> !"
>>>
>>> !" # "
>>> ...
>> 
>> I can't get that download to work, 
> Try this one :  wget http://www.cijoint.fr/cj200911/cijXki3VKL.pdf
>
>> but typically that sort of output
>> means that there's no valid encoding information for the font(s) in
>> the PDF file.
> Thanks for your response and informations from 
> http://www.glyphandcog.com/textext.html
>
> If the problem is � unable to find information from the font �, is there 
>      a way that pdftotext returns a message or a code for that situation ?

Not really.  I downloaded that PDF file, and in this case, the PDF
font objects contain no Encoding key at all.  With TrueType fonts,
that typically indicates use of WinAnsiEncoding (or MacRomanEncoding -
but those two are close enough, at least for the 7-bit ASCII part).
But in this case, the fonts are subsets and are not using an
ASCII-based encoding.  There's really no way for pdftotext to tell.  A
very similar PDF file could generate valid extracted text.

- Derek
0
Reply Derek 11/18/2009 10:05:30 PM

Derek B. Noonburg �crivait :
> Not really.  I downloaded that PDF file, and in this case, the PDF
> font objects contain no Encoding key at all.  With TrueType fonts,
> that typically indicates use of WinAnsiEncoding (or MacRomanEncoding -
> but those two are close enough, at least for the 7-bit ASCII part).
> But in this case, the fonts are subsets and are not using an
> ASCII-based encoding.  There's really no way for pdftotext to tell.  A
> very similar PDF file could generate valid extracted text.

Ok I understand.

I use pdftotext as the first step of indexing nearly 2000 pdf files, I 
really appreciate if there was a way to know if the pdf to text 
conversion fails with that � strange output. �

pdfinfo can ouput the � creator � of the pdf file. Are some of these 
software (or versions of software) known to produce pdf with subsets 
fonts, or to produce pdf without encoding key ?

Other question : is there a way to have pdftotext return an error code 
to say � there no encoding key at all � ?

Merci d'avance.

-- 
Vincent



0
Reply LoZ 11/30/2009 7:37:39 PM

LoZ wrote:
> Derek B. Noonburg �crivait :
>> Not really.  I downloaded that PDF file, and in this case, the PDF
>> font objects contain no Encoding key at all.  With TrueType fonts,
>> that typically indicates use of WinAnsiEncoding (or MacRomanEncoding -
>> but those two are close enough, at least for the 7-bit ASCII part).
>> But in this case, the fonts are subsets and are not using an
>> ASCII-based encoding.  There's really no way for pdftotext to tell.  A
>> very similar PDF file could generate valid extracted text.
> 
> Ok I understand.
> 
> I use pdftotext as the first step of indexing nearly 2000 pdf files, I 
> really appreciate if there was a way to know if the pdf to text 
> conversion fails with that � strange output. �
> 
> pdfinfo can ouput the � creator � of the pdf file. Are some of these 
> software (or versions of software) known to produce pdf with subsets 
> fonts, or to produce pdf without encoding key ?
> 
> Other question : is there a way to have pdftotext return an error code 
> to say � there no encoding key at all � ?

Doesn't pdfinfo provide this?

///Peter
0
Reply Peter 12/1/2009 11:17:26 PM

Peter Flynn écrivait :
> LoZ wrote:
>> Other question : is there a way to have pdftotext return an error code 
>> to say � there no encoding key at all � ?
> 
> Doesn't pdfinfo provide this?
As far as I know, no.
0
Reply LoZ 12/7/2009 2:35:41 PM

6 Replies
410 Views

(page loaded in 0.1 seconds)


Reply: