Bonsoir � tous,
Could someone be kind enough to explain me why this pdf file
(http://dl.free.fr/ambGjrH7D) produces a very strange output (without
error code) through pdftotext (from plopper-utils on a debian lenny) ?
<strange_output_excerpt>
!"
!" # "
!#$
%
!" & "
' ( )
#
</strange_output_excerpt>
Merci d'avance
--
Vincent
|
|
0
|
|
|
|
Reply
|
pat143 (1)
|
11/17/2009 9:20:09 PM |
|
On 2009-11-17, LoZ <pat@chau.de> wrote:
> Bonsoir � tous,
>
> Could someone be kind enough to explain me why this pdf file
> (http://dl.free.fr/ambGjrH7D) produces a very strange output (without
> error code) through pdftotext (from plopper-utils on a debian lenny) ?
>
><strange_output_excerpt>
> !"
>
> !" # "
> ...
I can't get that download to work, but typically that sort of output
means that there's no valid encoding information for the font(s) in
the PDF file.
For more details, see:
http://www.glyphandcog.com/textext.html
- Derek
|
|
0
|
|
|
|
Reply
|
Derek
|
11/17/2009 10:35:21 PM
|
|
Derek B. Noonburg wrote:
> On 2009-11-17, LoZ <pat@chau.de> wrote:
>> Bonsoir � tous,
>>
>> Could someone be kind enough to explain me why this pdf file
>> (http://dl.free.fr/ambGjrH7D) produces a very strange output (without
>> error code) through pdftotext (from plopper-utils on a debian lenny) ?
>>
>> <strange_output_excerpt>
>> !"
>>
>> !" # "
>> ...
>
> I can't get that download to work,
Try this one : wget http://www.cijoint.fr/cj200911/cijXki3VKL.pdf
> but typically that sort of output
> means that there's no valid encoding information for the font(s) in
> the PDF file.
Thanks for your response and informations from
http://www.glyphandcog.com/textext.html
If the problem is � unable to find information from the font �, is there
a way that pdftotext returns a message or a code for that situation ?
--
Vincent
|
|
0
|
|
|
|
Reply
|
LoZ
|
11/18/2009 7:31:11 AM
|
|
On 2009-11-18, LoZ <pat@chau.de> wrote:
> Derek B. Noonburg wrote:
>> On 2009-11-17, LoZ <pat@chau.de> wrote:
>>> Bonsoir � tous,
>>>
>>> Could someone be kind enough to explain me why this pdf file
>>> (http://dl.free.fr/ambGjrH7D) produces a very strange output (without
>>> error code) through pdftotext (from plopper-utils on a debian lenny) ?
>>>
>>> <strange_output_excerpt>
>>> !"
>>>
>>> !" # "
>>> ...
>>
>> I can't get that download to work,
> Try this one : wget http://www.cijoint.fr/cj200911/cijXki3VKL.pdf
>
>> but typically that sort of output
>> means that there's no valid encoding information for the font(s) in
>> the PDF file.
> Thanks for your response and informations from
> http://www.glyphandcog.com/textext.html
>
> If the problem is � unable to find information from the font �, is there
> a way that pdftotext returns a message or a code for that situation ?
Not really. I downloaded that PDF file, and in this case, the PDF
font objects contain no Encoding key at all. With TrueType fonts,
that typically indicates use of WinAnsiEncoding (or MacRomanEncoding -
but those two are close enough, at least for the 7-bit ASCII part).
But in this case, the fonts are subsets and are not using an
ASCII-based encoding. There's really no way for pdftotext to tell. A
very similar PDF file could generate valid extracted text.
- Derek
|
|
0
|
|
|
|
Reply
|
Derek
|
11/18/2009 10:05:30 PM
|
|
Derek B. Noonburg �crivait :
> Not really. I downloaded that PDF file, and in this case, the PDF
> font objects contain no Encoding key at all. With TrueType fonts,
> that typically indicates use of WinAnsiEncoding (or MacRomanEncoding -
> but those two are close enough, at least for the 7-bit ASCII part).
> But in this case, the fonts are subsets and are not using an
> ASCII-based encoding. There's really no way for pdftotext to tell. A
> very similar PDF file could generate valid extracted text.
Ok I understand.
I use pdftotext as the first step of indexing nearly 2000 pdf files, I
really appreciate if there was a way to know if the pdf to text
conversion fails with that � strange output. �
pdfinfo can ouput the � creator � of the pdf file. Are some of these
software (or versions of software) known to produce pdf with subsets
fonts, or to produce pdf without encoding key ?
Other question : is there a way to have pdftotext return an error code
to say � there no encoding key at all � ?
Merci d'avance.
--
Vincent
|
|
0
|
|
|
|
Reply
|
LoZ
|
11/30/2009 7:37:39 PM
|
|
LoZ wrote:
> Derek B. Noonburg �crivait :
>> Not really. I downloaded that PDF file, and in this case, the PDF
>> font objects contain no Encoding key at all. With TrueType fonts,
>> that typically indicates use of WinAnsiEncoding (or MacRomanEncoding -
>> but those two are close enough, at least for the 7-bit ASCII part).
>> But in this case, the fonts are subsets and are not using an
>> ASCII-based encoding. There's really no way for pdftotext to tell. A
>> very similar PDF file could generate valid extracted text.
>
> Ok I understand.
>
> I use pdftotext as the first step of indexing nearly 2000 pdf files, I
> really appreciate if there was a way to know if the pdf to text
> conversion fails with that � strange output. �
>
> pdfinfo can ouput the � creator � of the pdf file. Are some of these
> software (or versions of software) known to produce pdf with subsets
> fonts, or to produce pdf without encoding key ?
>
> Other question : is there a way to have pdftotext return an error code
> to say � there no encoding key at all � ?
Doesn't pdfinfo provide this?
///Peter
|
|
0
|
|
|
|
Reply
|
Peter
|
12/1/2009 11:17:26 PM
|
|
Peter Flynn écrivait :
> LoZ wrote:
>> Other question : is there a way to have pdftotext return an error code
>> to say � there no encoding key at all � ?
>
> Doesn't pdfinfo provide this?
As far as I know, no.
|
|
0
|
|
|
|
Reply
|
LoZ
|
12/7/2009 2:35:41 PM
|
|
|
6 Replies
410 Views
(page loaded in 0.1 seconds)
Similiar Articles:7/9/2012 3:15:21 PM
|