|
|
Can't <pdf to text>
What's with these *.pdf files which can't do <pdf to text>?
eg. http://www.cogsci.rpi.edu/~rsun/sun.clarion2005.pdf
Is the idea to prevent them being <copied>?
Or is it that a photo/pixel-grab of the paper was the source?
Is it that pdf & postscript render [to the VDU] a rectangle
of pixels: being the font of a single char/glyph/image;
and for a single char, the pixels are obtained from the
bit-map/font?
And for these problematic/un-decodable 'texts', it's
a full-page rectangle 'photo' of the original text?
Can someone recommend an OCR-utility for linux?
== TIA.
|
|
0
|
|
|
|
Reply
|
no.top.post (346)
|
8/12/2012 8:06:46 AM |
|
On Sun, 12 Aug 2012 08:06:46 +0000, no.top.post ci disse:
> Is the idea to prevent them being <copied>?
[...]
I don't think in this case (but the same effect can be obtained with
ghostscript intentionally)
in my opinion it is a matter of pdf badly encoded
and in any case, remember that what a man done, another man can done also
(or revert effect)
> And for these problematic/un-decodable 'texts', it's a full-page
> rectangle 'photo' of the original text?
[...]
No. this pdf only contains vector data for fonts
> Can someone recommend an OCR-utility for linux?
[...]
since a while,
*PDF-XChange Viewer*
added to its free viwer, the ocr ability for free
- download *PDF-XChange Viewer* (wine needed to work. I used wine 1.3.13,
other builds were problematic)
http://www.tracker-software.com/product/downloads (core program)
http://www.tracker-software.com/pdf-xchange-viewer-ocr (language pack)
now you can import your pdf into *PDF-XChange Viewer*, it will rasterize
any page and then will perform the ocr
I also produced a djvu version with ocred text (you can select and copy)
Evince can read djvu but there are many other djvu reader (also in java)
- http://ge.tt/2PNjStL/v/0
- http://ge.tt/2PNjStL/v/1 (tiff multipage raster version)
- http://ge.tt/2PNjStL/v/2 (rasterized pdf from multipage tiff)
--
Puppy Linux wiki: http://dokupuppylinux.tk
Puppy Linux Forum: http://www.italianpuppy.org/
Windows me genuit, Ubuntu rapuere / tenet nunc Puppy Linux...
invito dropbox 2GB free - http://db.tt/Vc6IeN4
|
|
0
|
|
|
|
Reply
|
elicona (24)
|
8/12/2012 9:12:18 AM
|
|
On Sunday, August 12, 2012 4:06:46 AM UTC-4, (unknown) wrote:
> What's with these *.pdf files which can't do <pdf to text>?
>
> eg. http://www.cogsci.rpi.edu/~rsun/sun.clarion2005.pdf
> Is the idea to prevent them being <copied>?
Doubtful.
> Or is it that a photo/pixel-grab of the paper was the source?
No. The glyphs on the page are true glyphs; but the encoding
used is scrambled in such a way that the association between glyphs
and original ASCII text is lost.
The producer of this PDF is GNU Ghostscript 7.05, but the encoding
scrambling doubtless happened before that. The fonts are all Type 3.
From that and from the look of the paper my guess is that the
original producer was LaTeX.
Others with more experience here can tell you whether it's
practically possible to recover the original text. It's certainly
theoretically possible -- it would be even less difficult than
a decryption, since you can read the text on the screen as a human
and therefore rebuild the encoding yourself.
|
|
0
|
|
|
|
Reply
|
rpresser (225)
|
8/13/2012 4:16:08 AM
|
|
On Sun, 12 Aug 2012 08:06:46 +0000 (UTC), no.top.post@gmail.com wrote:
> What's with these *.pdf files which can't do <pdf to text>?
> eg. http://www.cogsci.rpi.edu/~rsun/sun.clarion2005.pdf
>
> Is the idea to prevent them being <copied>?
>
> Or is it that a photo/pixel-grab of the paper was the source?
>
> Is it that pdf & postscript render [to the VDU] a rectangle
> of pixels: being the font of a single char/glyph/image;
> and for a single char, the pixels are obtained from the
> bit-map/font?
>
> And for these problematic/un-decodable 'texts', it's
> a full-page rectangle 'photo' of the original text?
>
> Can someone recommend an OCR-utility for linux?
>
> == TIA.
Here's a 1-pixel grab, of the letter "h", from the pdf you cite:
-- { --
and now the line fragment, "It also has a dual representational structure":
-- ¨spzl4{j&pzl
p&+pzbuBtvq&uBtwlBtwxysBprsBk~{x+pz(lBs4uo&nsBuBt --
so whatever is going on, what you see is *not* just graphics, it's
characters, but they're encoded pretty oddly, by my lights.
'Zat help at all? Cheers, -- tlvp
--
Avant de repondre, jeter la poubelle, SVP.
|
|
0
|
|
|
|
Reply
|
mPiOsUcB.EtLlLvEp (145)
|
8/13/2012 5:48:46 AM
|
|
On Sun, 12 Aug 2012 08:06:46 +0000, no.top.post wrote:
> What's with these *.pdf files which can't do <pdf to text>? eg.
> http://www.cogsci.rpi.edu/~rsun/sun.clarion2005.pdf
>
> Is the idea to prevent them being <copied>?
>
> Or is it that a photo/pixel-grab of the paper was the source?
>
> Is it that pdf & postscript render [to the VDU] a rectangle of pixels:
> being the font of a single char/glyph/image;
> and for a single char, the pixels are obtained from the bit-map/font?
>
> And for these problematic/un-decodable 'texts', it's a full-page
> rectangle 'photo' of the original text?
>
> Can someone recommend an OCR-utility for linux?
>
> == TIA.
I notice that Acrobat reader 9 for Linux can't even open the file.
Though Acrobat X reader for Windows opens and displays it without issue
copy and paste from the document gives apparent gibberish like pdftotext.
|
|
0
|
|
|
|
Reply
|
joebeanfish (188)
|
8/13/2012 2:13:10 PM
|
|
|
4 Replies
84 Views
(page loaded in 0.103 seconds)
|
|
|
|
|
|
|
|
|