f



Differentiating image+text from PDF normal

Is there an east way either in Acrobat, or via a s/w utility or
plugin, to determine if a pdf file was done as image+text, as opposed
to pdf normal (without having to visually inspect the pages).  I have
a large number of pdfs that I need to differentiate between the types.

Thank You.
0
mikegross
9/23/2004 4:59:39 PM
comp.text.pdf 5600 articles. 0 followers. ramon (1518) is leader. Post Follow

2 Replies
569 Views

Similar Articles

[PageSpeed] 59

Mike Gross schrieb:
> Is there an east way either in Acrobat, or via a s/w utility or
> plugin, to determine if a pdf file was done as image+text, as opposed
> to pdf normal (without having to visually inspect the pages).  I have
> a large number of pdfs that I need to differentiate between the types.

Did I get that right, you mean:

"image+text": PDFs typically created by Adobe Capture with large scans 
(typically black and white) embedded into the PDF and text behind the 
scan to support indexing and searching.

"pdf normal": all the rest, typically documents with much text, vector 
graphics, several small- and mid-size raster images, very few big raster 
images

You would need some tool to get metadata from pdfs. I recommend using 
tool.pdf.Info
	http://multivalent.sourceforge.net/Tools/pdf/Info.html

from Multivalent
	http://multivalent.sourceforge.net/

Features to check
------------------

The Producer and Creator fields might serve as an indicator. Search for 
"Adobe Capture".

Or check the number and size of images, in relation to the total number 
of pages and use some classifier such as: if there are about as many 
pages (page count) as images (larger than 600x800), then we assume, it 
is a "image+text" type document, else it is a "normal" one.

All this can be automated in a couple of lines of Perl, Python or 
whatever language you are most comfortable with.

------------------------

Here is a test with one version of the 911 Report (a "normal" PDF, of 
course) to show you how it would work.

mvinfo is my wrapper around "java -jar Multivalent.jar tool.pdf.Info".

$ mvinfo 911Report.pdf | grep ^Producer
Producer: Acrobat Distiller 6.0 for Macintosh

$ mvinfo 911Report.pdf | grep ^Creator
Creator: QuarkXPress: pictwpstops filter 1.0

$ mvinfo 911Report.pdf | grep ^Page
Page count: 585

$ mvinfo -image 911Report.pdf
Filename: 911Report.pdf
image: object 208, 3276x666, CCITTFaxDecode, length = 5687
image: object 209, 5976x1224, CCITTFaxDecode, length = 8757
image: object 210, 2760x1758, CCITTFaxDecode, length = 8804
image: object 211, 4478x704, CCITTFaxDecode, length = 4274
image: object 212, 3734x1286, CCITTFaxDecode, length = 5838
image: object 213, 5293x1526, CCITTFaxDecode, length = 9832
image: object 214, 4368x1616, CCITTFaxDecode, length = 8242
image: object 215, 4430x879, CCITTFaxDecode, length = 6656
image: object 216, 3228x636, CCITTFaxDecode, length = 4638
image: object 217, 2310x1002, CCITTFaxDecode, length = 4757
image: object 297, 288x238, DCTDecode, length = 10823
image: object 298, 309x225, DCTDecode, length = 9080
image: object 350, 157x124, DCTDecode, length = 3184
image: object 351, 161x118, DCTDecode, length = 2886
image: object 355, 156x112, DCTDecode, length = 2943
image: object 356, 154x111, DCTDecode, length = 3597
image: object 405, 313x285, DCTDecode, length = 12551
image: object 451, 288x304, DCTDecode, length = 22077
image: object 704, 313x237, DCTDecode, length = 10909
image: object 975, 86x108, DCTDecode, length = 2353
image: object 976, 86x105, DCTDecode, length = 2263
image: object 977, 85x107, DCTDecode, length = 1815
image: object 978, 86x110, DCTDecode, length = 2373
image: object 979, 86x108, DCTDecode, length = 2298
image: object 980, 85x108, DCTDecode, length = 2008
image: object 981, 85x110, DCTDecode, length = 2192
image: object 982, 86x108, DCTDecode, length = 2073
image: object 983, 86x108, DCTDecode, length = 2209
image: object 984, 84x107, DCTDecode, length = 2076
image: object 988, 85x108, DCTDecode, length = 1945
image: object 989, 84x107, DCTDecode, length = 2110
image: object 990, 84x106, DCTDecode, length = 2303
image: object 991, 86x104, DCTDecode, length = 2332
image: object 992, 86x108, DCTDecode, length = 2256
image: object 993, 84x107, DCTDecode, length = 2515
image: object 994, 85x108, DCTDecode, length = 2103
image: object 995, 86x112, DCTDecode, length = 2124
image: object 996, 87x110, DCTDecode, length = 2319
image: object 1117, 313x243, DCTDecode, length = 14200
image: object 1133, 313x307, DCTDecode, length = 10987
image: object 1146, 313x324, DCTDecode, length = 17395
image: object 1219, 313x463, DCTDecode, length = 20489
image: object 1223, 313x208, DCTDecode, length = 16225
image: object 1224, 313x223, DCTDecode, length = 11537

----------------

Classification:
Only 10 large images (the first 10 entries in the list) in a 585 pages 
document -> "normal" PDF.

I did not quickly find any "Adobe Capture" made PDF, that I could check.

Probably, the classifier needs some fine-tuning, but all in all, this 
procedure looks promising to produce acceptable results.

Ralf

-- 
Ralf Koenig
Wissenschaftlicher Mitarbeiter an der
Professur Rechnernetze und verteilte Systeme
TU Chemnitz, Zi. 1/B320, Tel. 0371-531-1532

0
Ralf
9/24/2004 8:30:00 AM
Ralf Koenig schrieb:
> Mike Gross schrieb:
> 
>> Is there an east way either in Acrobat, or via a s/w utility or
>> plugin, to determine if a pdf file was done as image+text, as opposed
>> to pdf normal (without having to visually inspect the pages).  I have
>> a large number of pdfs that I need to differentiate between the types.


> [...]
> I did not quickly find any "Adobe Capture" made PDF, that I could check.

I've just found a typical of those scanned PDFs in the PDF database 
administered Michael Still.

http://www.stillhq.com/pdfdb/000004/info.html

$ mvinfo data.pdf
Filename: data.pdf
Title: 199479714E
Author: IP Australia
Subject: AU-B Australian Documents
Keywords: Patents
Creator: HPA image bureau 1998-1999
Producer: HPA image bureau 1998-1999
Created: Fri Jan 01 00:00:00 CET 1999
Modified: Thu May 27 11:39:06 CEST 1999
Page count: 68
PDF version: 1.2
Linearized: version 1

$ mvinfo -image data.pdf
Filename: data.pdf
image: object 5, 2496x3584, CCITTFaxDecode, length = 53004
image: object 10, 2496x3584, CCITTFaxDecode, length = 50392
image: object 15, 2496x3584, CCITTFaxDecode, length = 43658
image: object 20, 2496x3584, CCITTFaxDecode, length = 52206
image: object 25, 2496x3584, CCITTFaxDecode, length = 50735
image: object 30, 2496x3584, CCITTFaxDecode, length = 50252
image: object 35, 2496x3584, CCITTFaxDecode, length = 50264
image: object 40, 2496x3584, CCITTFaxDecode, length = 48389
.... [many more similar lines omitted]

$ mvinfo -image data.pdf | wc -l
      69

So, this fits quite well. 68 big images in a 68 pages document (the 
first line in couting the lines was the "Filename" line), all encoded 
with a good compression algorithm for b/w images.

Strictly, it is not a "image+text" document, as the text behind the 
images is missing (observe the content streams, they all look similar to

q
	600 0 0 861 0 0 cm
	/Im1 Do
Q

only referering to the scans and no annotations), but I think, you still 
would put it into that category.

Ralf

-- 
Ralf Koenig
Wissenschaftlicher Mitarbeiter an der
Professur Rechnernetze und verteilte Systeme
TU Chemnitz, Zi. 1/B320, Tel. 0371-531-1532

0
Ralf
9/24/2004 8:52:54 AM
Reply: