How to determine the original file extension from which the pdf was created

  • Follow


Hi,

I have an issue at hand regarding PDFs. I would like to determine what
was the original source from which the PDF was created. For example if
the PDF was created from a word document how would you check that. From
my understanding you would need to check "File -> Document Properties
-> Summary." Is that Correct? If so, which field would I need to look
at, and what will be mentioned in the fields that will let me know for
example that it was created from a word document?

Now to the more difficult issue. The issue I have at hand is to
determine what the original source from which the PDF are created.
However there are these other criteria that I need to meet.
1) I need to check the original source without actually opening the
file. Is there some scripting or programming language that I can use to
do this?
2) I need to determine the original source in order to determined
whether the PDF was created from a scanned document or from some other
document e.g word? Is there some fields in the PDF file which mentions
that the document is created from a scanned document (or would the only
way be by guesstimating, draw the conclusion that the original source
is an image so the pdf is possibly created from scanned doc)?

Please let me know your thoughts.

Thanks.

0
Reply pchulani (2) 3/2/2006 6:21:01 PM

I think this is gonna be difficult unless you only have a few producers
of PDF that you are interested in. For example, Word vs. a scanned
image.

Open up the PDF in a text editor (e.g. Notepad) and you will find that
much of this information is pure text and easily readable. For example,
creating a PDF from Word with Distiller, you will find a text string
like.

/Creator(Acrobat PDFMaker 7.0.7 for Word)

So for Word, it is easy to just look for that string... but you would
need to know all the possible strings in your environment.

http://www.accesspdf.com/pdftk might help since it can dump the
information but I think it's simpler just to search for the string
directly in the PDF. If you know java,
http://www.pdfbox.org/commandlineutilities/index.html might come in
handy. I think there are tools for other languages too.

I hope this helps a little...

Martin

0
Reply Mycroft 3/3/2006 4:13:48 PM


Thank you very much for the reply.

A follow up question - what would you suggest would be the best way of
determining whether an image was scanned or not. Yes I do understand
for Word for example you would look at the Convertor that was used
("/Creator(Acrobat PDFMaker 7.0.7 for Word"), but for a scanned image
is there some Metadata field that I could like at to determine if it
was scanned from a pdf. i.e the pdf had a metadata field that mentions
the type of the original document and it mentions that the original
document was for example an image/jpeg etc. (looking for the convertor
helps with documents like .doc, or .xls, but cannot utilize the above
suggestion for image since I do not know the original scanning tool
used)

Second question - what would you suggest I use java for. Would that be
easier than using a scripting language to search through pdf's metadata
fields without actually opening it.

Could you suggest what java would be used for in this case? I am really
curios how one would use java to solve this issue.

Appreciate all your help.

PC

0
Reply pchulani 3/3/2006 11:38:21 PM

Well, in my case, a scanned image that I convert seems to have a
\title{XXX.jpg) with the name of the image... other than that, I don't
know. It all depends on your environment. If the images are always
scanned in the same manner, surely there will be a disctinctive field.

As for the second question, it was just a reference to a toolkit that I
know which handles a PDF. So all the parsing is already done and you
have access to the metadata in no time. It just happens to be Java but
it's just another programming language. Like I said, there are toolkits
for other languages, just not one I knew from the top of my head. A
scripting language will do just fine.

Martin

0
Reply Mycroft 3/7/2006 2:05:47 PM

Hi Martin,

Thanks for the reply. This has been very helpful. 

PC

0
Reply pchulani 3/7/2006 6:31:31 PM

4 Replies
361 Views

(page loaded in 0.805 seconds)

Similiar Articles:













7/22/2012 4:28:17 PM


Reply: