Hi,
I have an issue at hand regarding PDFs. I would like to determine what
was the original source from which the PDF was created. For example if
the PDF was created from a word document how would you check that. From
my understanding you would need to check "File -> Document Properties
-> Summary." Is that Correct? If so, which field would I need to look
at, and what will be mentioned in the fields that will let me know for
example that it was created from a word document?
Now to the more difficult issue. The issue I have at hand is to
determine what the original source from which the PDF are created.
However there are these other criteria that I need to meet.
1) I need to check the original source without actually opening the
file. Is there some scripting or programming language that I can use to
do this?
2) I need to determine the original source in order to determined
whether the PDF was created from a scanned document or from some other
document e.g word? Is there some fields in the PDF file which mentions
that the document is created from a scanned document (or would the only
way be by guesstimating, draw the conclusion that the original source
is an image so the pdf is possibly created from scanned doc)?
Please let me know your thoughts.
Thanks.
|
|
0
|
|
|
|
Reply
|
pchulani (2)
|
3/2/2006 6:21:01 PM |
|
I think this is gonna be difficult unless you only have a few producers
of PDF that you are interested in. For example, Word vs. a scanned
image.
Open up the PDF in a text editor (e.g. Notepad) and you will find that
much of this information is pure text and easily readable. For example,
creating a PDF from Word with Distiller, you will find a text string
like.
/Creator(Acrobat PDFMaker 7.0.7 for Word)
So for Word, it is easy to just look for that string... but you would
need to know all the possible strings in your environment.
http://www.accesspdf.com/pdftk might help since it can dump the
information but I think it's simpler just to search for the string
directly in the PDF. If you know java,
http://www.pdfbox.org/commandlineutilities/index.html might come in
handy. I think there are tools for other languages too.
I hope this helps a little...
Martin
|
|
0
|
|
|
|
Reply
|
Mycroft
|
3/3/2006 4:13:48 PM
|
|
Thank you very much for the reply.
A follow up question - what would you suggest would be the best way of
determining whether an image was scanned or not. Yes I do understand
for Word for example you would look at the Convertor that was used
("/Creator(Acrobat PDFMaker 7.0.7 for Word"), but for a scanned image
is there some Metadata field that I could like at to determine if it
was scanned from a pdf. i.e the pdf had a metadata field that mentions
the type of the original document and it mentions that the original
document was for example an image/jpeg etc. (looking for the convertor
helps with documents like .doc, or .xls, but cannot utilize the above
suggestion for image since I do not know the original scanning tool
used)
Second question - what would you suggest I use java for. Would that be
easier than using a scripting language to search through pdf's metadata
fields without actually opening it.
Could you suggest what java would be used for in this case? I am really
curios how one would use java to solve this issue.
Appreciate all your help.
PC
|
|
0
|
|
|
|
Reply
|
pchulani
|
3/3/2006 11:38:21 PM
|
|
Well, in my case, a scanned image that I convert seems to have a
\title{XXX.jpg) with the name of the image... other than that, I don't
know. It all depends on your environment. If the images are always
scanned in the same manner, surely there will be a disctinctive field.
As for the second question, it was just a reference to a toolkit that I
know which handles a PDF. So all the parsing is already done and you
have access to the metadata in no time. It just happens to be Java but
it's just another programming language. Like I said, there are toolkits
for other languages, just not one I knew from the top of my head. A
scripting language will do just fine.
Martin
|
|
0
|
|
|
|
Reply
|
Mycroft
|
3/7/2006 2:05:47 PM
|
|
Hi Martin,
Thanks for the reply. This has been very helpful.
PC
|
|
0
|
|
|
|
Reply
|
pchulani
|
3/7/2006 6:31:31 PM
|
|
|
4 Replies
361 Views
(page loaded in 0.805 seconds)
Similiar Articles: how to change the background color and print the PDF file? - comp ...... not created by me. I don't have PPT source. I only have the PDF ... ANOTHER PDF -- you don't need the original ... Portable Document Format, or PDF, files. A PDF file ... How to insert bmp's in pdf - comp.text.pdfWatch out for recent extensions like JPEG ... pixel format, but just dumping the original ... code for inserting bitmaps into a pdf file. Are these PDFs being created on ... How to check if a PDF is valid? - comp.text.pdf... do, but the original files are linearized. > > Do you have any ideas how to find out ... is a valid pdf file: file, check, valid, pdf I have some files with .pdf extension. How to read old PDF files - comp.text.pdfNot every file with .pdf extension is an Acrobat PDF file. ... Adobe invented the Portable Document File (PDF ... Adobe created and developed the PDF format. PDF ... How print pdf file without Acrobat Reader ? - comp.text.pdf ...Hello I've created a pdf file with ... to print a PDF file at all. Rather it seems to help you make a PDF file, but the original ... do not have a PDF extension IF "%WD_FILE_E .pdf won't open - comp.databases.filemaker".pdf" is the extension in the ... The pdf file that I am attempting to view was created by os x (preview?) via the print>save as pdf ... Portable Document Format (PDF) file in ... Printing to PDF automatically saves document - comp.text.pdf ...... document is automatically saved in the folder that the printer port was created in and has a name similar to the file from which it was printed, but with a PDF extension. Please help me find the PDF error - comp.text.pdf... cross-reference table (comprising the original ... I have a 32 page PDF file that was created from a combination of ... I have an Adobe PDF Extension error. Can you help me ... Problems with opening JPG files in GIMP - comp.graphics.apps.gimp ...They were created by an Olympus digital ... Converting to PDF format for printing ... While opening a file, GIMP must determine the file type. Unfortunately, the file extension ... disable Save and Print programmatically - comp.text.pdfWe have a bunch of documents already created and ... Portable Document Format, or PDF, is ... disable Save and Print ... How to implement batch print pdf file ... Windows File Association - File Extension PDFThe file extension PDF, or Adobe Portable Document Format, is a viewer that was created by Adobe Systems, Inc. File extension ... appear in the original ... File Extension .PDF DetailsThe PDF file type is primarily associated with 'Acrobat' by Adobe Systems Incorporated. Adobe Acrobat is a family of computer programs developed by Adobe Systems ... 7/22/2012 4:28:17 PM
|