f



specific text extraction from pdf

I've researched a lot, but still not found the solution. Let me
explain:

A pdf file is uploaded. The file can look in a million of manner,
right? Im talking about its disposition. What I need to do is to fetch
each odd row of the text (but only the paragraph text. Extracting text
from pdf often means you also get that text that for example is inside
an image) and cover that line with black color, so the text line is
not readable anymore.

Or maybe I want to do the same but for each odd word in the
paragraphs.

As you understand, it is about:

1) Extract text from pdf
2)Analyse it. What text was "real" text, and what text was unimportant
(table of content, text that explain an image, text inside the image,
page header etc).
3)Rewrite the pdf file in exactly the same manner, but while rewriting
the file, do the text manipulation (black on each two lines for
example).

How can I solve this? Im quite sure, now, after having reserached a
lot, that this is almost a mission impossible.
The most advanced I could find for this kind of manipulation is
Pdflib, and especially the library TET. But I couldnt find a good way
to analyse text in the way I described above. Anybody out there have
been working with something like that? And can give me an advice on
how to proceed?

0
8/11/2009 9:44:04 AM
comp.lang.php 32646 articles. 0 followers. Post Follow

1 Replies
506 Views

Similar Articles

[PageSpeed] 38

Aka Unknown escribi�:
> I've researched a lot, but still not found the solution. Let me
> explain:
> 
> A pdf file is uploaded. The file can look in a million of manner,
> right? Im talking about its disposition. What I need to do is to fetch
> each odd row of the text (but only the paragraph text. Extracting text
> from pdf often means you also get that text that for example is inside
> an image) and cover that line with black color, so the text line is
> not readable anymore.

In a semantic sense, the PDF format sucks. It doesn't handle any concept 
like "paragraph", "table" or "row". It just handles little boxes than 
happen to contain text and, gracefully positioned around the page, look 
like a document to human eyes. It's great for printing, but totally 
useless for automated information exchange.

I've learnt to never say never but I don't think that what you're trying 
to do is feasible, unless you find a very good third party tool that 
implements the PDF-equivalent of a OCR utility for pictures. Google has 
one (you can see an HTML version of indexed PDFs) but even Google's 
utility works awfully in most documents.


> Or maybe I want to do the same but for each odd word in the
> paragraphs.
> 
> As you understand, it is about:
> 
> 1) Extract text from pdf

Extracting text itself is quite easy... given that the text boxes are 
generated in the reading order.

> 2)Analyse it. What text was "real" text, and what text was unimportant
> (table of content, text that explain an image, text inside the image,
> page header etc).
> 3)Rewrite the pdf file in exactly the same manner, but while rewriting
> the file, do the text manipulation (black on each two lines for
> example).

Editing an existing PDF should be doable, see:

http://www.setasign.de/products/pdf-php-solutions/fpdi/

I'm not sure though about the possibility of removing existing parts 
(beyond drawing a white rectangle on top).


> How can I solve this? Im quite sure, now, after having reserached a
> lot, that this is almost a mission impossible.
> The most advanced I could find for this kind of manipulation is
> Pdflib, and especially the library TET. But I couldnt find a good way
> to analyse text in the way I described above. Anybody out there have
> been working with something like that? And can give me an advice on
> how to proceed?



-- 
-- http://alvaro.es - �lvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programaci�n web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--
0
8/11/2009 10:22:31 AM
Reply:

Similar Artilces:

PDF::API2
Hello All, I am new to PDF files so I don't really know if what I want to do is possible and how to use the PDF::API2 modules. I need to extract information from columns in a table ( I assume that PDF does not know anything about tables). What I was thinking of doing was finding the horizontal location of the header (I know what it should be), then extract all text that starts at that location. I have played around with the PDF::API2 module and read the 'Using PDF::API2 - The code' help page, however it doesn't show me how to extract information from an existing file. ...

extract text layer from searchable pdf and merge with another pdf
Dear comp.text.pdfians I have a pdf (a searchable pdf consisting in book pages scans, then passed to ocr that has added a text layer hidden under images, so pdf is searchable) this pdf has jbig2 compression (it counts 135 pages in A5 format scanned at 300 dpi and its size is about 1928 KB) After ocrization, I noticed that scans have been degrated in quality, so I want extract text layer and merge this text layer with another copy of same pdf containing scans in high quality it is possible extract a text layer from a pdf and then merge with raster layer of another pdf? -- Puppy Linux...

Extract text from .pdf
I have Acrobat Pro, is it possible to extract text from a .pdf? I see the "save as" options including Word Doc but it still seems to be an image? The ocr software with my cannon lide 200 scanner is as useless as tits on a boar hog.......... In article <C61B79B5.40FEF%elvisp@compuserve.com>, The Wolf <elvisp@compuserve.com> wrote: > I have Acrobat Pro, is it possible to extract text from a .pdf? I see the > "save as" options including Word Doc but it still seems to be an image? Acrbat has its own OCR built-in. I've found it to be very accurate, eve...

pdf \ text (get rid of text in pdf)
Is there a way to remove all text from PDF? Will extract images work for you? If so, PDF-Tools by Tracker Software will do it. http://www.docu-track.com/ -- Don Vancouver, USA "MarosV" <maros.vranec@gmail.com> wrote in message news:ebb897e1-c8e3-4b3a-9274-dfd9d2c845c3@c4g2000hsg.googlegroups.com... > Is there a way to remove all text from PDF? ...

Extract Text Coordinates from PDF
Hi, I was wondering if anyone could recommend a program which can extract the starting (top left) coordinates (x,y) of each word in a PDF file (and the end if possible). Ideally output would be in a format that could be easily inserted into a database. Hi, We did that here for an internal parsing requirement but did not make it a commercial product. That would take additional funding to bring it up to a marketable product. For a one time function, it would not be worth the cost. As an OEM or volume product, of course the picture changes. BTW our output was designed to take the information and place it on an OctoTools Template which is somewhat XML like. From there we could output CSV or a custom output if required. Call me if you are looking for a more commercial solution. Larry T. (978) 535-7676 US-Boston, MA On 11 Oct 2005 08:26:55 -0700, sebclark@gmail.com wrote: >Hi, >I was wondering if anyone could recommend a program which can extract >the starting (top left) coordinates (x,y) of each word in a PDF file >(and the end if possible). Ideally output would be in a format that >could be easily inserted into a database. pdw.exe, part of PDF Command Line Tools http://www.pdf-tools.com/asp/products.asp?name=CLE sample output using the -w option: 231.9 663.0 12.0 50.4 0 Cour: permits 295.7 663.0 12.0 21.6 0 Cour: the 330.6 663.0 12.0 28.8 0 Cour: text 372.8 663.0 12.0 72.0 0 Cour: extraction 458.2 663.0 12.0 28.8 0 Cour: from PDFLib ...

extract Text from PDF #2
Hello NG! We Would like to extract addressdetails from PDF Letters placed on certain coordinates defined by German DIN Standard for Letters. For this purpose we=B4re looking for a solution to extract Text from a PDF Document placed on certain Pixel-Coordinates. Does somebody knew a possible Solution for this Problem? We=B4ve tried really much to achieve this task, unfortunately without any success yet. Thank you very much in Advance. Markus ...

How to extract text from a PDF document
Hello, How can I extract text from a (MS Word) PDF file? I've tryed pdftotext but it only produce crap, not one readable cleartext sentence. :) Exists other (free) utilties to convert pdf to a text file or extract text? I think it must possible, because I also can copy and paste text from PDF documents. greetings Fabian Hello Fabian: You can try our product Chief-Win PDF Converter Personal Edition V1.1, convert PDF to word/text. You can download it through : http://www.chief-win.com/setup.exe, it allow 21 days free trial with full function. Or you can try Easy PDF To Text...

Parse pdf to extract text???????
Is there anyway to use php to parse a pdf file and extract text from the document? I have been looking around for a few days now and still really havent found much..... If anyone could help it would be greatly appreciated. Thanks, Nick On Nov 29, 5:46 pm, "Nicholas.B.Car...@gmail.com" <Nicholas.B.Car...@gmail.com> wrote: > Is there anyway to use php to parse a pdf file and extract text from > the document? I have been looking around for a few days now and still > really havent found much..... > > If anyone could help it would be greatly appreciated. > >...

extract text from PDF file
Hello, How can I extract text from a (MS Word) PDF file? I've tryed pdftotext but it only produce crap, not one readable cleartext sentence. :) Exists other utilties to convert pdf to a text file or extract text? I think it must possible, because I also can copy and paste text from PDF documents. greetings Fabian In article <44cdb91b$0$7874$6e1ede2f@read.cnntp.org>, fho@mailinator.com says... > Hello, > > How can I extract text from a (MS Word) PDF file? This isn't really a PostScript question.... > I've tryed pdftotext but it only produce crap, not one readable > cleartext sentence. :) Most likely your PDF file contains a re-encoded font. Possibly a CIDFont, in both cases it is quite tricky to go form the number representing the glyph to an ASCII encoded character. > Exists other utilties to convert pdf to a text file or extract text? > > I think it must possible, because I also can copy and paste text from > PDF documents. Have you tried it with this file ? This is a common complaint, you make think it is easy to extract the 'text' from a PDF file, but in fact it can be quite hard, and in some cases impossible. PDF is not intended as an editable format, while htere are several possible solutions to your problem, the easiest may simply be to print the file, scan it and then OCR it. Ken Ken Sharp wrote: > > PDF is not intended as an editable format, while htere are several > possib...

ANN: Fly Text to PDF
Hi All: Fly Text to PDF 1.3 is powerful tool which can convert your text files into PDF. This tool is powerful converter tool running on Microsoft Windows Operating System. You can use this tool to convert your text report, text documents and other text files into PDF quickly and easily. You also can set the PDF properties in each text files by using special tags, or set the default properties for every output PDF files. Please visit our website for more information: http://www.medafan.com/pdf-tools For the output sample, please click on: http://www.medafan.com/pdf-tools/license.pdf Key fea...

Extract Text out of PDF file
Does anyone know how to extract text out of a PDF file so that it can be ealisy imported into a databse? Example: Books. I would need a sepearte field for the title, author, publisher, date, description, image name, etc... I know all of this informaiton is stored in the PDF however, I can't seem to get it out correctly with doing it manually. Maybe, a apple script to pull based on font(?) or something... Any help will be greatly appricated. If there is a program out there or if anyone can build this for me that would rock. Matt PDFBox from http://www.pdfbox.org will do the trick for ...

extracting text from pdf files
Can anyone help me with how to extract text from pdf files using PHP or ColdFusion? Thanks for any help. Hi, Try the Xpdf project. Run the pdftotext command in the shell to produce the text. http://www.foolabs.com/xpdf/download.html There's more tips at php.net/pdf. runner7@fastmail.fm wrote: > Can anyone help me with how to extract text from pdf files using PHP or > ColdFusion? Thanks for any help. petersprc@gmail.com wrote: > Hi, > > Try the Xpdf project. Run the pdftotext command in the shell to produce > the text. > > http://www.foolabs.com/xpd...

extracting pure text from pdf
Hi, is there a way (e.g. sample code) to extract pure text from pdf with realbasic? Thanks. Frank In article <1i9fu85.1h0rx461hw2ikrN%spam@ghostlink.de>, spam@ghostlink.de (Frank Esselbach) wrote: > Hi, > > is there a way (e.g. sample code) to extract pure text from pdf with > realbasic? Thanks. > > Frank I do it on the mac with the free version of the pdf2txt unix command and you use it from rb with the command shell works nice for me. -- Jean-Yves. Frank Esselbach <spam@ghostlink.de> wrote: > Hi, > > is there a way (e.g. sample code) to extract pure text from pdf with > realbasic? Thanks. The MBS Plugins help here. The PDFDocumentMBS and the PDFPageMBS class can both give you text as plain text or attributed text. And the DynaPDFMBS class can also extract the text. Gru� Christian -- Over 900 classes with 18000 functions in one REALbasic plug-in. The Monkeybread Software Realbasic Plugin v7.7. <http://www.monkeybreadsoftware.de/realbasic/plugins.shtml> Hi, > The MBS Plugins help here. The PDFDocumentMBS and the PDFPageMBS class > can both give you text as plain text or attributed text. > > And the DynaPDFMBS class can also extract the text. .... this works also under windows? .... and without additional licenses (e.g. pdflib)? Frank Frank Esselbach <spam@ghostlink.de> wrote: > Hi, > > > The MBS Plugins help here. ...

HTML
Hi, I'm trying to develop a database that uses a plug-in (Troi or yooWeb) to get HTML data from a database. The return HTML appears fine and I can put this into a text field on its own, but I need to search the text and extract only a few peices of information from each page, and then put those small peices into my own DB. For Example: ---Whole Page --- find text "<br>No: xxxx-xxx-xx <br>" and paste xxxx-xxx-xx into a field find text "<br>Type: Some Data <br>" and paste 'Some Data' into a field... etc etc ---End Page ---- I can fin...

How to extract Arabic Text from PDF file
Dear All, Hi, I am doing following task in PHP.... I am using pdftotext command line utility of xpdf package for Windows and Linux. It successfully extracts English text from PDF files. Now I need to extract Unicoded Arabic text from PDF files. For this, I tried: "pdftotext -enc UTF-8 arabicFile.pdf arabicFile.txt" If I remove -enc switch/parameter, there is empty space in place of Arabic text, but English text is extracted from PDF. With -enc UTF-8, some Arabic characters/alphabet s are extracted from PDF, but the complete Arabic text is not extracted. I also have downloaded and installed the xpdf-Arabic package from internet. I couldn't get the required result i.e. Arabic Text from PDF. Can anyone help on urgent basis? How to configure xpdf-Arabic or some other way??? SHAHID MAHMOOD ...

extract text from a PDF file with JAVA
Hi to all the newsgroup. this is my first post. I'm approaching the text retrieving from PDF files with java. I know this is a very known problem in this newgroup, but i need for some example code, tutorial, guide or similar. I'm using, for the moment, PDFBox library but i notice a lot of errors in its PDF parsing. So i've tried with "Pjx" library and i've found a good example code in this site: http://www.jguru.com/faq/view.jsp?EID=1074237 ....but i can't find a way to call "PdfParser.getContents()" method. I will appreciate any advice. Thanks in adv...

How to extract Arabic Text from PDF file
Dear All, Hi, I am doing following task in PHP.... I am using pdftotext command line utility of xpdf package for Windows and Linux. It successfully extracts English text from PDF files. Now I need to extract Unicoded Arabic text from PDF files. For this, I tried: "pdftotext -enc UTF-8 arabicFile.pdf arabicFile.txt" If I remove -enc switch/parameter, there is empty space in place of Arabic text, but English text is extracted from PDF. With -enc UTF-8, some Arabic characters/alphabet s are extracted from PDF, but the complete Arabic text is not extracted. I also have downloaded and ...

extracting specific substrings from a text-file
Hello! I have following text-file: ---INT--- 123 435 789 ---EST-- abc def ---SAT--- x234 y567 z01 Now I want to extract these substrings: a = '123, 456, 789' b = 'abc, def' c = 'x234, y567, z01' How can this problem be resolved? Greeting from Bavaria, Michael On Jun 19, 7:59 am, BavarianMike86 <jo.mich...@gmx.de> wrote: > Hello! > > I have following text-file: > > ---INT--- > 123 > 435 > 789 > > ---EST-- > abc > def > > ---SAT--- > x234 > y567 > z01 > > Now I want to extract these substrings: > > a = '123, 456, 789' > b = 'abc, def' > c = 'x234, y567, z01' > > How can this problem be resolved? > > Greeting from Bavaria, > Michael How about: Read each line Use a case statement to compare with "---INT", "---EST", "---SAT" Read each line and concatenate with the previous until you hit NULL string (blank line) Repeat... ...

Script to extract text from PDF files
I have a very crude Python script that extracts text from some (and I emphasize some) PDF documents. On many PDF docs, I cannot extract text, but this is because I'm doing something wrong. The PDF spec is large and complex and there are various ways in which to store and encode text. I wanted to post here and ask if anyone is interested in helping make the script better which means it should accurately extract text from most any pdf file... not just some. I know the topic of reading/extracting the text from a PDF document natively in Python comes up every now and then on comp.lang....

Script to extract text from PDF files #2
On Wed Sep 26 15:06:54 CEST 2007, byte8bits wrote: > On Sep 25, 10:19 pm, Lawrence D'Oliveiro <l... at geek- > central.gen.new_zealand> wrote: > > > This is inherent in the nature of PDF: it's a page-description language, > > not a document-interchange language. Each text-drawing command can put a > > block of text anywhere on the page, so you have no idea, just from > > parsing the PDF content, how to join these blocks up into lines, > > paragraphs, columns etc. > > So (I'm not being a wise guy) how does pdftotext do it so well...

Extracting a specific column from a text file using unix
Hi, I have a unix file which looks like the following. 1. From this file I want to extract out the column labelled Safety Factor and I want to write it to an external file. 2. This set of data keeps on repeating and I want to keep on appending each instance of its occurence to the same external file. Please help me in this regard. Regards Jolly * MAXIMUM LINE TENSIONS. LF AND HF MOTION * ------------------------------------------------ ** Line Dynamics Included ** Line ---- Top tension ---- Max. Direction Type No. Mean Max Safety Segm. tangent from hor. (kN) (kN) factor No. motion (m) plane (deg) 1 2687.3 4538.5 4.07 3 12.06 -60.6 SAM 2 2729.6 4578.2 4.03 3 12.25 -60.6 SAM 3 2776.7 4623.1 3.99 3 12.47 -60.5 SAM 4 2828.6 4666.1 3.95 3 12.74 -60.5 SAM 5 2885.4 4710.7 3.92 3 13.05 -60.5 SAM 6 5704.8 14472.9 1.28 3 19.04 -44.3 SAM 7 5756.4 14627.2 1.26 3 19.12 -44.2 SAM 8 5773.8 14678.8 1.26 3 19.15 -44.2 SAM 9 5756.4 14627.2 1.26 3 19.12 -44.2 SAM 10 5704.8 14473.3 1.28 3 19.04 -44.3 SAM 11 2885.4 4710.5 3.92 3 13.06 -60.5 SAM 12 2828.6 ...

How to extract the text from pdf file using PDFBOX library
I have tried to extract the pdf document to text using pdf box library in Android public static void read(String[] args) throws IOException{ PDDocument doc = null; try { doc = PDDocument.load("C:\\Android.pdf"); PDFTextStripper stripper = new PDFTextStripper(); String text =stripper.getText(doc); } finally { if (doc != null) { doc.close(); } } But getting error in the logcat that Could not find method org.apache.pdfbox.pdmodel.PDDocument.load, ref...

PDFTextStream v2.0 Released [PDF text extraction]
Last week, we released PDFTextStream v2.0. This is a major update that again pushes the boundaries of PDF content extraction. v2.0: - Adds support for extracting Chinese, Japanese, Korean (CJK), as well as text that uses other double-byte character sets - Is now available for .NET and Python, as well as Java - Offers new tools for recognizing and interpreting tabular data -- very helpful in extracting and converting unstructured content - Includes dozens of other new features, and tons of other small improvements To read more about what's new in PDFTextStream v2.0, visit: http://snowtide.com/NewIn2.0 To read our press release about v2.0, visit: http://snowtide.com/press.pdfts-v2-release If you have any questions about this news, please don't hesistate to contact me. Thanks! Chas Emerick Founder, Snowtide Informatics Systems Enterprise-class PDF content extraction cemerick@snowtide.com http://snowtide.com ...

php script to filter a text file and extract lines starting with keyword?
Hi For a class, students are going to run an experiment on line. Each time a subject runs, his/her data is appended to one giant text file. Their own data set will be just one line starting with the keyword they gave as identification. The faculty does not want the students to be able to download and see the giant data file. He wants the students to only download and see the data that starts with their own identification tag. in unix, filtering a file to keep only the line starting with code MCB would look something like tail -f your_file_name | grep MCB from what I read. Given the concern...

Web resources about - specific text extraction from pdf - comp.lang.php

Special Patrol Insertion/Extraction - Wikipedia, the free encyclopedia
Generally, the SPIE rope is lowered into the pickup area from a hovering helicopter. Patrol personnel, each wearing a harness with an attached ...

Extraction: Project Outbreak on the App Store on iTunes
Get Extraction: Project Outbreak on the App Store. See screenshots and ratings, and read customer reviews.

PHOTO E - Donor egg held by pipette prior to nuclear extraction
Explore OHSU Photos' photos on Flickr. OHSU Photos has uploaded 1623 photos to Flickr.

Hillary Clinton Agrees with "No Future Extraction" of Fossil Fuels on Public Lands - YouTube
At her very first town hall in New Hampshire in July, we asked Hillary Clinton if she would commit to banning fossil fuel extraction on public ...

Extraction of the Day: Girl Pulls Out Her Own Tooth with a Slingshot
... of a viral YouTube video. An 11-year-old girl from Colorado named Alexis Davidson is the latest kid to perform an extreme public tooth extraction, ...

threat extraction - ARN
The source for IT industry news, views and analysis across the channel, business and technology

Doubts raised over IPCC draft backing carbon extraction
Many nations want a draft UN report to tone down prospects for sucking greenhouse gases from the air to help fix global warming, reckoning the ...

ATO cracks down on 'wealth extraction strategy'
The Australian Taxation Office is cracking down on family businesses which misuse partnership and trust structures.

Apple is reportedly fighting 12 more iPhone data extraction orders
The company has more than just the San Bernardino court order to contend with.

Woman wins $808,114 payout after dentist botched tooth extraction
A botched extraction in which part of a tooth was forced into a woman's sinus cavity has cost a dentist more than $800,000 in damages.

Resources last updated: 3/23/2016 4:36:30 PM