Extract Text Coordinates from PDF

  • Follow


Hi,
I was wondering if anyone could recommend a program which can extract
the starting (top left) coordinates (x,y) of each word in a PDF file
(and the end if possible). Ideally output would be in a format that
could be easily inserted into a database.

0
Reply sebclark (1) 10/11/2005 3:26:55 PM

Hi,

We did that here for an internal parsing requirement but did not make it a
commercial product.  That would take additional funding to bring it up to a
marketable product.  For a one time function, it would not be worth the
cost.  As an OEM or volume product, of course the picture changes.  BTW our
output was designed to take the information and place it on an OctoTools
Template which is somewhat XML like.  From there we could output CSV or a
custom output if required.  Call me if you are looking for a more commercial
solution.

Larry T.  (978) 535-7676  US-Boston, MA
0
Reply larrynospam 10/11/2005 3:48:30 PM


On 11 Oct 2005 08:26:55 -0700, sebclark@gmail.com wrote:

>Hi,
>I was wondering if anyone could recommend a program which can extract
>the starting (top left) coordinates (x,y) of each word in a PDF file
>(and the end if possible). Ideally output would be in a format that
>could be easily inserted into a database.

pdw.exe, part of PDF Command Line Tools
http://www.pdf-tools.com/asp/products.asp?name=CLE

sample output using the -w option:
231.9 663.0 12.0  50.4   0 Cour: permits
295.7 663.0 12.0  21.6   0 Cour: the
330.6 663.0 12.0  28.8   0 Cour: text
372.8 663.0 12.0  72.0   0 Cour: extraction
458.2 663.0 12.0  28.8   0 Cour: from
0
Reply JB 10/11/2005 10:07:31 PM

PDFLib TET : http://www.pdflib.com/products/tet/index.html





sebclark@gmail.com wrote:
> Hi,
> I was wondering if anyone could recommend a program which can extract
> the starting (top left) coordinates (x,y) of each word in a PDF file
> (and the end if possible). Ideally output would be in a format that
> could be easily inserted into a database.

0
Reply fhtino 10/12/2005 7:54:48 AM

fhtino wrote:

> PDFLib TET : http://www.pdflib.com/products/tet/index.html

> sebclark@gmail.com wrote:
> > Hi,
> > I was wondering if anyone could recommend a program which can extract
> > the starting (top left) coordinates (x,y) of each word in a PDF file
> > (and the end if possible). Ideally output would be in a format that
> > could be easily inserted into a database.
> 

Or write your own PS header library to hook the show command.

Eric

0
Reply Eric 10/12/2005 9:33:26 AM

sebclark@gmail.com wrote:
> Hi,
> I was wondering if anyone could recommend a program which can extract
> the starting (top left) coordinates (x,y) of each word in a PDF file
> (and the end if possible). Ideally output would be in a format that
> could be easily inserted into a database.

pdftohtml has an "-xml" mode, which does stuff like that.

http://pdftohtml.sourceforge.net/

Raflf
0
Reply Ralf 10/25/2005 10:16:08 PM

Ralf Koenig wrote:
> sebclark@gmail.com wrote:
> 
>> Hi,
>> I was wondering if anyone could recommend a program which can extract
>> the starting (top left) coordinates (x,y) of each word in a PDF file
>> (and the end if possible). Ideally output would be in a format that
>> could be easily inserted into a database.
> 
> 
> pdftohtml has an "-xml" mode, which does stuff like that.
> 
> http://pdftohtml.sourceforge.net/
> 
> Raflf


http://www.tinaja.com/glib/extract1.pdf

and similar tools at http://www.tinaja.com/gurgrm01.asp


-- 
Many thanks,

Don Lancaster                          voice phone: (928)428-4073
Synergetics   3860 West First Street   Box 809 Thatcher, AZ 85552
rss: http://www.tinaja.com/whtnu.xml   email: don@tinaja.com

Please visit my GURU's LAIR web site at http://www.tinaja.com
0
Reply Don 10/28/2005 6:21:07 PM

6 Replies
1734 Views

(page loaded in 0.132 seconds)

Similiar Articles:













7/19/2012 3:17:36 PM


Reply: