|
|
Extract Text Coordinates from PDF
Hi,
I was wondering if anyone could recommend a program which can extract
the starting (top left) coordinates (x,y) of each word in a PDF file
(and the end if possible). Ideally output would be in a format that
could be easily inserted into a database.
|
|
0
|
|
|
|
Reply
|
sebclark (1)
|
10/11/2005 3:26:55 PM |
|
Hi,
We did that here for an internal parsing requirement but did not make it a
commercial product. That would take additional funding to bring it up to a
marketable product. For a one time function, it would not be worth the
cost. As an OEM or volume product, of course the picture changes. BTW our
output was designed to take the information and place it on an OctoTools
Template which is somewhat XML like. From there we could output CSV or a
custom output if required. Call me if you are looking for a more commercial
solution.
Larry T. (978) 535-7676 US-Boston, MA
|
|
0
|
|
|
|
Reply
|
larrynospam
|
10/11/2005 3:48:30 PM
|
|
On 11 Oct 2005 08:26:55 -0700, sebclark@gmail.com wrote:
>Hi,
>I was wondering if anyone could recommend a program which can extract
>the starting (top left) coordinates (x,y) of each word in a PDF file
>(and the end if possible). Ideally output would be in a format that
>could be easily inserted into a database.
pdw.exe, part of PDF Command Line Tools
http://www.pdf-tools.com/asp/products.asp?name=CLE
sample output using the -w option:
231.9 663.0 12.0 50.4 0 Cour: permits
295.7 663.0 12.0 21.6 0 Cour: the
330.6 663.0 12.0 28.8 0 Cour: text
372.8 663.0 12.0 72.0 0 Cour: extraction
458.2 663.0 12.0 28.8 0 Cour: from
|
|
0
|
|
|
|
Reply
|
JB
|
10/11/2005 10:07:31 PM
|
|
PDFLib TET : http://www.pdflib.com/products/tet/index.html
sebclark@gmail.com wrote:
> Hi,
> I was wondering if anyone could recommend a program which can extract
> the starting (top left) coordinates (x,y) of each word in a PDF file
> (and the end if possible). Ideally output would be in a format that
> could be easily inserted into a database.
|
|
0
|
|
|
|
Reply
|
fhtino
|
10/12/2005 7:54:48 AM
|
|
fhtino wrote:
> PDFLib TET : http://www.pdflib.com/products/tet/index.html
> sebclark@gmail.com wrote:
> > Hi,
> > I was wondering if anyone could recommend a program which can extract
> > the starting (top left) coordinates (x,y) of each word in a PDF file
> > (and the end if possible). Ideally output would be in a format that
> > could be easily inserted into a database.
>
Or write your own PS header library to hook the show command.
Eric
|
|
0
|
|
|
|
Reply
|
Eric
|
10/12/2005 9:33:26 AM
|
|
sebclark@gmail.com wrote:
> Hi,
> I was wondering if anyone could recommend a program which can extract
> the starting (top left) coordinates (x,y) of each word in a PDF file
> (and the end if possible). Ideally output would be in a format that
> could be easily inserted into a database.
pdftohtml has an "-xml" mode, which does stuff like that.
http://pdftohtml.sourceforge.net/
Raflf
|
|
0
|
|
|
|
Reply
|
Ralf
|
10/25/2005 10:16:08 PM
|
|
Ralf Koenig wrote:
> sebclark@gmail.com wrote:
>
>> Hi,
>> I was wondering if anyone could recommend a program which can extract
>> the starting (top left) coordinates (x,y) of each word in a PDF file
>> (and the end if possible). Ideally output would be in a format that
>> could be easily inserted into a database.
>
>
> pdftohtml has an "-xml" mode, which does stuff like that.
>
> http://pdftohtml.sourceforge.net/
>
> Raflf
http://www.tinaja.com/glib/extract1.pdf
and similar tools at http://www.tinaja.com/gurgrm01.asp
--
Many thanks,
Don Lancaster voice phone: (928)428-4073
Synergetics 3860 West First Street Box 809 Thatcher, AZ 85552
rss: http://www.tinaja.com/whtnu.xml email: don@tinaja.com
Please visit my GURU's LAIR web site at http://www.tinaja.com
|
|
0
|
|
|
|
Reply
|
Don
|
10/28/2005 6:21:07 PM
|
|
|
6 Replies
1734 Views
(page loaded in 0.132 seconds)
|
|
|
|
|
|
|
|
|