f



Help with pdf to text

I am trying to use the sample of code posted by thodge at ipswich dot
qld
dot gov dot au found here:

http://au2.php.net/pdf

In order to convert a PDF file to a string. I am currently trying with
this
document:
http://www.tececo.com/files/appraisals/GlasserTecEcoAppraisal.pdf
however others fail in the same fashion. Basically the file read works,

since echoing $content after this point:

    $fp = fopen($sourcefile, 'rb');
   $content = fread($fp, filesize($sourcefile));
   fclose($fp);


Works fine, however using echo pdf2string($sourcefile)  the final
result of
this script is blank output. Can anyone suggest what could be the
problem in
the way I am using it, or another easy to use, cross platform script
that
will extract the text from PDF files?


Entire script is copied here for easy reference (sorry but not very
sure
what is going wrong, i have no experiance with this):


<?php


function pdf2string($sourcefile) {

   $fp = fopen($sourcefile, 'rb');

   $content = fread($fp, filesize($sourcefile));
  fclose($fp);

   echo $content;
   $searchstart = 'stream';
   $searchend = 'endstream';
   $pdfText = '';
   $pos = 0;
   $pos2 = 0;
   $startpos = 0;
   while ($pos !== false && $pos2 !== false) {

       $pos = strpos($content, $searchstart, $startpos);
       $pos2 = strpos($content, $searchend, $startpos + 1);

       if ($pos !== false && $pos2 !== false){

           if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
               $pos += 2;
           } else if ($content[$pos] == 0x0a) {
               $pos++;
           }

           if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] ==
0x0a) {
               $pos2 -= 2;
           } else if ($content[$pos2 - 1] == 0x0a) {
               $pos2--;
           }

           $textsection = substr(
               $content,
               $pos + strlen($searchstart) + 2,
               $pos2 - $pos - strlen($searchstart) - 1
           );
           $data = @gzuncompress($textsection);
           $pdfText .= pdfExtractText($data);
           $startpos = $pos2 + strlen($searchend) - 1;

       }
   }

   return preg_replace('/(\s)+/', ' ', $pdfText);

}

function pdfExtractText($psData){

   if (!is_string($psData)) {
       return '';
   }

   $text = '';

   // Handle brackets in the text stream that could be mistaken for
   // the end of a text field. I'm sure you can do this as part of the
   // regular expression, but my skills aren't good enough yet.
   $psData = str_replace('\)', '##ENDBRACKET##', $psData);
   $psData = str_replace('\]', '##ENDSBRACKET##', $psData);

   preg_match_all(
       '/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si',
       $postScriptData,
       $matches
   );
   for ($i = 0; $i < sizeof($matches[0]); $i++) {
       if ($matches[3][$i] != '') {
           // Run another match over the contents.
           preg_match_all('/\(([^)]*)\)/si', $matches[3][$i],
$subMatches);
           foreach ($subMatches[1] as $subMatch) {
               $text .= $subMatch;
           }
       } else if ($matches[4][$i] != '') {
           $text .= ($matches[1][$i] == 'Tc' ? ' ' : '') .
$matches[4][$i];
       }
   }

   // Translate special characters and put back brackets.
   $trans = array(
       '...'                => '&hellip;',
       '\205'                => '&hellip;',
       '\221'                => chr(145),
       '\222'                => chr(146),
       '\223'                => chr(147),
       '\224'                => chr(148),
       '\226'                => '-',
       '\267'                => '&bull;',
       '\('                => '(',
       '\['                => '[',
       '##ENDBRACKET##'    => ')',
       '##ENDSBRACKET##'    => ']',
       chr(133)            => '-',
       chr(141)            => chr(147),
       chr(142)            => chr(148),
       chr(143)            => chr(145),
       chr(144)            => chr(146),
   );
   $text = strtr($text, $trans);

   return $text;

}

echo pdf2string('GlasserTecEcoAppraisal.pdf');

?>

0
9/3/2006 9:52:48 AM
comp.lang.php 32646 articles. 0 followers. Post Follow

1 Replies
393 Views

Similar Articles

[PageSpeed] 30

"noodle_snacks" <noodle_snacks@yahoo.com.au> wrote:
>
>I am trying to use the sample of code posted by thodge at ipswich dot
>qld dot gov dot au found here:
>
>http://au2.php.net/pdf
>
>In order to convert a PDF file to a string. I am currently trying with
>this document:
>http://www.tececo.com/files/appraisals/GlasserTecEcoAppraisal.pdf
>however others fail in the same fashion. Basically the file read works,
>
>since echoing $content after this point:
>
>   $fp = fopen($sourcefile, 'rb');
>   $content = fread($fp, filesize($sourcefile));
>   fclose($fp);
>
>Works fine, however using echo pdf2string($sourcefile)  the final
>result of this script is blank output. Can anyone suggest what could be 
>the problem inthe way I am using it, or another easy to use, cross 
>platform script that will extract the text from PDF files?

That PDF file is compressed, as most PDF files are.  The script you
provided only works on uncompressed PDF files.  You can use the "pdftk"
tool to uncompress it.
-- 
- Tim Roberts, timr@probo.com
  Providenza & Boekelheide, Inc.
0
timr (1409)
9/5/2006 12:05:13 AM
Reply:

Similar Artilces:

pdf \ text (get rid of text in pdf)
Is there a way to remove all text from PDF? Will extract images work for you? If so, PDF-Tools by Tracker Software will do it. http://www.docu-track.com/ -- Don Vancouver, USA "MarosV" <maros.vranec@gmail.com> wrote in message news:ebb897e1-c8e3-4b3a-9274-dfd9d2c845c3@c4g2000hsg.googlegroups.com... > Is there a way to remove all text from PDF? ...

plz help me how to convert php(or)html to pdf i didnt get correct solution help me please....
Hi i used fpdf class for html to pdf converter. I generated pdf but it shows without style sheet implementation and gif images are not show in generated PDF how to solve gif imge error and style sheet not applicable error... Please help me... Thank u... with regards, S.Rajkumar.. On Aug 9, 7:33=A0am, Raj Kumar <rajkumar.sa...@gmail.com> wrote: > I generated pdf =A0but it shows without style sheet implementation and > gif images are not show in generated PDF PDF an HTML are two totally different things. You can't use the CSS style sheet in PDF. With FPDF, you have to place the images at the right place yourself. Or maybe you are talking about this: http://html2fpdf.sourceforge.net/ ...

ANN: Fly Text to PDF
Hi All: Fly Text to PDF 1.3 is powerful tool which can convert your text files into PDF. This tool is powerful converter tool running on Microsoft Windows Operating System. You can use this tool to convert your text report, text documents and other text files into PDF quickly and easily. You also can set the PDF properties in each text files by using special tags, or set the default properties for every output PDF files. Please visit our website for more information: http://www.medafan.com/pdf-tools For the output sample, please click on: http://www.medafan.com/pdf-tools/license.pdf Key fea...

help!help!help!help!
I am a student.I am going to make a simulation of a robot (FANUC Robot M-16iB) under the matlab\simulink environment . It is a normal 6DOF robot.I want to realize any angle and any speed (under the max speed) and any position and orientation control. As I just starting to do this new field,I have no experience about it. Can you give me some simulation demo or examples for 6DOF robot? I am very eager to get these.Please write back to me as soon as possible,thank you! Sincerely, Connie&#12288;&#12288;&#12288;&#12288;&#12288;&#12288;&#12288; zhanglijuan920@sohu.c...

4x PDF Help
Hi Everyone -- we've launched a new question and answer site for all PDF questions. It's based on the highly popular technology platform used by StackOverflow.com, and provides a great environment for getting your questions answered and answering other people's questions. Check it out --> http://help.4xpdf.com/ Cheers, - Rowan. ...

pdf uncripted but I cannot modify text! help !
I have uncripted a pdf file that was protected by password. All seem to be ok but I cannot modify the text ! How I can resolve the probl�em? please help me. Enermax Il /02 ago 2006/, *EnerMax* ha scritto: > I have uncripted a pdf file that was protected by password. All > seem to be ok but I cannot modify the text ! > How I can resolve the probl�em? maybe your pdf document have compressed data, try to uncompress with *pdftk* - http://www.pdfhacks.com/pdftk/pdftk-1.12.exe.zip Uncompress PDF page streams for editing the PDF pdftk doc.pdf output doc.unc.pdf uncompress And ...

Help! Help! Help!
My daughter has a Dell 2300 that is coincidentally 1 month past the warranty. It will start up and shut off within 1 minute. It doesn't matter if I boot from hard drive, floppy, go into safe mode, or go to setup screen. I don't know how to try to isolate the problem. (Mother board, power supply etc.) Any suggestions? sogs wrote: > My daughter has a Dell 2300 that is coincidentally 1 month past the > warranty. It will start up and shut off within 1 minute. It doesn't > matter if I boot from hard drive, floppy, go into safe mode, or go to > setup screen....

Help, Help, Help
As you can probably gather im in need of help. I started a basic JAVA course a week ago as it seems like a very interesting subject. The problem in having is with the IDE we have been asked to use. We are using JCreator, but in order for it to work, we are required to install the Java software development kit and a class library which comes with the course book called avi with is a audio visual interface library which is supposed to make it easier for us to learn the fundimentals of JAVA without having to worry about the input/output. Anyways, I have installed the JAVA SDK from SUN website and instaled the JCreator IDE but it seems to have a problem finding the avi class library as when i run a basic "Hello World" program as a test, the following error occurs when i compile the program: bad class file: C:\JavaClasses\Window.class class file contains wrong class: avi.Window Please remove or make sure it appears in the correct subdirectory of the classpath. Window screen = new Window("example1.java","bold","red",72); ^ 1 error Process completed. We where told to create a directory called JavaClass on the root directory and set the classpath to it. The other problem is the book tells you how to set the classpath for the SDK and the avi class library for windows 98 but im running XP so have kind of guessed on how to set the class path. Anyone got any ideas as i'm itching to jump into programming in JAVA but cant unt...

help! help!! help!!!
x=[0 1 10] p=[100 80] how can I get A A=[3x(1) 2x(1) x(1) 1 0 0 0 0; 2x(1) x(1) 1 0 0 0 0 0; 3x(2) 2x(2) x(2) 1 -3x(2) -2x(2) -x(2) -1; 2x(2) x(2) 1 0 -2x(2) -x(2) -1 0; p(1)x(1) p(1) 0 0 -p(2)x(2) -p(2) 0 0; p(1) 0 0 0 -p(2) 0 0 0; 0 0 0 0 -p(2)x(3) -p(2) 0 0; 0 0 0 0 -p(3) 0 0 0; ] In article <fkus3t$11h$1@fred.mathworks.com>, Jim lei <redlightlike@mathworks.com> wrote: >x=[0 1 10] >p=[100 80] >how can I get ...

HELP HELP HELP
Hello everyone, I am a freshmen at a computer university and also new to C++ . I havent used C before either but been through PASCAL and GW-BASIC. Can someone please tell me the link to download the C++ compiler which can run on my Windows XP Professional Edition. Thanks. ByE. Yours forever in Digital Paradise uSmAn could have typed what u wanted in google. "Usman" <game_pk@hotmail.com> wrote in message news:buobgo$kaacm$1@ID-217624.news.uni-berlin.de... > Hello everyone, > I am a freshmen at a computer university and also new to C++ . I havent used > C before eit...

HELP! HELP! HELP
I hope someone out there can solve my mysterious problem. I have tried everything imaginable, even paid $35 to Microsoft to help me, but they were not able to figure out this problem: Here is the problem: I recently created a new database in Access 2002. I took data from an > access 97 database converted one of the tables to access 2002 and then > imported it into a new table in access 2002. but for some strange > reason, every once in a while the data changes to Japanese characters, > it only happens once in a while, but once it happens that record is > lost. i have tried everything compacting, importing to a new > database, new table...What is causing this strange behavior? Anyone help would be greatly appreciated.... Mitchell, mithomas@boh.com (Mitchell Thomas) wrote: A more meaningful subject would help a lot with getting meaningful answers. Such as Japanese characters in record occasionally. >I hope someone out there can solve my mysterious problem. I have >tried everything imaginable, even paid $35 to Microsoft to help me, >but they were not able to figure out this problem: Eh? I would've thought MS would charge a lot more. >I recently created a new database in Access 2002. I took data from an >> access 97 database converted one of the tables to access 2002 and then >> imported it into a new table in access 2002. but for some strange >> reason, every once in a while the data changes to Japanese cha...

help help help
I need to refresh my C knowledge quickly. I will be reading code for simple industrial type controls like motor controllers, valve drivers, etc This software is pretty simple, 3000 lines max. The executable is embedded in flash on the processor chip. Actually I haven't seen it yet. I haven't looked at C or done any programming in ages. I'm looking for a simple programming suite that will run on my WinXP (NTFS) computer. Something that doesn't take really long to get up to speed with. Simple is better. I just need to practice and remember and do some simple experi...

Help!! Help!! Help!!
I wish to vectorize the following code: s = tf('s'); W = logspace(-1,2,50); COMBINATION = combn(W, 2); K = cell(length(COMBINATION), 1); K0 = 1 + (1/(2*s)) + 0.5*s/(0.2*s + 1); % PID controller to be approximated as PI gapValue = zeros(length(COMBINATION), 1); for i = 1:length(COMBINATION) K{i} = (COMBINATION(i,1)*(1 + (1/(COMBINATION(i,2)*s)))); % PI controller end for i = 1:length(COMBINATION) gapValue(i,1) = gapmetric(K0, K{i}); end where K0 is the Nominal transfer function and K is a cell array including transfer functions whose gap metric from K0 are...

HELP! HELP! HELP!
We are currently running R83 v3.1m (33 user) on an old 200 Mhz Pentium with 16Mb Mem and 2.x gif HD (the actual partition is smaller) We have run this for a number of years without ANY significant problems. Last week (just as my family was in the middle of "the move from Hell") the system decides to take a major dump. A bad link of some sort in the ABS area. THEN - the systems disks have disappeared. After much screaming and gnashing of teeth I got the system running again. OH! did I mention that the previous evenings file save tape was bad? Anyhow, the bottom line is that I REALLY need a set of 33 user 3.1m disks. Any Ideas/Help? I do have a 25 user version of r83 v3.1 (no 'm') with a monolith loader diskette (5.25) but it doesn't seem to hook up with the PICK diskettes (just keeps rebooting) If I can get a sister machine running with the 25 user - does anyone have any trick to bump it to 33 user. There must be just a byte or 2 in the ABS area that regulates # of users. Again - Help! Help! Help! Van PS: The reason we haven't upgraded is that I am using a 4GL called Envision that was written a long time ago and about half of it was done in assembler and won't run on any of the later version of PICK. I've been meaning to re-write it in Basic but it's a pretty daunting project. I'd get busy rewriting the software if I were you. If anyone actually has the disks, there's ...

How to grab a piece of text with help of a php-script
Hi group, I would like to show the latest aurora data on my site. I know this is possible with help of a php-script, but I'm a nitwit with php-scripting. That's why I'm asking for help here. The following url is being updated once every minute. http://sec.noaa.gov/ftpdir/lists/ace/ace_mag_1m.txt and contains data for the past 2 hours. I would like to show the most recent Bx, By, Bz en Bt values (see line 21) as following:. Bx=x.x By=x.x Bz=x.x Bt=x.x (where x.x is the most recent value) Does anyone know how I can do this with a small script? Help is highly appreciated !! Thanks in advance... -------------------------------------------------------------------------------- Mijn Postvak In wordt beschermd door SPAMfighter 14181 spam-mails zijn er tot op heden geblokkeerd. Download de gratis SPAMfighter vandaag nog! Peertje wrote: > Hi group, I would like to show the latest aurora data on my site. > I know this is possible with help of a php-script, but I'm a nitwit > with php-scripting. > That's why I'm asking for help here. > > The following url is being updated once every minute. > http://sec.noaa.gov/ftpdir/lists/ace/ace_mag_1m.txt > and contains data for the past 2 hours. > > I would like to show the most recent Bx, By, Bz en Bt values (see > line 21) > as following:. > Bx=x.x By=x.x Bz=x.x Bt=x.x (where x.x is the most recent value) > > Does anyone know how I can do this with a small scrip...

HELP-HELP-HELP!!!!!!!!!!
On this site par example: http://communications.siemens.com/cds/frontdoor/0,2241,nl_nl_0_27443_rArNrNrNrN,00.html u have a 3d animation, like many other sites...but if i want to see this a new window opens and i see a icon with ared cross in it. also on sites with games i see this... What is my problem? I have winXP-Pro firewall is out and sec. is on low..... twan1@home.nl (Twan) writes: > On this site par example: > http://communications.siemens.com/cds/frontdoor/0,2241,nl_nl_0_27443_rArNrNrNrN,00.html > u have a 3d animation, like many other sites...but if i want to...

plz help me how to convert php(or)html to pdf
Hi i used fpdf class for html to pdf converter. I generated pdf but it shows without style sheet implementation and gif images are not show in generated PDF how to solve gif imge error and style sheet not applicable error... Please help me... Thank u... with regards, S.Rajkumar.. start here http://www.fpdf.org/en/tutorial/index.php I'm a bit more familiar with ZendPdf so I won't be much use here. If it's anything like ZendPdf, you'll have to extend it to stylize i.e. write a few functions or extend ZendPdf with more methods. I'll see if I can dig up my code for wordwrap and centering text, mid you it's for ZendPdf but could be modified. When you can get shell access you can also use xsl-fo to generate PDF files. On Aug 7, 12:11 am, Raj Kumar <rajkumar.sa...@gmail.com> wrote: > > i used fpdf class for html to pdf converter. > > I generated pdf but it shows without style sheet implementation and > gif images are not show in generated PDF Check out dompdf: http://www.digitaljunkies.ca/dompdf/ Cheers, NC On Aug 8, 2:33 am, NC <n...@iname.com> wrote: > On Aug 7, 12:11 am, Raj Kumar <rajkumar.sa...@gmail.com> wrote: > > > > > i used fpdf class for html to pdf converter. > > > I generated pdf but it shows without style sheet implementation and > > gif images are not show in generated PDF > > Check out dompdf: > > http://www.digitaljunkies.ca/dompdf/ > > C...

Help Help Help
Hello, I am a reluctant user of the website ntsearch.com. I accidentally download a Java program called "dict" from an ebook website to my XP and now every English words that I read from my computer has got a link with your web. I found it very frustrating when I noticed all the chinese characters I read from the web now become loads of question marks. I think it has a link with the program that I accidentally downloaded from that ebook website. How can I remove the so called "dict" program and other Java Scripts from my computer? This is very urgent and please answer my question as soon as you can. Thanks for your help. Regina Pun Jerry Arzin wrote: > This is very urgent and please answer my question as soon as you can. Sure we'll stop everything and start right away. Do you want coffee as well? Our rates start at $180 per hour. gtoomey Gregory Toomey wrote: > Jerry Arzin wrote: > >> This is very urgent and please answer my question as soon as you can. > > Sure we'll stop everything and start right away. > Do you want coffee as well? > > Our rates start at $180 per hour. > > gtoomey Hey, I'll drop everything and make fun of him for only $50 per hour! Dave Gregory Toomey <nospam@bigpond.com> wrote in message news:<1154023.VsbNXCQAxi@GMT-hosting-and-pickle-farming>... > Jerry Arzin wrote: > > > This is very urgent and please answer my question as soon as you can. >...

Help reading PDF to get text... #4
Hi, I need help with PDF::API2 or TEXT::PDF::* or any module which can b used to read pdf files. I have been trying to find any other thread which address this... but was unable to get a resolution. I have a bunch of pdf reports which I need to read through to find text string in any of the lines to read the report name. Any help is appreciated. Thanks.[COLOR=firebrick - tq_aud ----------------------------------------------------------------------- Posted via http://www.codecomments.co ----------------------------------------------------------------------- ...

Attached pdf file appeared as garbel text. Help
Outlook express sent a pdf attachment to my eudora 6.2. The file did not show up as attachment, but a mess of garbled txt in the message body. The same guy sent the same attachment to another OE and is OK.. The 2nd OE forwarded it to my Eudora and is also OK. Any clue? I've seen this same thing happen with people using OE to send me other types of files. A large fill is sent by him, it's split into 3-4 smaller files and when it's received on my end it is Base64 encoded and is lines and line long, still broken up and received in 3-4 email messages The spliting...

Help: Why extremely slow processing of text pdf files?
Greetings! I'm new to this group, hope some expert here might help me. Using Acrobat 7 from CS2 for small scholarly publishing operations on various computers, ranging from several iBook G4 1.2 GHz models running OS 10.4.x to one Intel iMac 2.8 GHz running Leopard. For the last year or two we've noticed horribly slow performance when inserting or replacing pages in a pdf file. Things seem to go expeditiously, then we get a message at the bottom saying "Consolidating duplicate fonts", and this can go on for an hour or more. This despite the fact that our pdf files don...

Quark to PDF: text doesn't appear
Hello, I'm trying to transform a Quark XPress file to a PDF document. I select the Adobe printer and in the printer menu, click on "save as PDF...". The PDF file is then created, but only the images are within. I read in preview responses that one have to save make a postscript file from the Quark file, and then to treat it with Adobe Distiller. But how can I make a postscript file? Or how can I embed the text in the Quark file? Is there anyone who can help me? Thank you, Aude PS: I don't use Quark very good till now. ...

PDF image of text to readable text ?
Seems there are web based tools and software. My son needs text to have it read for him. He has a PC. Found PDF reader $50 , http://thurly.net/11ia and http://thurly.net/11i4 the last being google. Wondering what you folks found useful or use ? Thanks! -- Bill S. Jersey USA zone 5 shade garden http://uppitywis.org/ live WI ...

PDF::API2
Hello All, I am new to PDF files so I don't really know if what I want to do is possible and how to use the PDF::API2 modules. I need to extract information from columns in a table ( I assume that PDF does not know anything about tables). What I was thinking of doing was finding the horizontal location of the header (I know what it should be), then extract all text that starts at that location. I have played around with the PDF::API2 module and read the 'Using PDF::API2 - The code' help page, however it doesn't show me how to extract information from an existing file. ...

Web resources about - Help with pdf to text - comp.lang.php

Help - Wikipedia, the free encyclopedia
United States Senate Committee on Health, Education, Labor and Pensions , commonly abbreviated "H.E.L.P." Text is available under the Creative ...

Help:Contents - Wikipedia, the free encyclopedia
Templates are special pages that contain boilerplate text intended to be displayed on more than one page in Wikipedia. This Tip of the day box ...

Five practical ways Australians can help victims of the Brussels bombings
FOLLOWING the horrific Brussels terror attacks, global landmarks from the Eiffel Tower to the Brandenburg Gate have been lit up in the Belgian ...

Old Lady Lawyer: Would It Hurt To Help?
The news isn't good for lawyers: as many as one out of every three of us is a problem drinker, and one out of every four of us has some form ...

We asked a sleep scientist if the iPhone's new Night Shift feature will actually help you sleep, and ...
... more night-time cellphone use could end up making the problem worse. "There are lots of apps that are great that are designed to help you sleep," ...

Police: Pianist's wife sought mental help day before kids found dead
Police: Pianist's wife sought mental help day before kids found dead

Weak yuan may not help: PBOC adviser
A weaker renminbi won't necessarily lift China's sagging economic growth rates, warned an adviser to the PBOC's monetary policy committee.

The Latest: Merkel pledges help, solidarity to Brussels
BRUSSELS (AP) — The Latest on explosions at Brussels airport and metro station (all times local):

Passsengers On Exploded Megabus Get Inadequate Compensation For Bags, Can’t Get Legal Help
... losing possessions worth thousands of dollars was at best a huge hassle and at worst a hardship for the passengers, but there isn’t legal help ...

Republicans: New Rule Helps Unions By Stifling Employers’ Free Speech
Republicans: New Rule Helps Unions By Stifling Employers’ Free Speech

Resources last updated: 3/23/2016 10:12:24 PM