f



Help: String search in Windows 2000 doesn't find text in Windows XP: MS Word document

Hi:

	I am using Perl 5 (I believe both machines are using ActivePERL 5) on 
two machines with the same data files.  One machine is Win 2000 the 
other is Win XP.  The files are MS Word 2000 documents e-mailed 
(manually) from the Win 2000 machine to the XP machine.

	The program searches the MS Word Files (both created with MS Word 2000) 
for the word HYPERLINK.  The format for the HYPERLINK that I am 
searching for in the document is:

HYPERLINK "mydoc.doc"

(I checked this on the XP machine in Notepad and it is OK.)

PROBLEM: 	The program works on the Windows 2000 machine, but does not 
find the files on the Win Xp machine.

	The code that is not finding the text on the Win XP machine (same as 
the Win 2000 machine which does find the test)is:

-----------  start actual code segment --------------------
	while (/HYPERLINK(\s+.{1,80}?\.doc)/gim)  #  the "g" causes multiple 
matches

	{
		  $fndxx = $1;

		  $fndxx =~ s/\"//;   # remove leading quote
		  $fndxx =~ s/\s+//;   # remove leading spaces
		  $dir="C:\\IGINproducts\\UserDocuments\\"; 	
		
		  $fullname = ($dir . $fndxx);
		  $date_string = "Cannot Find";
		  if (-e $fullname) { $date_string = ctime(stat($dir . 
$fndxx)->mtime); }     #last update date of that file
		  print(OUTFILE  $fndxx,",",$date_string,", in: ",basename($file), 
"\n")  ;
		  $matches += 1;  # count matches
		
	}	#end while HYPERLINK
-----------  end actual code segment --------------------

The output for a found HYPERLINK should look like this (it does on the 
Win 2000 machine):

mydoc.doc,(date of last update), in: otherdoc.doc

On Win XP, the program cannot even find the word HYPERLINK (if I modify 
the code to just search for that).  The directories are valid, I can 
have the program print a list of all files as it processes them.

If I try this with a test program (the string to test is in the program 
itself ) it works fine on the XP machine.

There are no encryption issues, nor any file or directory problems.

I would really appreciate any comments or suggestions about what I am 
doing wrong.

Thanks,

Barry Millman

	
0
11/27/2005 2:13:16 PM
comp.lang.perl.misc 33233 articles. 2 followers. brian (1246) is leader. Post Follow

23 Replies
1181 Views

Similar Articles

[PageSpeed] 12

Just some added info:

The search works fine if I save the MS Word files as RTF.

Also I wanted to mention that I have this around the hyperlink search code:
	#open the file		
         open(INFILE,"< $file") or die "Couldn't open file ",$file;
	

	while(<INFILE>)
	{
		# the hyperlink code I posted earlier
	}       # end while infile

Barry



Barry Millman wrote:

> Hi:
> 
>     I am using Perl 5 (I believe both machines are using ActivePERL 5) 
> on two machines with the same data files.  One machine is Win 2000 the 
> other is Win XP.  The files are MS Word 2000 documents e-mailed 
> (manually) from the Win 2000 machine to the XP machine.
> 
>     The program searches the MS Word Files (both created with MS Word 
> 2000) for the word HYPERLINK.  The format for the HYPERLINK that I am 
> searching for in the document is:
> 
> HYPERLINK "mydoc.doc"
> 
> (I checked this on the XP machine in Notepad and it is OK.)
> 
> PROBLEM:     The program works on the Windows 2000 machine, but does not 
> find the files on the Win Xp machine.
> 
>     The code that is not finding the text on the Win XP machine (same as 
> the Win 2000 machine which does find the test)is:
> 
> -----------  start actual code segment --------------------
>     while (/HYPERLINK(\s+.{1,80}?\.doc)/gim)  #  the "g" causes multiple 
> matches
> 
>     {
>           $fndxx = $1;
> 
>           $fndxx =~ s/\"//;   # remove leading quote
>           $fndxx =~ s/\s+//;   # remove leading spaces
>           $dir="C:\\IGINproducts\\UserDocuments\\";    
>        
>           $fullname = ($dir . $fndxx);
>           $date_string = "Cannot Find";
>           if (-e $fullname) { $date_string = ctime(stat($dir . 
> $fndxx)->mtime); }     #last update date of that file
>           print(OUTFILE  $fndxx,",",$date_string,", in: 
> ",basename($file), "\n")  ;
>           $matches += 1;  # count matches
>        
>     }    #end while HYPERLINK
> -----------  end actual code segment --------------------
> 
> The output for a found HYPERLINK should look like this (it does on the 
> Win 2000 machine):
> 
> mydoc.doc,(date of last update), in: otherdoc.doc
> 
> On Win XP, the program cannot even find the word HYPERLINK (if I modify 
> the code to just search for that).  The directories are valid, I can 
> have the program print a list of all files as it processes them.
> 
> If I try this with a test program (the string to test is in the program 
> itself ) it works fine on the XP machine.
> 
> There are no encryption issues, nor any file or directory problems.
> 
> I would really appreciate any comments or suggestions about what I am 
> doing wrong.
> 
> Thanks,
> 
> Barry Millman
> 
>     
0
Barry
11/27/2005 3:27:28 PM
Barry Millman wrote:

(snipped)

>The code that is not finding the text on the Win XP machine (same as
> the Win 2000 machine which does find the test)is:

(snipped)

Move this line above and outside your while loop:

>   $dir="C:\\IGINproducts\\UserDocuments\\"; 	
 		
The reason for moving that line above and outside your while loop
is you are creating a new value for that variable with each loop
iteration. That is inefficient because that variable has a "fixed"
value; set the value above and outside your while loop.

You do not need to use double left hand slashes for your
file path but doing so causes no harm. You can use single
right hand slashes for your path, for a open(FILE) syntax
as shown below.

However, despite claims of one the "experts" in this group,
you must use double lefthand slashes for some syntax,
certainly for some system command syntax for Win32.

For a file open, you do not need double slashes but it
is perfectly ok to use them.

Uppercase letters in a file path are not needed for Win32
but are ok to use; no problem.

Your code produces this directory / file name path:

    C:\IGINproducts\UserDocuments\mydoc.doc

That "appears" to be a valid path. Check to be sure it is valid.
Double check to be sure there are not spaces in a directory
name, such as,   User Documents   which is typical.

You do not show your syntax for your  OUTFILE  open for write.
Be sure to use error checking to verify that file opens for write.

Run this test code,

#!perl

open (TEST, "c:/iginproducts/userdocuments/mydoc.doc") || die "File Open Failed: $!";

while (<TEST>)
 {
  if (index ($_, "HYPERLINK") > -1)
   { print "HYPERLINK found at line $.\n"; }
 }

close (TEST) || die "File Close Failed $!";


Clearly I cannot test that code not having your file to test.
However, my syntax is ok,

C:\APACHE\USERS\TEST>perl -c test.pl
test.pl syntax OK

Running that test code will determine if your file path and file name
are valid, and will determine if HYPERLINK is actually in your file.

Be cautious. If your HYPERLINK word spans lines, index will not
find that specific instance.

Often, reducing your code to most simple version possible will find
errors for you, quickly.

Purl Gurl
0
Purl
11/27/2005 3:47:59 PM
Barry Millman <millmanbarry@hotmail.com> wrote:

> The format for the HYPERLINK that I am 
> searching for in the document is:
> 
> HYPERLINK "mydoc.doc"

> PROBLEM: 	The program works on the Windows 2000 machine, but does not 
> find the files on the Win Xp machine.


I don't think I can help with that part, but the code is too hokey
to just let it pass...


> -----------  start actual code segment --------------------
> 	while (/HYPERLINK(\s+.{1,80}?\.doc)/gim)  #  the "g" causes multiple 
> matches


The //m does not do anything, so why is it there?

It changes the meaning of ^ and $, but you don't use those
anchors in your pattern, so you don't need //m.

   .{1,80}?

is the same as

   .{0,80}

Do you really want to match ' .doc' ?


We can't help you analyse why the match is failing because we
need two things to do that: the pattern and the string that
the pattern is to be matched against.

We have only one of those two things...


> 
> 	{
> 		  $fndxx = $1;
> 
> 		  $fndxx =~ s/\"//;   # remove leading quote
> 		  $fndxx =~ s/\s+//;   # remove leading spaces


Why capture them only to strip them out of the captured string?

Why not just leave them out of the capture in the first place?


   while (/HYPERLINK\s+"(.{1,78}\.doc")/gi)

or, probably better:

   while (/HYPERLINK\s+"([^"]{1,78}\.doc")/gi)


> 		  $dir="C:\\IGINproducts\\UserDocuments\\"; 	
> 		


Use single quotes unless you want to make use of one of the two
extra things that double quotes give you (interpolation
and backslash escapes).

Use forward slashes instead of silly slashes unless the path
is going to be fed to the "command interpreter".


   $dir='C:/IGINproducts/UserDocuments/';


> 		  print(OUTFILE  $fndxx,",",$date_string,", in: ",basename($file), 
> "\n")  ;


Gak!

Use double quoted strings to concatenate your output string:

    print(OUTFILE  "$fndxx,$date_string, in: ", basename($file), "\n")  ;


> If I try this with a test program (the string to test is in the program 
> itself ) it works fine on the XP machine.


If you had shown us your complete test program, then we could
have helped you debug it.

But you didn't, so we can't.  (hint)


> I would really appreciate any comments or suggestions about what I am 
> doing wrong.


Not posting a short and complete program that we can run that
illustrates your problem.

Have you seen the Posting Guidelines that are posted here frequently?


-- 
    Tad McClellan                          SGML consulting
    tadmc@augustmail.com                   Perl programming
    Fort Worth, Texas
0
Tad
11/27/2005 3:54:43 PM
Hi:

	I tried your suggestions, but no luck.  I did nove that directory 
assignment outside the loop.  Stupid of me!

	There is something really odd in MS Word storage in Win XP.  If I save 
the document to RTF it finds the stuff in the RTF file.

	I looked at both the MS Word and RTF files with the XVI32 Hex editor. 
They both showed the same hex values for the string  HYPERLINK.

	Barry




Purl Gurl wrote:

> Barry Millman wrote:
> 
> (snipped)
> 
> 
>>The code that is not finding the text on the Win XP machine (same as
>>the Win 2000 machine which does find the test)is:
> 
> 
> (snipped)
> 
> Move this line above and outside your while loop:
> 
> 
>>  $dir="C:\\IGINproducts\\UserDocuments\\"; 	
> 
>  		
> The reason for moving that line above and outside your while loop
> is you are creating a new value for that variable with each loop
> iteration. That is inefficient because that variable has a "fixed"
> value; set the value above and outside your while loop.
> 
> You do not need to use double left hand slashes for your
> file path but doing so causes no harm. You can use single
> right hand slashes for your path, for a open(FILE) syntax
> as shown below.
> 
> However, despite claims of one the "experts" in this group,
> you must use double lefthand slashes for some syntax,
> certainly for some system command syntax for Win32.
> 
> For a file open, you do not need double slashes but it
> is perfectly ok to use them.
> 
> Uppercase letters in a file path are not needed for Win32
> but are ok to use; no problem.
> 
> Your code produces this directory / file name path:
> 
>     C:\IGINproducts\UserDocuments\mydoc.doc
> 
> That "appears" to be a valid path. Check to be sure it is valid.
> Double check to be sure there are not spaces in a directory
> name, such as,   User Documents   which is typical.
> 
> You do not show your syntax for your  OUTFILE  open for write.
> Be sure to use error checking to verify that file opens for write.
> 
> Run this test code,
> 
> #!perl
> 
> open (TEST, "c:/iginproducts/userdocuments/mydoc.doc") || die "File Open Failed: $!";
> 
> while (<TEST>)
>  {
>   if (index ($_, "HYPERLINK") > -1)
>    { print "HYPERLINK found at line $.\n"; }
>  }
> 
> close (TEST) || die "File Close Failed $!";
> 
> 
> Clearly I cannot test that code not having your file to test.
> However, my syntax is ok,
> 
> C:\APACHE\USERS\TEST>perl -c test.pl
> test.pl syntax OK
> 
> Running that test code will determine if your file path and file name
> are valid, and will determine if HYPERLINK is actually in your file.
> 
> Be cautious. If your HYPERLINK word spans lines, index will not
> find that specific instance.
> 
> Often, reducing your code to most simple version possible will find
> errors for you, quickly.
> 
> Purl Gurl
0
Barry
11/27/2005 4:00:59 PM
Barry Millman  wrote:

> Purl Gurl wrote:
> > Barry Millman wrote:

(snipped)

> >>The code that is not finding the text on the Win XP machine (same as
> >>the Win 2000 machine which does find the test)is:

> >   if (index ($_, "HYPERLINK") > -1)
> >    { print "HYPERLINK found at line $.\n"; }

>I tried your suggestions, but no luck.

Then you have verified the word  HYPERLINK  does not exist in your file.
Your regex will never match that word with that word not existing.

I would instantly question why that word exists in your Win2K file and
does not exist in your WinXP file. You did indicate both files are the same,
or at least I think you did. I have not gone back to read again.

> There is something really odd in MS Word storage in Win XP.  If I save
> the document to RTF it finds the stuff in the RTF file.

> I looked at both the MS Word and RTF files with the XVI32 Hex editor.
> They both showed the same hex values for the string  HYPERLINK.

I never work with MS Word nor RTF in a programming environment. I do
use those for writing business letters, however!

Documents of that type do contain binary data. This presents myriad
problems for Perl based programs.

An example problem is some binary data will create a false end of file
signal, resulting in termination of reading, early.

There are myriad other problems created by reading in an ASCII mode
and encountering binary data; no telling what will happen.

Obvious problem is you will never find  HYPERLINK  in your file simply
because that word does not exist. It is possible that word is in binary
format or in partial binary format. Again, use of Perl's  index  function
has verified that word does not exist in your file. So, there is no error.

You have some choices.

You can open your file in a binary mode, cross your fingers and search
for your string in binary format. Almost a certainty that will fail. Perl is
not all that capable of reading and "regex" searching binary data; some
characters are "out of range" for perl core.

An alternative is to print your Word / RTF file in "plaintext" to a test file,
then open and search that file as you are currently trying to do. My use
of "plaintext" means in pure ASCII format, such as these articles we
post and read.

A rather wild alternative, and I use this method at times, is to write a
simple Visual Basic macro which runs your MS Word processor, sends
a command to seek and find incidents of your HYPERLINK, then return
data to your Perl program. For some cases, you can use VB commands,

Control C

OR

Control Insert

to move data to your clipboard, then use the Win32 clipboard module to
capture that data to move it into your Perl program.

My choice would be to print your binary document to a file, in plaintext, then
use that file for my Perl program.

Whatever, index has verified your search word does not exist. Now you
know what is causing your problem.

Purl Gurl
0
Purl
11/27/2005 4:29:01 PM
Tad McClellan wrote:

(snipped)

> I don't think I can help with that part, but the code is too hokey
> to just let it pass...

Have you helped the author resolve his problem?

Purl Gurl
0
Purl
11/27/2005 5:04:03 PM
OK.  Sorry about the bad code.  However, let's reduce this to the 
minimum, removing the search for the text.  All we will do is read 
chunks of data, with this program:

-------------------- start of program --------------------------
open (TEST, "c:\\PERL\\Barry\\Starthere.rtf") || die "File Open Failed: $!";

while (<TEST>)
  {

     print( "Chunk length: ", length($_),"\n");
     $chunks += 1; 	
  }

close (TEST) || die "File Close Failed $!";

     print( $chunks, " Chunks\n");
-------------------- end of program --------------------------

Now, if I run this using Starthere.rtf, I get 1544 Chunks and they have 
all sorts of different lengths.  Some of the first chunks are of length:
103, 218, 250,1,230,63, 255.

However, if I run this using Starthere.doc, I get only ONE chunk, and it 
is of length 6 bytes.

If I examine the MS Word file using a Hex editor, I get the following 
values for bytes 5 through 7 (calling the first byte as zero):
B1   1A   E1

The 1A is the seventh byte of the file.

The PERL program (above) seems to stop at this character.

So forgetting about the search, does this yield any clues?

Thank you,

Barry




Tad McClellan wrote:
> Barry Millman <millmanbarry@hotmail.com> wrote:
> 
> 
>>The format for the HYPERLINK that I am 
>>searching for in the document is:
>>
>>HYPERLINK "mydoc.doc"
> 
> 
>>PROBLEM: 	The program works on the Windows 2000 machine, but does not 
>>find the files on the Win Xp machine.
> 
> 
> 
> I don't think I can help with that part, but the code is too hokey
> to just let it pass...
> 
> 
> 
>>-----------  start actual code segment --------------------
>>	while (/HYPERLINK(\s+.{1,80}?\.doc)/gim)  #  the "g" causes multiple 
>>matches
> 
> 
> 
> The //m does not do anything, so why is it there?
> 
> It changes the meaning of ^ and $, but you don't use those
> anchors in your pattern, so you don't need //m.
> 
>    .{1,80}?
> 
> is the same as
> 
>    .{0,80}
> 
> Do you really want to match ' .doc' ?
> 
> 
> We can't help you analyse why the match is failing because we
> need two things to do that: the pattern and the string that
> the pattern is to be matched against.
> 
> We have only one of those two things...
> 
> 
> 
>>	{
>>		  $fndxx = $1;
>>
>>		  $fndxx =~ s/\"//;   # remove leading quote
>>		  $fndxx =~ s/\s+//;   # remove leading spaces
> 
> 
> 
> Why capture them only to strip them out of the captured string?
> 
> Why not just leave them out of the capture in the first place?
> 
> 
>    while (/HYPERLINK\s+"(.{1,78}\.doc")/gi)
> 
> or, probably better:
> 
>    while (/HYPERLINK\s+"([^"]{1,78}\.doc")/gi)
> 
> 
> 
>>		  $dir="C:\\IGINproducts\\UserDocuments\\"; 	
>>		
> 
> 
> 
> Use single quotes unless you want to make use of one of the two
> extra things that double quotes give you (interpolation
> and backslash escapes).
> 
> Use forward slashes instead of silly slashes unless the path
> is going to be fed to the "command interpreter".
> 
> 
>    $dir='C:/IGINproducts/UserDocuments/';
> 
> 
> 
>>		  print(OUTFILE  $fndxx,",",$date_string,", in: ",basename($file), 
>>"\n")  ;
> 
> 
> 
> Gak!
> 
> Use double quoted strings to concatenate your output string:
> 
>     print(OUTFILE  "$fndxx,$date_string, in: ", basename($file), "\n")  ;
> 
> 
> 
>>If I try this with a test program (the string to test is in the program 
>>itself ) it works fine on the XP machine.
> 
> 
> 
> If you had shown us your complete test program, then we could
> have helped you debug it.
> 
> But you didn't, so we can't.  (hint)
> 
> 
> 
>>I would really appreciate any comments or suggestions about what I am 
>>doing wrong.
> 
> 
> 
> Not posting a short and complete program that we can run that
> illustrates your problem.
> 
> Have you seen the Posting Guidelines that are posted here frequently?
> 
> 
0
Barry
11/27/2005 5:58:41 PM
Purl Gurl <purlgurl@purlgurl.net> wrote:
> Tad McClellan wrote:
> 
> (snipped)
> 
>> I don't think I can help with that part, but the code is too hokey
>> to just let it pass...
> 
> Have you helped the author resolve his problem?


Have you?


-- 
    Tad McClellan                          SGML consulting
    tadmc@augustmail.com                   Perl programming
    Fort Worth, Texas
0
Tad
11/27/2005 6:03:33 PM
Purl Gurl wrote:

> Whatever, index has verified your search word does not exist. Now you
> know what is causing your problem.

I have looked over Word Perfect and MS Word but not RTF formats, on a
9.x machine, a 2K machine and an XP machine.

There are some variations, but very minor. All present an ability to
save a file in a plaintext (.txt) format. Word Perfect on my XP contains
MS Word software. There is a feature which will convert various file
formats to other formats by using "all files" or specific file types.

However, I tried converting a desktop.ini file, which I know to contain
binary data, and this caused Word Perfect to hang. Eventually I had
to kill the process, with some difficulties; the "hang" remained in
RAM and I could not run Word Perfect again until a reboot. I would
suggest you not try to move outside limits of Perfect / Word software.

Bottom line is if you need to convert a MS Word document to a
plaintext format, this is very easy.

A follow-up suggestion is to use my same test script but index
for instances of,

"http:"

If returns are successful, you could pull http hyperlinks without
searching for HYPERLINK as you were doing. Upon success,
which is dubious, you could try a regex to match hyperlink
URL formats, sans your word HYPERLINK in your regex.

if ($_ =~ �http://[your regex set]+ �)
 { print success or use ( ) to capture $1 for printing }

I would expect a space to follow a hyperlink, thus my space at the end.
Perhaps Word does not binary encode hyperlinks. No guarantee.

A hex editor will display plaintext format, if in a binary file. I use
Hex Workshop v. 2.2x for this. Very old program but works with
excellence. You could simply open your Word document with a
hex editor, then search for http: from there.

Purl Gurl
0
Purl
11/27/2005 6:13:56 PM
Tad McClellan wrote:

> Purl Gurl wrote:
> > Tad McClellan wrote:

(snipped)

> >> I don't think I can help with that part, but the code is too hokey
> >> to just let it pass...

> > Have you helped the author resolve his problem?

> Have you?

I have. You have not.

Clearly you are trolling, as is your habit. Before posting this troll
article of yours, you knew I have and do know, I am helping the
author reach resolution of his problem.

This troll article of yours again affirms you are the troll many
of us know you to be, a very persistent troll at that.

Purl Gurl
0
Purl
11/27/2005 6:21:37 PM
Barry Millman wrote:

> Hi:
> 
>     I am using Perl 5 (I believe both machines are using ActivePERL 5) 
> on two machines with the same data files.  One machine is Win 2000 the 
> other is Win XP.  The files are MS Word 2000 documents e-mailed 
> (manually) from the Win 2000 machine to the XP machine.
> 
>     The program searches the MS Word Files (both created with MS Word 
> 2000) for the word HYPERLINK.  The format for the HYPERLINK that I am 
> searching for in the document is:
> 
> HYPERLINK "mydoc.doc"
 >
 > (I checked this on the XP machine in Notepad and it is OK.)
 >

Note that MS Word documents are stored in a proprietary binary 
gibberish format.  To assume that a given word in a document will 
actually always be stored in an ASCII string in the .doc file is 
assuming too much.  For example, perhaps it is stored in Unicode? 
  And maybe newer Notepad versions understand enough to present 
Unicode strings?  Try looking at your files with an editor that 
you *know* won't munge the contents.  I suggest VIM.

It is a mystery why a document would get changed while emailing 
it from one system to another.  Or did you perhaps open the 
document with Word after emailing it, and then save it?  You 
don't say.  Is it the same version of Word?  And what email 
system are you using on each of the computers?  Does the same 
thing happen if you zip the file, email the zipped version, and 
unzip it on the other system?

> PROBLEM:     The program works on the Windows 2000 machine, but does not 
> find the files on the Win Xp machine.
> 
>     The code that is not finding the text on the Win XP machine (same as 
> the Win 2000 machine which does find the test)is:
> 
> -----------  start actual code segment --------------------
>     while (/HYPERLINK(\s+.{1,80}?\.doc)/gim)  #  the "g" causes multiple 
> matches

As others have mentioned, the /m modifier does nothing, and the 
..{1,80}? would be better as .{0,80} .

> 
>     {
>           $fndxx = $1;
> 
>           $fndxx =~ s/\"//;   # remove leading quote

Your comment doesn't match the regex -- it will remove the first 
quote, not a leading quote.

>           $fndxx =~ s/\s+//;   # remove leading spaces

Again, this will remove the first run of whitespace from the 
string, not leading whitespace.

>           $dir="C:\\IGINproducts\\UserDocuments\\";    
>        
>           $fullname = ($dir . $fndxx);
>           $date_string = "Cannot Find";
>           if (-e $fullname) { $date_string = ctime(stat($dir . 
> $fndxx)->mtime); }     #last update date of that file
>           print(OUTFILE  $fndxx,",",$date_string,", in: 
> ",basename($file), "\n")  ;
>           $matches += 1;  # count matches
>        
>     }    #end while HYPERLINK
> -----------  end actual code segment --------------------
> 
> The output for a found HYPERLINK should look like this (it does on the 
> Win 2000 machine):
> 
> mydoc.doc,(date of last update), in: otherdoc.doc
> 
> On Win XP, the program cannot even find the word HYPERLINK (if I modify 
> the code to just search for that).  The directories are valid, I can 
> have the program print a list of all files as it processes them.
> 
> If I try this with a test program (the string to test is in the program 
> itself ) it works fine on the XP machine.
> 
> There are no encryption issues, nor any file or directory problems.

How exactly do you know this?  Using a piece of garbage like 
Notepad won't definitively tell you this.  I would trust Perl 
much further than Notepad.
....
> Barry Millman
-- 
Bob Walton
Email: http://bwalton.com/cgi-bin/emailbob.pl
0
Bob
11/27/2005 6:22:05 PM
Barry Millman wrote:

(snipped)

> If I examine the MS Word file using a Hex editor, I get the following
> values for bytes 5 through 7 (calling the first byte as zero):
> B1   1A   E1

> The 1A is the seventh byte of the file.

> The PERL program (above) seems to stop at this character.

Possible false end of file (eof) signal or a general collapse
of the read filehandle function because of illegal characters
for the specific read mode, ASCII for what you show.

Give binmode a try.

binmode (STDOUT);

open (TEST ....

binmode (TEST);

My sincere suggestion is you pursue your binary files for fun, only.

Should you need to accomplish your task, soon, use your RTF format
or convert your Word documents to plaintext.

Working with binary files via Perl, is very challenging. Perl core is simply
not designed to handle binary data. Perl core is designed to open filehandles
for various functions, tell a system to read or write in a specific mode, but
perl core is not involved in the actual transfer of data, ASCII or binary. Perl
is designed to manipulate "plaintext" data, not binary.

You can be successful in reading and writing binary data, but most likely will
not be successful using Perl to manipulate binary data, such as substr,
index, regex and other functions; Perl is not binary capable.

I have not looked at CPAN for binary handling modules. Have a look. You might
find a module which can be adapted for your needs.

If not, I suggest you stop mucking around with binary data and get your task done. =)

Purl Gurl
0
Purl
11/27/2005 6:37:38 PM
Purl Gurl wrote:
> Purl Gurl wrote:

Isn't talking to yourself the first sign?


>
> I have looked over Word Perfect and MS Word but not RTF formats, on a
> 9.x machine, a 2K machine and an XP machine.

Somewhat irrelevant because the OP wrote " The files are MS Word 2000
documents e-mailed (manually) from the Win 2000 machine to the XP
machine."


<half-baked story about WordPerfect deleted>


> A hex editor will display plaintext format, if in a binary file. I use
> Hex Workshop v. 2.2x for this. Very old program but works with
> excellence. You could simply open your Word document with a
> hex editor, then search for http: from there.

Pay attention Kira, the OP already wrote "I looked at both the MS Word
and RTF files with the XVI32 Hex editor. They both showed the same hex
values for the string  HYPERLINK."


Its so sad to see an old rusty V8 that's only running on three
cylinders.

0
foo
11/27/2005 7:03:06 PM
Purl Gurl wrote:
> Tad McClellan wrote:
>
> > Purl Gurl wrote:
> > > Tad McClellan wrote:
>
> (snipped)
>
> > >> I don't think I can help with that part, but the code is too hokey
> > >> to just let it pass...
>
> > > Have you helped the author resolve his problem?
>
> > Have you?
>
> I have. You have not.
>

The OP wrote about MS Word and you entertained him with a pointless and
inconclusive story about an unrelated product: WordPerfect. After he
wrote about using a hex editor you advised him to use a hex editor.

0
foo
11/27/2005 7:10:33 PM
Purl Gurl wrote:
> Barry Millman wrote:
>
> (snipped)
>
> > If I examine the MS Word file using a Hex editor, I get the following
> > values for bytes 5 through 7 (calling the first byte as zero):
> > B1   1A   E1
>
> > The 1A is the seventh byte of the file.
>
> > The PERL program (above) seems to stop at this character.
>
> Possible false end of file (eof) signal

"Possible"? Don't be such an unassertive wimp Kira, it is well known
that control-Z (hex 1A) *is* the end of file marker for text files on
MS-DOS and hence (for compatibility reasons) on Win32..

Perl uses the OS for file I/O and it is inevitable that Windows stops
reading your binary file prematurely unless you tell it to use binary
mode.

0
foo
11/27/2005 7:21:36 PM
Well Purl Gurl you are the BEST!!!!!

The binmode solved the problem.

Thank you all for your help.  Plese don't fight!

It still seems strange that the same file, created by the same word 
processor (Word 2000) would behave differently on two diffent versions 
of the same OS.

Thanks to Bill Gates and his team for a wonderful morning.

All the best,

Barry



Purl Gurl wrote:

> Barry Millman wrote:
> 
> (snipped)
> 
> 
>>If I examine the MS Word file using a Hex editor, I get the following
>>values for bytes 5 through 7 (calling the first byte as zero):
>>B1   1A   E1
> 
> 
>>The 1A is the seventh byte of the file.
> 
> 
>>The PERL program (above) seems to stop at this character.
> 
> 
> Possible false end of file (eof) signal or a general collapse
> of the read filehandle function because of illegal characters
> for the specific read mode, ASCII for what you show.
> 
> Give binmode a try.
> 
> binmode (STDOUT);
> 
> open (TEST ....
> 
> binmode (TEST);
> 
> My sincere suggestion is you pursue your binary files for fun, only.
> 
> Should you need to accomplish your task, soon, use your RTF format
> or convert your Word documents to plaintext.
> 
> Working with binary files via Perl, is very challenging. Perl core is simply
> not designed to handle binary data. Perl core is designed to open filehandles
> for various functions, tell a system to read or write in a specific mode, but
> perl core is not involved in the actual transfer of data, ASCII or binary. Perl
> is designed to manipulate "plaintext" data, not binary.
> 
> You can be successful in reading and writing binary data, but most likely will
> not be successful using Perl to manipulate binary data, such as substr,
> index, regex and other functions; Perl is not binary capable.
> 
> I have not looked at CPAN for binary handling modules. Have a look. You might
> find a module which can be adapted for your needs.
> 
> If not, I suggest you stop mucking around with binary data and get your task done. =)
> 
> Purl Gurl
0
Barry
11/27/2005 7:50:08 PM
Barry Millman <millmanbarry@hotmail.com> wrote:

> OK.  Sorry about the bad code.


Please do not send stealth Cc's. 

That is considered a rude practice, so I'm moving on to 
someone else's post...


-- 
    Tad McClellan                          SGML consulting
    tadmc@augustmail.com                   Perl programming
    Fort Worth, Texas
0
Tad
11/27/2005 8:08:01 PM
Barry Millman wrote:

> Purl Gurl wrote:
> > Barry Millman wrote:

(snipped)

> >>The PERL program (above) seems to stop at this character.

> > Give binmode a try.

> Well Purl Gurl you are the BEST!!!!!

Nope. I am simply a good thinker, as are many. If I am considered
to excel at "something" that would be my vivid imagination.

> The binmode solved the problem.

Cool. Not sure if that solves your problem of being able
to pull hyperlinks, but I suppose this will give you start.

Certainly would be easier if you can pull your data directly
from a binary rather than have to jump through hoops to
arrive at a plaintext file.

> Thank you all for your help.

I enjoy challenges. This is a mutually equitable bargin. Solving
or trying to solve problems, all learn something. I love learning.

> Plese don't fight!

Pffttt... the boys here are nothing more than entertainment. I don't
fight with them, I pull their puppet strings so they will entertain me.

> It still seems strange that the same file, created by the same word
> processor (Word 2000) would behave differently on two diffent versions
> of the same OS.

Not really. Different versions of Perl behave quite differently, as does
different versions of Windows, Mac and Linux. This is to be expected.

Over the years, I have enjoyed every version of Windows, even a very
early release which was not known as Windows. Changes in versions
are often quite radical. I play around Mac, play around with Linux,
and have found the same to be true; constant changes in behavior.

I often harp about a lack of standards. More often, though, I harp about
a lack of factual information. Even with factual information, we are often
surprised by "undocumented" behaviors, or those behaviors not yet
discovered. For those reasons, it is very important to never state,

"This is an absolute."

Good luck with your project.

Purl Gurl
0
Purl
11/27/2005 8:12:22 PM
Tad McClellan <tadmc@augustmail.com> wrote in 
news:slrndok4h1.o2e.tadmc@magna.augustmail.com:

> Barry Millman <millmanbarry@hotmail.com> wrote:
> 
>> OK.  Sorry about the bad code.
> 
> 
> Please do not send stealth Cc's. 
> 
> That is considered a rude practice, so I'm moving on to 
> someone else's post...

Well, he seems to have found a good match (see elsethread) ;-)

Sinan
-- 
A. Sinan Unur <1usa@llenroc.ude.invalid>
(reverse each component and remove .invalid for email address)

comp.lang.perl.misc guidelines on the WWW:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

0
A
11/27/2005 9:49:18 PM
On Sun, 27 Nov 2005 18:22:05 GMT, Bob Walton
<see.sig@rochester.rr.com> wrote:

>Barry Millman wrote:
>
>> Hi:
>> 
>>     I am using Perl 5 (I believe both machines are using ActivePERL 5) 
>> on two machines with the same data files.  One machine is Win 2000 the 
>> other is Win XP.  The files are MS Word 2000 documents e-mailed 
>> (manually) from the Win 2000 machine to the XP machine.
>> 
>>     The program searches the MS Word Files (both created with MS Word 
>> 2000) for the word HYPERLINK.  The format for the HYPERLINK that I am 
>> searching for in the document is:
>> 
>> HYPERLINK "mydoc.doc"
> >
> > (I checked this on the XP machine in Notepad and it is OK.)
> >
>
>Note that MS Word documents are stored in a proprietary binary 
>gibberish format.  To assume that a given word in a document will 
>actually always be stored in an ASCII string in the .doc file is 
>assuming too much.  For example, perhaps it is stored in Unicode? 
>  And maybe newer Notepad versions understand enough to present 
>Unicode strings?  Try looking at your files with an editor that 
>you *know* won't munge the contents.  I suggest VIM.
>
>It is a mystery why a document would get changed while emailing 
>it from one system to another.  Or did you perhaps open the 
>document with Word after emailing it, and then save it?  You 
>don't say.  Is it the same version of Word?  And what email 
>system are you using on each of the computers?  Does the same 
>thing happen if you zip the file, email the zipped version, and 
>unzip it on the other system?
>
[--snip--]

Yeah, "propriatory binary" thats a phrase you don't hear much.
Comparing md5's or even checksums should resolve transmission
or open/save issues between versions/machines. Email? Maybe the AV
firewall did some elective stripping somewhere en-route.
You wasted your time on this, you should have tried to code
to discerne the "difference" between saves. In reality
thats what your are trying to do. Just because you can "see" some
discernable text sometimes doesen't mean its a text stream.
You can type out a ".exe" file too. What are the odds it reads
everything to the eof sequence? Pretty good. What are the odds
its got thousands of them in the file? Pretty good. Why?
You can't reliably code for strings in a binary stream unless
you already know the format and read the entire thing into
waiting structures. By that time your past stream processing.
Why do you think xml was invented, or yenc or uucp? Control
codes munge up stream processing. The binary file data are
sometimes control codes when read by consoles, editors and the like.

There is no solution to the OPs problem, there is none. 
The approach is wrong. He made what engineers call "conceptual error".
"It worked once" is not proof of concept! Given binary structured data
files, it is absolutely, positively, impossible to treat it as
streaming text in ANY search capacity, unless controls can be 
discerened from data at the search core api routines, and thats
not what it does. You can't monitor or change fast enough the api
concept of control codes. The attempt is a bridge to nowhere..
Its a good bridge but the traffic drives off the end.

>> 
>> The output for a found HYPERLINK should look like this (it does on the 
>> Win 2000 machine):
>> 
>> mydoc.doc,(date of last update), in: otherdoc.doc
>> 
>> On Win XP, the program cannot even find the word HYPERLINK (if I modify 
>> the code to just search for that).  The directories are valid, I can 
>> have the program print a list of all files as it processes them.
>> 
>> If I try this with a test program (the string to test is in the program 
>> itself ) it works fine on the XP machine.
>> 
>> There are no encryption issues, nor any file or directory problems.
>
>How exactly do you know this?  Using a piece of garbage like 
>Notepad won't definitively tell you this.  I would trust Perl 
>much further than Notepad.
>...
>> Barry Millman

0
robic0
11/28/2005 12:09:04 AM
I'd add that MS Office files are compound files, that's "file system in
a file" objects. You need an advanced hex editor like FlexHex to see
the structure of a compound file:
http://www.flexhex.com/docs/help/objects/compound_files.phtml

I doubt that Perl supports structured storage, so there may be a
problem locating the main data stream.

0
Grod
11/28/2005 9:48:15 AM
Barry Millman wrote:
> Well Purl Gurl you are the BEST!!!!!
> 
> The binmode solved the problem.
> 

Even a stopped clock tells the right time twice a day.
0
Mark
11/28/2005 6:24:20 PM
Mark Clements wrote:

> 
> Even a stopped clock tells the right time twice a day.

And even tells us when it does so : exactly at the time where it is stopped.
0
fda
11/30/2005 7:47:41 AM
Reply: