f



Read strings from one file and search for them in a directory containing htm files

Hi Folks,

Trust this message finds you all in great spirits.

I have a problem -

I have one file where each line is treated as ONE STRING. I need to
read each line from this file and search for that line in another
directory which contains some 100 .html files. Once I find a matching
line that contains that ONE STRING, I need to write that ONE STRING
into another file. I need to discard those that are not found in any of
those 100 .htm files. So basically my intention is to find out if the
strings are used in any of the 100 .htm files that exist in another
directory.

Am new to awk and tried a host of things that I found on the net. Some
how they dont seem to provide the solution that am looking for or
anywhere close to it.

I am looking for an awk script for this. Any help will be really great
to have.

Thank you all so much in advance.
Megh

0
Meghavvarnam
11/20/2005 1:57:42 PM
comp.lang.awk 3450 articles. 0 followers. Post Follow

30 Replies
883 Views

Similar Articles

[PageSpeed] 1

Meghavvarnam wrote:

> I have one file where each line is treated as ONE STRING. I need to
> read each line from this file and search for that line in another

Use the shell to loop over the lines/strings from
the file. Inside the loop, use grep to search for each
string in the directory.
0
ISO
11/20/2005 2:33:14 PM
I tried this :

for string in `cat ./allStrings.txt`
do
    echo "SEARCHING" $(string) ". . . ."
    grep string ./htm/*.htm
done


This does not seem to work. Do you see any problem with what I have
written. Thank you for your help.

- Megh

0
Meghavvarnam
11/20/2005 2:39:01 PM
Meghavvarnam wrote:

> I tried this :
> 
> for string in `cat ./allStrings.txt`
> do
>     echo "SEARCHING" $(string) ". . . ."
>     grep string ./htm/*.htm
> done
> 
> 
> This does not seem to work. Do you see any problem with what I have
> written. Thank you for your help.

Yes, but the above has nothing to do with awk and so should be discussed 
at comp.unix.shell instead of comp.lang.awk so I'm crossposting and 
setting followups to go there.

In the meantime, try this:

while IFS= read -r string
do
     echo "SEARCHING $(string) . . . ."
     grep "$string" htm/*.htm
done < allStrings.txt

read this for an explanation:

	http://home.comcast.net/~j.p.h/cus-faq-2.html#14

and read this befre posting again:

	http://cfaj.freeshell.org/google

Regards,

	Ed.
0
Ed
11/20/2005 2:51:18 PM
Ed,

Thanks a lot for cross posting over here.

Here is the Original Problem Folks:

I have one file, named allStrings.txt where each line is treated as ONE
STRING. I need to read each line from this file and search for that
line in another directory which contains some 100 .html files. Once I
find a matching line that contains that ONE STRING, I need to write
that ONE STRING into another file. I need to discard those that are not
found in any of those 100 .htm files.

So basically my intention is to find out if the strings are used in any
of the 100 .htm files that exist in another directory. And if they
exist simply write the string into a new file - let us call the file
"usedStrings.txt". So basically when I compare usedStrings.txt and
allStrings.txt I will know which of them are NOT USED.

I am looking for a shell script for this. Any help will be really great
to have.

Thank you all so much in advance.
Megh

0
Meghavvarnam
11/20/2005 3:06:56 PM
Meghavvarnam wrote:

> Ed,
> 
> Thanks a lot for cross posting over here.

No problem, now as I said in my previous posting, PLEASE:

> read this befre posting again:
> 
>     http://cfaj.freeshell.org/google 

Regards,

	Ed.
0
Ed
11/20/2005 3:27:57 PM
I read and I understood that we should not be putting the original
message again over here.

However I posted it only to make it easier for folks who read it so
that they understand the original problem

Apologies if it REEEAlly was a violation.

Thanks,
Megh

0
Meghavvarnam
11/20/2005 3:30:55 PM
Meghavvarnam wrote:

> Ed,
> 
> Thanks a lot for cross posting over here.
> 
> Here is the Original Problem Folks:
> 
> I have one file, named allStrings.txt where each line is treated as ONE
> STRING. I need to read each line from this file and search for that
> line in another directory which contains some 100 .html files. Once I
> find a matching line that contains that ONE STRING, I need to write
> that ONE STRING into another file. I need to discard those that are not
> found in any of those 100 .htm files.
> 
> So basically my intention is to find out if the strings are used in any
> of the 100 .htm files that exist in another directory. And if they
> exist simply write the string into a new file - let us call the file
> "usedStrings.txt". So basically when I compare usedStrings.txt and
> allStrings.txt I will know which of them are NOT USED.
> 
> I am looking for a shell script for this. Any help will be really great
> to have.
> 
> Thank you all so much in advance.
> Megh
> 

Try this:

 > usedStrings.txt
while IFS= read -r string
do
     grep -q "^${string}$" directory/*.htm &&
	echo "$string" >> usedStrings.txt
done < allStrings.txt

Regards,

	Ed.
0
Ed
11/20/2005 3:32:45 PM
> I have one file, named allStrings.txt where each line is treated as ONE
> STRING. I need to read each line from this file and search for that
> line in another directory which contains some 100 .html files. Once I
> find a matching line that contains that ONE STRING, I need to write
> that ONE STRING into another file. I need to discard those that are not
> found in any of those 100 .htm files.
>
> So basically my intention is to find out if the strings are used in any
> of the 100 .htm files that exist in another directory. And if they
> exist simply write the string into a new file - let us call the file
> "usedStrings.txt". So basically when I compare usedStrings.txt and
> allStrings.txt I will know which of them are NOT USED.
>
> I am looking for a shell script for this. Any help will be really great
> to have.

I thought I had already seen a solution to this posted.  Something like
this ought to work:

  #!/bin/sh

  : ${HTMLDIR:=/path/to/some/directory}

  while read string; do
    if grep -lrF "$string" $HTMLDIR > /dev/null 2>&1; then
      echo "$string" >> usedStrings.txt
    fi
  done < allStrings.txt                                                         

In the 'grep' command above:

  -r for recursive searching, since if the number of files was large,
     directory/*.html could result in an error,

  -F to search for fixed strings, rather than treating allStrings.txt
     as a list of regular expressions, and

  -l This makes grep stop scanning a file as soon as it finds the
     first match (and print out the filename, but we discard that).

-- Lars

-- 
Lars Kellogg-Stedman <8273grkci8q8kgt@jetable.net>
This email address will expire on 2005-11-23.

0
Lars
11/20/2005 3:33:47 PM
Meghavvarnam wrote:

> I read and I understood that we should not be putting the original
> message again over here.

You misunderstood. Please re-read it. It really is pretty clear on what 
the problem is and precisely what steps to take on Google groups to fix it.

	Ed.

0
Ed
11/20/2005 3:36:10 PM
Thank you Lars.

I just tried it and it does not seem to print the desired result in
usedStrings.txt.

I grep'd for a existing string in usedStrings.txt and did not find it.

neither does a non existing string is in usedStrings.txt.

Sorry it just does not seem to work.

Am tensed as well since I really need a solution  for this.

Thanks,
Megh

0
Meghavvarnam
11/20/2005 3:53:46 PM
Ed,

I will certainly re-read it.. but when am really peacefull.

Apologies again that I bothered. I did not know that  a small mistake
would cause so much of disturbance. I still live in a imperfect world.

Thanks,
Megh

0
Meghavvarnam
11/20/2005 3:55:29 PM
> I just tried it and it does not seem to print the desired result in
> usedStrings.txt.

Megh,

In the absence of some sample input data and sample files there's only
so much we can do.  The script as written works just fine for the simple
tests I through at it, but it certainly isn't particularly robust.

> Am tensed as well since I really need a solution  for this.

The task you're trying to accomplish is really a very simple one, and
using my script or one of the others that has been suggested as a
starting point you should be able to solve it.

-- Lars

-- 
Lars Kellogg-Stedman <8273grkci8q8kgt@jetable.net>
This email address will expire on 2005-11-23.

0
Lars
11/20/2005 4:03:24 PM
Meghavvarnam wrote:

> Ed,
> 
> I will certainly re-read it.. but when am really peacefull.
> 
> Apologies again that I bothered. I did not know that  a small mistake
> would cause so much of disturbance. I still live in a imperfect world.

There's no disturbance here. I'm just trying to help you learn how to 
post to usenet so people can understand what problem you're trying to 
solve. As of now, thanks to using Google groups you're un-knowingly 
posting a bunch of disconnected statements which some people will find 
difficult to follow (and many more will just find plain annoying) and so 
won't bother trying to piece back together to understand. That link I 
keep referring you to will tell you how to easily fix your problem so 
you can provide the right information to everyone so you stand the best 
chance of getting a solution to your problem. I understand where you're 
coming from, in that you just want to get your problem fixed, but you're 
currently making that much more of a challenge than it has to be.

	Ed.
0
Ed
11/20/2005 4:04:26 PM
In article 
<1132495062.788033.272840@g14g2000cwa.googlegroups.com>,
 "Meghavvarnam" <meghsatish@yahoo.com> wrote:

> Hi Folks,
> 
> Trust this message finds you all in great spirits.
> 
> I have a problem -
> 
> I have one file where each line is treated as ONE STRING. I need to
> read each line from this file and search for that line in another
> directory which contains some 100 .html files. Once I find a matching
> line that contains that ONE STRING, I need to write that ONE STRING
> into another file. I need to discard those that are not found in any of
> those 100 .htm files. So basically my intention is to find out if the
> strings are used in any of the 100 .htm files that exist in another
> directory.
> 
> Am new to awk and tried a host of things that I found on the net. Some
> how they dont seem to provide the solution that am looking for or
> anywhere close to it.
> 
> I am looking for an awk script for this. Any help will be really great
> to have.
> 
> Thank you all so much in advance.
> Megh

If file ONE STRING contains entire lines to be found in the 100 
..htm files, then try this

awk '
    FNR == 1 { filecount++ }
    filecount == 1 { match_line[$0]=1 }
    filecount != 1 { if ( $0 in match_line ) print $0 }
' ONE.STRING *.htm >output.file

if ONE.STRING contains just search strings that might exist in a 
line of *.htm, then try this:

awk '
    FNR == 1 { filecount++ }
    filecount == 1 { match_string[$0]=1 }
    filecount != 1 {
        for( str in match_string ) {
            if ( $0 ~ str ) print $0
        }
    }
' ONE.STRING *.htm >output.file

This 2nd one is a bit more CPU intensive, but it should do the job.

                                        Bob Harris
0
Bob
11/20/2005 11:35:34 PM
> Megh,
>
> In the absence of some sample input data and sample files there's only
> so much we can do.  The script as written works just fine for the simple
> tests I through at it, but it certainly isn't particularly robust.

Lars,

I understand the difficulty when we dont have the data. Its difficult
to get a sense of what the behaviour of our code should be. Here is
part of a huge .htm file that I have :

================== Begin Snippet ====================
                  <h3 class="subTitle">Reset Bluetooth</h3>
                  <div class="pad10">
                     <table width="450" class="tabbedcontent" summary
="This table is used to display the Bluetooth interface configuration
parameters.">
                        <tr>
                           <td class="clf" colspan="3">
                           Use this option to reset Bluetooth to
factory default settings.
                           </td>
                        </tr>
                        <tr>
                           <td width="5%">&nbsp;</td>
                           <td class="clf" colspan="2">
                           <INPUT type="radio"
name="bt_reset_bluetooth" title="Select a reset Bluetooth setting"
value="choice_bt_reset_bluetooth_yes" accesskey="y" >
                           Yes, reset Bluetooth
                           </td>
                        </tr>
                        <tr>
                           <td width="5%">&nbsp;</td>
                           <td class="clf" colspan="2">
                           <INPUT type="radio"
name="bt_reset_bluetooth" title="Select a reset Bluetooth setting"
value="choice_bt_reset_bluetooth_no" accesskey="n" CHECKED>
                           No
                           </td>
                        </tr>
                     </table>
                  </div> <!-- end div pad10 -->
================== End Snippet ====================

Now the file that contains all the strings that I search for has, for
example, a string - "Reset".

When I look for "Reset", I need as output that meets the following
criteria :
1. Only lines that contain "Reset" and not a line from the aboe snippet
like - <h3 class="subTitle">Reset Bluetooth</h3>
2. The output is also case sensitive. For example, it will not print
lines that may have "reset".
3. And since we search in a html file, strings found in a comment do
not form part of the output.
4. Basically tags like title, summary, value need to contain for
example, ONLY "Reset", to match the criteria and be part of the output.

Thank you so much for your understanding and support.
Best Regards,
Megh

> > Am tensed as well since I really need a solution  for this.
>
> The task you're trying to accomplish is really a very simple one, and
> using my script or one of the others that has been suggested as a
> starting point you should be able to solve it.
>
> -- Lars
>
> --
> Lars Kellogg-Stedman <8273grkci8q8kgt@jetable.net>
> This email address will expire on 2005-11-23.

0
Meghavvarnam
11/21/2005 12:34:08 PM
Meghavvarnam <meghsatish@yahoo.com> wrote:
> I tried this :
> 
> for string in `cat ./allStrings.txt`
> do
>    echo "SEARCHING" $(string) ". . . ."
>    grep string ./htm/*.htm
> done
> 
> 
> This does not seem to work. Do you see any problem with what I have
> written. Thank you for your help.
> 
> - Megh
> 
How about the following?

while read string
do
   grep "$string" ./htm/*.htm >/dev/null 2>&1
   if [ $? -eq 0 ]; then echo "$string"; fi
done < ./allStrings.txt

grep exits with 0 if any input line matches "$string".  A more
succinct way is:

while read string
do
   grep "$string" ./htm/*.htm >/dev/null 2>&1 && echo "$string"
done < ./allStrings.txt

HTH.

Chih-Cherng Chin

0
Chih
11/21/2005 1:33:22 PM
Meghavvarnam wrote:
 >>Megh,
 >>
 >>In the absence of some sample input data and sample files there's only
 >>so much we can do.  The script as written works just fine for the simple
 >>tests I through at it, but it certainly isn't particularly robust.
 >
 >
 > Lars,
 >
 > I understand the difficulty when we dont have the data. Its difficult
 > to get a sense of what the behaviour of our code should be. Here is
 > part of a huge .htm file that I have :
 >
 > ================== Begin Snippet ====================
 >                   <h3 class="subTitle">Reset Bluetooth</h3>
 >                   <div class="pad10">
 >                      <table width="450" class="tabbedcontent" summary
 > ="This table is used to display the Bluetooth interface configuration
 > parameters.">
 >                         <tr>
 >                            <td class="clf" colspan="3">
 >                            Use this option to reset Bluetooth to
 > factory default settings.
 >                            </td>
 >                         </tr>
 >                         <tr>
 >                            <td width="5%">&nbsp;</td>
 >                            <td class="clf" colspan="2">
 >                            <INPUT type="radio"
 > name="bt_reset_bluetooth" title="Select a reset Bluetooth setting"
 > value="choice_bt_reset_bluetooth_yes" accesskey="y" >
 >                            Yes, reset Bluetooth
 >                            </td>
 >                         </tr>
 >                         <tr>
 >                            <td width="5%">&nbsp;</td>
 >                            <td class="clf" colspan="2">
 >                            <INPUT type="radio"
 > name="bt_reset_bluetooth" title="Select a reset Bluetooth setting"
 > value="choice_bt_reset_bluetooth_no" accesskey="n" CHECKED>
 >                            No
 >                            </td>
 >                         </tr>
 >                      </table>
 >                   </div> <!-- end div pad10 -->
 > ================== End Snippet ====================
 >
 > Now the file that contains all the strings that I search for has, for
 > example, a string - "Reset".
 >
 > When I look for "Reset", I need as output that meets the following
 > criteria :
 > 1. Only lines that contain "Reset" and not a line from the aboe snippet
 > like - <h3 class="subTitle">Reset Bluetooth</h3>
 > 2. The output is also case sensitive. For example, it will not print
 > lines that may have "reset".
 > 3. And since we search in a html file, strings found in a comment do
 > not form part of the output.
 > 4. Basically tags like title, summary, value need to contain for
 > example, ONLY "Reset", to match the criteria and be part of the output.

So, it sounds like none of the lines in your sample input actually match 
your search criteria. Is that right? It now sounds like what you really 
want is something like this:

 > usedStrings.txt
while IFS= read -r string
do
     gawk -vstring="=\"$string\"" '{
	for (i=1;i<=NF;i++) {
	    if ($i ~ string) {
		print FILENAME > usedStrings.txt
		nextfile
	    }
	}
     }' directory/*.htm
done < allStrings.txt

put posting sample input with some matches to your selection criteria 
plus that expected output would help. Note that I used "gawk" to take 
advantage of it's "nextfile" operator. If you use some other awk, you 
need to work around that.

Glad to see you've overcome your google groups non-quoting hurdle!

	Ed.


0
Ed
11/21/2005 1:35:08 PM
Ed Morton wrote:
> Meghavvarnam wrote:
>  >>Megh,
>  >>
>  >>In the absence of some sample input data and sample files there's only
>  >>so much we can do.  The script as written works just fine for the simple
>  >>tests I through at it, but it certainly isn't particularly robust.
>  >
>  >
>  > Lars,
>  >
>  > I understand the difficulty when we dont have the data. Its difficult
>  > to get a sense of what the behaviour of our code should be. Here is
>  > part of a huge .htm file that I have :
>  >
>  > ================== Begin Snippet ====================
>  >                   <h3 class="subTitle">Reset Bluetooth</h3>
>  >                   <div class="pad10">
>  >                      <table width="450" class="tabbedcontent" summary
>  > ="This table is used to display the Bluetooth interface configuration
>  > parameters.">
>  >                         <tr>
>  >                            <td class="clf" colspan="3">
>  >                            Use this option to reset Bluetooth to
>  > factory default settings.
>  >                            </td>
>  >                         </tr>
>  >                         <tr>
>  >                            <td width="5%">&nbsp;</td>
>  >                            <td class="clf" colspan="2">
>  >                            <INPUT type="radio"
>  > name="bt_reset_bluetooth" title="Select a reset Bluetooth setting"
>  > value="choice_bt_reset_bluetooth_yes" accesskey="y" >
>  >                            Yes, reset Bluetooth
>  >                            </td>
>  >                         </tr>
>  >                         <tr>
>  >                            <td width="5%">&nbsp;</td>
>  >                            <td class="clf" colspan="2">
>  >                            <INPUT type="radio"
>  > name="bt_reset_bluetooth" title="Select a reset Bluetooth setting"
>  > value="choice_bt_reset_bluetooth_no" accesskey="n" CHECKED>
>  >                            No
>  >                            </td>
>  >                         </tr>
>  >                      </table>
>  >                   </div> <!-- end div pad10 -->
>  > ================== End Snippet ====================
>  >
>  > Now the file that contains all the strings that I search for has, for
>  > example, a string - "Reset".
>  >
>  > When I look for "Reset", I need as output that meets the following
>  > criteria :
>  > 1. Only lines that contain "Reset" and not a line from the aboe snippet
>  > like - <h3 class="subTitle">Reset Bluetooth</h3>
>  > 2. The output is also case sensitive. For example, it will not print
>  > lines that may have "reset".
>  > 3. And since we search in a html file, strings found in a comment do
>  > not form part of the output.
>  > 4. Basically tags like title, summary, value need to contain for
>  > example, ONLY "Reset", to match the criteria and be part of the output.
>
> So, it sounds like none of the lines in your sample input actually match
> your search criteria. Is that right? It now sounds like what you really
> want is something like this:
>
>  > usedStrings.txt
> while IFS= read -r string
> do
>      gawk -vstring="=\"$string\"" '{
> 	for (i=1;i<=NF;i++) {
> 	    if ($i ~ string) {
> 		print FILENAME > usedStrings.txt
> 		nextfile
> 	    }
> 	}
>      }' directory/*.htm
> done < allStrings.txt
>
> put posting sample input with some matches to your selection criteria
> plus that expected output would help. Note that I used "gawk" to take
> advantage of it's "nextfile" operator. If you use some other awk, you
> need to work around that.
>
> Glad to see you've overcome your google groups non-quoting hurdle!
>

Ed,

Thanks for your inputs. Well it was not a hurdle. Its just that I had
to peacefully read that URL.

I will try that script out. And I use gawk too! So I am hoping it will
work.

Thanks again. 
Megh

> 	Ed.

0
Meghavvarnam
11/22/2005 5:37:21 AM
Ed Morton wrote:
> Meghavvarnam wrote:
>  >>Megh,
>  >>
>  >>In the absence of some sample input data and sample files there's only
>  >>so much we can do.  The script as written works just fine for the simple
>  >>tests I through at it, but it certainly isn't particularly robust.
>  >
>  >
>  > Lars,
>  >
>  > I understand the difficulty when we dont have the data. Its difficult
>  > to get a sense of what the behaviour of our code should be. Here is
>  > part of a huge .htm file that I have :
>  >
>  > ================== Begin Snippet ====================
>  >                   <h3 class="subTitle">Reset Bluetooth</h3>
>  >                   <div class="pad10">
>  >                      <table width="450" class="tabbedcontent" summary
>  > ="This table is used to display the Bluetooth interface configuration
>  > parameters.">
>  >                         <tr>
>  >                            <td class="clf" colspan="3">
>  >                            Use this option to reset Bluetooth to
>  > factory default settings.
>  >                            </td>
>  >                         </tr>
>  >                         <tr>
>  >                            <td width="5%">&nbsp;</td>
>  >                            <td class="clf" colspan="2">
>  >                            <INPUT type="radio"
>  > name="bt_reset_bluetooth" title="Select a reset Bluetooth setting"
>  > value="choice_bt_reset_bluetooth_yes" accesskey="y" >
>  >                            Yes, reset Bluetooth
>  >                            </td>
>  >                         </tr>
>  >                         <tr>
>  >                            <td width="5%">&nbsp;</td>
>  >                            <td class="clf" colspan="2">
>  >                            <INPUT type="radio"
>  > name="bt_reset_bluetooth" title="Select a reset Bluetooth setting"
>  > value="choice_bt_reset_bluetooth_no" accesskey="n" CHECKED>
>  >                            No
>  >                            </td>
>  >                         </tr>
>  >                      </table>
>  >                   </div> <!-- end div pad10 -->
>  > ================== End Snippet ====================
>  >
>  > Now the file that contains all the strings that I search for has, for
>  > example, a string - "Reset".
>  >
>  > When I look for "Reset", I need as output that meets the following
>  > criteria :
>  > 1. Only lines that contain "Reset" and not a line from the aboe snippet
>  > like - <h3 class="subTitle">Reset Bluetooth</h3>
>  > 2. The output is also case sensitive. For example, it will not print
>  > lines that may have "reset".
>  > 3. And since we search in a html file, strings found in a comment do
>  > not form part of the output.
>  > 4. Basically tags like title, summary, value need to contain for
>  > example, ONLY "Reset", to match the criteria and be part of the output.
>
> So, it sounds like none of the lines in your sample input actually match
> your search criteria. Is that right? It now sounds like what you really
> want is something like this:
>
>  > usedStrings.txt
> while IFS= read -r string
> do
>      gawk -vstring="=\"$string\"" '{
> 	for (i=1;i<=NF;i++) {
> 	    if ($i ~ string) {
> 		print FILENAME > usedStrings.txt
> 		nextfile
> 	    }
> 	}
>      }' directory/*.htm
> done < allStrings.txt
>
> put posting sample input with some matches to your selection criteria
> plus that expected output would help. Note that I used "gawk" to take
> advantage of it's "nextfile" operator. If you use some other awk, you
> need to work around that.

Sample data does help a great deal. Here it is:

allStrings.txt contains lines likes these -
=================== Begin allStrings.txt ====================
WPA1
WPA2
Automatic (WPA2 or WPA1)
XyZ technology helps make home networking simple.
XyZ architecture offers network connectivity between personal
computers, printers, intelligent appliances and wireless devices.
XyZ architecture leverages ABC/DE and the Web to enable seamless
proximity networking in addition to control and data transfer among
networked devices in the home and office.
If you enable XYZ , then XYZ-enabled devices can print to this device.
Privacy
SampleText:<br> Simpler, smarter online supplies ordering
Learn more about <br>XYZ SampleText
Transfer printer information to XYZ SampleText?
==================== End allStrings.txt =====================

Which means, the script will search for these lines in .htm files. Each
of these lines need to appear as is (case sensitive) to say that there
is a match. Now consider we read the 3rd line in the file above -
Automatic (WPA2 or WPA1).

>From the .htm snippet pasted below, the third option tag contains the
search string -

Automatic (WPA2 or WPA1)

So when a match like this occurs, I simply need to write Automatic
(WPA2 or WPA1) in the files - usedStrings.

==================== Begin .htm Snippet =====================
                        <tr>
                           <td>&nbsp;</td>
                           <td
class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WPA
Version
                           </td>
                           <td>
                              <select name="wpa_version" size="1"
title="Select a WPA version setting">
                                 <option value="WPA1">WPA1</option>
                                 <option value="WPA2">WPA2</option>
                                 <option selected="SELECTED"
value="Automatic">Automatic (WPA2 or WPA1)</option>
                              </select>
                           </td>
                        </tr>
                        <tr>
                           <td>&nbsp;</td>
                           <td
class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Encryption:
                           </td>
                           <td>
                              <select name="encr_type" size="1"
title="Select an encryption setting">
                                 <option value="AES_TKIP">Automatic
(AES or TKIP)</option>
                                 <option value="AES">AES</option>
                                 <option value="TKIP">TKIP</option>
                              </select>
                           </td>
                        </tr>
===================== End .htm Snippet ======================

I also tried the script that you sent in your previous post. It created
the file - usedStrings.txt. However it does not populate it. The file
remains empty.

In the process you are helping me learn awk as well.

Thank you so much once again for all the help ! Am growing to
understand some awk scripts and its behaviour.

Warm Regards,
Megh

> Glad to see you've overcome your google groups non-quoting hurdle!
> 
> 	Ed.

0
Meghavvarnam
11/22/2005 9:28:15 AM
Meghavvarnam wrote:

<snip>
> Sample data does help a great deal. Here it is:
> 
> allStrings.txt contains lines likes these -
> =================== Begin allStrings.txt ====================
> WPA1
> WPA2
> Automatic (WPA2 or WPA1)
> XyZ technology helps make home networking simple.
> XyZ architecture offers network connectivity between personal
> computers, printers, intelligent appliances and wireless devices.
> XyZ architecture leverages ABC/DE and the Web to enable seamless
> proximity networking in addition to control and data transfer among
> networked devices in the home and office.
> If you enable XYZ , then XYZ-enabled devices can print to this device.
> Privacy
> SampleText:<br> Simpler, smarter online supplies ordering
> Learn more about <br>XYZ SampleText
> Transfer printer information to XYZ SampleText?
> ==================== End allStrings.txt =====================
> 
> Which means, the script will search for these lines in .htm files. Each
> of these lines need to appear as is (case sensitive) to say that there
> is a match. Now consider we read the 3rd line in the file above -
> Automatic (WPA2 or WPA1).
> 
>>From the .htm snippet pasted below, the third option tag contains the
> search string -
> 
> Automatic (WPA2 or WPA1)
> 
> So when a match like this occurs, I simply need to write Automatic
> (WPA2 or WPA1) in the files - usedStrings.
> 
> ==================== Begin .htm Snippet =====================
>                         <tr>
>                            <td>&nbsp;</td>
>                            <td
> class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WPA
> Version
>                            </td>
>                            <td>
>                               <select name="wpa_version" size="1"
> title="Select a WPA version setting">
>                                  <option value="WPA1">WPA1</option>
>                                  <option value="WPA2">WPA2</option>
>                                  <option selected="SELECTED"
> value="Automatic">Automatic (WPA2 or WPA1)</option>
>                               </select>
>                            </td>
>                         </tr>
>                         <tr>
>                            <td>&nbsp;</td>
>                            <td
> class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> Encryption:
>                            </td>
>                            <td>
>                               <select name="encr_type" size="1"
> title="Select an encryption setting">
>                                  <option value="AES_TKIP">Automatic
> (AES or TKIP)</option>
>                                  <option value="AES">AES</option>
>                                  <option value="TKIP">TKIP</option>
>                               </select>
>                            </td>
>                         </tr>
> ===================== End .htm Snippet ======================
> 

Now we're back to about my original suggestion, if there's no newlines 
in the searched text:

 > usedStrings.txt
while IFS= read -r string
do
     grep -q ">${string}<" directory/*.htm &&
     echo "$string" >> usedStrings.txt
done < allStrings.txt

Alternatively, doing it all in awk, it's:

gawk 'NR==FNR{strings[$0]++;next}
     { for (string in strings}
	    if (index($0,">"string"<") {
		usedStrings[string]++
		delete strings[string]	# for efficiency
	    }
     }
     END { for (string in usedStrings)
	    print string
     }' allStrings.txt directory/*.htm > usedStrings.txt

Note that, since you said something in a previous posting about only 
wanting to look for text when it's part of an HTML tag (or something 
like that...) the search for ">"string"<" surrounds the line from 
"allStrings.txt" with ">" and "<" so it only matches when the text 
appears between those 2 characters. If you don't want that restriction, 
just get rid of the ">" and "<". Similairly for the grep solution.

If you'd like the awk script to tell you which strings are/aren't used, 
that's trivial, e.g.:

gawk 'NR==FNR{strings[$0]++;next}
     { for (string in strings}
	    if (index($0,">"string"<") {
		usedStrings[string]++
		delete strings[string]	# for efficiency
	    }
     }
     END {
	print "Used Strings:"
	for (string in usedStrings)
	    printf "\t%s\n",string
	print "Unused Strings:"
	for (string in strings)
	    printf "\t%s\n",string
     }' allStrings.txt directory/*.htm

If there can be newlines in the strings yopu're trying to match in the 
HTML files, then we need to figure out what "match" means since there 
aren't newlines in the strings in "allStrings.txt" and we need to figure 
out a different record separator than a newline char.

	Ed.
0
Ed
11/22/2005 1:31:03 PM
Ed Morton wrote:
> Meghavvarnam wrote:
>
> <snip>
> > Sample data does help a great deal. Here it is:
> >
> > allStrings.txt contains lines likes these -
> > =================== Begin allStrings.txt ====================
> > WPA1
> > WPA2
> > Automatic (WPA2 or WPA1)
> > XyZ technology helps make home networking simple.
> > XyZ architecture offers network connectivity between personal
> > computers, printers, intelligent appliances and wireless devices.
> > XyZ architecture leverages ABC/DE and the Web to enable seamless
> > proximity networking in addition to control and data transfer among
> > networked devices in the home and office.
> > If you enable XYZ , then XYZ-enabled devices can print to this device.
> > Privacy
> > SampleText:<br> Simpler, smarter online supplies ordering
> > Learn more about <br>XYZ SampleText
> > Transfer printer information to XYZ SampleText?
> > ==================== End allStrings.txt =====================
> >
> > Which means, the script will search for these lines in .htm files. Each
> > of these lines need to appear as is (case sensitive) to say that there
> > is a match. Now consider we read the 3rd line in the file above -
> > Automatic (WPA2 or WPA1).
> >
> >>From the .htm snippet pasted below, the third option tag contains the
> > search string -
> >
> > Automatic (WPA2 or WPA1)
> >
> > So when a match like this occurs, I simply need to write Automatic
> > (WPA2 or WPA1) in the files - usedStrings.
> >
> > ==================== Begin .htm Snippet =====================
> >                         <tr>
> >                            <td>&nbsp;</td>
> >                            <td
> > class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WPA
> > Version
> >                            </td>
> >                            <td>
> >                               <select name="wpa_version" size="1"
> > title="Select a WPA version setting">
> >                                  <option value="WPA1">WPA1</option>
> >                                  <option value="WPA2">WPA2</option>
> >                                  <option selected="SELECTED"
> > value="Automatic">Automatic (WPA2 or WPA1)</option>
> >                               </select>
> >                            </td>
> >                         </tr>
> >                         <tr>
> >                            <td>&nbsp;</td>
> >                            <td
> > class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > Encryption:
> >                            </td>
> >                            <td>
> >                               <select name="encr_type" size="1"
> > title="Select an encryption setting">
> >                                  <option value="AES_TKIP">Automatic
> > (AES or TKIP)</option>
> >                                  <option value="AES">AES</option>
> >                                  <option value="TKIP">TKIP</option>
> >                               </select>
> >                            </td>
> >                         </tr>
> > ===================== End .htm Snippet ======================
> >
>
> Now we're back to about my original suggestion, if there's no newlines
> in the searched text:
>
>  > usedStrings.txt
> while IFS= read -r string
> do
>      grep -q ">${string}<" directory/*.htm &&
>      echo "$string" >> usedStrings.txt
> done < allStrings.txt
>
> Alternatively, doing it all in awk, it's:
>
> gawk 'NR==FNR{strings[$0]++;next}
>      { for (string in strings}
> 	    if (index($0,">"string"<") {
> 		usedStrings[string]++
> 		delete strings[string]	# for efficiency
> 	    }
>      }
>      END { for (string in usedStrings)
> 	    print string
>      }' allStrings.txt directory/*.htm > usedStrings.txt
>

Apologies for asking some basic questions.

I have some doubts about the above gawk script. Request your time for
clarifications.

1. There are two variables - string and strings. I dont know what value
will the variable 'string' have when it iterates for the very first
time. Is it by default assigned to 0 ?
2. I simply pasted this script in a file and gave execute permission
and then executed it from the command line. Have I done it right ? Am
in doubt here because I was thinking if I should use gawk -f <script
file name>
Did I run your script in the right way at all ?

I have modified the if in the for loop to this:
if (index($0,">"string"<") || index($0,"\""string"\"") ||
index($0,">"string"\n" ) {

to include strings within double quotes (") and strings that have ">"
before them and end with a new line. Please correct me if am wrong
here.

Here is what happened when I ran the script :
gawk: cmd. line:2:      { for (string in strings}
gawk: cmd. line:2:                                    ^ parse error
gawk: cmd. line:3:             if (index($0,">"string"<") ||
index($0,"\""string"\"") || index($0,">"string"\n" ) {
gawk: cmd. line:3:             ^ parse error
gawk: cmd. line:3:             if (index($0,">"string"<") ||
index($0,"\""string"\"") || index($0,">"string"\n" ) {
gawk: cmd. line:3:       ^ parse error
gawk: cmd. line:5:                 usedStrings[string]++
gawk: cmd. line:5:                                                ^
unexpected newline
gawk: cmd. line:6:                 delete strings[string]  # for
efficiency
gawk: cmd. line:6:                                                 ^
parse error
gawk: cmd. line:9:      END { for (string in usedStrings)
gawk: cmd. line:9:                                                    ^
unexpected newline

> Note that, since you said something in a previous posting about only
> wanting to look for text when it's part of an HTML tag (or something
> like that...) the search for ">"string"<" surrounds the line from
> "allStrings.txt" with ">" and "<" so it only matches when the text
> appears between those 2 characters. If you don't want that restriction,
> just get rid of the ">" and "<". Similairly for the grep solution.
>
> If you'd like the awk script to tell you which strings are/aren't used,
> that's trivial, e.g.:
>
> gawk 'NR==FNR{strings[$0]++;next}
>      { for (string in strings}
> 	    if (index($0,">"string"<") {
> 		usedStrings[string]++
> 		delete strings[string]	# for efficiency
> 	    }
>      }
>      END {
> 	print "Used Strings:"
> 	for (string in usedStrings)
> 	    printf "\t%s\n",string
> 	print "Unused Strings:"
> 	for (string in strings)
> 	    printf "\t%s\n",string
>      }' allStrings.txt directory/*.htm

Am getting some errors for the above script as well. Here it is:

gawk: cmd. line:2:       { for (string in strings}
gawk: cmd. line:2:                                     ^ parse error
gawk: cmd. line:3:          if (index($0,">"string"<") {
gawk: cmd. line:3:          ^ parse error
gawk: cmd. line:3:          if (index($0,">"string"<") {
gawk: cmd. line:3:                                              ^ parse
error
gawk: cmd. line:5:              usedStrings[string]++
gawk: cmd. line:5:                                             ^
unexpected newline
gawk: cmd. line:6:              delete strings[string]  # for
efficiency
gawk: cmd. line:6:                                              ^ parse
error
gawk: cmd. line:11:     for (string in usedStrings)
gawk: cmd. line:11:                                         ^
unexpected newline
gawk: cmd. line:12:         printf "\t%s\n",string
gawk: cmd. line:12:                                      ^ unexpected
newline
gawk: cmd. line:14:     for (string in strings)
gawk: cmd. line:14:                                  ^ unexpected
newline
gawk: cmd. line:15:         printf "\t%s\n",string
gawk: cmd. line:15:                                      ^ unexpected
newline

Any thoughts on the fix would help a great deal.

> If there can be newlines in the strings yopu're trying to match in the
> HTML files, then we need to figure out what "match" means since there
> aren't newlines in the strings in "allStrings.txt" and we need to figure
> out a different record separator than a newline char.

There are no new lines "with in" any line except at the end of a line
in "allStrings.txt".

Warm Regards,
Megh

> 
> 	Ed.

0
Meghavvarnam
11/23/2005 1:40:54 PM
Meghavvarnam wrote:
> Ed Morton wrote:
> 
>>Alternatively, doing it all in awk, it's:
>>
>>gawk 'NR==FNR{strings[$0]++;next}
>>     { for (string in strings}
>>	    if (index($0,">"string"<") {
>>		usedStrings[string]++
>>		delete strings[string]	# for efficiency
>>	    }
>>     }
>>     END { for (string in usedStrings)
>>	    print string
>>     }' allStrings.txt directory/*.htm > usedStrings.txt
>>
> 
> 
> Apologies for asking some basic questions.
> 
> I have some doubts about the above gawk script. Request your time for
> clarifications.
> 
> 1. There are two variables - string and strings. I dont know what value
> will the variable 'string' have when it iterates for the very first
> time. Is it by default assigned to 0 ?

All variables are either "" (empty string) or 0 (numeric zero) when
you access them the first time.

> 2. I simply pasted this script in a file and gave execute permission
> and then executed it from the command line. Have I done it right ? Am
> in doubt here because I was thinking if I should use gawk -f <script
> file name>
> Did I run your script in the right way at all ?

On Unix you can define as first line in your script

   #!/bin/awk -f

(or whereever your awk is installed), or you can call the awk interpreter
explicitly with the file

   awk -f file_with_your_awk_program

or with the program (not recommended for programs of non-trivial size)

   awk '...awk-statements...'


> I have modified the if in the for loop to this:
> if (index($0,">"string"<") || index($0,"\""string"\"") ||
> index($0,">"string"\n" ) {
> 
> to include strings within double quotes (") and strings that have ">"
> before them and end with a new line. Please correct me if am wrong
> here.
> 
> Here is what happened when I ran the script :
> gawk: cmd. line:2:      { for (string in strings}
> gawk: cmd. line:2:                                    ^ parse error

It's like in other computer laguages, the brackets must match

                           { for (string in strings)


> [big snip]

Janis
0
Janis
11/23/2005 2:47:21 PM
Janis Papanagnou wrote:
> Meghavvarnam wrote:
> > Ed Morton wrote:
> >
> >>Alternatively, doing it all in awk, it's:
> >>
> >>gawk 'NR==FNR{strings[$0]++;next}
> >>     { for (string in strings}
> >>	    if (index($0,">"string"<") {
> >>		usedStrings[string]++
> >>		delete strings[string]	# for efficiency
> >>	    }
> >>     }
> >>     END { for (string in usedStrings)
> >>	    print string
> >>     }' allStrings.txt directory/*.htm > usedStrings.txt
> >>
> >
> >
> > Apologies for asking some basic questions.
> >
> > I have some doubts about the above gawk script. Request your time for
> > clarifications.
> >
> > 1. There are two variables - string and strings. I dont know what value
> > will the variable 'string' have when it iterates for the very first
> > time. Is it by default assigned to 0 ?
>
> All variables are either "" (empty string) or 0 (numeric zero) when
> you access them the first time.
>
> > 2. I simply pasted this script in a file and gave execute permission
> > and then executed it from the command line. Have I done it right ? Am
> > in doubt here because I was thinking if I should use gawk -f <script
> > file name>
> > Did I run your script in the right way at all ?
>
> On Unix you can define as first line in your script
>
>    #!/bin/awk -f
>
> (or whereever your awk is installed), or you can call the awk interpreter
> explicitly with the file
>
>    awk -f file_with_your_awk_program
>
> or with the program (not recommended for programs of non-trivial size)
>
>    awk '...awk-statements...'
>
>
> > I have modified the if in the for loop to this:
> > if (index($0,">"string"<") || index($0,"\""string"\"") ||
> > index($0,">"string"\n" ) {
> >
> > to include strings within double quotes (") and strings that have ">"
> > before them and end with a new line. Please correct me if am wrong
> > here.
> >
> > Here is what happened when I ran the script :
> > gawk: cmd. line:2:      { for (string in strings}
> > gawk: cmd. line:2:                                    ^ parse error
>
> It's like in other computer laguages, the brackets must match
>
>                            { for (string in strings)
>
>
I do have a closing brace for that. My intention was to paste just the
modified part and leave out the original script that Ed had put out
over here.

Thanks anyway for your inputs Janis.

Warm Regards,
Megh

> > [big snip]
> 
> Janis

0
Meghavvarnam
11/23/2005 2:57:29 PM
Meghavvarnam wrote:
> Janis Papanagnou wrote:
>>Meghavvarnam wrote:
>>
>>>Here is what happened when I ran the script :
>>>gawk: cmd. line:2:      { for (string in strings}
>>>gawk: cmd. line:2:                                    ^ parse error
>>
>>It's like in other computer laguages, the brackets must match
>>
>>                           { for (string in strings)
> 
> I do have a closing brace for that. My intention was to paste just the
> modified part and leave out the original script that Ed had put out
> over here.

I don't understand you. You posted a script that *has* this error and
you posted a bunch of errors that are created by that script exactly
at that location. If you would replaced the } bracket by the ) bracket
then the output should look quite good.

You had wrote...
 >>>Am getting some errors for the above script as well. Here it is:
 >>>
 >>>gawk: cmd. line:2:       { for (string in strings}
 >>>gawk: cmd. line:2:                                     ^ parse error
 >>>
 >>> etc. etc.
 >>>
 >>>Any thoughts on the fix would help a great deal.

With the same error in the second script as in the first one (Ed's).

Janis
0
Janis
11/23/2005 3:19:59 PM
Janis Papanagnou wrote:
> Meghavvarnam wrote:
> > Janis Papanagnou wrote:
> >>Meghavvarnam wrote:
> >>
> >>>Here is what happened when I ran the script :
> >>>gawk: cmd. line:2:      { for (string in strings}
> >>>gawk: cmd. line:2:                                    ^ parse error
> >>
> >>It's like in other computer laguages, the brackets must match
> >>
> >>                           { for (string in strings)
> >
> > I do have a closing brace for that. My intention was to paste just the
> > modified part and leave out the original script that Ed had put out
> > over here.
>
> I don't understand you. You posted a script that *has* this error and
> you posted a bunch of errors that are created by that script exactly
> at that location. If you would replaced the } bracket by the ) bracket
> then the output should look quite good.
>
I very much agree with you.

That was a copy paste error :-( And thanks a bunch for identfying that.

Though I figured out there was a missing ) in the if with in the for, I
didn't quite get to the one you pointed out over here.

Thanks Again,
Megh
> 
> Janis

0
Meghavvarnam
11/23/2005 4:44:53 PM
Ed Morton wrote:
> Meghavvarnam wrote:
>
> <snip>
> > Sample data does help a great deal. Here it is:
> >
> > allStrings.txt contains lines likes these -
> > =================== Begin allStrings.txt ====================
> > WPA1
> > WPA2
> > Automatic (WPA2 or WPA1)
> > XyZ technology helps make home networking simple.
> > XyZ architecture offers network connectivity between personal
> > computers, printers, intelligent appliances and wireless devices.
> > XyZ architecture leverages ABC/DE and the Web to enable seamless
> > proximity networking in addition to control and data transfer among
> > networked devices in the home and office.
> > If you enable XYZ , then XYZ-enabled devices can print to this device.
> > Privacy
> > SampleText:<br> Simpler, smarter online supplies ordering
> > Learn more about <br>XYZ SampleText
> > Transfer printer information to XYZ SampleText?
> > ==================== End allStrings.txt =====================
> >
> > Which means, the script will search for these lines in .htm files. Each
> > of these lines need to appear as is (case sensitive) to say that there
> > is a match. Now consider we read the 3rd line in the file above -
> > Automatic (WPA2 or WPA1).
> >
> >>From the .htm snippet pasted below, the third option tag contains the
> > search string -
> >
> > Automatic (WPA2 or WPA1)
> >
> > So when a match like this occurs, I simply need to write Automatic
> > (WPA2 or WPA1) in the files - usedStrings.
> >
> > ==================== Begin .htm Snippet =====================
> >                         <tr>
> >                            <td>&nbsp;</td>
> >                            <td
> > class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; WPA
> > Version
> >                            </td>
> >                            <td>
> >                               <select name="wpa_version" size="1"
> > title="Select a WPA version setting">
> >                                  <option value="WPA1">WPA1</option>
> >                                  <option value="WPA2">WPA2</option>
> >                                  <option selected="SELECTED"
> > value="Automatic">Automatic (WPA2 or WPA1)</option>
> >                               </select>
> >                            </td>
> >                         </tr>
> >                         <tr>
> >                            <td>&nbsp;</td>
> >                            <td
> > class="clf">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
> > Encryption:
> >                            </td>
> >                            <td>
> >                               <select name="encr_type" size="1"
> > title="Select an encryption setting">
> >                                  <option value="AES_TKIP">Automatic
> > (AES or TKIP)</option>
> >                                  <option value="AES">AES</option>
> >                                  <option value="TKIP">TKIP</option>
> >                               </select>
> >                            </td>
> >                         </tr>
> > ===================== End .htm Snippet ======================
> >
>
> Now we're back to about my original suggestion, if there's no newlines
> in the searched text:
>
>  > usedStrings.txt
> while IFS= read -r string
> do
>      grep -q ">${string}<" directory/*.htm &&
>      echo "$string" >> usedStrings.txt
> done < allStrings.txt
>
> Alternatively, doing it all in awk, it's:
>
> gawk 'NR==FNR{strings[$0]++;next}
>      { for (string in strings}
> 	    if (index($0,">"string"<") {
> 		usedStrings[string]++
> 		delete strings[string]	# for efficiency
> 	    }
>      }
>      END { for (string in usedStrings)
> 	    print string
>      }' allStrings.txt directory/*.htm > usedStrings.txt
>
> Note that, since you said something in a previous posting about only
> wanting to look for text when it's part of an HTML tag (or something
> like that...) the search for ">"string"<" surrounds the line from
> "allStrings.txt" with ">" and "<" so it only matches when the text
> appears between those 2 characters. If you don't want that restriction,
> just get rid of the ">" and "<". Similairly for the grep solution.
>
This is the script that I tried -

# listused
# lists strings that are used in all .htm files

gawk 'NR==FNR{strings[$0]++;next} {
        for (string in strings) #}
print string
                if (index($0,">"string"<") || index($0,"\""string"\"")
|| index($0,">"string"\n")) {
                        usedStrings[string]++
                        delete strings[string]  # for efficiency
        }
}
END {
        for (string in usedStrings)
                print string
}' allStrings.txt htm/*.htm > usedStringsfile

Please let me know, if there is any mistake in this. I gave execute
permission for the file that contained this script and ran it from the
shell.

usedStringsfile was empty at the end of it.

Any pointers will be of great help.

> If you'd like the awk script to tell you which strings are/aren't used,
> that's trivial, e.g.:
>
> gawk 'NR==FNR{strings[$0]++;next}
>      { for (string in strings}
> 	    if (index($0,">"string"<") {
> 		usedStrings[string]++
> 		delete strings[string]	# for efficiency
> 	    }
>      }
>      END {
> 	print "Used Strings:"
> 	for (string in usedStrings)
> 	    printf "\t%s\n",string
> 	print "Unused Strings:"
> 	for (string in strings)
> 	    printf "\t%s\n",string
>      }' allStrings.txt directory/*.htm
>
I modified the script above to remove all parse errors. Here is the
script that I used to try out -

gawk ' NR==FNR{strings[$0]++;next}
      { for (string1 in strings)
            string = sprintf("<%s>", string1)
            if (index($0,">"string"<")) {
                usedStrings[string]++
                delete strings[string]  # for efficiency
            }
      }
      END {
        print "Used Strings:"
        for (string in usedStrings)
            printf "\t%s\n", string
        print "Unused Strings:"
        for (string in strings)
            printf "\t%s\n", string
      }' allStrings.txt htm/*.htm

I see the same behaviour with this as with the earlier script. Would we
need a different approach for this thing at all ??

What does the line - NR==FNR{strings[$0]++;next} do.

Thank you in advance so much for your help.

Megh

> If there can be newlines in the strings yopu're trying to match in the
> HTML files, then we need to figure out what "match" means since there
> aren't newlines in the strings in "allStrings.txt" and we need to figure
> out a different record separator than a newline char.
> 
> 	Ed.

0
Meghavvarnam
11/28/2005 7:37:09 AM
Meghavvarnam wrote:
> 
> What does the line - NR==FNR{strings[$0]++;next} do.

'NR==FNR' is a condition that is true for the very first file on your
argument list. The pattern is used, e.g. for two-pass computing...

   awk -f yourprog datafile datafile

or for initialization purpose...

   awk -f yourprog initdatafile datafile1 datafile2 ...

'strings[$0]++' increments a counter for each line pattern, i.e. when
a line is present in the file multiple times the counter 'strings' for
the key $0 (the whole line) has the value how often the line occurred.

'next' skips the next patter/action statements and continues with the
next data record and the first pattern in the program.

Janis
0
Janis
11/28/2005 10:16:42 AM
Meghavvarnam wrote:

> Ed Morton wrote:
<snip>
>>gawk 'NR==FNR{strings[$0]++;next}
>>     { for (string in strings}
>>	    if (index($0,">"string"<") {
>>		usedStrings[string]++
>>		delete strings[string]	# for efficiency
>>	    }
>>     }
>>     END { for (string in usedStrings)
>>	    print string
>>     }' allStrings.txt directory/*.htm > usedStrings.txt
<snip>
> This is the script that I tried -
> 
> # listused
> # lists strings that are used in all .htm files
> 
> gawk 'NR==FNR{strings[$0]++;next} {
>         for (string in strings) #}
> print string
>                 if (index($0,">"string"<") || index($0,"\""string"\"")
> || index($0,">"string"\n")) {
>                         usedStrings[string]++
>                         delete strings[string]  # for efficiency
>         }

Note that the above is now:

	for (string in strings)
		print string
	if (index...) {
	}

By adding "print string" between the "for.." and the "if..", you've 
taken the "if..." outside of the loop. Add parens to make what you want 
explicit {...}.

> }
> END {
>         for (string in usedStrings)
>                 print string
> }' allStrings.txt htm/*.htm > usedStringsfile
> 
> Please let me know, if there is any mistake in this.

Yes, there is. You now only have "print string" in the "for" loop. The 
"if ..." is outside of it.

  I gave execute
> permission for the file that contained this script and ran it from the
> shell.
> 
> usedStringsfile was empty at the end of it.
> 
> Any pointers will be of great help.
> 
> 
>>If you'd like the awk script to tell you which strings are/aren't used,
>>that's trivial, e.g.:
>>
>>gawk 'NR==FNR{strings[$0]++;next}
>>     { for (string in strings}
>>	    if (index($0,">"string"<") {
>>		usedStrings[string]++
>>		delete strings[string]	# for efficiency
>>	    }
>>     }
>>     END {
>>	print "Used Strings:"
>>	for (string in usedStrings)
>>	    printf "\t%s\n",string
>>	print "Unused Strings:"
>>	for (string in strings)
>>	    printf "\t%s\n",string
>>     }' allStrings.txt directory/*.htm
>>
> 
> I modified the script above to remove all parse errors. 

What parse errors? There may be some since it's untested, but I don't 
see any.

Here is the
> script that I used to try out -
> 
> gawk ' NR==FNR{strings[$0]++;next}
>       { for (string1 in strings)
>             string = sprintf("<%s>", string1)

Here again you've added a line and so taken the subsequent block (the 
"if...") out of the loop.

>             if (index($0,">"string"<")) {
>                 usedStrings[string]++
>                 delete strings[string]  # for efficiency
>             }
>       }
>       END {
>         print "Used Strings:"
>         for (string in usedStrings)
>             printf "\t%s\n", string
>         print "Unused Strings:"
>         for (string in strings)
>             printf "\t%s\n", string
>       }' allStrings.txt htm/*.htm
> 
> I see the same behaviour with this as with the earlier script.

By that do you mean that "usedStringsfile" is empty? Well, yes, it would 
be since no-where above do you direct any output to it, but additionally 
you've broken the loop again.

  Would we
> need a different approach for this thing at all ??

No.

> What does the line - NR==FNR{strings[$0]++;next} do.

See Janis' response.

> Thank you in advance so much for your help.

You're welcome,

	Ed.
0
Ed
11/28/2005 2:53:39 PM
Ed Morton wrote:
> Meghavvarnam wrote:
>
> > Ed Morton wrote:
> <snip>
> >>gawk 'NR==FNR{strings[$0]++;next}
> >>     { for (string in strings}
> >>	    if (index($0,">"string"<") {
> >>		usedStrings[string]++
> >>		delete strings[string]	# for efficiency
> >>	    }
> >>     }
> >>     END { for (string in usedStrings)
> >>	    print string
> >>     }' allStrings.txt directory/*.htm > usedStrings.txt
> <snip>
> > This is the script that I tried -
> >
> > # listused
> > # lists strings that are used in all .htm files
> >
> > gawk 'NR==FNR{strings[$0]++;next} {
> >         for (string in strings) #}
> > print string
> >                 if (index($0,">"string"<") || index($0,"\""string"\"")
> > || index($0,">"string"\n")) {
> >                         usedStrings[string]++
> >                         delete strings[string]  # for efficiency
> >         }
>
> Note that the above is now:
>
> 	for (string in strings)
> 		print string
> 	if (index...) {
> 	}
>
> By adding "print string" between the "for.." and the "if..", you've
> taken the "if..." outside of the loop. Add parens to make what you want
> explicit {...}.
>
> > }
> > END {
> >         for (string in usedStrings)
> >                 print string
> > }' allStrings.txt htm/*.htm > usedStringsfile
> >
> > Please let me know, if there is any mistake in this.
>
> Yes, there is. You now only have "print string" in the "for" loop. The
> "if ..." is outside of it.
>
I completely agree with you. Here is how the modified script looks
like...

gawk 'NR==FNR{strings[$0]++;next} {
        for (string in strings) {
                if (index($0,">"string"<") || index($0,"\""string"\"")
|| index($0,">"string"\n")) {
                        usedStrings[string]++
                        delete strings[string]  # for efficiency
                } # if
        } # for loop
}
END {
        for (string in usedStrings)
                print string
}' allStrings.txt htm/*.htm > usedStringsfile

This script is saved in a file called listused. Gave execute
permission.Then executed it from the command line as shown below

[Megh@razor] listused

I still see the same behaviour - usedStringsfile was empty.

>   I gave execute
> > permission for the file that contained this script and ran it from the
> > shell.
> >
> > usedStringsfile was empty at the end of it.
> >
> > Any pointers will be of great help.
> >
> >
> >>If you'd like the awk script to tell you which strings are/aren't used,
> >>that's trivial, e.g.:
> >>
> >>gawk 'NR==FNR{strings[$0]++;next}
> >>     { for (string in strings}
> >>	    if (index($0,">"string"<") {
> >>		usedStrings[string]++
> >>		delete strings[string]	# for efficiency
> >>	    }
> >>     }
> >>     END {
> >>	print "Used Strings:"
> >>	for (string in usedStrings)
> >>	    printf "\t%s\n",string
> >>	print "Unused Strings:"
> >>	for (string in strings)
> >>	    printf "\t%s\n",string
> >>     }' allStrings.txt directory/*.htm
> >>
> >
> > I modified the script above to remove all parse errors.
>
> What parse errors? There may be some since it's untested, but I don't
> see any.
>
> Here is the
> > script that I used to try out -
> >
> > gawk ' NR==FNR{strings[$0]++;next}
> >       { for (string1 in strings)
> >             string = sprintf("<%s>", string1)
>
> Here again you've added a line and so taken the subsequent block (the
> "if...") out of the loop.
>
> >             if (index($0,">"string"<")) {
> >                 usedStrings[string]++
> >                 delete strings[string]  # for efficiency
> >             }
> >       }
> >       END {
> >         print "Used Strings:"
> >         for (string in usedStrings)
> >             printf "\t%s\n", string
> >         print "Unused Strings:"
> >         for (string in strings)
> >             printf "\t%s\n", string
> >       }' allStrings.txt htm/*.htm
> >
> > I see the same behaviour with this as with the earlier script.
>
> By that do you mean that "usedStringsfile" is empty? Well, yes, it would
> be since no-where above do you direct any output to it, but additionally
> you've broken the loop again.
>
Here is the modified script:
gawk ' NR==FNR{strings[$0]++;next}
      { for (string in strings) {
            if (index($0,">"string"<") || index($0,"\""string"\"") ||
index($0,">"string"\n")) {
                usedStrings[string]++
                delete strings[string]  # for efficiency
            }
        }
      }
      END {
        print "Used Strings:"
        for (string in usedStrings)
            printf "\t%s\n", string
        print "Unused Strings:"
        for (string in strings)
            printf "\t%s\n", string
      }' allStrings.txt htm/*.htm

This file is saved in listused1, provide execute permission and run
from the command line like this :

listused1 > output1

output1 has strings that are both used and unused in it. When I cross
check it manually.
Here is how the output1 file begins -
Used Strings:
Unused Strings:
... All the strings follow here

Ed,

Given that the script is saved in a file, it would help if you can tell
me the correct way to run it from the command line.

We need to get this working.. Help please !

Thank you again!

Regards,
Megh

>   Would we
> > need a different approach for this thing at all ??
>
> No.
>
> > What does the line - NR==FNR{strings[$0]++;next} do.
>
> See Janis' response.
>
> > Thank you in advance so much for your help.
> 
> You're welcome,
> 
> 	Ed.

0
Meghavvarnam
11/29/2005 1:39:45 PM
Meghavvarnam wrote:
<snip>
> I completely agree with you. Here is how the modified script looks
> like...
> 
> gawk 'NR==FNR{strings[$0]++;next} {
>         for (string in strings) {
>                 if (index($0,">"string"<") || index($0,"\""string"\"")
> || index($0,">"string"\n")) {
>                         usedStrings[string]++
>                         delete strings[string]  # for efficiency
>                 } # if
>         } # for loop
> }
> END {
>         for (string in usedStrings)
>                 print string
> }' allStrings.txt htm/*.htm > usedStringsfile
<snip>
> Given that the script is saved in a file, it would help if you can tell
> me the correct way to run it from the command line.
> 
> We need to get this working.. Help please !

OK, let's just focus on one version for now. If you have the above in a 
  file named "listused" and it's executable, then just execute it as 
/path/listused as you appear to have been doing. So, there's really only 
a couple of ways you'd get usedStringsfile empty:

1) allStrings.txt is empty, or
2) There are no files matching htm/*.htm, or
3) None of the files that match htm/*.htm contain any of the strings in 
allStrings.txt

So, let's instrument the program for debugging:

gawk '{printf "Working on file %s\n",FILENAME}
NR==FNR{strings[$0]++;printf "Added string %s\n",$0;next}
{

	for (string in strings) {
		printf "Searching for string \"%s\" in line \"%s\"\n",string,$0
                 if (index($0,">"string"<") || index($0,"\""string"\"")
|| index($0,">"string"\n")) {
			printf "Found string \"%s\" in line \"%s\"\n",string,$0
                         usedStrings[string]++
                         delete strings[string]  # for efficiency
                 } # if
         } # for loop
}
END {
         for (string in usedStrings)
                 print string
}' allStrings.txt htm/*.htm > usedStringsfile

Then run that on a small sample input and see post the result.

	Ed.
0
Ed
11/29/2005 3:03:49 PM
Reply: