Extracting values from an HTML table

  • Follow


Hi All,

Occasionally, I have to extract values from a long HTML table (see some 
sample lines below) and convert them into a more useful format. The HTML 
code looks like this:

<table>

   <tr>

      <td>1</td>

      <td>1</td>

      <td>1</td>

      <td>25.25384484</td>

      <td>56.46285198</td>

   </tr>
   <tr>

      <td>1</td>

      <td>1</td>

      <td>2</td>

      <td>25.25595084</td>

      <td>56.46250503</td>

   </tr>

(...)

</table>

The expected output format is:

1 1 1 25.25384484 56.46285198
1 1 2 25.25595084 56.46250503
(...)

Could someone help me out ?

Thanks in advance, Hermann
0
Reply peifer (24) 10/1/2007 5:22:26 PM

Hermann Peifer wrote:

> Occasionally, I have to extract values from a long HTML table (see some
> sample lines below) and convert them into a more useful format. The HTML
> code looks like this:

If it is properly tagged, why dont you treat it as XML ?
0
Reply ISO 10/1/2007 6:24:02 PM


On 1    , 20:22, Hermann Peifer <pei...@gmx.net> wrote:
> Hi All,
>
> Occasionally, I have to extract values from a long HTML table (see some
> sample lines below) and convert them into a more useful format. The HTML
> code looks like this:
>
> <table>
>
>    <tr>
>
>       <td>1</td>
>
>       <td>1</td>
>
>       <td>1</td>
>
>       <td>25.25384484</td>
>
>       <td>56.46285198</td>
>
>    </tr>
>    <tr>
>
>       <td>1</td>
>
>       <td>1</td>
>
>       <td>2</td>
>
>       <td>25.25595084</td>
>
>       <td>56.46250503</td>
>
>    </tr>
>
> (...)
>
> </table>
>
> The expected output format is:
>
> 1 1 1 25.25384484 56.46285198
> 1 1 2 25.25595084 56.46250503
> (...)
>
> Could someone help me out ?
>
> Thanks in advance, Hermann


I would definitely use perl (HTML::TreeBuilder) on this ;-)
You can also try:

/<tr>/ { start = 1; next }
/<\/tr>/ { start = 0; print ""; next }
start {
	gsub(/<[^>]+>/, "")
	sub(/^[ \t]+/, "")
	sub(/[ \t]+$/, "")
	printf "%s ", $0
}

  Vassilis

0
Reply Vassilis 10/1/2007 6:28:46 PM

J�rgen Kahrs wrote:
> Hermann Peifer wrote:
> 
>> Occasionally, I have to extract values from a long HTML table (see some
>> sample lines below) and convert them into a more useful format. The HTML
>> code looks like this:
> 
> If it is properly tagged, why dont you treat it as XML ?

Good idea. I hadn't thought about it. Sometimes, one can't see the 
forest for the trees.

Hermann
0
Reply Hermann 10/1/2007 6:36:09 PM

Can you use lynx ?

$ lynx --dump filename.html

   1 1 1 25.25384484 56.46285198
   1 1 2 25.25595084 56.46250503


On Oct 1, 2:22 pm, Hermann Peifer <pei...@gmx.net> wrote:
> Hi All,
>
> Occasionally, I have to extract values from a long HTML table (see some
> sample lines below) and convert them into a more useful format. The HTML
> code looks like this:
>
> <table>
>
>    <tr>
>
>       <td>1</td>
>
>       <td>1</td>
>
>       <td>1</td>
>
>       <td>25.25384484</td>
>
>       <td>56.46285198</td>
>
>    </tr>
>    <tr>
>
>       <td>1</td>
>
>       <td>1</td>
>
>       <td>2</td>
>
>       <td>25.25595084</td>
>
>       <td>56.46250503</td>
>
>    </tr>
>
> (...)
>
> </table>
>
> The expected output format is:
>
> 1 1 1 25.25384484 56.46285198
> 1 1 2 25.25595084 56.46250503
> (...)
>
> Could someone help me out ?
>
> Thanks in advance, Hermann


0
Reply Tiago 10/1/2007 6:43:48 PM

Hermann Peifer wrote:

>> If it is properly tagged, why dont you treat it as XML ?
> 
> Good idea. I hadn't thought about it. Sometimes, one can't see the
> forest for the trees.

Ha ! This problem is so simple that you can
even use the "portable subset" library. Notice
that you will _not_ depend on XMLgawk by copying
this libary into your script.
0
Reply ISO 10/1/2007 6:43:56 PM

Tiago Peczenyj wrote:
> Can you use lynx ?
> 
> $ lynx --dump filename.html
> 
>    1 1 1 25.25384484 56.46285198
>    1 1 2 25.25595084 56.46250503
> 

Great.

I'll get a few more spaces between the values, but this shouldn't be a 
problem.

Thanks, Hermann

0
Reply Hermann 10/1/2007 7:44:15 PM

On Oct 1, 12:22 pm, Hermann Peifer <pei...@gmx.net> wrote:
> Hi All,
>
> Occasionally, I have to extract values from a long HTML table (see some
> sample lines below) and convert them into a more useful format. The HTML
> code looks like this:
>
> <table>
>
>    <tr>
>
>       <td>1</td>
>
>       <td>1</td>
>
>       <td>1</td>
>
>       <td>25.25384484</td>
>
>       <td>56.46285198</td>
>
>    </tr>
>    <tr>
>
>       <td>1</td>
>
>       <td>1</td>
>
>       <td>2</td>
>
>       <td>25.25595084</td>
>
>       <td>56.46250503</td>
>
>    </tr>
>
> (...)
>
> </table>
>
> The expected output format is:
>
> 1 1 1 25.25384484 56.46285198
> 1 1 2 25.25595084 56.46250503
> (...)
>
> Could someone help me out ?
>
> Thanks in advance, Hermann

#!gawk
BEGIN { RS = "</?tr>" }

"</tr>" == RT {
  gsub( /[^0-9.-]/, " " )
  $1 = $1
  print
}

==== output ====
1 1 1 25.25384484 56.46285198
1 1 2 25.25595084 56.46250503

0
Reply William 10/1/2007 7:48:35 PM

Hermann Peifer wrote:
> Hi All,
> 
> Occasionally, I have to extract values from a long HTML table (see some 
> sample lines below) and convert them into a more useful format. 

Thanks for all replies.

Tiago's hint with lynx works fine with smaller and medium-sized files.

William's Gawk script also works great with bigger files (100+ M).

Hermann
0
Reply Hermann 10/2/2007 7:39:23 PM

8 Replies
37 Views

(page loaded in 0.125 seconds)

Similiar Articles:













7/15/2012 11:52:56 AM


Reply: