Hi All,
Occasionally, I have to extract values from a long HTML table (see some
sample lines below) and convert them into a more useful format. The HTML
code looks like this:
<table>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>25.25384484</td>
<td>56.46285198</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>2</td>
<td>25.25595084</td>
<td>56.46250503</td>
</tr>
(...)
</table>
The expected output format is:
1 1 1 25.25384484 56.46285198
1 1 2 25.25595084 56.46250503
(...)
Could someone help me out ?
Thanks in advance, Hermann
|
|
0
|
|
|
|
Reply
|
peifer (24)
|
10/1/2007 5:22:26 PM |
|
Hermann Peifer wrote:
> Occasionally, I have to extract values from a long HTML table (see some
> sample lines below) and convert them into a more useful format. The HTML
> code looks like this:
If it is properly tagged, why dont you treat it as XML ?
|
|
0
|
|
|
|
Reply
|
ISO
|
10/1/2007 6:24:02 PM
|
|
On 1 , 20:22, Hermann Peifer <pei...@gmx.net> wrote:
> Hi All,
>
> Occasionally, I have to extract values from a long HTML table (see some
> sample lines below) and convert them into a more useful format. The HTML
> code looks like this:
>
> <table>
>
> <tr>
>
> <td>1</td>
>
> <td>1</td>
>
> <td>1</td>
>
> <td>25.25384484</td>
>
> <td>56.46285198</td>
>
> </tr>
> <tr>
>
> <td>1</td>
>
> <td>1</td>
>
> <td>2</td>
>
> <td>25.25595084</td>
>
> <td>56.46250503</td>
>
> </tr>
>
> (...)
>
> </table>
>
> The expected output format is:
>
> 1 1 1 25.25384484 56.46285198
> 1 1 2 25.25595084 56.46250503
> (...)
>
> Could someone help me out ?
>
> Thanks in advance, Hermann
I would definitely use perl (HTML::TreeBuilder) on this ;-)
You can also try:
/<tr>/ { start = 1; next }
/<\/tr>/ { start = 0; print ""; next }
start {
gsub(/<[^>]+>/, "")
sub(/^[ \t]+/, "")
sub(/[ \t]+$/, "")
printf "%s ", $0
}
Vassilis
|
|
0
|
|
|
|
Reply
|
Vassilis
|
10/1/2007 6:28:46 PM
|
|
J�rgen Kahrs wrote:
> Hermann Peifer wrote:
>
>> Occasionally, I have to extract values from a long HTML table (see some
>> sample lines below) and convert them into a more useful format. The HTML
>> code looks like this:
>
> If it is properly tagged, why dont you treat it as XML ?
Good idea. I hadn't thought about it. Sometimes, one can't see the
forest for the trees.
Hermann
|
|
0
|
|
|
|
Reply
|
Hermann
|
10/1/2007 6:36:09 PM
|
|
Can you use lynx ?
$ lynx --dump filename.html
1 1 1 25.25384484 56.46285198
1 1 2 25.25595084 56.46250503
On Oct 1, 2:22 pm, Hermann Peifer <pei...@gmx.net> wrote:
> Hi All,
>
> Occasionally, I have to extract values from a long HTML table (see some
> sample lines below) and convert them into a more useful format. The HTML
> code looks like this:
>
> <table>
>
> <tr>
>
> <td>1</td>
>
> <td>1</td>
>
> <td>1</td>
>
> <td>25.25384484</td>
>
> <td>56.46285198</td>
>
> </tr>
> <tr>
>
> <td>1</td>
>
> <td>1</td>
>
> <td>2</td>
>
> <td>25.25595084</td>
>
> <td>56.46250503</td>
>
> </tr>
>
> (...)
>
> </table>
>
> The expected output format is:
>
> 1 1 1 25.25384484 56.46285198
> 1 1 2 25.25595084 56.46250503
> (...)
>
> Could someone help me out ?
>
> Thanks in advance, Hermann
|
|
0
|
|
|
|
Reply
|
Tiago
|
10/1/2007 6:43:48 PM
|
|
Hermann Peifer wrote:
>> If it is properly tagged, why dont you treat it as XML ?
>
> Good idea. I hadn't thought about it. Sometimes, one can't see the
> forest for the trees.
Ha ! This problem is so simple that you can
even use the "portable subset" library. Notice
that you will _not_ depend on XMLgawk by copying
this libary into your script.
|
|
0
|
|
|
|
Reply
|
ISO
|
10/1/2007 6:43:56 PM
|
|
Tiago Peczenyj wrote:
> Can you use lynx ?
>
> $ lynx --dump filename.html
>
> 1 1 1 25.25384484 56.46285198
> 1 1 2 25.25595084 56.46250503
>
Great.
I'll get a few more spaces between the values, but this shouldn't be a
problem.
Thanks, Hermann
|
|
0
|
|
|
|
Reply
|
Hermann
|
10/1/2007 7:44:15 PM
|
|
On Oct 1, 12:22 pm, Hermann Peifer <pei...@gmx.net> wrote:
> Hi All,
>
> Occasionally, I have to extract values from a long HTML table (see some
> sample lines below) and convert them into a more useful format. The HTML
> code looks like this:
>
> <table>
>
> <tr>
>
> <td>1</td>
>
> <td>1</td>
>
> <td>1</td>
>
> <td>25.25384484</td>
>
> <td>56.46285198</td>
>
> </tr>
> <tr>
>
> <td>1</td>
>
> <td>1</td>
>
> <td>2</td>
>
> <td>25.25595084</td>
>
> <td>56.46250503</td>
>
> </tr>
>
> (...)
>
> </table>
>
> The expected output format is:
>
> 1 1 1 25.25384484 56.46285198
> 1 1 2 25.25595084 56.46250503
> (...)
>
> Could someone help me out ?
>
> Thanks in advance, Hermann
#!gawk
BEGIN { RS = "</?tr>" }
"</tr>" == RT {
gsub( /[^0-9.-]/, " " )
$1 = $1
print
}
==== output ====
1 1 1 25.25384484 56.46285198
1 1 2 25.25595084 56.46250503
|
|
0
|
|
|
|
Reply
|
William
|
10/1/2007 7:48:35 PM
|
|
Hermann Peifer wrote:
> Hi All,
>
> Occasionally, I have to extract values from a long HTML table (see some
> sample lines below) and convert them into a more useful format.
Thanks for all replies.
Tiago's hint with lynx works fine with smaller and medium-sized files.
William's Gawk script also works great with bigger files (100+ M).
Hermann
|
|
0
|
|
|
|
Reply
|
Hermann
|
10/2/2007 7:39:23 PM
|
|
|
8 Replies
37 Views
(page loaded in 0.125 seconds)
Similiar Articles: Extracting data - comp.databases.ms-accessThis query was created using 2 temporary tables which ... www.accessmvp.com/JConrad/accessjunkie/resources.html ... 2D surface plot in Matlab depicting concentration values ... ENVI's codes - comp.lang.idl-pvwave... more like the image is shown in 256-color table or ... http://www.dfanning.com/documents/programs.html can ... How to extract pixel values from a GeoTIFF using an Esri Shapefile ... get pixel color - comp.graphics.api.openglHow to extract pixel values of a colored image into an array ... How to Change Colors (ie build a lookup table ... Pick pixel colors as HTML, HEX or RGB values Using our free ... Convert multi-page tables to EMF images - comp.soft-sys.sas ...Converter for multiple (!!) *.html files ... TXT-Files with a bad structure (values of tables are ... ... to know, if is there any way to extract images (eg. charts, tables ... Converting Certain Data from Numeric Matrix to Text - comp.soft ...... 3 0.7 ] I would like to find all Z values that are ... I have a logical matrix and I want to extract ... to convert a string that contains the text of an HTML table ... Get Polygon points - comp.databases.mysqlHi, first i created the table like this: CREATE ... the following syntax: INSERT INTO geom (g) VALUES ... 50) not null ); delimiter @ drop procedure extract ... Convert shapefiles to georeference raster - comp.soft-sys.matlab ...... com/access/helpdesk/help/toolbox/map/f7-10852.html ... How to extract pixel values from a GeoTIFF using an Esri Shapefile ... ... numeric field in the vector data attribute table? Poker hand evaluator - comp.lang.javascript... representation of a hand of 7 cards, returns a value ... in ECMAScript - a 123 megabyte lookup table is ... Ah -> 50 As -> 51 That makes extracting the suit and ... Find and Replace script, util, etc... - comp.databases.filemaker ...... find and replace of all occurrences of a specific value. ... I am talking about replacing table names, columns, script ... and replace operations across multiple ASCII (text, HTML ... High concurrency SELECT / UPDATE procedure - comp.databases.mysql ...The code below is a "simplified" extract of the real ... refers to the 1) and 2) step only: CREATE TABLE ... FOR UPDATE - comp.databases.mysql Simple update of values ... Extracting values from an HTML table - Application Forum at ...Hi All, Occasionally, I have to extract values from a long HTML table (see some sample lines below) and convert them into a more useful format. The How to extract values from html table via Regular Expression?I have an html file that contains a certain table that I want to work with. In the midst of the html code is this table that I want to work with: ... 7/15/2012 11:52:56 AM
|