If I had this tag and wanted to return 123 how would I do it? I have
tried countless methods but can not get the only the 123 without the
<TD> tags
<TD class=tblform3 id=L_listing width=23>123</TD>
After 3 hours I am giving up and asking the experts.
|
|
0
|
|
|
|
Reply
|
tdmailbox (42)
|
5/26/2005 10:04:18 PM |
|
tdmailbox@yahoo.com wrote:
> If I had this tag and wanted to return 123 how would I do it? I have
> tried countless methods but can not get the only the 123 without the
> <TD> tags
>
> <TD class=tblform3 id=L_listing width=23>123</TD>
>
> After 3 hours I am giving up and asking the experts.
Did you study the applicable docs during those 3 hours?
perldoc perlrequick
perldoc perlretut
perldoc perlre
perldoc -f m
perldoc perlop
Or did you read this FAQ entry
perldoc -q "remove HTML"
which lets you know that you'd better think twice before attempting to
use regexes for this task?
If you have studied those documents, please post the code you have and
somebody may be able to help you fix it.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
|
|
0
|
|
|
|
Reply
|
Gunnar
|
5/26/2005 10:23:39 PM
|
|
tdmailbox@yahoo.com writes:
> If I had this tag and wanted to return 123 how would I do it? I have
> tried countless methods but can not get the only the 123 without the
> <TD> tags
>
> <TD class=tblform3 id=L_listing width=23>123</TD>
>
> After 3 hours I am giving up and asking the experts.
If you'd asked your computer, you'd have had the answer much faster:
perldoc -q HTML
And the first returned result is:
"How do I remove HTML from a string?"
Which is exactly what you need. If you get in the habit of searching
your local documentation first, then you'll get better answers faster,
as you won't have to wait for an answer here, and also the people who
can give you the best answers to your questions are tired of answering
them all the time, which is why they wrote the FAQ in the first place!
So if you ask FAQs here, then you will by definition only get the
less-experienced people answering your questions, as a rule.
But I'm feeling generous, also I'd been meaning to poke at
HTML::Parser for a while anyhow. So I whipped up this little example:
#!/usr/bin/perl
use warnings;
use strict;
use HTML::Parser ();
sub start_handler
{
return if shift ne "td";
my $self = shift;
$self->handler(text => sub { print shift }, "dtext");
$self->handler(end => sub { shift->eof if shift eq "td"; },
"tagname,self");
}
my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler, "tagname, self" );
$p->parse( <<EODATA );
<TD class=tblform3 id=L_listing width=23>123</TD>
EODATA
print "\n";
__END__
For future reference, if you have a problem, you're going to get the
best results here if you can create an example of it that looks
something like that-- short (I went to 21 lines, and that's about as
big as I try to let them get), complete, and clearly state what is
happening, and how that differs from what you wanted to happen.
Also, note that the above example stops parsing after the first </TD>;
if you are going to parse text containing multiple TD elements, you'll
want to read the HTML::Parser documentation to find out better ways of
doing that.
-=Eric
--
Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
-- Blair Houghton.
|
|
0
|
|
|
|
Reply
|
Eric
|
5/26/2005 10:35:48 PM
|
|
($result) = ($bunch_of_html =~ /<td.*?>(.*?)<\/td>/i);
|
|
0
|
|
|
|
Reply
|
andrewflanders
|
5/26/2005 11:21:56 PM
|
|
Eric Schwartz wrote:
>
> use HTML::Parser ();
>
> sub start_handler
> {
> return if shift ne "td";
> my $self = shift;
> $self->handler(text => sub { print shift }, "dtext");
> $self->handler(end => sub { shift->eof if shift eq "td"; },
> "tagname,self");
> }
>
> my $p = HTML::Parser->new(api_version => 3);
> $p->handler( start => \&start_handler, "tagname, self" );
> $p->parse( <<EODATA );
> <TD class=tblform3 id=L_listing width=23>123</TD>
> EODATA
> print "\n";
And this is a "simple-minded" way:
print '<TD class=tblform3 id=L_listing width=23>123</TD>'
=~ m{<td.*?>([^<]+)</td>}is, "\n";
If I was to parse a whole HTML page, possibly with nested elements, and
whose design I don't control, I wouldn't dream of using regular
expressions. If, OTOH, the task actually is as simple as the literal
question asked by the OP, I wouldn't dream of using a parsing module.
Which way is most suitable depends reasonably on the complexity of the
task together with how much you know about regular expressions.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
|
|
0
|
|
|
|
Reply
|
Gunnar
|
5/26/2005 11:28:44 PM
|
|
andrewflanders@gmail.com writes:
> ($result) = ($bunch_of_html =~ /<td.*?>(.*?)<\/td>/i);
Hrm.
#!/usr/bin/perl
use warnings;
use strict;
my $bunch_of_html = <<EOHTML;
<td><img src='closetd.jpg' alt='image of </td>' /></td>
EOHTML
my ($result) = ($bunch_of_html =~ /<td.*?>(.*?)<\/td>/i);
print "result: [$result]\n";
__END__
gives:
result: [<img src='foo.jpg' alt='image of ]
Parsing HTML with a regex is, ultimately, an exercise in futility.
You can do it for one small subset, but as soon as you change it even
a small amount, your solution can easily break. And then you tweak.
And then it breaks again. It's easier to spend a little effort
up-front with HTML::Parser or the like, than to constantly be fixing
regex-based hacks.
-=Eric
--
Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
-- Blair Houghton.
|
|
0
|
|
|
|
Reply
|
Eric
|
5/26/2005 11:29:35 PM
|
|
wrote:
> If I had this tag and wanted to return 123 how would I do it? I have
> tried countless methods but can not get the only the 123 without the
> <TD> tags
>
> <TD class=tblform3 id=L_listing width=23>123</TD>
>
> After 3 hours I am giving up and asking the experts.
use strict;
use warnings;
use HTML::TreeBuilder;
:
:
my $root = HTML::TreeBuilder->
new_from_content( $content );
my $td = $root->look_down( _tag => 'td',
class => 'tblform3', id => 'L_listing' );
defined $td or die "TD not found";
print $td->as_text, "\n";
(untested, assumes $content contains the HTML)
see also:
http://johnbokma.com/perl/phpbb-remote-backup.html
http://johnbokma.com/perl/froogle-script.html
--
John Small Perl scripts: http://johnbokma.com/perl/
Perl programmer available: http://castleamber.com/
Happy Customers: http://castleamber.com/testimonials.html
|
|
0
|
|
|
|
Reply
|
John
|
5/26/2005 11:36:41 PM
|
|
Gunnar Hjalmarsson <noreply@gunnar.cc> writes:
> And this is a "simple-minded" way:
>
> print '<TD class=tblform3 id=L_listing width=23>123</TD>'
> =~ m{<td.*?>([^<]+)</td>}is, "\n";
Which, as you knew, fails if the <TD> has comments in it:
$ perl -e 'print "<TD class=tblform3 id=L_listing width=23>\n123<!-- this is the item ID from the database -->\n</td>" =~ m{<td.*?>([^<]+)</td>}is, "\n";'
$
If there is content on both sides of the comment, only the
post-comment parts get printed, but if the content is after the
comment, it will do what it's supposed to. This is the sort of thing
that causes me to lose sleep and pull out my hair before its time. I
know you knew that, I'm just pointing out to the OP how fragile a
regex-based solution can be. It may work now, in one place, but
there's all sorts of things that could cause it to fail later, some of
which can be very subtle.
> Which way is most suitable depends reasonably on the complexity of the
> task together with how much you know about regular expressions.
Also the likelihood of your input changing-- a regex solution might be
right in at first, but can easily fail later-- as well as the intended
scope of use. Subroutines have a way around here of quickly migrating
out into general-use modules, where they are used by people in very
different contexts from where they originated. What works for one
particular task is likely to need serious changes if used for others.
-=Eric
--
Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
-- Blair Houghton.
|
|
0
|
|
|
|
Reply
|
Eric
|
5/26/2005 11:44:45 PM
|
|
That's true if you are writing a web crawler but most of the time the
purpose for doing this is to strip spread sheet style data from a
website you don't control and insert it into your own database in which
case the html formating of the target HTML is likely to be fairly
consistant and in this case it's quicker for me to write that regex
than install and learn how to use HTML::Parser. Add that to the fact
that your case example is silly.
|
|
0
|
|
|
|
Reply
|
andrewflanders
|
5/26/2005 11:55:27 PM
|
|
Eric Schwartz wrote:
> I'm just pointing out to the OP how fragile a
> regex-based solution can be.
Agreed. You need to know that no comments will be inserted that way, and
that there are no attributes containing '>' characters, etc., etc.
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
|
|
0
|
|
|
|
Reply
|
Gunnar
|
5/26/2005 11:56:51 PM
|
|
andrewflanders@gmail.com writes:
> That's true if you are writing a web crawler but most of the time the
> purpose for doing this is to strip spread sheet style data from a
> website you don't control and insert it into your own database in which
> case the html formating of the target HTML is likely to be fairly
> consistant and in this case it's quicker for me to write that regex
> than install and learn how to use HTML::Parser. Add that to the fact
> that your case example is silly.
Please quote the messages you're replying to, at least enough so that
we can tell what you're replying to. Guessing that you're replying to
my reply to you, the fact that you don't control the HTML is exactly
why you need something like HTML::Parser-- if you control the HTML,
you can force it to always be produced so your regex can parse it. If
you don't, though, the producer of that HTML can do all kinds of
things to break your regex. Inserting comments in the middle of table
data is only one of the most obvious ways a regex can break; see my
reply to Gunnar's regex solution for more detail.
-=Eric
--
Come to think of it, there are already a million monkeys on a million
typewriters, and Usenet is NOTHING like Shakespeare.
-- Blair Houghton.
|
|
0
|
|
|
|
Reply
|
Eric
|
5/27/2005 12:27:20 AM
|
|
<TD class=tblform3 id=L_listnum.*?>(.*?)<\/TD>
That works.. however it returns the whole <TD> tag.. I just want the
value inside the tag. That is my core issue that I cant find the
solution to. I can find plenty of expressions that will find the right
<TD> tag but not one that will just give me the data between the tags
|
|
0
|
|
|
|
Reply
|
tdmailbox
|
5/27/2005 12:36:59 AM
|
|
"the fact that you don't control the HTML is exactly why you need
something like HTML::Parser"
You don't realy know that the target html is as dirty as you assume.
Unless the poster says he or she is writing a long-term use and robust
data miner I'm assuming it's a one-off script where the html and data
in question is uniform because this is most often the case.
|
|
0
|
|
|
|
Reply
|
andrewflanders
|
5/27/2005 12:41:23 AM
|
|
andrewflanders@gmail.com wrote:
> You don't realy know that the target html is as dirty as you assume.
Maybe not dirty. Maybe just subject to change. I have been bit by that
even using an HTML parser.
> Unless the poster says he or she is writing a long-term use and robust
> data miner I'm assuming it's a one-off script
We don't know that, so the discussion has merit.
|
|
0
|
|
|
|
Reply
|
Scott
|
5/27/2005 12:50:29 AM
|
|
tdmailbox@yahoo.com wrote:
> <TD class=tblform3 id=L_listnum.*?>(.*?)<\/TD>
>
> That works.. however it returns the whole <TD> tag..
No, it doesn't. It doesn't return anything.
Have you read any of the replies in this thread??
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
|
|
0
|
|
|
|
Reply
|
Gunnar
|
5/27/2005 12:55:37 AM
|
|
Since L_listing is what makes the take you unique I took your code and
modified it to
<TD class=tblform3 id=L_listnum.*?>([^<]+)</TD>
and I get the right tag..
However the issue is that I only want to return the data between the
tag. The expression above includes the tag.
<TD class=tblform3 id=L_listnum width=106>$799,000</TD></TR>
thanks an advance for any help on that.
|
|
0
|
|
|
|
Reply
|
tdmailbox
|
5/27/2005 1:01:41 AM
|
|
[ Please provide some context when replying!! Most people are not
reading this group via Google Groups. ]
tdmailbox@yahoo.com wrote:
> Gunnar Hjalmarsson wrote:
>> And this is a "simple-minded" way:
>>
>> print '<TD class=tblform3 id=L_listing width=23>123</TD>'
>> =~ m{<td.*?>([^<]+)</td>}is, "\n";
>
> Since L_listing is what makes the take you unique I took your code and
> modified it to
> <TD class=tblform3 id=L_listnum.*?>([^<]+)</TD>
>
> and I get the right tag..
>
> However the issue is that I only want to return the data between the
> tag. The expression above includes the tag.
> <TD class=tblform3 id=L_listnum width=106>$799,000</TD></TR>
Don't try to just explain in English what you are doing, but post a
short but complete program that demonstrates the problem you are having.
Also, have you read the description of the m// operator in "perldoc perlop"?
--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl
|
|
0
|
|
|
|
Reply
|
Gunnar
|
5/27/2005 1:07:50 AM
|
|
You must be accessing the result of the match wrong because the match
that is found between the ( ) will not include the entire td tag but
it's possible that some other variable does. Try printing $1 after the
match is supposed to occur and see if it prints the value you want to
parse out.
|
|
0
|
|
|
|
Reply
|
andrewflanders
|
5/27/2005 1:16:33 AM
|
|
tdmailbox@yahoo.com wrote:
> <TD class=tblform3 id=L_listnum.*?>(.*?)<\/TD>
>
> That works.. however it returns the whole <TD> tag.. I just want the
> value inside the tag. That is my core issue that I cant find the
> solution to. I can find plenty of expressions that will find the right
> <TD> tag but not one that will just give me the data between the tags
>
Read up on HTML::TableExtract.
Getting this sort of data using regex or similar is tricky and the page
definition may change ( will change ).
If the tables are not well structured you may have to search by depth
and count to get the right table. You will have to come to grips with
the structure of the data you are dealing with - the tables and the form.
Start here
"http://search.cpan.org/~msisk/HTML-TableExtract-1.08/lib/HTML/TableExtract.pm"
Happy reading.
|
|
0
|
|
|
|
Reply
|
Paul
|
5/27/2005 2:05:13 AM
|
|
tdmailbox@yahoo.com wrote:
> Since L_listing is what makes the take you unique I took your code and
> modified it to
> <TD class=tblform3 id=L_listnum.*?>([^<]+)</TD>
>
> and I get the right tag..
>
> However the issue is that I only want to return the data between the
> tag. The expression above includes the tag.
No, it doesn't. You must not be using the regex in the proper manner.
Hint: /(.*)/; m/(.*)/; m%(.*)%;, m{(.*)}; m[(.*)]; m<(.*)>;
-Joe
|
|
0
|
|
|
|
Reply
|
Joe
|
5/27/2005 4:42:32 AM
|
|
You are totally correct. After further investigation I realized that I
needed to grab the $1. Once I did that I got the value that I was
looking for.
andrewflanders@gmail.com wrote:
> You must be accessing the result of the match wrong because the match
> that is found between the ( ) will not include the entire td tag but
> it's possible that some other variable does. Try printing $1 after the
> match is supposed to occur and see if it prints the value you want to
> parse out.
|
|
0
|
|
|
|
Reply
|
tdmailbox
|
5/27/2005 4:44:03 AM
|
|
|
20 Replies
73 Views
(page loaded in 0.238 seconds)
|