Java API for correcting malformed HTML code

  • Follow


Hello,
What are the Java APIs out there that can simply correct malformed
HTML code, like take a input stream of badly formed HTML and produce
an output stream of clean HTML code (parsable by the Swing HTML
parser) ?
0
Reply sunrise1647 (4) 6/9/2004 1:03:20 PM

MCP wrote:
> What are the Java APIs out there that can simply correct malformed
> HTML code, like take a input stream of badly formed HTML and produce
> an output stream of clean HTML code (parsable by the Swing HTML
> parser) ?

Maybe this can help http://jtidy.sourceforge.net/ No idea if it fulfills 
all your requirements.

/Thomas
0
Reply nobody89 (1419) 6/9/2004 1:28:31 PM


On 9 Jun 2004 06:03:20 -0700, sunrise@cliffhanger.com (MCP) wrote or
quoted :

>What are the Java APIs out there that can simply correct malformed
>HTML code, like take a input stream of badly formed HTML and produce
>an output stream of clean HTML code (parsable by the Swing HTML
>parser) ?

I have been bugging the HTMLValidator people to write such a beast.  I
figured it could save me a ton of work if it did simple unambiguous
corrections like insert missing </li> or convert stray & to &amp;

His fear is making a change that the user did not want.  He did not
want to be morally liable for messing up the source.
 
I have done a number of one shot programs to clean up various problems
in my website. They do it all with indexof and substring. If you are
just trying to correct a single problem at a time, it can be pretty
simple.

-- 
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming. 
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
0
Reply look-on (3298) 6/9/2004 8:54:17 PM

On Wed, 09 Jun 2004 20:54:17 GMT, Roedy Green wrote:

> ..it could save me a ton of work if it did simple unambiguous
> corrections like insert missing </li>

(whispers)  W3C defininition for the <li>
is that it does not require a closing </li>..

<http://www.w3.org/TR/1999/REC-html401-19991224/struct/lists.html#didx-list>

-- 
Andrew Thompson
http://www.PhySci.org/ Open-source software suite
http://www.PhySci.org/codes/ Web & IT Help
http://www.1point1C.org/ Science & Technology
0
Reply SeeMySites (3836) 6/10/2004 4:03:36 AM

On Thu, 10 Jun 2004 04:03:36 GMT, Andrew Thompson
<SeeMySites@www.invalid> wrote or quoted :

>(whispers)  W3C defininition for the <li>
>is that it does not require a closing </li>..

what about </td> and </tr>?

Anyway I like to have the HTML consistent.

-- 
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming. 
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
0
Reply look-on (3298) 6/10/2004 6:14:58 AM

On Thu, 10 Jun 2004 06:14:58 GMT, Roedy Green wrote:

> On Thu, 10 Jun 2004 04:03:36 GMT, Andrew Thompson
> <SeeMySites@www.invalid> wrote or quoted :
> 
>>(whispers)  W3C defininition for the <li>
>>is that it does not require a closing </li>..
> 
> what about </td> and </tr>?

I am pretty sure they need to be 
explicitly closed.  (shrugs) If in doubt,
leave one out and throw it at the validator
(which is usually quicker than finding the 
element on W3C's site)

> Anyway I like to have the HTML consistent.

;-)   I know what you mean, it has taken
some training to *prevent* myself from 
typing </p> and </li>..

-- 
Andrew Thompson
http://www.PhySci.org/ Open-source software suite
http://www.PhySci.org/codes/ Web & IT Help
http://www.1point1C.org/ Science & Technology
0
Reply SeeMySites (3836) 6/10/2004 7:41:26 AM

On Thu, 10 Jun 2004 18:37:46 GMT, arne thormodsen wrote:

>> ;-)   I know what you mean, it has taken
>> some training to *prevent* myself from
>> typing </p> and </li>..
>>
> 
> Why bother?  All new broswers..

...not all browser are new, not all users
can update, not all sites can afford to
turn away customers just because their
browser is not flavour of the month.

That's why.

-- 
Andrew Thompson
http://www.PhySci.org/ Open-source software suite
http://www.PhySci.org/codes/ Web & IT Help
http://www.1point1C.org/ Science & Technology
0
Reply SeeMySites (3836) 6/10/2004 6:18:09 PM

>
> ;-)   I know what you mean, it has taken
> some training to *prevent* myself from
> typing </p> and </li>..
>

Why bother?  All new broswers interpret XHTML properly, so you might
as well make your HTML well-formed as XML too.  Then you can use XML
tools to process it.

--arne


0
Reply arneDOTthormodsen (21) 6/10/2004 6:37:46 PM


>
> Maybe this can help http://jtidy.sourceforge.net/ No idea if it
fulfills
> all your requirements.
>

I've used it extensively in the past.  It works pretty well.

--arne

> /Thomas


0
Reply arneDOTthormodsen (21) 6/10/2004 6:38:34 PM

Andrew Thompson wrote:

> On Thu, 10 Jun 2004 18:37:46 GMT, arne thormodsen wrote:
> 
>>> ;-)   I know what you mean, it has taken
>>> some training to *prevent* myself from
>>> typing </p> and </li>..
>>>
>> 
>> Why bother?  All new broswers..
> 
> ..not all browser are new, not all users
> can update, not all sites can afford to
> turn away customers just because their
> browser is not flavour of the month.
> 
> That's why.
> 

I'm pretty sure even netscape 4.7 or Lynx interprets </p> and </li>
correctly. Even pure XHTML should pose no problem for those, when you write
the empty elements like <br> as <br /> instead of <br/>. Any browser better
than those (that's all of the currently used browsers :) should have no
problems if you close your tags.

As it says in the spec, the closing tags are not *required*, it doesn't say
that they shouldn't be present. And the advantages of writing XML
compatible HTML are bigger than adjusting to the lowest possible
denominator IMHO.

Have you got any example of a browser which breaks when you add the optional
closing tags?

-- 
Kind regards,
Christophe Vanfleteren
0
Reply c.v4nfl3t3r3n (486) 6/10/2004 8:17:16 PM

Christophe Vanfleteren <c.v4nfl3t3r3n@pandora.be> wrote:
 
> I'm pretty sure even netscape 4.7 or Lynx interprets </p> and </li>
> correctly.

I can confirm that both do. I always use <p></p> and <li></li> in my HTML.

-- 
JustThe.net Internet & New Media Services, http://JustThe.net/ 
Steven J. Sobol, Geek In Charge / 888.480.4NET (4638) / sjsobol@JustThe.net
PGP Key available from your friendly local key server (0xE3AE35ED)
Apple Valley, California     Nothing scares me anymore. I have three kids.
0
Reply sjsobol (486) 6/10/2004 9:29:41 PM

On Thu, 10 Jun 2004 20:17:16 GMT, Christophe Vanfleteren wrote:
> Andrew Thompson wrote:
>> On Thu, 10 Jun 2004 18:37:46 GMT, arne thormodsen wrote:
>> 
>>>> ;-)   I know what you mean, it has taken
>>>> some training to *prevent* myself from
>>>> typing </p> and </li>..
...
>>> Why bother?  All new broswers..
>> 
>> ..not all browser are new, 
....
> I'm pretty sure even netscape 4.7 or Lynx interprets </p> and </li>
> correctly. Even pure XHTML should pose no problem for those, when you write
> the empty elements like <br> as <br /> instead of <br/>. 

Oh, alright,..  I suppose I tuned out at 
the 'new browsers' comment.  

I had rejected XHTML earlier for some reason 
...no 'target' for 'href's.. no applet tags or 
something..  I do not quite remember.

Maybe I should take another look..

[ ..but damn-it, if it does not work on 
my NN 4.08, it is *out*!  ;-) ]

-- 
Andrew Thompson
http://www.PhySci.org/ Open-source software suite
http://www.PhySci.org/codes/ Web & IT Help
http://www.1point1C.org/ Science & Technology
0
Reply SeeMySites (3836) 6/11/2004 12:43:24 AM

11 Replies
36 Views

(page loaded in 1.303 seconds)

Similiar Articles:


















7/8/2012 7:31:27 AM


Reply: