f



Distinguish text URLs from non-text URLs?

Hello,
I have a question when trying to use Java to program a crawler like
program.
As I only need the text (html) files, I am wondering whether anyone
know a
good way to distinguish text URLs (files such as html, htm, etc) from
non-text URLs?

What I want is: given a String type url, how can I decide whether this
URL
points a text file (.htm, etc) or not? We know text pages usually have
URLs
ending with .htm, .html, etc. But with many dynamic pages, such as in 
http://www.amazon.com/exec/obidos/ASIN/B00006HXJ6/ref=nosim/fatwalletcom/002-5149236-2409652
this URL points to a html page, but its URL has no file extension.

I have tried to use the getContentType() from Class URLConnection, but
it
works so bad and it even consider many .pdf files as text. :-(

Anyone has any idea of it? 
Thanks and happy new year!~
0
kaidizhao (106)
1/1/2004 3:01:16 AM
comp.lang.java.programmer 52714 articles. 1 followers. Post Follow

5 Replies
666 Views

Similar Articles

[PageSpeed] 34

The content type of the response is a web server setting.
If the server is responding with "Content-Type: text/plain" from a PDF file,
then it has not been configured correctly.

Server configuration is *usually* done with a mapping between file extension
and content type, so if you were to duplicate this functionality on the
client side, you may be prone to problems.

I'd configure the server correctly and use the getContentType() call.

--
Tony Morris
(BInfTech, Cert 3 I.T., SCJP[1.4], SCJD)
Software Engineer
IBM Australia - Tivoli Security Software

"Kaidi" <kaidizhao@yahoo.com.sg> wrote in message
news:7844729b.0312311901.73813357@posting.google.com...
> Hello,
> I have a question when trying to use Java to program a crawler like
> program.
> As I only need the text (html) files, I am wondering whether anyone
> know a
> good way to distinguish text URLs (files such as html, htm, etc) from
> non-text URLs?
>
> What I want is: given a String type url, how can I decide whether this
> URL
> points a text file (.htm, etc) or not? We know text pages usually have
> URLs
> ending with .htm, .html, etc. But with many dynamic pages, such as in
>
http://www.amazon.com/exec/obidos/ASIN/B00006HXJ6/ref=nosim/fatwalletcom/002
-5149236-2409652
> this URL points to a html page, but its URL has no file extension.
>
> I have tried to use the getContentType() from Class URLConnection, but
> it
> works so bad and it even consider many .pdf files as text. :-(
>
> Anyone has any idea of it?
> Thanks and happy new year!~


0
dibblego (448)
1/1/2004 3:04:18 AM
kaidizhao@yahoo.com.sg (Kaidi) writes:

> What I want is: given a String type url, how can I decide whether this
> URL
> points a text file (.htm, etc) or not?

Connect to the URL, do a HEAD request, and check the content type.

> We know text pages usually have URLs ending with .htm, .html, etc.

Not necessarily.

> I have tried to use the getContentType() from Class URLConnection,
> but it works so bad and it even consider many .pdf files as text.
> :-(

No, it's not the method URLConnection.getContentType() that is bad,
it's the web server sending the wrong content type. The API cannot fix
outside errors.
0
1/1/2004 11:29:12 AM
Kaidi wrote:
> Hello,
> I have a question when trying to use Java to program a crawler like
> program.
> As I only need the text (html) files, I am wondering whether anyone
> know a
> good way to distinguish text URLs (files such as html, htm, etc) from
> non-text URLs?

You could try 'sniffing' the first few bytes of the files. If they start 
with <!DOC or <html you can be pretty sure they're html files. Of 
course, this isn't foolproof.
-- 
Daniel Sj�blom

0
dsjoblom (223)
1/2/2004 12:34:01 AM
Thanks friends.
I will try the HEAD request as suggested above.
Currently, I 'sniff' the bytes, see if they have any html tages 
such as HTML, HEAD, META, BODY, <p>, etc. It works OK for me, alghough 
some kind of troublesome and a kind of "heuristic".

Daniel Sj�blom <dsjoblom@mbnet.fi_NOSPAM> wrote in message news:<3ff4ff26$0$11440$39db0f71@news.song.fi>...
> Kaidi wrote:
> > Hello,
> > I have a question when trying to use Java to program a crawler like
> > program.
> > As I only need the text (html) files, I am wondering whether anyone
> > know a
> > good way to distinguish text URLs (files such as html, htm, etc) from
> > non-text URLs?
> 
> You could try 'sniffing' the first few bytes of the files. If they start 
> with <!DOC or <html you can be pretty sure they're html files. Of 
> course, this isn't foolproof.
0
kaidizhao (106)
1/3/2004 8:21:32 PM
"Kaidi" <kaidizhao@yahoo.com.sg> wrote in message
news:7844729b.0401031221.629d41e5@posting.google.com...
> Thanks friends.
> I will try the HEAD request as suggested above.
> Currently, I 'sniff' the bytes, see if they have any html tages
> such as HTML, HEAD, META, BODY, <p>, etc. It works OK for me, alghough
> some kind of troublesome and a kind of "heuristic".

The properly formed HTML documents will have a
string like this at the very top..
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">

So if the file _starts_ with '<!DOCTYPE HTML'
you can tell early that this is an HTML document.
[  Unfortunately, very few pages _are_ properly
formed. ]

Otherwise I would recommend searching for the strings
you mentioned, but with the opening '<', like..
'<head', or '<html'.

That reminds me of something else, make sure
you check them for either upper or lower case,
as either is valid.

HTH

--
Andrew Thompson
* http://www.PhySci.org/ PhySci software suite
* http://www.1point1C.org/ 1.1C - Superluminal!
* http://www.AThompson.info/andrew/ personal site


0
andrew64 (796)
1/4/2004 10:15:46 AM
Reply: