f



Why doesn't Python's "robotparser" like Wikipedia's "robots.txt" file?

    For some reason, Python's parser for "robots.txt" files
doesn't like Wikipedia's "robots.txt" file:

 >>> import robotparser
 >>> url = 'http://wikipedia.org/robots.txt'
 >>> chk = robotparser.RobotFileParser()
 >>> chk.set_url(url)
 >>> chk.read()
 >>> testurl = 'http://wikipedia.org'
 >>> chk.can_fetch('Mozilla', testurl)
False
 >>>

    The Wikipedia robots.txt file passes robots.txt validation,
and it doesn't disallow unknown browsers.  But the Python
parser doesn't see it that way.  No matter what user agent or URL is
specified; for that robots.txt file, the only answer is "False".
It's failing in Python 2.4 on Windows and 2.5 on Fedora Core.

    I use "robotparser" on lots of other robots.txt files, and it
normally works.  It even used to work on Wikipedia's older file.
But there's something in there now that robotparser doesn't like.
Any ideas?

				John Nagle
0
nagle (1119)
10/2/2007 3:31:51 AM
comp.lang.python 77058 articles. 6 followers. Post Follow

5 Replies
1149 Views

Similar Articles

[PageSpeed] 8

In message <HYiMi.9932$JD.6615@newssvr21.news.prodigy.net>, John Nagle
wrote:

>     For some reason, Python's parser for "robots.txt" files
> doesn't like Wikipedia's "robots.txt" file:
> 
>  >>> import robotparser
>  >>> url = 'http://wikipedia.org/robots.txt'
>  >>> chk = robotparser.RobotFileParser()
>  >>> chk.set_url(url)
>  >>> chk.read()
>  >>> testurl = 'http://wikipedia.org'
>  >>> chk.can_fetch('Mozilla', testurl)
> False
>  >>>

    >>> chk.errcode
    403

Significant?

0
ldo (2177)
10/2/2007 4:24:16 AM
On 02/10/2007, John Nagle <nagle@animats.com> wrote:
>
> But there's something in there now that robotparser doesn't like.
> Any ideas?

Wikipedia denies _all_ access for the standard urllib user agent, and
when the robotparser gets a 401 or 403 response when trying to fetch
robots.txt, it is equivalent to "Disallow: *".

http://infix.se/2006/05/17/robotparser

It could also be worth mentioning that if you were planning on
crawling a lot of Wikipedia pages, you may be better off downloading
the whole thing instead: <http://download.wikimedia.org/>
(perhaps adding <http://code.google.com/p/wikimarkup/> to convert the
wiki markup to HTML).
-- 
filip salomonsson
0
10/2/2007 2:10:02 PM
Lawrence D'Oliveiro wrote:
> In message <HYiMi.9932$JD.6615@newssvr21.news.prodigy.net>, John Nagle
> wrote:
> 
>>     For some reason, Python's parser for "robots.txt" files
>> doesn't like Wikipedia's "robots.txt" file:
>>
>>  >>> import robotparser
>>  >>> url = 'http://wikipedia.org/robots.txt'
>>  >>> chk = robotparser.RobotFileParser()
>>  >>> chk.set_url(url)
>>  >>> chk.read()
>>  >>> testurl = 'http://wikipedia.org'
>>  >>> chk.can_fetch('Mozilla', testurl)
>> False
>>  >>>
> 
>     >>> chk.errcode
>     403
> 
> Significant?
> 
    Helpful.  Also an undocumented feature.  See

	http://docs.python.org/lib/module-robotparser.html

					John Nagle
0
nagle (1119)
10/2/2007 3:00:19 PM
Filip Salomonsson wrote:
> On 02/10/2007, John Nagle <nagle@animats.com> wrote:
>> But there's something in there now that robotparser doesn't like.
>> Any ideas?
> 
> Wikipedia denies _all_ access for the standard urllib user agent, and
> when the robotparser gets a 401 or 403 response when trying to fetch
> robots.txt, it is equivalent to "Disallow: *".
> 
> http://infix.se/2006/05/17/robotparser

     That explains it.  It's an undocumented feature of "robotparser",
as is the 'errcode' variable.  The documentation of "robotparser" is
silent on error handling (can it raise an exception?) and should be
updated.

> It could also be worth mentioning that if you were planning on
> crawling a lot of Wikipedia pages, you may be better off downloading
> the whole thing instead: <http://download.wikimedia.org/>
> (perhaps adding <http://code.google.com/p/wikimarkup/> to convert the
> wiki markup to HTML).

     This is for SiteTruth, the site rating system (see "sitetruth.com"),
and we never look at more than 21 pages per site.  We're looking for
the name and address of the business behind the web site, and if we
can't find that after looking in 20 of the most obvious places to
look, it's either not there or not "prominently disclosed".

				John Nagle
0
nagle (1119)
10/2/2007 3:11:28 PM
In article <ActMi.30614$eY.11375@newssvr13.news.prodigy.net>,
 John Nagle <nagle@animats.com> wrote:

> Filip Salomonsson wrote:
> > On 02/10/2007, John Nagle <nagle@animats.com> wrote:
> >> But there's something in there now that robotparser doesn't like.
> >> Any ideas?
> > 
> > Wikipedia denies _all_ access for the standard urllib user agent, and
> > when the robotparser gets a 401 or 403 response when trying to fetch
> > robots.txt, it is equivalent to "Disallow: *".
> > 
> > http://infix.se/2006/05/17/robotparser
> 
>      That explains it.  It's an undocumented feature of "robotparser",
> as is the 'errcode' variable.  The documentation of "robotparser" is
> silent on error handling (can it raise an exception?) and should be
> updated.

Hi John,
Robotparser is probably following the never-approved RFC for robots.txt 
which is the closest thing there is to a standard. It says, "On server 
response indicating access restrictions (HTTP Status Code 401 or 403) a 
robot should regard access to the site completely restricted."
http://www.robotstxt.org/wc/norobots-rfc.html

If you're interested, I have a replacement for the robotparser module 
that works a little better (IMHO) and which you might also find better 
documented. I'm using it in production code:
http://nikitathespider.com/python/rerp/

Happy spidering

-- 
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
0
10/4/2007 4:14:40 PM
Reply: