Python robotparser module giving wrong results

1.4k Views Asked by At

I am facing an issue with the Python's robotparser module. It works fine for a particular URL but starts failing once I perform a specific sequence of steps. Mentioned below are the steps I performed and the outcome:-

This sequence works fine:-

>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> url = "http://www.ontheissues.org/robots.txt"
>>> rp.set_url(url)
>>> rp.read()
>>> rp.can_fetch("*", "http://www.ontheissues.org/House/Jim_Nussle.htm")
True
>>> 

However, the below mentioned sequence fails for the same steps which I did above:-

>>>> import robotparser
>>>> rp = robotparser.RobotFileParser()
>>>> url = "http://menendez.senate.gov/robots.txt"
>>>> rp.set_url(url)
>>>> rp.read()
>>>> rp.can_fetch("*", "http://menendez.senate.gov/contact/contact.cfm")
False
>>>>
>>>>
>>>> url = "http://www.ontheissues.org/robots.txt"
>>>> rp.set_url(url)
>>>> rp.read()
>>>> rp.can_fetch("*", "http://www.ontheissues.org/House/Jim_Nussle.htm")
False
>>>>

After debugging it for sometime, I found that it works fine if I create a new object everytime I am using a new URL. This implies, I have to do "rp = robotparser.RobotFileParser()" everytime the URL changes.

I am not sure if my approach is right since if I am given the ability to change the URL, robotparser should be able to handle such cases.

Also, in the above case, it gives 503 error_code when I try to download the link "http://menendez.senate.gov/contact/contact.cfm" using requests.get() or any other way. I looked into the code of robotparser.py and in that file, for the read() method in class RobotFileParser, there is no check for HTTP response codes > 500. I am not sure why those response_codes are not handled, just wanted to get some pointers what could be the reason for not handling those response codes.

1

There are 1 best solutions below

1
On

robotparser can parse only files in "/robots.txt" format as specified at http://www.robotstxt.org/orig.html and for such files to be active in excluding robot traversals they must be located at /robots.txt on a website. Based on this, robotparser should not be able to parse "http://menendez.senate.gov/contact/contact.cfm" because it is probably not in "/robots.txt" format, even if there were no problems accessing it.

Facebook has a robots.txt file at https://www.facebook.com/robots.txt. It is in plain text and can be read in a browser. robotparser can parse it with no problems, however its access to other files on facebook.com appears to be excluded with the following rule in robots.txt:

User-agent: *
Disallow: /

Here is a session using robotparser to read and parse https://www.facebook.com/robots.txt:

>>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("https://www.facebook.com/robots.txt")
>>> rp.read()  # no error
>>> rp.can_fetch("*", "https://www.facebook.com/")
False
>>> rp.can_fetch("*", "https://www.facebook.com/about/privacy")
False

When testing access to http://www.ontheissues.org/robots.txt in my browser, I got HTTP Error 404 - File or directory not found. Then I downloaded http://svn.python.org/projects/python/branches/release22-maint/Lib/robotparser.py, modified its read() function to print every line it read, ran it on this URL and printed only the first line:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

This line indicates the format of http://www.ontheissues.org/robots.txt is incorrect for a "/robots.txt" file although it may redirect to one.

Doing the same test on "https://www.facebook.com/robots.txt" again resulted in only one line, this time with a warning message:

# Notice: Crawling Facebook is prohibited unless you have express written

Testing http://menendez.senate.gov/contact/contact.cfm with the modified robotparser.read() function again resulted in an HTML header simliar but not identical to that of http://www.ontheissues.org/robots.txt and with no errors. Here is the header line it printed for http://menendez.senate.gov/contact/contact.cfm:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Browsing http://menendez.senate.gov/contact/contact.cfm again, it initially results in http://www.menendez.senate.gov/404 which redirects after 10-15 seconds to http://www.menendez.senate.gov/. Such a redirect link can be coded as follows:

<meta http-equiv="refresh" content="15;url=http://www.menendez.senate.gov/" />

Searching the source of http://www.menendez.senate.gov/contact/ finds no match for "cfm" showing it contains no link to contact.cfm. Although such a link could be configured elsewhere in the web server or dynamically generated, it's not likely given that browsing it results in an HTTP 404 error at http://www.menendez.senate.gov/404.