I am facing an issue with the Python's robotparser module. It works fine for a particular URL but starts failing once I perform a specific sequence of steps. Mentioned below are the steps I performed and the outcome:-
This sequence works fine:-
>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> url = "http://www.ontheissues.org/robots.txt"
>>> rp.set_url(url)
>>> rp.read()
>>> rp.can_fetch("*", "http://www.ontheissues.org/House/Jim_Nussle.htm")
True
>>>
However, the below mentioned sequence fails for the same steps which I did above:-
>>>> import robotparser
>>>> rp = robotparser.RobotFileParser()
>>>> url = "http://menendez.senate.gov/robots.txt"
>>>> rp.set_url(url)
>>>> rp.read()
>>>> rp.can_fetch("*", "http://menendez.senate.gov/contact/contact.cfm")
False
>>>>
>>>>
>>>> url = "http://www.ontheissues.org/robots.txt"
>>>> rp.set_url(url)
>>>> rp.read()
>>>> rp.can_fetch("*", "http://www.ontheissues.org/House/Jim_Nussle.htm")
False
>>>>
After debugging it for sometime, I found that it works fine if I create a new object everytime I am using a new URL. This implies, I have to do "rp = robotparser.RobotFileParser()" everytime the URL changes.
I am not sure if my approach is right since if I am given the ability to change the URL, robotparser should be able to handle such cases.
Also, in the above case, it gives 503 error_code when I try to download the link "http://menendez.senate.gov/contact/contact.cfm" using requests.get() or any other way. I looked into the code of robotparser.py and in that file, for the read() method in class RobotFileParser, there is no check for HTTP response codes > 500. I am not sure why those response_codes are not handled, just wanted to get some pointers what could be the reason for not handling those response codes.
robotparser can parse only files in "/robots.txt" format as specified at http://www.robotstxt.org/orig.html and for such files to be active in excluding robot traversals they must be located at /robots.txt on a website. Based on this, robotparser should not be able to parse "http://menendez.senate.gov/contact/contact.cfm" because it is probably not in "/robots.txt" format, even if there were no problems accessing it.
Facebook has a robots.txt file at https://www.facebook.com/robots.txt. It is in plain text and can be read in a browser. robotparser can parse it with no problems, however its access to other files on facebook.com appears to be excluded with the following rule in robots.txt:
Here is a session using robotparser to read and parse https://www.facebook.com/robots.txt:
When testing access to http://www.ontheissues.org/robots.txt in my browser, I got HTTP Error 404 - File or directory not found. Then I downloaded http://svn.python.org/projects/python/branches/release22-maint/Lib/robotparser.py, modified its read() function to print every line it read, ran it on this URL and printed only the first line:
This line indicates the format of http://www.ontheissues.org/robots.txt is incorrect for a "/robots.txt" file although it may redirect to one.
Doing the same test on "https://www.facebook.com/robots.txt" again resulted in only one line, this time with a warning message:
Testing http://menendez.senate.gov/contact/contact.cfm with the modified robotparser.read() function again resulted in an HTML header simliar but not identical to that of http://www.ontheissues.org/robots.txt and with no errors. Here is the header line it printed for http://menendez.senate.gov/contact/contact.cfm:
Browsing http://menendez.senate.gov/contact/contact.cfm again, it initially results in http://www.menendez.senate.gov/404 which redirects after 10-15 seconds to http://www.menendez.senate.gov/. Such a redirect link can be coded as follows:
Searching the source of http://www.menendez.senate.gov/contact/ finds no match for "cfm" showing it contains no link to contact.cfm. Although such a link could be configured elsewhere in the web server or dynamically generated, it's not likely given that browsing it results in an HTTP 404 error at http://www.menendez.senate.gov/404.