I need to check if a certain URL in the robots.txt file is available for crawling by a certain agent.
I'm using
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser("https://rus-teplici.ru/robots.txt")
rp.read()
print(rp.can_fetch("*", "https://rus-teplici.ru/"))
print(rp.can_fetch("*", "https://rus-teplici.ru/search/?q=fd"))
But the parser does not understand the characters * ? $. That's why the error
It doesn't understand strings such as:
Disallow: *set_filter=
URL https://rus-teplici.ru/catalog/teplitsy-polikarbonat/?arrFilter_2_MIN=990&arrFilter_2_MAX=85990&set_filter= will be TRUE. Because it doesn't handle characters in robots.txt: * ?
I looked - what it sees in robots.txt
Disallow: /%2Abitrix
Disallow: /%2Aindex.php
Disallow: %2Aset_filter%3D**
And it should see
Disallow: /*bitrix
Disallow: /*index.php
Disallow: *set_filter=*