Check Url in robots.txt

82 Views Asked by Максим Акулов At 05 September 2023 at 06:17

I need to check if a certain URL in the robots.txt file is available for crawling by a certain agent.

I'm using

import urllib.robotparser
rp = urllib.robotparser.RobotFileParser("https://rus-teplici.ru/robots.txt")
rp.read()
print(rp.can_fetch("*", "https://rus-teplici.ru/"))
print(rp.can_fetch("*", "https://rus-teplici.ru/search/?q=fd"))

But the parser does not understand the characters * ? $. That's why the error

It doesn't understand strings such as:

Disallow: *set_filter=

URL https://rus-teplici.ru/catalog/teplitsy-polikarbonat/?arrFilter_2_MIN=990&arrFilter_2_MAX=85990&set_filter= will be TRUE. Because it doesn't handle characters in robots.txt: * ?

I looked - what it sees in robots.txt

Disallow: /%2Abitrix
Disallow: /%2Aindex.php
Disallow: %2Aset_filter%3D**

And it should see

Disallow: /*bitrix
Disallow: /*index.php
Disallow: *set_filter=*

Original Q&A

Check Url in robots.txt

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in PARSING

Related Questions in ROBOTS.TXT

Trending Questions

Popular # Hahtags

Popular Questions