Why robots.txt file is supposed to block sub folder but blocking some random files also

54 Views Asked by At

I was getting some strange URLs indexed for my site by adding files as folders. One sample URL is here

https://www.plus2net.com/python/tkinter-scale.php/math.php

I have a file tkinter-scale.php but don’t have a directory in that name

Similarly this URL is also indexed, you can ses file names are included as folder names.

https://www.plus2net.com/python/tkinter-sqlite.php/javascript_tutorial/asp-tutorial/site_map.php

Then I added these two lines to robots.txt file remove all the subfolders after /python/, so I can index the files inside python folder but not down the level.

Allow: /python/$
Disallow: /python/

Now I have a big list of files blocked by robots.txt file, which is correct and they are inside sub – folders of python directory. But there are five files which are also blocked ( out of a list of nearly 500 ).

https://www.plus2net.com/python/string-rjust.php
https://www.plus2net.com/python/dj-mysql-add-data.php
https://www.plus2net.com/python/next.php
https://www.plus2net.com/python/test.csv
https://www.plus2net.com/python/string-islower.php

Why these files are blocked? ( there is no page level blocking for these files )

1

There are 1 best solutions below

0
Stephen Ostermiller On

Your current robots.txt Allow rule is not allowing anything within the python directory. Neither files, nor sub-directories, just the base directory itself. It looks to me like you want your robots.txt to look like:

User-Agent: *
Disallow: *.php/

OR

User-Agent: *
Disallow: /python/
Allow: /python/$
Allow: /python/*.php$

These two would be pretty equivalent for the major search engines. They process Allow directives, * as wildcard and $ as "ends with". These rules would differ for most other crawlers that don't know about Google's robots.txt syntax extensions. The first set of rules would allow them to crawl everything, the second set of rules would block the entire python directory.

For future reference, I would suggest testing your robots.txt against Google's robots.txt testing tool.


Another way of solving this problem is to configure your web server not to allow paths after file names. If you are using Apache, you could set the following in httpd.conf or .htaccess.

AcceptPathInfo Off