I was getting some strange URLs indexed for my site by adding files as folders. One sample URL is here
https://www.plus2net.com/python/tkinter-scale.php/math.php
I have a file tkinter-scale.php but don’t have a directory in that name
Similarly this URL is also indexed, you can ses file names are included as folder names.
https://www.plus2net.com/python/tkinter-sqlite.php/javascript_tutorial/asp-tutorial/site_map.php
Then I added these two lines to robots.txt file remove all the subfolders after /python/, so I can index the files inside python folder but not down the level.
Allow: /python/$
Disallow: /python/
Now I have a big list of files blocked by robots.txt file, which is correct and they are inside sub – folders of python directory. But there are five files which are also blocked ( out of a list of nearly 500 ).
https://www.plus2net.com/python/string-rjust.php
https://www.plus2net.com/python/dj-mysql-add-data.php
https://www.plus2net.com/python/next.php
https://www.plus2net.com/python/test.csv
https://www.plus2net.com/python/string-islower.php
Why these files are blocked? ( there is no page level blocking for these files )
Your current
robots.txtAllowrule is not allowing anything within thepythondirectory. Neither files, nor sub-directories, just the base directory itself. It looks to me like you want yourrobots.txtto look like:OR
These two would be pretty equivalent for the major search engines. They process
Allowdirectives,*as wildcard and$as "ends with". These rules would differ for most other crawlers that don't know about Google'srobots.txtsyntax extensions. The first set of rules would allow them to crawl everything, the second set of rules would block the entirepythondirectory.For future reference, I would suggest testing your
robots.txtagainst Google'srobots.txttesting tool.Another way of solving this problem is to configure your web server not to allow paths after file names. If you are using Apache, you could set the following in
httpd.confor.htaccess.AcceptPathInfo Off