As of recently when I look at my web statistics through AWStats, I see several things that concern me. The first is 'Unknown robot' listed under the 'Robots/spiders visitors' tab. The second, and most concerning line is 'A PHP script' under the same AWStats robots tab. I run content on my site that should not be fetched by other site's PHP scripts. Is there any way to log this in Apache logs? In other words, how can I tell if the script is being called by a PHP script (through logs or PHP functions)? Lastly, an image is listed below that shows what I'm describing. As you can see, hits from a normal bot - Googlebot - number in the hundreds whereas the hits from the 'Unknown robots' number roughly 700 thousand.
Check if PHP file is called from another site's PHP script
872 Views Asked by B. Schmidt AtThere are 2 best solutions below

To your 1st concern, a client can put anything they want into the user agent. These stats are only looking at the client user agent information, and like any other user supplied information it can not be trusted.
I would suggest you look at the awstats robots.pm file to get an idea of how that is coded. It explains why you get a large list of otherwise uncategorized PHP agents.
The much larger concern should be all the requests that have no user agent at all, since they are eating up Gigs of your bandwidth.
This problem is where a service like Cloudflare can help. Cloudflare began life as an offshoot of a "Realtime Black Hole"(RBL) list at project Honeypot. Cloudflare can buffer your site as a reverse proxy, from a lot of this activity, acting as an RBL and a cache.
With that said, even if you add in Cloudflare today, the reality is that bot networks capture IP's or just crawl ip blocks, so it can't prevent traffic it never sees.
You can however, use the project HoneyPot Http:BL to block the requests of known bad actors directly. There are many PHP library implementations you could integrate, as well as plugins for CMS's like Wordpress.
There are also many other RBL's out there you can investigate, and most if not all have implementations that will allow you to integrate their use in some way.
You can also implement your own serverside mitigation. These suggestions are only scratching the surface of a topic, but should serve as an entrypoint to looking into options you might find useful.
Rate Limiting. Apache for example, has a rate limit module that can implement some simple but effective rate limit settings. There are other modules or alternatives for other web servers. Something like mod_ratelimit doesn't tend to play well with a reverse proxy, but the reverse proxy solutions usually have something similar. For example, NGINX has sophisticated rate limitation settings that can be applied to different types of content, and allow for short burst activity (as does mod_ratelimit). In fact, in the past, the use of nginx even with apache was often employed to handle requests for static content separately from requests for dynamic content (cgi/php etc) scripts.
Firewall/IP banning. There are many solutions in this area, not excluding the integration with RBLs I mentioned previously. Again, just as an example, the popular and longstanding fail2ban package can be used to identify and block traffic you don't want. For example, you can have fail2ban examine your incoming apache logs and implement firewall blocking rules. There are numerous tutorials and how-tos on getting this setup, like this one from digital ocean. There are also additional filters you can find searching github, which are easy to understand and modify you could integrate. Fail2ban is but one of a number of similar approaches to these problems. For example, Crowdsec is a newer alternative you might investigate, that combines a bit of both project honeypot and local pattern recognition and IP banning.
In conclusion this is a non-trivial social and technical challenge, and you can easily shoot yourself in the foot when combining some of these mitigations. In general you need a high degree of system administration expertise and control in order to implement these solutions.
There's no 100% way to do it, as no matter what kind of script is connecting to your site, it can make it look like a browser so you will never know.
The only thing crossing my mind — they call "PHP script" requests that contain something specific in
$_SERVER['HTTP_USER_AGENT']
, for example user agents starting withPHP/
, likePHP/5.2.9
.