I have been facing a serious issue on my website due to potential spam traffic originating from Google hosted IP addresses. Here are two examples:
Example 1: IP: 34.77.98.119 | User Agent: newspaper/0.2.8 Hostname: 119.98.77.34.bc.googleusercontent.com
Example 2: IP: 34.170.179.100 | User Agent: go-http-client/2.0 Hostname: 100.179.170.34.bc.googleusercontent.com
As you can see above, the IP address in the hostname has been reversed and the UA is cryptic / not mentioned in Google authorized docs such as [1] https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers and [2] https://developers.google.com/search/docs/crawling-indexing/verifying-googlebot.
I need to ensure my website remains safe and user-friendly. I also do not want to mistakenly block legitimate Google crawlers while addressing this issue.
I request the community's guidance on: How to tell apart legitimate traffic from malicious traffic from Google hosted IPs. (By legitimate, I am primarily concerned with Google crawlers and services, everyone else I will do a security profile and determine if we consider them to be malicious for us or not).
The lists in [1] and [2] seem to be incomplete because when I trigger a hit from Google Pagespeed Insights tool, the IP is 66.249.82.64, the UA is "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4590.2 Safari/537.36 Chrome-Lighthouse" and the hostname maps to google-proxy-66-249-82-64.google.com but both of them (UA and hostname) are not mentioned in the above two lists [1] & [2] of genuine UAs and crawlers including "user-triggered-fetchers". Similarly, in the two examples above, the hostnames end in bc.googleusercontent.com and this hostname is not listed in the above google genuine crawlers as well.
Look forward to understand on how based on UA and IP combination we can separate genuine Google triggered traffic from malicious traffic that is also generated from Google servers such as Google cloud / compute engine VMs, etc. that anyone in the world can "rent".
The second document you linked to, shows how to manually verify a Google crawler. Following the steps in that section, then for your first IP, you'd run the command
$ host 34.77.98.119and this gives
119.98.77.34.in-addr.arpa domain name pointer 119.98.77.34.bc.googleusercontent.com.and then running
host 119.98.77.34.bc.googleusercontent.comgives
119.98.77.34.bc.googleusercontent.com has address 34.77.98.119From the above, I'd say that the IP is from Google