How do deal with bots using the in-site search and overflowing the SQL with too many requests?

238 Views Asked by At

What is the best practise to not annoy users with flood limits, but yet block off bots doing automated searches?

What is going on:

I am been more aware of odd search behaviour and I finally had the time, to catch who it is. It is 157.55.39.* also known as Bing. Which is odd, because when _GET['q'] is detected, noindex is added.

Problem however is, that they are slowing down the SQL server, as there is just too many instances of requests coming in.

What I have done so far:

I have implemented searching flood limit. But since I did it with a session-cookie, checking and calculating from the last search timestamp -- bing obviously ignores cookies and continues on.

Worst case scenario is to add reCAPTHA, but I don't want the "Are you human?" tickbox everytime you search. It should appear only, when flood is detected. So basically, the real question is, how to detect too many requests from client to trigger some sort of recaptcha to stop requests..

EDIT #1:
I handled the situation currently, with:

<?

# Get end IP
define('CLIENT_IP', (filter_var(@$_SERVER['HTTP_X_FORWARDED_IP'], FILTER_VALIDATE_IP) ? @$_SERVER['HTTP_X_FORWARDED_IP'] : (filter_var(@$_SERVER['HTTP_X_FORWARDED_FOR'], FILTER_VALIDATE_IP) ? @$_SERVER['HTTP_X_FORWARDED_FOR'] : $_SERVER['REMOTE_ADDR'])));

# Detect BING:
if (substr(CLIENT_IP, 0, strrpos(CLIENT_IP, '.')) == '157.55.39') {

    # Tell them not right now:
    Header('HTTP/1.1 503 Service Temporarily Unavailable');

    # ..and block the request
    die();
}

It works. But it seems like another temp solution to a more systematic problem.

I would like to mention, that I still would like search engines, including Bing to index /search.html, just not to actually search there. There is no "latest searches" or anything like that, so its a mystery where they are getting the queries from.

EDIT #2 -- How I solved it
If someone else in the future has these problems, I hope this helps.

First of all, it turns out that Bing has the same URL parameter feature, that Google has. So I was able to tell Bing to ignore URL parameter "q".

Based on the correct answer, I added disallow rows for parameter q to robots.txt:

Disallow: /*?q=*
Disallow: /*?*q=*

I also told inside the bing webmaster console, to not bother us on peak traffic.

Overall, this right away showed positive feedback from server resource usage. I will however, implement overall flood limit for identical queries, specifically where _GET is involved. So in case Bing should ever decide to visit an AJAX call (example: ?action=upvote&postid=1).

1

There are 1 best solutions below

3
On BEST ANSWER

Spam is a problem that all website owners struggle to deal with.
And there are a lot of ways to build good protection, starting from very easy ways and finishing with very hard and strong protection mechanisms.

But for you right now I see one simple solution.
Use robots.txt and disallow Bing spider to crawl your search page.
You can do this very easy.

Your robots.txt file would look like:

User-agent: bingbot
Disallow: /search.html?q=

But this will totally block search engine spider from crawling your search results.
If you want just to limit such requests, but not totally block them, try this:

User-agent: bingbot
crawl-delay: 10

This will force Bing to crawl your website pages only every 10 seconds.
But with such delay, it will crawl only 8,640 pages a day (which is very small amount of requests per/day).
If you good with this, then you ok.

But, what if you want manually control this behavior by the server itself, protecting search form not only from web crawlers, but also from hackers?
They could send to your server over 50,000 requests per/hour with the ease.

In this case, I would recommend you 2 solutions.
Firstly, connect CloudFlare to your website, and don't forget to check if your server real IP is still available via services like ViewDNS IP History, cuz many websites with CF protection lack on this (even popular once).
If your active server IP is visible in the history, then you may consider changing it (highly recommended).

Secondly, you could use MemCached to store flood data and detect if a certain IP is querying too much (i.e. 30 q/min).
And if they do, block their opportunity to use perform (via MemCached) for some time.

Of course, this is not the best solution you could use, but it will work and will cost not much for your server.