I am interacting with a search engine programmatically and I need to trick it into thinking that I am a human making queries, as opposed to a robot. This involves generating queries for which it seems plausible that any ordinary user would search for, like "ncaa football schedule" or "When was the lunar landing?" I'll be making over a thousand of these queries daily, and searching for random words out of a dictionary won't cut it, since that's not a very typical search habit.
So far I have thought of a few ways to generate realistic queries:
- Obtain a list of the top google (or Yahoo or Bing, etc) searches for the day
- Make use of Google's autocomplete feature by entering a random word from the dictionary followed by a space and scraping the recommended queries.
The latter approach sounds like it would involve a lot of reverse engineering. And with the former approach, I've been unable to find a list of more than 80-or-so queries - the only sources I've found are AOL trends (50-100) and Google Trends (30).
How might I go about generating a large set of human-like search phrases?
(For any language-dependent answers: I'm programming in Python)
Although this most likely breaks Google's TOS, you can scrape the autocomplete data easily:
autocomplete('a', depth=2)
gives you the top 110 queries that start witha
(with some duplicates). Scrape each letter to a depth of 2, and you should have a ton of legitimate queries to choose from.