Neither BigQuery nor the public data sets seems to have all the bigrams

287 Views Asked by At

Summary: All I'm trying to do is find out where to download the data I can see in the n-gram viewer since neither the raw data nor BigQuery seem to have as many results as the viewer...

So in my attempt to download all the bigrams without opening each file manually (from the available raw data), I turned to BigQuery in an attempt to convert the trigram data down to bigrams, but realized, because of how the trigrams were constructed, there were plenty of bigrams that weren't included.

So then I went the old fashioned way and, as a test, downloaded the st file from the raw data available here . It was a huge file, but for some reason, it didn't contain the obvious bigram stay here even though the ngram viewer has it. Another example is stay strapped. The viewer will show you the graph for both phrases, but the st file, which I would hope contains that data, does not. Does anyone know why and what I could do to obtain such data? I presume that if it's available through the n-gram viewer, there must be some way to download it?

1

There are 1 best solutions below

2
On

From the documentation you link to, the nGram data sets available for download are snapshots in time. The most recent was posted on July 2012. I believe the nGram Viewer itself is running against much more recent data.

I know that in BigQuery's case, the trigram data is an old snapshot of the nGram data, dating back to the time that BigQuery first launched. Note that our sample dataset documentation does not include the trigrams data set, in part due to how old our snapshot is.