I have an AppEngine app that tracks user various sorts of impression data across several websites. Currently we're gathering roughly 40 million records a month and the main BigQuery table is closing in on 15Gb in size after 6 weeks of gathering data and our estimates show that within 6 more weeks, we will be gathering over 100 million records a month. A relatively small dataset in terms of bigdata, but with potential to grow quite a bit quite fast.
Now faced with a successful trial we need to work on an API that sits on top of BigQuery that allows us to analyze the data and deliver the results to a dashboard provided by us.
My concern here is that most of the data being analyzed by the customer spans only a few days at most (per request) and since BigQuery queries are in fact full table scans, the API may in time become slower to respond as the table grows in size and BQ needs to process more data in order to return the results.
My question is therefore this. Should we shard the BigQuery log tables, for instance by month or week, in order to reduce data that needs processing, or would it be "wiser" to pre-process the data and store the results in the NDB datastore? This would result in a blazingly fast API, but requires us to pre-process everything, even things some customers may never need.
Or am I perhaps optimizing prematurely?
Based on my experience analyzing performance of similar projects in BigQuery. If you are concerned with performance only, then you don't have to change anything. BigQuery's optimizer can figure out many things, and if query uses WHERE against only few days - the performance will be good. But from billing point of view, you will be paying more and more as your data grows, so in order to save money - it is wise to shard data by month or even by week. With TABLE_RANGE you still will be able to query all the data if you will need it, so you don't lose any functionality.