Analytical Function Scalability in BigQuery
Quite a big setback our team has for transitioning to BigQuery is the scalability issues over analytical functions. I've read through numerous replies here, and they either say it cannot be done since sorting must be done on a single slot [1][2][3], to PARTITION
[1][2][3], or to LIMIT
[1][2][3][4][5].
Problem for us is that we need to be capable of doing this without partitioning and without limiting. We'd very much prefer not to have to export our data, do the analytical function in a different program, and then upload the data back to BigQuery each time.
Questions:
- Is there a road map for improving the scalability of analytical functions in BigQuery?
- Is it possible to allot more memory so that more data can fit on a single node?
Example:
Table size: 3.76 GB
Two columns: UNIQUE_ID
& SALES
140 million rows
SELECT
UNIQUE_ID
, SALES
, NTILE(10) OVER (ORDER BY SALES ASC) AS SALES_DECILE
FROM `project.dataset.table`
Error:
Resources exceeded during query execution: The query could not be executed in the allotted memory. OVER() operator used too much memory.
For my purposes, not having exactly 10% of records in each decile is okay, so APPROX_QUANTILES()
does the trick, but I'm not sure if that's okay with other teams. The questions still stand when it comes to other analytical functions that need the exact amount. For example, I need to ROW_NUMBER() OVER()
every record in a 140 million row table with one unique ID attribute that isn't numeric, so I'm unable to PARTITION
at all. Otherwise, I'd totally do this suggestion.