Hi Want to create a presto resource calculator for different teams of my organization to calculate the cluster cpu core and memory requirement
As per my understanding below are the key points to decide the number of cpu core and memory for presto cluster . Can someone check and suggest any other terms related to presto missing here to create a simplified presto cluster resource calculator
Inputs:
Total Data Size(TS):
Split Size(SS):
Concurrent queries(CS):
Threads per worker node:
Expected processing time by one split(SPT):
Expected memory requirement by one task for a split(SM):
Desired Query SLA(SLA) :
Calculation:
Total number of splits per query = TS / SS
Total processing time per query = Total number of splits * SPT
Required task parallelism = Total processing time per query / Desired Query SLA
Required worker nodes = Required task parallelism / Threads per worker node
Total Memory required per query = Total VCore required per query * SM
Total VCore required for concurrent queries = Total VCore required per query * CS
Total Memory required for all conclurrent queries = Total Memory required per query * CS
For example: Inputs:
Total Data Size(TS): 512GB = 512*1024 = 524288 mb
Split Size(SS): 256MB
Concurrent queries(CS): 10
Threads per worker node: 6
Expected scan time by one split(SPT): 3s
Expected memory requirement by one task for a split(SM): Split Size * 5 = 256*5 = 1280MB
Desired Query SLA(SLA) : 30s
Calculation:
Total number of splits per query = 524288 / 256 = 2048
Total processing time per query = 2048 * 3 = 6144s
Required task parallelism per query = 6144/30 = 204~
Required worker nodes = Required task parallelism / Threads per worker node = 204 / 6 = 34
Total Memory required per query = 204 * 1280 = 261120MB = 255GB
OUTPUT
Total VCore required for all concurrent queries = 204 * 10 = 2040
Total Memory required for all conclurrent queries = 255GB * 10 = 2550 GB = 2.5TB~
Total number of worker nodes = Total VCore required for concurrent queries / Threads per worker node = 2040/6 = 340
This is a pretty complicated topic. I’ll do by best to provide some guidance, but I don’t think it is simple to condense this into a simple formula without at least knowing more about how the cluster would be used. Below, I try to share some pointers that may be helpful in deriving a heuristic that suits your particular usage of Presto.
I think the starting point for this question is how will the clusters be used—that will affect which resource you count. For example, if you're optimizing for adhoc querying, you'll probably want to optimize the Presto cluster to minimize queuing. Adhoc queries may last from the milliseconds to tens of minutes or even longer, so you wouldn’t be able to infer any reasonable expectations around per-query latency. Instead, you'll want to determine a worst case over-admission scenario, which will be dependent on the cluster size and hardware.
An example of why it's important to optimize for the correct thing is, often tasks may become skewed, or splits incorrectly sized. This would end up reducing availability. Over time, your cluster may hot spot with several queries which exhibit these problems, and when this happens, it may slow down the cluster overall. When the cluster slows, this means you have more concurrent queries running that are operating in an unproductive manner, which will cause more concurrent slots to be occupied in the cluster’s resource group.
In this example, you’d want to make your cluster resilient to poorly sized splits or data skew. To do this, you’d want to count the maximum amount of concurrency allowable by your resource groups to prevent cluster overadmission. Cluster over admission is generally never advisable, so to calculate this, I’d recommend working backwards from a level of concurrency that perfectly utilizes the Presto cluster. There are several levels of concurrency in Presto: query-level (controlled by resource groups), stage level (how many concurrent stages are scheduled at once), task level (how many tasks are scheduled on a cluster for a given stage), and split level (how many in-process or pending splits are scheduled to workers). You’d want to try to estimate task and stage level parallelism, and then within those estimates, figure out how much split level parallelism you’d want to allow. Once you do that, then you’d want to consider that hash join and aggregations will use memory often for roughly the entire duration of the query, so you’d also want to try to put a limit on concurrency to prevent excessive memory usage (another symptom of overadmission).
Now, suppose we were optimizing for minimal latency, and we know that queries can actually be expected to complete quickly. Here, we’d probably design our queues to facilitate very quick queries, but penalize excessive use or queries which don’t complete quickly—by killing them. So we probably wouldn’t count resource group concurrency, and instead we’d maybe count split level parallelism. To calculate the worst case scenario as above, instead we’d probably consider the maximum parallelism we’d hit with the longest duration queries allowable. The goal would be to allow the most splits to be scheduled per query as possible without creating excessively high worker split queues.
Determine this a priori may be very challenging. Perhaps a better approach would be to use heuristics for the above, and tweak based off of empirical observations in production. You’d want to start on the more conservative end to ensure reliability, then relax constraints and measure impact to overall memory usage, query paralellism, and monitor for signs of overadmission, including aggregate worker CPU, split queues on the workers, and resource group queueing.