GKE Sticky-Sessions + Auto-scaling with large memory usage per session

987 Views Asked by At

We have a web service that we are considering GKE for with the following unique features:

  • Each session involves the client uploading a large amount of data which is pre-processed and stored in-memory in the Pod - somewhere between 1GB to 5GB per session. For performance reasons, we think that it makes sense to simply maintain the session in memory in the Pod instead of having a shared DB like Reddis (i.e. the Pods are NOT stateless - hence the sticky sessions mentioned in the title). This way when the next request comes from the client to try another heavy computation with their data, it is ready to process it and return the result without needing to regenerate the memory-heavy-state from an external source like Reddis.

  • Each Pod requires the resources of an entire VM (e.g. 4vCPU 32GB with 1 GPU attached) because the service it provides to each connected session involves short bursts of GPU/CPU usage. (i.e. horizontal scaling will require new NODES every time, not additional pods on a single node). The idea would be that the number of sessions a Pod can support is constrained only by the memory on the node (as the actual processing will occur in short bursts as mentioned).

It seems as though GKE auto-scaling out-of-the-box with a target memory usage of ~50% would work nicely for these heavy-duty sessions, but here are my concerns:

  1. GKE, sticky sessions, and Load Balancing: Supposing we used something like nginx-ingress using cookies https://kubernetes.github.io/ingress-nginx/examples/affinity/cookie/ for sticky sessions - given the involvement of both GKE and the nginx load balancer, how would new sessions be assigned to the "right" node? For example, suppose I have 2 nodes with 10 sessions each but it so happens that the first node's sessions use 3GB each (90% node memory usage) and the second node's sessions use 1GB each (30% node memory usage). Ideally any new sessions would be routed to the 30% memory usage Pod - would GKE make that happen in this setup (given its goal of 50% average memory usage)? Or would it just be a simple round-robin and the node with 90% memory usage could become over-extended even though the other node has plenty of memory and could easily support additional sessions?
  2. Sticky sessions and scaling down: Consider the following hypothetical: Suppose we have 10 nodes each with 10 currently active sessions and under those circumstances each node is at 50% memory usage (exactly at our GKE memory usage target). Now suppose a bit later, each node drops to only 1 currently active session and 5% memory usage. It is important that these nodes stay "alive" to keep their respective sticky sessions going. Is GKE "smart" enough (perhaps with a specific configuration related to sticky sessions?) to recognize this and avoid killing the nodes with active sticky sessions in spite of the average memory usage being so low? Is there some way to configure it to act this way?

GKE seems great for stateless services but for one like ours involving lots of memory per session where offloading the state to an external repository (like Reddis) and reloading it for each subsequent request would result in a significant performance hit, I'm not so sure that it can be made to work well... but if I can address the above concerns such that I can be confident GKE is going to make sure new sessions are routed to nodes with lower memory usage and that GKE isn't going to kill nodes with an active session (and yes there would be timeouts on sessions so they don't keep nodes alive forever) then I think it might be a great fit.

I'm definitely open to alternative suggestions too, I'm new to Kubernetes and the black box is intimidating to say the least!

EDIT (additional details about the application for more context): our current thinking is to have each Pod run a python web application that acts as a supervisor for spawning subprocesses that actually do the heavy work, one subprocess per sticky session. Then it seems like it's just a matter of routing the requests from the client to their respective subprocesses for the heavy-duty processing and returning the result to the client. The subprocess would stay alive as long as the session was alive (with a timeout of course). Again the key thing here seems to be that each of these subprocesses/sessions will require 1-5GB of memory so passing it to and from an external source seems like a bad idea for performance reasons (not only the bandwidth issue, but also regenerating the state), so we're thinking it best to just keep the state in the subprocess/session in the pod itself and then rely on GKE's horizontal scaling to add the necessary memory resources as-needed... assuming we can sort out the 2 concerns I mentioned.

1

There are 1 best solutions below

0
On

To configure the Ingress object on GKE the cluster must be configured as VPC-Native cluster you can use a BackendConfig to set session affinity to client IP or generated cookie. Session affinity in GKE operates on a best-effort basis to deliver requests to the same backend that served the initial request and by default is disabled, the balancing mode determines when the backend is at capacity and if you want to use external HTTPS Load Balancer the balancing mode recommended is RATE. The example that you mentioned seems like UTILIZATION and this is not recommended to use Session Affinity. This is because changes in the instance utilization can cause the load balancing service to direct new requests or connections to backend VMs that are less full.

About Autoscaler, GKE offers Cluster Autoscaler to automatically resize your GKE cluster's node pools based on the demands of your workloads. When demand is high, the cluster autoscaler adds nodes to the node pool. When demand is low, the cluster autoscaler scales back down to a minimum size that you designate. And also there is a new feature that combines vertical and horizontal autoscale that is called multidimensional pod autoscaling, this feature stills on beta version. A MultidimPodAutoscaler object modifies memory requests and adds replicas so that the average CPU utilization of each replica matches your target utilization