GAE is very slow loading a sentence transformer

95 Views Asked by At

I'm using Google App Engine to host a website using Python and Flask.

I need to add text similarity functionality, using sentence_transformers. In requirements.txt, I add a dependency to the cpu version of torch:

torch @ https://download.pytorch.org/whl/cpu/torch-2.2.1%2Bcpu-cp311-cp311-linux_x86_64.whl 
sentence-transformers==2.4.0

When I add these statements to the main.py file:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

the GAE instance creation time degrades from < 1 sec to > 20 sec.

Performance improves if I save the model to a directory in the project and use:

model = SentenceTransformer('./idp_web_server/model')

but it is still over 15 sec. (Removing the statement for model creation reduces instance creation time to 4 sec). Going from an F4 instance (2.4 GHZ, with automatic scaling) to a B8 instance (4.8 MHZ, basic scaling) instance does not improve performance, so, it seems to be IO bound. Running the app locally on my machine (2.4 GHz), the model creation takes only 1.7 sec, i.e., is 5 to 10 times faster.

Can this be improved? Should I move to Google Cloud instead of GAE?

2

There are 2 best solutions below

2
minou On

Two suggestions to try:

  1. Don't load the model during instance creation. Instead load it at the first request that needs it. This is described in more detail here.
  2. You might need more memory. For my ML models, I use GAE flexible with this instance specification:
resources:
  cpu: 2
  memory_gb: 8.0
  disk_size_gb: 20
1
NoCommandLine On
  1. GAE is still Google Cloud (Google Cloud consists of multiple products/services). I assume you're asking if you should switch to maybe Google Compute Engine (GCE) or Cloud Run

  2. See if you can find out where exactly the bottleneck is by

    a) Go to logs explorer - https://console.cloud.google.com/logs/

    b) Find an entry that seems to have taken a long time. If you mouse over the time, a menu should popup and the first entry should be 'view trace details'. Click on it and it will give you a breakdown of the calls to internal APIs and how long each one took. This might help you figure out where your bottleneck is and if it's something you can fix

    c) Also check how often new instances are being started (your logs will tell you if a visit kicked off a new instance). This can help you figure out if you should increase the number of min or max instances you need.