I have built an ML model API using FastAPI, and this API mostly requires GPU usage. I want to make this API serve at least some amount of parallel requests. To achieve this, I tried to make all the functions def instead of async def so that it can handle requests concurrently (as mentioned here, here and here). Currently, what I am experiencing is that if one request is made, it takes 3 seconds to get output, and if three parallel requests are made, then all users are getting the output in 9 seconds. Here, all users are getting the output at the same time, but as you can see, the time increases as the number of requests increases. However, what I actually want is for all users to get the output in 3 seconds.
I have tried some approaches like ThreadPoolExecutor (here), ProcessPoolExecutor (here), Asyncio (here), run_in_threadpool (here), but none of these methods worked for me.
This is the how my api code looks like with simple def:
from fastapi import Depends, FastAPI, File, UploadFile, Response
import uvicorn
class Model_loading():
def __init__():
self.model = torch.load('model.pth')
app = Fastapi()
model_instance = Model_loading()
def gpu_based_processing(x):
---- doing some gpu based computation ----
return result
@app.post('/model-testing')
def my_function(file: UploadFile = File(...)):
---- doing some initial preprocessing ----
output = gpu_based_processing(x)
return Response(content=output , media_type="image/jpg")
Additionally, I have observed a behavior where making 20 parallel requests to the above API leads to a CUDA out-of-memory error. Even with only 20 requests, it is unable to handle them. How can I address the CUDA memory issue and manage handling multiple parallel requests simultaneously?