Hi i am new to gcp and i have this project using fastapi with llama 2 chat. I cant run it on my computer as it is took long to response and i tried to deploy it on app engine but it gets error 502. This is my code and also i am new to llama.
import copy
from fastapi import FastAPI, HTTPException
from llama_cpp import Llama
from pydantic import BaseModel
# loading model
print("loading model...")
llm = Llama(model_path=r"C:\Users\Harry\Project\models\llama-2-7b-chat.Q2_K.gguf")
print("model loaded!")
app = FastAPI()
class InputMessage(BaseModel):
message: str
class OutputMessage(BaseModel):
response: str
@app.get("/")
def read_root():
return {"message": "Welcome to the chatbot API!"}
@app.post("/chat", response_model=OutputMessage)
def chat_post(input_message: InputMessage):
try:
global llm
if llm is None:
raise HTTPException(status_code=500, detail="Chatbot model not loaded")
user_message = input_message.message
# Use your chatbot model to generate a response
bot_response_dict = llm(user_message,
max_tokens=-1,
echo=False,
temperature=0.1,
top_p=0.9)
# Extract the response from the dictionary
bot_response = bot_response_dict.get('response', 'Default response')
# Print the response content to the console
print(f"bot_response_dict: {bot_response_dict}")
return OutputMessage(response=bot_response)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
I tried to run locally but it took so long to response and i dont know how to solve it. My questions is how to run my fast api and how to deploy it on gcp