Error in deploying Falcon-7B after fine-tuning to AWS SageMaker endpoint using SageMaker Python SDK

Question

Error in deploying Falcon-7B after fine-tuning to AWS SageMaker endpoint using SageMaker Python SDK

465 Views Asked by Jacob Brophy At 29 July 2025 at 06:41

I am currently running into a problem with AWS SageMaker where I cannot deploy my fine-tuned Falcon-7B model to a SageMaker endpoint after training it using an AWS training job. I am roughly following this tutorial:

https://www.philschmid.de/sagemaker-mistral#2-load-and-prepare-the-dataset

Which follows a fairly predictable workflow: create a training script, set hyperparams, create HF estimator and then train the model on data in an S3 Bucket. This part works fine, and I can store the uncompressed model weights into an s3 bucket.

From there, I get the LLM image uri:

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="1.1.0",
  session=sess,
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

create a new HuggingFace Estimator with the model data set to the s3 bucket path:

import json
from sagemaker.huggingface import HuggingFaceModel

model_s3_path = huggingface_estimator.model_data["S3DataSource"]["S3Uri"]

# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 1
health_check_timeout = 300

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024), # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048), # Max length of the generation (including input text)
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  model_data={'S3DataSource':{'S3Uri': model_s3_path,'S3DataType': 'S3Prefix','CompressionType': 'None'}},
  env=config
)

and then finally deploy the model:

llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

From here I always get an error like this:

UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-tgi-inference-2023-10-21-16-47-53-072: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint..

When I check the cloudwatch logs this is the first error that pops up in the logs:

RuntimeError: Not enough memory to handle 4096 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

Which doesn't really make sense because I am using the fairly large instance of ml.g5.12xlarge and Falcon-7B is not a large model comparatively and in the tutorial the author successfully deploys Mistral 7B to even smaller instances like ml.g5.2xlarge. Plus even when I cut the pre-fill tokens in half I still get this error:

RuntimeError: Not enough memory to handle 2048 prefill tokens. You need to decrease `--max-batch-prefill-tokens`

I've tried several permutations of this, training the model on AWS and pushing it to Huggingface and then trying to deploy the model from Huggingface (which I know is unnecessarily complicated but I was desperate for a workaround) but that didn't work and returned an error that said ValueError: Unsupported model type falcon and I've also tried training and deploying the model in compressed form (as in model.tar.gz) but that also didn't work and returned the same error. I am not an expert by any means, but I feel like this shouldn't be this hard, I'm wondering if anyone has experienced and solved this problem before and whether or not this problem is unique to the Falcon model series and SageMaker?

Original Q&A

There are 1 best solutions below

**Raghu Ramesha** · Answer 1

Please refer to this example - https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/generativeai/llm-workshop/lab11-llama2/meta-llama-2-7b-lmi.ipynb

There could be 2 potential issues - 1/ You are setting a larger value of prefill tokens which is causing these errors, start small. 2/ Max input + Output tokens for Llama2 is 4096, but ideally should be around 3500 from my testing so be sure to set these values appropriately

Also, you don't need to create a tar ball to host models in SageMaker anymore. Please see this - https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-uncompressed.html

Error in deploying Falcon-7B after fine-tuning to AWS SageMaker endpoint using SageMaker Python SDK

There are 1 best solutions below

Related Questions in PYTHON-3.X

Related Questions in AMAZON-SAGEMAKER

Related Questions in HUGGINGFACE

Related Questions in LARGE-LANGUAGE-MODEL

Trending Questions

Popular # Hahtags

Popular Questions