How to set up configuration file for sagemaker triton inference?

482 Views Asked by At

I have been looking examples and ran into this from aws, https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-triton/ensemble/sentence-transformer-trt/examples/ensemble_hf/bert-trt/config.pbtxt. based on this example , we need to define the input and output , data types for those input and output. the example is not clear on what the dims ( probably dimensions) represent , is it number of elements in an array of inputs ? also , what Is max_batch_size ? and at the bottom , we have to specify instance group and kind is set to KIND_GPU, I assume if we are using a cpu based instance , we can change this to cpu. do we need to specify , how many cpu we want to use?

name: "bert-trt"
platform: "tensorrt_plan"
max_batch_size: 16
input [
  {
    name: "token_ids"
    data_type: TYPE_INT32
    dims: [128]
  }...
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [128, 384]
  }...
]
instance_group [
    {
      kind: KIND_GPU
    }
  ]

I have tested the given example , but if we want to use a text based input and do tokenization in the server, how does this config.pbtxt file look like?

1

There are 1 best solutions below

0
On BEST ANSWER

The max_batch_size entry specifies the maximum batch size to use with Triton dynamic batching. Triton will combine multiple requests into a single batch in order to increase throughput.

If you set max_batch_size to zero you need to define the batch dimension in the config.pbtxt, i.e.

name: "bert-trt"
platform: "tensorrt_plan"
max_batch_size: 0
input [
  {
    name: "token_ids"
    data_type: TYPE_INT32
    dims: [-1, 128]
  }...
]

In this case -1 implies that the batch dimension is variable (you can also set the sequence dimension to -1)

In order to tokenize on the server, you need to create a python backend and either an ensemble model, or use python for both the tokeniser and the model

name: "tokenizer"
max_batch_size: 0
backend: "python"

input [
    {
        name: "text"
        data_type: TYPE_STRING
        dims: [ -1 ]
    }
]

I found triton-ensemble-model-for-deploying-transformers-into-production a good resource.