How can I use the full 128,000 Token Context of GPT4 Turbo?

1k Views Asked by At

The new GPT4 Turbo has 128,000 token context and a 4096 token output limit.

However, when I call it, if my input text is 4000 tokens, it will only provide output of 96 tokens. I am expecting that if I provide 123,000 token input, I will be able to generate up to 4096 tokens of output. Or at least, that is my experience with GPT4 which has 8192 token limit. If I have an input of 3000 tokens, it can generate 5192 tokens of output.

This is my function for interfacing with the OpenAi API.

def process_ai(input_text, res = None):
    final_res = ''


    # Overriding settings
    if res:
        if 'model' in res:
            model = res['model']
            max_tokens = res['max_tokens']

        if 'temperature' in res:
            if res['temperature']:
                temperature = res['temperature']

    openai.api_key      = os.getenv('API_KEY')
    openai.organization = os.getenv('ORGANIZATION_KEY')

    try:
        # Default settings
        max_allowed_tokens = min(res.get('max_tokens', 8192), 4096) 
        max_input_length = max_allowed_tokens - int(len(input_text)/3.9+650)
        
        # Ensure max_input_length doesn't go below a certain threshold (e.g., 10)
        max_input_length = max(max_input_length, 10)

        result = openai.ChatCompletion.create(
            model            = model,
            messages=[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": input_text}],
            temperature       = temperature,
            max_tokens        = max_input_length,
            top_p             = 0.3,
            frequency_penalty = 0.5,
            presence_penalty  = 0.0,
        )
        final_res = result['choices'][0]['message']['content']
    except openai.error.RateLimitError:
        final_res = 'AI Server is busy, try again in a few minutes.'
    except openai.error.ServiceUnavailableError:
        final_res = 'AI Server is busy, try again in a few minutes.'
    except Exception as e:
        final_res = 'Error occured, try again in a few minutes or contact admin.'+'string length' +str(len(input_text)/3.9+600) +'Dev info:' + str(e)
    final_res = html.unescape(final_res) # try to fix characters
    
    return final_res

FYI max_tokens comes from a nosql document

"name": "GPT4-Turbo",
        "enabled": NumberInt(1),
        "model": "gpt-4-1106-preview",
        "desc": "",
        "max_tokens": NumberInt(128000)

If I flip back to GPT4, an input string of 4000 tokens allows for up to 8192 - 4000 = 4192 tokens in output. If I use GPT4 turbo and set max_tokens = 4095, I am stuck with my original problem of only receiving 96 tokens in the output.

How are we supposed to be able to use this 128,000 token context? Are we supposed to chunk data into it somehow?

1

There are 1 best solutions below

0
On

I figured it out. Simple math error.

Changed these lines:

max_allowed_tokens = res.get('max_tokens') 
max_input_length = max_allowed_tokens - int(len(input_text)/3.9+650)

Then took the minimum of the two options.

max_tokens = min(max_input_length, 4095),