I have a JSON file where the structure looks like the following:
{
"events": [
{
"id": 1,
"name": "EV001",
"note": "",
"pages": [
{
"list": [
{
"code": 231,
"indent": 0,
"parameters": [
0
]
},
{
"code": 401,
"indent": 0,
"parameters": [
"ひな"
]
},
{
"code": 401,
"indent": 0,
"parameters": [
"ひらがな"
]
},
{
"code": 131,
"indent": 0,
"parameters": [
0
]
},...
]
}
]
}
]
}
My goal is to grab any text inside "parameters" where "code" = 401. After I grab this text I translate it, then I want to put it back in the same spot.
Currently I use the following function to extract the text:
# Extract 401 Text
untranslatedTextList = []
events = data['events']
for event in events:
if event is not None:
for page in event['pages']:
for command in page['list']:
if command['code'] == 401:
untranslatedTextList.append(command['parameters'][0])
This gives me untranslatedTextList
which is a list of all the strings I need to translate. I can translate this list using whatever method I like.
My problem starts here. Normally I would translate line by line so that I could easily retain the position of where I grabbed the raw text from and then write back into the same command. However this has too many drawbacks.
- (Main Issue) The translation quality suffers greatly because the machine doesn't have the context. Much of the text is dialogue and requires knowledge of what was just said or what the context is.
- The cost is much higher line by line vs one giant batch.
- The time taken for translation is much greater due to the larger number of requests.
Therefore my only choice is to translate all of that text in the list in a single request to avoid the above pitfalls. However, afterwards I'm left with a translation blob of differing length where it's nearly impossible to know which sentences go to which 401 codes. I have tried using delimiters to mark where each group of 401's end, however GPT3.5 likes to randomly add/remove these delimiters throwing everything off.
Frankly after thinking about it for a long time it seems like an impossible task, but maybe someone in the community has a good idea.
I have tried groupings, delimiters, and forcefully matching the two lists. All result in a small mismatch in one of the positions of the 401 which throws off the order of everything in the file and causes bugs.