Training a tokenizer learns the merge rules that need to be applied to new text to split it into tokens
From the huggingface tutorial on the same:
Merge rules look like this:
{('Ġ', 't'): 'Ġt', ('i', 's'): 'is', ('e', 'r'): 'er', ('Ġ', 'a'): 'Ġa', ('Ġt', 'o'): 'Ġto', ('e', 'n'): 'en',
('T', 'h'): 'Th', ('Th', 'is'): 'This', ('o', 'u'): 'ou', ('s', 'e'): 'se', ('Ġto', 'k'): 'Ġtok',
('Ġtok', 'en'): 'Ġtoken', ('n', 'd'): 'nd', ('Ġ', 'is'): 'Ġis', ('Ġt', 'h'): 'Ġth', ('Ġth', 'e'): 'Ġthe',
('i', 'n'): 'in', ('Ġa', 'b'): 'Ġab', ('Ġtoken', 'i'): 'Ġtokeni'}
And when we get new text the following is how huggingface instructs tokenization should happen:
def tokenize(text):
pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
#pretokenization splits the text into words according to predef rules. Example: split on every whitespace
pre_tokenized_text = [word for word, offset in pre_tokenize_result]
splits = [[l for l in word] for word in pre_tokenized_text]
for pair, merge in merges.items():
for idx, split in enumerate(splits):
i = 0
while i < len(split) - 1:
if split[i] == pair[0] and split[i + 1] == pair[1]:
split = split[:i] + [merge] + split[i + 2 :]
else:
i += 1 #This is where my doubt is
splits[idx] = split
return sum(splits, [])
It might be trivial but I do not see why the i+=1
is under the else
clause. I think i+=1
should always happen irrespective of whether or not the merge was found. Notwithstanding my understanding, I am assuming the code demonstrated is correct and that I am missing some edge scenario where putting i+=1
outside of the else block would break and/or produce wrong tokenization. Can you help me understand what that would be? Or, maybe confirm that i+=1
belongs outside of the else block