Understanding the huggingface tokenization test time implementation

35 Views Asked by At

Training a tokenizer learns the merge rules that need to be applied to new text to split it into tokens

From the huggingface tutorial on the same:

Merge rules look like this:

{('Ġ', 't'): 'Ġt', ('i', 's'): 'is', ('e', 'r'): 'er', ('Ġ', 'a'): 'Ġa', ('Ġt', 'o'): 'Ġto', ('e', 'n'): 'en',
 ('T', 'h'): 'Th', ('Th', 'is'): 'This', ('o', 'u'): 'ou', ('s', 'e'): 'se', ('Ġto', 'k'): 'Ġtok',
 ('Ġtok', 'en'): 'Ġtoken', ('n', 'd'): 'nd', ('Ġ', 'is'): 'Ġis', ('Ġt', 'h'): 'Ġth', ('Ġth', 'e'): 'Ġthe',
 ('i', 'n'): 'in', ('Ġa', 'b'): 'Ġab', ('Ġtoken', 'i'): 'Ġtokeni'}

And when we get new text the following is how huggingface instructs tokenization should happen:

def tokenize(text):
    pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text) 
#pretokenization splits the text into words according to predef rules. Example: split on every whitespace
    pre_tokenized_text = [word for word, offset in pre_tokenize_result]
    splits = [[l for l in word] for word in pre_tokenized_text]
    for pair, merge in merges.items():
        for idx, split in enumerate(splits):
            i = 0
            while i < len(split) - 1:
                if split[i] == pair[0] and split[i + 1] == pair[1]:
                    split = split[:i] + [merge] + split[i + 2 :]
                else:
                    i += 1 #This is where my doubt is
            splits[idx] = split

    return sum(splits, [])

It might be trivial but I do not see why the i+=1 is under the else clause. I think i+=1 should always happen irrespective of whether or not the merge was found. Notwithstanding my understanding, I am assuming the code demonstrated is correct and that I am missing some edge scenario where putting i+=1 outside of the else block would break and/or produce wrong tokenization. Can you help me understand what that would be? Or, maybe confirm that i+=1 belongs outside of the else block

0

There are 0 best solutions below