Byte-level BPE tokenizer for handing Bigram and Trigram

214 Views Asked by At

I'm currently employing the HuggingFace tokenizer to tokenize a textual database, and here's how I'm doing it:

from tokenizers import ByteLevelBPETokenizer
from tokenizers import normalizers

tokenizer = ByteLevelBPETokenizer()
tokenizer.normalizer = normalizers.BertNormalizer(lowercase = False)
tokenizer.train_from_iterator(Data, vocab_size = 50264, min_frequency = 2, special_tokens = ["<s>", "<pad>", "</s>", "<unk>"])

When I applying this on data, I notice that many of the identified tokens are single words. Sometimes, it even divides a word into smaller parts, which is something I anticipated. I'm curious if there's a way to incorporate Bigram (and Trigram), or n-gram in general, into this process? I like to observe longer tokens that consist of two or three tokens grouped together.

0

There are 0 best solutions below