I'm currently employing the HuggingFace tokenizer to tokenize a textual database, and here's how I'm doing it:
from tokenizers import ByteLevelBPETokenizer
from tokenizers import normalizers
tokenizer = ByteLevelBPETokenizer()
tokenizer.normalizer = normalizers.BertNormalizer(lowercase = False)
tokenizer.train_from_iterator(Data, vocab_size = 50264, min_frequency = 2, special_tokens = ["<s>", "<pad>", "</s>", "<unk>"])
When I applying this on data, I notice that many of the identified tokens are single words. Sometimes, it even divides a word into smaller parts, which is something I anticipated. I'm curious if there's a way to incorporate Bigram (and Trigram), or n-gram in general, into this process? I like to observe longer tokens that consist of two or three tokens grouped together.