Byte-level BPE tokenizer for handing Bigram and Trigram

203 Views Asked by Eghbal At 28 July 2025 at 15:09

I'm currently employing the HuggingFace tokenizer to tokenize a textual database, and here's how I'm doing it:

from tokenizers import ByteLevelBPETokenizer
from tokenizers import normalizers

tokenizer = ByteLevelBPETokenizer()
tokenizer.normalizer = normalizers.BertNormalizer(lowercase = False)
tokenizer.train_from_iterator(Data, vocab_size = 50264, min_frequency = 2, special_tokens = ["<s>", "<pad>", "</s>", "<unk>"])

When I applying this on data, I notice that many of the identified tokens are single words. Sometimes, it even divides a word into smaller parts, which is something I anticipated. I'm curious if there's a way to incorporate Bigram (and Trigram), or n-gram in general, into this process? I like to observe longer tokens that consist of two or three tokens grouped together.

Original Q&A

Byte-level BPE tokenizer for handing Bigram and Trigram

There are 0 best solutions below

Related Questions in TOKENIZE

Related Questions in HUGGINGFACE

Related Questions in HUGGINGFACE-TOKENIZERS

Related Questions in BYTE-PAIR-ENCODING

Trending Questions

Popular # Hahtags

Popular Questions