I recently encountered some questions when I was learning Google’s SentencePiece.
- BPE, WordPiece and Unigram are all common subword algorithms, so what is the relationship between SentencePiece and them? Some tutorials say that SentencePiece is also a subword algorithm, and some tutorials say that SentencePiece is an implementation of the above subword algorithm.
- SentencePiece seems to only replace spaces with special underlines in preprocessing. If there is no pre-tokenization stage, then how to apply subword algorithms such as BPE and Unigram that require pre-tokenization?
My own understanding:
- I am more inclined to think that SentencePiece is an implementation of subword algorithms such as BPE and Unigram. Because if SentencePiece is also classified as a subword algorithm, why are there still such expressions as SentencePiece+BPE and SentencePiece+Unigram?
- SentencePiece supports BPE, Unigram and other algorithms, but these algorithms obviously require pre-tokenization. SentencePiece does not need to be pre-tokenized. Is this a bit of a conflict?
I had the same question a couple of days ago, so I did some research and here is my answer (which may or may not be 100% correct, but might be helpful):
Since there are no delimiters in this sentence (which says "Hello World." by the way), it did not pre-tokenize it and just treats the whole sentence as a single token and probably sends this one token to the BPE algorithm.
See how it pretokenizes the sentence based on the whitespaces? The only difference is that it preserves the whitespace information, which a naive whitespace pretokenizer (or any other pretokenizer) might have lost. Based on what I understand from the paper, this is the only strength of the sentencepiece that it makes the encoding and decoding lossless by keeping the whitespace information intact.
Reference: