From the transformers library by huggingface
from transformers import BertTokenizer
tb = BertTokenizer.from_pretrained("bert-base-uncased")
tb
is not a wordpiece tokenizer. It has arguments text
and text_target
. What is the difference between the two? Can you please give the functional difference as well?
The documentation says:
text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
text_target (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).
I do not understand the difference b/w the two from the descriptions above
The
BertTokenizer
from Hugging Face's Transformers library takes two arguments,text
andtext_target
, which serve different purposes in various NLP tasks.The
text
argument is used for the input sequence that needs to be encoded, and its format can be a single string, a list of strings, or a list of lists of strings (for pretokenized inputs). For example, in tasks like question-answering, thetext
would be the question, while in translation, it would be the sentence in the source language.On the other hand, the
text_target
argument is used for the target sequence that serves as a label or the desired output for a given input sequence. Its format can also be a single string, a list of strings, or a list of lists of strings. Back to the examples of tasks above, in question-answering, thetext_target
would be the answer to the question, while in translation, it would be the sentence in the target language.The main difference between the two arguments lies in their intended use. In tasks without paired sequences, such as simple text classification, only the
text
argument is used. However, in tasks involving paired sequences, like question-answering or translation, both text and text_target are used to encode the input and desired output sequences. The model then learns the relationship between the two sequences.