From the transformers library by huggingface

from transformers import BertTokenizer
tb = BertTokenizer.from_pretrained("bert-base-uncased")

tb is not a wordpiece tokenizer. It has arguments text and text_target. What is the difference between the two? Can you please give the functional difference as well?

The documentation says:

text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

text_target (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

I do not understand the difference b/w the two from the descriptions above

1

There are 1 best solutions below

0
On

The BertTokenizer from Hugging Face's Transformers library takes two arguments, text and text_target, which serve different purposes in various NLP tasks.

The text argument is used for the input sequence that needs to be encoded, and its format can be a single string, a list of strings, or a list of lists of strings (for pretokenized inputs). For example, in tasks like question-answering, the text would be the question, while in translation, it would be the sentence in the source language.

On the other hand, the text_target argument is used for the target sequence that serves as a label or the desired output for a given input sequence. Its format can also be a single string, a list of strings, or a list of lists of strings. Back to the examples of tasks above, in question-answering, the text_target would be the answer to the question, while in translation, it would be the sentence in the target language.

The main difference between the two arguments lies in their intended use. In tasks without paired sequences, such as simple text classification, only the text argument is used. However, in tasks involving paired sequences, like question-answering or translation, both text and text_target are used to encode the input and desired output sequences. The model then learns the relationship between the two sequences.