Error message when trying to use huggingface pretrained Tokenizer (roberta-base)

521 Views Asked by At

I am pretty new at this, so there might be something I am missing completely, but here is my problem: I am trying to create a Tokenizer class that uses the pretrained tokenizer models from Huggingface. I would then like to use this class in a larger transformer model to tokenize my input data. Here is the class code class Roberta(MyTokenizer):

from transformers import AutoTokenizer
from transformers import RobertaTokenizer


class Roberta(MyTokenizer):

def build(self, *args, **kwargs):
    self.max_length = self.phd.max_length
    self.untokenized_data = self.questions + self.answers

def tokenize_and_filter(self):
    # Initialize the tokenizer with a pretrained model
    Tokenizer = AutoTokenizer.from_pretrained('roberta')

    tokenized_inputs, tokenized_outputs = [], []

    inputs = Tokenizer(self.questions, padding=True)
    outputs = Tokenizer(self.answers, padding=True)

    tokenized_inputs = inputs['input_ids']
    tokenized_outputs = outputs['input_ids']

    return tokenized_inputs, tokenized_outputs

When I call the function tokenize_and_filter in my Transformer model as below

    questions = self.get_tokenizer().tokenize_and_filter
    answers   = self.get_tokenizer().tokenize_and_filter

    print(questions)

and I try to print the tokenized data, I get this message:

<bound method Roberta.tokenize_and_filter of <MyTokenizer.Roberta.Roberta object at 
      0x000002779A9E4D30>>

It appears that the function returns a method instead of a list or a tensor - I've tried passing the parameter 'return_tensors='tf'', I have tried using the tokenizer.encode() method, I have tried both with AutoTokenizer and with RobertaTokenizer, I have tried the batch_encode_plus() method, nothing seems to work.

Please help!

1

There are 1 best solutions below

0
On

it seems this was a really stupid error on my part, I forgot to put parentheses when calling the function

questions = self.get_tokenizer().tokenize_and_filter
answers   = self.get_tokenizer().tokenize_and_filter

should actually be

questions = self.get_tokenizer().tokenize_and_filter()
answers   = self.get_tokenizer().tokenize_and_filter()

and it works this way :)