I am trying to run DistilBert with ktrain in Colab but I am getting "error too many values to unpack". I am trying to perform toxic comment classification, I uploaded 'train.csv' from CivilComments, I am able to run BERT but not DistilBert
#prerequisites:
!pip install ktrain
import ktrain
from ktrain import text as txt
DATA_PATH = '/content/train.csv'
NUM_WORDS = 50000
MAXLEN = 150
label_columns = ["toxic", "severe_toxic", "obscene",
"threat", "insult", "identity_hate"]
it works fine if I just preprocess with 'bert' but then I cannot use distilbert model. When preprocessing with distilbert I get the error:
(x_test, y_test), preproc = txt.texts_from_csv(DATA_PATH, 'comment_text', label_columns=label_columns, val_filepath=None, max_features=NUM_WORDS, maxlen=MAXLEN, preprocess_mode='distilbert')
'too many values to unpack, expected 2', if I substitute distilbert with bert it works fine (code below), but then I am forced to use bert as model, preprocessing with bert works fine:
(x_train, y_train), (x_test, y_test), preproc = txt.texts_from_csv(DATA_PATH, 'comment_text', label_columns=label_columns, val_filepath=None, max_features=NUM_WORDS, maxlen=MAXLEN, preprocess_mode='bert')
no error on this one but then I cannot use distilbert, see below:
example: model = txt.text_classifier('distilbert', train_data=(x_train, y_train), preproc=preproc)
error message: if 'bert' is selected model, then preprocess_mode='bert' should be used and vice versa
I want to use (x_test, y_test), preproc = txt.texts_from_csv(DATA_PATH, 'comment_text', label_columns=label_columns, val_filepath=None, max_features=NUM_WORDS, maxlen=MAXLEN, preprocess_mode='distilbert')
with distillbert model, how to avoid error 'too many values to unpack'
Links on which the code is based: Arun Maiya (2019). ktrain: A Lightweight Wrapper for Keras to Help Train Neural Networks. https://towardsdatascience.com/ktrain-a-lightweight-wrapper-for-keras-to-help-train-neural-networks-82851ba889c.
As shown in this example notebook, the
texts_from_*
functions returnTransformerDataset
objects (not NumpyArrays) when specifyingpreprocess_mode='distilbert'
as the model. So, you'll need to do something like this: