Feature engineering on BERT

85 Views Asked by At

I'm trying to develop a Tweet classifier using the BERT model (bert-base-uncased, BertForSequenceClassification). During the preprocessing of the dataset, my teacher told me that it would be better if I extracted characteristic features such as the length of the tweet, the number of emojis and profanity words, etc.

So, I gathered some useful features into a data frame like:

profanity_word positive_emoji negative_emoji tweet_length
2 1 0 123
0 0 1 52
1 0 1 87

However, I didn't find any way to input these values into the model for fine-tuning in the documentation. Is there any way to achieve this? Or am I missing some point here?

1

There are 1 best solutions below

5
ewz93 On

BERT (Bidirection Encoder Representations From Transformers) is a model that, as the name suggests, works as well as it does because it learns contextualized representations of input sequences in an unsupervised fashion. When fine-tuning and using this model for a specific task such as text classification based on training data (supervised learning), what is done is these representations the model outputs are then used as input for one or a few simple layers which learn how these model-internal semantic representations connect to the class you want to predict. This means that BERT does its own representation and feature learning.

What your teacher describes is more of a traditional Machine Learning approach where the features on input sequences are hand-crafted, so the representations are not learned by the model itself but hard-coded.

These two approaches are usually not applied together and I personally doubt that the performance of a powerful LLM such as BERT would be increased much by adding handcrafted features (although you would have to try to really find out). What usually leads to good BERT performance is a large-ish amount of high quality (unambiguous) training data, larger variants of the standard BERT or more modern and powerful variants (e.g. ALBERT or DeBERTaV3). So in general this is where I would recommend focussing your efforts on.

If you insist on adding the handcrafted features to the decision-making in a simple way your best shot is probably taking the BERT outputs and combining them with your simple features. I do not know what the target output class(es) you want to predict looks like (binary, categorical, numerical?), but you could fine-tune BERT to produce that prediction, convert it into an integer or float number if it isn't one and then concatenate that and the simple features into a new feature vector you can then use in a simple classifier (e.g. a small fully connected neural network in scikit-learn). Alternatively instead of fine-tuning BERT on the task at all you could just encode the inputs with BERT which produces a usually 768-dimensional representation vector and combine that with your simple features.