BERT Multi-class Sentiment Analysis got low accuracy?

2k Views Asked by At

I am working on a small data set which:

  • Contains 1500 pieces of news articles.

  • All of these articles were ranked by human beings with regard to their sentiment/degree of positive on a 5-point scale.

  • Clean in terms of spelling errors. I used google sheet to check spelling before import into the analysis. There are still some characters that are not correctly coded, but not much.

  • The average length is greater than 512 words.

  • slightly-imbalanced data set.

I regard this as a multi-class classification problem and I want to fine-tune BERT with this data set. In order to do that, I used Ktrain package and basically follows the tutorial. Below is my code:

(x_train, y_train), (x_test, y_test), preproc = text.texts_from_array(
                                                                    x_train=x_train, 
                                                                    y_train=y_train,
                                                                    x_test=x_test, 
                                                                    y_test=y_test,
                                                                    class_names=categories,
                                                                    preprocess_mode='bert',
                                                                    maxlen= 510,
                                                                    max_features=35000)

model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), batch_size=6)
learner.fit_onecycle(2e-5, 4)

However, I only get a validation accuracy at around 25%, which is way too low.

          precision-recall f1-score support

   1       0.33      0.40      0.36        75
   2       0.27      0.36      0.31        84
   3       0.23      0.24      0.23        58
   4       0.18      0.09      0.12        54
   5       0.33      0.04      0.07        24
accuracy                               0.27       295
macro avg          0.27      0.23      0.22       295
weighted avg       0.26      0.27      0.25       295

I also tried the head+tail truncation strategy since some of the articles are pretty long, however, the performance remains the same.

Can anyone give me some suggestions?

Thank you very much!

Best

Xu

================== Update 7.21=================

Following Kartikey's advice, I tried the find_lr. Below is the result. It seems that 2e^-5 is a reasonable learning rate.

simulating training for different learning rates... this may take a few 
moments...
Train on 1182 samples
Epoch 1/2
1182/1182 [==============================] - 223s 188ms/sample - loss: 1.6878 
- accuracy: 0.2487
Epoch 2/2
432/1182 [=========>....................] - ETA: 2:12 - loss: 3.4780 - 
accuracy: 0.2639
done.
Visually inspect loss plot and select learning rate associated with falling 
loss

learning rate.jpg

And I just tried to run it with some weighting:

{0: 0,
 1: 0.8294736842105264,
 2: 0.6715909090909091,
 3: 1.0844036697247708,
 4: 1.1311004784688996,
 5: 2.0033898305084747}

Here is the result. Not much changed.

          precision    recall  f1-score   support

       1       0.43      0.27      0.33        88
       2       0.22      0.46      0.30        69
       3       0.19      0.09      0.13        64
       4       0.13      0.13      0.13        47
       5       0.16      0.11      0.13        28

accuracy                            0.24       296
macro avg       0.23      0.21      0.20       296
weighted avg    0.26      0.24      0.23       296

array([[24, 41,  9,  8,  6],
       [13, 32,  6, 12,  6],
       [ 9, 33,  6, 14,  2],
       [ 4, 25, 10,  6,  2],
       [ 6, 14,  0,  5,  3]])

============== update 7.22 =============

To get some baseline results, I collapse the classification problem on a 5-point scale into a binary one, which is just to predict positive or negative. This time the accuracy increased to around 55%. Below is the detailed description of my strategy:

training data: 956 samples (excluding those classified as neutural)
truncation strategy: use the first 128 and last 128 tokens
(x_train,  y_train), (x_test, y_test), preproc_l1 = 
                     text.texts_from_array(x_train=x_train, y_train=y_train,    
                     x_test=x_test, y_test=y_test                      
                     class_names=categories_1,                      
                     preprocess_mode='bert',                                                          
                     maxlen=  256,                                                                  
                     max_features=35000)
Results:
              precision    recall  f1-score   support

       1       0.65      0.80      0.72       151
       2       0.45      0.28      0.35        89

accuracy                               0.61       240
macro avg          0.55      0.54      0.53       240
weighted avg       0.58      0.61      0.58       240

array([[121,  30],
       [ 64,  25]])

However, I think 55% is still not a satisfactory accuracy, slightly better than random guess.

============ update 7.26 ============

Following Marcos Lima's suggestion, I made several additional steps into my procedures:

  1. remove all numbers, punctuation and redundant spaces before being pre-processed by the Ktrain pkg. (I thought the Ktrain pkg would do this for me, but not sure)

  2. I use the first 384 and last 128 tokens of any text in my sample. This is what I called "Head+Tail" strategy.

  3. The task is still binary classification (positive vs negative)

This is the figure for learning curve. It remains the same as the one I posted before. And it still looks very different to the one posted by Marcos Lima:

The updated learning curve

Below are my results, which are probably the best set of results that I have got.

begin training using onecycle policy with max lr of 1e-05...
Train on 1405 samples
Epoch 1/4
1405/1405 [==============================] - 186s 133ms/sample - loss: 0.7220 
- accuracy: 0.5431
Epoch 2/4
1405/1405 [==============================] - 167s 119ms/sample - loss: 0.6866 
- accuracy: 0.5843
Epoch 3/4
1405/1405 [==============================] - 166s 118ms/sample - loss: 0.6565 
- accuracy: 0.6335
Epoch 4/4
1405/1405 [==============================] - 166s 118ms/sample - loss: 0.5321 
- accuracy: 0.7587

             precision    recall  f1-score   support

       1       0.77      0.69      0.73       241
       2       0.46      0.56      0.50       111

accuracy                           0.65       352
macro avg       0.61      0.63      0.62       352
weighted avg       0.67      0.65      0.66       352

array([[167,  74],
       [ 49,  62]])

Note: I think maybe the reason why it is so difficult for the pkg to work well on my task is that this task is like a combination of classification and sentiment analysis. The classical classification task for news articles is to classify which category a news belongs, for example, biology, economics, sports. The words used in different categories are pretty different. On the other hand, the classical example for classifying sentiment is to analyse Yelp or IMDB reviews. My guess is these texts are pretty straightforward in expressing their sentiment whereas texts in my sample, economic news, are kind of polished and well organized before publication, so the sentiment might always appear in some implicit way which BERT may not be able to detect.

3

There are 3 best solutions below

1
On

Try treating the problem as a text regression task like this Yelp sentiment model, which was trained using ktrain.

2
On

Trying hyperparameter optimization.

Before doing learner.fit_onecycle(2e-5, 4). Try: learner.lr_find(show_plot=True, max_epochs=2)

Are all the classes having around 20% weightage? Maybe try something of this fashion:

MODEL_NAME = 'bert'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=train_b.target_names)

.....
.....

# the one we got most wrong
learner.view_top_losses(n=1, preproc=t)

For the above class increase weightage.

Does the validation set has stratified sampling or random sampling?

2
On

The form of your learning curve is not expected.

LR curve for a similar problem My curve (above) shows that the TR should be around 1e-5, but yours is flat.

Try to pre-process your data:

  • Remove numbers and emojis.
  • Recheck your data for errors (usually in y_train).
  • Use your language model or multilanguage if it's not english.

You said that:

The average length is greater than 512 words.

Try to break each text in 512 tokens-long because you can lose a lot o information for classification when BERT model truncates it.