I am working on a small data set which:
Contains 1500 pieces of news articles.
All of these articles were ranked by human beings with regard to their sentiment/degree of positive on a 5-point scale.
Clean in terms of spelling errors. I used google sheet to check spelling before import into the analysis. There are still some characters that are not correctly coded, but not much.
The average length is greater than 512 words.
slightly-imbalanced data set.
I regard this as a multi-class classification problem and I want to fine-tune BERT with this data set. In order to do that, I used Ktrain
package and basically follows the tutorial. Below is my code:
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_array(
x_train=x_train,
y_train=y_train,
x_test=x_test,
y_test=y_test,
class_names=categories,
preprocess_mode='bert',
maxlen= 510,
max_features=35000)
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), batch_size=6)
learner.fit_onecycle(2e-5, 4)
However, I only get a validation accuracy at around 25%, which is way too low.
precision-recall f1-score support
1 0.33 0.40 0.36 75
2 0.27 0.36 0.31 84
3 0.23 0.24 0.23 58
4 0.18 0.09 0.12 54
5 0.33 0.04 0.07 24
accuracy 0.27 295
macro avg 0.27 0.23 0.22 295
weighted avg 0.26 0.27 0.25 295
I also tried the head+tail truncation strategy since some of the articles are pretty long, however, the performance remains the same.
Can anyone give me some suggestions?
Thank you very much!
Best
Xu
================== Update 7.21=================
Following Kartikey's advice, I tried the find_lr. Below is the result. It seems that 2e^-5 is a reasonable learning rate.
simulating training for different learning rates... this may take a few
moments...
Train on 1182 samples
Epoch 1/2
1182/1182 [==============================] - 223s 188ms/sample - loss: 1.6878
- accuracy: 0.2487
Epoch 2/2
432/1182 [=========>....................] - ETA: 2:12 - loss: 3.4780 -
accuracy: 0.2639
done.
Visually inspect loss plot and select learning rate associated with falling
loss
And I just tried to run it with some weighting:
{0: 0,
1: 0.8294736842105264,
2: 0.6715909090909091,
3: 1.0844036697247708,
4: 1.1311004784688996,
5: 2.0033898305084747}
Here is the result. Not much changed.
precision recall f1-score support
1 0.43 0.27 0.33 88
2 0.22 0.46 0.30 69
3 0.19 0.09 0.13 64
4 0.13 0.13 0.13 47
5 0.16 0.11 0.13 28
accuracy 0.24 296
macro avg 0.23 0.21 0.20 296
weighted avg 0.26 0.24 0.23 296
array([[24, 41, 9, 8, 6],
[13, 32, 6, 12, 6],
[ 9, 33, 6, 14, 2],
[ 4, 25, 10, 6, 2],
[ 6, 14, 0, 5, 3]])
============== update 7.22 =============
To get some baseline results, I collapse the classification problem on a 5-point scale into a binary one, which is just to predict positive or negative. This time the accuracy increased to around 55%. Below is the detailed description of my strategy:
training data: 956 samples (excluding those classified as neutural)
truncation strategy: use the first 128 and last 128 tokens
(x_train, y_train), (x_test, y_test), preproc_l1 =
text.texts_from_array(x_train=x_train, y_train=y_train,
x_test=x_test, y_test=y_test
class_names=categories_1,
preprocess_mode='bert',
maxlen= 256,
max_features=35000)
Results:
precision recall f1-score support
1 0.65 0.80 0.72 151
2 0.45 0.28 0.35 89
accuracy 0.61 240
macro avg 0.55 0.54 0.53 240
weighted avg 0.58 0.61 0.58 240
array([[121, 30],
[ 64, 25]])
However, I think 55% is still not a satisfactory accuracy, slightly better than random guess.
============ update 7.26 ============
Following Marcos Lima's suggestion, I made several additional steps into my procedures:
remove all numbers, punctuation and redundant spaces before being pre-processed by the Ktrain pkg. (I thought the Ktrain pkg would do this for me, but not sure)
I use the first 384 and last 128 tokens of any text in my sample. This is what I called "Head+Tail" strategy.
The task is still binary classification (positive vs negative)
This is the figure for learning curve. It remains the same as the one I posted before. And it still looks very different to the one posted by Marcos Lima:
Below are my results, which are probably the best set of results that I have got.
begin training using onecycle policy with max lr of 1e-05...
Train on 1405 samples
Epoch 1/4
1405/1405 [==============================] - 186s 133ms/sample - loss: 0.7220
- accuracy: 0.5431
Epoch 2/4
1405/1405 [==============================] - 167s 119ms/sample - loss: 0.6866
- accuracy: 0.5843
Epoch 3/4
1405/1405 [==============================] - 166s 118ms/sample - loss: 0.6565
- accuracy: 0.6335
Epoch 4/4
1405/1405 [==============================] - 166s 118ms/sample - loss: 0.5321
- accuracy: 0.7587
precision recall f1-score support
1 0.77 0.69 0.73 241
2 0.46 0.56 0.50 111
accuracy 0.65 352
macro avg 0.61 0.63 0.62 352
weighted avg 0.67 0.65 0.66 352
array([[167, 74],
[ 49, 62]])
Note: I think maybe the reason why it is so difficult for the pkg to work well on my task is that this task is like a combination of classification and sentiment analysis. The classical classification task for news articles is to classify which category a news belongs, for example, biology, economics, sports. The words used in different categories are pretty different. On the other hand, the classical example for classifying sentiment is to analyse Yelp or IMDB reviews. My guess is these texts are pretty straightforward in expressing their sentiment whereas texts in my sample, economic news, are kind of polished and well organized before publication, so the sentiment might always appear in some implicit way which BERT may not be able to detect.
Try treating the problem as a text regression task like this Yelp sentiment model, which was trained using ktrain.