BERT fine-tuning for NER(specifically for Phone number and credit card)

34 Views Asked by At
!pip install simpletransformers
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from simpletransformers.ner import NERModel, NERArgs

data = pd.read_csv("/content/sample_data/test.csv", encoding="utf-8")
data = data.fillna(method="ffill")
data["Sentence #"] = LabelEncoder().fit_transform(data["Sentence #"])
data.rename(columns={"Sentence #": "sentence_id", "Word": "words", "Tag": "labels"}, inplace=True)

data["labels"] = data["labels"].str.upper()
X = data[["sentence_id", "words"]]
Y = data["labels"]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

# building up train data and test data
train_data = pd.DataFrame({"sentence_id": x_train["sentence_id"], "words": x_train["words"], "labels": y_train})
test_data = pd.DataFrame({"sentence_id": x_test["sentence_id"], "words": x_test["words"], "labels": y_test})
label = data["labels"].unique().tolist()

args = NERArgs()
args.num_train_epochs = 3
args.learning_rate = 1e-5
args.overwrite_output_dir = True
args.train_batch_size = 32
args.eval_batch_size = 32
args.output_dir = "/content/sample_data/model-folder"



print("done here")
model = NERModel('bert', 'bert-base-cased', labels=label, args=args,)

model.train_model(train_data, eval_data=test_data, acc=accuracy_score)
# Assuming 'model' is your NERModel instance
# model.save_model("/content/sample_data/models-1")

result, model_outputs, preds_list = model.eval_model(test_data)
print("RESULT: ", result)

prediction, model_output = model.predict(["The credit number is 4111-7224-4222-1111."])
print(prediction)
test.csv

Sentence #,Word,Tag
Sentence: 1,Your,o
,credit,o
,card,o
,2146 2216 1231 8666,CREDIT-CARD
,has,o
,been,o
,successfully,o
,charged,o
,.,o

This is my code for FT BERT for credit card and Phone number identification. I have a dataset that has sentences and each word has its tag(o, phone,credit-card). I am not able to get the required results. The model is not accurate, it identifies even dob as credit card sometimes. and not identify a phone number if it has spaces in them. I am very new to this, is there any way i can improve BERT model so that it can identify more accurately.

0

There are 0 best solutions below