Text classification using BERT - how to handle misspelled words

3.6k Views Asked by user1877600 At 03 April 2020 at 16:33

I am not sure if this is the best place to submit that kind of question, perhaps CrossValdation would be a better place.

I am working on a text multiclass classification problem. I built a model based on BERT concept implemented in PyTorch (huggingface transformer library). The model performs pretty well, except when the input sentence has an OCR error or equivalently it is misspelled.

For instance, if the input is "NALIBU DRINK" the Bert tokenizer generates ['na', '##lib', '##u', 'drink'] and model's prediction is completely wrong. On the other hand, if I correct the first character, so my input is "MALIBU DRINK", the Bert tokenizer generates two tokens ['malibu', 'drink'] and the model makes a correct prediction with very high confidence.

Is there any way to enhance Bert tokenizer to be able to work with misspelled words?

Original Q&A

There are 1 best solutions below

NRJ_Varshney On 06 April 2020 at 23:40

You can leverage BERT's power to rectify the misspelled word. The article linked below beautifully explains the process with code snippets https://web.archive.org/web/20220507023114/https://www.statestitle.com/resource/using-nlp-bert-to-improve-ocr-accuracy/

To summarize, you can identify misspelled words via a SpellChecker function and get replacement suggestions. Then, find the most appropriate replacement using BERT.

Text classification using BERT - how to handle misspelled words

There are 1 best solutions below

Related Questions in PYTORCH

Related Questions in TEXT-CLASSIFICATION

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in BERT-LANGUAGE-MODEL

Related Questions in MISSPELLING

Trending Questions

Popular # Hahtags

Popular Questions