How to normalize text in hindi language using Python?

484 Views Asked by At

I am testing an Automatic Speech Recognition model on some audio files containing speech in Hindi language.

I am using WER, Word Error Rate as the metric.

reference (ground truth) - वह शादीशुदा नहीं है
hypothesis(model output) - वह शादी शुदा नहीं है

I need some way to normalize the reference and hypotheses sentences so that the WER makes more sense. The above example should actually have got WER = 0, but because of the space in between शादी शुदा, WER becomes 2/4=0.5

I am not able to find any way to do it for Hindi text.

Can somebody please help me with this? Thanks

1

There are 1 best solutions below

1
On

I've search 'Normalizing text in Hindi language using Python` on Google and I've got and I got a NLP library developed bt iitB for Hindi texts. You can check out the links below:

https://www.cse.iitb.ac.in/~anoopk/pages/softwares.html

https://github.com/anoopkunchukuttan/indic_nlp_library

Maybe it will help you.