I am testing an Automatic Speech Recognition model on some audio files containing speech in Hindi language.
I am using WER, Word Error Rate as the metric.
reference (ground truth) - वह शादीशुदा नहीं है
hypothesis(model output) - वह शादी शुदा नहीं है
I need some way to normalize the reference and hypotheses sentences so that the WER makes more sense. The above example should actually have got WER = 0, but because of the space in between शादी शुदा, WER becomes 2/4=0.5
I am not able to find any way to do it for Hindi text.
Can somebody please help me with this? Thanks
I've search 'Normalizing text in Hindi language using Python` on Google and I've got and I got a NLP library developed bt iitB for Hindi texts. You can check out the links below:
https://www.cse.iitb.ac.in/~anoopk/pages/softwares.html
https://github.com/anoopkunchukuttan/indic_nlp_library
Maybe it will help you.