Large corpus of Hindi text in Roman script

1.4k Views Asked by At

Where can I find such a corpus? I require this to build a language detector between Hindi and English at the token (word) level.

For instance, something like the Hindi Wikipedia in the Roman alphabet would be quite useful. Or short stories, social media posts or tweets, or blogs? Any ideas?

Existing transliteration engines are not so good as far as I can tell. If there is one which is good, will consider using that too.

1

There are 1 best solutions below

1
On

Google translate provides the transliterated result when searched by selecting 'text' option on https://translate.google.co.in/. Sample.

But, there's a catch. It has a character limit of 5k. Surprisingly enough, google does not provide this feature while translating anywhere else. (google docs, gmail etc.) Please let me know if you were able to find a more feasible and robust solution to your problem.