I am using php Pdfparser and pdftotext to extract hindi/ devnagri text from pdf. But I am getting the same kind of junk or garbage using both of the above mentioned.
Junk, for example :
f{kfrt114; rhanz feJ dk tUe lu~ 1977 esa v;ksè;k (mÙkj izns"k) esa gqvkA mUgksaus y[kumQ fo"ofo|ky;] y[kumQ ls ¯gnh esa ,e-,- fd;kA os vktdy Lora=k ys[ku osQ lkFk v¼Zokf"kZd lfgr if=kdk dk laiknu dj jgs gSaA lu~ 1999 eas lkfgR; vkSj dykvksa osQ lao¼Zu vkSj vuq"khyu osQ fy, ,d lkaLÑfrd U;kl ^foeyk nsoh iQkmaMs"ku* dk lapkyu Hkh dj jgs gSaA ;rhanz feJ osQ rhu dkO;&laxzg izdkf"kr gq, gSaμ;nk&dnk] v;ksè;k rFkk vU; dfork,¡] M~;ks<+h ij vkykiA blosQ vykok "kkL=kh; xkf;dk fxfjtk nsoh osQ thou vkSj laxhr lk/uk ij ,d iqLrd fxfjtk fy[khA jhfrdky osQ vafre izfrfuf/ dfo f}tnso dh xzaFkkoyh (2000) dk lg&laiknu fd;kA oq¡Qoj ukjk;.k ij osaQfnzr nks iqLrdksa osQ vykok fLid eSosQ osQ fy, fojklr&2001
If I paste this junk in google it shows the correct hindi page. May be the garbled words are correct but it is in a different language.
If anybody can support to extract the exact readable text from pdf to text.