text is being changed when i do copy it from searchable pdf file (created with tesseract command) and paste it in notepad

200 Views Asked by At

I have created a searchable pdf file by running following command on one of my images.

tesseract page.jpg test pdf --oem 1 --psm 5 -l urd

this the image which I have converted to searchable pdf. enter image description here

the image contains Urdu text, but when I am copying it from newly created pdf file and pasting it in any other text editor, this is what I am getting.

GehbFie”

any tesseract OCR and encoding expert here who can solve my issue please, any help will be highly appreciated, thanks in advance.

1

There are 1 best solutions below

1
On BEST ANSWER

pdf is the config file name. it needs to come last in the command, after --oem --psm -l etc.

the correct format for the command is following.

tesseract page.jpg test --oem 1 --psm 5 -l urd pdf

I resolved my issue in this way.