I've worked with famous python packages for PDF files, like PDFminer, PyMuPDF, PyPDF2 and more. But none of them can extract text correctly from PDF files which are written in right-to-left languages (Persian, Arabic).
For example:
import fitz
doc = fitz.open("*/path/to/file.pdf")
txt = doc.getPageText(0)
print(txt)
it returns something like this:
...
اﯾﻨﺘﺮﻧﺖ و ﮐﺎﻣﭙﯿﻮﺗﺮ ﺑﻪ ﻣﺴﻠﻂ
ﻣﺴﻠﻂ ﻫﺎیزﺑﺎن
...
Sometimes the words are written reversed (first character comes last) and the words are swapped in a sentence, sometimes words are written correctly. But it does not know how to handle the Zero-width non-joiner (نیمفاصله) which is commonly used in Persian.
I tried a lot, But came to nothing. Thanks for your helps, in advance.
I had this problem, and I wrote following code:
But this package has two problems. 1) Reverses the words (e.g. "سلام" -> "مالس") I solved it in this code. 2) It has problems with documents with multi languages, like Farsi and English.