RTL (Arabic) ligatures problem when extracting text from PDF

482 Views Asked by At

When extracting Arabic text from a PDF file using librairies like PyMuPDF or PDFMiner, the words are returned in backward order which is a normal behavior for RTL languages, and you need to use bidi algorithm to be able to display it correctly across UI/GUIs.

The problem is when you have ligatures chars that are composed of two chars, these ligatures chars are not reversed which makes the extracted text inaccurate.

Here's an example :

Let's say we have a font with a ligature glyph "لا" that maps to "uni0644 uni0627". The pdf is rendered like this:

enter image description here

When you extract the pdf text using the library text extraction method, you get this:

كارتــــــشلاا

Notice how all chars are in reverse order except "لا".

And here's the final result after applying bidi algorithm:

االشــــــتراك

Am I missing something? Is there any workaround to fix this without detecting false positives and breaking them, or should I write my own implementation that correctly handles ligatures decomposition in bidirectional text?

1

There are 1 best solutions below

3
dirck On

Most likely, the actual text on the PDF page isn't Unicode, but font CIDs (identifying the glyph used) and that the program converting the CIDs to Unicode doesn't take RTL into account.

An example using RTL with english (sorry), suppose the word "fire" is rendering RTL as "erif" with 3 glyphs: e, r, and fi (through arbitrary CIDs, perhaps \001\002\003). If the CIDs are used to get the Unicode information, and the "fi" ligature is de-ligatured, you'll get "erfi" as the data.

In this case, there's no way of knowing that the 'f' and 'i' characters should actually compose a ligature and be flipped around. I'm assuming that's the case for these Arabic characters.

It's unlikely that the tools you're using know anything about RTL or are going to be much help here. You'll need different tools, or to use an approach that can get you the CID's directly so you can recompose the Unicode in the correct order.