PDF to Text extraction for non-english language PDF

110 Views Asked by At

I am using DataLogic utilities(Datalogics.PDFL) to manipulate the PDF, I am facing issues with the below scenario. A PDF with non-english text getting weird output.

Sample input file SS

enter image description here

Getting output in the below format for the same:

enter image description here

 WordFinderConfig wordConfig = new WordFinderConfig();
            wordConfig.IgnoreCharGaps = false;
            wordConfig.IgnoreLineGaps = false;
            wordConfig.NoAnnots = false;
            wordConfig.NoEncodingGuess = false;

            // Std Roman treatment for custom encoding; overrides the noEncodingGuess option
            wordConfig.UnknownToStdEnc = true;

            wordConfig.DisableTaggedPDF = false;    // legacy mode WordFinder creation
            wordConfig.NoXYSort = true;
            wordConfig.PreserveSpaces = false;
            wordConfig.NoLigatureExp = false;
            wordConfig.NoHyphenDetection = false;
            wordConfig.TrustNBSpace = false;
            wordConfig.NoExtCharOffset = false;     // text extraction efficiency
            wordConfig.NoStyleInfo = false;         // text extraction efficiency

            WordFinder wordFinder = new WordFinder(doc, WordFinderVersion.Latest, wordConfig);
1

There are 1 best solutions below

0
On

I'd encourage you to upgrade to the most current release (e.g. via Nuget) and if you still experience problematic Text Extraction results to then contact our (Datalogics) Support Department for assistance and provide them with the input document and a runnable sample for reproduction purposes.