I am using DataLogic utilities(Datalogics.PDFL) to manipulate the PDF, I am facing issues with the below scenario. A PDF with non-english text getting weird output.
Sample input file SS
Getting output in the below format for the same:
WordFinderConfig wordConfig = new WordFinderConfig();
wordConfig.IgnoreCharGaps = false;
wordConfig.IgnoreLineGaps = false;
wordConfig.NoAnnots = false;
wordConfig.NoEncodingGuess = false;
// Std Roman treatment for custom encoding; overrides the noEncodingGuess option
wordConfig.UnknownToStdEnc = true;
wordConfig.DisableTaggedPDF = false; // legacy mode WordFinder creation
wordConfig.NoXYSort = true;
wordConfig.PreserveSpaces = false;
wordConfig.NoLigatureExp = false;
wordConfig.NoHyphenDetection = false;
wordConfig.TrustNBSpace = false;
wordConfig.NoExtCharOffset = false; // text extraction efficiency
wordConfig.NoStyleInfo = false; // text extraction efficiency
WordFinder wordFinder = new WordFinder(doc, WordFinderVersion.Latest, wordConfig);
I'd encourage you to upgrade to the most current release (e.g. via Nuget) and if you still experience problematic Text Extraction results to then contact our (Datalogics) Support Department for assistance and provide them with the input document and a runnable sample for reproduction purposes.