PDF to Text extraction for non-english language PDF

100 Views Asked by rohit pandit At 27 July 2025 at 20:02

I am using DataLogic utilities(Datalogics.PDFL) to manipulate the PDF, I am facing issues with the below scenario. A PDF with non-english text getting weird output.

Sample input file SS

Getting output in the below format for the same:

 WordFinderConfig wordConfig = new WordFinderConfig();
            wordConfig.IgnoreCharGaps = false;
            wordConfig.IgnoreLineGaps = false;
            wordConfig.NoAnnots = false;
            wordConfig.NoEncodingGuess = false;

            // Std Roman treatment for custom encoding; overrides the noEncodingGuess option
            wordConfig.UnknownToStdEnc = true;

            wordConfig.DisableTaggedPDF = false;    // legacy mode WordFinder creation
            wordConfig.NoXYSort = true;
            wordConfig.PreserveSpaces = false;
            wordConfig.NoLigatureExp = false;
            wordConfig.NoHyphenDetection = false;
            wordConfig.TrustNBSpace = false;
            wordConfig.NoExtCharOffset = false;     // text extraction efficiency
            wordConfig.NoStyleInfo = false;         // text extraction efficiency

            WordFinder wordFinder = new WordFinder(doc, WordFinderVersion.Latest, wordConfig);

Original Q&A

There are 1 best solutions below

JosephA On 12 January 2023 at 14:06

I'd encourage you to upgrade to the most current release (e.g. via Nuget) and if you still experience problematic Text Extraction results to then contact our (Datalogics) Support Department for assistance and provide them with the input document and a runnable sample for reproduction purposes.

PDF to Text extraction for non-english language PDF

There are 1 best solutions below

Related Questions in C#

Related Questions in .NET

Related Questions in ASP.NET-MVC

Related Questions in ASP.NET-CORE

Related Questions in ADOBE-PDF-LIBRARY

Trending Questions

Popular # Hahtags

Popular Questions