How to search cyrillic pdf using PDF Clown

213 Views Asked by At

I'm trying to programmatically search a Russian-language pdf file for a string using PDF Clown like this:

var FilePath = @"‪C:\Users\Yvoloshin\source\repos\SearchPdf\Газета «Красная Звезда» №001 от 01 января 1942 года.pdf";
org.pdfclown.files.File file = new org.pdfclown.files.File(FilePath);

// Define the text pattern to look for
var pattern = new Regex("К новым", RegexOptions.IgnoreCase);

// Instantiate the extractor
TextExtractor textExtractor = new TextExtractor(true, true);

foreach (var page in file.Document.Pages)
{
// Extract the page text
var textStrings = textExtractor.Extract(page);

// Find the text pattern matches
var matches = pattern.Matches(TextExtractor.ToString(textStrings));
Console.WriteLine(matches);
Console.ReadLine();
}

When I run this, I get this error:

Unhandled Exception: System.NotSupportedException: The given path's format is not supported.
   at System.Security.Permissions.FileIOPermission.EmulateFileIOPermissionChecks(String fullPath)
   at System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy, Boolean useLongPath, Boolean checkHost)
   at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access)
   at org.pdfclown.files.File..ctor(String path)

Is this a problem with PDF Clown not being set up for cyrillic fonts, or is the problem elsewhere? I'm using Visual Studio 2017 and .NET 4.8.

0

There are 0 best solutions below