FineReader Engine Java SDK. How to ignore pictures during conversion from PDF to DOCX

505 Views Asked by At

I need to find a way to ignore pictures and photos from PDF document during conversion to DOCX file.

I am creating an instance of FineReader Engine:

IEngine engine = Engine.InitializeEngine(
engineConfig.getDllFolder(), engineConfig.getCustomerProjectId(),
engineConfig.getLicensePath(), engineConfig.getLicensePassword(), "", "", false);

After that, I am converting a document:

IFRDocument document = engine.CreateFRDocument();
document.AddImageFile(file.getAbsolutePath(), null, null);
document.Process(null);
String exportPath = FileUtil.prepareExportPath(file, resultFolder);
document.Export(exportPath, FileExportFormatEnum.FEF_DOCX, null);

As a result, it converts all images from the initial pdf document.

3

There are 3 best solutions below

0
On BEST ANSWER

I'm not really familiar with PDF to DOCX conversion, but i think you could try custom profiles according to your needs.

At some point in your code you should create a Engine object, and then create a Document object (or IFRDocument object depending of your application). Add this line just before giving your document to your engine for processing:

engine.LoadProfile(PROFILE_FILENAME);

Then create your file with some processing parameters described in the documentation packaged with your FRE installation under "Working with Profiles" section. Do not forget to add in your file:

... some params under other sections

[PageAnalysisParams]
DetectText = TRUE       --> force text detection
DetectPictures = FALSE  --> ignore pictures
... other params under PageAnalysisParams

... some params under other sections

It works the same way for Barcodes, etc... But keep in mind to benchmark your results when adding or removing things from this file as it may alter processing speed and global quality of your result.

0
On

When you exporting pdf to docx you should use some export params. In this way you can use IRTFExportParams. You can get this object:

IRTFExportParams irtfExportParams = engine.CreateRTFExportParams();

and there you can set writePicture property like this:

irtfExportParams.setWritePictures(false);

there: IEngine engine is main interface. I think u know how to initialize it;)))

Also you have to set in method document.Process() property. (document is from IFRDocument document). In Process() method you have to give IDocumentProcessingParams iDocumentProcessingParams. This object has method setPageProcessingParams() and there you have to put IPageProcessingParams iPageProcessingParams params(You can get this object by engine.CreatePageProcessingParams()). And this object has methods:

iPageProcessingParams.setPerformAnalysis(true);
iPageProcessingParams.setPageAnalysisParams(iPageAnalysisParams);

In the first method set true, and in the second one we give iPageAnalysisParams(IPageAnalysisParams iPageAnalysisParams = engine.CreatePageAnalysisParams()).

Last step, you have to set false value in setDetectPictures(false) method from iPageAnalysisParams like this. Thats all:)

And when you are going to export document you should put this param like this:

IFRDocument document = engine.CreateFRDocument();
document.Export(filePath, FileExportFormatEnum.FEF_DOCX, irtfExportParams);

I hope my answer will help to everyone)))

1
On

What do PDF input pages contain? What is expected in MS Word? It would be great if you would attach an example of an input PDF file and an example of the desired result in MS Word format. Then give a useful recommendation will be much easier.