I try to extract the text of a pdf via
iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage
,
which does not work because of some bad formatting of the pdf file with respect to an inline picture.
I figured out that I can fix this problem, if I (A) open the pdf in Adobe Acrobat and save it as an optimized pdf. Then the parsing would work. Or (B) I would open it in Adobe Acrobat and print it again via Adobe PDF as pdf.
Now I have 14.000 of these files and want to automate (A) or (B). But somehow I cannot succeed.
For (A) I included the Adobe library and do in short something like this
mApp = new AcroAppClass();
avDoc = new AcroAVDocClass();
avDoc.Open (strFilePath, "");
pdDoc = (CAcroPDDoc)avDoc.GetPDDoc ();
pdDoc.Save(1, strFilePath.Substring(0, strFilePath.Length - 4) + "_changed.pdf");
But Adobe SDK does not allow me to save as a different format.
For (B) it tried something like this:
Process pdfProcess = new Process();
pdfProcess.StartInfo.FileName = @"C:\Program Files (x86)\Adobe\Acrobat 11.0\Acrobat\AcroRd32.exe";
pdfProcess.StartInfo.Arguments = string.Format(@"/t", strFilePathSource, "Adobe PDF", "Adobe PDF", strFilePathTarget);
pdfProcess.Start();
This is not throwing any error, but there is also no file produced.