Extract text from PDF files(Printed)

891 Views Asked by At

I'm using RedMon(Redirection Port Monitor), HP Universal Driver PS and GhostScript to intercept document printing.

However, for the following scenario:

File PDF -> HP Universal Driver PS -> RedMon -> PostScript File** -> GhostScript create file printed.pdf*.

* Can not extract text from PDF file: gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=output.txt printed.pdf

** The PostScript file is created as compacted and can not extract the text.

Question is?

Can I create a PostScript file without compacting when a PDF is sent to the printer?

Observation: Printed.pdf -> Image(TIFF) -> Tesseract(OCR) -> Text File... Works! But it is slow.

1

There are 1 best solutions below

5
On

As Dweeberly says in the comments, if you want to extract text from a PDF file, do not start by printing it. Especially do not turn it into PostScript.

PDF files can have ToUnicode CMaps in the (its optional) and these allow reliable text extraction. PostScript doesn't support these and so the information is lost if you create a PostScript file from the PDF (no matter what means you use to create hte PostScript).

In addition the PostScript program will usually be created with subset fonts, non-standard Encodings and other modifications to the text which wil make it hard, or impossible, to extract text from it.

Since Ghostscript can accept PostScript and PDF as input, there is no value in turning the PDF into PostScript before feeding it to the txtwrite device. All you are doing is making life harder for the device and discarding useful information.

Just use Ghostscript and the txtwrite device, and give it the PDF file as an input.

Naturally OCR works, because it scans the shapes of the text to determine the character, but yes its slow. On the other hand it will work with PDF files which only contain images of text, not actual text, which the txtwrite device won't.