HowTo extract embedded OCR data from a PDF?

1.6k Views Asked by erik At 02 March 2011 at 13:57

I have PDF-files with embedded OCR data. (So I already orcd them) So they are searchable. Now I want to extract this OCR data, because I want to put in in my tomcat6 searchserver. For doing this, I need the plain OCR data. So my question is, is it possible to extract this embedded OCR-Data from the pdf Files? It would be nice to get files with coordinates. But it would also be sufficient to get plaintext files.

Original Q&A

There are 1 best solutions below

david On 02 March 2011 at 17:04

You should be able to do this with iText or iTextsharp. iTextsharp has 0 documentation however, and a good number of the functions are not equivalent to those found in iText.

PDFSharp does not support iref streams. Those are pretty much the only comprehensive opensource solutions. If you do not mind paying, vista solutions may have something for you, they mostly handle workflow, but they have some pretty extensive pdf libraries as well.

HowTo extract embedded OCR data from a PDF?

There are 1 best solutions below

Related Questions in PDF

Related Questions in EXTRACT

Related Questions in OCR

Related Questions in PDF-SCRAPING

Trending Questions

Popular # Hahtags

Popular Questions