I have PDF-files with embedded OCR data. (So I already orcd them) So they are searchable. Now I want to extract this OCR data, because I want to put in in my tomcat6 searchserver. For doing this, I need the plain OCR data. So my question is, is it possible to extract this embedded OCR-Data from the pdf Files? It would be nice to get files with coordinates. But it would also be sufficient to get plaintext files.
HowTo extract embedded OCR data from a PDF?
1.6k Views Asked by erik At
1
There are 1 best solutions below
Related Questions in PDF
- Itext get special letters from pdf
- Carrierwave file upload with different file types
- Get text from a section of a pdf page with IcePdf
- itext pdf to image convert
- PDF to Text extractor in nodejs without OS dependencies
- PDF to ByteArray Conversion
- Opening PDF file in SWT Browser - XulRunner default viewer
- Generate TCPDF output to a shared drive folder
- Combine base and ggplot graphics in R figure window in different pages
- Updating a PDF Barcode Field in iOS and Android Device
- Prevent PDFsharp from saving an image file?
- Adding attachment links between lines in itext for pdf
- Crop Pdf from each edge using itextshap
- How to create a PDF with iText+XMLWorker from servlet using custom font?
- how to create a pdf editor for grails
Related Questions in EXTRACT
- Python: How to extract data in text file based on class information from another text file?
- Extract method names from .CS code files
- Node js Separate String into arrays
- How to extract data from web api with Talend Open Studio
- Extract as.numeric value from CSV
- Download zip file, extract and overwrite windows 2008 r2
- extract numerical values from DataFrame string object
- Extract Images from Executable Files Using Managed Code (C#/VB)
- Incorrect Extraction of fields in Splunk
- How to extract short sequence using window with specific step size?
- Extract rows based on values from text file using Python
- XmlPullParser : extract tags in Tag
- Extracting Table Data using JSoup
- Extracting Substring from File Name
- How to extract short sequence based on step size?
Related Questions in OCR
- Tesseract - The specified module could not be found
- Linux OCR of LCD characters
- Calculating equation from image in Java
- Python Tesseract OCR training to a specific list of words
- How correct send encoded base64 image to nodeJS and get response in Java
- OCR serial number CRC, check algorithm
- How to extract a specific text from an image
- Can Tesseract be set to OCR only (no image modification) when producing a PDF?
- OCR on text stamped into metal plate
- Arabic number recognization
- Tesseract Assert failed trainingsampleset.cpp line 622 with mftraining
- Camera Preview and OCR
- Getting the ocrad.js demo to work?
- What is the image type in MNIST dataset?
- Issue reading Bold fonts with Tesseract API / Tess4j
Related Questions in PDF-SCRAPING
- just like scraping data off the web , either from html or json , can the same be done in pdfs using R?
- HowTo extract embedded OCR data from a PDF?
- PDF Scraping - All Objects Passed were None
- PDF scraping, tabula py - columns do not correspond with "true" values of PDF file
- Turning a PDF into a dataframe using pdf_data() from pdftools
- Scraping large and complex PDF tables
- trying to scrape from long PDF with different table formats
- PDF scraping: get company and subsidiaries tables
- Headers are not getting extracted from PDF while extracting the table data from PDF using camelot
- tm readPDF: Error in file(con, "r") : cannot open the connection
- Scraping large pdf tables which span across multiple pages
- Python PdfMiner - How to get the info on the orientation of each word/sentence included in a pdf?
- Extract text from PDF section keeping strings in one line
- Identifying tables with gridlines in a pdf using python with tabula
- Tabula-py: reading tables from a pdf that contains form fields
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
You should be able to do this with iText or iTextsharp. iTextsharp has 0 documentation however, and a good number of the functions are not equivalent to those found in iText.
PDFSharp does not support iref streams. Those are pretty much the only comprehensive opensource solutions. If you do not mind paying, vista solutions may have something for you, they mostly handle workflow, but they have some pretty extensive pdf libraries as well.