I want to know the location of all the words in the pdf page. I have been trying to find something on the web but couldn't. Can anyone help me which library (preferably in java platform) should I use?
Getting text location from pdf
468 Views Asked by Prabhjot Rai At
2
There are 2 best solutions below
0
not2qubit
On
You can use Textricator, but unfortunately the documentation is not maintained so it's very difficult to make the more interesting aspects of it work. However, to just see the text locations you can use simple text mode.
./textricator.bat text --pages=2 xxx.pdf
# output is a long list of CSV properties for the document, including the OCR read text and the x,y coordinates of it.
Related Questions in PDF
- Itext get special letters from pdf
- Carrierwave file upload with different file types
- Get text from a section of a pdf page with IcePdf
- itext pdf to image convert
- PDF to Text extractor in nodejs without OS dependencies
- PDF to ByteArray Conversion
- Opening PDF file in SWT Browser - XulRunner default viewer
- Generate TCPDF output to a shared drive folder
- Combine base and ggplot graphics in R figure window in different pages
- Updating a PDF Barcode Field in iOS and Android Device
- Prevent PDFsharp from saving an image file?
- Adding attachment links between lines in itext for pdf
- Crop Pdf from each edge using itextshap
- How to create a PDF with iText+XMLWorker from servlet using custom font?
- how to create a pdf editor for grails
Related Questions in ITEXT
- Itext get special letters from pdf
- itext pdf to image convert
- Adding attachment links between lines in itext for pdf
- Crop Pdf from each edge using itextshap
- How to create a PDF with iText+XMLWorker from servlet using custom font?
- How to flatten pdf with Itext in c#?
- Parsing HTML into iText Elements while using XMLWorker
- Error in PDF Export
- I have a pdf from which I have to extract data and show but I am getting this exception, I'm not being able to figure out what is this Exception is?
- Watermark on pdf not working properlly
- Bulletins and Character numbering in PDF using iTextSharp c#
- iTextSharp large table: adding in chunks leaves visible gaps in the table
- I want to drag an image to my pdf file using itext java
- iText on a 10G database (1.4 JVM) generates a stacktrace
- insert arabic into pdf using itex java nullpointerException
Related Questions in PDFBOX
- PdfBox issue while changing page
- PDFBox: extracting images from pdf to inputstream
- Loading a document with PDF box 2.0 causing torubles
- PDFbox not extracting regions on android
- PDF to Image using PDFBox 1.8.9 text overlapped
- PDFBox create oversized pages
- How to generate Dyanamic no of pages using PDFBOX
- PDFBox - Single drawString method call to generate both normal and bold font
- Setting multi-line text to form fields in PDFBox
- How to solve "No glyph for U+000A in font Helvetica-Bold" in pdfbox (android port)
- Extracting images from pdf using java
- Replacing images with same resource in PDFBox
- How to distinguish between two encrypted / secured PDF files
- What is the equivalent of PDFTextStripper in pdfbox snapshot 2.0
- How to prevent PDFBox application menu (on OS X)
Related Questions in PDF2HTMLEX
- Transforming pdf to html in Python
- Pdf2htmlEx: The html size converted by pdf is very large?
- Pdf2Html Installation
- Extract all content from PDF file (not just text, but also tables/diagrams)?
- Convert PDF to HTML without losing any format
- Using co-ordinates in XML generated by poppler to build an email template
- pdf2htmlEX error during conversion - CMap is not valid and got dropped for font
- Font misalignment during pdf to html conversion using pdf2htmlEx tool
- running Pdf2htmlEX on linux using php
- pdf2htmlEX on Debian 10 for use with Drupal
- How to identify the modified content in a pdf file?
- pdf2HtmlEX - Text on html is different than the source pdf
- pdf2htmlEX's output shows Times New Roman font for only a few characters?
- Replace word even if it has empty HTML tags between it, which breaks it up
- pdfminer when I am trying to run pdf2txt.py not working in windows
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Take a look at this tutorial : http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_structure
Basically, with PDFBox, you can aces to the PDFContent with
and then, search for the
X Y Tdline you're looking for.I'm REALLY sure there is a simpler way to do it, but since I work a lot with the Content Stream for a project, I am only aware of this way.
Search in PDFBox's javaDocs for more details !
I hope this will help you :)