I want to know the location of all the words in the pdf page. I have been trying to find something on the web but couldn't. Can anyone help me which library (preferably in java platform) should I use?
Getting text location from pdf
468 Views Asked by Prabhjot Rai At
2
There are 2 best solutions below
0
not2qubit
On
You can use Textricator, but unfortunately the documentation is not maintained so it's very difficult to make the more interesting aspects of it work. However, to just see the text locations you can use simple text mode.
./textricator.bat text --pages=2 xxx.pdf
# output is a long list of CSV properties for the document, including the OCR read text and the x,y coordinates of it.
Related Questions in PDF
- How to use custom font during html to pdf conversion?
- How to get content of BLOCK types LAYOUT_TITLE, LAYOUT_SECTION_HEADER and LAYOUT_xx in Textract
- PDF form checkbox/radio button ignores content stream
- Suggest python library for rendering html to pdf files
- Problems with the order in which PDF files are created
- Centering a map element on a generated PDF
- download all pdf files from website doesn't support wildcard
- How to enter external pdf into quarto book while keeping page layout+numbering
- How do I create a website that combines user input and standard text and converts it into a pdf?
- Excel VBA error 1004 on PDF export - not a path issue
- downloading pdf using requests not working
- Creating pdf on Firestore with Pdfplum: Template path "no such object"
- Export password protected PDF from QGIS
- XPS convert PDF with Ghostscript
- Download PDF in ASP.NET MVC application
Related Questions in ITEXT
- Itext pdf deferred signing with invalid signature
- Itext 7 library replaces text in pdf file, but the selected text is not in the right position
- Pdf signing using USB device
- Posting a filled in pdf form back to server for processing Itext7 .net core
- How to fit long text inside text box using iText Java
- Pdf hash sign with iText v8.0.3
- I want to completely flatten/remove some pdf text/objects outline in the pdf document I am working with using iText or any tools programmatically
- Combining different PDF files onto a single PDF page, creating a sample booklet
- Digitially sign a PDF using java
- How to handle "No StructParents key" PDF syntax issue when using iText to copy pages?
- Creating PDF file with large data by springboot with jasper report
- Itext and Pdfbox Rotation settings compatibility issues
- Unable to make itext5 pdf watermark non removable in VMware Workspace ONE Boxer email
- iText7 deferred signed pdf document shows “Error during signature verification.”
- iText7 TextRenderInfo.GetFontSize() usually produces a false result
Related Questions in PDFBOX
- How to differentiate between background color and text color?
- When adding an image to a pdf file using pdfbox the image is added without color, a part of the image should be red but it is black
- PDField set default appearances multiple fonts - pdfbox 2.0
- Draw transparent png image to pdf using pdfbox and seeing gray halo around the edges
- Why does transforming PDF pages drop embedded fonts?
- How to fit a text to a position by PDFBox
- Deployed jar get java.lang.ClassNotFoundException: org.apache.pdfbox.pdmodel.PDDocument
- How to remove nested structure of containers inside content panel
- Digitially sign a PDF using java
- Blue box appearing instead of digital signature and signature panel contains unsigned signatures
- PDFBox - Extract rotated text
- Issue in PDFBox
- Potentially incorrect calculation of the character width when filling in the AcroForm field with the isComb attribute using PDFBox
- Itext and Pdfbox Rotation settings compatibility issues
- Apache PdfBox - MultiLine Content being overwritten while writing into pdf
Related Questions in PDF2HTMLEX
- Is there a way to remove all the transforms when using pdf2htmlex
- how do i use the pdf2html docker image on windows to convert pdf to html?
- Why can't I add the text when I convert my HTML file to PDF?
- How to identify the modified content in a pdf file?
- Using co-ordinates in XML generated by poppler to build an email template
- pdf2htmlEX error during conversion - CMap is not valid and got dropped for font
- pdf2htmlEX on Debian 10 for use with Drupal
- Convert PDF to HTML without losing any format
- pdf2htmlEX converts text but not visible (program can't find font file on linux?)
- Pdf2htmlEX common error "Cannot load font"
- Internal Error: Attempt to output 65872 into a 16-bit field. It will be truncate
- Pdf2Html Installation
- Pdf2htmlEx: The html contains images, how could i have instead graphics as output instead of images?
- Install pdf2htmlEX on heroku
- pdf2HtmlEX - Text on html is different than the source pdf
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Take a look at this tutorial : http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_structure
Basically, with PDFBox, you can aces to the PDFContent with
and then, search for the
X Y Tdline you're looking for.I'm REALLY sure there is a simpler way to do it, but since I work a lot with the Content Stream for a project, I am only aware of this way.
Search in PDFBox's javaDocs for more details !
I hope this will help you :)