I want to extract text from some pdf files (programmatically, with some utility or even with copy/paste) but some characters are coming out really strange. Although I specify UTF-8 encoding when extracting the text, characters like "ș, ț, ă," etc look like "„ ˛" and not "s, t, a" (or at least the displayed character).
The text is displayed correctly but when I try to copy it for example, those characters are not OK.
Is there some way to extract the text correctly or are those pdf files corrupted in some way (java/C/python etc or windows/linux/etc utility)?
Extracting correctly the text from a pdf (UTF-8)
1.7k Views Asked by Andrei F At
1
There are 1 best solutions below
Related Questions in PDF
- Itext get special letters from pdf
- Carrierwave file upload with different file types
- Get text from a section of a pdf page with IcePdf
- itext pdf to image convert
- PDF to Text extractor in nodejs without OS dependencies
- PDF to ByteArray Conversion
- Opening PDF file in SWT Browser - XulRunner default viewer
- Generate TCPDF output to a shared drive folder
- Combine base and ggplot graphics in R figure window in different pages
- Updating a PDF Barcode Field in iOS and Android Device
- Prevent PDFsharp from saving an image file?
- Adding attachment links between lines in itext for pdf
- Crop Pdf from each edge using itextshap
- How to create a PDF with iText+XMLWorker from servlet using custom font?
- how to create a pdf editor for grails
Related Questions in TEXT
- Delete the extra space after special character in all the lines of text file
- Apply gaussian filter on text
- text show and hide with button php/js
- Get text from a section of a pdf page with IcePdf
- load word file (.docx) in richtextbox
- Display a specific line in a text file - android/java
- how to change text direction to the right slide of switch in android?
- C language - Read specific data from text file
- Read text file from specific position and store in two arrays
- How to animate text
- Detect repetition in text string / copied text
- Use MATLAB's webread to login to website and extract text
- LWJGL Drawing colored text to the screen issue
- Hide part of text temporarily, show after user clicks certain element
- Reading text file in java using scanner
Related Questions in UTF-8
- Site code to enable UTF-8 to EBCDIC encoding
- Wrong output when str_replace with acute ( ´ ) in utf-8 website
- How to encode bytes as a printable unicode string (like base64 for ascii)
- showing umlauts in html with utf8 charset
- Replace special qoutes with normal
- wxWidgets and UTF8 - some characters missing
- Detecting corrupt characters in UTF-8 encoded text file
- Control encoding when parsing SPSS file using package memisc
- Slidify no longer renders accent marks
- javascript treating special characters as utf characters
- Character encoding is missing at a point
- Search special characters with pg_search
- Hot deploying HTML templates generates question marks in the place of chinese characters - only on CentOS
- Reading from property file containing utf 8 character
- Problems with UTF8 text in XE7 ReadLn command
Related Questions in TEXT-EXTRACTION
- Using python for text analytics
- Extract relevant attributes from postal addresses data in order to do PCA on those Data (using R)
- How to prepare POST data from a previous HTTP response?
- Extract a substring by using regex doesn't work
- Extract numbers and decimal from string in EXCEL
- Extract each sentence from a PDF file to a separate cell in Excel?
- How can I extract dynamically loaded items from a PDF file?
- Grouping together of lines while doing line segmentation of printed text
- Tika unable to parse after detecting mime-type
- Rough edged text after applying Otsu's threshold for text extraction
- Extracting Identity-H encoded PDF text and replacing it using PDFBox in java
- Text Extraction in R with stringi package
- Text Extraction from Notebook
- Extract specific emails from different emails in a column- R
- I want to extract complete create table DDL query from file
Related Questions in PDF-EXTRACTION
- how to make existing pdf editable? Android app
- How to export pdf form fields to xml automatically
- How to merge the empty rows with the row above that one?
- Extract texts as well as images sequentially using Pymupdf
- Extracting correctly the text from a pdf (UTF-8)
- iText - Get Font size and family of a text segment
- What will be the exact color if I am getting value 1 in "Separation" colorspace using itext7 pdf extraction?
- Node.js - Problem to extract text from PDF file using Google Cloud Vision API
- Perl error - cant call the "getPageContent" on undefined value?
- How can I improve text recognition accuracy with jTessBoxEditor?
- Borderless pdf extraction to json is not working properly for Python camelot library
- PyPDF2 to extract vertical text from scanned pdf
- Tabula CalledProcessError: returned non-zero exit status 2. Tried everything possible
- How to improve Hindi text extraction?
- Python extract text between two tables as title for the table(outside tables) from pdf with tabula
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Can you extract the text correctly in Acrobat from the PDF?