I am using pdftools in R to extract text from both scanned and text based PDF files. One problem is with the § character. This is not recognized by tesseract.
I looked at the following links: CRAN tesseract package vignette
And I tried the following:
I found the configuration files using
tesseract_info()and edited thedigitsfile underconfigs. Thedigitsfile content was like this:tessedit_char_whitelist 0123456789.
After editing it looks like this:
tessedit_char_whitelist 0123456789-$§.
This did not change anything at all, I am still not able to extract §. They still appear as 8.
After the 1st step failed, I tried the following:
filepng <- pdftools::pdf_convert(filePathPDF, dpi = 600) specs <- tesseract("deu", options = list(tessedit_char_whitelist = "1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM@߀!$%&§/()=?+")) text <- tesseract::ocr(filepng, engine = specs)
This one failed too. I am by no means an expert on OCR and tesseract has room for improvements when it comes to documentation.
How can I add § to the list of characters to be recognized in the right way, so that it applies?
Update
The following works to recognize §, when I remove language from the argument list:
charlist <- tesseract(options = list(tessedit_char_whitelist = " 1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM@߀!$%&§/()=?+"))
text <- tesseract::ocr(filepng, engine = charlist)
But this time, I am losing German umlauts. I cannot find out how I can specify the language and the char_whitelist at the same time. According to the documentation, tesseract() accepts language argument and options argument. But this does not seem to work. Any ideas?
Update: I tried using tesseract in command line (MacOS Catalina 10.15.7).
I converted a scanned PDF file first to an image then used this:
tesseract fileConverted.tiff fileToText
It creates fileToText.txt. It does recognize §. All of them are correctly recognized. But German umlauts are not recognized correctly, since I did not specify language at all. When I use the same command with the language argument
tesseract fileConverted.tiff fileToText -l deu
German umlauts are recognized properly but § is not.
The digits config file I changed is here:
/usr/local/Cellar/tesseract/4.1.1/share/tessdata/configs
My understanding is: it is not a problem specific to R, but it occurs with tesseract itself. Setting tessedit_char_whitelist and the language at the same time does not seem to be possible or I am missing something horribly.
As said above, tesseract 4 does not support setting a whitelist. To go around that problem, you could use the command-line switch. You need to set OCR Engine mode to the "Original Tesseract only" with
--oem 0then use-c tessedit_char_whitelist=abc...to pass your whitelist directly via the command-line.Overall, it should look something like this :
tesseract fileConverted.tiff fileToText --oem 0 -l deu -c tessedit_char_whitelist=0123456789-$§