Changing language in tika

1.8k Views Asked by At

Is it possible to change the langauge (default detection) for tika?

I am trying to use a pdf file in tamil. (language code 'ta'). But tika is detecting it as 'th' (thai). Though most characters are recognized well, it not defecting few chars.

see example below, where some 'o' is appearing in between text.

ஓவச - அக் ைரும்பாகலைளில் ைருப்பஞ்ொறு பாய்வதால் எழுகின்ற ஓகெயும்; வவவலச் சங்கின் வாய்ப் கபாங்கும் ஓவச - நீர்க் ைகரைளில் உள்ள ெங்குைளிடமிருந்து

from tika import language
print(language.from_file(u'pdf/KambaRamayanam1.pdf' ))

result is 'th'. expected is 'ta'

3

There are 3 best solutions below

2
marek.kapowicki On
  1. in the TesseractOCRConfig class, there is a method setLanguage
  2. When You use tika server You can set the language header: X-Tika-OCRLanguage the list of tesseract languages is here: https://github.com/tesseract-ocr/tessdata
0
marek.kapowicki On

Tika can handle the pdf in either ocr mode (works fine with scanned pdf) or the no_ocr mode - than tika sends the request to tesseract

  1. be sure that your tika is using the ocr either in code

    PDFParserConfig::setExtractInlineImages(true) //is important PDFParserConfig::setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY)

or using headers in tika server:

X-Tika-PDFextractInlineImages: true, X-Tika-PDFocrStrategy: OCR_ONLY
  1. than your tika is using the tesseract and you can change the tesseract configuration: https://tika.apache.org/1.24/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html

To get the big picture I m strongly recommended to look at my java project https://github.com/marekkapowicki/nlp and the blog post: https://medium.com/@masreis/text-extraction-and-ocr-with-apache-tika-302464895e5f

0
Dave Meikle On

There seems to be two parts to this question, if I understand correctly; the OCR and the Language Detection. They are interlinked from what I understand from the comments to the other questions and answers.

OCR for Indian Languages

In terms of OCR, by default, Apache Tika uses the eng Tesseract model only unless you tell it to use others. You can do this by setting them in the TesseractOCRConfig, either through:

  1. Creating your own TesseractOCRConfig.properties file and placing it on the classpath in the appropriate package
  2. Sending an appropriate header on the request to Tika Server in the REST call.

(Option 2 can also be used to override defaults you've set in one)

This allows you to give a list of one or more Tesseract models to load for use during the OCR.

You can use some of the ones provided by Tesseract, or customs ones such as Tesseract Models for Indian languages here that you install or build.

Once installed you just need to use the relevant model name in the language list in the TesseractOCRConfig

Both are explained in more details on the Wiki: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR

Language Detection

Because you are using the tika-python model, which in turn uses Tika Server, you are not calling the language detector from tika-core referenced in the other answer. You are actually using the OptimaizeLangDetector and it's default models, with profiles supported for the languages shown here.

There is no Parsing going on in the /language endpoints you are using, and thus no OCR, with it merely using the raw String or what it reads within the InputStream sent.

To maximise the chance of a good result, you'll want to use:

from tika import parser, language
content = parser.from_file("sample.pdf") 
print(language.from_buffer(content))

Together with TesseractOCRConfig setup to use the appropriate models that match the scripts you are sending in, you should get a reasonable result.

from tika import parser, language
headers = { "X-Tika-OCRLanguage": "eng+tam" }
content = parser.from_file("sample.pdf", xmlContent=False, requestOptions={'headers': headers})
print(language.from_buffer(content))