Changing language in tika

Question

Changing language in tika

1.8k Views Asked by Prem Anand At 24 October 2020 at 04:42

Is it possible to change the langauge (default detection) for tika?

I am trying to use a pdf file in tamil. (language code 'ta'). But tika is detecting it as 'th' (thai). Though most characters are recognized well, it not defecting few chars.

see example below, where some 'o' is appearing in between text.

ஓவச - அக் ைரும்பாகலைளில் ைருப்பஞ்ொறு பாய்வதால் எழுகின்ற ஓகெயும்; வவவலச் சங்கின் வாய்ப் கபாங்கும் ஓவச - நீர்க் ைகரைளில் உள்ள ெங்குைளிடமிருந்து

from tika import language
print(language.from_file(u'pdf/KambaRamayanam1.pdf' ))

result is 'th'. expected is 'ta'

Original Q&A

There are 3 best solutions below

**marek.kapowicki** · Answer 1 · 2020-10-27T11:06:48.900000

in the TesseractOCRConfig class, there is a method setLanguage
When You use tika server You can set the language header: X-Tika-OCRLanguage the list of tesseract languages is here: https://github.com/tesseract-ocr/tessdata

**marek.kapowicki** · Answer 2 · 2020-10-28T11:16:07.090000

Tika can handle the pdf in either ocr mode (works fine with scanned pdf) or the no_ocr mode - than tika sends the request to tesseract

be sure that your tika is using the ocr either in code

PDFParserConfig::setExtractInlineImages(true) //is important PDFParserConfig::setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY)

or using headers in tika server:

X-Tika-PDFextractInlineImages: true, X-Tika-PDFocrStrategy: OCR_ONLY

than your tika is using the tesseract and you can change the tesseract configuration: https://tika.apache.org/1.24/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html

To get the big picture I m strongly recommended to look at my java project https://github.com/marekkapowicki/nlp and the blog post: https://medium.com/@masreis/text-extraction-and-ocr-with-apache-tika-302464895e5f

**Dave Meikle** · Answer 3 · 2020-11-13T01:28:32.647000

There seems to be two parts to this question, if I understand correctly; the OCR and the Language Detection. They are interlinked from what I understand from the comments to the other questions and answers.

OCR for Indian Languages

In terms of OCR, by default, Apache Tika uses the eng Tesseract model only unless you tell it to use others. You can do this by setting them in the TesseractOCRConfig, either through:

Creating your own TesseractOCRConfig.properties file and placing it on the classpath in the appropriate package
Sending an appropriate header on the request to Tika Server in the REST call.

(Option 2 can also be used to override defaults you've set in one)

This allows you to give a list of one or more Tesseract models to load for use during the OCR.

You can use some of the ones provided by Tesseract, or customs ones such as Tesseract Models for Indian languages here that you install or build.

Once installed you just need to use the relevant model name in the language list in the TesseractOCRConfig

Both are explained in more details on the Wiki: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR

Language Detection

Because you are using the tika-python model, which in turn uses Tika Server, you are not calling the language detector from tika-core referenced in the other answer. You are actually using the OptimaizeLangDetector and it's default models, with profiles supported for the languages shown here.

There is no Parsing going on in the /language endpoints you are using, and thus no OCR, with it merely using the raw String or what it reads within the InputStream sent.

To maximise the chance of a good result, you'll want to use:

from tika import parser, language
content = parser.from_file("sample.pdf") 
print(language.from_buffer(content))

Together with TesseractOCRConfig setup to use the appropriate models that match the scripts you are sending in, you should get a reasonable result.

from tika import parser, language
headers = { "X-Tika-OCRLanguage": "eng+tam" }
content = parser.from_file("sample.pdf", xmlContent=False, requestOptions={'headers': headers})
print(language.from_buffer(content))

Changing language in tika

There are 3 best solutions below

OCR for Indian Languages

Language Detection

Related Questions in PYTHON

Related Questions in NLP

Related Questions in APACHE-TIKA

Related Questions in TAMIL

Trending Questions

Popular # Hahtags

Popular Questions