Extraction issue with bold heading letters from pdf using tika

45 Views Asked by At

I am new with reading text from pdf using python. I am using tika to extract content from pdf, and when it extracts bold headings, it seems to fail.

example image

In the example above, it's reads "Rating the Items" as following "RRaattiinngg tthhee IItteemms" and this happens with other headings as well, is it something to do with library I am using or the issue is with pdf itself.

Code I am using:

from tika import parser
raw=parser.from_file(config.PATH)
print(raw['content']

Are there better library for extracting text from pdf?
Thank You

0

There are 0 best solutions below