Python Tika cannot parse pdf from url

2.7k Views Asked by At

python for parsing the online pdf for future usage. My code are below.

from tika import parser
import requests
import io
url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf'
response = requests.get(url)
with io.BytesIO(response.content) as open_pdf_file:
    pdfFile = parser.from_file(open_pdf_file)
print(pdfFile)

However, it shows

AttributeError: '_io.BytesIO' object has no attribute 'decode'

I have taken an example from How can i read a PDF file from inline raw_bytes (not from file)?

In the example, it is using PyPDF2. But I need to use Tika as Tika has a better result than PyPDF2.

Thank you for helping

1

There are 1 best solutions below

0
On BEST ANSWER

In order to use tika you will need to have JAVA 8 installed. The code that you'll need to retrieve and print contents of a pdf is as follows:

from tika import parser

url = 'https://www.whitehouse.gov/wp-content/uploads/2017/12/NSS-Final-12-18-2017-0905.pdf'

pdfFile = parser.from_file(url)

print(pdfFile["content"])