Why is some of the text I extracted not properly decoded in Python?

35 Views Asked by At

I have written the following code to download the text of a financial report of Apple on the SEC:

headers = {'User-Agent' : 'email'}
response = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000119312514383437/0001193125-14-383437.txt', headers=headers)
content = response.content.decode()
try:
    soup = BeautifulSoup(content, "html.parser")
    if soup is None:
        raise Exception("Failed to parse with html.parser")
except Exception as e:
    soup = BeautifulSoup(content, "lxml")
text = soup.get_text()
print(text)

This returns the full decoded text file of the financial report I have downloaded. However, some of the output is not properly decoded. So for example, instead of Company's the output shows Company’s. I have tried encoding and decoding again, but that does not work, so now I am pretty much stuck. I hope someone knows how I should modify my code to get the desired output.

0

There are 0 best solutions below