How to correctly format a jsonp file coming from a website to extract a well formatted text? (with python)

32 Views Asked by At

I am trying to extract a correct and well formated text from this link :

https://html.scribdassets.com/5lamlvj3nkau3ato/pages/100-ca9665a40f.jsonp

It was taken from this website :

https://www.scribd.com/document/628782766/La-machoire-de-Cain

I tried using beautifulsoup, but the output was wrong, it gave something like this : 'peut-être moins tendre et plussincère. Mon cœur se' you can see the words 'plussincère' are joined. (in the jsonp there are no spaces between the text and the tag). I then tried to add a space between the and the text, but it gave something weird (on this file or on another one, it returned words like this : 'B o n jou r' because some words are splitted into different spans..

I then tried using the viterbi algorithm with a large data set (300k) but didn't work.

Here is the link of other pages if you need.

{'pageNum': 43, 'contentUrl': 'https://html.scribdassets.com/5lamlvj3nkau3ato/pages/43-f3a7f37540.jsonp'}
{'pageNum': 44, 'contentUrl': 'https://html.scribdassets.com/5lamlvj3nkau3ato/pages/44-a06fccf8e0.jsonp'}

Thank you in advance.

0

There are 0 best solutions below