I am trying to extract a correct and well formated text from this link :
https://html.scribdassets.com/5lamlvj3nkau3ato/pages/100-ca9665a40f.jsonp
It was taken from this website :
https://www.scribd.com/document/628782766/La-machoire-de-Cain
I tried using beautifulsoup, but the output was wrong, it gave something like this : 'peut-être moins tendre et plussincère. Mon cœur se' you can see the words 'plussincère' are joined. (in the jsonp there are no spaces between the text and the tag). I then tried to add a space between the and the text, but it gave something weird (on this file or on another one, it returned words like this : 'B o n jou r' because some words are splitted into different spans..
I then tried using the viterbi algorithm with a large data set (300k) but didn't work.
Here is the link of other pages if you need.
{'pageNum': 43, 'contentUrl': 'https://html.scribdassets.com/5lamlvj3nkau3ato/pages/43-f3a7f37540.jsonp'}
{'pageNum': 44, 'contentUrl': 'https://html.scribdassets.com/5lamlvj3nkau3ato/pages/44-a06fccf8e0.jsonp'}
Thank you in advance.