I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue.
Goose version used:https://github.com/agolo/python-goose/ Present version gives some errors.
from goose import Goose
from requests import get
response = get('http://www.highbeam.com/doc/1P3-979471971.html')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
print text
Goose indeed uses several predefined elements which are likely a good starting point for finding the top node. If there are no "known" elements found, it starts looking for the
top_nodewhich in general is an element containing a lot ofptags inside it. You can readextractors/content.pyfor more details.The given article does not have many traits of a common article, which is normally wrapped inside an article tag, or a div tag with class and id such as 'post-content', 'story-body', 'article', etc. It's a div tag with
id = 'docText'and has no paragraphs, thus Goose cannot predict a good thing about it.What I can suggest you is to add this line at the beginning of
KNOWN_ARTICLE_CONTENT_TAGSconstant inextractors/content.py:and here is the extracted body: