I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue.
Goose version used:https://github.com/agolo/python-goose/ Present version gives some errors.
from goose import Goose
from requests import get
response = get('http://www.highbeam.com/doc/1P3-979471971.html')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
print text
Goose indeed uses several predefined elements which are likely a good starting point for finding the top node. If there are no "known" elements found, it starts looking for the
top_node
which in general is an element containing a lot ofp
tags inside it. You can readextractors/content.py
for more details.The given article does not have many traits of a common article, which is normally wrapped inside an article tag, or a div tag with class and id such as 'post-content', 'story-body', 'article', etc. It's a div tag with
id = 'docText'
and has no paragraphs, thus Goose cannot predict a good thing about it.What I can suggest you is to add this line at the beginning of
KNOWN_ARTICLE_CONTENT_TAGS
constant inextractors/content.py
:and here is the extracted body: