Read article content using goose retrieving nothing

Question

Read article content using goose retrieving nothing

1.2k Views Asked by Abhishek Bhatia At 07 June 2025 at 12:18

I am trying to goose to read from .html files(specified url here for sake convenience in examples)[1]. But at times it's doesn't show any text. Please help me out here with the issue.

Goose version used:https://github.com/agolo/python-goose/ Present version gives some errors.

from goose import Goose
from requests import get

response = get('http://www.highbeam.com/doc/1P3-979471971.html')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
print text

Original Q&A

There are 1 best solutions below

**Thiem Nguyen** · Accepted Answer

Goose indeed uses several predefined elements which are likely a good starting point for finding the top node. If there are no "known" elements found, it starts looking for the top_node which in general is an element containing a lot of p tags inside it. You can read extractors/content.py for more details.

The given article does not have many traits of a common article, which is normally wrapped inside an article tag, or a div tag with class and id such as 'post-content', 'story-body', 'article', etc. It's a div tag with id = 'docText' and has no paragraphs, thus Goose cannot predict a good thing about it.

What I can suggest you is to add this line at the beginning of KNOWN_ARTICLE_CONTENT_TAGS constant in extractors/content.py:

KNOWN_ARTICLE_CONTENT_TAGS = [
    {'attr': 'id', 'value': 'docText'},
    ... other paths go here
]

and here is the extracted body:

Chennai, Dec. 19 -- The Tamil Nadu Government on Monday appointed a one-man judicial commission of inquiry to look into the reasons for Sunday's stampede in state capital Chennai, which claimed 42 lives and left another 37 injured.\n\nThe announcement of the formation of the commission came even as family members of those killed in a stampede agonised and agitated over the unexpected tragedy.\n\nThe 42 homeless people were trampled to death during the distribution of flood relief supplies at a shelter in the Tamil Nadu capital.\n\nOfficials said over 5,000 people rushed in as the gates of the shelter opened, causing the stampede.\n\nChitra, family member of a victim, said it was mismanagement that led to the tragedy. \u2026

Read article content using goose retrieving nothing

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in WEB-CRAWLER

Related Questions in GOOSE

Trending Questions

Popular # Hahtags

Popular Questions