Best visible content extractor available

122 Views Asked by At

So my application needs visible content from a given URL, like just the text part, no html no header or footer data. As of now I am using beautifulsoup and boilerpipe for getting the same. But in some rare cases I am not getting enough data or the right data. So was wondering is there any other competitor, programming language is not a barrier.

1

There are 1 best solutions below

1
On

I would recommend xpath or css extractors directly for content extraction, both selectors are already simply implemented on parsel module.

For a complete suite of web-crawling + content extractor, scrapy would be my preferred option.

And if you want to extract to visually select what parts of the html to extract, I would recommend portia.

Hope that helped.