I want to extract text blocks from a HTML page and I'm using boilerpipe to do this. It works fine for one text in a page, but some pages like blogs have multiple texts in the page.
I want to extract all texts, but identifying each one as a separate text, and not only one.
There is some library that can do this?
EDIT: I'm using Jsoup to parse HTML, but I don't want do parsing, but information extraction like boilerpipe do in the pages. I want to test other similar tool.
JSoup is very widely used parser for these type of tasks. Please check it.