Java libraries to extract text blocks from HTML pages

1.3k Views Asked by Renato Dinhani At 28 July 2025 at 22:45

I want to extract text blocks from a HTML page and I'm using boilerpipe to do this. It works fine for one text in a page, but some pages like blogs have multiple texts in the page.

I want to extract all texts, but identifying each one as a separate text, and not only one.

There is some library that can do this?

EDIT: I'm using Jsoup to parse HTML, but I don't want do parsing, but information extraction like boilerpipe do in the pages. I want to test other similar tool.

Original Q&A

There are 3 best solutions below

Santosh On 20 January 2012 at 15:47

JSoup is very widely used parser for these type of tasks. Please check it.

bezmax On 20 January 2012 at 12:41

Well, personally I liked using Doj together with HtmlUnit. Basically Doj introduces something similar to CSS selectors for Java.

Example (from official page):

Doj spanDoj = Doj.on(page).get("#updates tr", 1).get("td", 2).get("span.item");

You can see more complex example on the linked page (scroll it down).

Lucas Wiman On 20 January 2012 at 19:19

The closest Java library I'm aware of is the Road Runner project: http://www.dia.uniroma3.it/db/roadRunner/ It's a system that can construct a special kind of regular expression on tokens in the HTML document which can (in many cases) detect patterns of this kind given several documents based on the same template. This might be achieved for blogs by, for example, looking at paginated pages. You would probably still have to pick out precisely which repeated patterns were the ones of interest for each site.

For blogs, I would probably look for a feed link in the header of the blog and use a feed parsing library to parse out the permalinks for each article. Crawl those and use boilerpipe (only necessary because lots of blogs don't include the full text in the RSS/Atom feed). Lots of blogs don't include the full text on the main page either, so I'd focus on methods of identifying the permalinks, and go from there.

Java libraries to extract text blocks from HTML pages

There are 3 best solutions below

Related Questions in JAVA

Related Questions in HTML

Related Questions in TEXT

Related Questions in INFORMATION-EXTRACTION

Trending Questions

Popular # Hahtags

Popular Questions