TagSoup vs JSoup :: Performance?

2.2k Views Asked by At

Looking for a performance comparison between TagSoup and JSoup for real-world documents. So far I've been using TagSoup for HTML processing, and it works quite well. The only drawback is that because of SAX nature, lots of stuff should be done programmatically using stacks (for processing text withing tags for example). JSoup looks more concise - but I'm concerning about performance.

1

There are 1 best solutions below

0
On

The TagSoup website states:

There are a variety of other HTML SAX parsers written in Java, notably NekoHTML, JTidy (a port of the C library and tool HTML Tidy), and HTML Parser. All have their good and bad points: the general view around the Web seems to be that TagSoup is the slowest, but also the most robust and reliable.

I tried to create an application that would parse 5 pages using jsoup and 5 pages using TagSoup and post the timings. Unfortunately, I could not figure out how to use TagSoup 1.2.1 to return a web page into a DOM, which makes an apples-to-apples comparison difficult.