Web Crawler vs Html Parser

1k Views Asked by Ahmed Sakr At 14 November 2018 at 16:40

What is the difference between web crawler and parser?

In java there are some name for fetching libraries . For example , they name nutch as a crawler and jsoup as a parser .

Are they do the same purpose?

Are they fully similar for the job?

thanks

Original Q&A

There are 2 best solutions below

rzo1 On 10 December 2018 at 10:20 BEST ANSWER

The jsoup library is a Java library for working with real-world HTML. It is capable of fetching and working with HTML. However, it is not a Web-Crawler in general as it is only capable of fetching one page at a time (without writing a custom program (=crawler) using jsoup to fetch, extract and fetch new urls).

A Web crawler uses a HTML parser to extract URLs from a previously fetched Website and adds this newly discovered URL to its frontier.

A general sequence diagram of a Web crawler can be found in this answer: What sequence of steps does crawler4j follow to fetch data?

To summarize it:

A HTML parser is a necessary component of a Web crawler for parsing and extracting URLs from given HTML input. However, a HTML parser alone, is not a Web crawler as it lacks some necessary features such as maintaining previously visted URLs, politeness, etc.

maio290 On 14 November 2018 at 16:45

This is easily answered by looking this up on Wikipedia:

A parser is a software component that takes input data (frequently text) and builds a data structure

https://en.wikipedia.org/wiki/Parsing#Computer_languages

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an [Internet bot] that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).

https://en.wikipedia.org/wiki/Web_crawler

Web Crawler vs Html Parser

There are 2 best solutions below

Related Questions in JAVA

Related Questions in WEB-CRAWLER

Related Questions in JSOUP

Related Questions in CRAWLER4J

Trending Questions

Popular # Hahtags

Popular Questions