What is the difference between web crawler and parser?
In java there are some name for fetching libraries . For example , they name nutch as a crawler and jsoup as a parser .
Are they do the same purpose?
Are they fully similar for the job?
thanks
What is the difference between web crawler and parser?
In java there are some name for fetching libraries . For example , they name nutch as a crawler and jsoup as a parser .
Are they do the same purpose?
Are they fully similar for the job?
thanks
On
This is easily answered by looking this up on Wikipedia:
A parser is a software component that takes input data (frequently text) and builds a data structure
https://en.wikipedia.org/wiki/Parsing#Computer_languages
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an [Internet bot] that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).
The
jsouplibrary is a Java library for working with real-world HTML. It is capable of fetching and working with HTML. However, it is not a Web-Crawler in general as it is only capable of fetching one page at a time (without writing a custom program (=crawler) usingjsoupto fetch, extract and fetch new urls).A Web crawler uses a HTML parser to extract URLs from a previously fetched Website and adds this newly discovered URL to its frontier.
A general sequence diagram of a Web crawler can be found in this answer: What sequence of steps does crawler4j follow to fetch data?
To summarize it:
A HTML parser is a necessary component of a Web crawler for parsing and extracting URLs from given HTML input. However, a HTML parser alone, is not a Web crawler as it lacks some necessary features such as maintaining previously visted URLs, politeness, etc.