I want to crawl the web pages and save the keywords with their frequency. For example, I want to crawl the category Arts from URL: http://www.dmoz.org/Arts/ and save a list of words with their frequency. So I want the following output
Word Frequency
Movies 400
Song 100
magazine 120
Which is the simplest way to achieve that? Any tool or library in any language will be greatly helpful.
Ok, here we go.
(minor edits, mostly for grammar, 20110316)
I can only spare the time to show you the simplest, non-production ready solution to the problem. If you're need a one-off solution, then this should save you a lot of time. If you're looking for a production level tool, then you'll want to do this entirely different, especially how you boil down the html to just straight text. Just search here on SO for "awk html parser" to see how wrong this solution is ;-) (more about this below) ... Anyway ...
1 -- spider/capture text to files
This will put all the www.dmoz.org files in a dir-structure in your current directory, starting with www.dmoz.org at the top. Cd down into it to see the raw data if you like.
2 -- make a bare-bones html stripper script like
This will bring the "don't even thing about parsing html in awk" police down on us ;-), so maybe someone will recommend a simple command line xslt processor (or other) that will do a cleaner job than above. I've just figured out some of this recently and am looking for proper solutions that will fit into a unix scripting environment. Or you can check the opensource web-crawlers listed at Wikipedia entry for webCrawlers
3 -- make a big unix-pipeline to get the output you want.
you can easily take this apart and see what each phase adds to the process.
The unusual bit is
This resets the awk RecordSeparator to the space char, so that each word is printed on a separate line.
Then it is easy to sort them, get a count of uniq items, sort by the leading number of the sort output and only display the last 50 entries. (Obviously, you can change that to any number you feel might useful.)
If you don't like looking at all the noise words (the, at, it, .... etc), put those words in a file and use
4 -- I'm looking at the output after having the spider run for 1/2 hr, and I see some other things you'll want to add to the pipeline, which is left as an exercise for you ;-)
I hope this helps