Easiest tool (Windows Platform) to crawl the web and save words?

522 Views Asked by At

I want to crawl the web pages and save the keywords with their frequency. For example, I want to crawl the category Arts from URL: http://www.dmoz.org/Arts/ and save a list of words with their frequency. So I want the following output

Word Frequency
Movies 400
Song 100
magazine 120

Which is the simplest way to achieve that? Any tool or library in any language will be greatly helpful.

1

There are 1 best solutions below

0
On

Ok, here we go.

(minor edits, mostly for grammar, 20110316)

I can only spare the time to show you the simplest, non-production ready solution to the problem. If you're need a one-off solution, then this should save you a lot of time. If you're looking for a production level tool, then you'll want to do this entirely different, especially how you boil down the html to just straight text. Just search here on SO for "awk html parser" to see how wrong this solution is ;-) (more about this below) ... Anyway ...

1 -- spider/capture text to files

wget -nc -S       -r -l4 -k -np -w10 --random-wait  http://www.dmoz.org/Arts/
     #noClobber
         #server Responses
                # -r recursive
                    # -l4 4 levels
                     # -k (convert) make links in downloaded HTML point to local files.
                        # -np no-parent.  don't ascend to the parent directory.
                              #  -w10 wait 10 secs between
                                    # --random-wait randomize that 10 secs above from 0-10

This will put all the www.dmoz.org files in a dir-structure in your current directory, starting with www.dmoz.org at the top. Cd down into it to see the raw data if you like.

2 -- make a bare-bones html stripper script like

$: cat rmhtml3

#! /bin/awk -f
{
        gsub(/[{<].*[>}]/, "")
        gsub("&nbsp;", "")
        gsub(/[ \t][ \t]*/, " ")
        if ($0 !~ /^[ \t]*$/) {
                print $0
        }
}

This will bring the "don't even thing about parsing html in awk" police down on us ;-), so maybe someone will recommend a simple command line xslt processor (or other) that will do a cleaner job than above. I've just figured out some of this recently and am looking for proper solutions that will fit into a unix scripting environment. Or you can check the opensource web-crawlers listed at Wikipedia entry for webCrawlers

3 -- make a big unix-pipeline to get the output you want.

find . -name '*.html' | xargs ./rmhtml3 \
| awk 'BEGIN {RS=" ";};{ print $0}' \
| sort | uniq -c \
| sort +0n | tail -50

you can easily take this apart and see what each phase adds to the process.

The unusual bit is

awk 'BEGIN{RS=" ";};{print $0}'

This resets the awk RecordSeparator to the space char, so that each word is printed on a separate line.

Then it is easy to sort them, get a count of uniq items, sort by the leading number of the sort output and only display the last 50 entries. (Obviously, you can change that to any number you feel might useful.)

If you don't like looking at all the noise words (the, at, it, .... etc), put those words in a file and use

.... | fgrep -vif skipwords | sort | uniq -c ...

4 -- I'm looking at the output after having the spider run for 1/2 hr, and I see some other things you'll want to add to the pipeline, which is left as an exercise for you ;-)

   sort -i # ignore upper-lower case while sorting
   sed 's/[,]//g  # delete all commas. Add any other chars you find inside the []

I hope this helps