How to get webpage text from Common Crawl?

2.1k Views Asked by At

Using common crawl, is there a way I can download raw text from all pages of a particular domain (e.g., wisc.edu)? I am only interested in text for NLP purposes such as topic modeling.

2

There are 2 best solutions below

0
On

Common Crawl provides two indexes which allow to pick arbitrary WARC records:

  1. the CDX index (https://index.commoncrawl.org/) to search records by URL (prefix) or domain name
  2. the columnar index which (in addition) allows efficiently select records by some metadata (eg. content type or language)

To download all WARC records of a single domain you could use

  • cdx-toolkit, e.g.
    cdxt -v --cc --from=20201001000000 --to=20201101000000 --limit 10 warc 'wisc.edu/*'
    
    downloads 10 WARC records from University of Wisconsin archived during October 2020 by Common Crawl and writes them into a local WARC file.
  • to scale up and process millions of WARC records, you might consider to use the columnar index in combination with Spark, see the projects cc-index-table and cc-pyspark for examples.
0
On

No, I don't think there is any easy way to partition the data set by source without parsing all of it.

The URLs in each WARC file appear to be sorted alphabetically, but if you are searching for something near the end of the alphabet, like www.wisc.edu, you will have to examine nearly all of the URLs before you find the ones you want to target.

tripleee$ zgrep ^WARC-Target-URI: CC-MAIN-20201020021700-20201020051700-00116.warc.gz | head
WARC-Target-URI: http://024licai.com/index.php/art/detail/id/6007.html
WARC-Target-URI: http://024licai.com/index.php/art/detail/id/6007.html
WARC-Target-URI: http://024licai.com/index.php/art/detail/id/6007.html
WARC-Target-URI: http://04732033888.com/mrjh/1508.html
WARC-Target-URI: http://04732033888.com/mrjh/1508.html
WARC-Target-URI: http://04732033888.com/mrjh/1508.html
WARC-Target-URI: http://04nn.com/a/cp/lvhuagai/123.html
WARC-Target-URI: http://04nn.com/a/cp/lvhuagai/123.html
WARC-Target-URI: http://04nn.com/a/cp/lvhuagai/123.html
WARC-Target-URI: http://0551ftl.com/0551ftl_196119_138772_338002/

(This example is from one of the first files of the October 2020 dump.)

The whole point of the Common Crawl is to pull together results from many different places. A much less resource-intensive path is probably to examine what archive.org has on file from this domain.

That's only one specific server; there seems to be a large number of subdomains, like mcburney.wisc.edu, sohe.wisc.edu, etc.

Of course, if you are lucky, somebody will already have divided or indexed the Common Crawl material and can offer you a map of where to find your specific domain, but I am not aware of any such index. My expectation would be that those who do that sort of thing will generally not want to or expect others to want to examine the material from that particular angle.