Using common crawl, is there a way I can download raw text from all pages of a particular domain (e.g., wisc.edu)? I am only interested in text for NLP purposes such as topic modeling.
How to get webpage text from Common Crawl?
2.1k Views Asked by SanMelkote AtThere are 2 best solutions below

No, I don't think there is any easy way to partition the data set by source without parsing all of it.
The URLs in each WARC file appear to be sorted alphabetically, but if you are searching for something near the end of the alphabet, like www.wisc.edu
, you will have to examine nearly all of the URLs before you find the ones you want to target.
tripleee$ zgrep ^WARC-Target-URI: CC-MAIN-20201020021700-20201020051700-00116.warc.gz | head
WARC-Target-URI: http://024licai.com/index.php/art/detail/id/6007.html
WARC-Target-URI: http://024licai.com/index.php/art/detail/id/6007.html
WARC-Target-URI: http://024licai.com/index.php/art/detail/id/6007.html
WARC-Target-URI: http://04732033888.com/mrjh/1508.html
WARC-Target-URI: http://04732033888.com/mrjh/1508.html
WARC-Target-URI: http://04732033888.com/mrjh/1508.html
WARC-Target-URI: http://04nn.com/a/cp/lvhuagai/123.html
WARC-Target-URI: http://04nn.com/a/cp/lvhuagai/123.html
WARC-Target-URI: http://04nn.com/a/cp/lvhuagai/123.html
WARC-Target-URI: http://0551ftl.com/0551ftl_196119_138772_338002/
(This example is from one of the first files of the October 2020 dump.)
The whole point of the Common Crawl is to pull together results from many different places. A much less resource-intensive path is probably to examine what archive.org
has on file from this domain.
That's only one specific server; there seems to be a large number of subdomains, like mcburney.wisc.edu
, sohe.wisc.edu
, etc.
Of course, if you are lucky, somebody will already have divided or indexed the Common Crawl material and can offer you a map of where to find your specific domain, but I am not aware of any such index. My expectation would be that those who do that sort of thing will generally not want to or expect others to want to examine the material from that particular angle.
Common Crawl provides two indexes which allow to pick arbitrary WARC records:
To download all WARC records of a single domain you could use