Means of getting data for a given website from the Web Data Commons?

563 Views Asked by user1556658 At 27 June 2015 at 22:14

I'm trying interesting data inside the Web Data Commons dumps. It is taking day to grep across it on my machine (in parallel). Is there an index out there of what websites are covered and an ability to extract specifically from those sites?

Original Q&A

There are 1 best solutions below

Chris On 11 August 2015 at 21:53

To get all of the pages from a particular domain -- one option is to query the common crawl api site:

http://index.commoncrawl.org

To list all of the pages from the specific domain wikipedia.org:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org*/&showNumPages=true

This shows you how many pages of blocks common crawl has from this domain (note you can use wildcards as in this example).

Then go into each page and ask common crawl to send you a json object of each file:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=en.wikipedia.org/*&page=0&output=json

You can then parse the json and get each warc file through the field: filename

This link will help you.

Means of getting data for a given website from the Web Data Commons?

There are 1 best solutions below

Related Questions in COMMON-CRAWL

Trending Questions

Popular # Hahtags

Popular Questions