How to download subset of Amazon CommonCrawel (only the text (WET files?) is needed)

389 Views Asked by UriCS At 17 December 2014 at 20:09

For research purposes, I want a large (~100K) set of web pages, though I am only interested in their text. I plan to use them for gensim LDA topic model. CommonCrawler seems like a good place to start, but I am not sure how to do it. Could someone point the way how to download 100K text files or how to access them (if it's easier than downloading them)?

Original Q&A

There are 1 best solutions below

UriCS On 17 December 2014 at 21:42 BEST ANSWER

It seems it is possible to download only parts of the DataSet (you can just select the month you want), and you can download only the text (called WET files). for example, you can download the August 2014 Crawl Data from: http://blog.commoncrawl.org/2014/09/august-2014-crawl-data-available/ and an explanation about the file format can be found here: http://blog.commoncrawl.org/2014/04/navigating-the-warc-file-format/

How to download subset of Amazon CommonCrawel (only the text (WET files?) is needed)

There are 1 best solutions below

Related Questions in DOWNLOAD

Related Questions in LDA

Related Questions in GENSIM

Related Questions in COMMON-CRAWL

Trending Questions

Popular # Hahtags

Popular Questions