wget --warc-file gets only main page and robot pages?

189 Views Asked by Spiridon At 20 May 2022 at 14:22

I am trying to do a little project on a small-ish WARC file. I used this command:

[ ! -f course.warc.gz ] && wget -r -l 3 "https://www.ru.nl/datascience/" --delete-after --no-directories --warc-file="course" || echo Most likely, course.warc.gz already exists

First time I ran it, everything went fine, got over 150 pages worth, amazing. Now I wanted to redo it from scratch, so I deleted the file 'course.warc.gz'; problem is, when I run the same command now I get 3 pages: the one requested for, and two robot pages to boot. Why is this happening?

Original Q&A

There are 1 best solutions below

Sebastian Nagel On 21 May 2022 at 09:17

Wget can follow links in HTML, [...] This is sometimes referred to as “recursive downloading.” While doing that, Wget respects the Robot Exclusion Standard (/robots.txt). (wget manual)

The robots.txt includes the following rule:

# Block alle andere spiders
User-agent: *
Disallow: /

Difficult to answer whether what happened during the previous run of wget. Maybe the robots.txt changed?

wget --warc-file gets only main page and robot pages?

There are 1 best solutions below

Related Questions in WGET

Related Questions in WARC

Trending Questions

Popular # Hahtags

Popular Questions