Using WGET to Get Links on the Page

39 Views Asked by At

I'm using:

wget --spider --force-html -r -l5 http://example.com 2>&1 | grep '^--' | awk '{print $3}' > urls.txt

It works great; however, it doesn't seem to copy the 'href=' links on each page.

wget -q http://example.com -O - | \
tr "\t\r\n'" '   "' | \
grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \
sed -e 's/^.*"\([^"]\+\)".*$/\1/g'>urls.txt

This second one does grab the href links I'm looking for, but it doesn't spider.

I'm trying to make the first one accept href links on each page or the second one perform spider. I'm aware there are better tools to do this, but I have to use WGET in this example.

1

There are 1 best solutions below

1
Nihad Badalov On

wget does not offer such an option. Please read its man page.

You could use lynx for this:

lynx -dump -listonly http://aligajani.com | grep -v facebook.com > file.txt

From its man page:

   -listonly
          for -dump, show only the list of links.

Copy of this answer