I'm making a script that takes all the links from a google page in bash. I get the google page with the w3m
utility and this script:
#!/bin/bash
# performs a google search using a word in input
word=$1
touch .google
if [ -z $word ]
then
echo "$word missing!"
echo "Aborting..."
exit
fi
a="www.google.com/search?q="
search=$a$word
w3m -no-cookie $search > .google
sleep 1
Next, I have to get all the sites from this page. I was thinking to take all the string that start with www.
and ends with /
echo `grep -wo "www[^/]*" .google`> .temp
The problem with this is that I miss a lot of the links that don't start with www
and at the same time I risk breaking everything when there is a site that doesn't end with /
.
What better way could I get the urls from this response?
You might want to grep for
<a href="
and take the value up to the next quote symbol. Then filter out all javascript stuff. Although this solution is probably not fool-proof either.