Bash shell script to find Robots meta tag value

683 Views Asked by At

I've found this bash script to check status of URLs from text file and print the destination URL when having redirections :

#!/bin/bash
while read url
do
    dt=$(date '+%H:%M:%S');
    urlstatus=$(curl -kH 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code} %{redirect_url}' "$url" )
    echo "$url $urlstatus $dt" >> urlstatus.txt

done < $1

I'm not that good in bash : I'd like to add - for each url - the value of its Robots meta tag (if is exists)

2

There are 2 best solutions below

1
On BEST ANSWER

Actually I'd really suggest a DOM parser (e.g. Nokogiri, hxselect, etc.), but you can do this for instance (Handles lines starting with <meta and "extracts" the value of the robots' attribute content):

curl -s "$url" | sed -n '/\<meta/s/\<meta[[:space:]][[:space:]]*name="*robots"*[[:space:]][[:space:]]*content="*\([^"]*\)"*\>/\1/p'

This will print the value of the attribute or the empty string if not available.

Do you need a pure Bash solution? Or do you have sed?

0
On

You can add a line to extract the meta header for robots from the source code of the page and modify the line with echo to show its value:

#!/bin/bash
while read url
do
    dt=$(date '+%H:%M:%S');
    urlstatus=$(curl -kH 'Cache-Control: no-cache' -o /dev/null --silent --head --write-out '%{http_code} %{redirect_url}' "$url" )
    metarobotsheader=$(curl -kH 'Cache-Control: no-cache' --silent "$url" | grep -P -i "<meta.+robots" )
    echo "$url $urlstatus $dt $metarobotsheader" >> urlstatus.txt
done < $1

This example records the original line with the meta header for robots.

If you want to put a mark "-" when the page has no meta header for robots, you can change the metarobotsheader line, and put this one:

    metarobotsheader=$(curl -kH 'Cache-Control: no-cache' --silent "$url" | grep -P -i "<meta.+robots" || echo "-")

If you want to extract the exact value of the attribute, you can change that line:

    metarobotsheader="$(curl -kH 'Cache-Control: no-cache' --silent "$url" | grep -P -i "<meta.+robots" | perl -e '$line = <STDIN>; if ( $line =~ m#content=[\x27"]?(\w+)[\x27"]?#i) { print "$1"; } else {print "no_meta_robots";}')"

When the URL doesn't contain any meta header for robots, it will show no_meta_robots.