Using X robot tags to in .htaccess file to de index query strings URL from Google

129 Views Asked by At

I am looking for a solution to deindex all the URLs with query strings ?te= from Google. From example I want to deindex all the URLs https://example.com/?te= from Google.

Google has currently indexed 21k URLs with the same query string and I want them all to be deindex. Should I used X robot files to do so?

What are the possible solution to do that?

I have tried blocking them using robot.txt using the command

Disallow: /*?te=

But it didn't help me out.

1

There are 1 best solutions below

8
On

Your robots.txt solution would mostly work if you gave it enough time. Google usually stops indexing URLs it can't crawl. However, Google occasionally indexes such URLs based in external links without indexing the contents of the page.

Using X-Robots-Tag is a much better idea. It will prevent Google from indexing the pages. You will need to remove your disallow rule from robots.txt or Googlebot won't be able to crawl your URLs and see the X-Robots-Tag. You'll also need to give Googlebot time to crawl all the pages. Some pages will start getting de-indexed in a few days, but it could take months for Googlebot to get through all of them.

If you are using Apache 2.4 or later, you can do this in .htaccess using Apache's built in expressions:

<If "%{QUERY_STRING} =~ /te=/">
    Header set X-Robots-Tag noindex
</If>

If you are still on Apache 2.2 or earlier, you'll have to use a rewrite rule and environment variable to achieve the same effect:

RewriteCond %{QUERY_STRING} te=
RewriteRule ^(.*)$ $1 [E=teinquery:1]
Header set X-Robots-Tag noindex env=teinquery

I recommend testing to see if it is working using curl on the command line.

curl --head "https://example.com/"

should NOT show a line that is X-Robots-Tag: noindex, but the following command should show it:

curl --head "https://example.com/?te=foo"