Why is Google crawling pages blocked by my robots.txt?

963 Views Asked by At

I have a “double” question on the number of pages crawled by Google and it’s maybe relation with possible duplicate content (or not) and impact on SEO.

Facts on my number of pages and pages crawled by Google

I launched a new website two months ago. Today, it has close to 150 pages (it's increasing every day). This is the number of pages in my sitemap anyway.

If I look in "Crawl stats" in Google webmaster, I can see the number of pages crawled by Google everyday is much bigger (see image below). Google crawled up to 903 pages in a day

I'm not sure it's good actually because not only it make my server a bit more busy (5,6 MB of download for 903 pages in a day), but I'm scared it makes some duplicate content as well.

I have checked on Google (site:mysite.com) and it gives me 1290 pages (but only 191 are shown unless I click on "repeat the search with the omitted results included". Let’s suppose the 191 ones are the ones in my sitemap (I think I have a problem of duplicate content of around 40 pages, but I just update the website for that).

Facts on my robots.txt

I use a robots.txt file to disallow all crawling engines to go to pages with parameters (see robots below) and also “Tags”.

User-Agent: *
Disallow: /administrator
Disallow: *?s
Disallow: *?r
Disallow: *?c
Disallow: *?viewmode
Disallow: */tags/*
Disallow: *?page=1
Disallow: */user/*

The most important one is tags. They are in my url as follow:

www.mysite.com/tags/Advertising/writing

It is blocked by the robots.txt (I’ve check with google webmaster) but it is still present in Google search (but you need to click on “repeat the search with the omitted results included.”)

I don’t want those pages to be crawled as it is duplicate content (it’s a kind of search on a keyword) that’s why I put them in robots.txt

Finaly, my questions are:

Why Google is crawling the pages that I blocked in robots.txt?

Why is Google indexing pages that I have blocked? Are those pages considered by Google as duplicate content? If yes I guess it’s bad for SEO.

EDIT: I'm NOT asking how to remove the pages indexed in Google (I know the answer already).

1

There are 1 best solutions below

1
On

Why google is crawling the pages that I blocked in robots.txt? Why google is indexing pages that I have blocked?

They may have crawled it before you blocked it. You have to wait until they read your updated robots.txt file and then update their index accordingly. There is no set timetable for this but it is typically longer for newer websites.

Are those pages considered as duplicate content?

You tell us. Duplicate content is when two pages have identical or nearly identical content on two or more pages. Is that happening on your site?

Blocking duplicate content is not the way to solve that problem. You should be using canonical URLs. Blocking pages means you're linking to "black holes" in your website which hurts your SEO efforts. Canonical URLs prevents this and gives the canonical URL full credit for its related terms and all links to all duplicated pages as well.