How to fix indexed pages that shouldn't be crawled by GoogleBot and other search engine crawlers?

Question

How to fix indexed pages that shouldn't be crawled by GoogleBot and other search engine crawlers?

1.2k Views Asked by ElHaix At 18 August 2025 at 08:41

On an existing .Net MVC3 site, we implemented paging where the URL looks something like www.mysite.com/someterm/anotherterm/_p/89/10, where 89 is the page number and 10 is the number of results per page.

Unfortunately the rel="nofollow" was missing from page number links greater than 3, and those pages were also missing <meta name="robots" content="noindex,nofollow" />.

The problem is that Google and a few other search engines have now indexed those pages, and now attempting to crawl all of them, quite frequently, which as we found started having a drastic impact on the prod db server. We don't want all those additional thousands of pages crawled, only the first few.

I reverted the code back to a version of the site that does not include paging so that our DB server won't be hit so hard now. So while the search engines will get 404 errors for all those pages, I want to know if this is the best thing to do, since after a while I will introduce the paging site again?

I could add the following to the web.config to have all 404's redirected to the home page:

 <httpErrors errorMode="Custom">
     <remove statusCode="404"/>
     <error statusCode="404" path="/" responseMode="ExecuteURL"/>
  </httpErrors>

But I'm thinking that doing this will be rendered as "duplicate content" for all of those pages with pagination URL parameters.

Is the best idea here to just let those 404's continue for a week or two - then re-introduce the paging site?

Another option may be to release the paging site with some code added in to reject crawlers on pages greater than 3. Suggestions?

Is there a quicker way of getting those pages out of the indices so they won't be crawled?

Thanks.

Original Q&A

There are 3 best solutions below

Garrett Fogerlie On 04 August 2012 at 08:51

I wouldn't resort to 404's except as a last resort, and duplicate content could result in your losing page rank. The first thing I would do is create a Google Webmaster Tools account and configure how you want it to crawl your pages, remove pages, what attributes to use, etc.

Do the same (webmaster tools) for Bing, and you should be in the clear in a day or two. (Bing's engine is used by a lot of other search engines, and it seems like your changes on Bing trickle down to them too.)

Calgary Libertarian On 05 August 2012 at 08:54

Try creating a robots.txt most (non-black-hat) crawlers should respect the blocking of that page if you place it in the robots.txt file.

Here is a tool: http://www.mcanerin.com/EN/search-engine/robots-txt.asp

Also webmaster tools by google gives you great in site and can also help with robots.txt

**ElHaix** · Accepted Answer

Simply leaving the pages as 404 wouldn't do, as this is a permanent removal. Looking at the RFC 2616 Hypertext Transfer Protocol – HTTP/1.1 chapter 10. Status Code Definitions:

“The requested resource is no longer available at the server and no forwarding address is known. This condition is expected to be considered permanent. Clients with link editing capabilities SHOULD delete references to the Request-URI after user approval. If the server does not know, or has no facility to determine, whether or not the condition is permanent, the status code 404 (Not Found) SHOULD be used instead. This response is cacheable unless indicated otherwise.

The 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed. Such an event is common for limited-time, promotional services and for resources belonging to individuals no longer working at the server's site. It is not necessary to mark all permanently unavailable resources as "gone" or to keep the mark for any length of time -- that is left to the discretion of the server owner.”

I simply added a new ActuionResult method:

    public ActionResult Http410()
    {
        return new HttpStatusCodeResult(410);
    }

and created new routes for matching "__p":

routes.MapRoute(name: "WholeCountryResultsWithPaging__p", url: "{searchTerm}/__p/{pageNumber}/{pageSize}", defaults: new { controller = "Errors", action = "Http410", pageNumber = UrlParameter.Optional, pageSize = UrlParameter.Optional });

How to fix indexed pages that shouldn't be crawled by GoogleBot and other search engine crawlers?

There are 3 best solutions below

Related Questions in ASP.NET-MVC-3

Related Questions in HTTP-STATUS-CODE-404

Related Questions in HTTP-REDIRECT

Related Questions in GOOGLE-CRAWLERS

Related Questions in SEARCH-ENGINE-BOTS

Trending Questions

Popular # Hahtags

Popular Questions