HTML5 Boilerplate build script, VCS-based deployment and duplicate content

522 Views Asked by At

I'm using the HTML5 Boilerplate build script on a new project that I've just deployed to a staging environment. The script works like a charm; it's well documented, so it was easy to configure for use in my application.

After reading through the documentation I decided to use Paul Irish's approach for VCS-based deployment to point to the /publish directory, using this snippet from his documentation in my .htaccess file:

RewriteEngine On
RewriteCond $1 !^publish/
RewriteRule ^(.*)$ publish/$1 [L]

I have it configured like this for my particular setup, and everything points to the minified and concatenated files just like it should. This is great, but the /publish directory is also browsable directly by going to http://[mysite.com]/publish/

This seems like kind of a loose thread to leave dangling. I'm wondering if anyone here has run into this and come up with a good solution. I'm not expecting users to type in /publish/ after the URL, but I wouldn't want it to be crawlable for sure, and it just seems a little sloppy to leave it like that.

Any ideas?

Thanks in advance

Update: after much appreciated help from Gerben, below, I ended up changing my thinking on this a bit - there is no need to redirect users from /publish to the root URL because users won't be typing in /publish, and there will never be any links to [site.com]/publish. Instead I've added the following rule in the .htaccess file within the /publish directory. This produces a 403 error (Forbidden) for any requests to the publish subdirectory: http://httpd.apache.org/docs/current/rewrite/flags.html#flag_f

RewriteCond %{THE_REQUEST} publish
RewriteRule .? - [F]

In addition, I've added the publish directory to robots.txt just to be sure search bots aren't indexing two sets of files which contain the same data.

2

There are 2 best solutions below

10
On

Seems I misread you question. I think the following would redirect anything back the the root folder.

RewriteCond %{THE_REQUEST} " /publish/"
RewriteRule ^publish/(.*) /$1 [R=302,L]

To be sure I would probably also add /public to my robots.txt as forbidden, just in case you accidentally remove the htaccess or something.

Explanation: The RewriteRule check that the requested url starts with publish/___ and redirect those urls to /___. But to distinguish between direct requests to /publish and urls rewritten to /publish you'll need to examine the originally requested url. The only way to get to that is via the THE_REQUEST variable. That variable should contain something like GET /publish/___ HTTP/1.1 for direct requests. So the RewriteCond checks for the presence of <space>/publish/

EDIT: final attempt:

RewriteBase /
RewriteCond %{ENV:REDIRECT_STATUS} ^$
RewriteRule ^publish/(.*) /$1 [R=302,L]
0
On

So the solution I ended up with was to throw users a 403 (forbidden) error on the odd chance that they happen to stumble on [site.com]/publish. Here's how I did it:

In the root directory's .htaccess file, I kept this rule from the h5bp documentation:

RewriteCond $1 !^publish/
RewriteRule ^(.*)$ publish/$1 [L]

In the /publish directory's .htaccess file I added this rule, with the F flag (forbidden):

RewriteCond %{THE_REQUEST} publish
RewriteRule .? - [F]

I hope this is helpful for anyone else who runs into this problem!