.htaccess, 301 Redirects and the Subtleties of Syntax

By now who doesn’t know that duplicate content is a bad thing for SEO?  And although the universe of people who have heard of the famous “www resolve” issue is a smaller group, it’s still undoubtedly the vast majority of professional SEO consultants.

Why a Single Web Page Can Look Like Duplicate Content to Google

TwinsIf you know what I’m talking about you can skip down to the next subheading.  If you need a refresher, read on.

This realization is crucial: Google does not deal in web pages, they deal in URL’s.  So if you have two URL’s that are different, but they point to the same page, Google will see those pages as different “pages” with the same content.  In other words, Google may very well see that as duplicate content.

Enter the wild and wooly www.  Although the “www” prefix added to website domain names is a relic of those good old days when people actually called it the “World Wide Web,” its legacy continues.  Put 100 people in a room with a computer and a browser, tell them to go to the CNN website and probably half of them will enter www.cnn.com and the other half, the impatient half, the half who, like me, believes their lives will be shortened by the use of excess keystrokes, will simply enter cnn.com.  (After all, counting the “dot,” eliminating the “www” saves me four precious keystrokes – hurray for me!).

So, in order to please the whole universe, web servers will typically accept both versions of the url and you’ll end up looking at all the news that’s fit to, well, print (at least in cnn’s biased opinion).  Addtionally, web servers will often allow people to input the simple domain name and actually see a file called something like “index.html.”  So, using the cnn example potentially, you could enter the following four addresses into the address bar of your browser…

  • cnn.com
  • www.cnn.com
  • cnn.com/index.html
  • www.cnn.com/index.html

…and each time you would arrive at the home page of CNN.  Yet, Google thinks of 4 URL’s as 4 pages, right?  Uh oh, duplicate content.  So what do we do about it?

Using Redirects to Make Sure Google Doesn’t Go Stupid on Us

If you’ve spent more than a few hours investigating the SEO implications of this, you know that the enlightened way to deal with the above scenario is the search-engine-friendly 301 redirect.  When we want to direct web traffic (and search engine spiders) from an obsolete, removed, or simply mistaken url to a valid URL, as web developers we have the choice of either a 302 (temporary) redirect, or a 301 (permanent) redirect.  [use this as a call-out] A redirect basically says, “hey, careless, you typed cnn.com into your browser, and that’s not valid, but I’m going to do you a favor and send you over to www.cnn.com instead.”

At this point you could ask all sorts of questions about the relative advantages and disadvantages of 301 vs. 302 redirects, but why waste the time?  Google blesses the 301 and damns the 302, so we know which type of redirect we want to use on our sites – All hail Google!  And if you do already know this, you know that the best way of handling this, at least on Apache web servers, is through the .htaccess file, a small text file that has big implications for how your website behaves.  (For more on what an .htaccess file is, and even how to configure it, check out the official page here.)

If you put a few lines of code into an .htaccess file, and place that file in the public root of your website (that’s the primary, top-level folder that is accessible by website visitors), then your problem with this situation is solved.

But wait.  What is that little bit of code anyway?  Since I’m not a web server guru, I’m at the mercy of what I find on line for this trick, and as I recently found out, what you find on line is not always exactly what you need.  Let’s take a look at an example site and see how different flavors of .htaccess code that you find online can affect it.

First, the example.  Here I checked the site I was working on, namely bicycle jersey reseller ecyclingstore.com,  using the handy little redirect tool found at RagePank for just such a purpose.  Here’s what I got:

Redirect Tool

 

As you can see from the screen capture, even though I had placed what I thought was the necessary code in an .htaccess on their site, I was surprised at the results, since it’s showing 4 different 200 results (a 200 code returned by a website says, “hey, you bet we have that page, and here it is!”) on 4 different urls, only one of which is a real page.

Here’s the code that was in my .htaccess file.  And this is code I find all over on the Internet:

Options +FollowSymLinks

RewriteEngine on

RewriteCond %{HTTP_HOST} ^ecyclingstore.com$ [NC]

RewriteRule ^(.*)$ http://www.ecyclingstore.com/$1 [R=301,L]

 

I looked at the rewrite rule found on that redirect checker, they suggested something almost, but not quite the same:

RewriteEngine on

RewriteCond %{HTTP_HOST} !^www.ecyclingstore.com

RewriteRule (.*) http://www.ecyclingstore.com/$1 [R=301,L]

 

When I modified the .htaccess rule and did a rerun on the redirect checker, I got this:

Redirect Tool

Much better, but I still have two 200 responses and I want to get that down to one and only one 200 response.  PageRank recommended another bit of code, as follows:

Options +FollowSymLinks

RewriteCond %{THE_REQUEST} ^.*/index.php

RewriteRule ^(.*)index.php$ http://www.ecyclingstore.com/$1 [R=301,L]

 

Once I installed that code to .htaccess everything was hunky dory.

Redirect Tool

The Conclusion You Should Not Draw From This

I don’t want you to think that the only relevant way to do an .htaccess redirect is what I’ve shown here.  For one thing, this code gets pretty hairy and I’m no expert at it.

The conclusion you should draw is that you need to test your .htaccess code once you get everything set up.  I’ve left the wrong .htaccess code in place for years because I fell prey to the second classic blunder of all time (the first of which is, of course, never get involved in a land war in asia), namely never trust anything you pull off the Internet without verifying it.

PostScript: Don’t Even Trust the Redirect Checkers

As I was going to press with this blog post I took another look at PageRank’s redirect checker (mentioned above).  Something in the results it was giving me got me suspicious.  After a bit of investigation using a more classic, much more mundane-looking tool, namely Web Sniffer, I found that RagePank can return a 301 code on url’s that get a 404 on Web Sniffer.  So I guess you even need to check the checkers.