I came across the following howto on the web the other day, and was amazed at just how many ways one could get such a simple thing wrong. It serves as a great example of how not to do things, while at the same time providing opportunity to show you how you can do things better.
First, here’s the article: http://www.programmingfacts.com/2009/12/24/how-to-remove-index-php-from-url-using-htaccess-mod_rewrite/
I sincerely hope that parts of it will be updated at some point, so don’t be too surprised if what you read there doesn’t seem to line up with my remarks about it. Also, it may seem that I’m being unduly harsh to Mr. Patel, and perhaps I am. But I see stuff like this every day, and Mr. Patel is taking all of the heat for those other articles too. My apologies.
So, let’s start with his recipe:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9} /([^/]+/)*index.php HTTP/
RewriteRule ^(([^/]+/)*)index.php$ http://www.%{HTTP_HOST}/ [R=301,NS,L]
The concepts here are good enough – it distinguishes between the initial HTTP request from the browser – %{THE_REQUEST} – and the actual URI that ends up being considered for mapping. That way, you can force THE_REQUEST to be one thing, but map it to another. So far, so good.
Let’s look at the RewriteCond.
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9} /([^/]+/)*index.php HTTP/
This RewriteCond ignores one of the most important rules of regular expressions – a regular expression is a substring match, unless you force it not to be. So if all you care about is that THE_REQUEST is for something ending in index.php, most of that stuff is unnecessary. Now, this may seem to be picking nits, but on a website that received thousands or millions of requests per minute, every second that you spend in evaluating unnecessary regex bits is wasted time.
Remember that THE_REQUEST is the entire HTTP request string – something that looks like
GET /docs/images/feather.gif HTTP/1.1
And so the regex presented here attempts to match the entire HTTP method (GET or POST or a bunch of other things), hence the {3,9}. But we really don’t care, so that’s wasted time. It also tries to match the entire URI path, and even captures it into backreferences which it then discards. Wasted time *and* memory.
Finally, please note that the RewriteCond won’t in fact actually work at all, because it contains spaces, and yet is not enclosed in quotes. mod_rewrite will interpret that space as the termination of the regex, and will issue an “invalid flag delimiter” warning, because it will attempt to interpret that next bit as a flag, such as [NC] or [OR].
Instead, we really only need this:
RewriteCond %{THE_REQUEST} “/index.php HTTP”
Or, perhaps, if you want to be even more minimalist, and perhaps less clear to read:
RewriteCond %{THE_REQUEST} “index.php H”
But then we’re trading performance for readability, so you’ll have to make a judgment call on that.
Next, the RewriteRule:
RewriteRule ^(([^/]+/)*)index.php$ http://www.%{HTTP_HOST}/ [R=301,NS,L]
Remember that the goal is to redirect to a URL that still works, but which lacks the ‘index.php’ on the end of it, in the mistaken belief that this will improve your search engine ranking. (It won’t but that’s an article for another day.) However, this rewrite rule not only doesn’t do that, but very likely redirects to the wrong hostname entirely. It’s pretty clear that this rule was never tested, since it won’t work.
First of all, although it captures the leading path (ie, the bit before /index.php) so that it can redirect to the correct path (such as /application/index.php or /wordpress/index.php – but without the “index.php”), it then discards this, instead of using it in the redirection URL.
Secondly, it seems to assume that HTTP_HOST is lacking the ‘www’ prefix, which it may or may not be. So you may end up redirecting from www.example.com to www.www.example.com, which likely won’t work.
What it does get right is the use of the NS flag, so that it doesn’t enter a redirection loop on subrequest – mod_dir will map “/” back to “/index.php” in such a subrequest.
In the RewriteCond, we have already determined that the request is something ending in index.php, so we really don’t need to go to any trouble at all to craft a complicated regex to re-verify this. Instead, we only have to capture the bit that comes before index.php. Here’s what we need:
RewriteRule ^(.*)index.php$ http://%{HTTP_HOST}/$1 [R=301,NS,L]
Note that Mr. Patel has assumed that we’re doing all of this in a .htaccess file – something which I object to on principle, but will let go for now. So we can’t assume that there will be a leading slash on the REQUEST_URI as there would be in server config scope. If you use these rules in your main config, you’ll need to tweak accordingly.
But we capture the leading part of the request, if any, in $1, which we then use that in the redirection.
If the path exists, it will contain a trailing space, which will be stuck on the end of the redirection URL. If it doesn’t, well, we’ve already put a slash on there, so it accomplishes the same end.
I know I’ve been very long-winded here, but I wanted to demonstrate the dangers of these kinds of articles. Someone posts nonsense, and it gets re-tweeted a dozen times, and suddenly folks think that it’s their fault that they can’t get it working.
So, the full recipe, which will actually work:
RewriteEngine On
RewriteCond %{THE_REQUEST} “/index.php HTTP”
RewriteRule ^(.*)index.php$ http://%{HTTP_HOST}/$1 [R=301,NS,L]
And another day we’ll discuss the fallacy that removing index.php from your URLs actually helps your search engine ranking. It doesn’t, but I suppose people like to feel that they’re at least doing something.