Attention! Helicon Tech Blog has moved to www.helicontech.com/articles/

Tuesday, February 3, 2009

Exploding myths about mod_rewrite. Part I

For many years mod_rewrite was considered enigmatic and incomprehensible voodoo art. We want to shatter this myth and illustrate that mod_rewrite can be easy and handy instrument for everyone who is at least slightly acquainted with regular expressions (more info on regular expressions may be obtained here http://www.regular-expressions.info/).

What can and what cannot be done with mod_rewrite

000 - The begining

mod_rewrite processes (verifies, changes, adds and deletes) any incoming (request) headers(highlighted in yellow).  Usually most of the work is done on URL part of request. It is divided by browser into parts that are then transferred to server in separate headers (first line of the request and Host: header). That’s why in it’s unlikely to meet the whole URL in mod_rewrite directives.

Despite its versatility mod_rewrite is NOT capable of processing response headers and response body.

mod_rewrite directives and order of processing

mod_rewrite module possesses 10 directives but 2 of them are used much more often that others: RewriteRule and RewriteCond.

Whenever you want to put RewriteRule directives into some context, you must start your config with the following directive:

RewriteEngine on

All mod_rewrite directives but RewriteCond and RewriteRule (and their extended alternatives RewriteHeader and RewriteProxy) may be written in any order and even several times, in such case the module will accept only the last (close to the bottom of the config) value.

RewriteRule directives are processed from top to bottom, RewriteCond directives refer only to one subsequent RewriteRule. (see detals below).

Simple example

As an example we'll take small .htaccess in the root of the site:

RewriteEngine on
RewriteRule robots\.txt robots.asp [NC]

and robots.asp:

<H1>Asp function "Now"</H1>
<%=now()%>

000 - Simple Rule Scheme

This rule replaces requested robots.txt (this file may not even exist on the server) with real dynamic robots.asp file. [NC] flag makes strings comparison case-insensitive. Clients requesting robots.txt have no idea of what’s happening on server, i.e. they don’t know that instead of static robots.txt they will get dynamically generated response (e.g. as on the picture below).

001 - Simple Rule - Result

Another simple example

To understand the logic of config processing let’s make it a little more complicated and add the rule rewriting default.htm with default.asp to the end of .htaccess. How will this config be processed?

RewriteEngine on
RewriteRule robots\.txt robots.asp [NC]
RewriteRule default\.htm default.asp [NC]

000 - Simple Rule

For requests to any file in the root of the site the sequence will be the following:

  1. If “default.htm” is requested, initial URL is compared with “robots.txt”. Strings don’t match, no rewriting occurs.
  2. “default.htm” is compared with “default.htm”. Strings match, URL is rewritten to “default.asp”.

or

  1. If “robots.txt” is requested, initial URL is compared with “robot.txt”. Strings match, URL is rewritten to robot.asp.
  2. Then rewritten URL - robots.asp – is compared with default.htm. Strings don’t match. (But this last comparison is obviously unnecessary!)

Processing stops.

In other words: all requested URLs are compared with all rules regardless of whether one of the rules has already matched or not. This leads to excessive actions. That’s why we’ll add [L] flag that terminates rules processing if the current rule matched.

RewriteEngine on
RewriteRule robots\.txt robots.asp [NC,L]
RewriteRule default\.htm default.asp [NC,L]
  1. If “default.htm” is requested, the sequence is the same as in the previous scenario because the rule that matches requested URL is the last rule.
  2. If “robots.txt” is requested, initial URL is compared with “robots.txt”. Strings match, URL is rewritten to robot.asp, BUT the second rule is ignored and processing stops.

001 - Rule with flag L

That’s why it’s better to place frequently used rules at the top of the config.

 Important note!

We’ll wander off the topic a little and explain one imperceptible thing about strings comparison in regular expressions. If ^ character is not specified at the beginning of match pattern, regexp mechanism will look for the substring starting from all possible positions in the string.

Example:

We request default.htm and regular expression is robots\.txt. Regexp mechanism compares:

  1. default.htm ≠ robots\.txt
  2. efault.htm ≠ robots\.txt
  3. fault.htm ≠ robots\.txt
  4. ault.htm ≠ robots\.txt
  5. ult.htm ≠ robots\.txt
  6. lt.htm ≠ robots\.txt
  7. t.htm ≠ robots\.txt
  8. .htm ≠ robots\.txt
  9. htm ≠ robots\.txt
  10. tm ≠ robots\.txt
  11. m ≠ robots\.txt

and only after that regexp mechanism will inform you that the rule was not matched!

But if we add ^ at the beginning and $ at the end of each RewriteRule,

RewriteEngine on
RewriteRule ^robots\.txt$ robots.asp [NC,L]
RewriteRule ^default\.htm$ default.asp [NC,L]

001 - Rule with flag L with right rule

comparison will only occur once:

  1. default.htm ≠ robots\.txt

that is much faster. We strongly recommend to add  ^ wherever possible.

Using conditions

It’s often necessary to apply additional conditions to the rule. RewriteCond directive is destined for such purposes.

Say we need to dynamically generate gif and jpg images and there are two scripts for that -render_gif.asp and render_jpg.asp. mod_rewrite config will look like this:

RewriteEngine on
RewriteRule ^(.*)\.gif$ render_gif.asp?file=$1 [NC,L]
RewriteRule ^(.*)\.jpg$ render_jpg.asp?file=$1 [NC,L]

Now we’ll add a condition: if requested gif or jpg file physically exists on the disk, return real file. This check may be performed using the following directive:

RewriteCond %{REQUEST_FILENAME} !-f

And the config will become:

RewriteEngine on
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^(.*)\.gif$ render_gif.asp?file=$1 [NC,L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^(.*)\.jpg$ render_jpg.asp?file=$1 [NC,L]

004 - Rule with conds 2

We want to drag your attention to the following:

  1. One or several consecutive RewriteCond conditions preceding RewriteRule (RewriteProxy, RewriteHeader) directive affect this only directive. This means that if you have 2 rules and you want to apply the same conditions to both of them, these conditions should be put before ach rule (see the last piece of code above).
  2. RewriteCond directives will only be processed if the first (left) part of RewriteRule matched.

Order of processing RewriteRule with RewriteCond

Let’s take the config that allows to prevent hotlinking. This config blocks requests to gif and jpg files which don’t have referrer value or their referrer doesn’t start with http://www.example.net:

RewriteEngine on
RewriteCond %{HTTP_REFERER} .
RewriteCond %{HTTP_REFERER} !^http://www\.example\.net [NC]
RewriteRule \.(jpe?g|gif)$ - [F]

004 - Rule with conds 4

The config is processed in the following order:

  1. RewriteRule checks whether gif or jpg file is requested using this regular expression \.(jpe?g|gif)$.
  2. First RewriteCond checks if Referer: header has a value. To be exact it makes sure it contains at least one character. If it does, processing moves on.
  3. Second RewriteCond checks if Referer value starts with http://www.example.net. If it doesn’t, control goes back to RewriteRule, if it does, processing terminates and user gets requested resource.
  4. If all directives matched, then the image was requested from some unwanted site and  RewriteRule returns “403 – Forbidden”.

Comments:

  • RewriteCond directives are processed consequently from top to bottom but only AFTER the first part of RewriteRule gets matched against requested resource.
  • The second part of RewriteRule will be executed only if ALL RewriteCond directives get matched.
  • All flags (but [NC]) are taken into account and executed after RewriteRule execution. If the flags do not prohibit further processing ([L], [F], [G], [P], etc), next RewriteRule is processed.

The following picture illustrates the result of the above rule. As you can see, for request /test.jpg referrer did not start with http://www.example.com and the result was “403 - Forbidden”.

004 - Result of Rule with conds -F

Using variables in RewriteCond’s and format string

Let’s take an example allowing to emulate several sites on one IIS site while placing them into different directories:

RewriteEngine on
RewriteCond %{HTTP:Host} ^(www\.)?(.+)
RewriteRule (.*) /%2$1 [L]

Say the request is http://example.net/robots.txt

  1. /index.php is matched against (.*) and is saved in $1 variable
  2. %{HTTP:Host} represents example.net
  3. example.net is compared with ^(www\.)?(.+), matches it and is saved in %2
  4. /%2$1 represents /example.net/index.php

Now all files requested on http://example.net/ will be searched not in C:\inetpub\wwwroot\ but in C:\inetpub\wwwroot\example.net\ folder.

004 - Rule with conds 5 (2)

In green areas you may and should use the following types of variables:

  • %{ServerVariable}
  • header sent in the HTTP request %{HTTP:header}
  • map file values ${mapname:key|default}
  • environment variables %{ENV:variable}
  • back references $0-$9 (to match groups in the first part of RewriteRule) and %1-%9 (to match groups in the second part of RewriteCond). Notice that %n may be used only in left (green) part of consequent RewriteCond’s or right (green) part of RewriteRule; $n may be used anywhere inside green areas.

Yellow areas require regular expressions. $n and %n are not possible.

Conditional operators in format string

(extended functionality, don’t use in Apache)

Here’s the config allowing redirection of non-www requests to www:

RewriteEngine on
RewriteCond %{HTTPS} (on)?
RewriteCond %{HTTP:Host} ^(?!www\.)(.+)$ [NC]
RewriteCond %{REQUEST_URI} (.+)
RewriteRule .? http(?%1s)://www.%2%3 [R=301,L]

007 - Rule with optional conds

  1. RewriteRule matches any requested resource.
  2. RewriteCond checks if HTTPS is switched on for this request. If yes, on value is saved in %1 variable, if no, %1 remains empty.
  3. RewriteCond checks if Host: header not starts with www. If it doesn’t, host name is saved in %2 variable.
  4. RewriteCond simply saves URI of requested resource in %3 variable.
  5. If all RewriteCond directives matched, RewriteRule builds the substitution string.

a. This part (?%1s) checks whether %1 matched and if yes, it adds “s” character. http(?%1s) -> https

Please pay attention!

  • Check of whether the group matched or not is performed using different syntax depending on the directive: RewriteRule: ?Ntrue_string:false_string
  • RewriteCond: ?%Ntrue_string:false_string

b. Then https://www.%2%3 -> https://www.example.com%3

c. Then https://www.example.com%3 -> https://www.example.com/index.php

Dealing with QueryString

RewriteRule directive deals only with the part of the request AFTER host name and BEFORE QueryString.

E.g. the request is http://localhost/test?param=foo (only part in bold is processed).

And the rule is:

So, there are 4 variants of passing initial QueryString:

  1. By default, if no new QueryString is specified in substitution string, initial QueryString will be added to the rewritten URL. For
    RewriteRule ^/test$ /test.asp

    the result is /test.asp?param=foo

  2. If new QueryString is specified in substitution string, initial QueryString will NOT be added to the rewritten URL. So, for
    RewriteRule ^/test$ /test.asp?bar=foo

    the result is /test.asp?bar=foo.

  3. If [QSA] flag is put after the rule, new and initial QueryStrings will be joined. For
    RewriteRule ^/test$ /test.asp?bar=foo [QSA]

    the result is /test.asp?bar=foo&param=foo.

  4. If it’s necessary not to add initial QueryString to rewritten URL when no new QueryString is specified, you should add “?” character at the end of the substitution string:
    RewriteRule ^/test$ /test.asp?

    And the result will be: /test.asp.

If one needs to work with QueryString parameters more selectively, you should use RewriteCond directive and %{QUERY_STRING} variable (remember that RewriteRule directive doesn’t match QUERY_STRING?).

Example

Redirect /index.php?id=123 to /index/123. The config is:

RewriteEngine on
RewriteCond %{QUERY_STRING} ^id=(.*)$
RewriteRule ^/index.php$ /index/%1? [NC,R=301,L]
  1. RewriteRule checks if requested file is index.php.
  2. RewriteCond retrieves QuerySrting value (part of request after “?”) from %{QUERY_STRING} server variable. In our case it’s “id=123”.
  3. RewriteCond applies ^id=(.*)$ regular expression to “id=123” string and saves 123 value in %1 variable.
  4. %1 is substituted with 123 in substitution string: /index/%1? -> /index/123 As there’s a “?” character at the end of the line, initial QuerySrting is not added.
  5. [R=301] flag is processed. If absolute address is not set for redirect, by default mod_rewrite adds http:// + requested_host + requested_port/. So, /index/123 –> http://example.com/index/123.

Small remark about # character

Please notice that browser does not send anchor information to the server (everything that comes after # character). That’s why it’s absolutely impossible to write a rule that will use this info. Nevertheless, browser will correctly process relative links, ‘cause rewriting is absolutely transparent for it.

005 - Request wist params and #

Conclusion

Hope we convinced you that mod_rewrite is much easier to use that you thought. We are happy if we managed to shed some light on this scary-looking question and you got more understanding of the issue. Next article about mod_rewrite will tell you the story of context merging and distributed configurations.

Best wishes, HeliconTech Team

No comments:

Post a Comment