I use my own PHP-based CMS for almost all my web development. This is great, because since I know the code inside and out, I can make whatever the client wishes happen. It has a lot of nice, reusable features (plugins) that make development of a generic website pretty short. Still, all was not well because one of the things it didn’t support was URL rewriting. In this blog post, I’ll set out to describe how this is done, what pitfalls there are and how they can be avoided.

What is URL rewriting?

In my CMS, a typical URL to page looks like this:

http://www.mysite.com/index.php?__target__=mypage

Inside the CMS, pages have names, and by supplying a __target__ argument, I can tell it show a particular page. Either that, or the plugin may show the output of some plugin, combination of pages, etc. The key is that there is a __target__ that describes what must be shown. There are also other types of URLs. For example, this URL would show blog post #23:

http://www.mysite.com/index.php?__target__=blog&id=23

These URLs, though technically fine for the CMS, are ugly. There are question marks, ampersands and variable names (__target__) in the URL, which does not look inviting to the user, and neither to Google. From a search engine optimization perspective, these cryptic URLs don’t help.

Enter friendly URLs. It would be much nicer if we could offer URLs in this style:

http://www.mysite.com/provinces/maputo
http://www.mysite.com/provinces/inhambane
http://www.mysite.com/hotels/polana

or

http://www.mysite.com/blog/23

or, better yet

http://www.mysite.com/blog/practical-guide-to-url-rewriting

These URLs:

  • Are human-readable;
  • Make it clear to the user that beyond Maputo and Inhambane, there are probably other provinces. The user can simply * change the URL to visit them;
  • Illustrate the structure of the website. Evidently there are provinces and hotels;
  • Allow Google to index keywords such as “Maputo”, “Inhambane” and “practical guide to URL rewriting” straight from the URLS.

Our CMS does not understand friendly URLs. It requires a __target__ argument to be present in any URL. This goes for other systems as well, such as WordPress or Joomla. They all use “technical” URLs.

What’s needed now is a mechanism that translates friendly URLs to technical URLs. A CMS like WordPress already does this. The question is, how does it do it, and which code does the job?

IIS URL Rewriting

It turns out that URL rewriting is done by the web server (IIS, Apache, or other), and not by the CMS code. A CMS like WordPress simply instructs Apache to perform URL rewriting – provided that Apache has its mod_rewrite module installed. Internet Information Server (IIS) is also capable of URL rewriting, and requires the URL rewriting extension to be installed. Just follow the link to add it to your IIS installation. Upon launching IIS, you’ll see the extension appear:

IIS Extension for URL Rewriting

Regular Expressions

URL rewriting works by applying regular expressions to URLs. Ordinarily, when the web server receives a request, it passes the request URL along to the CMS PHP script, which produces content, and the web server sends this content back to the requesting client. With the URL rewriting extension installed, an intermediate step is taken. Any incoming URL is passed through a regular expression filter, resulting in a rewritten URL. This URL is then passed to the CMS script.

As an example, let’s consider the following URLs:

http://www.mysite.com/provinces/maputo
http://www.mysite.com/provinces/inhambane/

We’d like this rewritten as:

http://www.mysite.com/index.php?__target__=province-maputo
http://www.mysite.com/index.php?__target__=province-inhambane

In order to do this, we’ll provide the URL rewriting extension with the following regular expression to match:

^provinces/([^/]+)

This will capture the province part, followed by the name of the province, but excluding a possible terminating forward slash. Notice that the part of the regular expression that matches the province name is contained in parentheses. This will allow us to refer to a matched group when we do the rewriting.

Next, we provide the formatting for the actual rewriting:

index.php?__target__=province-{R:1}

The URL rewriting extension will not replace the matched substring with the rewriting, filling out the name of the province at the {R:1} position. The {R:1} label refers to the first matched group. If it were necessary to match more than one group, we would use {R:2}, {R:3} and so on.

Rewriting Rules in IIS

That’s the theory – now for a practical implementation in IIS. Opening the URL Rewriting extension in ISS by double-clicking it, the following screen appears:

Adding a rewrite rule

Two sets of rules appear: Inbound rules and Outbound rules. For the time being, only the inbound rules are of interest. In order to add a new rule, click “Add Rule(s)…“.

Creating a blank rule

There’s a slew of options available. In order to best get to know how IIS URL rewriting works, we’ll be adding a blank inbound rule. Clicking this option, a form appears:

Creating an inbound rule

This is where we’re going to input our regular expression.

  • Give the rule a name, e.g. “Provinces”
  • Under Pattern, fill out the regular expression that the URL must match: ^provinces/([^/]+)
  • Under Action Properties, Rewrite URL, provide the rewriting pattern: index.php?__target__=province-{R:1}

There are many additional options that you can set. For instance, you can apply your rewriting pattern to URLs that do not match the regular expression you provide. Or, you can use wildcards instead of regular expressions. The overall mechanism is the same though: match a URL, and replace it with some pattern. Do note the checkbox labelled Append query string – this should normally be checked, as anything that your rule did not rewrite, i.e. additional arguments to the query, will then still be appended to the rewritten string.

Having saved the rule, the rewriting should now work in your browser.

Pitfall: relative URLs break

It’s time to discuss something that had me stumped for a while. If your friendly URLs have the following form:

http://www.mysite.com/provinces/maputo

then the web server will consider the current working directory to be “provinces”. This, in turn, will affect any relative links in your HTML. If your style sheet lives at css/style.css, then the web server will now try to load it from provinces/css/style.css, resulting in a 404 error for the style sheet! Some or all of your javascripts will no longer be loaded, breaking your site.

Problems with broken relative links could be fixed with the <BASE> tag, but it turns out that this tag has some very undesirable side effects. It turns out that it would be better to simply use absolute URLs, since these won’t be modified by the web server. In order to do this, our CMS will have to be prepared (CMSs like WordPress and Joomla produce absolute URLs for the same reason).

I ended up configuring all my sites with a document root, e.g. “http://www.mysite.com/”, which would be prefixed to all relative URLs that the CMS generates.

Outbound Rules

This may come as a surprise, but we’re only half done. While we may have configured inbound rules to have the web server rewrite all friendly URLs to technical URLs, there is still something missing. Any URLs that our CMS generates, are still technical URLs! One of our content pages may contain a link to some other content page in the technical format:

http://www.mysite.com/index.php?__target__=city-pemba

This means that while a visitor may access our website initially through a friendly URL, as soon as she clicks a link, we revert to technical URLs and we have gained very little.

What to do? A first impulse might be to have the CMS translate all its technical URLs to friendly URLs, but this would require that the CMS have knowledge of how we form our friendly URLs. There is no need for this – this translation mechanism can be implemented in the web server’s URL rewriting extension as well.

Outbound rules are regular expressions that are applied to the HTML content of each response just before it is sent off to the client. This means that the CMS generates whatever content is needs to generate, after which the web server goes through the resulting HTML and changes all technical URLs to friendly URLs.

Actually, this process can be applied to more than just HTML, and we will in fact have to instruct the web server to limit its efforts to HTML content only.

In order to create an outbound rule, select Add Rule(s)… and create a Blank outbound rule:

Creating an outbound rule

We need to set the following:

  • The matching scope should be Response.
  • Match the content within should be set to A (links only). Do note that you can actually rewrite the content of many other elements!
  • For the Pattern, take province-(.*)
  • The Action type should be Rewrite
  • The Value should be http://www.mysite.com/provinces/{R:1}

This completes an outbound rule. However, this rule will process all content types, not just HTML! The web server will perform replacements in your CSS and your JavaScript as well. In order to limit this, we must add a Precondition. At the top of the form, create a new Precondition like this (call it IsHTML):

Editing a precondition

This precondition checks the content type of each response that the web server return to the client, and verifies that it’s of type text/html. Applying this precondition to your outbound rule prevents the web server from messing up your CSS, JavaScript and other files.

Doing it in code

Maybe you’re not a fan of configuring all your rules in the IIS Manager. Good news! Everything you do in this system is actually saved to a file called web.config, which is stored in the root directory of your website. This means that you can also simply edit this file outside of IIS Manager.

What about Apache?

The Apache web server also does URL rewriting. For this, you’ll need to install the mod_rewrite module. The concept is similar: you define inbound and outbound rules using regular expression, albeit with a syntax that is slightly different. The rules are then commonly stored in a .htaccess file in the root directory of your website.

Summary

This article provides an introduction to doing URL rewriting with the IIS web server. The main points are:

  • URL rewriting should be done by the web server, not the CMS
  • The web server allows definition of inbound rules for rewriting friendly URLs to technical URLs accepted by the CMS, and outbound rules for rewriting any technical URLs produced by the CMS to friendly URLs in all the HTML generated by * the CMS
  • Care must be taken that friendly URLs may change the document root of your website, causing any relative links to * break. * Changing all relative URLs to absolute URLs (a job for the CMS) is a catch-all solution.