Oct 162012
 

Sometimes I get surprised by how many people do not fully understand how URLs work … or more specifically how they are decomposed and what each part means. And not just people who have no real reason to understand them, but people in IT. As a DNS administrator (amongst other things) I get some surprising requests – surprising to me at least – which involve explaining how I would like to help, but accomplishing the impossible is a task somewhat above my pay grade.

With any luck (so probably not then), this little post may go some way towards explaining URLs and what can and cannot be accomplished with the dark arts of the domain name system.

To start with, URLs can be thought of as web addresses. Not the kind you find painted on the sides of vans (www.plumbers-are-us.com) but what they turn into in the location bar with an honest web browser when you visit a site. Such as http://www.plumbers-are-us.com/. Although I note that my own browser is less than honest!

But just to make things a little more interesting, I will make that example URL a little more interesting: http://www.plumbers-are-us.com:8080/directory/portsmouth.html.

And now to the dissection. The first part of that URL above is the http bit … to be precise that which appears before the two slashes (apologies if you have been deceived by Microsoft but a ‘/’ is a forwards slash and a ‘\’ is a backwards slash, although those formal graphemologists who write the standards prefer to call a slash a solidus). This part of the URL is the scheme.

The scheme defines what protocol should be used to fetch a page with. You should be familiar with http and https as these are conventionally used to fetch web pages … with the later involving SSL encryption of course. There are of course other schemes less well known :-

ftp File Transfer Protocol – a pre-web method for transferring files.
gopher Gopher – an earlier competitor to the Web.
mailto Used to compose a mail message to an address.

In fact that is just a tiny sneak peak at the full list which contains a number of things even I have never heard of. But the usual scheme is either http or https (at least for now), so we can skip over the scheme part.

The next part (between the ‘//’ and the next ‘/’) contains two items of information :-

  1. The “hostname” where the web server can be found.
  2. The “port” to attach to on that web server.

The “port” is relatively uninteresting. If the server where the URL is served from is configured properly, there is no need to specify a port number, as any browser is capable of realising that the default port number for http is 80 (computers are good with numbers after all) and 443 for https. Unfortunately, whilst there is (arguably) no real excuse for running web servers on non-standard ports these days, some people insist on doing the Wrong Thing; quite often through archaic knowledge picked up during the 1990s which would be best recycled.

The “hostname” part is where it starts to get interesting. This is turned into an IP address by your browser, so it can go off across the Internet and have a polite conversation with a web server at the other end to ask nicely for a copy of the web page you have asked for. You can just put an IP address in there, but the expectation is that sometimes URLs may be typed in, and isn’t really.zonky.org slightly more memorable than 2001:8b0:640c:dead::d00d ?

But wait! It gets more interesting: The DNS allows you to point more than one name at a server, so mine can be reached with several different URLs such as http://zonky.org and http://really.zonky.org plus a few others. Which in fact show different web pages, by using so called virtual servers (which has nothing to do with virtual machines).

So the DNS can be used to change a boring server name such as server0032.facilities.north.some.organisation into a more meaningful name such as internet.some.organisation, but it can only pull tricks with the “hostname” part. Any messing with any other part of the URL including the bit after the slash is the job of something else; usually the web server itself, although that can sometimes require additional support.

The last part of the URL comes after the first single slash – in our example the “/directory/portsmouth.html” part – which can be best called the pathname as it provides a path to the page within the web server to fetch. In a very simplistic way, web servers can be thought of as file servers which require you to tell it which file to request; just like working with the command-line on a Linux machine or even a Windows machine.

BTW: I’m not really that scary – I haven’t bitten anyone’s head off for ages … at least a couple of weeks at least!