Mar 122007
 

I have been using large IT systems since 1986; quite a while. Of course in many ways, IT systems have improved dramatically … they are faster, we have more of them, we have graphical user interfaces, etc. I would say that one thing hasn’t changed … they are still unreliable; but I don’t believe that, I believe they’re actually less reliable than they used to be.

Why is this ? Well I don’t have a definitive answer, but I do have a few ideas …

Humans are fallible and IT systems are written by humans. Naturally IT systems fail. However there is an assumption that it is possible to spend enough time, effort and testing, and eliminate all the problems with an IT system, despite evidence to show that this is foolish thinking. We need to accept that IT systems will fail, and design them to fail gracefully … for example, Firefox is setup to restore your browsing session if it gets killed or crashes.

IT systems are all too frequently designed monolithically … for instance (an over simplification) a monolithic web application could work better as three separate components … a web user interface, a command-line tool to do the work, and a database backend for storage. Making it easier for separate parts of the application to operate independently makes it easier to isolate faults by operating individual components separately. It is also easier to scale applications by separating their components; it becomes easier to see where the bottlenecks are, easier to see where you need more resources, and easier to re-engineer problematic areas.

We are too fond of the “big bang” approach to improving IT systems. We go out and ask for a list of improvements to make, decide we need to roll out a huge new IT system to meet “user requirements”, initiate a huge project to replace a critical IT system, spend huge amounts of money on the new system, make the new system “live” after huge amounts of testing by the user population, and keep the old system running for years “just in case”.

We all know where the big bang approach leads … the “big headache” when things don’t work, cost too much, etc.

It is far less sexy to evolve existing IT systems into the direction we want. It takes longer, but it is safer. It also means you don’t have to keep old systems around “just in case” … indeed you can’t because the new system is the old system. It is also less stressful for all involved to change things a little bit at a time; because each change is smaller, you can be more confident that each change will work.

Users of IT systems need to have more realistic expectations; this is partly the fault of IT people … we like to promise the earth, and partly the fault of users who can set unrealistic requirements. Part of the problem is that users set the requirements so high that meeting them becomes exceptionally difficult, and because deadlines have become unrealistic, many hidden requirements end up not being met. For example, if we ask the users if they want a fancy web-based front end to their finance system the answer we will get is “yes”; if we also ask if the users want a reliable system the answer is of course “yes”. If those requirements are incompatible, users will insist we accomplish the impossible.