Jan 122014
 

Computers have gotten faster … a lot faster. In some cases there is never enough speed, but to a certain extent today’s computers are not noticeably faster than computers of a few years ago. At least not if you do not run benchmarks. So there is little incentive to upgrade that 5 year old desktop machine – unless you are running Windows XP of course (support for which will be dropped soon).

Unless of course you look at aspects other than simple speed – such as reliability.

A few years ago I used to run old Unix workstations in preference to PCs despite their lack of speed, because they were simply more reliable – I could leave a workstation running for weeks without any negative effects. Whereas the PCs I was used to using were just not quite as stable; every so often something unexpected would occur and a reboot would be necessary. Usually at the most irritating possible time.

We expect computers to be reliable, but are all too often disappointed.

Desktop manufacturers may be able to revive the flagging market for desktops by offering something new – desktops with reliability. There are a number of reliability features that are commonly found in servers that could be offered in desktops with only a marginal increase in cost.

Error Correcting Code Memory

Forget the “code” part in the title; without going into a great deal of technical detail, ECC memory automatically corrects memory errors when they occur. And occur they do.

There are a variety of causes of bit errors within memory varying from cosmic rays to atmospheric radiation; the cause does not matter so much. What matters is how frequently they occur. According to small studies and theory, they should be quite rare, but Google have released a paper actually measuring the error rate in a large pool of machines; the error rate is roughly about 5 single bit errors in 8 Gigabytes of RAM per hour.

If true, that’s more than enough to have a significant impact on the reliability of your average desktop PC. If a piece of software has some random instructions changed into something else, it will usually crash or do something strange to your data. Or if that random memory error occurs within your data, then you might expect a strange coloured blob to appear in your favourite photo.

Normal desktop PCs do not come supplied with ECC memory because it is slightly more expensive than ordinary memory. Without going into details, ECC memory uses additional memory to maintain a check on the contents of main memory.

And that costs more. Not a lot more, but in a competitive market, a small saving may lead to increased sales. Of course there are other ways to increase sales – such as by making a feature of ECC memory and reliability.

Storage

We are currently in a transition period between mechanical storage (disks) and electronic mass storage (flash). Flash storage currently offers very fast storage but with a price tag attached meaning it is infeasible for large amounts of storage. That will of course change.

In the meantime we have to deal with two storage solutions; one with a reputation of unreliability (flash) and one that is really unreliable (disks). Both fail with regrettable regularity (although discs will fail more often!) but fail in different ways. Disks themselves are likely to have a short period where they do not work very well before refusing to do anything, although as mechanical devices they can fail in surprising ways too! Flash will tend to fail in a rather nice way – it will get to the point where all attempts to write will fail, but all of the information is still readable.

Because they fail in different ways, we have to cope with their failure in different ways too. Except for the most obvious thing – everything needs to be backed up. And of course getting a backup mechanism up and running is a pretty tedious task.

It would make a great deal of sense for a vendor to offer a cloud-based disaster recovery backup for your system disk(s). An account with a copy of the system disk image is created before your system is shipped. And once on line, your desktop PC sends updates to that image in the cloud. And when the disk fails, you can ask the vendor to ship a replacement disk with almost everything you previously had already put in place.

On a more general note, it is worth mentioning that most consumer hard disks at the bottom end of the market are complete rubbish. And I would pay extra to buy disks from a vendor that :-

  1. Takes ordinary disks and burns them in for a week to verify that they are not going to go bad in the first few months; there’s a NAS vendor (whose name escapes me for the moment) that does this and has one of the lowest disk failure rates on the market despite using relatively cheap and nasty disks.
  2. Ships them in proper packaging that absorbs the shipping bumps and knocks. Just because a disk drive looks intact does not mean it is safe to use.

 And What About The File System?

So far it has all been about the hardware, but there is more we can do about reliability in software too. And carrying on from the previous section, one of those areas is how the operating system stores files on disks.  The software module that does this is (to use the Unix or Linux term) the file system and there are different kinds.

Historically different file systems have assumed that the storage is perfectly reliable. However with the increased awareness of silent data corruption, there are now a few file systems that check for silent data corruption – including what is probably the first: ZFS.

Even if there is a small loss of performance, file systems should detect silent data corruption and correct if possible.

Preparing To Fail

We all know that software is unreliable; to be precise it is not perfectly reliable as it is a great deal more reliable than we give it credit for. After all we only notice the failures; and some of the failures at that.

Rather than trying just to produce reliable software, programmers should be designing software that fails safe without losing any data. See crash-only software.

Jan 052011
 

Becoming increasingly popular are various forms of streaming media services – Last.FM has personalised radio stations I can tune into on my phone, the BBC has their iPlayer which allows me to catch up on BBC TV (or radio) programmes I’ve missed, and my film rental service even has a streaming service that allows me to watch films without being worried about discs being mailed to me. All very cool of course, and it’s even quite handy but there are a few problems that need to be solved before streaming services can beat having the real disc – compact disc for music and blueray for films.

We sometimes look at these services under the best of conditions and rarely consider how they would work under the worst of conditions.

Firstly there is the quality issue. Whilst streaming music may well approach the quality of CDs, films and other forms of video are a long way from being of the same quality of the discs – sometimes not even getting close to the quality of DVDs when Bluerays are the quality to aim for. Sure it is no big deal – the convenience of online streaming makes up for the quality to a certain extent, but it does not replace the need for quality.

Secondly, reliability is an issue. Not only does streaming media (even audio) have a tendency to stutter to the point where listening or watching becomes unbearable, but sometimes streaming services just crash through being overloaded – very frustrating when it is half-way through a film. In theory most of our network connections have more than enough bandwidth to support streaming media – at least audio. In fact my own network connection is good enough for streaming video with just the occasional stutter – maybe just once an hour – and of course the occasional stutter may well because of other activity on my network. I do after all have people visiting my “server under the stairs” for blog postings and photographs on a regular basis.

However my wireless network is sufficiently bad that even streaming audio can get very bad in the evening. Not the fault of the streaming media companies that I live in a very dense environment with lots of wireless “noise”, but it still means that I tend to avoid using wireless networking except on devices where there is no choice. And on those devices I have sometimes been forced to put them away, or switch to using 3G.

It would be helpful if media streaming companies allowed people to buffer larger amounts of the media stream to assist in this. I would not mind waiting 10 minutes for a buffer to fill up to ensure that I could watch a film all the way through without stuttering. Or indeed wait 60s for an audio stream to buffer.

On the subject of media servers crashing, it is a little hard to see what can be done about this. The obvious thing is that streaming media companies need to be very careful about the code they write (or buy) to increase reliability. Software always has bugs, but increasing the importance of bug destruction would be very wise. Less obvious is to measure how reliable the media servers are at various loads, and limit the load to the level they can support reliably.

A message saying “please wait for an available film slot” is better by far than trying to start playing a film only to have it drop out half-way through!