ECC – More Zonkyness

An Awesome Window Manager ECC Widget

Apr 012018

This is a continuation of an earlier post regarding ECC memory under Linux, and is how I added a little widget to display the current ECC memory status. Because I don’t really know lua, most of the work is carried out with a shell script that is run via cron on a frequent basis.

The shell script simply runs edac-util to obtain the number of correctable errors and uncorrectable errors, and formats the numbers in a way suitable for setting the text of a widget :-

#!/bin/zsh
#
# Use edac-util to report some numbers to display ...

correctables=$(edac-util --report=ce | awk '{print $NF}')
uncorrectables=$(edac-util --report=ue | awk '{print $NF}')

c="chartreuse"
if [[ "$correctables" != "0" ]]
then 
  c="orange"
fi
if [[ "$uncorrectables" != "0" ]]
then
  c="red"
fi

echo "ECC: $correctables/$uncorrectables "

This is run with a crontab entry :-

*/7 * * * * /site/scripts/gen-ecc-wtext > /home/mike/lib/awesome/widget-texts/ecc-status

Once the file is being generated, the Awesome configuration can take effect :-

-- The following function does what it says and is used in a number of dumb widgets
-- to gather strings from shell scripts
function readfiletostring (filename)
  file = io.open(filename, "r")
  io.input(file)
  s = io.read()
  io.close(file)
  return s
end

eccstatus = wibox.widget.textbox()
eccstatus:set_markup(readfiletostring(homedir .. "/lib/awesome/widget-texts/ecc-status"))
eccstatustimer = timer({ timeout = 60 })
eccstatustimer:connect_signal("timeout",
  function()
      eccstatus:set_markup(readfiletostring(homedir .. "/lib/awesome/widget-texts/ecc-status"))
  end
)
eccstatustimer:start()
...
layout = wibox.layout.fixed.horizontal, ... eccstatus, ...

There plenty of ways this could be improved – there’s nothing really that requires a separate shell script, but this works which is good enough for now.

Linux: Adding ECC Memory Error Reporting (EDAC)

Computing, Linux 1 Response »

Jun 262014

Came across a hint today about reporting on ECC memory errors. For those who do not know, ECC memory detects memory errors and corrects correctable errors. Normal memory (as found in almost all laptops and desktops) simply ignores the errors and lets them accumulate and cause problems either with data corruption or by causing software errors.

As I happen to have ECC memory in my desktop machine I thought I would have a look into the hint. Turns out that Linux does not report on ECC events automatically; you need to install the relevant EDAC (Error Detection and Correction) tools. Which for Debian, turns out to be pretty simple :-

# apt-get install edac-utils

As part of the installation process, a daemon process is started. But for whatever reason, it didn’t automatically detect what driver to load. So I edited /etc/default/edac and added :-

EDAC_DRIVER=amd64_edac_mod

Once that is done, a simple /etc/init.d/edac restart loads the driver and starts monitoring. Messages should appear in your log files (/var/log/messages) and reports can be displayed with edac-util :-

# edac-util --report=full 
mc0:csrow0:mc#0csrow#0channel#0:CE:0
mc0:csrow0:mc#0csrow#0channel#1:CE:0
mc0:csrow1:mc#0csrow#1channel#0:CE:0
mc0:csrow1:mc#0csrow#1channel#1:CE:0
mc0:csrow2:mc#0csrow#2channel#0:CE:0
mc0:csrow2:mc#0csrow#2channel#1:CE:0
mc0:csrow3:mc#0csrow#3channel#0:CE:0
mc0:csrow3:mc#0csrow#3channel#1:CE:0
mc0:noinfo:all:UE:0
mc0:noinfo:all:CE:0

Of course memory errors are relatively rare (or at least should be) so it may take months before any error is reported.

It’s All About Reliability, Reliability, and More Reliability

Computing No Responses »

Jan 122014

Computers have gotten faster … a lot faster. In some cases there is never enough speed, but to a certain extent today’s computers are not noticeably faster than computers of a few years ago. At least not if you do not run benchmarks. So there is little incentive to upgrade that 5 year old desktop machine – unless you are running Windows XP of course (support for which will be dropped soon).

Unless of course you look at aspects other than simple speed – such as reliability.

A few years ago I used to run old Unix workstations in preference to PCs despite their lack of speed, because they were simply more reliable – I could leave a workstation running for weeks without any negative effects. Whereas the PCs I was used to using were just not quite as stable; every so often something unexpected would occur and a reboot would be necessary. Usually at the most irritating possible time.

We expect computers to be reliable, but are all too often disappointed.

Desktop manufacturers may be able to revive the flagging market for desktops by offering something new – desktops with reliability. There are a number of reliability features that are commonly found in servers that could be offered in desktops with only a marginal increase in cost.

Error Correcting Code Memory

Forget the “code” part in the title; without going into a great deal of technical detail, ECC memory automatically corrects memory errors when they occur. And occur they do.

There are a variety of causes of bit errors within memory varying from cosmic rays to atmospheric radiation; the cause does not matter so much. What matters is how frequently they occur. According to small studies and theory, they should be quite rare, but Google have released a paper actually measuring the error rate in a large pool of machines; the error rate is roughly about 5 single bit errors in 8 Gigabytes of RAM per hour.

If true, that’s more than enough to have a significant impact on the reliability of your average desktop PC. If a piece of software has some random instructions changed into something else, it will usually crash or do something strange to your data. Or if that random memory error occurs within your data, then you might expect a strange coloured blob to appear in your favourite photo.

Normal desktop PCs do not come supplied with ECC memory because it is slightly more expensive than ordinary memory. Without going into details, ECC memory uses additional memory to maintain a check on the contents of main memory.

And that costs more. Not a lot more, but in a competitive market, a small saving may lead to increased sales. Of course there are other ways to increase sales – such as by making a feature of ECC memory and reliability.

Storage

We are currently in a transition period between mechanical storage (disks) and electronic mass storage (flash). Flash storage currently offers very fast storage but with a price tag attached meaning it is infeasible for large amounts of storage. That will of course change.

In the meantime we have to deal with two storage solutions; one with a reputation of unreliability (flash) and one that is really unreliable (disks). Both fail with regrettable regularity (although discs will fail more often!) but fail in different ways. Disks themselves are likely to have a short period where they do not work very well before refusing to do anything, although as mechanical devices they can fail in surprising ways too! Flash will tend to fail in a rather nice way – it will get to the point where all attempts to write will fail, but all of the information is still readable.

Because they fail in different ways, we have to cope with their failure in different ways too. Except for the most obvious thing – everything needs to be backed up. And of course getting a backup mechanism up and running is a pretty tedious task.

It would make a great deal of sense for a vendor to offer a cloud-based disaster recovery backup for your system disk(s). An account with a copy of the system disk image is created before your system is shipped. And once on line, your desktop PC sends updates to that image in the cloud. And when the disk fails, you can ask the vendor to ship a replacement disk with almost everything you previously had already put in place.

On a more general note, it is worth mentioning that most consumer hard disks at the bottom end of the market are complete rubbish. And I would pay extra to buy disks from a vendor that :-

Takes ordinary disks and burns them in for a week to verify that they are not going to go bad in the first few months; there’s a NAS vendor (whose name escapes me for the moment) that does this and has one of the lowest disk failure rates on the market despite using relatively cheap and nasty disks.
Ships them in proper packaging that absorbs the shipping bumps and knocks. Just because a disk drive looks intact does not mean it is safe to use.

And What About The File System?

So far it has all been about the hardware, but there is more we can do about reliability in software too. And carrying on from the previous section, one of those areas is how the operating system stores files on disks. The software module that does this is (to use the Unix or Linux term) the file system and there are different kinds.

Historically different file systems have assumed that the storage is perfectly reliable. However with the increased awareness of silent data corruption, there are now a few file systems that check for silent data corruption – including what is probably the first: ZFS.

Even if there is a small loss of performance, file systems should detect silent data corruption and correct if possible.

Preparing To Fail

We all know that software is unreliable; to be precise it is not perfectly reliable as it is a great deal more reliable than we give it credit for. After all we only notice the failures; and some of the failures at that.

Rather than trying just to produce reliable software, programmers should be designing software that fails safe without losing any data. See crash-only software.