No ads? Contribute with BitCoins: 16hQid2ddoCwHDWN9NdSnARAfdXc2Shnoa
Sep 012016
 

One of the advantages that ZFS brings, is that it is so easy to create file systems, that you can create them for purposes that you would not previously do. For example, I have an additional file system mounted under my home directory for a certain application that generates a lot of data that I do not need backed up. Because the script I use to back up stuff does not cross file system boundaries (i.e. it does not descend into a directory that contains a mounted file system), I can simply exclude a large amount of frequently changing data by making a file system.

Or I might (as it happens I do not, but I could well do) create file systems for large lumps of data to easily see how much space they occupy – perhaps ~/Pictures. You can run a command like du -sh ~/Pictures, but that is an expensive command (it takes a while) and it tells you how large the files are; not how much space they occupy on disk. And on-disk compression can make that a significant difference! So simply run df -h ~/Pictures if that directory is on a separate file system.

But there is a bit of a gotcha with that. If you create such file systems in the normal way (such as zfs create pool/mikes-pictures; zfs set mountpoint=/home/mike/Pictures pool/mikes-pictures) you risk creating a situation that may prevent your home directory from mounting. If the “child” file system is mounted before the parent, it will not be possible for the parent file system to be mounted when booting.

Instead create the hierarchy properly :-

zfs create pool/h2
mkdir /h2
zfs set mountpoint=/h2 pool/h2
zfs create pool/h2/mike
zfs create pool/h2/mike/Pictures
ls /h2/mike/Pictures

You will also have to fix the permissions, but this is a far safer way of organising things suitable for future file system creation.

damascus-unix-prompt

Jan 122014
 

Computers have gotten faster … a lot faster. In some cases there is never enough speed, but to a certain extent today’s computers are not noticeably faster than computers of a few years ago. At least not if you do not run benchmarks. So there is little incentive to upgrade that 5 year old desktop machine – unless you are running Windows XP of course (support for which will be dropped soon).

Unless of course you look at aspects other than simple speed – such as reliability.

A few years ago I used to run old Unix workstations in preference to PCs despite their lack of speed, because they were simply more reliable – I could leave a workstation running for weeks without any negative effects. Whereas the PCs I was used to using were just not quite as stable; every so often something unexpected would occur and a reboot would be necessary. Usually at the most irritating possible time.

We expect computers to be reliable, but are all too often disappointed.

Desktop manufacturers may be able to revive the flagging market for desktops by offering something new – desktops with reliability. There are a number of reliability features that are commonly found in servers that could be offered in desktops with only a marginal increase in cost.

Error Correcting Code Memory

Forget the “code” part in the title; without going into a great deal of technical detail, ECC memory automatically corrects memory errors when they occur. And occur they do.

There are a variety of causes of bit errors within memory varying from cosmic rays to atmospheric radiation; the cause does not matter so much. What matters is how frequently they occur. According to small studies and theory, they should be quite rare, but Google have released a paper actually measuring the error rate in a large pool of machines; the error rate is roughly about 5 single bit errors in 8 Gigabytes of RAM per hour.

If true, that’s more than enough to have a significant impact on the reliability of your average desktop PC. If a piece of software has some random instructions changed into something else, it will usually crash or do something strange to your data. Or if that random memory error occurs within your data, then you might expect a strange coloured blob to appear in your favourite photo.

Normal desktop PCs do not come supplied with ECC memory because it is slightly more expensive than ordinary memory. Without going into details, ECC memory uses additional memory to maintain a check on the contents of main memory.

And that costs more. Not a lot more, but in a competitive market, a small saving may lead to increased sales. Of course there are other ways to increase sales – such as by making a feature of ECC memory and reliability.

Storage

We are currently in a transition period between mechanical storage (disks) and electronic mass storage (flash). Flash storage currently offers very fast storage but with a price tag attached meaning it is infeasible for large amounts of storage. That will of course change.

In the meantime we have to deal with two storage solutions; one with a reputation of unreliability (flash) and one that is really unreliable (disks). Both fail with regrettable regularity (although discs will fail more often!) but fail in different ways. Disks themselves are likely to have a short period where they do not work very well before refusing to do anything, although as mechanical devices they can fail in surprising ways too! Flash will tend to fail in a rather nice way – it will get to the point where all attempts to write will fail, but all of the information is still readable.

Because they fail in different ways, we have to cope with their failure in different ways too. Except for the most obvious thing – everything needs to be backed up. And of course getting a backup mechanism up and running is a pretty tedious task.

It would make a great deal of sense for a vendor to offer a cloud-based disaster recovery backup for your system disk(s). An account with a copy of the system disk image is created before your system is shipped. And once on line, your desktop PC sends updates to that image in the cloud. And when the disk fails, you can ask the vendor to ship a replacement disk with almost everything you previously had already put in place.

On a more general note, it is worth mentioning that most consumer hard disks at the bottom end of the market are complete rubbish. And I would pay extra to buy disks from a vendor that :-

  1. Takes ordinary disks and burns them in for a week to verify that they are not going to go bad in the first few months; there’s a NAS vendor (whose name escapes me for the moment) that does this and has one of the lowest disk failure rates on the market despite using relatively cheap and nasty disks.
  2. Ships them in proper packaging that absorbs the shipping bumps and knocks. Just because a disk drive looks intact does not mean it is safe to use.

 And What About The File System?

So far it has all been about the hardware, but there is more we can do about reliability in software too. And carrying on from the previous section, one of those areas is how the operating system stores files on disks.  The software module that does this is (to use the Unix or Linux term) the file system and there are different kinds.

Historically different file systems have assumed that the storage is perfectly reliable. However with the increased awareness of silent data corruption, there are now a few file systems that check for silent data corruption – including what is probably the first: ZFS.

Even if there is a small loss of performance, file systems should detect silent data corruption and correct if possible.

Preparing To Fail

We all know that software is unreliable; to be precise it is not perfectly reliable as it is a great deal more reliable than we give it credit for. After all we only notice the failures; and some of the failures at that.

Rather than trying just to produce reliable software, programmers should be designing software that fails safe without losing any data. See crash-only software.

Oct 132013
 

I discovered this cool feature of Linux quite by accident. zRAM is a block device (i.e. a “disk”) where the contents are compressed and stored in memory, which makes it sound rather mundane and hardly very interesting. However in use, it does appear to be quite nifty; sufficiently so that Google are enabling it for Chrome OS. So why?

The way that it is usually configured is as a swap space … so in effect, zRAM is used to compress normal memory, trading processor utilisation for more memory. What should happen is that instead of hitting the performance brick wall of suddenly paging to disk when you hit the memory limits of your machine, the zRAM is used instead eating a bit of processor time but with any luck keeping everything within memory rather than going to disk. It should have no effect during normal operation, but during temporary surges of memory utilisation, it should allow things to proceed at more or less normal performance.

That’s the theory anyway; but if it were not the case would Google be enabling it by default?

Of course in addition to using it as a swap device, there are other possible uses for zRAM devices :-

  1. As an L2ARC cache device for those using ZFS.
  2. To use as a block device for very hot disk spots in examples such as Exim’s retry database – which can be safely discarded on reboot.
  3. Or any other cache whose contents can be safely discarded at any point.

The last point is worth remembering. Because zRAM devices are contained within main memory, their contents are discarded when the power goes away.

Configuration

To use zRAM, we need to load the zRAM module, and choose how many devices to make at the same time. Some people believe that it makes sense to create as many devices as you have cores, as that gives each core (or thread) a device to spend it’s time compressing. To do this, we add the following to the /etc/rc.local file (assuming a Debian system) :-

/sbin/modprobe zram zram_num_devices=$(cat /proc/cpuinfo | grep processor | wc -l)

By default the zRAM will allocate 25% of the main memory to all of the zRAM devices; personally I think that is reasonable enough. However it seems that as soon as you set the number of devices, the size defaults to zero … so we have to set the size of the device as we configure it. Once created, you will have to decide how to use the devices. In my case, I wanted to use half of the devices for swap and half for L2ARC, which I did by adding the following to /etc/rc.local :-

size=$(( ($(cat /proc/meminfo | awk '/^MemTotal/ {print $2}')*1024) / (4 * $(cat /proc/cpuinfo| grep "^processor" | wc -l)) ))
#       Complex way of determining the size of each zRAM device
for dev in /dev/zram*
do
  base=$(basename $dev)
  echo $size > /sys/block/${base}/disksize
  odd=$(( $(echo $dev | sed -e "s/^.*zram//") % 2 ))
  if [ $odd = 0 ]
  then
    /sbin/mkswap $dev
    /sbin/swapon -p 32767 $dev
  else
    zpool remove pool0 $dev > /dev/null 2>&1
    zpool add pool0 cache $dev
  fi
done

This is a rather complex way of doing it, and doesn’t contain much in the way of error checking, but it does work.

Feb 112013
 

One of the obvious things to do with a ZFS storage pool is to increase the size of the disks in it – after all disks get bigger and cheaper over time. Not that it is a very difficult thing to do, but it is always worth doing a quick search to find out what others have done before setting forth. And if nobody blogs their own experience, there’s nothing for anybody to find!

So I started off with four 2Tbyte drives configured as two vdevs each of which was a mirror. And I had two 3Tbyte disks to swap in. So I was going to be swapping one of the vdevs (consisting of two 2Tbyte drives) with the 3Tbyte drives.

In the details below, I have a storage pool called zroot and the two disks being replaced are gpt/disk3 and gpt/disk2. As you will notice, I am growing the storage pool I boot off; however the disks I am using do not contain a boot partition with the boot code.

The first job was to swap out one of the 2Tbyte drives. This was done by :-

  1. Take disk to be swapped out offline: zpool offline zroot gpt/disk3
  2. Shut down the server and take the selected drive out. Swap over the disk caddy onto a new 3Tbyte drive, and swap that back in.
  3. Power on the server.
  4. Create an EFI partition table: gpart create -s gpt ada3
  5. Optionally create a swap partition: gpart add -t freebsd-swap -s 4G -l swap3 ada3
  6. Create a ZFS partion: gpart add -t freebsd-zfs -l disk3 ada3
  7. Replace the device: zpool replace zroot gpt/disk3

Now is the time to wait for the resilvering process to complete. Once that has finished, the steps above can be repeated for the other drive in the vdev. Once the resilvering for that replacement has finished, you may want to check the size of the pool.

If the size has not increased, you may need to do: zpool online -e zroot gpt/disk2 gpt/disk3.

Oct 272009
 

Whether you are using ufs filesystems or zfs storage pools, Solaris has a rather nifty way of migrating storage from one SAN to another wih no (or little) downtime. Or various other reasons involving moving from one disk to another. The key advantage to the following method is to reducing or eliminating downtime. Even if your users can take the hit, not having to slowly watch a multiterabyte filesystem copying from one disk to another is reason enough to use this technique.

Basically it is by using mirroring. Using mirroring to copy a disk might seem a little odd to begin with, but once you’ve seen it work you’ll be a fan.

For UFS (and SVM) Filesystems

This section assumes that the source disk device (cXXXXX) is set in the variable ${sourcedisk} and the destination is in ${destdisk}.

For UFS filesystems, the first step (which does require an outage) is to :-

  1. Stop the application that uses the filesystem being migrated.
  2. Unmount the filesystem.
  3. Encapsulate the existing filesystem device into a SVM metadevice: metainit d1001 1 1 ${sourcedisk}
  4. Create a mirror device with the new metadevice as a submirror: metainit d1000 -m d1001
  5. Change the references in /etc/vfstab to the old device name (${sourcedisk}) to the new mirror (not sub-mirror!) device – d1000
  6. Remount the filesystem and restart the application.

This should take no more than 10 minutes and is the only outage involved. There are two remaining sets of steps :-

  1. Create a new metadevice using the new disk: metainit d1002 1 1 ${destdisk}
  2. Attach the new metadevice to the mirror as an additional sub-mirror: metattach d1000 d1002

At this point, the mirror will start resilvering. It may take some time to complete, but the time it takes to do so does not really matter. In particular the resilvering process should not cause a performance problem to your application – the application I/O takes priority.

When the resilvering is complete :-

  1. Remove the metadevice containing the old SAN disk: metadetach d1000 d1001
  2. Remove the metadevice that is no longer required: metaclear d1001
  3. Attach “nothing” to the mirror metadevice (this is to ensure that the mirror grows to the size of the new submirror): metattach d1000
  4. Finally, ignore the warning on the manual page (which is outdated) and grow the filesystem: growfs -M /mount/point /dev/md/rdsk/d1000

You will see that I have used the metadevice names d1000 (for the mirror), d1001 (for the old sub-mirror), and d1002 (for the new submirror). Whatever device names you use, it is worth trying to be consistent – it helps a lot when you have dozens of filesystems to process.

ZFS Storage Pools

This is even simpler. If you have a storage pool called ${pool} which contains a single device called ${sourcedisk}, you simply :-

  1. Attach the new device: zpool attach ${pool} ${sourcedisk} ${destdisk}
  2. Wait for the resilvering to finish.
  3. Dettach the old device: zpool detach ${pool} ${sourcedisk}

Of course be aware of anything you read on the Internet! I have not actually tested the above; I’m merely regurgitating memory that has recently been exercised – I’m doing a SAN migration at work right now.

Oct 032009
 

Yesterday I went through the process of creating a ZFS storage pool with a single device :-

zpool create zt1 cXXXXX

Next adding an additional device to mirror the first :-

zpool attach zt1 cXXXXX cYYYYY

Watched it resilver, and then detached the first replica reducing the number of replicas to one :-

zpool detach zt1 cXXXXX

This is one of the nicest ways possible to migrate a large dataset from one set of devices to another (say replacing a SAN). However the documentation on Sun’s manual page for zpool is just a little vague in the relevant area and does not explicitly say that a single replica is a perfectly valid configuration.

This might all seem a little obvious, but removing a replica to reduce a storage pool to an pool without a mirror (no redundancy) is something that some volume managers don’t allow.

Feb 252009
 

Traditionally I have always mounted just the filesystems I needed in single user mode whilst tinkering in Solaris. Turns out this is a dumb method for ZFS filesystems.

What happens is that the zfs mount command will create any directories necessary to mount the filesystem. Later this can stop other ZFS filesystems from mounting when the tinkering is finished. This could be an argument for not creating hierarchies of filesystems, but that’s rather extreme.

The better solution is to mount all the ZFS filesystems in one go with :-

zfs mount -a
Jan 082009
 

If you think that you can use a ZFS volume as a Solaris LVM metadevice database, you will be wrong. Whilst it works initially, the LVM subsystem is initialised before ZFS this cannot find the databases. Whilst this may seem to be a perverse configuration at least one administrator has tried it – being me!

Nov 072007
 

I have been spending some time looking up information on ZFS for OSX because I’ve used ZFS under Solaris and would quite like it on my new Macbook. In many of the places I looked, there were tons of comments wondering why ZFS would be of any use for ordinary users. Oddly the responders indicating features that are more useful for servers than workstations. The doubters were responding with “So?”.

This is perhaps understandable because most of the information out there is for Solaris ZFS and tends to concentrate on the advantages for the server (and the server administrator). This is perhaps unfortunate because I can see plenty of advantages for ordinary users.

I will go through some of the advantages of ZFS that may work for ordinary users. In some cases I will give examples using a command-line. Apple will undoubtedly come up with a GUI for doing much of this, but I don’t have access to that version of OSX and the command-line still works.

ZFS Checks Writes

Unlike most conventional filesystems, ZFS does not assume that hard disks are perfect and uses checks on the data it writes to ensure that what gets read back is what was written. As each “block” is written to disk, ZFS will also write a checksum; when reading a “block” ZFS will verify that the block read matches the checksum.

This has already been commented on by people using ZFS under Solaris as showing up problematic disks that were thought to be fine. Who wants to lose data ?

This checksum checking that zfs does will not protect from the most common forms of data loss … hard disk failures or accidentally removing files. But it does protect against silent data corruption. As someone who has seen this personally, I can tell you it is more than a little scary with mysterious problems becoming more and more common. Protecting against this is probably the biggest feature of ZFS although it is not something that is immediately obvious.
ZFS Filesystems Are Easy To Create

So easy in fact that it frequently makes sense to create a filesystem where in the past we would create a directory. Why? So that it is very easy and quick to see who or what is using all that disk space that got eaten up since last week.

Lets assume you currently have a directory structure like :-

/Users/mike
/Users/john
/Users/stuart
/Users/stuart/music
/Users/stuart/photos

If those directories were ZFS filesystems you could instantly see how much disk space is in use for each with the command zfs list

% zfs list
NAME                                 USED   AVAIL   REFER   MOUNTPOINT
zpool0                               3.92G  23G     3.91M   /zpool0
zpool0/Users/mike                    112M   23G     112M    /Users/mike
zpool0/Users/john                    919M   23G     919M    /Users/john
zpool0/Users/stuart                  309M   23G     309M    /Users/stuart
zpool0/Users/stuart/music            78G    23G     78G     /Users/stuart/music
zpool0/Users/stuart/photos           12G    23G     12G     /Users/stuart/photos

With one very simple (and quick) command you can see that Stuart is using the most space in his ‘music’ folder … perhaps he has discovered Bittorrent! The equivalent for a series of directories on a normal filesystem can take a long time to complete.

With any luck Apple will modify the Finder so that alongside the option to create a new folder is a new option to “create a new folder as a ZFS filesstem” (or something more user-friendly).

It may seem silly to have many filesystems when we are used to filesystems that are fixed in size (or are adjustable but in limited ways), but zfs filesystems are allocated out of a common storage pool and grow and shrink as required.

ZFS Supports Snapshots

Heard of “Time Machine” ? Nifty isn’t it ?

Well ZFS snapshots do the same thing … only better. Time Machine is pretty much limited to an external hard disk which is all very well if you happen to have one with you, but not much use when you only have a single disk. ZFS snapshots work “in place” and are instantaneous. In addition you can create a snapshot when you want to … for instance just before starting to revise a large document so that if everything goes wrong you can quickly revert.

Time Machine has one little disadvantage … if you modify a very large file, it will need to duplicate the entire file multiple times. For instance if you have a 1Gbyte video that you are editing over multiple days, Time Machine will store the entire video every time it ‘checkpoints’ the filesystem. This can add up pretty quick, and could be a problem if you work on very large files. Zfs snapshots stores only the changes to the file (although an application can accidentally ‘break’ this) making it far more space efficient.

One thing that zfs snapshots does not do that Time Machine does, is to ensure you have a backup of your data on an external hard disk. The zfs equivalent is the zfs send command which sends a zfs snapshot “somewhere”. The somewhere could be to a zfs storage pool on an external hard disk, to a zfs pool on a remote server somewhere (for instance an external hard disk attached to your Mac at work to give you offsite backups), or even to a storage server that does not understand ZFS! And yes you can send “incrementals” in much the same way too.

Currently using zfs send (and the opposite zfs receive) requires inscrutable Unix commands, but somebody will soon come up with a friendlier way of doing it. Oh! It seems they already have!

Unfortunately I’ve found out that using ZFS with Leopard is currently (10.5.0) pretty difficult … the beta code for ZFS is hard to get hold of, and may not be too reliable. Funnily enough this mirrors what happened when Solaris 10 first came out … ZFS was not ready until the first update of Solaris 10!

Unfortunately it seems that Apple have retreated back from using ZFS in OSX which is a great shame, and until they come up with something better, we are stuck with HFS+, which means not only do we lack the features of a modern filesystem, but we are also stuck with slow fsck times. Ever wonder why sometimes that blue screen of a Mac starting sometimes takes much longer ? The chances are that it is because a filesystem is being checked – something that isn’t necessary with a modern filesystem.

Aug 252007
 

If you’re hoping to read about Linux finally getting ZFS (except as a FUSE module) then you are going to be disappointed … this is merely a rant about the foolishness shown by the open-source world. It seems that the reason we won’t see ZFS in the Linux kernel is not because of technical issues but because of licensing issues … the two open-source licenses (GPL and CDDL) are allegedly incompatible!

Now some may wonder why ZFS is so great given that most of the features are available in other storage/filesystem solutions. Well as an old Unix systems administrator, I have seen many different storage and filesystem solutions over time … Veritas, Solaris Volume Manager, the AIX logical volume manager, Linux software RAID, Linux LVM, …, and none come as close to perfection as ZFS. In particular ZFS is insanely simple to manage, and those who have never managed a server with hundreds of disks may not appreciate just how desireable this simplicity is.

Lets take a relatively common example from Linux; we have two disks and no RAID controller so it makes sense to use Linux software RAID to create a virtual disk that is a mirror of the two physical disks. Not a difficult task. Now we want to split that disk up into seperate virtual disks to put filesystems on; we don’t know how large the different filesystems will become so we need to have some facility to grow and shrink those virtual disks. So we use LVM and make that software RAID virtual disk into an LVM “physical volume”, add the “physical volume” to a volume group, and finally create “logical volumes” for each filesystem we want. Then of course we need to put a filesystem on each “logical volume”. None of these steps are particularly difficult, but there are 5 seperate steps, and the separate software components are isolated from each other … which imposes some limitations.

Now imagine doing the same thing with ZFS … we create a storage pool consisting of two mirrored physical disks with a single command. This storage pool is automatically mounted as a filesystem ready for immediate use. If we need separate filesystems, we can create each with a single command. Now we come to the advantages … filesystem ‘snapshots’ are almost instantaneous and do not consume additional disk space until changes are made to the original filesystem at which point the increase in size is directly proportional to the changes made. Each ZFS filesystem shares the storage pool with the size being totally dynamic (by default) so that you do not have a set size reserved for each filesystem … essentially the free space on every single filesystem is available to all filesystems.

So what is the reason for not having ZFS under Linux ? It is open-source so it is technically possible to add to the Linux kernel. It has already been added to the FreeBSD kernel (in “-CURRENT”) and will shortly be added to the released version of OSX. Allegedly because the license is incompatible. The ZFS code from Sun is licensed under the CDDL license and the Linux kernel is licensed under the GPL license. I’m not sure how they are incompatible because frankly I have better things to do with my time than read license small-print and try to determine the effects.

But Linux (reluctantly admittedly) allows binary kernel modules to be loaded into the kernel and the license on those certainly isn’t the GPL! So why is not possible to allow GPLed code and CDDLed code to co-exist peacefully ? After all it seems that if ZFS were compiled as a kernel module and released as a binary blob, it could then be used … which is insane!

The suspicion I have is that there is a certain amount of “not invented here” going on.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close