Oct 252013
 

This is about updating FreeBSD once you have broken free of binary updates – probably not the best thing to do, but I needed to do it for various reasons. This isn’t a recommended method; merely the method I use. It is more about gathering the instructions I need.

First update the source :-

cd /usr/src
make update SVN_UPDATE=yes

Next step is to build the world :-

make -j 4 buildworld

This step takes quite a while, so it’s very helpful to add the “-j 4” to build in parallel.

The next step is to build the locally configured kernel. This usually starts with configuring it, but comparing the GENERIC configuration with the local configuration shows that in my case at the time of writing I didn’t have to re-configure.

make buildkernel KERNCONF=${NAME-OF-YOUR-CONFIG}

Once built the kernel needs to be installed :-

make installkernel KERNCONF=${NAME-OF-YOUR-CONFIG}

Next is to boot to single user mode which is done by pressing a key other than Return during the 10 second countdown, and entering boot -s at the prompt. Once booted into single user mode, the following steps need to be taken :-

# mount -u /; zfs mount -a
(To mount all filesystems with zfs in read-write mode)
# mergemaster -p
(To start the update of the configuration files)
# cd /usr/src
# make installworld
(To install the new user land)
# mergemaster
(To complete the update of the configuration files)
# make delete-old

That’s it for single-user mode. Now reboot into full multi-user and carry on …

Next step is to update the ports :-

# portsnap fetch
# portsnap update
# pkgdb -F
# cd /usr/ports/ports-mgmt/portupgrade
# make deinstall
# make install clean
# portupgrade -af --batch

This of course takes a very long time to complete.

And that should be it … seems to have been successful for me.

Jul 292013
 

… which is of course massive overkill. But fun. It should increase the raw bandwidth available between the two machines from 1Gbps to 20Gbps (with one link) and 40Gbps with both links bonded. It was a bit of a surprise to me when I looked around at prices of second-hand kit to realise that InfiniBand was so much cheaper to acquire than Fibre Channel; the kit I acquired cost less than £100 all in whereas FC kit would be in the region of £1,000, and InfiniBand is generally quicker. There is of course 16Gb FC and 10Gb InfiniBand, but that is hardly comparing like with like. So what is this overkill for? Networking of course. I’ve acquired two HP InfiniBand dual link cards which means I can connect my workstation to my server :- InfiniBand Network Using dual links is of course overkill on top of overkill, but given that these cards have dual links, why not use them? And it does give a couple of experiments to try later. To prepare in advance, the following network addresses will be used :-

Server Link Number IPv4 Address IPv6 Address
A 1 10.255.0.1 AAISP:d00d::1
A 2 10.255.1.1 AAISP:d00f::1
B 1 10.255.0.254 AAISP:d00d:2
B 1 10.255.1.254 AAISP:d00f:2

Yes I have cheated for the IPv6 addresses! The first step is to configure each “server” … one is running Debian Linux, and the other is running FreeBSD.

Configuring Linux

This was subject to much delay whilst I believed that I had a problem with the InfiniBand card, but putting the card into a new desktop machine caused it to spring back to life. Either some sort of incompatibility with my old desktop (which was quite old), or some sort of problem with the BIOS settings.

Inserting the card should load the core module (mlx4_core) automatically, and spit out messages similar to the following :-

[    3.678189] mlx4_core 0000:07:00.0: irq 108 for MSI/MSI-X
[    3.678195] mlx4_core 0000:07:00.0: irq 109 for MSI/MSI-X
[    3.678199] mlx4_core 0000:07:00.0: irq 110 for MSI/MSI-X
[    3.678204] mlx4_core 0000:07:00.0: irq 111 for MSI/MSI-X
[    3.678208] mlx4_core 0000:07:00.0: irq 112 for MSI/MSI-X
[    3.678212] mlx4_core 0000:07:00.0: irq 113 for MSI/MSI-X
[    3.678216] mlx4_core 0000:07:00.0: irq 114 for MSI/MSI-X
[    3.678220] mlx4_core 0000:07:00.0: irq 115 for MSI/MSI-X
[    3.678223] mlx4_core 0000:07:00.0: irq 116 for MSI/MSI-X
[    3.678228] mlx4_core 0000:07:00.0: irq 117 for MSI/MSI-X
[    3.678232] mlx4_core 0000:07:00.0: irq 118 for MSI/MSI-X
[    3.678236] mlx4_core 0000:07:00.0: irq 119 for MSI/MSI-X
[    3.678239] mlx4_core 0000:07:00.0: irq 120 for MSI/MSI-X
[    3.678243] mlx4_core 0000:07:00.0: irq 121 for MSI/MSI-X
[    3.678247] mlx4_core 0000:07:00.0: irq 122 for MSI/MSI-X
[    3.678250] mlx4_core 0000:07:00.0: irq 123 for MSI/MSI-X
[    3.678254] mlx4_core 0000:07:00.0: irq 124 for MSI/MSI-X
[    3.678259] mlx4_core 0000:07:00.0: irq 125 for MSI/MSI-X
[    3.678263] mlx4_core 0000:07:00.0: irq 126 for MSI/MSI-X
[    3.678267] mlx4_core 0000:07:00.0: irq 127 for MSI/MSI-X
[    3.678271] mlx4_core 0000:07:00.0: irq 128 for MSI/MSI-X
[    3.678275] mlx4_core 0000:07:00.0: irq 129 for MSI/MSI-X

This is just the core driver; at this point additional modules are needed to do anything useful. You can manually load the modules with modprobe but sooner or later it is better to make sure they’re loaded automatically by adding their names to /etc/modules. The modules you want to load are :-

  1. mlx4_ib
  2. ib_umad
  3. ib_uverbs
  4. ib_ipoib

This is a minimal set necessary for networking (“IP”) rather than additional features such as SCSI. It’s generally better to start with a minimal set of features initially. At this point, it is generally a good idea to reboot to verify that things are getting closer. After a reboot, you should have one or more new network interfaces listed by ifconfig :-

ib0       Link encap:UNSPEC  HWaddr 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00  
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:256 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

ib1       Link encap:UNSPEC  HWaddr 80-00-00-49-FE-80-00-00-00-00-00-00-00-00-00-00  
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:256 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Despite the appearance, we still have quite a way to go yet. The next step is to install some additional packages: ibutilsinfiniband-diags, and opensm. The last package is for a subnet manager which is unnecessary if you have an InfiniBand switch (but I don’t). The first step is to get opensm up and running. Edit /etc/default/opensm and change the PORTS variable to “ALL” (unless you want to restrict the managed ports, and make things more complicated). And start opensm: /etc/init.d/opensm start; update-rc.d opensm defaults.

At this point, you can configure the network addresses by editing /etc/network/interfaces. If you need help doing this, then you’re in the tech pool beyond your depth! Without something at the other end, these interfaces won’t work (obviously), so it’s time to start work on the other end …

Configuring FreeBSD

See: https://wiki.freebsd.org/InfiniBand I hadn’t had cause to build a custom kernel before, so the very first task was to use subversion to checkout a copy of the FreeBSD source code :-

svn co svn://svn0.us-east.FreeBSD.org/base/stable/9 /usr/src

Updating will of course require just: cd /usr/src && svn update. Once installed, create a symlink from /sys to /usr/src/sys if the link does not already exist: ln -s /usr/src/sys /sys

Go to the kernel configuration directory (/usr/src/sys/amd64/conf), copy the GENERIC configuration file to a new file, and edit the new file to add in certain options :-

# Infiniband stuff (locally added)
options         OFED
options         IPOIB_CM
device          ipoib
device          mlx4ib

Again, this is a minimal set that will not offer full functionality … but should be enough to get IP networking up and running. The next step is to build and install the kernel :-

make buildkernel KERNCONF=${NAME-OF-YOUR-CONFIG}; make installkernel KERNCONF=${NAME-OF-YOUR-CONFIG}

The next step is to build the “world”  :-

  1. Edit /etc/src.conf and add “WITH_OFED=’yes'” to that file.
  2. Change to /usr/src and run: make buildworld
  3. Finalise with make installworld

As it happens I had to build the user-land first, as the kernel compilation needed a new user-land feature.

After a reboot, the new network interface(s) should show up as ib0 upwards. And these can be configured with an address in exactly the same as any other network interface.

Testing The Network

A tip for making sure the interfaces you think are connected together is to configure one of the machines, send a broadcast ping to the relevant network address of each interface in turn, and run tcpdump on the other machine to verify that the packets coming down the wire match what you expect.

Below the level of IP, it is possible to run an InfiniBand ping to verify connectivity. First you need a GUID on “the server”, which can be obtained by running ibstat and looking for the “Port GUID”, which will be something like “0x0002c90200273985”. Next run ibping -S on the server.

Now on the other machine (“the client”), run ibping :-

# ibping -G 0x0002c90200273985
Pong from polio.inside.zonky.org (Lid 3): time 0.242 ms
Pong from polio.inside.zonky.org (Lid 3): time 0.153 ms
Pong from polio.inside.zonky.org (Lid 3): time 0.160 ms

The next step is to run an IP ping to one of the hosts. If that works, it is time to start looking at something that will do a reasonable attempt at a speed test.

This can be done in a variety of different ways, but I chose to use nttcp which is widely available. On one of the hosts, run nttcp -i to act as the “partner” (or server). On the sending server, run nntcp -T ${ip-address-to-test} which will give output something like :-

# nttcp -T 10.0.0.26
     Bytes  Real s   CPU s Real-MBit/s  CPU-MBit/s   Calls  Real-C/s   CPU-C/s
l  8388608    0.70    0.01     95.7975   5592.4053    2048   2923.51  170666.7
1  8388608    0.71    0.04     94.0667   1878.6950    5444   7630.87  152403.3

According to the documentation, the second line should begin with ‘r’, but for a simple speed test we can simply average the numbers in the “Real-MBit/s” to get an approximate speed. Oddly my gigabit ethernet seems to have mysteriously degraded to 100Mbps! At least it makes the InfiniBand speed slightly more impressive :-

# nttcp -T 10.255.0.2
     Bytes  Real s   CPU s Real-MBit/s  CPU-MBit/s   Calls  Real-C/s   CPU-C/s
l  8388608    0.03    0.00   2521.9415  16777.2160    2048  76963.55  512000.0
1  8388608    0.03    0.03   2206.6574   2568.6620    4032 132579.25  154329.0

Before getting into a panic over what appears to be a pretty poor result, it is worth bearing in mind that IP over InfiniBand isn’t especially efficient, and InfiniBand seems to suffer from marketing exaggeration. From what I understand, DDR’s 20Gbps signalling rate becomes 16Gbps, which in turn becomes 8.5Gbps when looking at the output of ibstatus (not ibstat) – why the halving here is a bit of a mystery, but that may become apparent later.

There has also been a hint that FreeBSD is due for a significant improvement in InfiniBand performance sometime after the release of 9.2.

As a late addition, it would appear that running OpenSM (the subnet manager) on both hosts means that when one or other is rebooting, the other can take over the duties of the subnet manager. To enable on FreeBSD, simply add opensm_enable=”YES” to the file /etc/rc.conf and reboot.

Feb 112013
 

One of the obvious things to do with a ZFS storage pool is to increase the size of the disks in it – after all disks get bigger and cheaper over time. Not that it is a very difficult thing to do, but it is always worth doing a quick search to find out what others have done before setting forth. And if nobody blogs their own experience, there’s nothing for anybody to find!

So I started off with four 2Tbyte drives configured as two vdevs each of which was a mirror. And I had two 3Tbyte disks to swap in. So I was going to be swapping one of the vdevs (consisting of two 2Tbyte drives) with the 3Tbyte drives.

In the details below, I have a storage pool called zroot and the two disks being replaced are gpt/disk3 and gpt/disk2. As you will notice, I am growing the storage pool I boot off; however the disks I am using do not contain a boot partition with the boot code.

The first job was to swap out one of the 2Tbyte drives. This was done by :-

  1. Take disk to be swapped out offline: zpool offline zroot gpt/disk3
  2. Shut down the server and take the selected drive out. Swap over the disk caddy onto a new 3Tbyte drive, and swap that back in.
  3. Power on the server.
  4. Create an EFI partition table: gpart create -s gpt ada3
  5. Optionally create a swap partition: gpart add -t freebsd-swap -s 4G -l swap3 ada3
  6. Create a ZFS partion: gpart add -t freebsd-zfs -l disk3 ada3
  7. Replace the device: zpool replace zroot gpt/disk3

Now is the time to wait for the resilvering process to complete. Once that has finished, the steps above can be repeated for the other drive in the vdev. Once the resilvering for that replacement has finished, you may want to check the size of the pool.

If the size has not increased, you may need to do: zpool online -e zroot gpt/disk2 gpt/disk3.

Aug 252007
 

If you’re hoping to read about Linux finally getting ZFS (except as a FUSE module) then you are going to be disappointed … this is merely a rant about the foolishness shown by the open-source world. It seems that the reason we won’t see ZFS in the Linux kernel is not because of technical issues but because of licensing issues … the two open-source licenses (GPL and CDDL) are allegedly incompatible!

Now some may wonder why ZFS is so great given that most of the features are available in other storage/filesystem solutions. Well as an old Unix systems administrator, I have seen many different storage and filesystem solutions over time … Veritas, Solaris Volume Manager, the AIX logical volume manager, Linux software RAID, Linux LVM, …, and none come as close to perfection as ZFS. In particular ZFS is insanely simple to manage, and those who have never managed a server with hundreds of disks may not appreciate just how desireable this simplicity is.

Lets take a relatively common example from Linux; we have two disks and no RAID controller so it makes sense to use Linux software RAID to create a virtual disk that is a mirror of the two physical disks. Not a difficult task. Now we want to split that disk up into seperate virtual disks to put filesystems on; we don’t know how large the different filesystems will become so we need to have some facility to grow and shrink those virtual disks. So we use LVM and make that software RAID virtual disk into an LVM “physical volume”, add the “physical volume” to a volume group, and finally create “logical volumes” for each filesystem we want. Then of course we need to put a filesystem on each “logical volume”. None of these steps are particularly difficult, but there are 5 seperate steps, and the separate software components are isolated from each other … which imposes some limitations.

Now imagine doing the same thing with ZFS … we create a storage pool consisting of two mirrored physical disks with a single command. This storage pool is automatically mounted as a filesystem ready for immediate use. If we need separate filesystems, we can create each with a single command. Now we come to the advantages … filesystem ‘snapshots’ are almost instantaneous and do not consume additional disk space until changes are made to the original filesystem at which point the increase in size is directly proportional to the changes made. Each ZFS filesystem shares the storage pool with the size being totally dynamic (by default) so that you do not have a set size reserved for each filesystem … essentially the free space on every single filesystem is available to all filesystems.

So what is the reason for not having ZFS under Linux ? It is open-source so it is technically possible to add to the Linux kernel. It has already been added to the FreeBSD kernel (in “-CURRENT”) and will shortly be added to the released version of OSX. Allegedly because the license is incompatible. The ZFS code from Sun is licensed under the CDDL license and the Linux kernel is licensed under the GPL license. I’m not sure how they are incompatible because frankly I have better things to do with my time than read license small-print and try to determine the effects.

But Linux (reluctantly admittedly) allows binary kernel modules to be loaded into the kernel and the license on those certainly isn’t the GPL! So why is not possible to allow GPLed code and CDDLed code to co-exist peacefully ? After all it seems that if ZFS were compiled as a kernel module and released as a binary blob, it could then be used … which is insane!

The suspicion I have is that there is a certain amount of “not invented here” going on.