Mar 9, 2020 4 min read

ZFS on Linux with all flash?

In the previous post I described how I mounted 4 NVMe flash drives in a single host in order to build an all-flash datastore. In this post, I’ll describe how to move from FreeNAS, TrueNAS (or any other ZFS host OS) to ZFS on Linux, and test if the performance is acceptable to continue with ZFS. Spoiler: it’s not.

Moving to ZFS on Linux

Still on the FreeNAS box, I extended my single SSD pool (yes really, NVMEe 1TB SSDs were expensive 3 years ago) to a pool with 2 mirrored vdevs through standard zpool commands. Afterwards I shutdown the VM and connected the disks (via passthrough/VT-d) to a new Ubuntu VM.

Installing ZFS on Ubuntu is easy as the essential parts are included in the kernel, and there’s a standard package for managing ZFS:

$ sudo apt install zfsutils-linux

Instead of creating a pool, I imported my existing pool of 4 disks

$ sudo zpool import tank -f

The -f (force) option is because I didn’t export the pool properly in FreeNAS (shame on me).

$ zpool status
  pool: tank
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: resilvered 232G in 1h29m with 0 errors on Thu Feb 27 12:39:37 2020

        NAME           STATE     READ WRITE CKSUM
        tank           ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme0n1    ONLINE       0     0     0
            nvme3n1    ONLINE       0     0     0
          mirror-1     ONLINE       0     0     0
            nvme1n1    ONLINE       0     0     0
            nvme2n1    ONLINE       0     0     0

errors: No known data errors

The pool contains a ZFS volume (zvol) named iSCSIvol, which is a virtual block device to allow ESXi to connect through iSCSI:

$ zfs list
NAME                                                    USED  AVAIL  REFER  MOUNTPOINT
tank                                                   1.53T   339G    88K  /tank
tank/.system                                           50.8M   339G   100K  legacy
tank/iSCSIvol                                          1.51T  1.50T   350G  -


We already measured the raw performance of one of the disks in an earlier post:

Mode Blocksize IOPS Bandwidth (MB/s)
random read 4k 184k 752
random write 4k 165k 675
random read 64k 23.9k 1566
random write 64k 25.8k 1694
random read 1M 1544 1620
random write 1M 1616 1695

The next step is to get a measurement for the raw performance of the ZFS volume before things like iSCSI, VMFS, network or ESXi come into play. To do this, I created a new zvol on the zpool, and did the same experiment I did before, only instead of using the NVMe disk (/dev/nvmex) directly, I now accessed to the zvol test device (/dev/zvol/tank/test) directly from the VM containing the pool. The results:

$ fio fio --ioengine=libaio --direct=1 --name=test --filename=/dev/zvol/tank/test --iodepth=32 --size=12G --numjobs=16 --group_reporting --bs=4k --readwrite=randread
Mode Blocksize IOPS Bandwidth (MB/s) vs. single raw device
random read 4k 155k 634 0.84x
random write 4k 30.3k 124 0.18x
random read 64k 16.4k 1073 0.69x
random write 64k 2.8k 184 0.11x
random read 1M 1.22k 1278 0.79x
random write 1M 189 199 0.12x

Goodbye ZFS

So I expected some overhead, but this is terrible. Keep in mind the devices are mirrored, and there are two mirrors, so the real read performance should be 4x as good as, and the write 2x as good as a single device. ZFS requires a lot of RAM to perform, but for now consider the VM has plenty of RAM. The results are not CPU constrained either, as I increased the number of virtual CPUs (vCPU) until the VM was not constantly at 100% load. Instead, iostat -x measurements indicated the NVMe devices themselves were never utilized more than 25%, with the zfs device at 100%. Tuning ZFS doesn’t help here as performance traces showed most of the time was spent in spinlocks and mutexes. This is clearly a performance problem in the ZFS code, which becomes apparent as the NVMe devices are too fast.

These results are so bad in fact, that as much as I love ZFS, I can’t continue with ZFS as a backing for my all flash datastore until ZFS gets some much needed performance tweaks for all flash purposes. That’s why in the next post I’ll continue the same exercise with Linux software RAID and LVM, and see how that works.