I have very strong feelings about the possibility of losing my data and I have taken some fairly extreme steps to avoid that ever happening. This is already the third time I am writing about the system or systems I have built for that purpose.

It all begun, however, with my interest in ZFS and evaluation of it. A compilation of my early articles about ZFS in Finnish is at http://www.raimokoski.com/ZFS.php. That resulted soon in the first backup recipe, dated 9th march 2012. Then I assembled another PC as my backup server and wrote separately about the backup scripts.

The techniques I use to reach my goals are heavily dependent on filesystem features Linux and Unix-like operating systems offer. Hard links and snapshots are the most essential features. Hard links are an old feature, but snapshots are far more poorly supported. On Linux they used to be supported only on LVM. However, my personal opinion about LVM is not printable. Luckily ZFS on Linux is like heaven on earth before btrfs reaches maturity.

System disk backup

You really should avoid having to do system disk backups. RAID-1 is the proper solution and if you are really paranoid, add a third or even fourth disk as an additional mirror or hot spare.

Unfortunately it is sometimes hard to avoid the need for system disk backups. I have two such computers. The more recent one is my current workstation machine, which seems to have a really troublesome motherboard. To cut the story short, it seems to support only one SATA hard disk at a time properly. I have several 250 GB disks reserved to form RAID-1 arrays, but all of them have begun to act flaky soon after they have been other than the first disk. When I have deinstalled them and tested as external disks with Icy Dock through USB connection, they have been again fully OK.

Using a SSD disk mitigates the problem, but doesn’t fully solve it. In a workstation it is of course also a good way to speed up things.

I use Fedora in my workstation and laptops. It is widely known that the re-installation of Fedora means going through various “100 things to do after Fedora installation” lists, for example, to be able to play music and movies and browse the web. You need various more or less proprietary codecs and plugins for those tasks. So the only reasonable option is to do a full backup and restore. There are easy to use tools for that, but if you want to do also incremental backups and save on the size of backups, the options are so limited that I ended up building my own.

File level deduplication using hard links

If you are not familiar with deduplication, you might wonder why you should need it. A simple demonstration tells why. When the system disk backup system is in use, in my system there are seven main directories containing full backups created every day. The disk space they need is however much less than seven times the size of one full backup. I call “du” to testify:

# du -sh current ?
49G     current
2,4G    1
2,4G    2
1,7G    3
1,2G    4
1,3G    5
1,1G    6

The total size is told by:

# du -sh .
58G    . 

Those were the actual sizes. “du” can also tell the size as if hard links were not used:

# du -sh --count-links .
344G    .

So in this case 286 Gb was saved. The idea most probably seems now so great, you ask why not do backups every day and perhaps save them a whole month. In an ideal world yes, but many filesystems have problems with a large number of hard links. Ext4 has a limit of 65 000 hard links and ext2/3 32 000. Below is a simple way to test the maximum number of hard links:

# mkdir linktest
# cd linktest
# rm -fv * ; touch 1 ; i=2 ; while true ; do cp -l 1 $i || \
break ; i=$(($i+1)) ; done ; echo -n “Number of links created “ ; \
ls | sort -n | tail -n 1

You can monitor the progress on a different terminal session with “ls | wc”. The end result on ext4 should be:

cp: cannot create hard link '65001' to '1': Too many links
Number of links created 65000

On ZFS the limit is so high, it shouldn’t normally bother. I let the loop run on ZFS until it reached over half a million links. When you start using the backup script on ext2/3/4, also rsync might complain as below:

rsync: link "/mirrors/rk5bk/current/sda3/var/lib/dnf/yumdb/f/bf8a1235c04e7d762543ff96d2d61207a0186099-fsarchiver-0.6.19-5.fc23-x86_64/checksum_type" => var/lib/dnf/yumdb/A/005b25c99812c179b34114410bdc6983eb9a38e8-Add64-1.2.2-9.fc23-x86_64/checksum_type failed: Too many links (31)

“(31)” looks very confusing, but using stat should tell what it is all about. Note the links count.

# LANG=C stat /var/lib/dnf/yumdb/A/005b25c99812c179b34114410bdc6983eb9a38e8-Add64-1.2.2-9.fc23-x86_64/checksum_type
File: '/var/lib/dnf/yumdb/A/005b25c99812c179b34114410bdc6983eb9a38e8-Add64-1.2.2-9.fc23-x86_64/checksum_type'
Size: 6               Blocks: 8          IO Block: 4096   regular file
Device: 803h/2051d      Inode: 526201      Links: 9657
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:rpm_var_lib_t:s0
Access: 2016-02-22 21:30:47.554611260 +0200
Modify: 2016-01-14 16:20:42.108021864 +0200
Change: 2016-02-22 21:30:50.025636311 +0200
Birth: -

Then a guess at the culprit:

# rpm -qa | wc -l 
9650

When a file with almost 10 000 hard links is copied, another almost 10 000 links is created and so on. The result is:

# LANG=C stat /mirrors/rk5bk/current/sda3/var/lib/dnf/yumdb/f/bf8a1235c04e7d762543ff96d2d61207a0186099-fsarchiver-0.6.19-5.fc23-x86_64/checksum_type
File: '/mirrors/rk5bk/current/sda3/var/lib/dnf/yumdb/f/bf8a1235c04e7d762543ff96d2d61207a0186099-fsarchiver-0.6.19-5.fc23-x86
_64/checksum_type'
Size: 6               Blocks: 8          IO Block: 4096   regular file
Device: 900h/2304d      Inode: 328925601   Links: 65000
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-02-26 07:41:17.228950162 +0200
Modify: 2016-01-14 16:20:42.108021864 +0200
Change: 2016-02-27 02:36:32.832579219 +0200
Birth: -

So, how bad is the problem?

# cat /etc/redhat-release
Fedora release 23 (Twenty Three)

# find /var/lib/dnf/yumdb/ -type f -exec stat {}  \; | awk '/Links:/ {print $6}' | sort -n | tail -n1 
10002
# rpm -qa | wc -l 
9650

# cat /etc/redhat-release  
Scientific Linux release 6.4 (Carbon)
# find /var/lib/yum/yumdb/ -type f -exec stat {}  \; | awk '/Links:/ {print $6}' | sort -n | tail -n1             
339
# rpm -qa | wc -l
1420

Scientific Linux is an RHEL clone like CentOS. Looks like yum already had the problem with excessive number of links, but dnf has made it much worse. In fact, dnf is already broken on ext2/3 because all the available rpm packages can’t be installed.

# dnf list all | wc -l          
56097

Tailoring a backup script

To understand what should be backed up, a fdisk listing is in order:

# LANG=C fdisk -l  
Disk /dev/sda: 223.6 GiB, 240057409536 bytes, 468862128 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x7720098c

Device     Boot    Start       End   Sectors   Size Id Type 
/dev/sda1  *        2048  16779263  16777216     8G 83 Linux
/dev/sda2       16779264  83888127  67108864    32G 82 Linux swap / Solaris
/dev/sda3       83888128 468862127 384974000 183.6G 83 Linux

/dev/sda1 is the boot partition and /dev/sda3 the root. /dev/sda2 doesn’t need to be backed up, but it must be created as a part of restore.

So lets take a look at the backup script:

#!/bin/bash
# /mirrors is a big NFS mounted data filesystem
# Change the mount point to suit your system
if [ ! `mount | grep /mirrors >/dev/null` ]
then
# Only for the first time..
  mkdir -p /mirrors/rk5bk/current > /dev/null
  cd /mirrors/rk5bk || exit
# Try to squash a bug occuring in some unusual condition?
  rmdir ? 2> /dev/null
  echo "Deleting the oldest backup"
  time ssh rk7 "cd /mirrors/rk5bk ; rm -rf 6 > /dev/null ; sync"
  for i in `seq 6`
  do     
    mv $((6-$i)) $((7-$i)) 2> /dev/null
  done
  echo "Copying the most recent backup to next recent as hard links"
  time  ssh rk7 "cd /mirrors/rk5bk ; cp -al current 1"
  date > /bktimestamp
  echo "Backing up the root partition"
  time  rsync -axH --delete /  current/sda3
  echo "Backing up the boot partition"
  time  rsync -aH  --delete  /boot/*  current/sda1
  echo "Saving bootblock and grub related info"
  time dd if=/dev/sda of=current/mbr.raw bs=512 count=2048
# Change the partition numbers if needed. Below sda1 is boot and sda3 root partition.
  /usr/sbin/dumpe2fs /dev/sda1 | awk '/Filesystem UUID/ {print $3}' > current/sda1.uuid
  /usr/sbin/dumpe2fs /dev/sda3 | awk '/Filesystem UUID/ {print $3}' > current/sda3.uuid
# rsync doesn't change the date of the target directory, so..
  #  touch current
#  touch current/bktimestamp  
  echo Systemdisk backup done at `date`
fi

There are many lines with “time” at the beginning. Those are operations that take longish time and using “time” to time those should enlighten what eats up all that time the script needs to run. Some lines have “ssh” to invoke a command to run locally on the file server. That is in those cases considerably faster than running those operations over the network. You have to have ssh public keys copied to the file server of course for ssh to work that way.

On a local network rsync runs fastest when remote filesystems are shared via e.g. NFS and rsync thinks it copies the files locally. Otherwise it would use encryption and that would slow the transfers down. If the network or Internet link is slow, compression (with -z) is worth a try.

In the script some things need some explaining. “dd” copies 2048 sectors instead of only the MBR sector, which contains the partition table. Grub uses the next sectors for storing the code which loads the next stage of grub. If you look at fdisk listing, you should notice that the first partition starts at sector 2048 (counting starts at 0). All the sectors between MBR and sector 2048 are officially unclaimed and grub usually uses the first 15 of them.

Another probably unfamiliar thing is storing the filesystem UUIDs. Those are referenced in the grub configuration file and must be set on restore for the system to boot up properly. Here is one boot menu entry using filesystem UUIDs:

menuentry 'Fedora (4.3.3-303.fc23.x86_64) 23 (Workstation Edition)' --class fedora --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-4.3.3-300.fc23.x86_64-advanced-ab89bb56-fb4e-47db-b22b-f083a9d32902' {
        load_video
        set gfxpayload=keep
        insmod gzio
        insmod part_msdos
        insmod ext2
        set root='hd0,msdos1'
        if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root --hint-bios=hd0,msdos1 --hint-efi=hd0,msdos1 --hint-baremetal=ahci0,msdos1 --hint='hd0,msdos1'  7c2569c1-9fc1-4ae0-8efe-c83bd1a5d346
        else
          search --no-floppy --fs-uuid --set=root 7c2569c1-9fc1-4ae0-8efe-c83bd1a5d346
        fi
        linux16 /vmlinuz-4.3.3-303.fc23.x86_64 root=UUID=ab89bb56-fb4e-47db-b22b-f083a9d32902 ro rhgb quiet LANG=fi_FI.UTF-8
        initrd16 /initramfs-4.3.3-303.fc23.x86_64.img
}

Those filesystem UUIDs referenced are:

# cat sda1.uuid 
7c2569c1-9fc1-4ae0-8efe-c83bd1a5d346
# cat sda3.uuid
ab89bb56-fb4e-47db-b22b-f083a9d32902

Now we are ready to set the backup script to run automatically on even days by adding the following line to /etc/crontab:

# Example of job definition:
# .---------------- minute (0 - 59)
# | .------------- hour (0 - 23)
# | | .---------- day of month (1 - 31)
# | | | .------- month (1 - 12) OR jan,feb,mar,apr ...
# | | | | .---- day of week (0 - 6) (Sunday=0 or 7) OR
# | | | | |     sun,mon,tue,wed,thu,fri,sat
# | | | | |
# * * * * * user-name command to be executed
32 2 * * * root /usr/local/bin/sysbackup

The restore script is divided in two parts. The first part listed and run:

[root@rk5 current]# cat ../restore1.sh 
#!/bin/sh
if [ -b /dev/$1 ]
   then
       dd if=mbr.raw of=/dev/$1
       echo "Copied boot block and partition table to $1."
       echo "Adjust now the partition table if you need and"
       echo "continue with restore2.sh."
       fdisk -l /dev/$1
else echo "Please specify a valid block device file name like sda."
fi
[root@rk5 current]# LANG=C sh ../restore1.sh sdb
2048+0 records in
2048+0 records out
1048576 bytes (1.0 MB) copied, 0.124054 s, 8.5 MB/s
Copied boot block and partition table to sdb.
Adjust now the partition table if you need and
continue with restore2.sh.
Disk /dev/sdb: 233.8 GiB, 251000193024 bytes, 490234752 sectors 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x7720098c

Device     Boot    Start       End   Sectors   Size Id Type 
/dev/sdb1  *        2048  16779263  16777216     8G 83 Linux
/dev/sdb2       16779264  83888127  67108864    32G 82 Linux swap / Solaris
/dev/sdb3       83888128 468862127 384974000 183.6G 83 Linux

The backed up disk was 240 GB in size and it was restored to a 250 GB disk, so the last partition could be now expanded to end at sector 490234751 instead of 468862127. Expansion is optional, but if the new disk is smaller, you must decrease partition size(s). The second part of the restoration script should offer no surprises:

[root@rk5 current]# cat ../restore2.sh 
#!/bin/sh
if [ -b /dev/$1 ]
   then
       mkfs.ext4 /dev/${1}1
       mkfs.ext4 /dev/${1}3
       mkswap /dev/${1}2
       tune2fs -U `cat sda1.uuid` /dev/${1}1
       tune2fs -U `cat sda3.uuid` /dev/${1}3
       mkdir /mnt/${1}1
       mkdir /mnt/${1}3
       mount /dev/${1}1 /mnt/${1}1
       mount /dev/${1}3 /mnt/${1}3
       time rsync -aH sda1/* /mnt/${1}1
       time rsync -aH sda3/* /mnt/${1}3
       umount /mnt/${1}1 /mnt/${1}3
       rmdir /mnt/${1}1 /mnt/${1}3
else echo "Please specify a valid block device file name like sda."
fi

Note however the location of the script. It is on the file server.

Testing backup and restore

This is of course important thing to do. You should run the backup script seven times to test if you have problem with too many hard links. You can also increase the number of saved copies one by one and if you reach seven without problems, you can set the crontab to run the script daily. I would however advice caution. I once had fairly big problems with big ext3 filesystem and I suspect that excessive hard links caused it. I tried to store every day a copy over one month period and the problems started at the 20th or 21. day. So it is good that ext4 complains if the limit is exceeded. Probably ext3 is already fixed in that respect, but all I can do is to warn.

When you test restore, it helps if you have external docking station like Icy Dock. Search for “hdd docking station usb3” and you’ll find plenty of options. eSATA is even faster, but not as universal as USB. After restoring, shut down, unplug your system disk, plug in the restore disk instead, and try booting.

If you would like to simulate a real disaster situation, you should boot with a rescue disk like SystemRescueCD. Having it on a USB stick and being familiar with it is a very good idea in any case.

Dedicated Backup Server

I have used my backup server in a double role. Because snapshot support is so good in ZFS, it is very well suited as backup media. ZFS has also many other interesting features like deduplication. When I started to evaluate it, I soon noticed that those features came with a price. Deduplication needed RAM, much RAM, much, much RAM. Also snapshots were not without problems. Using it for backups has been a kind of long term reliability test. During that time ZFS and ZFS on Linux have also matured.

Having a backup server makes really sense only when it is physically located separately from the file server. That means that the network connection has to be fast enough. “rsync” with compressed transfers makes even fairly slow connections acceptably fast. I don’t have the luxury of separate locations, so I don’t use compression and I might soon trust ZFS with raidz3 enough for it to serve as both my file and backup server.

Before I write more about my backup server, I’ll have to tell a little bit about my fileserver. It has 15 disks. The first is SSD system disk and the rest are 2 TB disks forming a 13 disk RAID6 array with one hot spare. Because RAID6 effectively uses two disks for parity, the array has 11 times 2 TB for storage. In binary terabytes that is roughly 20 TiB. The filesystem is ext4. I didn’t choose ZFS because growing it has severe limitations. Growing RAID6 one disk at a time is easy and now that the 16 TiB limit is already exceeded, ext4 grows as effortlessly.

Actually the RAID6 array on my fileserver has been already introduced. It is mounted as /mirrors on my workstation. Of the 20 TiB, 13 TiB is in use currently. So that is what is being backed up.

Briefly about the Backup Server Hardware

I started my acquaintanceship with ZFS on OpenSolaris. That limited hardware choices greatly, but I found that Supermicro’s AOC-USAS and SAS2 version AOC-USAS2 were affordable, good, and supported by OpenSolaris. They are 8 port SAS adapters which also support SATA disks. Now that I use ZOL (ZFS on Linux) and there probably are no SATA/SAS adapters without Linux support, I still would recommend them. Disk adapters were the biggest grievance, but you had to also be careful when choosing network adapters.

My first backup server supported only 4 GB of RAM. Normally it wouldn’t be a problem for a file server with no need of GUI. ZFS however needed much RAM, for example, for deduplication. Also snapshot deletion would occasionally fail if the ratio of RAM to used disk space was too low. The problem with snapshot deletion was fixed in ZFS version 26 or 27 and when I started to use my current backup server, the ZFS version in OpenIndiana was 28.

The type of deduplication ZFS offers is block based. The foundation for it comes free because ZFS calculates and stores checksums of all stored blocks to ensure data integrity. If the checksum of a new block is the same as some previously stored block has, there is no need to store the block, just make a link to the previously stored duplicate. The rather obvious problem is that you have to have all the checksums in RAM for deduplication to be reasonably fast. Recommendation has been 2 GB of RAM per used terabyte of disk storage.

Deduplication and many other features ZFS offers are not all or nothing options. You can create datasets, which are like directory trees with individual settings. If you know that you have a set of very deduplicable files, you can place them on a dataset with deduplication enabled while elsewhere it is not and thus limit the need for RAM and increase the speed.

When I started to build my current backup server, the old one had three main limitations. The disks were only 2 TB models and the total space was getting too small. RAM limit was already mentioned. The third one was lack of support for WOL (Wake On Lan). Than meant all old had to go.

My current backup server is HP Proliant ML350 G5. It has built-in E200i 8 port SAS adapter and two 2.5” 147 GB 10 000 RPM SAS disks in hardware mirror configuration. CPU is 2 GHz dual core Xeon 5130. Originally it had 6 GB of ECC RAM. RAM limit is 32 GB and it has now 12 GB. I bought it used from huuto.net (local ebay clone) for 250 € a couple of years ago. Upgrading memory was cheap from ebay. Currently 4 x 4 GB RAM for it costs 30 € at eBay, so reaching maximum 32 GB memory costs only 60 €. I also had G4 of ML350 from which I got a drive gage for 3.5 “ disks.

The 147 GB SAS disks serve as system disk. For data there are eight 4 TB disks in raidz2 configuration and 60 GB SSD disk for ZFS cache.

Why ZFS on Linux?

My first backup server run originally OpenSolaris svn 134, the last free version, but then I upgraded to OpenIndiana. When I started to assemble my current backup server, I was rather confident that an old server machine from HP would be fully supported. As it turned around, I was wrong. When I tried to install OpenIndiana, the installation program wrote 2 % of the files to the disk and then froze. Another try to a SATA disk connected to a supported adapter produced an unbootable installation with already grub throwing garbage on screen.

I would have continued to use OpenIndiana even if I had no great interest for other rather advanced features it offers besides ZFS, but I had had enough of disk related problems. It was ZFS after all, that I was mainly after and I was starting to find disk related problems inexcusable.

There are other OS’s that support ZFS, but Linux has clearly the best hardware support. Dragonfly BSD could have been interesting because of the HAMMER2 filesystem for comparison, but also Linux has one interesting filesystem in the emerging promises category: btrfs.

Backing up the Fileserver

The backup script needs a couple of unusual utilities. “zfs-auto-snapshot” has instructions and download link at https://github.com/zfsonlinux/zfs-auto-snapshot. Another is wol. You have to have epel repository defined to install it with yum on CentOS. Once you have wol, you have to get the MAC address of your backup server like below:

root@rk4 ~]# ifconfig  
eth0      Link encap:Ethernet  HWaddr 00:1C:C4:16:6E:6E
inet addr:192.168.0.4  Bcast:192.168.0.255  Mask:255.255.255.0

What you need is listed after Hwaddr, in this case 00:1C:C4:16:6E:6E. Below is the whole script:

#!/bin/bash 
# Is the backup server running?
ping -c 3 rk4 > /dev/null  && ISUP=1 || ISUP=0 
if [ "$ISUP" = "0" ]
 then
# herätetään rk4
#
# Wake it up
 if  [ "$1" = "-v" ]
   then
     wol 00:1C:C4:16:6E:6E
   else
     wol 00:1C:C4:16:6E:6E > /dev/null
 fi
# Koneen ja sen levyadapterien BIOS:it vievät lähes pari minuuttia 
# käynnistymisajasta. Scientific Linux käynnistyy komentokehotteeseen
# reilussa puolessa minuutissa, joten kokonaisaika on noin 2 min 30 s
# mutta odotetaan varmuuden vuoksi vielä 15 sekuntia
#
# The BIOSes of the machine and disk adapters take almost 2 minutes
# of the startup time. Scientific Linux boots up in 30+ seconds, but 
# we wait another 15 seconds.
  sleep 165
  if  [ "$1" = "-v" ]
    then
      ping -c 3 rk4 || exit
    else
      ping -c 3 rk4 > /dev/null || exit
  fi
  ssh rk4 "mount rk7:/mirrors /rk7/mirrors"  || exit
  else
    ssh rk4 "mount rk7:/mirrors /rk7/mirrors"  
  fi
cd /mirrors || exit
# Generate file listing
ls -lR > ls_-lR
if [ "$1" = "-v" ]
 then
  ssh rk4 "cd /tank ; \
           rsync --progress -aH --delete --exclude=bk/ --exclude=rk5bk \
                 /rk7/mirrors . ; \
           rsync --progress -aH --delete  /rk7/mirrors/bk/current mirrors/bk ; \  
           rsync --progress -aH --delete  /rk7/mirrors/rk5bk/current mirrors/rk5bk"    
 else
  ssh rk4 "cd /tank ; \
           rsync -aH --delete --exclude=bk/ --exclude=rk5bk /rk7/mirrors . ; \
           rsync -aH --delete  /rk7/mirrors/bk/current mirrors/bk ; \ 
           rsync -aH --delete  /rk7/mirrors/rk5bk/current mirrors/rk5bk" 
fi 
ssh rk4 "/usr/local/sbin/zfs-auto-snapshot.sh --label=weekly --keep=26 tank"
ssh rk4 "df /tank"
ssh rk4 "zfs list tank"
ssh rk4 "zfs list -t snapshot"
echo Mirror backup done at `date`
if [ "$ISUP" = "0" ]
 then
   ssh rk4 "shutdown -h now"
fi
exit

The backup script is run every Friday by cron according to the following /etc/crontab line:

05 6 * * fri root  /usr/local/bin/mirrorbackupnew

The backup script is rather simple really. If the backup server is up, it is not shut down, but if not, it is waken only for the duration needed for backing up. Currently two directories are treated specially because they contain daily rotating backups. Only the most recent version is backed up resulting in weekly rotation after the first week. It is assumed that the backup server has nothing else to do, but the backed up file server might have other duties, so the backup server is called to do the hardest part of the work via ssh. Therefore the root users ssh public key must be copied from the file server to backup servers root users authorized users file.

The backup script keeps 26 most recent weekly backups or you can go back half year back in time. What that means in terms of disk usage varies depending on how much and often the files and their contents change. The script lists at the end some statistics which are mailed to the root user. Below the same commands are run manually:

[root@rk4 ~]# df /tank 
Filesystem       1K-blocks        Used  Available Use% Mounted on
tank           21453608832 13942261760 7511347072  65% /tank
[root@rk4 ~]# zfs list tank
NAME   USED  AVAIL  REFER  MOUNTPOINT
tank  13,3T  7,00T  13,0T  /tank
[root@rk4 ~]# zfs list -t snapshot
NAME                                        USED  AVAIL  REFER  MOUNTPOINT
tank@zfs-auto-snap_weekly-2015-08-21-0434  2,10G      -  12,8T  -
tank@zfs-auto-snap_weekly-2015-08-28-0426  2,10G      -  12,8T  -
tank@zfs-auto-snap_weekly-2015-09-04-0429  2,09G      -  12,8T  -
tank@zfs-auto-snap_weekly-2015-09-11-0433  2,53G      -  12,8T  -
tank@zfs-auto-snap_weekly-2015-09-18-0426  2,16G      -  12,8T  -
tank@zfs-auto-snap_weekly-2015-09-25-0428  2,83G      -  12,8T  -
tank@zfs-auto-snap_weekly-2015-10-02-0427  2,18G      -  12,9T  -
tank@zfs-auto-snap_weekly-2015-10-09-0429  2,19G      -  12,9T  -
tank@zfs-auto-snap_weekly-2015-10-16-0642  2,53G      -  12,9T  -
tank@zfs-auto-snap_weekly-2015-10-23-0502  2,54G      -  13,0T  -
tank@zfs-auto-snap_weekly-2015-10-30-0557  2,57G      -  13,0T  -
tank@zfs-auto-snap_weekly-2015-11-06-0559  2,57G      -  13,0T  -
tank@zfs-auto-snap_weekly-2015-11-13-0609  2,65G      -  13,0T  -
tank@zfs-auto-snap_weekly-2015-11-20-0559  2,60G      -  13,0T  -
tank@zfs-auto-snap_weekly-2015-11-27-0558  1,93G      -  13,0T  -
tank@zfs-auto-snap_weekly-2015-12-04-0521  1,94G      -  13,0T  -
tank@zfs-auto-snap_weekly-2015-12-11-0601  2,64G      -  13,0T  -
tank@zfs-auto-snap_weekly-2015-12-18-0610  2,32G      -  13,0T  -
tank@zfs-auto-snap_weekly-2015-12-25-0639  2,29G      -  13,0T  -
tank@zfs-auto-snap_weekly-2016-01-01-0613  2,32G      -  13,0T  -
tank@zfs-auto-snap_weekly-2016-01-08-0605  1,01G      -  13,0T  -
tank@zfs-auto-snap_weekly-2016-01-15-0524  1,01G      -  13,0T  -
tank@zfs-auto-snap_weekly-2016-01-22-0612  3,34G      -  13,0T  -
tank@zfs-auto-snap_weekly-2016-01-29-0822  6,95G      -  13,0T  -
tank@zfs-auto-snap_weekly-2016-02-05-0757  32,9G      -  13,0T  -
tank@zfs-auto-snap_weekly-2016-02-12-0704   197M      -  13,0T  -

In my use snapshots use very little space.

If you want to access the snapshots, the easiest way is to enable .zfs directory visibility:

# zfs get snapdir tank
NAME  PROPERTY  VALUE    SOURCE
tank  snapdir   hidden   local
# zfs set snapdir=visible tank
# ls /tank/.zfs/snapshot/
zfs-auto-snap_weekly-2015-08-21-0434  zfs-auto-snap_weekly-2015-11-20-0559
zfs-auto-snap_weekly-2015-08-28-0426  zfs-auto-snap_weekly-2015-11-27-0558
zfs-auto-snap_weekly-2015-09-04-0429  zfs-auto-snap_weekly-2015-12-04-0521
zfs-auto-snap_weekly-2015-09-11-0433  zfs-auto-snap_weekly-2015-12-11-0601
zfs-auto-snap_weekly-2015-09-18-0426  zfs-auto-snap_weekly-2015-12-18-0610
zfs-auto-snap_weekly-2015-09-25-0428  zfs-auto-snap_weekly-2015-12-25-0639
zfs-auto-snap_weekly-2015-10-02-0427  zfs-auto-snap_weekly-2016-01-01-0613
zfs-auto-snap_weekly-2015-10-09-0429  zfs-auto-snap_weekly-2016-01-08-0605
zfs-auto-snap_weekly-2015-10-16-0642  zfs-auto-snap_weekly-2016-01-15-0524
zfs-auto-snap_weekly-2015-10-23-0502  zfs-auto-snap_weekly-2016-01-22-0612
zfs-auto-snap_weekly-2015-10-30-0557  zfs-auto-snap_weekly-2016-01-29-0822
zfs-auto-snap_weekly-2015-11-06-0559  zfs-auto-snap_weekly-2016-02-05-0757
zfs-auto-snap_weekly-2015-11-13-0609  zfs-auto-snap_weekly-2016-02-12-0704

Each of those subdirectories contain all the files that were present when the snapshot was taken, so the total size is 26 times 12.8 TiB to 13 TiB.

Reflections and Future Directions

Third time counts, they say. Now it is also chronologically right time to again consider future directions.

The First Backup Server

I started evaluating ZFS a little over four years ago. Deduplication was then the most intriguing feature of ZFS. Testing or rather simulating it was very instructive. At that time I had “only” 8 TB of files on my fileserver. Calculating what would be the deduplication ratio of those files needed 28 GB of memory. Because my original backup server had and supported only 4 GB of RAM, it was really slow. Actually I ended up kind of cheating by using SSD disk as swap space and tweaking every possible setting. The calculation took 40:31 hours. Now that I have 12 GB of RAM and 13 TiB of used disk space, the calculation took 6:30 hours. Total of 18 GB of RAM and swap space were used. The result was:

# time zdb -S tank                    
Simulated DDT histogram:

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
1    93.4M   11.5T   11.5T   11.5T    93.4M   11.5T   11.5T   11.5T
2    4.27M    484G    484G    488G    9.63M   1.05T   1.05T   1.06T
4     379K   29.7G   29.7G   30.7G    1.69M    134G    134G    139G
8    53.7K   4.28G   4.28G   4.42G     553K   44.5G   44.5G   45.9G
16    17.4K   1.80G   1.80G   1.82G     371K   38.8G   38.8G   39.2G
32    1.55K    105M    105M    110M    61.3K   3.90G   3.90G   4.13G
64      294   9.88M   9.88M   11.5M    25.0K    850M    850M    993M
128       98   2.36M   2.36M   3.00M    16.7K    467M    467M    574M
256       34    808K    808K   1.01M    11.4K    272M    272M    347M
512       20    191K    191K    341K    15.1K    119M    119M    235M
1K       20    922K    922K   1024K    29.2K   1.58G   1.58G   1.71G
2K        8    133K    133K    188K    22.1K    419M    419M    565M
4K        3    129K    129K    145K    16.2K    711M    711M    796M
2M        1    128K    128K    128K    3.25M    416G    416G    416G
Total    98.1M   12.0T   12.0T   12.1T     109M   13.2T   13.2T   13.2T

dedup = 1.10, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.10

real    389m49.067s
user    30m52.490s
sys     5m12.914s

I ordered 4 x 4 or 16 GB of memory yesterday from China because it is now so cheap, but I’ll have to wait about 2 weeks before I can try deduplication calculation without swapping to disk. The new total memory amount will be 24 GB. As it now seems, the requirement of 2 GB of RAM per TB of used disk space is no longer true, 24 GB should be enough for at least 16 TB of used disk space, maybe even more. In any case, more RAM opens options that were previously more or less closed.

The Second Backup Server

I built my second backup server about two years ago. So it is now a very good time to think if some changes should be made. Surprisingly little has happened in two years. I paid 160 € for each of the 4 TB disks and now the same model costs 140 €. Now there are also 8 TB disks, but the cost per TB is 30 € while on 4 TB disks it is 35 €. 30 is less than 35, but when you use RAID levels with parity, you must take the number of disks into consideration. The less disks you have, the more proportionally you lose for parity. I think that for RAID 6, or raidz2 in ZFS parlance, 8 to 12 disks is optimal. So, for my needs, 8 TB is still too big and 5 or 6 TB disks present too small change to be worthwhile. So adding more 4 TB disks seems like a good idea.

Another big change, that looks feasible, is combining the roles of file and backup servers, in practice getting rid of the file server. When I have previously upgraded, in practice doubled the size of the disks, I have sold the old disks on huuto.net (local eBay clone) and have gotten a fairly good per terabyte price on them. The bigger disks tend to have cheaper terabytes, so you don’t usually lose much, if any, by doubling your disk sizes, if you count per terabyte prices.

Faster CPU

In addition to adding RAM and more disks, a faster CPU was so cheap as used, that it is worth a try. ML350 G5 would accept also a second CPU, but that would mean buying CPU upgrade kit costing about 50 € compared to 10 € I paid for two 3 MHz bare CPU’s.

To test the faster CPU, I first run mp3 encoding and memory speed tests.

# time for j in *.wav ; do lame "$j" `basename "$j" .wav`.mp3 ; done
real    0m56.755s 
user    0m54.627s
sys     0m0.185s

# for i in 1 2 3 4 5 ; do hdparm -Tt /dev/sda | grep "cached reads" ; done   
Timing cached reads:   5882 MB in  2.00 seconds = 2943.33 MB/sec
Timing cached reads:   6070 MB in  2.00 seconds = 3037.93 MB/sec
Timing cached reads:   6258 MB in  2.00 seconds = 3131.98 MB/sec
Timing cached reads:   5970 MB in  2.00 seconds = 2988.21 MB/sec
Timing cached reads:   5912 MB in  2.00 seconds = 2958.74 MB/sec

Then the same tests with 3 GHz CPU:

real    0m39.536s 
user    0m36.572s
sys     0m0.316s

Timing cached reads:   7754 MB in  2.00 seconds = 3879.81 MB/sec
Timing cached reads:   6740 MB in  2.00 seconds = 3372.07 MB/sec
Timing cached reads:   7178 MB in  2.00 seconds = 3591.59 MB/sec
Timing cached reads:   7190 MB in  2.00 seconds = 3597.95 MB/sec
Timing cached reads:   7100 MB in  2.00 seconds = 3553.06 MB/sec

Then calculations:

# echo | awk '{print 56.755/39.536}'
1.43553 
]# echo | awk '{print 3879.81/3131.98}'
1.23877

So, CPU intensive tasks are now 44 % faster and memory intensive tasks 24 % faster. Well spent 10 €! What about disk intensive tasks?

# time zdb -S tank 
Simulated DDT histogram:

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
1    93.4M   11.5T   11.5T   11.6T    93.4M   11.5T   11.5T   11.6T
2    4.15M    479G    479G    482G    9.22M   1.04T   1.04T   1.04T
4     621K   38.3G   38.3G   40.6G    2.75M    175G    175G    185G
8    71.6K   5.43G   5.43G   5.63G     720K   55.4G   55.4G   57.4G
16    18.7K   1.87G   1.87G   1.90G     401K   40.9G   40.9G   41.4G
32    1.75K    118M    118M    124M    69.7K   4.42G   4.42G   4.68G
64      309   10.0M   10.0M   11.8M    26.3K    872M    872M   1023M
128      127   2.54M   2.54M   3.39M    21.6K    503M    503M    647M
256       33    789K    789K   1007K    11.2K    262M    262M    336M
512       22    210K    210K    375K    16.3K    120M    120M    245M
1K       18    957K    957K   1.02M    25.6K   1.62G   1.62G   1.72G
2K       10    132K    132K    205K    26.8K    416M    416M    606M
4K        5    132K    132K    162K    29.5K    729M    729M    909M
2M        1    128K    128K    128K    3.27M    419G    419G    419G
Total    98.2M   12.1T   12.1T   12.1T     110M   13.3T   13.3T   13.3T

dedup = 1.10, compress = 1.00, copies = 1.00, dedup * compress / copies = 1.10

real    312m30.460s
user    28m38.933s
sys     4m21.458s

That is quite impressive, 25 % faster.

The Speed of ZFS on Linux

Two years ago when I started to use ZFS on Linux, it was not optimized for speed. The test results were at the most modest. Soon after that the results got much better, but I didn’t write about it.

Because I don’t have now the luxury of being able to create zpools with different raidz levels or mdadm devices with different filesystems, it it not worth to test much more than basics, i.e. writing and reading speed. Ordinarily “hdparm -Tt” would be enough, but zpool doesn’t make a device file like normal software raid, i.e. mdadm, does. First writing speed:

# sync ; time ( dd if=/dev/zero of=test40G bs=1M count=40960 ; sync )          
40960+0 records in
40960+0 records out
42949672960 bytes (43 GB) copied, 127.158 s, 338 MB/s

real    2m12.102s
user    0m0.149s
sys     0m47.108s

# echo | awk '{print "Writing speed =",42949672960/132.102/1024/1024,"MB/s"}'
Writing speed = 310.063 MB/s
# sync ; time ( dd if=/dev/zero of=test40-2G bs=1M count=40960 ; sync ) 
40960+0 records in
40960+0 records out
42949672960 bytes (43 GB) copied, 129.581 s, 331 MB/s

real    2m14.984s
user    0m0.167s
sys     0m47.958s
# echo | awk '{print "Writing speed =",42949672960/134.984/1024/1024,”MB/s”}' 
Writing speed = 303.443 MB/s

Reading speed doesn’t need calculating or “time”.

# sync ; time ( dd if=test40G of=/dev/null bs=1M count=40960 ; sync ) 
40960+0 records in
40960+0 records out
42949672960 bytes (43 GB) copied, 86.6026 s, 496 MB/s

real    1m26.606s
user    0m0.161s
sys     0m43.542s
[root@rk4 tank]# sync ; time ( dd if=test40-2G of=/dev/null bs=1M count=40960 ; sync )
40960+0 records in
40960+0 records out
42949672960 bytes (43 GB) copied, 82.4693 s, 521 MB/s

real    1m22.482s
user    0m0.168s
sys     0m43.892s

Two years ago writing speed was only 69 MB/s and reading speed 575 MB/s.

The Future

Progress in the PC hardware market seems to get slower and slower. Over 10 years ago you got twice as big hard disk the next year for the same price and CPU frequencies still had room to rise. I had at that time followed hard disk market closely for 10 years starting with my first article in MikroPC magazine year 1994. After that, yearly doubling of capacities continued very accurately for nearly a decade. Back then the conventional wisdom was, that a clearly noticeable increase in speed would require fourfold increase measured in numbers. It used to mean waiting two or three years before upgrading was sensible. Now you should wait at least five, preferably almost 10 years.

On the other hand, stability means that planning is easier. If I now add two or three disks, I can expect that the new configuration should last for next three to five years.

Lessons Learned

TBD