Hi all,
I think that I encountered buggy behavior of the ZFS auto-snapshot feature in Solaris 11, 11.1, 11.2 and 11.3. Let's assume, we have the following ZFS datasets on a freshly installed test machine (running Solaris 11.3):
steffen@solaris:~$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 9,48G 21,5G 4,85M /rpool
rpool/ROOT 5,67G 21,5G 31K legacy
rpool/ROOT/solaris 5,67G 21,5G 4,70G /
rpool/ROOT/solaris/var 854M 21,5G 710M /var
rpool/VARSHARE 246K 21,5G 47K /var/share
rpool/VARSHARE/pkg 81K 21,5G 32K /var/share/pkg
rpool/VARSHARE/pkg/repositories 31K 21,5G 31K /var/share/pkg/repositories
rpool/VARSHARE/zones 31K 21,5G 31K /system/zones
rpool/dump 2,06G 21,6G 2,00G -
rpool/export 5,77M 21,5G 32K /export
rpool/export/home 5,72M 21,5G 32K /export/home
rpool/export/home/steffen 5,67M 21,5G 5,33M /export/home/steffen
rpool/swap 1,03G 21,6G 1,00G -
rpool/zones 724M 21,5G 32K /zones
rpool/zones/test 724M 21,5G 32K /zones/test
rpool/zones/test/rpool 724M 21,5G 31K /rpool
rpool/zones/test/rpool/ROOT 724M 21,5G 31K legacy
rpool/zones/test/rpool/ROOT/solaris 724M 21,5G 667M /zones/test/root
rpool/zones/test/rpool/ROOT/solaris/var 49,7M 21,5G 48,7M /zones/test/root/var
rpool/zones/test/rpool/VARSHARE 38K 21,5G 38K /var/share
rpool/zones/test/rpool/export 62K 21,5G 31K /export
rpool/zones/test/rpool/export/home 31K 21,5G 31K /export/home
When activating the auto-snapshot feature, we will find that the following datasets will get auto-snapshotted correctly:
steffen@solaris:~$ zfs list -t snapshot|grep zfs-auto-snap
rpool@zfs-auto-snap_frequent-2016-02-29-03h01 0 - 4,85M -
rpool/ROOT/solaris/var@zfs-auto-snap_frequent-2016-02-29-03h01 0 - 1,07G -
rpool/VARSHARE/pkg/repositories@zfs-auto-snap_frequent-2016-02-29-03h01 0 - 31K -
rpool/VARSHARE/zones@zfs-auto-snap_frequent-2016-02-29-03h01 0 - 31K -
rpool/export/home/steffen@zfs-auto-snap_frequent-2016-02-29-03h01 0 - 5,33M -
rpool/zones/test/rpool/ROOT/solaris/var@zfs-auto-snap_frequent-2016-02-29-03h01 0 - 239M -
rpool/zones/test/rpool/VARSHARE@zfs-auto-snap_frequent-2016-02-29-03h01 0 - 38K -
rpool/zones/test/rpool/export/home@zfs-auto-snap_frequent-2016-02-29-03h01 0 - 31K -
But there are other datasets which will never get an auto-snapshot - completely independent from what is modified on that dataset:
rpool/ROOT
rpool/ROOT/solaris
rpool/VARSHARE
rpool/dump
rpool/export
rpool/swap
rpool/zones
rpool/zones/test
rpool/zones/test/rpool
rpool/zones/test/rpool/ROOT
rpool/zones/test/rpool/ROOT/solaris
rpool/zones/test/rpool/export
For example, if I added, deleted or modified files or directories at "rpool/ROOT/solaris" (which is mounted as "/" in the global zone) or at "rpool/zones/test/rpool/ROOT/solaris" (which is mounted as "/" in the zone "test"), I would expect that there will be snapshots for these datasets, too -- but they are never created. While it is, of course, obvious for "rpool/dump" and "rpool/swap", such a behavior shouldn't occur for the other datasets (e.g. "rpool/zones/test/rpool/ROOT/solaris") as the ZFS properties are set to enable auto-snapshots:
NAME | PROPERTY | VALUE SOURCE |
rpool/zones/test/rpool/ROOT/solaris com.sun:auto-snapshot true inherited from rpool
There is neither a problem to snapshot them manually nor to get them snapshotted when updating the system and creating a new boot environment. So it seems that it is not a problem of ZFS or the ZFS snapshot feature in general, but has rather to do with the auto-snapshot feature of time-slider.
The problem occurs on all servers I am working with and it occurs in Solaris 11, 11.1, 11.2 and 11.3.
To track this down, I made the python script
/usr/share/time-slider/lib/time_slider/zfs.py
a bit more verbose by adding several debug outputs. It shows the following behavior which presumably explains the whole problem:
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/zones/test/rpool/export/home
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/zones/test/rpool/export/home
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/zones/test/rpool/export/home@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/zones/test/rpool/export
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/zones/test/rpool/export
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/zones/test/rpool/export@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/zones/test/rpool/VARSHARE
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/zones/test/rpool/VARSHARE
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/zones/test/rpool/VARSHARE@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/zones/test/rpool/ROOT/solaris/var
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/zones/test/rpool/ROOT/solaris/var
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/zones/test/rpool/ROOT/solaris/var@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/zones/test/rpool/ROOT/solaris
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/zones/test/rpool/ROOT/solaris
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/zones/test/rpool/ROOT/solaris@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/zones/test/rpool/ROOT
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/zones/test/rpool/ROOT
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/zones/test/rpool/ROOT@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/zones/test/rpool
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/zones/test/rpool
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/zones/test/rpool@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/zones/test
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/zones/test
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/zones/test@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/zones
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/zones
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/zones@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/export/home/steffen
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/export/home/steffen
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/export/home/steffen@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/export/home
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/export/home
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/export/home@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/export
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/export
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/export@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/VARSHARE/zones
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/VARSHARE/zones
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/VARSHARE/zones@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/VARSHARE/pkg/repositories
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/VARSHARE/pkg/repositories
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/VARSHARE/pkg/repositories@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/VARSHARE/pkg
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/VARSHARE/pkg
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/VARSHARE/pkg@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/VARSHARE
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/VARSHARE
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/VARSHARE@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/ROOT/solaris/var
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/ROOT/solaris/var
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/ROOT/solaris/var@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] in finalrecursive: rpool/ROOT
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool/ROOT
Feb 28 03:13:10 solaris time-sliderd: [ID 702911 daemon.notice] util.run_command(): /usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/ROOT@zfs-auto-snap_daily-2016-02-28-03h13
Feb 28 03:13:11 solaris time-sliderd: [ID 702911 daemon.notice] in single: rpool
Feb 28 03:13:11 solaris time-sliderd: [ID 702911 daemon.notice] create_snapshot: rpool
The lines with "util.run_command()" show the commands which are fired by time-slider to create the auto-snapshots. To make it more clearly, I'll pick the two bold-highlighted lines of the above log.
What happens is: At first, a snapshot is created this way:
/usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/zones/test/rpool/ROOT/solaris/var@zfs-auto-snap_daily-2016-02-28-03h13
Immediately after this snapshot command, another command is executed:
/usr/bin/pfexec /usr/sbin/zfs snapshot -r rpool/zones/test/rpool/ROOT/solaris@zfs-auto-snap_daily-2016-02-28-03h13
The second command fails because "-r" tells "zfs snapshot" to create snapshots for all sub-datasets of "rpool/zones/test/rpool/ROOT/solaris". As "rpool/zones/test/rpool/ROOT/solaris/var" is a sub-dataset of "rpool/zones/test/rpool/ROOT/solaris" and a snapshot for "rpool/zones/test/rpool/ROOT/solaris/var" has already been created before using the same snapshot tag, the "zfs" tool quits with the message:
cannot create snapshot 'rpool/zones/test/rpool/ROOT/solaris/var@zfs-auto-snap_daily-2016-02-28-03h13': snapshot already exists
no snapshots were created
For this reason, auto-snapshots only seem to work for ZFS datasets which don't have child datasets. Only the leaves of the tree (and the root) actually get auto-snapshotted, but branches in-between won't.
In my opinion, this behavior is not documented this way and therefore should not occur. People who rely on auto-snapshots might get disappointed when they see that snapshots they have been expecting have actually never created. Especially in times were cryptographic ransom-ware viruses and Trojan horses are in the wild which attack Windows systems (and encrypt even possibly their server-stored backups), it is important to have a reliably working auto-snapshot feature on the server (presuming that Solaris is often used to provide data storage services to Windows clients). Independently from the usage scenario, a bug like this should be fixed as soon as possible.
My question to you is: Is this problem known? Does anybody else observe the problem as well? Did I do or understand anything in a wrong way?
Thank you very much in advance for any helpful replies!
Best regards,
Steffen