This discussion is archived
5 Replies Latest reply: Feb 21, 2013 6:29 AM by JulianG RSS

Alternate Boot Environment disaster

JulianG Newbie
Currently Being Moderated
Hi - hoping someone can help me with a small disaster I've just had in trying to patch one of my SPARC T4-1 servers running zones using the patch ABE method. The patching appeared to work perfectly well, I ran the following commands:

sudo su -
zlogin tdukihstestz01 shutdown -y -g0 -i 0
zlogin tdukihstestz02 shutdown -y -g0 -i 0
zlogin tdukbackupz01 shutdown -y -g0 -i 0

lucreate -n CPU_2013-01
mkdir /tdukwbadm01
mount -F nfs tdukwbadm01:/export/jumpstart/Patches/Solaris10/10_Recommended_CPU_2013-01 /tdukwbadm01/
cd /tdukwbadm01/
./installpatchset apply-prereq s10patchset
nohup ./installpatchset -B CPU_2013-01 --s10patchset
luactivate CPU_2013-01
lustatus
init 6

However when the server came back up only 1 zone would start - tdukbackupz01.

The other two zones were in the installed state although they are set to autoboot. The ONLY difference between the zones is that for the two that won't start I had added a "fs" by doing this:

zonepath: /export/zones/tdukihstestz01
fs:
special: /export/zones/tdukihstestz01/logs

So I actually made /logs a folder under the zonepath - and it appears after patching the ABE this doesn't exist so the zone won't start. In fact /export/zones/tdukihstestz01-CPU_2013-01/ is completely empty now. So I can only assume that by having /logs inside the zones file system has caused this problem.

So after a bit of manual intervention I have my zones running again - basically I edited the zones xml files and the index file in /etc/zones and removed the references to CPU_2013-01 which has done the trick.

However my ZFS looks a bit of a mess. It now looks like this:

root@tdukunxtest01:~ 503$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
archives 42.8G 504G 42.8G /archives
rpool 126G 421G 106K /rpool
rpool/ROOT 5.48G 421G 31K legacy
rpool/ROOT/CPU_2013-01 5.38G 421G 3.60G /
rpool/ROOT/CPU_2013-01@CPU_2013-01 592M - 3.60G -
rpool/ROOT/CPU_2013-01/var 1.21G 421G 1.19G /var
rpool/ROOT/CPU_2013-01/var@CPU_2013-01 14.4M - 659M -
rpool/ROOT/Solaris10 96.9M 421G 3.60G /.alt.Solaris10
rpool/ROOT/Solaris10/var 22.2M 421G 671M /.alt.Solaris10/var
rpool/dump 32.0G 421G 32.0G -
rpool/export 17.9G 421G 35K /export
rpool/export/home 1.01G 31.0G 1.01G /export/home
rpool/export/zones 16.9G 421G 35K /export/zones
rpool/export/zones/tdukbackupz01 41.8M 421G 3.14G /export/zones/tdukbackupz01
rpool/export/zones/tdukbackupz01-Solaris10 3.14G 96.9G 3.13G /export/zones/tdukbackupz01-Solaris10
rpool/export/zones/tdukbackupz01-Solaris10@CPU_2013-01 1.80M - 3.13G -
rpool/export/zones/tdukihstestz01 43.3M 421G 10.1G /export/zones/tdukihstestz01
rpool/export/zones/tdukihstestz01-Solaris10 10.2G 21.8G 10.2G /export/zones/tdukihstestz01-Solaris10
rpool/export/zones/tdukihstestz01-Solaris10@CPU_2013-01 2.28M - 10.2G -
rpool/export/zones/tdukihstestz02 35.3M 421G 3.37G /export/zones/tdukihstestz02
rpool/export/zones/tdukihstestz02-Solaris10 3.40G 28.6G 3.40G /export/zones/tdukihstestz02-Solaris10
rpool/export/zones/tdukihstestz02-Solaris10@CPU_2013-01 1.66M - 3.40G -
rpool/logs 5.10G 26.9G 5.10G /logs
rpool/swap 66.0G 423G 64.0G -

Whereas previously it look more like this:

NAME USED AVAIL REFER MOUNTPOINT
archives 42.8G 504G 42.8G /archives
rpool 126G 421G 106K /rpool
rpool/ROOT 5.48G 421G 31K legacy
rpool/dump 32.0G 421G 32.0G -
rpool/export 17.9G 421G 35K /export
rpool/export/home 1.01G 31.0G 1.01G /export/home
rpool/export/zones 16.9G 421G 35K /export/zones
rpool/export/zones/tdukbackupz01 41.8M 421G 3.14G /export/zones/tdukbackupz01
rpool/export/zones/tdukihstestz01 43.3M 421G 10.1G /export/zones/tdukihstestz01
rpool/export/zones/tdukihstestz02 35.3M 421G 3.37G /export/zones/tdukihstestz02
rpool/logs 5.10G 26.9G 5.10G /logs
rpool/swap 66.0G 423G 64.0G -


Does anyone know how to fix my file system mess and is having a non-global zones /logs inside the actual zones zonepath is a bad idea - it would appear it is.

Thanks - Julian.
  • 1. Re: Alternate Boot Environment disaster
    800381 Explorer
    Currently Being Moderated
    First, not sure if you got bit by this one or not, but there's a bug somewhere in the LU scripts that causes all kinds of problems if your zpool names are long enough to, IIRC, cause the output columns from a "df' command to run together.

    Second, the snapshot/clone nightmare is what happens when you create a new boot environment in the same root pool as the current boot environment. IME it's best to have two root pools, and to always create a new boot environment on the root pool that the current boot environment is NOT on. Yeah, it takes longer because files actually have to be copied from one pool to the other, but there's no mess of file system clones and snapshots afterwards.
  • 2. Re: Alternate Boot Environment disaster
    JulianG Newbie
    Currently Being Moderated
    Thanks, I've had mixed success with this today and in the past. I posted on here some months ago because the whole live upgrade process failed miserably when zones were concerned. I think in the end it was acknowledged that I'd found an issue with the process. I've not looked at this for some months now. Anyway today was my first attempt at patching from CPU 2012-07 to CPU 2013-01 and I thought by shutting all non-global zones down I'd be ok .... obviously not.

    I actually patched 2 x SPARC T4-1 servers the other day that ONLY had global zones and a single ZFS resource pool and the patching went without fault. The only difference today is that the server(s) all have non-global zones defined. I'm also 99.9% sure that having /logs inside the non-global zone zonepath was a bad idea.

    I'm still not entirely sure what has happened and why, but as it's a Sunday I don't really want to spend hours trying to figure this out just in case I totally brick the server. There up and running but the ZFS looks a real mess.

    Thanks.
  • 3. Re: Alternate Boot Environment disaster
    JulianG Newbie
    Currently Being Moderated
    I suppose my other question now is how the heck do I sort my file system mess out? The server is working, but just looking at one of the non-global zones I now seem to have multiple ZFS resource pools defined.

    1) zoneadm list -cv
    ID NAME STATUS PATH BRAND IP
    0 global running / native shared
    4 tdukihstestz01 running /export/zones/tdukihstestz01 native shared


    2) lustatus
    Boot Environment Is Active Active Can Copy
    Name Complete Now On Reboot Delete Status
    -------------------------- -------- ------ --------- ------ ----------
    CPU_2013-01 yes yes yes no -


    3) zfs list |grep tdukihstestz01
    rpool/export/zones/tdukihstestz01 64.9M 421G 10.1G /export/zones/tdukihstestz01
    rpool/export/zones/tdukihstestz01-Solaris10 10.2G 21.8G 10.2G /export/zones/tdukihstestz01-Solaris10
    rpool/export/zones/tdukihstestz01-Solaris10@CPU_2013-01 2.28M - 10.2G -


    One thing I don't understand so far is if the zonepath is /export/zones/tdukihstestz01 for tdukihstestz01 then how is the output from zfs list showing that rpool/export/zones/tdukihstestz01 has only 84.7M used? All the space for this zone is taken up in /export/zones/tdukihstestz01-Solaris10 but that doesn't exist anymore in /export/zones/???

    root@tdukunxtest01:~ 503$ cd /export/zones/
    root@tdukunxtest01:zones 504$ ls
    tdukbackupz01/ tdukihstestz01/ tdukihstestz02/


    This alone has me a little confused, and tbh I daren't do anything in case I destroy the zone or something.

    Any help would be appreciated.

    Thanks - Julian.
  • 4. Re: Alternate Boot Environment disaster
    JulianG Newbie
    Currently Being Moderated
    Ok, got a little further with this. I do now think that I can track down the start of my problems was due to me defining a filesystem within a non-global zone that was actually inside the zonepath itself - having looked at the Solaris zones documentation there's nothing to stop you doing this, just that it's a bad idea. So I've amended ALL my non-global zones to NOT do this anymore and checked.

    Taking a single non-global zone I can see that ZFS did the following when I ran the lucreate command:

    2013-02-17.07:39:58 zfs snapshot rpool/export/zones/tdukihstestz01@CPU_2013-01
    2013-02-17.07:39:58 zfs clone rpool/export/zones/tdukihstestz01@CPU_2013-01 rpool/export/zones/tdukihstestz01-CPU_2013-01
    2013-02-17.07:39:58 zfs set zoned=off rpool/export/zones/tdukihstestz01-CPU_2013-01

    So a snapshop / clone was taken. There is then a series of zfs canmount=on and zfs canmount=off commands seen against rpool/export/zones/tdukihstestz01-CPU_2013-01 - I'm not entirely sure what these are doing, well I know what the command does just not why its doing it.

    The patch process finished at 08:46 and I rebooted the server with an init 6 a little time after this. I then see a few more canmount commands and then:

    2013-02-17.08:49:22 zfs rename rpool/export/zones/tdukihstestz01 rpool/export/zones/tdukihstestz01-Solaris10

    And then a load more canmount commands against rpool/export/zones/tdukihstestz01-Solaris10 but also the following is shown:

    2013-02-17.08:54:31 zfs rename rpool/export/zones/tdukihstestz01-CPU_2013-01 rpool/export/zones/tdukihstestz01

    Now my memory is a little fuzzy over what happened next but the failure of the non-global zone to boot was because <zonepath>/logs/ did not exist - and this takes me back to my point above about defining a file system within the <zonepath> - when I tried to start the zone tdukihstestz01 it complained that /logs did not exist. It did exist in the zone on the old Boot Environment but NOT the new one. And when I actually created these zones several months ago I can remember I had to manually create these BEFORE I ran the initial sudo zoneadm -z tdukihstestz01 boot command.

    So basically I'm 99.9% sure that I know what I did wrong to cause this for the non-global zones and I can only assume this has had a knock on effect with the root environment. To fix a non-global zone I ran the following commands earlier today.


    zfs list |grep tdukihstestz02
    rpool/export/zones/tdukihstestz02 81.1M 421G 3.41G /export/zones/tdukihstestz02 <-- clone
    rpool/export/zones/tdukihstestz02-Solaris10 3.40G 28.6G 3.40G /export/zones/tdukihstestz02-Solaris10
    rpool/export/zones/tdukihstestz02-Solaris10@CPU_2013-01 1.66M - 3.40G - <-- snapshot

    zlogin tdukihstestz02
    init 5

    zfs destroy -R rpool/export/zones/tdukihstestz02-Solaris10@CPU_2013-01

    zfs list |grep tdukihstestz02
    rpool/export/zones/tdukihstestz02-Solaris10 3.40G 28.6G 3.40G /export/zones/tdukihstestz02-Solaris10

    zfs rename rpool/export/zones/tdukihstestz02-Solaris10 rpool/export/zones/tdukihstestz02

    zfs set canmount=on rpool/export/zones/tdukihstestz02

    zfs mount rpool/export/zones/tdukihstestz02


    I also see that 81.1M of space used in rpool/export/zones/tdukihstestz01 must refer to changes between the original file system and the clone ... I think. These will only have been log files so I'm not to bothered ... again I think, well actually hope.


    So I'm sort of almost sorted, there is the small matter of the root file system - which tbh I won't be so gung ho' in my approach to fixing. But again if anyone has any ideas on this I'd love to hear them.

    Thanks - Julian.
  • 5. Re: Alternate Boot Environment disaster
    JulianG Newbie
    Currently Being Moderated
    I'll answer my own post again ;-)

    Well the answer seems to be make sure that your non-global zones are setup correctly. I've tested this twice on T4-1 servers with non-global zones using the ABE / live upgrade method this morning and it works a treat. However I did stop the non-global zones from running, I simply dare not do this with running non-global zones. Maybe one day I will try this, but for now my preferred method works just fine.

    Thanks - Julian.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points