Hi - hoping someone can help me with a small disaster I've just had in trying to patch one of my SPARC T4-1 servers running zones using the patch ABE method. The patching appeared to work perfectly well, I ran the following commands:
So I actually made /logs a folder under the zonepath - and it appears after patching the ABE this doesn't exist so the zone won't start. In fact /export/zones/tdukihstestz01-CPU_2013-01/ is completely empty now. So I can only assume that by having /logs inside the zones file system has caused this problem.
So after a bit of manual intervention I have my zones running again - basically I edited the zones xml files and the index file in /etc/zones and removed the references to CPU_2013-01 which has done the trick.
However my ZFS looks a bit of a mess. It now looks like this:
First, not sure if you got bit by this one or not, but there's a bug somewhere in the LU scripts that causes all kinds of problems if your zpool names are long enough to, IIRC, cause the output columns from a "df' command to run together.
Second, the snapshot/clone nightmare is what happens when you create a new boot environment in the same root pool as the current boot environment. IME it's best to have two root pools, and to always create a new boot environment on the root pool that the current boot environment is NOT on. Yeah, it takes longer because files actually have to be copied from one pool to the other, but there's no mess of file system clones and snapshots afterwards.
Thanks, I've had mixed success with this today and in the past. I posted on here some months ago because the whole live upgrade process failed miserably when zones were concerned. I think in the end it was acknowledged that I'd found an issue with the process. I've not looked at this for some months now. Anyway today was my first attempt at patching from CPU 2012-07 to CPU 2013-01 and I thought by shutting all non-global zones down I'd be ok .... obviously not.
I actually patched 2 x SPARC T4-1 servers the other day that ONLY had global zones and a single ZFS resource pool and the patching went without fault. The only difference today is that the server(s) all have non-global zones defined. I'm also 99.9% sure that having /logs inside the non-global zone zonepath was a bad idea.
I'm still not entirely sure what has happened and why, but as it's a Sunday I don't really want to spend hours trying to figure this out just in case I totally brick the server. There up and running but the ZFS looks a real mess.
I suppose my other question now is how the heck do I sort my file system mess out? The server is working, but just looking at one of the non-global zones I now seem to have multiple ZFS resource pools defined.
1) zoneadm list -cv
ID NAME STATUS PATH BRAND IP
0 global running / native shared
4 tdukihstestz01 running /export/zones/tdukihstestz01 native shared
Boot Environment Is Active Active Can Copy
Name Complete Now On Reboot Delete Status
-------------------------- -------- ------ --------- ------ ----------
CPU_2013-01 yes yes yes no -
One thing I don't understand so far is if the zonepath is /export/zones/tdukihstestz01 for tdukihstestz01 then how is the output from zfs list showing that rpool/export/zones/tdukihstestz01 has only 84.7M used? All the space for this zone is taken up in /export/zones/tdukihstestz01-Solaris10 but that doesn't exist anymore in /export/zones/???
root@tdukunxtest01:~ 503$ cd /export/zones/
root@tdukunxtest01:zones 504$ ls
tdukbackupz01/ tdukihstestz01/ tdukihstestz02/
This alone has me a little confused, and tbh I daren't do anything in case I destroy the zone or something.
Ok, got a little further with this. I do now think that I can track down the start of my problems was due to me defining a filesystem within a non-global zone that was actually inside the zonepath itself - having looked at the Solaris zones documentation there's nothing to stop you doing this, just that it's a bad idea. So I've amended ALL my non-global zones to NOT do this anymore and checked.
Taking a single non-global zone I can see that ZFS did the following when I ran the lucreate command:
So a snapshop / clone was taken. There is then a series of zfs canmount=on and zfs canmount=off commands seen against rpool/export/zones/tdukihstestz01-CPU_2013-01 - I'm not entirely sure what these are doing, well I know what the command does just not why its doing it.
The patch process finished at 08:46 and I rebooted the server with an init 6 a little time after this. I then see a few more canmount commands and then:
Now my memory is a little fuzzy over what happened next but the failure of the non-global zone to boot was because <zonepath>/logs/ did not exist - and this takes me back to my point above about defining a file system within the <zonepath> - when I tried to start the zone tdukihstestz01 it complained that /logs did not exist. It did exist in the zone on the old Boot Environment but NOT the new one. And when I actually created these zones several months ago I can remember I had to manually create these BEFORE I ran the initial sudo zoneadm -z tdukihstestz01 boot command.
So basically I'm 99.9% sure that I know what I did wrong to cause this for the non-global zones and I can only assume this has had a knock on effect with the root environment. To fix a non-global zone I ran the following commands earlier today.
zfs set canmount=on rpool/export/zones/tdukihstestz02
zfs mount rpool/export/zones/tdukihstestz02
I also see that 81.1M of space used in rpool/export/zones/tdukihstestz01 must refer to changes between the original file system and the clone ... I think. These will only have been log files so I'm not to bothered ... again I think, well actually hope.
So I'm sort of almost sorted, there is the small matter of the root file system - which tbh I won't be so gung ho' in my approach to fixing. But again if anyone has any ideas on this I'd love to hear them.
Well the answer seems to be make sure that your non-global zones are setup correctly. I've tested this twice on T4-1 servers with non-global zones using the ABE / live upgrade method this morning and it works a treat. However I did stop the non-global zones from running, I simply dare not do this with running non-global zones. Maybe one day I will try this, but for now my preferred method works just fine.