Presently , I have a requirement, where I need to make a hanged system panic. So as to do the same , I have come across , one of the possible way, which will be helpful in few cases .i.e deadman timer.So , i am trying to test the same and make test cases, before the same can be used in Production. I have been trying to make the Solaris System Hang, but as I am newbie, I have not been successful in my attempts so far.
So , before , I go ahead, and share the details(OS , OS patch level,Sever Model etc.), I would like, to confirm, Am I on the right track? Can implementing the deadman timer, on Solaris preferably 10, will ensure, hanged system(Dodgy Storage mainly), will panic? If yes, Is there also a better way of doing the same?
Kindly share if any details are required from my end as a pre-requisite for the above mentioned queries.
If you create a non-redundant ZFS storage pool, set the "fail mode" pool property to panic, and start pulling
out devices from the pool, the system should hang and then panic. This helpful editor keeps putting fail mode
as two words, but its actually one, like this:
# zpool set failmode=panic pool-name
I'm not sure what you are trying to accomplish, but this method is worth a try.
Just to elaborate, in last 4 months, we had encountered couple of instances, where our Solaris 10 UltraSPARC(UFS) systems have hanged. Even trying sync from OBP has not worked and as a last resort, we have to give a hard reset to the systems to make them back online. As the system was given hard reset, so we do not have any crash dumps to work on and find out the exact reason.
From the dmesg, it seems , that the same has happened due to the dodgy Hard Disk. The setup contains Raid 0+1 JBOD.
So, while searching for the system hangs , I have come across, deadman timer which looks promising to me. But before, I can go ahead and implement the same on our UFS system(production), I have to test the same , so as to confirm, that the same is useful in our scenario.
If this works, it will reduce the Downtime and would provide us with the crash dump as well :)
As your procedure, revolves around ZFS, so the same might not that useful(all systems have ufs), but thanks again for your revert. If the same concept(pulling off disk) is used in UFS, it might cause the system to Hang. So , I would google around the same and see and at the same time, would check the feasibility.
if someone does has a copy of SUN Document ID->13258 ...
That statement suggests you don't have service contract privileges to MOS.
If that is correct, then if someone provides the information to you they will be violating the terms of their service agreement privileges.
Giving something like that to anyone that does not have proper access would subject them to having all the privileges revoked by Oracle. That is a high price to pay, and it is doubtful anyone would take that chance.
(.... and pasting a quote of the document into a forum post is the same sort of reportable violation as handing it to you.)
Now, having said all that, it doesn't take much creativiity to use a Internet search web site to look for Sun Infodoc 13258 and get usable results.
I assume that you have also tried the usual diagnostics like booting with kadb if the system
hangs to attempt to get a crash dump.
Also, I should have mentioned that with dodgy hardware, creating non-redundant ZFS pools and then pulling the disks is not a good idea if you want your data consistent. I don't recommend this under normal conditions.
To track down bad devices that are hanging the system, you can also review iostat -En and fmadm faulty and fmdump -eV.
Scripts that are gathering system statistics are already present. The script that contains the below commands have been scheduled to run 12 times a day using cron but in previous instances , we have not observed any issue from the logs captured.