We have a test environment deployed with about 50 virtual machines in OVM, with about 5 or 6 Windows Servers (2008 and 2012). Only about 15-20 are running with various memory and processor usage at any given time. We seem to have about 50% of our available 192 GB of memory blocked off for the VMs when they are running, and the processor utilization is almost always below 10%. We have run in to a rare snag over the past few months where about 9-10 of our VMs show 100% utilization (both Linux and Windows) and our OVM/OVS environment freezes. When we try to shut down the Windows machines, they say stopping but never stop, and when we try to kill them it says locked and never shuts down. The Linux machines most of the time shut down fine. The OVM environment functions fine but it seems like the OVS times out because of the processor lock.
My question is this: has anyone else run in to this? Also, is this a known issue? If so, is there a work-around? None of these machines are production, but this is a nasty snag causing us to forcefully shut down our HP Proliant blade that is running them, causing a prolonged outage.
I've seen rare instances where even Linux guests get hung shutting down. Especially if you're trying to shut them down when there is a process hung on the guest. Most of the time a "kill" works fine with no lasting issue with the guest.
I have had one instance in which a guest refused to do this. In this case I did the following from the CLI
try to delete a lock
/usr/sbin/ovs-agent-dlm --unlock --uuid "uuid of the vm guest"
ls -ltr /dlm/ovm
Force lock remove by deleting it.
rm -f /dlm/ovm/"uuid of the vm guest"
You also use the xm command to kill a VM guest from the CLI of the VM Server
xm list will list domains(VM Guests that are running)
xm stop uuid of the guest will stop a running XM
xm destroy uuid of the guest will "kill" a running XM.
The one time I had this happen I was getting NFS errors on the shared NFS repo.
Thanks a ton for that information! That is going to help us in the long run so we don't have to physically reboot the blade and kill everything!
I read on another Oracle forum post that someone had mentioned that NFS could be causing some hangs in OVM/OVS...did you ever figure out what errors were being caused in NFS? We are using it for two of our repositories. Thanks again!
I can't remember exactly what the error was but it wasn't generated in any log on the VM Server but I could see it on the VM guest through the VNC console.... post unmount of logical disks. Seems like it had unmounted the repo but was still trying to access something within the repo. Seems like it was throwing stale NFS handle errors. Don't quote me. My memory isn't what it use to be.