I opened an SR on this problem 48 hours ago, and have gotten almost nowhere with support. Jeez.
I should have asked here first.
I have a newly installed 3.1.1 2-node VM Server cluster. It was working fine. Then I tried to "update" one of the VM servers via YUM, and it into la-la land. It is in a state where, from VM Manager, I can't start it, can't stop it, can't restart it, can't delete it, and can't re-discover it.
What the heck do I do, short of completely re-installing it?
Or is that the easy solution? I can re-install it and discover it, except that it's "stuck" in VM manager, and fear that I couldn't discover it again.
Support told me to run "ovs-agent-cleanup", which, I understand, is supposed to get it back into a state where I can re-discover it. But I've looked everywhere on that server, and there's no such script. Maybe it's only on version 2.2 or 3.0. I'm Linux-savvy enough to search thoroughly through the filesystem, and there's no file with a name anywhere close to "ovs-agent-cleanup".
Anyone know what I can do get my cluster back?
Well, I have just checked with my 3.1.1 VM servers and the installation layout has indeed changed. I'd agree that the support engineer is referring to some former ovs-agent installation.
You can check out the ovs stuff in /usr/sbin, but a check check of /usr/sbin/ovs-agent-db didn't reveal anything particular similar to the option that the support engineer referred to. I'd request to escalate the SR up the chain to get some feedback from Oracle.
Wow. I thought this would be an easy one. Maybe things have changed in 3.1.1 so much that the old methods (which surely must exist, I theorize) don't work any more. My Oracle support case on this has not returned any solution either. The node really is totally in limbo, unusable in any way I can see. If I could delete it from VM Manager, I could at least do the traditional Windows solution: re-install from scratch.
With the release of OVM 3.1.1 there have been some cli tools released seperately. Maybe you'll find something in there, that'd help.
On the other hand… if you're able to remove the VM server from the cluster, I'd surely take that route to get it up and running again - it's only a couple of minutes to do so and another couple of minutes to get it into the Cluster again.
I have finally gotten a better response from support, after they decided that, yes, the cleanup_ovs_agent tool isn't there any more with 3.1.1.
Their response was:
Please just try to remove the following directory
After removing it, readd the server and rediscover it on the manager.
Can someone please confirm or deny that removing that directory is a reasonable thing to do?
Also, after removing it, they say to "readd the server" and rediscover it. I have no idea what "readd the server" means, and they (of course) sent me nothing more than what you see above. Anyone know what that means? I have asked for clarification, but that might be another day of waiting.
I removed the /etc/ovs-agent/db directory, rebooted it, and tried to rediscover it.
That FAILED, and among the messages that came from that, it said:
Job Internal Error (Operation)com.oracle.ovm.mgr.api.exception.IllegalOperationException: If the IP address of this server has changed, please delete the server: toravm1.acbl.net, and re-discover.
Well, the IP address has NOT changed, but I tried to delete the server anyhow (which I had done several times before). That failed, too.
So, the server is still dead...
You've already gone past the point where you might be able to recover from the app interface but for the next time you see something similar:
When I've seen the VM client interface 'freeze' and display various system status icons on both VMs and VM servers that I know are incorrect (like claiming an OVM server is running that I know is powered off), its always been due to an 'in-progress' running job/task that had to be manually aborted via the jobs control pane before the UI would refresh properly, and the control functions like rediscover/start/stop etc would work correctly again.
The first time I saw this I had to basically take our test farm down and rebuild it by an uninstall and reinstall of the OVM manager and associated database. The VM servers themselves seemed to be fine and were able to be rediscovered and added in afterwards. This was not a problem as the system was only a pilot/test at that stage. Obviously this is a critical issue for production, but the remediation step of aborting the hanging 'in-progress' job has worked fine since.
I suppose a fix would be to put appropriate timeouts and other checks around these jobs so they don't run indefinitely while blocking any admin functions from being issued while its running.
Thanks for the input.
I've just about lost all faith that Oracle support can help me on this one. Now, the OTHER VM Server says it's 'hosed'. By that I mean (1) its status says "starting(error)", but it's obviously okay, since I have running, usable VMs on it.
Can you describe the steps to recover a "farm" by re-installing the manager? I guess the first step is to install with the same UUID, right? But then, can I just discover my two servers, and everything comes back - the server pool, networking configs, VMs, and all that?
That may be my only hope.
I'm a bit skeptical about how much value formal support can really offer when it comes to something as bleeding-edge as the 3.1.x OVM branch when they wont have had time to build up anything resembling a robust knowledge-base of use-cases, defects, best practice guides, fault workarounds etc, but still being able to log CSIs is helpful at least.
Before considering a VM manager rebuild have you acknowledged any and all events in the
Servers and VMs -> Vm Hostname X -> Perspective/Events pane?
I have found that an error state on an apparently well-running VM can be a sign of a past error state rather than a current error state, and you need to ACK the alert or warning state before it clears and goes back to green.
For my VM manager installl I used a full wipe and reinstall of Oracle VM manager and the 11g-r2 database store we used as the store. So it was effectively a clean install with tables needing to be repopulated and so on, and then pointed it at the VM servers after all patching when the interface was ready via the discovery. I dont know much about UUIDs or how to use this for a VM manager install, I repeated the networking config over again as I recall so this may not suit you.
Oh, I see now. I didn't understand that I might have critical events that might be interfering with my efforts. Thanks for the info.
I did what you said, and acked all events on my servers. That turned the colors of the icons from red to green, so that's A Good Thing. My "bad server" still has a status of "starting", though, even after rebooting it. When I try to rediscover it, the manager refuses to do so, since it "is not running". I'm going to give up, and tell support to go away. I will reinstall the manager, and see what happens. If I lose everytihing, I won't mind all that much.
I have already re-installed OVMM a couple of times and it's a pretty straight forward process:
- mount the OVMM Installer iso
- run the installer and un-install Oracle VM Manager, but make a note of it's uuid before
- choose to uninstall everything (Weblogic, Database, Java...)
- run the installer again, but this time using -u <UUID> and it will perform a clean install
When OVMM is ready simply add/discover one of the VM servers and OVMM will go ahead and discover the whole server pool. You don't have to add more than one server, you know is in a good state.
And be patient - the discovery process takes some time, but It has always (!) discovered my Cluster consisting of 6 VM Servers running 35+ VMs on it.
Thanks for your comments.
After waiting A WEEK for support to give me some intelligent response, I took matters into my own hands. It was relatively painless. I'll describe what happened, in case it's helpful to someone else:
Shutdown the manager service, via "service ovmm stop".
Save a copy of the config file, which contains the manager UUID. It's called /u01/app/oracle/ovm-manager-3/.config.
Removed everything under /u01/app/oracle, with "rm -fr".
Dropped the schema in the database I was using.
Ran the installer, specifying the "-u <uuid>" option.
It completed, and I logged in.
I rediscovered one of my two servers, the one that was still in the pool. I had removed the other one from the pool when my problems started, last week.
The rediscovery got back the server pool and VMs, and they were still running.
I rediscovered the other server, the one that I could not discover before I reinstalled the manager. That worked, too.
I added the server to the pool, and it's all working.
The only casualty I've seen so far, is that I've lost the names of all my VMs. Oops... even more is wrong. I just noticed.
I was about the say that the VMs have names like "ORPHAN_...". But there's more wrong than that:
If I look at the info for a VM, the'networks' and 'disks' panels are completely blank.
And if I right-click a VM and select 'edit', nothing happens.
But I can open a console window to them, so they really ARE running.
Anyone know what's wrong, and how to fix this?
Oh, hell. NEVER MIND.
I restarted the manager service ("service ovmm restart"), and now all the VM information is there, and I can indeed 'edit'.
This thing is sure quirky, isn't it?
I think I'm all back up and running again.
Thanks for the information. After the OVM server crashed, I managed to bring up and the OVM server. But, I too have similar problem with my OVS servers. The status on OVM is showing as "Running(Error). I cannot remove the OVS server from the pool. Neither can I re-discover the OVS servers. This OVM 3.1.1. is really quite buggy. I have been struggling with it for days.
Since Oracle support doesn't have a published solution for this. My best bet is to re-install the OVM.....