1 Reply Latest reply: Jun 27, 2013 3:51 PM by bperoutka RSS

    OVM Manager 3.1.1 can't discover one of the servers

    dosielczak

      Hi,

       

      I have a weird problem with one of our OVM3 servers. Due to "System is initializing ..." errors I had to delete and recreate the OVM Manager DB. The operation went smoothly but after a restart I'm only able to discover one of the servers (soaovm2). We actually have each of the servers in a separate pool (self-tough best practice if you plan to use RAC on the VMs). So, as mentioned the second server was discovered just fine (with pool and everything) but the first server is refusing to get discovered:

       

      Job Construction Phase

      ----------------------

      begin()

      Appended operation 'Discover Manager Server Discover' to object 'OVM Foundry : Discover Manager'.

      commit()

      Completed Step: COMMIT

       

      Objects and Operations

      ----------------------

      Object (IN_USE): [DiscoverManager] OVM Foundry : Discover Manager

      Operation: Discover Manager Server Discover

       

      Job Running Phase at 15:48 on Tue, Jun 11, 2013

      ----------------------------------------------

      Job Participants: []

       

       

      Actioner

      --------

      Starting operation 'Discover Manager Server Discover' on object 'OVM Foundry : Discover Manager'

      Setting Context to model only in job with id=1370965683697

      Operation 'NTP Service Configure' in non-job running context, not adding it to object 'e4:11:5b:ac:b1:10:e4:11:5b:ac:b1:10:e4:11:5b:ac'.

      Operation 'NTP Service Configure' in non-job running context, not adding it to object 'e4:11:5b:ac:b1:10:e4:11:5b:ac:b1:10:e4:11:5b:ac'.

      Operation 'Server Set Statistic Interval' in non-job running context, not adding it to object 'e4:11:5b:ac:b1:10:e4:11:5b:ac:b1:10:e4:11:5b:ac'.

      Job Internal Error (Operation)com.oracle.ovm.mgr.api.exception.FailedOperationException: OVMAPI_4010E Attempt to send command: discover_hardware to server: e4:11:5b:ac:b1:10:e4:11:5b:ac:b1:10:e4:11:5b:ac failed. OVMAPI_4004E Server Failed Command: discover_hardware , Status: org.apache.xmlrpc.XmlRpcException: I/O error while communicating with HTTP server: The server 57.56.168.171 failed to respond

      Tue Jun 11 15:48:08 UTC 2013

      Tue Jun 11 15:48:08 UTC 2013

      at com.oracle.ovm.mgr.action.ActionEngine.sendCommandToServer(ActionEngine.java:507)

      at com.oracle.ovm.mgr.action.ActionEngine.sendUndispatchedServerCommand(ActionEngine.java:459)

      at com.oracle.ovm.mgr.action.ActionEngine.sendServerCommand(ActionEngine.java:385)

      at com.oracle.ovm.mgr.action.ActionEngine.sendDiscoverCommand(ActionEngine.java:308)

      at com.oracle.ovm.mgr.action.ServerAction.getHardwareInfo(ServerAction.java:104)

      at com.oracle.ovm.mgr.discover.ovm.ServerHardwareDiscoverHandler.query(ServerHardwareDiscoverHandler.java:206)

      at com.oracle.ovm.mgr.discover.ovm.ServerHardwareDiscoverHandler.query(ServerHardwareDiscoverHandler.java:42)

      at com.oracle.ovm.mgr.discover.ovm.DiscoverHandler.execute(DiscoverHandler.java:61)

      at com.oracle.ovm.mgr.discover.DiscoverEngine.handleDiscover(DiscoverEngine.java:461)

      at com.oracle.ovm.mgr.discover.DiscoverEngine.handleDiscover(DiscoverEngine.java:446)

      at com.oracle.ovm.mgr.discover.DiscoverEngine.handleDiscover(DiscoverEngine.java:430)

      at com.oracle.ovm.mgr.discover.DiscoverEngine.handleDefaultDiscover(DiscoverEngine.java:391)

      at com.oracle.ovm.mgr.discover.DiscoverEngine.discoverNewServer(DiscoverEngine.java:377)

      at com.oracle.ovm.mgr.discover.DiscoverEngine.discoverServer(DiscoverEngine.java:280)

      at com.oracle.ovm.mgr.op.manager.DiscoverManagerServerDiscover.action(DiscoverManagerServerDiscover.java:48)

      at com.oracle.ovm.mgr.api.collectable.ManagedObjectDbImpl.executeCurrentJobOperationAction(ManagedObjectDbImpl.java:1012)

      at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:329)

      at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:289)

      at com.oracle.odof.core.storage.Transaction.invokeMethod(Transaction.java:826)

      at com.oracle.odof.core.Exchange.invokeMethod(Exchange.java:245)

      at com.oracle.ovm.mgr.api.manager.DiscoverManagerProxy.executeCurrentJobOperationAction(Unknown Source)

      at com.oracle.ovm.mgr.api.job.JobEngine.operationActioner(JobEngine.java:218)

      at com.oracle.ovm.mgr.api.job.JobEngine.objectActioner(JobEngine.java:309)

      at com.oracle.ovm.mgr.api.job.InternalJobDbImpl.objectCommitter(InternalJobDbImpl.java:1140)

      at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:329)

      at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:289)

      at com.oracle.odof.core.BasicWork.invokeMethod(BasicWork.java:136)

      at com.oracle.odof.command.InvokeMethodCommand.process(InvokeMethodCommand.java:105)

      at com.oracle.odof.core.BasicWork.processCommand(BasicWork.java:81)

      at com.oracle.odof.core.TransactionManager.processCommand(TransactionManager.java:773)

      at com.oracle.odof.core.WorkflowManager.processCommand(WorkflowManager.java:455)

      at com.oracle.odof.core.WorkflowManager.processWork(WorkflowManager.java:513)

      at com.oracle.odof.io.AbstractClient.run(AbstractClient.java:42)

      at java.lang.Thread.run(Thread.java:662)

      Caused by: com.oracle.ovm.mgr.api.exception.IllegalOperationException: OVMAPI_4004E Server Failed Command: discover_hardware , Status: org.apache.xmlrpc.XmlRpcException: I/O error while communicating with HTTP server: The server 57.56.168.171 failed to respond

      Tue Jun 11 15:48:08 UTC 2013

      at com.oracle.ovm.mgr.action.ActionEngine.sendAction(ActionEngine.java:798)

      at com.oracle.ovm.mgr.action.ActionEngine.sendCommandToServer(ActionEngine.java:503)

      ... 41 more

       

       

      ...

       

      ----------

      End of Job

      ----------


      from Agent log:


      [2013-06-11 15:40:24 5177] DEBUG (OVSCommons:124) get_api_version: ()

      [2013-06-11 15:40:24 5177] DEBUG (OVSCommons:132) get_api_version: call completed.

      [2013-06-11 15:40:24 5178] DEBUG (OVSCommons:124) discover_server: ()

      [2013-06-11 15:40:25 5178] DEBUG (OVSCommons:132) discover_server: call completed.

      [2013-06-11 15:40:28 5335] DEBUG (OVSCommons:124) discover_hardware: ()

      [2013-06-11 15:40:28 5336] DEBUG (OVSCommons:124) discover_hardware: ()

      [2013-06-11 15:40:28 5337] DEBUG (OVSCommons:124) discover_hardware: ()

      [2013-06-11 15:40:28 5338] DEBUG (OVSCommons:124) discover_hardware: ()



      I have tried every possible trick there is (delete discover_hardware.lock, delete agent db) as well as made sure communication works ok both ways (can connect to soaovm1:8899 from manager as well as ovmmgr:7001 from server, passwords double-checked) but nothing helps. I still get the same error. My suspicion is this has something to do with the one-node cluster pool the server is (was?) in. It seems to be working ok but I can't stop it:


      [root@lx-cgnclh-soaovm1 ~]# /etc/init.d/o2cb status

      Driver for "configfs": Loaded

      Filesystem "configfs": Mounted

      Stack glue driver: Loaded

      Stack plugin "o2cb": Loaded

      Driver for "ocfs2_dlmfs": Loaded

      Filesystem "ocfs2_dlmfs": Mounted

      Checking O2CB cluster "6d707b8abd12b5af": Online

        Heartbeat dead threshold: 31

        Network idle timeout: 60000

        Network keepalive delay: 2000

        Network reconnect delay: 2000

        Heartbeat mode: Global

      Checking O2CB heartbeat: Active

        0004FB0000050000DBE4AF6EF63C7A3A /dev/dm-11

      Nodes in O2CB cluster: 0

      [root@lx-cgnclh-soaovm1 ~]# /etc/init.d/o2cb stop

      Clean userdlm domains: OK

      Stopping global heartbeat on cluster "6d707b8abd12b5af": Failed

      o2cb: Heartbeat region in use while stopping heartbeat on region '0004FB0000050000DBE4AF6EF63C7A3A'

       

      I would appreciate any ideas here (preferably ones which wouldn't require rebooting guest VMs ...) as I have no clue what to try next.

        • 1. Re: OVM Manager 3.1.1 can't discover one of the servers
          bperoutka

          I ran into this issue.  The fix for me was to restart the ovm manager (so on the ovm manager I ran "/etc/init.d/ovmm restart") and then I restarted the ovs-agent on the host. So I did "/etc/init.d/ovs-agent restart".

           

          If you need to you can reset the ovs agent password from the host by issuing "ovs-agent-passwd oracle <password>".

           

          Hope this helps.

          ~Bryan