This discussion is archived
1 Reply Latest reply: Jun 27, 2013 1:51 PM by bperoutka RSS

OVM Manager 3.1.1 can't discover one of the servers

dosielczak Newbie
Currently Being Moderated

Hi,

 

I have a weird problem with one of our OVM3 servers. Due to "System is initializing ..." errors I had to delete and recreate the OVM Manager DB. The operation went smoothly but after a restart I'm only able to discover one of the servers (soaovm2). We actually have each of the servers in a separate pool (self-tough best practice if you plan to use RAC on the VMs). So, as mentioned the second server was discovered just fine (with pool and everything) but the first server is refusing to get discovered:

 

Job Construction Phase

----------------------

begin()

Appended operation 'Discover Manager Server Discover' to object 'OVM Foundry : Discover Manager'.

commit()

Completed Step: COMMIT

 

Objects and Operations

----------------------

Object (IN_USE): [DiscoverManager] OVM Foundry : Discover Manager

Operation: Discover Manager Server Discover

 

Job Running Phase at 15:48 on Tue, Jun 11, 2013

----------------------------------------------

Job Participants: []

 

 

Actioner

--------

Starting operation 'Discover Manager Server Discover' on object 'OVM Foundry : Discover Manager'

Setting Context to model only in job with id=1370965683697

Operation 'NTP Service Configure' in non-job running context, not adding it to object 'e4:11:5b:ac:b1:10:e4:11:5b:ac:b1:10:e4:11:5b:ac'.

Operation 'NTP Service Configure' in non-job running context, not adding it to object 'e4:11:5b:ac:b1:10:e4:11:5b:ac:b1:10:e4:11:5b:ac'.

Operation 'Server Set Statistic Interval' in non-job running context, not adding it to object 'e4:11:5b:ac:b1:10:e4:11:5b:ac:b1:10:e4:11:5b:ac'.

Job Internal Error (Operation)com.oracle.ovm.mgr.api.exception.FailedOperationException: OVMAPI_4010E Attempt to send command: discover_hardware to server: e4:11:5b:ac:b1:10:e4:11:5b:ac:b1:10:e4:11:5b:ac failed. OVMAPI_4004E Server Failed Command: discover_hardware , Status: org.apache.xmlrpc.XmlRpcException: I/O error while communicating with HTTP server: The server 57.56.168.171 failed to respond

Tue Jun 11 15:48:08 UTC 2013

Tue Jun 11 15:48:08 UTC 2013

at com.oracle.ovm.mgr.action.ActionEngine.sendCommandToServer(ActionEngine.java:507)

at com.oracle.ovm.mgr.action.ActionEngine.sendUndispatchedServerCommand(ActionEngine.java:459)

at com.oracle.ovm.mgr.action.ActionEngine.sendServerCommand(ActionEngine.java:385)

at com.oracle.ovm.mgr.action.ActionEngine.sendDiscoverCommand(ActionEngine.java:308)

at com.oracle.ovm.mgr.action.ServerAction.getHardwareInfo(ServerAction.java:104)

at com.oracle.ovm.mgr.discover.ovm.ServerHardwareDiscoverHandler.query(ServerHardwareDiscoverHandler.java:206)

at com.oracle.ovm.mgr.discover.ovm.ServerHardwareDiscoverHandler.query(ServerHardwareDiscoverHandler.java:42)

at com.oracle.ovm.mgr.discover.ovm.DiscoverHandler.execute(DiscoverHandler.java:61)

at com.oracle.ovm.mgr.discover.DiscoverEngine.handleDiscover(DiscoverEngine.java:461)

at com.oracle.ovm.mgr.discover.DiscoverEngine.handleDiscover(DiscoverEngine.java:446)

at com.oracle.ovm.mgr.discover.DiscoverEngine.handleDiscover(DiscoverEngine.java:430)

at com.oracle.ovm.mgr.discover.DiscoverEngine.handleDefaultDiscover(DiscoverEngine.java:391)

at com.oracle.ovm.mgr.discover.DiscoverEngine.discoverNewServer(DiscoverEngine.java:377)

at com.oracle.ovm.mgr.discover.DiscoverEngine.discoverServer(DiscoverEngine.java:280)

at com.oracle.ovm.mgr.op.manager.DiscoverManagerServerDiscover.action(DiscoverManagerServerDiscover.java:48)

at com.oracle.ovm.mgr.api.collectable.ManagedObjectDbImpl.executeCurrentJobOperationAction(ManagedObjectDbImpl.java:1012)

at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:329)

at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:289)

at com.oracle.odof.core.storage.Transaction.invokeMethod(Transaction.java:826)

at com.oracle.odof.core.Exchange.invokeMethod(Exchange.java:245)

at com.oracle.ovm.mgr.api.manager.DiscoverManagerProxy.executeCurrentJobOperationAction(Unknown Source)

at com.oracle.ovm.mgr.api.job.JobEngine.operationActioner(JobEngine.java:218)

at com.oracle.ovm.mgr.api.job.JobEngine.objectActioner(JobEngine.java:309)

at com.oracle.ovm.mgr.api.job.InternalJobDbImpl.objectCommitter(InternalJobDbImpl.java:1140)

at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:329)

at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:289)

at com.oracle.odof.core.BasicWork.invokeMethod(BasicWork.java:136)

at com.oracle.odof.command.InvokeMethodCommand.process(InvokeMethodCommand.java:105)

at com.oracle.odof.core.BasicWork.processCommand(BasicWork.java:81)

at com.oracle.odof.core.TransactionManager.processCommand(TransactionManager.java:773)

at com.oracle.odof.core.WorkflowManager.processCommand(WorkflowManager.java:455)

at com.oracle.odof.core.WorkflowManager.processWork(WorkflowManager.java:513)

at com.oracle.odof.io.AbstractClient.run(AbstractClient.java:42)

at java.lang.Thread.run(Thread.java:662)

Caused by: com.oracle.ovm.mgr.api.exception.IllegalOperationException: OVMAPI_4004E Server Failed Command: discover_hardware , Status: org.apache.xmlrpc.XmlRpcException: I/O error while communicating with HTTP server: The server 57.56.168.171 failed to respond

Tue Jun 11 15:48:08 UTC 2013

at com.oracle.ovm.mgr.action.ActionEngine.sendAction(ActionEngine.java:798)

at com.oracle.ovm.mgr.action.ActionEngine.sendCommandToServer(ActionEngine.java:503)

... 41 more

 

 

...

 

----------

End of Job

----------


from Agent log:


[2013-06-11 15:40:24 5177] DEBUG (OVSCommons:124) get_api_version: ()

[2013-06-11 15:40:24 5177] DEBUG (OVSCommons:132) get_api_version: call completed.

[2013-06-11 15:40:24 5178] DEBUG (OVSCommons:124) discover_server: ()

[2013-06-11 15:40:25 5178] DEBUG (OVSCommons:132) discover_server: call completed.

[2013-06-11 15:40:28 5335] DEBUG (OVSCommons:124) discover_hardware: ()

[2013-06-11 15:40:28 5336] DEBUG (OVSCommons:124) discover_hardware: ()

[2013-06-11 15:40:28 5337] DEBUG (OVSCommons:124) discover_hardware: ()

[2013-06-11 15:40:28 5338] DEBUG (OVSCommons:124) discover_hardware: ()



I have tried every possible trick there is (delete discover_hardware.lock, delete agent db) as well as made sure communication works ok both ways (can connect to soaovm1:8899 from manager as well as ovmmgr:7001 from server, passwords double-checked) but nothing helps. I still get the same error. My suspicion is this has something to do with the one-node cluster pool the server is (was?) in. It seems to be working ok but I can't stop it:


[root@lx-cgnclh-soaovm1 ~]# /etc/init.d/o2cb status

Driver for "configfs": Loaded

Filesystem "configfs": Mounted

Stack glue driver: Loaded

Stack plugin "o2cb": Loaded

Driver for "ocfs2_dlmfs": Loaded

Filesystem "ocfs2_dlmfs": Mounted

Checking O2CB cluster "6d707b8abd12b5af": Online

  Heartbeat dead threshold: 31

  Network idle timeout: 60000

  Network keepalive delay: 2000

  Network reconnect delay: 2000

  Heartbeat mode: Global

Checking O2CB heartbeat: Active

  0004FB0000050000DBE4AF6EF63C7A3A /dev/dm-11

Nodes in O2CB cluster: 0

[root@lx-cgnclh-soaovm1 ~]# /etc/init.d/o2cb stop

Clean userdlm domains: OK

Stopping global heartbeat on cluster "6d707b8abd12b5af": Failed

o2cb: Heartbeat region in use while stopping heartbeat on region '0004FB0000050000DBE4AF6EF63C7A3A'

 

I would appreciate any ideas here (preferably ones which wouldn't require rebooting guest VMs ...) as I have no clue what to try next.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points