I've done many live migrations over the last few months, with no problems. Suddenly, they don't work any more. Here's what I see:
I start the migration from the Manager.
The VM immediately disappears from the list of VMs on the source servers, and appears on the destination server.
The job shows "in progress", and it NEVER completes.
The "% complete" for the job never says anything but ZERO.
If I look at the 'details' on the 'in progress' migration job, it says:
Job Construction Phase
Appended operation 'Bridge Configure Operation' to object '0004fb00002000005c945b4212271249 (network.BondPort (2) in oravm3.acbl.net)'.
Appended operation 'Virtual Machine Migrate' to object '0004fb000006000066c8e49bc5ab54b0 (jiplcm01)'.
Completed Step: COMMIT
Objects and Operations
Object (IN_USE): [Server] e2:a3:70:c6:67:89:e1:11:bb:8e:e4:1f:13:eb:92:b2 (oravm3.acbl.net)
Object (IN_USE): [BondPort] 0004fb00002000005c945b4212271249 (network.BondPort (2) in oravm3.acbl.net)
Operation: Bridge Configure Operation
Object (IN_USE): [Server] 92:0f:60:b4:84:91:e1:11:aa:cb:e4:1f:13:eb:d2:3a (oravm2.acbl.net)
Object (IN_USE): [VirtualMachine] 0004fb000006000066c8e49bc5ab54b0 (jiplcm01)
Operation: Virtual Machine Migrate
Job Running Phase at 13:10 on Wed, Jan 2, 2013
Job Participants: [92:0f:60:b4:84:91:e1:11:aa:cb:e4:1f:13:eb:d2:3a (oravm2.acbl.net)]
Starting operation 'Bridge Configure Operation' on object '0004fb00002000005c945b4212271249 (network.BondPort (2) in oravm3.acbl.net)'
Bridge [0004fb001018c4c] already exists (and should exist) on interface [bond1] on server [oravm3.acbl.net]; skipping bridge creation
Completed operation 'Bridge Configure Operation' completed with direction ==> DONE
Starting operation 'Virtual Machine Migrate' on object '0004fb000006000066c8e49bc5ab54b0 (jiplcm01)'
Job failed commit (internal) due to Caught during invoke method: java.net.SocketException: Socket closed
Wed Jan 02 13:11:36 EST 2013
com.oracle.odof.exception.InternalException: Caught during invoke method: java.net.SocketException: Socket closed
Wed Jan 02 13:11:36 EST 2013
at com.oracle.ovm.mgr.api.job.InternalJobProxy.objectCommitter(Unknown Source)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
... 7 more
Anyone have any idea what the problem is? What can I do to gather useful information?
Job failed commit (internal) due to Caught during invoke method: java.net.SocketException: Socket closed at com.oracle.ovm.mgr.api.job.InternalJobProxy.objectCommitter(Unknown Source)
It looks to me that either the target server does not have access to everything needed to complete the migration (access to the shared pool, access to the shared storage and etc) or the target server is having an issue communicating with the VM Manager.
I wish such error were more descriptive but I believe the "unknown source" and "socket closed" indicates such a problem.
I opened at SR (priority 2) on this, and actually got an intelligent response within a reasonable time.
The problem was: My Oracle VM servers had too many open files.
I don't know WHY they were running out of open files, but support had me change /etc/security/limits.conf to make root have 128K open files, and then restart the agent. Somehow that cleared up the problem.
I have now made that change on all my servers, and rebooted them.