13 Replies Latest reply: Feb 4, 2013 2:06 AM by budachst RSS

    power outage - lost my local storage and ocfs2 mount errors

    user5373812
      I recently installed OVM and had everything working nicely for about 2 days. My OVM Server shut down unexpectedly and I seemed to have lost access to my repository on the OVM.
      I am using local storage on the OVM Server. When I go to /OVS/Repositories and ls, there is nothing there.

      On the OVM server, when I boot, I get a bunch of errors about ofcf2 which I now understand is the oracle file system that was probably installed on the local storage when I originally set everything up.

      Any help would be GREATLY appreciated!!!!

      When I reboot the OVM Server, I get the following errors:
      mount.ocfs2,4344,0):ocfs2_check_volume:2468 File system was not unmounted cleanly, recovering volume.
      Info fld=0x240b3
      end_request: critical target error, dev sdd, sector 148632
      end_request: critical target error, dev dm-2, sector 147632
      Info fld=0x2a399
      end_request: critical target error, dev sdd, sector 175768
      end_request: critical target error, dev dm-2, sec 175768
      JBD: Failed to read block at offset 20435
      JBD: recovery failed
      (mount.ocfs2,4515,2):ocfs2_journal_load:1004 ERROR: Failed to load journal!
      (mount.ocfs2,4515,2):ocfs2_check_volume:2476 ERROR: ocfs2 journal load failed! -5
      (mount.ocfs2,4515,2):ocfs2_check_volume:2525 ERROR: status = -5
      (mount.ocfs2,4515,2):ocfs2_check_volume:1910 ERROR: status = -5
      (mount.ocfs2,4515,2):ocfs2_journal_shutdown:942 ERROR: status = -5
      (mount.ocfs2,4515,2):ocfs2_fill_super:1240 ERROR: status: -5

      OVM Server running on PowerEdge 2900

      When I try to start a VM, I get the following error:

      Job Construction Phase
      ----------------------
      begin()
      Appended operation 'Virtual Machine Start' to object '0004fb0000060000205a8c08f511c265 (DB)'.
      Appended operation 'Bridge Configure Operation' to object '0004fb00002000001a58d1d8e5d47fa2 (network.BondPort (1) in ovs-36-01-3f)'.
      commit()
      Completed Step: COMMIT

      Objects and Operations
      ----------------------
      Object (IN_USE): [VirtualDisk] 0004fb00001200002f0d60a7afa9a9c4.img (System.img (12))
      Object (IN_USE): [VmDiskMapping] 0004fb0000130000547e921f59135cba
      Object (IN_USE): [VmDiskMapping] 0004fb00001300001e5ab7465015e717
      Object (IN_USE): [VirtualMachine] 0004fb0000060000205a8c08f511c265 (DB)
      Operation: Virtual Machine Start
      Object (IN_USE): [BondPort] 0004fb00002000001a58d1d8e5d47fa2 (network.BondPort (1) in ovs-36-01-3f)
      Operation: Bridge Configure Operation
      Object (IN_USE): [Vnic] 0004fb0000070000711bbf916a1066d8 (00:21:f6:00:00:00)
      Object (IN_USE): [VirtualDisk] 0004fb00001200002cad6110e182ef76.img (FSCMDB.img (4))
      Object (IN_USE): [VmDiskMapping] 0004fb0000130000607a9d0741a8f768
      Object (IN_USE): [VirtualDisk] 0004fb000012000005bee2b246851db1.img (oracle11g_x86_64_asm.img (4))

      Job Running Phase at 21:11 on Thu, Dec 6, 2012
      ----------------------------------------------
      Job Participants: [44:45:4c:4c:43:00:10:38:80:44:c7:c0:4f:59:42:31 (ovs-36-01-3f)]


      Actioner
      --------
      Starting operation 'Virtual Machine Start' on object '0004fb0000060000205a8c08f511c265 (DB)'
      Completed operation 'Virtual Machine Start' completed with direction ==> LATER
      Starting operation 'Bridge Configure Operation' on object '0004fb00002000001a58d1d8e5d47fa2 (network.BondPort (1) in ovs-36-01-3f)'
      Bridge [10.0.1.0] already exists (and should exist) on interface [bond0] on server [ovs-36-01-3f]; skipping bridge creation
      Completed operation 'Bridge Configure Operation' completed with direction ==> DONE
      Starting operation 'Virtual Machine Start' on object '0004fb0000060000205a8c08f511c265 (DB)'
      Job Internal Error (Operation)com.oracle.ovm.mgr.api.exception.FailedOperationException: OVMAPI_4010E Attempt to send command: start_vm to server: ovs-36-01-3f failed. OVMAPI_4004E Server Failed Command: start_vm 0004fb00000300007e64717b049deaef 0004fb0000060000205a8c08f511c265, Status: org.apache.xmlrpc.XmlRpcException: exceptions.RuntimeError:Command: ['xm', 'create', '/OVS/Repositories/0004fb00000300007e64717b049deaef/VirtualMachines/0004fb0000060000205a8c08f511c265/vm.cfg'] failed (1): stderr: Error: Unable to open config file: /OVS/Repositories/0004fb00000300007e64717b049deaef/VirtualMachines/0004fb0000060000205a8c08f511c265/vm.cfg
      stdout: Usage: xm create [options] [vars]

      Create a domain based on .
      ...

      Create a domain.

      Domain creation parameters can be set by command-line switches, from
      a python configuration script or an SXP config file. See documentation
      for defconfig, config. Configuration variables can be set using
      VAR=VAL on the command line. For example vmid=3 sets vmid to 3.


      ...
      Thu Dec 06 21:11:03 EST 2012...
      Thu Dec 06 21:11:03 EST 2012
      at com.oracle.ovm.mgr.action.ActionEngine.sendCommandToServer(ActionEngine.java:507)
      at com.oracle.ovm.mgr.action.ActionEngine.sendUndispatchedServerCommand(ActionEngine.java:459)
      at com.oracle.ovm.mgr.action.ActionEngine.sendServerCommand(ActionEngine.java:385)
      at com.oracle.ovm.mgr.action.VirtualMachineAction.sendStartVmCommand(VirtualMachineAction.java:121)
      at com.oracle.ovm.mgr.op.virtual.VirtualMachineStart.asyncAction(VirtualMachineStart.java:65)
      at com.oracle.ovm.mgr.api.collectable.ManagedObjectDbImpl.executeCurrentJobOperationAsyncAction(ManagedObjectDbImpl.java:999)
      at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:330)
      at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:290)
      at com.oracle.odof.core.storage.Transaction.invokeMethod(Transaction.java:822)
      at com.oracle.odof.core.Exchange.invokeMethod(Exchange.java:245)
      at com.oracle.ovm.mgr.api.virtual.VirtualMachineProxy.executeCurrentJobOperationAsyncAction(Unknown Source)
      at com.oracle.ovm.mgr.api.job.JobEngine.objectAsyncStart(JobEngine.java:596)
      at com.oracle.ovm.mgr.op.virtual.VirtualMachineStart.action(VirtualMachineStart.java:52)
      at com.oracle.ovm.mgr.api.collectable.ManagedObjectDbImpl.executeCurrentJobOperationAction(ManagedObjectDbImpl.java:1009)
      at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:330)
      at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:290)
      at com.oracle.odof.core.storage.Transaction.invokeMethod(Transaction.java:822)
      at com.oracle.odof.core.Exchange.invokeMethod(Exchange.java:245)
      at com.oracle.ovm.mgr.api.virtual.VirtualMachineProxy.executeCurrentJobOperationAction(Unknown Source)
      at com.oracle.ovm.mgr.api.job.JobEngine.operationActioner(JobEngine.java:218)
      at com.oracle.ovm.mgr.api.job.JobEngine.objectActioner(JobEngine.java:309)
      at com.oracle.ovm.mgr.api.job.InternalJobDbImpl.objectCommitter(InternalJobDbImpl.java:1140)
      at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:330)
      at com.oracle.odof.core.AbstractVessel.invokeMethod(AbstractVessel.java:290)
      at com.oracle.odof.core.BasicWork.invokeMethod(BasicWork.java:136)
      at com.oracle.odof.command.InvokeMethodCommand.process(InvokeMethodCommand.java:100)
      at com.oracle.odof.core.BasicWork.processCommand(BasicWork.java:81)
      at com.oracle.odof.core.TransactionManager.processCommand(TransactionManager.java:773)
      at com.oracle.odof.core.WorkflowManager.processCommand(WorkflowManager.java:401)
      at com.oracle.odof.core.WorkflowManager.processWork(WorkflowManager.java:459)
      at com.oracle.odof.io.AbstractClient.run(AbstractClient.java:42)
      at java.lang.Thread.run(Thread.java:662)
      Caused by: com.oracle.ovm.mgr.api.exception.IllegalOperationException: OVMAPI_4004E Server Failed Command: start_vm 0004fb00000300007e64717b049deaef 0004fb0000060000205a8c08f511c265, Status: org.apache.xmlrpc.XmlRpcException: exceptions.RuntimeError:Command: ['xm', 'create', '/OVS/Repositories/0004fb00000300007e64717b049deaef/VirtualMachines/0004fb0000060000205a8c08f511c265/vm.cfg'] failed (1): stderr: Error: Unable to open config file: /OVS/Repositories/0004fb00000300007e64717b049deaef/VirtualMachines/0004fb0000060000205a8c08f511c265/vm.cfg
      stdout: Usage: xm create [options] [vars]

      Create a domain based on .

      Options:

      ...

      ...
      Thu Dec 06 21:11:03 EST 2012
      at com.oracle.ovm.mgr.action.ActionEngine.sendAction(ActionEngine.java:798)
      at com.oracle.ovm.mgr.action.ActionEngine.sendCommandToServer(ActionEngine.java:503)
      ... 40 more


      FailedOperationCleanup
      ----------
      Starting failed operation 'Virtual Machine Start' cleanup on object 'DB'
      Complete rollback operation 'Virtual Machine Start' completed with direction=DB

      Rollbacker
      ----------
      Executing rollback operation 'Virtual Machine Start' on object '0004fb0000060000205a8c08f511c265 (DB)'
      Complete rollback operation 'Virtual Machine Start' completed with direction=DONE
      Executing rollback operation 'Bridge Configure Operation' on object '0004fb00002000001a58d1d8e5d47fa2 (network.BondPort (1) in ovs-36-01-3f)'
      Complete rollback operation 'Bridge Configure Operation' completed with direction=DONE

      Objects To Be Rolled Back
      -------------------------
      Object (IN_USE): [VirtualDisk] 0004fb00001200002f0d60a7afa9a9c4.img (System.img (12))
      Object (IN_USE): [VmDiskMapping] 0004fb0000130000547e921f59135cba
      Object (IN_USE): [VmDiskMapping] 0004fb00001300001e5ab7465015e717
      Object (IN_USE): [VirtualMachine] 0004fb0000060000205a8c08f511c265 (DB)
      Object (IN_USE): [BondPort] 0004fb00002000001a58d1d8e5d47fa2 (network.BondPort (1) in ovs-36-01-3f)
      Object (IN_USE): [Vnic] 0004fb0000070000711bbf916a1066d8 (00:21:f6:00:00:00)
      Object (IN_USE): [VirtualDisk] 0004fb00001200002cad6110e182ef76.img (FSCMDB.img (4))
      Object (IN_USE): [VmDiskMapping] 0004fb0000130000607a9d0741a8f768
      Object (IN_USE): [VirtualDisk] 0004fb000012000005bee2b246851db1.img (oracle11g_x86_64_asm.img (4))


      Write Methods Invoked
      -------------------
      Class=InternalJobDbImpl vessel_id=9797 method=addTransactionIdentifier accessLevel=6
      Class=VirtualMachineDbImpl vessel_id=5862 method=start accessLevel=6
      Class=VnicDbImpl vessel_id=1644 method=lock accessLevel=6
      Class=VirtualDiskDbImpl vessel_id=5874 method=lock accessLevel=6
      Class=VirtualDiskDbImpl vessel_id=5884 method=lock accessLevel=6
      Class=VirtualDiskDbImpl vessel_id=5896 method=lock accessLevel=6
      Class=VnicDbImpl vessel_id=1644 method=lock accessLevel=6
      Class=VmDiskMappingDbImpl vessel_id=5879 method=lock accessLevel=6
      Class=VmDiskMappingDbImpl vessel_id=5891 method=lock accessLevel=6
      Class=VmDiskMappingDbImpl vessel_id=5901 method=lock accessLevel=6
      Class=VirtualDiskDbImpl vessel_id=5874 method=lock accessLevel=6
      Class=VirtualDiskDbImpl vessel_id=5884 method=lock accessLevel=6
      Class=VirtualDiskDbImpl vessel_id=5896 method=lock accessLevel=6
      Class=VirtualMachineStartingEventDbImpl vessel_id=9806 method=notify accessLevel=6
      Class=VirtualMachineStartingEventDbImpl vessel_id=9806 method=setDescription accessLevel=6
      Class=BondPortDbImpl vessel_id=802 method=addJobOperation accessLevel=6
      Class=InternalJobDbImpl vessel_id=9797 method=setCompletedStep accessLevel=6
      Class=InternalJobDbImpl vessel_id=9797 method=setAssociatedHandles accessLevel=6
      Class=InternalJobDbImpl vessel_id=9797 method=setTuringMachineFlag accessLevel=6
      Class=VirtualMachineDbImpl vessel_id=5862 method=setCurrentOperationToLater accessLevel=6
      Class=InternalJobDbImpl vessel_id=9797 method=setTuringMachineFlag accessLevel=6
      Class=BondPortDbImpl vessel_id=802 method=setCurrentJobOperationComplete accessLevel=6
      Class=BondPortDbImpl vessel_id=802 method=nextJobOperation accessLevel=6
      Class=VirtualMachineDbImpl vessel_id=5862 method=nextJobOperation accessLevel=6
      Class=InternalJobDbImpl vessel_id=9797 method=setFailedOperation accessLevel=6
      Class=VirtualDiskDbImpl vessel_id=5874 method=nextJobOperation accessLevel=6
      Class=VmDiskMappingDbImpl vessel_id=5901 method=nextJobOperation accessLevel=6
      Class=VmDiskMappingDbImpl vessel_id=5879 method=nextJobOperation accessLevel=6
      Class=VirtualMachineDbImpl vessel_id=5862 method=nextJobOperation accessLevel=6
      Class=BondPortDbImpl vessel_id=802 method=nextJobOperation accessLevel=6
      Class=VnicDbImpl vessel_id=1644 method=nextJobOperation accessLevel=6
      Class=VirtualDiskDbImpl vessel_id=5896 method=nextJobOperation accessLevel=6
      Class=VmDiskMappingDbImpl vessel_id=5891 method=nextJobOperation accessLevel=6
      Class=VirtualDiskDbImpl vessel_id=5884 method=nextJobOperation accessLevel=6
      Class=VirtualMachineDbImpl vessel_id=5862 method=nextJobOperation accessLevel=6
      Class=BondPortDbImpl vessel_id=802 method=nextJobOperation accessLevel=6
      Completed Step: ROLLBACK
      Job failed commit (internal) due to OVMAPI_4010E Attempt to send command: start_vm to server: ovs-36-01-3f failed. OVMAPI_4004E Server Failed Command: start_vm 0004fb00000300007e64717b049deaef 0004fb0000060000205a8c08f511c265, Status: org.apache.xmlrpc.XmlRpcException: exceptions.RuntimeError:Command: ['xm', 'create', '/OVS/Repositories/0004fb00000300007e64717b049deaef/VirtualMachines/0004fb0000060000205a8c08f511c265/vm.cfg'] failed (1): stderr: Error: Unable to open config file: /OVS/Repositories/0004fb00000300007e64717b049deaef/VirtualMachines/0004fb0000060000205a8c08f511c265/vm.cfg
      stdout: Usage: xm create [options] [vars]

      Create a domain based on .

      Options:

      Edited by: user5373812 on Dec 7, 2012 9:26 AM
        • 2. Re: power outage - lost my local storage and ocfs2 mount errors
          user5373812
          fsck.ocfs didn't work, but I tried fsck.ocfs2 on /dev/sda3 and I get bad magic number in superblock while opening /dev/sda3

          Edited by: user5373812 on Dec 7, 2012 2:22 PM
          • 3. Re: power outage - lost my local storage and ocfs2 mount errors
            user5373812
            Ok...I was wrong on /dev/sd3
            Looks like ocfs was on /dev/sdd

            I ran fsck.ocfs2 on /dev/sdd and /dev/dm-2and did something, but I'm still having trouble - still giving me errors
            JBD:Ignoring recovery information on journal
            end_request: critical target error, dev sdd, sector 147362
            end_request: critical target error, dev dm-2, sector 147362
            JBD: Failed to read block at offset 20345

            then some ocfs errors:
            mount.ocfs2,4596,3 -
            ocfs2_journal_wipe:1045
            ocfs2_check_volume:2463
            ocfs2_mount_volume:1910
            ocfs2_journal_shutdown:942
            ocfs2_fill_super:1240

            If I run blkid, I get the following:
            /dev/sda1
            /dev/sda2
            /dev/sda3
            /dev/sr0
            /dev/mapper/SATA_ST31000340AS_9QJ1MSOP
            /dev/sdd

            sda1 and sda2 are ext3
            sdd and the dev mapper are ocfs2
            sr0 is an iso

            if I do df, I only see sda2, sda1 an then tempfs and none
            • 4. Re: power outage - lost my local storage and ocfs2 mount errors
              budachst
              df will only report on mounted volumes and since your ocfs2 volume isn't mounted, it won't be reported by df.
              Since blkid doesn't report a valid OCFS2 fs on /dev/sdd, there's actually no chance that, that fsck.ocfs could work anyway.
              If I run blkid on my VM server it spits out all existing file systems - can you provide the exact output of blkid?
              • 5. Re: power outage - lost my local storage and ocfs2 mount errors
                user5373812
                Thanks for the response, but when the oracle forums went down, I let the frustration get the best of me and reinstalled everything from scratch. Was able to get back to where I was previously pretty quickly.
                • 6. Re: power outage - lost my local storage and ocfs2 mount errors
                  user5373812
                  Ok - another extended power outage and same issue. This time sI have some work into the various environments, so its not as easy for me to just start over.
                  Here's what I have figured out. The device is /dev/sdd. This is a separate 1TB drive on the physical server. When the power outage occurs, the drive seems to get unresponsive when it tries to remount. So, successive reboots, the device is not recognized after reboots.
                  I seem to need to do a complete power down (unplug server) and then the NEXT reboot, I can see the drive.

                  Problem is, its not mounting and I don't understand where in OCFS2 I need to configure it.
                  Here's my configuration:
                  3 disks presented as repositories. 2 are working fine, the 3rd is not.
                  If I do df, I see the two that work correctly as (full names shortened for clarity):
                  /dev/mapper/SATA_8162
                  /dev/mapper/SATA_8186

                  If I go to /dev/mapper and do an ls, I see all 3 devices:
                  SATA_8186, SATA_8162 and SATA_ST31 (SATA_ST31 is the one that isn't working correctly/remounting)


                  If I go to /OVS/Repositories, I see the 3 mount points, but only two of them have the OVM directories (the two that work), the one thats not working is not present.

                  Leads me to believe that its simply not mounting, but I'm not sure where to look to see the errors or where to look to configure it to automatically mount.

                  Any ideas?
                  • 7. Re: power outage - lost my local storage and ocfs2 mount errors
                    user5373812
                    Ok - got a little farther - actually found the disk in ovm manager on the 'storage' tab and then was able to associate it with the server.
                    At that point, it showed up under repository tab but was unpresented to the server - but at this point, I could click the present button and give it a go.

                    Doing the present I assume is where it mounts and here I got a whole bunch of errors

                    starting with slot 0 in use
                    JDB: Ignoring recovery information on journal
                    then end-request critical target error dev dm-2 (and various sectors)
                    the ocfs2_write_block: 94 ERROR
                    ocfs2_update_disk_slot:217 ERROR
                    ocfs2_put_slot:531 ERROR

                    Then a big trace for the mount
                    followed by a series of end_request: I/0 error, dev sdd, sector 0
                    • 8. Re: power outage - lost my local storage and ocfs2 mount errors
                      budachst
                      Besides you seem to have some flaky hardware and no backup of your OVS storage repos, what everybody should have, just in case, a desaster like this happens…

                      what dies blkid return?
                      • 9. Re: power outage - lost my local storage and ocfs2 mount errors
                        user5373812
                        This is basically a POC environment that I was using to learn about the system - I didn't think I needed to build in redundancy at this level, but after this one, I agree - I definitely need it. I have a new storage server on the way, but I'd like to recover this disk if possible.

                        It shows up in blkid, its specific entry is
                        /dev/mapper/SATA_ST31... : LABEL="OVS4b.." UUID="0004f..." TYPE="ocfs2"
                        • 10. Re: power outage - lost my local storage and ocfs2 mount errors
                          user5373812
                          I just noticed that /dev/sdd is not showing up in blkid

                          /dev/sdb, /dev/sdc both show up with details that correspond to the OVS dev/mapper entries.
                          • 11. Re: power outage - lost my local storage and ocfs2 mount errors
                            budachst
                            Have you checked the cabling of sdd? Maybe you could put this disk into another box where you can install OVM onto and see, if blkid comes up with something useful.
                            • 12. Re: power outage - lost my local storage and ocfs2 mount errors
                              user5373812
                              What specifically would I check for? The drive appears to work until a power outage, and then will not remount and gives the errors I listed above (critical sectors, etc).
                              My guess is that I could completely re-initialize the drive and present it to the server and it would work fine - but obviously that wouldn't get me anywhere.

                              Is there something I can do to fix the bad sectors? Also, are there any logs I can look at to figure out whether there are any other errors I'm missing?
                              • 13. Re: power outage - lost my local storage and ocfs2 mount errors
                                budachst
                                If the drive only shows up after a reboot, than there's probably something wrong with your system. That's why I suggested trying the drive on another host.
                                Obviously, you're having issues getting the drive to work reliably in your current system, which is not a very good starting point for a desaster recovery.

                                There is also a really good - if not the best - hard drive maintenace and recovery utility called SpinRite. This will take care of physical issues with your drive sectors, should they need repair - provided that any repair is possible. After that, I'd mount the drive on another system and have OCFS repair it, if possible.