1 2 Previous Next 18 Replies Latest reply: Dec 18, 2013 9:42 AM by Cindys-Oracle RSS

    How to repair a corrupted zfs filesystems?

    lameon

      I am running Solaris 11 x86. I had a zpool which was running on 2 mirrored 3TB disks. Ok, the short story is, some of zfs file systems on this pool have permanent errors, including the root zfs filesystem, which means I'v lost the whole pool, about 2TB data.

       

      The following is the status of my pool:

      root@solaris:~# zpool status -v dps

        pool: dps

      state: DEGRADED

      status: One or more devices has experienced an error resulting in data

              corruption.  Applications may be affected.

      action: Restore the file in question if possible. Otherwise restore the

              entire pool from backup.

         see: http://support.oracle.com/msg/ZFS-8000-8A

        scan: resilvered 442K in 0h0m with 551 errors on Thu Sep 12 00:11:57 2013

      config:

       

              NAME      STATE     READ WRITE CKSUM

              dps       DEGRADED     0     0    12

                c4t1d0  DEGRADED     0     0    24

       

      device details:

       

              c4t1d0  DEGRADED          too many errors

              status: FMA has degraded this device.

              action: Run 'fmadm faulty' for more information. Clear the errors

                      using 'fmadm repaired'.

                 see: http://support.oracle.com/msg/ZFS-8000-GH for recovery

       

       

      errors: Permanent errors have been detected in the following files:

       

              dps/Sharepoint/VirtualDisks:<0x0>

              dps:<0x0>

              dps:/VirtualBoxDisks/xppro64.vdi

              dps/Media:<0x0>

              dps/Sharepoint:<0x0>

       

      And here is the long story, one day my server hung and I had to force it powered off and when it started up again, I saw an error said one of the file on that zfs has permanent error on one of the disks. Since I had mirror disks, I tried to fix it by resilvering the disks. And the pool was still in service during the resilvering. After a few hours, I found that the resilvering seemed got stuck and more errors appeared. I then detached one disk from the mirror in order to preserve a copy from more damaging. But later, the detached disk became completely useless, "zpool import" couldn't find any pool info on that disk. I tried my best to recover it by exporting-importing the pool, rebooting the system, etc. But unfortunately thing get worse and worse, and end up with loosing the whole pool.

       

      Does any one here knows if this pool is still recoverable fully or partially? Any suggestion what I should do next? Any input would be greatly appreciated.

        • 1. Re: How to repair a corrupted zfs filesystems?
          Cindys-Oracle

          Hi--

           

          Sounds like this pool had 2 DEGRADED disks which caused the data corruption. Its unusual that both disks started failing at the same time. Unfortunately, detaching a disk from a pool wipes off the pool info. Resilvering onto DEGRADED disks won't help resolve the existing corrupted data. Data doesn't go bad on its own, but disks do which causes bad data so you have to resolve the disk problems before you can resolve the data problems.

          You can use these commands to determine when the devices problems started:

           

          # fmadm faulty

          # iostat -En

          # fmdump

           

          Then, use this command to review the fmdump reports:

           

          # fmdump -v -u <EVENT-ID>

           

          I would also rule out a larger problem like a bad cable or a controller problem.

           

          If you had REPLACED one of the DEGRADED disks instead of detaching it, then you might have recovered this pool. You would most likely have some corrupted data and I can't tell how severe it is from the paths above. Is this system running on VB or just hosting VMs with VB?

           

          Thanks, Cindy

          • 2. Re: How to repair a corrupted zfs filesystems?
            lameon

            Hi Cindy,

             

            Thanks for your reply.

             

            I assumed the problem was at the controller but not the disk, cause I thought both disk failed at the same time would be a really rare case.  I had sent the MB to the manufacture to have it repaired. BTW, I'm using GigaByte GA-Z77M-D3H-MVP. GigaByte said the MB is repaired, but it didn't say what problem they found. I'm not quite sure, but I hope the hardware is problem is eliminated. Of course I'll monitoring it closely.

             

            How idiot was I !!  I should have thought that detaching disk would not preserve the data, for the basic security sake. The current status is kind disaster to me. Almost all my digital life were in that pool. The disaster happened at 3 months ago, after several attempts of repairing, things got worse and worse, and I was scared of touching it any more. I have turned off the server for 3 months and until recently, I thought probably someone online could help then I started to seeking help here.

             

            Cindy, do you think this pool is repairable, by any magic tool or manually with amazing expertise? And I just had another thought. Detaching wipes out pool info, however, does it wipe out all zfs filesystem info? My best guess is it doesn't, if this is the case, can I try to rebuild the pool info without clean up zfs filesystem info?

             

            Thanks,

            Lance

            • 3. Re: How to repair a corrupted zfs filesystems?
              Cindys-Oracle

              Hi Lance,

               

              We won't know the state of this pool until you can get the system back together. I doubt that this will completely solve the data corruption. Do you have snapshots of these file systems? Snapshots don't always help because they originate from the same location of the file systems and share the same blocks.

               

              Let us know when the system is back and up and running.

               

              Trying to recover data when the pool info is lost is not impossible, but very difficult. It will be easier to try to recover from the remaining disk, I think.

               

              Thanks, Cindy

              • 4. Re: How to repair a corrupted zfs filesystems?
                lameon

                Hi Cindy,

                 

                The system is back now, unfortunately I don't have snapshots for it. Pleas let me know what I should do next.

                 

                Thanks,

                Lance

                • 5. Re: How to repair a corrupted zfs filesystems?
                  Cindys-Oracle

                  Hi Lance,

                   

                  Can you attempt to import this pool, if its not available:

                   

                  # zpool import dps

                   

                  Let us know the result.

                   

                  I'm catching a plane shortly so will not be available again until this evening.

                   

                  Thanks, Cindy

                  • 6. Re: How to repair a corrupted zfs filesystems?
                    lameon

                    The pool is already imported. the status info I posted is its current status.

                    • 7. Re: How to repair a corrupted zfs filesystems?
                      Cindys-Oracle

                      HI Lance,

                       

                      I need to review some things first and I'll get back to you. Don't give up just yet. Thanks,Cindy

                      • 8. Re: How to repair a corrupted zfs filesystems?
                        Cindys-Oracle

                        Lance,

                         

                        If the higher level system problems are resolved then I would try the next steps to see if the device issues are resolved.

                         

                        # zpool online dps c4t1d0

                        # zpool clear dps

                         

                        Let me know the results. Thanks, Cindy

                        • 9. Re: How to repair a corrupted zfs filesystems?
                          lameon

                          Hi Cindy,

                           

                          Here are the results,

                          root@solaris:~# zpool online dps c4t1d0

                          warning: device 'c4t1d0' onlined, but remains in degraded state

                          root@solaris:~# zpool clear dps

                          root@solaris:~# echo $?

                          0

                           

                          Seems nothing happened.

                          • 10. Re: How to repair a corrupted zfs filesystems?
                            Cindys-Oracle

                            Lance,

                             

                            I take it that the root pool is functioning on this system and just dps is a problem.

                            Looks like the disk is failing so lets see if FMA confirms this problem.

                             

                            # fmdump -eV > /tmp/fmdump.out

                            # grep c4t1d0 /tmp/fmdump.out

                             

                            If c4t1d0 is listed in this file, then vi the file to find out the date of the problems.

                            Maybe this is a separate problem from the motherboard problem but it is hard to say.

                            If the root pool is fine then maybe this is separate disk failure.

                             

                            If this disk needs to be replaced based on the FMA data, then I want to check with

                            some experts to see if we should try to recover the data before the disk is replaced.

                             

                            Thanks, Cindy

                            • 11. Re: How to repair a corrupted zfs filesystems?
                              lameon

                              Hi Cindy,

                               

                              There are totally 110 occurrence of c4t1d0 in fmdump output and all of them are denote the same device, vdev_path = /dev/dsk/c4t1d0s0.

                               

                              The first date it appeared was July 21, 2013, there are three entries at almost the same time. And then there are a bunch of entries happened at Aug 30 and 31, which was the day I had problem of accessing this pool. And then Nov 30, which was the day I restarted my system after repaired the MB.

                               

                              The following is the first batch in fmdump output,

                               

                                 1373 Jul 21 2013 23:27:52.414828706 ereport.fs.zfs.probe_failure

                                 1374 nvlist version: 0

                                 1375         class = ereport.fs.zfs.probe_failure

                                 1376         ena = 0x3dcdf54e21e00401

                                 1377         detector = (embedded nvlist)

                                 1378         nvlist version: 0

                                 1379                 version = 0x0

                                 1380                 scheme = zfs

                                 1381                 pool = 0xae7e2d470cb4144c

                                 1382                 vdev = 0xb410bba0db8b87fd

                                 1383         (end detector)

                                 1384

                                 1385         pool = dps

                                 1386         pool_guid = 0xae7e2d470cb4144c

                                 1387         pool_context = 0

                                 1388         pool_failmode = wait

                                 1389         vdev_guid = 0xb410bba0db8b87fd

                                 1390         vdev_type = disk

                                 1391         vdev_path = /dev/dsk/c4t1d0s0

                                 1392         vdev_devid = id1,sd@SATA_____ST3000DM001-9YN1____________W1F0K4F3/a

                                 1393         parent_guid = 0xee23f913afcb43d2

                                 1394         parent_type = mirror

                                 1395         prev_state = 0x0

                                 1396         __ttl = 0x1

                                 1397         __tod = 0x51eca6b8 0x18b9c8a2

                                 1398

                                 1399 Jul 21 2013 23:27:52.414875304 ereport.fs.zfs.io

                                 1400 nvlist version: 0

                                 1401         class = ereport.fs.zfs.io

                                 1402         ena = 0x3dcdf5595c600c01

                                 1403         detector = (embedded nvlist)

                                 1404         nvlist version: 0

                                 1405                 version = 0x0

                                 1406                 scheme = zfs

                                 1407                 pool = 0xae7e2d470cb4144c

                                 1408                 vdev = 0xb410bba0db8b87fd

                                 1409         (end detector)

                                 1410

                                 1411         pool = dps

                                 1412         pool_guid = 0xae7e2d470cb4144c

                                 1413         pool_context = 0

                                 1414         pool_failmode = wait

                                 1415         vdev_guid = 0xb410bba0db8b87fd

                                 1416         vdev_type = disk

                                 1417         vdev_path = /dev/dsk/c4t1d0s0

                                 1418         vdev_devid = id1,sd@SATA_____ST3000DM001-9YN1____________W1F0K4F3/a

                                 1419         parent_guid = 0xee23f913afcb43d2

                                 1420         parent_type = mirror

                                 1421         zio_err = 6

                                 1422         zio_txg = 0x27ab27a

                                 1423         zio_offset = 0xb219648000

                                 1424         zio_size = 0x20000

                                 1425         zio_objset = 0x10

                                 1426         zio_object = 0x19

                                 1427         zio_level = 0

                                 1428         zio_blkid = 0x4de6

                                 1429         __ttl = 0x1

                                 1430         __tod = 0x51eca6b8 0x18ba7ea8

                                 1431

                                 1432 Jul 21 2013 23:27:52.414875452 ereport.fs.zfs.io

                                 1433 nvlist version: 0

                                 1434         class = ereport.fs.zfs.io

                                 1435         ena = 0x3dcdf5595c600c01

                                 1436         detector = (embedded nvlist)

                                 1437         nvlist version: 0

                                 1438                 version = 0x0

                                 1439                 scheme = zfs

                                 1440                 pool = 0xae7e2d470cb4144c

                                 1441                 vdev = 0xb410bba0db8b87fd

                                 1442         (end detector)

                                 1443

                                 1444         pool = dps

                                 1445         pool_guid = 0xae7e2d470cb4144c

                                 1446         pool_context = 0

                                 1447         pool_failmode = wait

                                 1448         vdev_guid = 0xb410bba0db8b87fd

                                 1449         vdev_type = disk

                                 1450         vdev_path = /dev/dsk/c4t1d0s0

                                 1451         vdev_devid = id1,sd@SATA_____ST3000DM001-9YN1____________W1F0K4F3/a

                                 1452         parent_guid = 0xee23f913afcb43d2

                                 1453         parent_type = mirror

                                 1454         zio_err = 6

                                 1455         zio_txg = 0x27ab27a

                                 1456         zio_offset = 0xb219648000

                                 1457         zio_size = 0x20000

                                 1458         zio_objset = 0x10

                                 1459         zio_object = 0x19

                                 1460         zio_level = 0

                                 1461         zio_blkid = 0x4de6

                                 1462         __ttl = 0x1

                                 1463         __tod = 0x51eca6b8 0x18ba7f3c

                               

                              Cindy, please let me know if you need the whole fmdump.out.

                               

                              Thanks,

                              Lance

                              • 12. Re: How to repair a corrupted zfs filesystems?
                                Cindys-Oracle

                                Hi Lance,

                                 

                                No, I don't need the entire fmdump.out. This is enough to see that this disk has had problems and continues

                                to have problems. I think it needs to be replaced but I want to see if we can try to rescue the data before

                                the disk replacement. I'll get back to you tomorrow.

                                 

                                Thanks, Cindy

                                • 13. Re: How to repair a corrupted zfs filesystems?
                                  lameon

                                  Cindy, I see your point. I hope you can find a way to rescue the data. Again, since the pool had mirrored disks, is it possible to replicate the pool info from c4t1d0 to the detached disk. I wish that disk is good and zfs info are not corrupted.

                                   

                                  Thanks,

                                  Lance

                                  • 14. Re: How to repair a corrupted zfs filesystems?
                                    Cindys-Oracle

                                    Hi Lance,

                                     

                                    Corrupted pool recovery is not my speciality but I discussed your issues with someone who is a recovery expert and we

                                    have a few ideas:

                                     

                                    1. Keep the detached disk available, don't overwrite it or do anything with it just yet.

                                    2. Can you create a new ZFS pool on an extra spare disk?

                                    3. If you can do #2 above, then please create a new pool and copy all of your existing data from the existing dps pool.

                                    4. Let me know if you can do 2-3 and if the data is reasonably good shape.

                                    5. If you can't do 2-3, then we need go to more involved steps.

                                     

                                    Thanks, Cindy

                                    1 2 Previous Next