I am running Solaris 11 x86. I had a zpool which was running on 2 mirrored 3TB disks. Ok, the short story is, some of zfs file systems on this pool have permanent errors, including the root zfs filesystem, which means I'v lost the whole pool, about 2TB data.
The following is the status of my pool:
root@solaris:~# zpool status -v dps
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
scan: resilvered 442K in 0h0m with 551 errors on Thu Sep 12 00:11:57 2013
NAME STATE READ WRITE CKSUM
dps DEGRADED 0 0 12
c4t1d0 DEGRADED 0 0 24
c4t1d0 DEGRADED too many errors
status: FMA has degraded this device.
action: Run 'fmadm faulty' for more information. Clear the errors
using 'fmadm repaired'.
see: http://support.oracle.com/msg/ZFS-8000-GH for recovery
errors: Permanent errors have been detected in the following files:
And here is the long story, one day my server hung and I had to force it powered off and when it started up again, I saw an error said one of the file on that zfs has permanent error on one of the disks. Since I had mirror disks, I tried to fix it by resilvering the disks. And the pool was still in service during the resilvering. After a few hours, I found that the resilvering seemed got stuck and more errors appeared. I then detached one disk from the mirror in order to preserve a copy from more damaging. But later, the detached disk became completely useless, "zpool import" couldn't find any pool info on that disk. I tried my best to recover it by exporting-importing the pool, rebooting the system, etc. But unfortunately thing get worse and worse, and end up with loosing the whole pool.
Does any one here knows if this pool is still recoverable fully or partially? Any suggestion what I should do next? Any input would be greatly appreciated.
Sounds like this pool had 2 DEGRADED disks which caused the data corruption. Its unusual that both disks started failing at the same time. Unfortunately, detaching a disk from a pool wipes off the pool info. Resilvering onto DEGRADED disks won't help resolve the existing corrupted data. Data doesn't go bad on its own, but disks do which causes bad data so you have to resolve the disk problems before you can resolve the data problems.
You can use these commands to determine when the devices problems started:
# fmadm faulty
# iostat -En
Then, use this command to review the fmdump reports:
# fmdump -v -u <EVENT-ID>
I would also rule out a larger problem like a bad cable or a controller problem.
If you had REPLACED one of the DEGRADED disks instead of detaching it, then you might have recovered this pool. You would most likely have some corrupted data and I can't tell how severe it is from the paths above. Is this system running on VB or just hosting VMs with VB?
Thanks for your reply.
I assumed the problem was at the controller but not the disk, cause I thought both disk failed at the same time would be a really rare case. I had sent the MB to the manufacture to have it repaired. BTW, I'm using GigaByte GA-Z77M-D3H-MVP. GigaByte said the MB is repaired, but it didn't say what problem they found. I'm not quite sure, but I hope the hardware is problem is eliminated. Of course I'll monitoring it closely.
How idiot was I !! I should have thought that detaching disk would not preserve the data, for the basic security sake. The current status is kind disaster to me. Almost all my digital life were in that pool. The disaster happened at 3 months ago, after several attempts of repairing, things got worse and worse, and I was scared of touching it any more. I have turned off the server for 3 months and until recently, I thought probably someone online could help then I started to seeking help here.
Cindy, do you think this pool is repairable, by any magic tool or manually with amazing expertise? And I just had another thought. Detaching wipes out pool info, however, does it wipe out all zfs filesystem info? My best guess is it doesn't, if this is the case, can I try to rebuild the pool info without clean up zfs filesystem info?
We won't know the state of this pool until you can get the system back together. I doubt that this will completely solve the data corruption. Do you have snapshots of these file systems? Snapshots don't always help because they originate from the same location of the file systems and share the same blocks.
Let us know when the system is back and up and running.
Trying to recover data when the pool info is lost is not impossible, but very difficult. It will be easier to try to recover from the remaining disk, I think.
I take it that the root pool is functioning on this system and just dps is a problem.
Looks like the disk is failing so lets see if FMA confirms this problem.
# fmdump -eV > /tmp/fmdump.out
# grep c4t1d0 /tmp/fmdump.out
If c4t1d0 is listed in this file, then vi the file to find out the date of the problems.
Maybe this is a separate problem from the motherboard problem but it is hard to say.
If the root pool is fine then maybe this is separate disk failure.
If this disk needs to be replaced based on the FMA data, then I want to check with
some experts to see if we should try to recover the data before the disk is replaced.
There are totally 110 occurrence of c4t1d0 in fmdump output and all of them are denote the same device, vdev_path = /dev/dsk/c4t1d0s0.
The first date it appeared was July 21, 2013, there are three entries at almost the same time. And then there are a bunch of entries happened at Aug 30 and 31, which was the day I had problem of accessing this pool. And then Nov 30, which was the day I restarted my system after repaired the MB.
The following is the first batch in fmdump output,
1373 Jul 21 2013 23:27:52.414828706 ereport.fs.zfs.probe_failure
1374 nvlist version: 0
1375 class = ereport.fs.zfs.probe_failure
1376 ena = 0x3dcdf54e21e00401
1377 detector = (embedded nvlist)
1378 nvlist version: 0
1379 version = 0x0
1380 scheme = zfs
1381 pool = 0xae7e2d470cb4144c
1382 vdev = 0xb410bba0db8b87fd
1383 (end detector)
1385 pool = dps
1386 pool_guid = 0xae7e2d470cb4144c
1387 pool_context = 0
1388 pool_failmode = wait
1389 vdev_guid = 0xb410bba0db8b87fd
1390 vdev_type = disk
1391 vdev_path = /dev/dsk/c4t1d0s0
1392 vdev_devid = id1,sd@SATA_____ST3000DM001-9YN1____________W1F0K4F3/a
1393 parent_guid = 0xee23f913afcb43d2
1394 parent_type = mirror
1395 prev_state = 0x0
1396 __ttl = 0x1
1397 __tod = 0x51eca6b8 0x18b9c8a2
1399 Jul 21 2013 23:27:52.414875304 ereport.fs.zfs.io
1400 nvlist version: 0
1401 class = ereport.fs.zfs.io
1402 ena = 0x3dcdf5595c600c01
1403 detector = (embedded nvlist)
1404 nvlist version: 0
1405 version = 0x0
1406 scheme = zfs
1407 pool = 0xae7e2d470cb4144c
1408 vdev = 0xb410bba0db8b87fd
1409 (end detector)
1411 pool = dps
1412 pool_guid = 0xae7e2d470cb4144c
1413 pool_context = 0
1414 pool_failmode = wait
1415 vdev_guid = 0xb410bba0db8b87fd
1416 vdev_type = disk
1417 vdev_path = /dev/dsk/c4t1d0s0
1418 vdev_devid = id1,sd@SATA_____ST3000DM001-9YN1____________W1F0K4F3/a
1419 parent_guid = 0xee23f913afcb43d2
1420 parent_type = mirror
1421 zio_err = 6
1422 zio_txg = 0x27ab27a
1423 zio_offset = 0xb219648000
1424 zio_size = 0x20000
1425 zio_objset = 0x10
1426 zio_object = 0x19
1427 zio_level = 0
1428 zio_blkid = 0x4de6
1429 __ttl = 0x1
1430 __tod = 0x51eca6b8 0x18ba7ea8
1432 Jul 21 2013 23:27:52.414875452 ereport.fs.zfs.io
1433 nvlist version: 0
1434 class = ereport.fs.zfs.io
1435 ena = 0x3dcdf5595c600c01
1436 detector = (embedded nvlist)
1437 nvlist version: 0
1438 version = 0x0
1439 scheme = zfs
1440 pool = 0xae7e2d470cb4144c
1441 vdev = 0xb410bba0db8b87fd
1442 (end detector)
1444 pool = dps
1445 pool_guid = 0xae7e2d470cb4144c
1446 pool_context = 0
1447 pool_failmode = wait
1448 vdev_guid = 0xb410bba0db8b87fd
1449 vdev_type = disk
1450 vdev_path = /dev/dsk/c4t1d0s0
1451 vdev_devid = id1,sd@SATA_____ST3000DM001-9YN1____________W1F0K4F3/a
1452 parent_guid = 0xee23f913afcb43d2
1453 parent_type = mirror
1454 zio_err = 6
1455 zio_txg = 0x27ab27a
1456 zio_offset = 0xb219648000
1457 zio_size = 0x20000
1458 zio_objset = 0x10
1459 zio_object = 0x19
1460 zio_level = 0
1461 zio_blkid = 0x4de6
1462 __ttl = 0x1
1463 __tod = 0x51eca6b8 0x18ba7f3c
Cindy, please let me know if you need the whole fmdump.out.
No, I don't need the entire fmdump.out. This is enough to see that this disk has had problems and continues
to have problems. I think it needs to be replaced but I want to see if we can try to rescue the data before
the disk replacement. I'll get back to you tomorrow.
Corrupted pool recovery is not my speciality but I discussed your issues with someone who is a recovery expert and we
have a few ideas:
1. Keep the detached disk available, don't overwrite it or do anything with it just yet.
2. Can you create a new ZFS pool on an extra spare disk?
3. If you can do #2 above, then please create a new pool and copy all of your existing data from the existing dps pool.
4. Let me know if you can do 2-3 and if the data is reasonably good shape.
5. If you can't do 2-3, then we need go to more involved steps.