1 2 3 Previous Next 41 Replies Latest reply on Dec 18, 2008 6:56 AM by 807557 Go to original post
      • 30. Re: Solaris 10 hopelessly unreliable - random hangs
        807557
        I’m having very similar freezes on zfs receive operations. It seems that every time the hang happens is when I have a second terminal open with zpool iostat 1 running.
        Once it hangs I can still use new terminals but ANY zfs activity hangs. The machine does not want to shut down, I must force reset it.

        The last command that hung was:
        zfs send -vR zportable/data@20081212| zfs recv -vdF zlocal
        The 3 disks involved in the operation is 1 zpool with 1TB drive being transferred to 1 zpool with 2x1TB drive striped.
        AVAILABLE DISK SELECTIONS:
               0. c0d0 <DEFAULT cyl 12747 alt 2 hd 255 sec 63>
                  /pci@0,0/pci-ide@1f,2/ide@0/cmdk@0,0
               1. c2d0 <DEFAULT cyl 59523 alt 2 hd 255 sec 126>
                  /pci@0,0/pci-ide@1f,5/ide@0/cmdk@0,0
               2. c3d0 <ST310003-         5QJ07WR-0001-931.51GB>
                  /pci@0,0/pci8086,244e@1e/pci-ide@3/ide@1/cmdk@0,0
               3. c4d0 <ST340062-         9QG418B-0001-372.61GB>
                  /pci@0,0/pci8086,244e@1e/pci-ide@3/ide@0/cmdk@0,0
               4. c4d1 <ST340062-         9QG41DD-0001-372.61GB>
                  /pci@0,0/pci8086,244e@1e/pci-ide@3/ide@0/cmdk@1,0
               5. c5d0 <ST310003-         5QJ0L9T-0001-931.51GB>
                  /pci@0,0/pci-ide@1f,5/ide@1/cmdk@0,0
        • 31. Re: Solaris 10 hopelessly unreliable - random hangs
          807557
          Followup:
          Disk hardware is SATA running on intel integated and siI3114 controllers
          SunOS 5.11      snv_79a January 2008
          Once I rebooted I will post the zpool status.
          bash-3.2# dmesg
          Fri Dec 12 23:09:00 SAST 2008
          Dec 12 18:15:46 sols1 genunix: [ID 936769 kern.info] uhci3 is /pci@0,0/pci8086,d701@1d,1
          Dec 12 18:15:46 sols1 npe: [ID 236367 kern.info] PCI Express-device: pci8086,d701@1d,2, uhci4
          Dec 12 18:15:46 sols1 genunix: [ID 936769 kern.info] uhci4 is /pci@0,0/pci8086,d701@1d,2
          Dec 12 18:15:46 sols1 unix: [ID 950921 kern.info] cpu0: x86 (chipid 0x0 GenuineIntel 1067A family 6 model 23 step 10 clock 2983 MHz)
          Dec 12 18:15:46 sols1 unix: [ID 950921 kern.info] cpu0: Intel(r) Core(tm)2 CPU         E8400  @ 3.00GHz
          Dec 12 18:15:46 sols1 unix: [ID 950921 kern.info] cpu1: x86 (chipid 0x0 GenuineIntel 1067A family 6 model 23 step 10 clock 2983 MHz)
          Dec 12 18:15:46 sols1 unix: [ID 950921 kern.info] cpu1: Intel(r) Core(tm)2 CPU         E8400  @ 3.00GHz
          Dec 12 18:15:46 sols1 unix: [ID 557827 kern.info] cpu1 initialization complete - online
          Dec 12 18:15:46 sols1 npe: [ID 236367 kern.info] PCI Express-device: pci8086,244e@1e, pci_pci0
          Dec 12 18:15:46 sols1 genunix: [ID 936769 kern.info] pci_pci0 is /pci@0,0/pci8086,244e@1e
          Dec 12 18:15:47 sols1 pcplusmp: [ID 803547 kern.info] pcplusmp: pciclass,0c0010 (hci1394) instance 0 vector 0x14 ioapic 0x2 intin 0x14 is bound to cpu 0
          [...]
          Dec 12 20:37:29 sols1 zfs: [ID 664491 kern.warning] WARNING: Pool 'zportable' has encountered an uncorrectable I/O error. Manual intervention is required.
          [cut pseudo-device info]
          Dec 12 22:50:21 sols1 pcplusmp: [ID 803547 kern.info] pcplusmp: asy (asy) instance 0 vector 0x4 ioapic 0x2 intin 0x4 is bound to cpu 0
          Dec 12 22:50:21 sols1 isa: [ID 202937 kern.info] ISA-device: asy0
          Dec 12 22:50:21 sols1 genunix: [ID 936769 kern.info] asy0 is /isa/asy@1,3f8
          Dec 12 22:50:21 sols1 genunix: [ID 773945 kern.info]        UltraDMA mode 5 selected
          Dec 12 22:50:21 sols1 genunix: [ID 773945 kern.info]        UltraDMA mode 4 selected
          Dec 12 22:50:21 sols1 last message repeated 2 times
          Dec 12 22:50:21 sols1 unix: [ID 954099 kern.info] NOTICE: IRQ18 is being shared by drivers with different interrupt levels.
          Dec 12 22:50:21 sols1 This may result in reduced system performance.
          Dec 12 22:50:21 sols1 genunix: [ID 773945 kern.info]        UltraDMA mode 5 selected
          Dec 12 22:50:21 sols1 last message repeated 1 time
          Dec 12 22:50:21 sols1 unix: [ID 954099 kern.info] NOTICE: IRQ18 is being shared by drivers with different interrupt levels.
          Dec 12 22:50:21 sols1 This may result in reduced system performance.
          Dec 12 22:50:21 sols1 genunix: [ID 640982 kern.info]        IDE device at targ 0, lun 0 lastlun 0x0
          Dec 12 22:50:21 sols1 genunix: [ID 846691 kern.info]        model ST3400620AS
          Dec 12 22:50:21 sols1 genunix: [ID 479077 kern.info]        ATA/ATAPI-7 supported, majver 0xfe minver 0x0
          Dec 12 22:50:21 sols1 genunix: [ID 640982 kern.info]        IDE device at targ 1, lun 0 lastlun 0x0
          Dec 12 22:50:21 sols1 genunix: [ID 846691 kern.info]        model ST3400620AS
          Dec 12 22:50:21 sols1 genunix: [ID 479077 kern.info]        ATA/ATAPI-7 supported, majver 0xfe minver 0x0
          Dec 12 22:50:21 sols1 isa: [ID 202937 kern.info] ISA-device: pit_beep0
          [cut pseudo-device info]
          Dec 12 22:50:22 sols1 pci_pci: [ID 370704 kern.info] PCI-device: ide@0, ata6
          Dec 12 22:50:22 sols1 genunix: [ID 936769 kern.info] ata6 is /pci@0,0/pci8086,244e@1e/pci-ide@3/ide@0
          Dec 12 22:50:22 sols1 genunix: [ID 773945 kern.info]        UltraDMA mode 6 selected
          Dec 12 22:50:22 sols1 last message repeated 2 times
          Dec 12 22:50:22 sols1 gda: [ID 243001 kern.info] Disk5:     <Vendor 'Gen-ATA ' Product 'ST3400620AS     '>
          [cut pseudo-device info]
          Dec 12 22:50:22 sols1 sv: [ID 443358 kern.info] sv Nov 27 2007 20:53:03 (revision 11.11, 11.11.0_5.11, 11.27.2007)
          [cut pseudo-device info]
          Dec 12 22:50:22 sols1 ipf: [ID 774698 kern.info] IP Filter: v4.1.9, running.
          Dec 12 22:50:22 sols1 rdc: [ID 517869 kern.info] @(#) rdc: built 20:52:37 Nov 27 2007
          Dec 12 22:50:22 sols1 pseudo: [ID 129642 kern.info] pseudo-device: rdc0
          Dec 12 22:50:22 sols1 genunix: [ID 936769 kern.info] rdc0 is /pseudo/rdc@0
          Dec 12 22:50:23 sols1 ata: [ID 496167 kern.info] cmdk5 at ata6 target 0 lun 0
          Dec 12 22:50:23 sols1 genunix: [ID 936769 kern.info] cmdk5 is /pci@0,0/pci8086,244e@1e/pci-ide@3/ide@0/cmdk@0,0
          Dec 12 22:50:23 sols1 gda: [ID 243001 kern.info] Disk3:     <Vendor 'Gen-ATA ' Product 'ST3400620AS     '>
          Dec 12 22:50:24 sols1 ata: [ID 496167 kern.info] cmdk3 at ata6 target 1 lun 0
          Dec 12 22:50:24 sols1 genunix: [ID 936769 kern.info] cmdk3 is /pci@0,0/pci8086,244e@1e/pci-ide@3/ide@0/cmdk@1,0
          • 32. Re: Solaris 10 hopelessly unreliable - random hangs
            807557
            Follow up:
            This is happening regularly so that I have not been able to send the full zportable partition in 6 tries/days.
            I see an IRW18 interrupt clash but do not know how to find the culprit hardware.
            bash-3.2# zpool status
            pool: zlocal
            state: ONLINE
            scrub: none requested
            config:
                    NAME        STATE     READ WRITE CKSUM
                    zlocal      ONLINE       0     0     0
                      c3d0      ONLINE       0     0     0
                      c5d0      ONLINE       0     0     0
            errors: No known data errors
            pool: zportable
            state: ONLINE
            scrub: none requested
            config:
                    NAME        STATE     READ WRITE CKSUM
                    zportable   ONLINE       0     0     0
                      c2d0s2    ONLINE       0     0     0
            errors: No known data errors
            bash-3.2# zfs list
            NAME                       USED  AVAIL  REFER  MOUNTPOINT
            zlocal                     594G  1.20T    21K  /zlocal
            zlocal/jpv                 594G  1.20T   584G  /zlocal/jpv
            zlocal/jpv@20081123a      9.54G      -   586G  -
            zlocal/jpv@20081206        192M      -   583G  -
            zlocal/jpv@20081210        141K      -   584G  -
            zlocal/jpv@20081211         46K      -   584G  -
            zportable                  788G   102G  21.5K  /zportable
            zportable@20081212            0      -  21.5K  -
            zportable/data             185G   102G  86.9G  /zportable/data
            zportable/data@20081123a  98.1G      -   185G  -
            zportable/data@20081206    442K      -  86.9G  -
            zportable/data@20081210     24K      -  86.9G  -
            zportable/data@20081211     24K      -  86.9G  -
            zportable/data@20081212       0      -  86.9G  -
            zportable/jpv              603G   102G   593G  /zportable/jpv
            zportable/jpv@20081123a   9.54G      -   586G  -
            zportable/jpv@20081206     191M      -   583G  -
            zportable/jpv@20081210     139K      -   584G  -
            zportable/jpv@20081211     139K      -   584G  -
            zportable/jpv@20081212    9.06M      -   593G  -
            • 33. Re: Solaris 10 hopelessly unreliable - random hangs
              807557
              Two points - snv79 is pretty old (given sun's on 103 right now). Heck, Sun's released two(!) versions of S10 since then. Various ZFS problems and driver bugs have been resolved since snv_79. The second is that I recall the 3114 controller has various issues in general (not solaris specific). You should be sure that the controller firmware is current.
              • 34. Re: Solaris 10 hopelessly unreliable - random hangs
                807557
                Thank you rogerfujii for your help.
                Based on your advice I removed the SiI3114 from the system, this meant also removing all non critical SATA drives as well as removing a SATA DVDRom drive.
                I then tested but received the same problem. As soon as I can download a newer version of the OS I will try it. In the meantime I will attempt the send-receive over SSH which seemed to work in the past.

                When I tried to reboot after the last freeze the system panicked. I’m downloading the crash analyzer to see if I can find any information there.

                <rant>In Windows one would open the Event viewer to see error reports on any failing subsystems.
                In Solaris it seems one has to replace components of the system to resolve by elimination. </rant>

                If there are any documentation out there on how to debug this kind of problem please point them out.

                Edited by: JacoVosloo on Dec 13, 2008 8:15 AM
                • 35. Re: Solaris 10 hopelessly unreliable - random hangs
                  807557
                  I will attempt the send-receive over SSH which seemed to work in the past.
                  This worked in the past? This puts a different light on things. Looking through your logs a little more carefully, there's this warning message:
                  Dec 12 20:37:29 sols1 zfs: [ID 664491 kern.warning] WARNING: Pool 'zportable' has encountered an uncorrectable I/O error. Manual intervention is required.
                  [cut pseudo-device info]

                  (Didn't see this fully before because the frame it's in clipped it at the "has"). Anyway, this implies you ran into some r/w error. You might look through /var/adm/messages and see if there's any ide error messages. ZFS gets especially grumpy with drives with errors on it. You also might want to scrub the pool (zportable).
                  When I tried to reboot after the last freeze the system panicked
                  Never a good sign. Does it boot in failsafe?

                  re failing subsystems - huh? Solaris has multiple mechanisms to log errors. The "quick" way is to scan /var/adm/messages. There is a more comprehensive framework (using fmd). You can do an fmdump and get the list of fault event.s
                  If there are any documentation out there on how to debug this kind of problem please point them out.
                  Diagnostics is pretty much an acquired skill on most things. From the situation you seem to be in, I'd first try to get the HW out of the equation - run format, then choose your zportable disk, then run analyze/read (nondestructive) and see if it can read all the blocks on the disk. If it detects any errors, you know the source of your headaches. If you are confident that the disk is ok, then I'd start focusing in on zfs - but I'd upgrade first. There's been a lot of funky problems that have disappeared by patching/upgrading. No sense doing any brain bashing for something that's fixed already.....

                  -r
                  • 36. Re: Solaris 10 hopelessly unreliable - random hangs
                    807557
                    Thank you for the good and thorough information.
                    Based on the error message you indicated I realized that the zportable pool may be corrupt and not the target pool. Unfortunately I only realize this after destroying the target pool which was my only backup :.( I found that SSH receive did not work on the DATA filesystem but that the JPV filesystem seemed to work albeit slowly. I’m now trying to get it completed and will then continue to recover the DATA filesystem. (Fortunately the DATA fs is not critical to recover.)

                    After the panic the system kept hanging while initializing the pool, even when booting in safe mode if I try to import the pool. That was when I decided to destroy the pool; I had to unplug one striped disk to force the pool into unavailable mode before I could destroy and recreate it.

                    I read the fmdump man page, it seems to be a great event management system.
                    Due to lack of download bandwidth I was hoping to upgrade directly to XVM server but don’t know if I should risk EA3.
                    When I have more info I will post back.
                    • 37. Re: Solaris 10 hopelessly unreliable - random hangs
                      807557
                      I had to unplug one striped disk to force the pool into unavailable mode
                      If you have scrambled disk, and you are going destroy the contents anyway, you can wipe out the disk header by either using format / analyze / purge (of the first cylinder), or be more direct and do a dd if=/dev/zero of=/dev/dsk/c1d0p0 count=10 (substitute c1d0p0 with whatever the disk you want clobbered is).
                      • 38. Re: Solaris 10 hopelessly unreliable - random hangs
                        807557
                        Thanks, next time I want to quickly clear a disk I will try this.
                        I was finally able to complete the send of the problem filesystem. It seems that a file in an old snapshot got corrupted and it got ZFS knickers in a knot. My solution was to delete the specific file, create a new snapshot and destroy the old snapshot.
                        Just to recap what I did: At first when doing the following send my ZFS would hang ignoring kill commands and requiring a cold reboot.
                        zfs send -v zportable/data@20081212 | zfs recv -vFd zsols1
                        I saw the following error in dmesg:
                        Dec 15 06:42:23 sols1.zenblox.net zfs: [ID 664491 kern.warning] WARNING: Pool 'zportable' has encountered an uncorrectable I/O error. Manual intervention is required.
                        Then I got the following from fmdump:
                        bash-3.2# fmdump -t 2008-12-15 -eV
                        TIME                           CLASS
                        Dec 15 2008 06:42:23.392818680 ereport.fs.zfs.checksum
                        nvlist version: 0
                                class = ereport.fs.zfs.checksum
                                ena = 0x6220376de8a00401
                                detector = (embedded nvlist)
                                nvlist version: 0
                                        version = 0x0
                                        scheme = zfs
                                        pool = 0x766b4380fa3c352b
                                        vdev = 0xd0526d6461725b6f
                                (end detector)
                        
                                pool = zportable
                                pool_guid = 0x766b4380fa3c352b
                                pool_context = 0
                                vdev_guid = 0xd0526d6461725b6f
                                vdev_type = disk
                                vdev_path = /dev/dsk/c1d0s2
                                vdev_devid = id1,cmdk@AST31000340AS=____________9QJ17PZ0/c
                                parent_guid = 0x766b4380fa3c352b
                                parent_type = root
                                zio_err = 50
                                zio_offset = 0xdf1ae00000
                                zio_size = 0x20000
                                zio_object = 0x13c99
                                zio_level = 0
                                zio_blkid = 0x2ed
                                __ttl = 0x1
                                __tod = 0x4945e02f 0x1769eff8
                        
                        Dec 15 2008 06:42:23.392818864 ereport.fs.zfs.checksum
                        nvlist version: 0
                                class = ereport.fs.zfs.checksum
                                ena = 0x6220376de8a00401
                                detector = (embedded nvlist)
                                nvlist version: 0
                                        version = 0x0
                                        scheme = zfs
                                        pool = 0x766b4380fa3c352b
                                        vdev = 0xd0526d6461725b6f
                                (end detector)
                        
                                pool = zportable
                                pool_guid = 0x766b4380fa3c352b
                                pool_context = 0
                                vdev_guid = 0xd0526d6461725b6f
                                vdev_type = disk
                                vdev_path = /dev/dsk/c1d0s2
                                vdev_devid = id1,cmdk@AST31000340AS=____________9QJ17PZ0/c
                                parent_guid = 0x766b4380fa3c352b
                                parent_type = root
                                zio_err = 50
                                zio_offset = 0xdf1ae00000
                                zio_size = 0x20000
                                zio_object = 0x13c99
                                zio_level = 0
                                zio_blkid = 0x2ed
                                __ttl = 0x1
                                __tod = 0x4945e02f 0x1769f0b0
                        
                        Dec 15 2008 06:42:23.392818769 ereport.fs.zfs.data
                        nvlist version: 0
                                class = ereport.fs.zfs.data
                                ena = 0x6220376de8a00401
                                detector = (embedded nvlist)
                                nvlist version: 0
                                        version = 0x0
                                        scheme = zfs
                                        pool = 0x766b4380fa3c352b
                                (end detector)
                        
                                pool = zportable
                                pool_guid = 0x766b4380fa3c352b
                                pool_context = 0
                                zio_err = 50
                                zio_object = 0x13c99
                                zio_level = 0
                                zio_blkid = 0x2ed
                                __ttl = 0x1
                                __tod = 0x4945e02f 0x1769f051
                        So I ran format – analyze - read and found no hardware errors.
                        Then I ran a scrub which told me that a specific file was corrupt on a specific snapshot.
                        Even after the scrub it could not send the filesystem, I tried the zfs online command to clear the errors but it did not help. Eventually deleting the offending file and snapshot did the trick.

                        Now I just hope my data is still valid after this whole exercise. Thanks again for your help.
                        • 39. Re: Solaris 10 hopelessly unreliable - random hangs
                          807557
                          So I ran format – analyze - read and found no hardware errors.
                          Just an FYI - just because you didn't find any errors now, doesn't mean there weren't errors then. The SMART stuff on drives will do 'magic' things - like auto-remap bad sectors. I think it depends on the firmware what it does when it can't recover the data after it remaps it.
                          Then I ran a scrub which told me that a specific file was corrupt on a specific snapshot.
                          scrub can't fix something unless the pool is in some sort of redundant configuration. ZFS can be unforgiving since it refuses to return bogus data.

                          Now I just hope my data is still valid after this whole exercise. Thanks again for your help.

                          You can have some confidence that if it passes a scrub, the files you can read are still the files you think they are. Given the random reliability of drives nowadays, and the low prices of drives, running in a mirror is a good hedge against a future headache....

                          -r
                          • 40. Re: Solaris 10 hopelessly unreliable - random hangs
                            807557
                            Good point. I was hoping to use AVS to mirror my data across machines rather than on one physical machine.
                            My machines don’t use ECC RAM so I am worried that bad memory may gradually corrupt my data after a few send-receives. Do you know if ZFS has enough protection against data becoming corrupt during a send?
                            • 41. Re: Solaris 10 hopelessly unreliable - random hangs
                              807557
                              My machines don’t use ECC RAM so I am worried that bad memory may gradually corrupt my data after a few send-receives. Do you know if ZFS has enough protection against data becoming corrupt during a send?
                              Because ZFS uses RAM so aggressively, it has a tendency to show memory problems before it is noticed elsewhere. As someone else put it, it's a very good memory tester. While it still possible to have data corrupted during a send, because of ZFS's end-to-end checksum and TCP checksums (though the latter might be short-circuited if the connections are on the same box), you will probably detect memory problems before this, as normal ZFS usage will exercise the ram such that you will see various checksum errors / data read errors. Have SPARC box that ran s10 with ufs fine (though it was light duty). When I installed ZFS boot, the machines failed to boot. Turned out to be a bad memory stick, so I'll vouch that it's not a myth....

                              -r
                              1 2 3 Previous Next