4 Replies Latest reply: May 13, 2013 12:13 AM by Stlouis1 RSS

    how to tell which disk is where

    Stlouis1
      I'm having a really hard time with this now

      my understanding was that the disk numbering worked as C# which indicated the controller, and then T# which indicated the port the disk was on, and so on, so forth, and it turns out, my understanding was wrong

      I replaced two drives the other day, c9t5### and c9t6#### and after removing them, and putting in the new drives, the new ones are c9t10 and c9t11 which is throwing me off

      when I replaced them, i had decided it would be wise to label the drive bays so i knew what was what, i used iostat -eEn to get the sn's and match them up

      really could use some help before I start losing data. the idea was to swap in larger drives to grow the pool and i'm sort of half way there now

      now I started by taking the disk offline
      root@srv-data:~# zpool status tank
        pool: tank
       state: DEGRADED
      status: One or more devices has been taken offline by the administrator.
              Sufficient replicas exist for the pool to continue functioning in a
              degraded state.
      action: Online the device using 'zpool online' or replace the device with
              'zpool replace'.
        scan: resilvered 1.59M in 0h21m with 0 errors on Tue May  7 09:50:26 2013
      config:
      
              NAME           STATE     READ WRITE CKSUM
              tank           DEGRADED     0     0     0
                raidz2-0     DEGRADED     0     0     0
                  c9t2d0p0   OFFLINE      3     0     0
                  c9t3d0p0   ONLINE       0     0     0
                  c9t4d0p0   ONLINE       0     0     0
                  c9t7d0p0   ONLINE       0     0     0
                  c9t10d0p0  ONLINE       0     0     0
                  c9t11d0p0  ONLINE       0     0     0
      
      errors: No known data errors
      then I removed it and put in the new disk which became c9t12
      root@srv-data:~# cfgadm -al
      Ap_Id                          Type         Receptacle   Occupant     Condition
      c8                             scsi-bus     connected    configured   unknown
      c8::dsk/c8t0d0                 disk         connected    configured   unknown
      c9                             scsi-sas     connected    configured   unknown
      c9::dsk/c9t3d0                 disk         connected    configured   unknown
      c9::dsk/c9t4d0                 disk         connected    configured   unknown
      c9::dsk/c9t7d0                 disk         connected    configured   unknown
      c9::dsk/c9t10d0                disk         connected    configured   unknown
      c9::dsk/c9t11d0                disk         connected    configured   unknown
      c9::dsk/c9t12d0                disk         connected    configured   unknown
      so I replaced the disk and checked the pool status
      root@srv-data:~# zpool replace tank c9t2d0p0 c9t12d0p0
      root@srv-data:~# zpool status tank
        pool: tank
       state: DEGRADED
      status: One or more devices is currently being resilvered.  The pool will
              continue to function in a degraded state.
      action: Wait for the resilver to complete.
              Run 'zpool status -v' to see device specific details.
        scan: resilver in progress since Sat May 11 05:42:15 2013
          20.3M scanned out of 8.34T at 718K/s, (scan is slow, no estimated time)
          3.36M resilvered, 0.00% done
      config:
      
              NAME             STATE     READ WRITE CKSUM
              tank             DEGRADED     0     0     0
                raidz2-0       DEGRADED     0     0     0
                  replacing-0  DEGRADED     0     0     0
                    c9t2d0p0   OFFLINE      3     0     0
                    c9t12d0p0  DEGRADED     0     0     0  (resilvering)
                  c9t3d0p0     REMOVED      0     0     0
                  c9t4d0p0     ONLINE       0     0     0
                  c9t7d0p0     ONLINE       0     0     0
                  c9t10d0p0    ONLINE       0     0     0
                  c9t11d0p0    ONLINE       0     0     0
      
      errors: No known data errors
      but now it seems to me that c9t3 is showing removed for no reason as far as I can tell because it's not showing me any error, or i'm not checking the right place. seems to me the drive is still connected because i can gather information from it using iostat -eEn


      edit/// i just re-read this again,i think i must have misread something because i just sorted my understanding of the disc numbering, and what's explained coincides with what i'm seeing so it makes sense
      http://initialprogramload.blogspot.ca/2008/07/how-solaris-disk-device-names-work.html

      now i just need to figure out why that one drive removed itself. i checked fmadm faulty and there's nothing there other than that I replaced a disk

      Edited by: Stlouis1 on May 11, 2013 7:39 AM
        • 1. Re: how to tell which disk is where
          Cindys-Oracle
          Hi--

          What kind of hardware is this? I've seen this on our own Oracle Sun hardware with SATA disks and the older mpt driver. For example, disk c11t5d0 failed and the replacement disk becomes c11t36d0 because the driver just rescans the existing list and adds the replaced disk to the end. Yes, its confusing and unexpected, but its normal for this type of h/w and driver. I don't know why your c9t3d0p0 is REMOVED now. Can you reseat it?

          A couple of other comments:

          1. If you are physically replacing disks with larger disks, then do one at a time, and issue the zpool replace command.
          Then make sure it resilvers before moving on to the next disk. The ZFS Admin Guide covers this process with a specific
          example for SATA disks that do need to be offlined first.

          2. Your pool is built on p* devices, which are the fdisk partitions. You should build your pools on the d* devices.
          A couple of Solaris docs had these examples incorrectly. The ZFS Admin Guide and zpool.1m do not and never did. You won't have any problem with operation but some people didn't understand and built pools with overlapping devices like c8t1d0p0 and then also c8t1d0s0.

          Thanks, Cindy
          • 2. Re: how to tell which disk is where
            Stlouis1
            it is slightly confusing and unexpected, but at least I can recognize the behaviour

            to be honest when drive 3 was showing removed I thought I had pulled the wrong drive and then issues the command to replace yet again the wrong drive. I checked and rechecked. and reseated that drive several times and it would not come back online....

            I have been doing one drive at a time

            this solaris system is running as a file server on an ESXi 5.1 host. I found myself installing update 1 for vmware at 5am this morning and since the reboot the drive I was having problems with is showing online again. not sure what happened, but it's working, and hopefully I'll be able to get the other 3 drives changed over next weekend
            • 3. Re: how to tell which disk is where
              Cindys-Oracle
              I have no experience with VMware but I've seen devices from virtual layers go offline in various ways that are surprising.
              Make sure the pool devices are stable when you attempt to replace the devices and do one at a time as I directed.

              Thanks, Cindy
              • 4. Re: how to tell which disk is where
                Stlouis1
                1 at a time is just what I've been doing.

                I'm hoping the vmware update addresses a couple issues though. I skimmed through the list of things it addresses, one of them had something to do with LSI controllers to hopefully I don't see that happen again.

                I'm also hoping the other issue with cpu usage goes away. I've been having to reset the Solaris VM periodically because cpu usage was spiking to 100% on the VM when Solaris itself was only using 3-4%

                Either way, things seem to be in order for the moment. I suppose I can mark this thread as answered