This discussion is archived
5 Replies Latest reply: Feb 7, 2012 5:18 PM by BryanWood RSS

RDAC on RHEL5.5 with 6180

903538 Newbie
Currently Being Moderated
hi all,
my platform is:
a RHEL5.5 installed on a blade X6270M2 that is installed inside a chassis 6000, this blade has 2 dual fiber HBA that are connected to two different fiber switches on which port 0 and 1 are dedicated for the 6180.
as I know, in order to setup multipath correctly on RHEL connected to dual controller storegtek 6180 we need to download and install "RDAC" package on the RHEL and then configure and edit some files in RHEL to gain the better IO and redundancy over the storage controllers.
the thing that i am facing is:
whenever I try to check the redundancy and availability of mapped volumes by pulling out a fiber cable from the server's HBA, I can feel that the path take a long time (app 60 secs) to failover to the other path!
i know that it is taking a long time to failover by issuing the "fdisk" command in RHEL that takes long to show the disks.
any help willbe highly appreciated.
many thanks in advance
  • 1. Re: RDAC on RHEL5.5 with 6180
    BryanWood Explorer
    Currently Being Moderated
    60 seconds is probably the SCSI timeout. You can confirm this with:
    root# cat /sys/block/sdac/device/timeout   ## replace sdac with your device name
    60
    root# echo 30 > /sys/block/sdac/device/timeout
    root# cat /sys/block/sdac/device/timeout
    30
    root# 
    CAUTION: lowering the SCSI device timeout can have adverse effects on IO delivery when your IO path is under load. There is no harm in lowering the value temporarily as a test to confirm it is the source of the 60 second delay you are experiencing. I wouldn't recommend going below 30 seconds, and to be honest the default value was chosen for a good reason.

    You could create a udev rule for [re]setting this value automatically upon every system boot:
    root# more /etc/udev/rules.d/50-udev.rules
    ACTION=="add", SUBSYSTEM=="scsi" , SYSFS{type}=="0|7|14", \
    RUN+="/bin/sh -c 'echo 45 > /sys$$DEVPATH/timeout'"
    Another parameter to be aware of with FC failures between your HBA and the switch, you configure your host bus adapter (HBA) to report link failure. Assuming you have emulex HBAs:
    root# grep -v test /sys/class/scsi_host/*/lpfc_nodev_tmo
    host3: 60
    root# echo 30 > /sys/class/scsi_host/host3/lpfc_nodev_tmo
    root# grep -v test /sys/class/scsi_host/*/lpfc_nodev_tmo
    host3: 30
    NOTE: if you're pulling cables between the switch and the array, where the link between your RHEL5 host and the switch is un-affected, the lpfc_nodev_tmo may not have any effect given your HBA is not losing signal. In that scenario, the SCSI timeout is your sole recourse for handling the failure.

    Reply back with your HBA model, the outputs of the above grep commands, and relevant /var/log/messages entries from your cable pull tests.


    Best Regards,
    Bryan Wood

    Edited by: BryanWood on Jan 26, 2012 4:11 PM
  • 2. Re: RDAC on RHEL5.5 with 6180
    903538 Newbie
    Currently Being Moderated
    hello BryanWood,
    thank you for the reply and sorry for being late :)
    I checked the commands and grep output that you mentioned and everything looks fine as you stated, I am using Q-Logic HBAs.
    well lets assume that one of my single FC HBA has failed knowing that this port is having an initiator through which three DB volumes were mapped, does that mean that data to these volumes will be not written on the disks for 60 sec until the volume is remapped through the other initiator (created on the second single HBA port)?! what are the benefits from having multipathing if that is going to happen and as a consequence the DB will be out of sync and degraded.

    Edited by: 900535 on Jan 31, 2012 6:40 AM
  • 3. Re: RDAC on RHEL5.5 with 6180
    BryanWood Explorer
    Currently Being Moderated
    thank you for the reply and sorry for being late :)
    Very glad to help!
    I checked the commands and grep output that you mentioned and everything looks fine as you stated, I am using Q-Logic HBAs.
    well lets assume that one of my single FC HBA has failed knowing that this port is having an initiator through which three DB volumes were mapped, does that mean that data to these volumes will be not written on the disks for 60 sec until the volume is remapped through the other initiator (created on the second single HBA port)?!
    Oracle writes to disk asynchronously, and only the IO that gets routed to the suspect path will get queued while the IO to the remaining path(s) will continue without problem. Once the suspect path is marked as failed, the IO that was previously queued there will get re-routed to a surviving path. The volume is still available on the good path during this timeframe, and good path managers use heuristics such as queue length in determining where to route traffic, thereby minimizing the impact of a path failure.
    what are the benefits from having multipathing if that is going to happen and as a consequence the DB will be out of sync and degraded.
    All asynchronous IO that is simultaneously in flight at any given time has already been "blessed" by the RDBMS, regardless of how or what order it gets delivered. When the order of delivery matters, Oracle will issue an aiowait() system call, which means the asynchronous IO that was previously in flight must be acknowledged before continuing. Rest assured that the database will call aiowait() to confirm those IO have completed before it takes any action that might have resulted in the database becoming out of sync. Async IO (aiowrite, aioread) is a huge performance improvement over synchronous IO (pread, pwrite), the latter of which waits for each and every IO before proceeding. Because Oracle uses Async IO, a good measure of work can still be accomplished during the interval of time we're waiting for the SCSI timeout.


    There are only a few places where you would observe an issue, and only a subset of user sessions would observe the delay. Here are a couple of the more common scenarios:

    - Sessions presently attempting to commit data (log buffer flush), or in extreme situations simply modify data when no log buffer space is available pending the flush.
    - A session reading a block from disk that was not found in the buffer cache.

    Other sessions that are reading blocks still located in the block buffer cache would not even notice the delay. So long as the delay is resolved within 600 seconds, and you're not running RAC, the database will patiently wait for the timeout. If you are running RAC, it is possible the cluster misscount tuning may result in a node eviction, which may or may not be the desired action depending on your point of view.

    The database will not continue with making a change that might compromise the integrity of your data without having first acknowledged the prerequisite prior IO has been confirmed to reside on disk. The only exceptions would be:

    - If non-POSIX compliant storage is used which caches writes thereby ignoring Oracle's explicit request for Direct IO, which would falsely tell Oracle the IO is persisted on disk when it in fact has not.
    - If a submitted IO is "lost" by the underlying OS, driver, and/or storage array. In my entire career of 18+ years, I've heard lost write theorized at least 50 times but only proven twice.

    Best Regards,
    Bryan Wood
  • 4. Re: RDAC on RHEL5.5 with 6180
    903538 Newbie
    Currently Being Moderated
    thank you again,
    (sorry my english is not very good :))
    actually I didn't install and configure my Oracle DB yet.
    what I thought about multipathing is that whenever a path fails another path should handle the IO to the volume so that the path fail doesn't affect the data flow. But what I understand right now is that the overall multipathing process is just a non-interactive process to pass the data to the other available path after 60 secs.
    right?
    by the way, I tried to ftp some data to the SAN volume and i disconnected one of the fiber cables (volume in 6180 was mounted through controller A; I disconnected the fiber cable that connects controller A to the server). when the cable was disconnected, the data flow stopped for approximately 60 sec and then continued after that normally. When reconnecting the fiber cable again, the data flow didn't fail back as I didn't see any stop in IO so I am not sure if the failback happened.
  • 5. Re: RDAC on RHEL5.5 with 6180
    BryanWood Explorer
    Currently Being Moderated
    what I thought about multipathing is that whenever a path fails another path should handle the IO to the volume so that the path fail doesn't affect the data flow. But what I understand right now is that the overall multipathing process is just a non-interactive process to pass the data to the other available path after 60 secs.
    right?
    by the way, I tried to ftp some data to the SAN volume and i disconnected one of the fiber cables (volume in 6180 was mounted through controller A; I disconnected the fiber cable that connects controller A to the server). when the cable was disconnected, the data flow stopped for approximately 60 sec and then continued after that normally.
    The 6180 is a traditional Active/Passive array, which means that some of your device paths will not be servicing IO.

    If you have only two paths to storage, one will most likely be Active and the other Passive for a given LUN. Having said that, you may have another LUN that is Active for the other path, and Passive for this path. This is known as LUN spreading (poor man's load balancing), to allow maximum use of the available hardware. For a given LUN, IO will only use a single path. If that path fails, RDAC will send a special command called "AVT" (automated volume transfer) through the surviving path to the Passive controller telling it to take ownership of the LUN. After the AVT, your IO then resumes on the promoted path.

    If you have four paths to storage, you most likely have two Active paths and two Passive paths. In this case, RDAC will be routing traffic to both Active paths simultaneously, while the remaing two Passive paths will not be receiving any IO. If one of the two Active paths fail in this scenario, RDAC will continue to use the surviving Active path without requiring any AVT operation.
    When reconnecting the fiber cable again, the data flow didn't fail back as I didn't see any stop in IO so I am not sure if the failback happened.
    Simply reconnecting the old path[s] does not necessarily cause the ownership of the LUN to move back to the original controller. There is an option for failback that will do this if you desire that behavior (see DisableLUNRebalance in /etc/mpp.conf). Even if you enable that feature, IOs are gracefully quiesced during the graceful AVT operation which is not nearly as disruptive as the first test when yanking a cable forcing SCSI to eventually timeout the IOs in flight (see more discussion below). Manual AVTs can be initiated as well if you prefer to schedule a time of lighter load to move the LUN ownership back to its original controller (the default).

    In any case when you induce a failure on one path, the IO in flight on the failed path (if any) are held until a SCSI timeout occurs, afterwhich those IO are resubmitted on the surviving path[s]. If no remaining paths are Active, RDAC will induce an AVT command through a Passive path to regain access to the LUN. This is known as a "Preferred Path" policy:
    root# mpputil –g 0
     [..]
    Lun Information
    ---------------
        Lun #0 - WWN: 600a0b80003ac1f800000bde4a7bab80
        ----------------
           LunObject: present                                 CurrentOwningPath: A     <-- current owner
      RemoveEligible: N                                          BootOwningPath: A
       NotConfigured: N                                           PreferredPath: A     <--- "preferred owner"
            DevState: OPTIMAL                                   ReportedPresent: Y
                                                                ReportedMissing: N
                                                          NeedsReservationCheck: N
                                                                      TASBitSet: N
                                                                       NotReady: N
                                                                           Busy: N
                                                                      Quiescent: N
    
        Controller 'A' Path
        --------------------
       NumLunObjects: 2                                         RoundRobinIndex: 0
             Path #1: LunPathDevice: present
                           DevState: OPTIMAL
                        RemoveState: 0x0  StartState: 0x1  PowerState: 0x0
             Path #2: LunPathDevice: present
                           DevState: OPTIMAL
                        RemoveState: 0x0  StartState: 0x1  PowerState: 0x0
    
        Controller 'B' Path
        --------------------
       NumLunObjects: 2                                         RoundRobinIndex: 0
             Path #1: LunPathDevice: present
                           DevState: OPTIMAL
                        RemoveState: 0x0  StartState: 0x1  PowerState: 0x0
             Path #2: LunPathDevice: present
                           DevState: OPTIMAL
                        RemoveState: 0x0  StartState: 0x1  PowerState: 0x0
    It takes a failed IO (timeout is one type of failure) to mark a path as "failed", and during the time we are waiting on the oldest IO to timeout, more IO may stack onto the path that will ultimately get marked as failed. As a result, under heavy load it may sometimes take more than 60 seconds. RDAC is an aged path manager, and as such has limited capabilities by way of path selection (failover or round-robin, IIRC). More modern path managers give options such as "minimum queue" where a bottlenecked path will not be chosen, thereby reducing the backlog of work once the path does finally go disabled.

    It looks like RDAC is the only supported choice with the 6180.

    Back to my original response, you can either live with the 60 second delay (the default value, and the recommended one), tune the block device timeout value to something less than 60 seconds (try not to go below 30 seconds), and/or tune the HBA driver nodev timeout setting if applicable.

    To see which controller actively owns your LUN, re-run mpputil as given above to observe the value of "CurrentOwningPath". Also check out "mpputil -S" output as well.

    Please remember to mark the question as answered if a participant has adequately helped with the question/issue.

    Best Regards,
    Bryan Wood

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points