Troubleshooting I/O Load in an Oracle Solaris OpenStack Cloud

Версия 4

    by Dave Miner

     

    Tips for how to determine what's causing slow disk I/O in an Oracle OpenStack for Oracle Solaris cloud environment

     

    The other day we noticed that a number of guests running on our Oracle Solaris OpenStack cloud were very lethargic, with even simple operations like listing boot environments with beadm taking a long time, let alone more complex things like installing packages.  It was easy to rule out CPU load on the compute nodes, as prstat quickly showed me that load averages were not at all high on the compute nodes hosting my guests.

     

    The next suspect was disk I/O.  When you're running on bare metal with local disks, this is pretty easy to check; things like zpool status, zpool iostat, and iostat provide a pretty good high-level view of what's going on, and tools like prstat or truss might help identify the culprit.  In an OpenStack cloud it's not nearly that simple.  It's all virtual machines, the "disks" are ZFS volumes served up as LUNs by an iSCSI target, and with a couple hundred VMs running on a bunch of compute nodes, you're now searching for the needle(s) in a haystack.  Where to start?

     

    First I confirmed that we didn't have any funny business going on with the network.  We've had issues with the ixgbe NICs falling back to 100 Mb with our switch occasionally, but dladm show-aggr -x aggr0 confirmed that wasn't an issue on either the Cinder node or the compute nodes.  My next step was to look at I/O on the Cinder volume server.  We're not yet running this on an Oracle ZFS Storage Appliance; it's just on an Oracle Solaris box with a pile of disks, so there is no analytics BUI yet.  A quick look with zpool iostat showed we're pretty busy:

     

    $ zpool iostat 30
            capacity       operations    bandwidth
    pool   alloc   free   read  write   read  write
    -----  -----  -----  -----  -----  -----  -----
    rpool   224G  54.2G      0     46      0   204K
    tank   7.17T  25.3T  1.10K  5.53K  8.78M  92.4M

     

    But are we really that loaded as far as the disks are concerned?

     

    $ iostat -xnzc 30
         cpu
     us sy st id
      0 12  0 88
                        extended device statistics
        r/s    w/s   kr/s    kw/s wait actv wsvc_t asvc_t  %w  %b device
       57.9  272.6  769.1 10579.3  0.0  4.7    0.0   14.1   0  88 c1t0d0
        0.0    8.9    0.0    78.1  0.0  0.0    0.0    2.1   0   1 c1t1d0
       54.6  256.9  759.7 10475.1  0.0  4.6    0.0   14.8   0  90 c1t2d0
       59.1  283.3  749.9 10070.2  0.0  4.3    0.0   12.5   0  87 c1t3d0
       76.9  224.9 1172.9 10275.5  0.0  5.4    0.0   18.0   0  96 c1t4d0
       51.7  259.8  784.0 10579.2  0.0  4.2    0.0   13.6   0  93 c1t5d0
       47.3  275.2  781.3 10109.4  0.0  4.3    0.0   13.4   0  93 c1t6d0
       49.6  262.0  766.2 10416.9  0.0  4.7    0.0   15.1   0  92 c1t7d0
       66.3  243.2 1163.7 10172.9  0.0  5.3    0.0   17.1   0  97 c1t8d0
       57.1  269.1  814.6 10584.6  0.0  4.2    0.0   13.0   0  90 c1t9d0
       62.3  272.2  820.1 10298.4  0.0  4.0    0.0   12.0   0  87 c1t10d0
       55.1  247.3  791.9 10558.6  0.0  4.6    0.0   15.3   0  91 c1t11d0
       63.5  221.6 1166.7 10559.2  0.0  5.7    0.0   20.0   0 100 c1t12d0
        0.0    9.1    0.0    78.1  0.0  0.0    0.0    4.9   0   1 c1t13d0

     

    Yeah, that looks heavy; all of the disks in the tank pool are above 87% busy and it wasn't changing much over several minutes of monitoring, so it's not just a brief burst of activity.  Note that I don't have continuous monitoring set up for this data but past quick looks had shown we were typically about half those bandwidth numbers from zpool iostat and disks were in the 50–60% busy range.  So we have an unusually heavy load; but where's it coming from, and is it expected activity from something like a source build or something that's gone runaway?  This is where the problem gets challenging; we've got a bunch of compute nodes hammering this over iSCSI, so we need to figure out which ones to look at.

     

    Fortunately, my friends over at solarisinternals.com have a nice little DTrace script called iscsiio.d for breaking down iSCSI traffic at the target end.  Running it for less than a minute got this output:

     

    # dtrace -s /var/tmp/iscsiio.d
    Tracing... Hit Ctrl-C to end.
    ^C
       REMOTE IP                     EVENT    COUNT     Kbytes     KB/sec
       10.134.13.59                   read       56        238         11
       10.134.13.37                   read      631       2835        132
       10.134.13.58                   read      636       3719        173
       10.134.13.54                   read      671       3561        166
       10.134.13.59                  write      869       7330        342
       10.134.13.54                  write     2200      23153       1080
       10.134.13.70                   read     2273      13422        626
       10.134.13.33                  write     3730      69042       3221
       10.134.13.58                  write     4236      47745       2227
       10.134.13.37                  write     5997     108624       5068
       10.134.13.70                  write     6803     105302       4913
       10.134.13.56                   read     8152      45458       2121
       10.134.13.56                  write     9557     130043       6068
       10.134.13.33                   read    46666     372022      17359

     

    Wow, a lot of read traffic on that last one.  So I logged into that node and ran:

     

    $ iostat -xnzc 30
         cpu
     us sy st id
      0  4  0 96
                        extended device statistics
        r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
        0.0    4.3    0.0   27.6  0.0  0.0    0.0    2.3   0   0 c0t5000CCA03C31E530d0
        0.0    5.0    0.0   27.6  0.0  0.0    0.0    1.8   0   0 c0t5000CCA03C2DB384d0
        0.2    2.3    1.1    7.9  0.0  0.6    0.0  256.4   0  19 c0t600144F07DA70300000055DE259000B9d0
        0.2    2.4    1.1    5.4  0.0  1.1    0.0  421.1   0  24 c0t600144F07DA70300000056548EDA0080d0
        5.5    4.0 4543.6   18.8  0.0  3.5    0.0  363.5   0 100 c0t600144F07DA70300000055706B06000Dd0
        0.8   12.1    4.5   68.1  0.0  1.7    0.0  133.7   0  43 c0t600144F07DA7030000005655824D0085d0
        8.1    7.1 6127.3  443.2  0.0  6.4    0.0  417.8   0 100 c0t600144F07DA703000000555C7D48003Bd0
        0.9   11.6    4.7  110.0  0.0  2.0    0.0  157.8   0  58 c0t600144F07DA703000000564EB7950065d0
        0.0    0.7    0.0    2.1  0.0  0.4    0.0  550.8   0   7 c0t600144F07DA70300000056A9F12A0121d0
        0.2    2.2    1.1    6.2  0.0  0.5    0.0  215.4   0  12 c0t600144F07DA70300000056A8CB070112d0
        0.0    0.0    0.0    0.0  0.0  0.1    0.0    0.0   0   2 c0t600144F07DA703000000569558FE00BBd0
        0.2    2.4    1.1    6.5  0.0  0.2    0.0   75.6   0   7 c0t600144F07DA7030000005695379300B6d0
        0.4    4.9    2.3   12.3  0.0  0.9    0.0  163.0   0  34 c0t600144F07DA703000000569522F800B4d0
        0.4    2.8    2.3   14.6  0.0  0.5    0.0  158.9   0   7 c0t600144F07DA70300000056A1722900F4d0
        0.6    7.6    3.4   40.0  0.0  1.2    0.0  140.0   0  27 c0t600144F07DA70300000056A652E700FBd0
        0.2    2.6    1.1    6.6  0.0  0.0    0.0    0.6   0   0 c0t600144F07DA70300000056B02C5F0150d0

     

    There are two LUNs that have a lot of read traffic going.  Which kernel zones are responsible?  That requires help from OpenStack to map it back to the responsible guest.  Back on the target, we get COMSTAR to tell us what a LUN maps to:

     

    $ stmfadm list-lu -v 600144F07DA703000000555C7D48003B
    LU Name: 600144F07DA703000000555C7D48003B
        Operational Status     : Online
        Provider Name          : sbd
        Alias                  : /dev/zvol/rdsk/tank/cinder/volume-ed69313a-8954-405b-b89d-a11e5723a88f
        View Entry Count       : 1
        Data File              : /dev/zvol/rdsk/tank/cinder/volume-ed69313a-8954-405b-b89d-a11e5723a88f
        Meta File              : not set
        Size                   : 107374182400
        Block Size             : 512
        Management URL         : not set
        Software ID            : not set
        Vendor ID              : SUN
        Product ID             : COMSTAR
        Serial Num             : not set
        Write Protect          : Disabled
        Write Cache Mode Select: Enabled
        Writeback Cache        : Enabled
        Access State           : Active

     

    Now that I have the volume name, I can feed the UUID portion of that name to Cinder:

     

    $ cinder show ed69313a-8954-405b-b89d-a11e5723a88f
    ...
    |              attachments              | [{u'device': u'c1d0',
    u'server_id': u'5751bd18-78aa-4387-be55-8e395aa081eb',
    u'id': u'ed69313a-8954-405b-b89d-a11e5723a88f',
    u'host_name': None, u'volume_id': u'ed69313a-8954-405b-b89d-a11e5723a88f'}] |

     

    (I've wrapped the above output to aid its readability) That maps to a server_id I can provide to Nova to identify the zone.

     

    $ nova show 5751bd18-78aa-4387-be55-8e395aa081eb

     

    I've elided that output as it's private info.  From there, I'm able to talk to the owner, use zlogin to inspect it, etc.

     

    It turned out that the two problem zones were running a test OpenStack Swift deployment that seemed to have run away a bit; we shut them down and voilà, we were back to nominal performance for the rest of the cloud.

     

    See Also

     

     

    About the Author

     

    Dave Miner, the architect for Oracle Solaris deployment and system management, has a lot of background in networking on the side. He spends a lot of time operating Oracle Solaris Engineering's OpenStack cloud. He is coauthor of the OpenSolaris Bible (Wiley, 2009).

     

     

    Revision 1.0, 02/09/2016

     

    Follow us:
    Blog | Facebook | Twitter | YouTube