8 Replies Latest reply: Apr 9, 2011 5:47 PM by User101-Oracle RSS

    dtrace and io problem

    831265
      Hi,

      We have an M4000 which is connected to numerous HDS USPV1 LUN's (not sure of the HDS configuration).

      The problem is when a Database load occurs the response time goes from 50ms upto 100ms. There is no io/wait, load average is low, no CPU or memory problems but %b on iostat is constantly at 99/100%.

      Im told the maximum number of IOPS from hds LUN is 720 (and there are 15 LUN's presented to zpool's).

      I turned to dtrace in the hope of finding something which would explain the problem but after reading up countless blogs, PDF's im still at a loss. I downloaded the dtrace toolkit and ran iotop, iosnoop etc etc which although produced a lot of info it was in fact too much info and didnt really show what the problem was (although didnt think it would be that easy).

      In fact when running iosnoop/iowait I received the following:

      dtrace: 679 dynamic variable drops
      dtrace: 1 dynamic variable drop with non-empty rinsing list
      dtrace: 664 dynamic variable drops with non-empty dirty list
      dtrace: 1712 dynamic variable drops
      dtrace: 2074 dynamic variable drops
      dtrace: 2123 dynamic variable drops

      I'll be investigating and reading up further on dtrace but in the meantime if anyone could point me in the direction of a dtrace command which could enlighten me as to the IO problem it would be appreciated. Appreciate its a far reaching question but anything would help!!!



      Thanks.
        • 1. Re: dtrace and io problem
          YoungWinston
          user13465954 wrote:
          I'll be investigating and reading up further on dtrace but in the meantime if anyone could point me in the direction of a dtrace command which could enlighten me as to the IO problem it would be appreciated.
          And this has to do with Java ... How exactly?

          Winston
          • 2. Re: dtrace and io problem
            rukbat
            Moved from the Java Programming forum to the Dtrace Forum
            • 3. Re: dtrace and io problem
              Nik
              Hi.
              Can you show iostat result when databaase show normal and bad response time ?

              Regards.
              • 4. Re: dtrace and io problem
                gleng
                I've found the following syntax useful:

                ~/DTraceToolkit/DTraceToolkit-0.99/iosnoop -o -m /u06 -vN

                (Note: this was on a V490 and LUNs on a 6140 SAN.)

                GlenG
                • 5. Re: dtrace and io problem
                  gleng
                  It's my understanding that:

                  kstat -m cpu_stat | grep "iowait "

                  will show the number of iowaits by CPU (this is the number of I/Os waiting on each CPU when the command is run, I see mostly 0's on my V490 Oracle DB server but if recall and run the command a dozen times I can usually get a 1 for 1 or 2 CPUs (there are 8 CPUs in the box)).

                  I do not remember where I found the above command but it came with this explanation:

                  "This field is incremented whenever biowait() is called and decremented before return. boiwait() is documented in biowait(9f)."

                  GlenG
                  • 6. Re: dtrace and io problem
                    800381
                    It's hard to diagnose IO performance issues across the internet, but here are some things you need to be looking at:

                    1. When you're seeing your IO performance problem, what does the output of "iostat -sndxz 2" look like?
                    2. How are the LUNs laid out on your storage?
                    3. Do multiple LUNs share the same physical disks (bad for performance)?
                    4. Are your IO operations aligned with the LUN blocksize?
                    5. What kind(s) of LUNs do you have? RAID-1? RAID-5?
                    6. What kind of disks? SATA? FC? SAS?

                    It's not that hard to take a supposedly-high-performance disk system and make it run really slow. Something like a lot of really small random writes to several LUNs built with large block sizes all sharing the same drives in a RAID-5 array is a real good way to do just that.
                    • 7. Re: dtrace and io problem
                      831265
                      Thanks for the replies and the example commands, appreciate its almost impossible to diagnose over the board.

                      Still need to do some serious digging and reading on dtrace to find the reason for the slow performance but will post a solution if I ever find it.


                      Thanks again.
                      • 8. Re: dtrace and io problem
                        User101-Oracle
                        Thanks try and go through them one by one.....

                        the %b/busy column in iostat is commonly mis-understood. There is a good document in the my support knowledge base explaining why busy is at 100%. It hard to generalise but if %b is 100% with little signs of issues its normally the application (database in this case) is not utilising the hardware, eg going stuff in serials instead of parallel.

                        What Does %b (or %Busy) Actually Mean in the Output of iostat(1M)? [ID 1003635.1]

                        The 'dtrace: xxxx dynamic variable drops' messages are associated with the internal dtrace log buffers overfilling. Off the top of my head the easier way to address this is reduce the amount of data you are logging, increase the buffer size of increase the rate these buffers are flushed. Try looking at the following two documents. You may be interested in bufsize, dynvarsize and cleanrate.

                        http://wikis.sun.com/display/DTrace/Buffers+and+Buffering
                        http://wikis.sun.com/display/DTrace/Options+and+Tunables

                        No idea where 720 IOPS came from. Very rough rule of thumbs is if you are getting 500 iops from a standard disk your are doing well. With someone workloads you may only see 150 iops. If you look at the specs for some of the arrays your talking 100k iops if not more.

                        Stuff changes all the time but last time I looked response times within iostat are the amount of time an io is on the wire PLUS the time to complete that io in software PLUS the time to schedule the next io. So get the numbers iostat reports and compare it those the array is reporting. Are they the same (array saturated?) or different (software issue?) or something else.

                        Anyway, play with dtrace and see what you can work out.

                        All rather interesting for a nerd.