9 Replies Latest reply: Nov 8, 2011 9:52 AM by Rich Headrick-Oracle RSS

    Performance monitoring on linux with collectl, tips and tricks?

    888491
      There was a previous post about a nifty tool based on collectl and as the author of collectl, I thought I'd just post a general note to ask if anyone has any questions about using collectl. It has a zillion switches and options and when coupled with colmux can do a pretty decent job of monitoring clusters, allowing you to get a 'top-like', sorted view of almost any linux metric. Then when you toss in 'colplot' for web-based plotting, things get even more interesting. Unfortunately all this flexibility comes at the cost of complexity. I like to think that most of the simple things are pretty easy to do, and most people only do the simple things, but the more complex ones can be pretty slick but are still tricky or even unknown.
      -mark
        • 1. Re: Performance monitoring on linux with collectl, tips and tricks?
          Marc Fielding
          Hi Mark,

          I do have a question for you: given that this is an Exadata forum, OSWatcher is the pre-installed and Oracle-supported OS statistic gathering tool for Exadata. Why would someone want to use collectl instead of OSWatcher in an Exadata context?

          Thanks!
          • 2. Re: Performance monitoring on linux with collectl, tips and tricks?
            888491
            good question ;)
            I posted my note specifically in response to Re: Exadata screen monitoring tool (OLL : Migrate 1TB in 20 minutes) which identified a really nifty monitoring tool that is clearly a hacked up version of collectl - note that I'm using 'hacked' as a good thing. I guess the point I'm trying to make is from what little of saw of that tool that looked like it explicitly hardcodes a few of the counters collectl provides is that there are a lot of other counters that people might found complement what they see in that tool. Does that help?
            -mark
            • 3. Re: Performance monitoring on linux with collectl, tips and tricks?
              603349
              I think the main reason we built that tool was to see predetermined system metrics from a cluster of computers all on the same screen in near real-time. Using the socket option in collectl makes that possible. Not sure that is possible with OSWatcher.

              --
              Regards,
              Greg Rahn
              http://structureddata.org
              • 4. Re: Performance monitoring on linux with collectl, tips and tricks?
                888491
                yeah, that socket interface proved to be pretty useful though most people don't take advantage of it. I've been able to use colmux to look at 'top' anything clusters with as many as 1K nodes or more! I particularly like to occasionally monitor all my lun access times and looks for the 'slow' disks.

                I'm curious, using your tool do you have collect write data to local logs and monitor more than what you display? I've seen situations where a system was slow not because of a disk or a network but because of a NIC generating too many interrupts and swamping a CPU. This is the type of thing you'd never see if you didn't collectl interrrupt stats too. ;) I really like to watch data in real time, but when I see something not right, I want to be able to go back to the original data and drill into it.

                one thing about monitoring to be of any value you need data as close in time as possible and that's typically not possible unless you use the same tool to collect it all. that's why collectl tries to synchronize its collection to the nearest micro-sec so even on clusters that run ntp all the samples are within a couple msec. the other thing is I'm always surprised at is when I see something I never would have thought to be a problem, being the problem and that's why I tend to collect everything. then when something goes wrong I can always go back and look across all data on all nodes at the same time! in fact, with colmux I can play back the data from all nodes and sort on the top-n stats of a particular type.

                have you tried the newer colmux yet?

                -mark
                • 5. Re: Performance monitoring on linux with collectl, tips and tricks?
                  603349
                  All the data is read directly from the remote collectl socket and parsed and pretty printed with perl.

                  I have yet to take a look at the colmux stuff, though it certainly looks interesting. Thanks again for a great tool, Mark.

                  --
                  Regards,
                  Greg Rahn
                  http://structureddata.org
                  • 6. Re: Performance monitoring on linux with collectl, tips and tricks?
                    888491
                    glad you're enjoying collectl. for a quick test with colmux, install collectl-utils and try the command:

                    colmux -addr addr1,addr2... -command "-sD"

                    this will start collectl running on the addresses listed and run the command "colleclt -sD", which shows individual disks stats sorted by column. to sort by column 4, just add "column 4" to colmux, not the -command part which gets sent to collectl. if you want to change sort columns dynamically, just type the column number and return which colmux is running.

                    once you get that far we can talk about additional tricks/options, if you're interested

                    -mark
                    • 7. Re: Performance monitoring on linux with collectl, tips and tricks?
                      Rich Headrick-Oracle
                      Here's what we've done with the tool. This was run on a 1/4 Exadata rack composed of 2 db nodes and 3 cell nodes. Sorry, the format doesn't paste well without courier fonts.

                      Current Time: Tue Oct 25 19:40:26 EDT 2011
                      <---------------Disks------------------><---------------Flash------------------><--------CPU----------><-----------Memory------->
                      MBRead Reads RSize MBWrit Writes WSize MBRead Reads RSize MBWrit Writes WSize User Sys Wait Irq Run FreeMB SwapMB SwIn SwOut
                      testcel01 0 11 1 0 113 0 0 16 5 0 151 0 0 0 0 0 0 9104 0 0 0
                      testcel02 0 3 1 0 14 11 0 0 0 0 0 0 0 0 0 0 1 13273 0 0 0
                      testcel03 0 3 2 0 15 11 0 0 0 0 0 0 0 0 0 0 1 6338 0 0 0
                      TotalIO: 0 MB/s; DiskRead: 0 MB/s; DiskWrite: 0 MB/s; FlashRead: 0 MB/s; FlashWrite: 0 MB/s; Average CPU: 0%;
                      <--------CPU----------><---------------Disks------------------><-----------Memory------->
                      User Sys Wait Irq Run MBRead Reads RSize MBWrit Writes WSize FreeMB SwapMB SwIn SwOut
                      testldb01 1 0 0 0 1 0 0 0 0 25 10 4153 0 0 0
                      testdb02 0 0 0 0 0 0 0 0 0 5 14 6674 0 0 0
                      Average CPU: 1%;

                      Mark, can you ping me offline?

                      Edited by: Rich Headrick on Oct 25, 2011 4:48 PM
                      • 8. Re: Performance monitoring on linux with collectl, tips and tricks?
                        Rich Headrick-Oracle
                        Sorry, my last post didn't format well, but I think you get the idea.
                        • 9. Re: Performance monitoring on linux with collectl, tips and tricks?
                          Rich Headrick-Oracle
                          Sorry folks. Oracle has decided to NOT include the collectl and monitoring script with Exadata for legal and other reasons beyond my control. I know some of you are pursuing development of the tool for your own purposes, so I have asked whether or not our script can be shared with you.

                          --Rich