This content has been marked as final. Show 9 replies
I do have a question for you: given that this is an Exadata forum, OSWatcher is the pre-installed and Oracle-supported OS statistic gathering tool for Exadata. Why would someone want to use collectl instead of OSWatcher in an Exadata context?
good question ;)
I posted my note specifically in response to Re: Exadata screen monitoring tool (OLL : Migrate 1TB in 20 minutes) which identified a really nifty monitoring tool that is clearly a hacked up version of collectl - note that I'm using 'hacked' as a good thing. I guess the point I'm trying to make is from what little of saw of that tool that looked like it explicitly hardcodes a few of the counters collectl provides is that there are a lot of other counters that people might found complement what they see in that tool. Does that help?
I think the main reason we built that tool was to see predetermined system metrics from a cluster of computers all on the same screen in near real-time. Using the socket option in collectl makes that possible. Not sure that is possible with OSWatcher.
yeah, that socket interface proved to be pretty useful though most people don't take advantage of it. I've been able to use colmux to look at 'top' anything clusters with as many as 1K nodes or more! I particularly like to occasionally monitor all my lun access times and looks for the 'slow' disks.
I'm curious, using your tool do you have collect write data to local logs and monitor more than what you display? I've seen situations where a system was slow not because of a disk or a network but because of a NIC generating too many interrupts and swamping a CPU. This is the type of thing you'd never see if you didn't collectl interrrupt stats too. ;) I really like to watch data in real time, but when I see something not right, I want to be able to go back to the original data and drill into it.
one thing about monitoring to be of any value you need data as close in time as possible and that's typically not possible unless you use the same tool to collect it all. that's why collectl tries to synchronize its collection to the nearest micro-sec so even on clusters that run ntp all the samples are within a couple msec. the other thing is I'm always surprised at is when I see something I never would have thought to be a problem, being the problem and that's why I tend to collect everything. then when something goes wrong I can always go back and look across all data on all nodes at the same time! in fact, with colmux I can play back the data from all nodes and sort on the top-n stats of a particular type.
have you tried the newer colmux yet?
All the data is read directly from the remote collectl socket and parsed and pretty printed with perl.
I have yet to take a look at the colmux stuff, though it certainly looks interesting. Thanks again for a great tool, Mark.
glad you're enjoying collectl. for a quick test with colmux, install collectl-utils and try the command:
colmux -addr addr1,addr2... -command "-sD"
this will start collectl running on the addresses listed and run the command "colleclt -sD", which shows individual disks stats sorted by column. to sort by column 4, just add "column 4" to colmux, not the -command part which gets sent to collectl. if you want to change sort columns dynamically, just type the column number and return which colmux is running.
once you get that far we can talk about additional tricks/options, if you're interested
Here's what we've done with the tool. This was run on a 1/4 Exadata rack composed of 2 db nodes and 3 cell nodes. Sorry, the format doesn't paste well without courier fonts.
Current Time: Tue Oct 25 19:40:26 EDT 2011
MBRead Reads RSize MBWrit Writes WSize MBRead Reads RSize MBWrit Writes WSize User Sys Wait Irq Run FreeMB SwapMB SwIn SwOut
testcel01 0 11 1 0 113 0 0 16 5 0 151 0 0 0 0 0 0 9104 0 0 0
testcel02 0 3 1 0 14 11 0 0 0 0 0 0 0 0 0 0 1 13273 0 0 0
testcel03 0 3 2 0 15 11 0 0 0 0 0 0 0 0 0 0 1 6338 0 0 0
TotalIO: 0 MB/s; DiskRead: 0 MB/s; DiskWrite: 0 MB/s; FlashRead: 0 MB/s; FlashWrite: 0 MB/s; Average CPU: 0%;
User Sys Wait Irq Run MBRead Reads RSize MBWrit Writes WSize FreeMB SwapMB SwIn SwOut
testldb01 1 0 0 0 1 0 0 0 0 25 10 4153 0 0 0
testdb02 0 0 0 0 0 0 0 0 0 5 14 6674 0 0 0
Average CPU: 1%;
Mark, can you ping me offline?
Edited by: Rich Headrick on Oct 25, 2011 4:48 PM
Sorry, my last post didn't format well, but I think you get the idea.
Sorry folks. Oracle has decided to NOT include the collectl and monitoring script with Exadata for legal and other reasons beyond my control. I know some of you are pursuing development of the tool for your own purposes, so I have asked whether or not our script can be shared with you.