I’ve been asked on numerous occasions why do peoples' database benchmark results often vary. In other words why one observed transaction per second (TPS) value today and sometimes 20% or more off tomorrow. In short, why aren’t database benchmark results reliably repeatable?


Often people try to blame the benchmarking software. That is the logical choice since that software runs the benchmark and scores the result. But I’ve seen several database benchmarking tools exhibit this odd behavior. I know that commercial software gets rigorous QA testing, and open source or freeware tools have very active communities reporting issues. So I’m hesitant to throw any database benchmarking tools under the bus. The answer must lie elsewhere.


Another source of potential issues are the database benchmarks themselves. Maybe (and I’m totally just conjecturing here) the specs contain items open to interpretation or misinterpretation by those coding database benchmarking tools. In fact, it might even be possible that those database benchmark specs possess some design flaws. It’s possible. But most database benchmarking tools have been around for years, some even well over a decade. My belief is that QA and communities would have reported real problems or bugs long ago. So again the real answer must lie elsewhere.


However there’s yet another source of potential issues we could indirectly blame on the database benchmark specs – very few people actually read or fully comprehend them. TPS is such a simple concept that any computer literate person should be able to run a database benchmarking tool. Wrong! Most of us know how to drive an automobile, but how many of us would get behind the wheel of a formula one race car and expect good things to happen? In a dozen or so years of helping people do database benchmarking, I can honestly say that maybe 25% of the people had read and understood the benchmark spec well enough to select options or parameters in the database benchmarking tools in order to obtain desired results. So here is probably the principal source of observed performance variance. But even the minority of well-prepared people still see variance. Why?


I have a belief that should make sense – but I cannot say that I have a mathematical proof or certainty that my explanation is 100% correct. But I will offer it nonetheless as I think at an absolute minimum it will radically reduce the observed performance variance for many people.


Let’s use a very simple analogy based on an automobile’s speedometer (MPH) and tachometer (RPM). Remember that MPH is what the police measure to see if you get a speeding ticket. So MPH is like the average response time – that which the database user observes. TPS are more like RPM’s – how hard the database engine is working. This distinction is simple yet critical.


If I don’t try to see the fastest speed at which the automobile can travel (i.e. I do not push it to its limit), then the RPM’s may well be meaningless. If I’m going 40 mph @ 2500 rpms, then it really matters what gear the automobile’s transmission is in. First gear may top out at 40 MPH, while fifth gear may permit speeds of 120 MPH or more at the same RPM’s.


The same is true for database benchmarking. It’s my experience that benchmark tests that do not push the hardware to its limits tend to have widely varying results at lower levels. I don’t know if it’s a flaw in the benchmark design or the software running it. I just know that a database server needs tested at the upper end of its capacity for the results to be both meaningful and repeatable.


Hence my advice to the people I work with doing database benchmarks is to size appropriately. My old notebook with an old Core 2 Duo (P8700) with 8 GB RAM and a first generation commercial grade SSD can handle a few hundred concurrent database sessions. Remember, a common transaction has “think time” in between operations, so 100 users might really mean just 10 concurrent operations at once. Therefore a current or one prior generation of XEON CPU with 128 GB RAM and an enterprise grade IO subsystem should be able to handle tens of thousands of concurrent sessions. Running tiny database benchmarks of under 1,000 users will fall into this area where results may well not be either reliable or repeatable. So test appropriately – even if it means more money for a properly sized tool. Testing workloads that are far too small for the hardware is really just an exercise in futility to place a checkmark somewhere that benchmarks were performed.