just to minimize interpretation errors: Do I summarize/rephrase correctly, if I say:
The test setups yield very similar results to the left hand side in the graph, while your production setup is like on the right hand side of the red line and that this data is specific for a certain combination of iron and config?
Regarding the test setups: Do you have nested zones for the mysql write (b), one kernel zone and a non-global zone or kernel zone and kernel zone next to each other, i.e. different kernel instances for mysql and apache?
May we further assume, that the production combination deployed on 'test' hardware behaves like the production, while test c (on global zone rather than kernel/non-global) yields 'steady' results?
Regarding the test load: Is the measured test load combining contributions from cms, mediawiki and mysql, or just a subset (e.g. cms + mysql)? Do the requests require spawning new processes, or are they served by established (long lived) processes? Is php in the game?
Generally speaking, the heavy spiking smells a lot like waiting to me.
Do you have some kind of ressource management (rcapadm) or alike deployed? This might explain a difference between kernel and non-global zone but not necessarily from kernel to global zone, though.
Any sort of additional traffic/load (e.g. monitoring on production) that seems neglectable but might make a difference (e.g. by means of locking)?
Do you have some dependency on network ressources to fulfill the request? I happily remember some nagle/delayed ack interaction that resulted in n*200ms 'spikes' for certain combinations/lengths of payloads.
Do you see differences, if you stop the statstore data collection?
I assume, you already ruled out storage and had a look to further narrow the component taking the blame for the delays, e.g. using the dtrace providers for overall metric differences in syscall amounts/cumulated times or for mysql 'response times'.
Sorry for just contributing some questions.
how do you actually measure the throughput and latency of your server?
When looking at "ippatrol.com", it seems that there is a distant (cloud) service involved. Can you make sure that the network between your server and "ippatrol.com" is not the culprit or at least part of the problem? Does your production server and your test server have the same IP address? I've seen routing decisions where the actual choice of an IP route was determined by both, the sender's and the receiver's IP addresses. So it could be possible that the routes between your server(s) and "ippatrol.com" differ depending on the host you are testing with. In one very special case, one of the ATM interfaces (it was a long time ago when my carrier had this problem) used in one of the routes lost some packets (about 3 %). Caused by the fact that the routes were chosen based on source and destination IP address, the behavior looked quite non-deterministic at the first place.
And another question: What happens if you put a large static test file (e.g. 4 GB) on your server(s) and start a download from a remote site? Does the throughput also behave so much inconstantly?
Sorry for not being able to come up with a solution but with rather more questions…
I have an update, but no solution yet. Short answer "MySQL replication is the cause"
First to a question to clarify the setup.
- using external site (ipPatrol) but also double checking it via a simple wget on a local network
2019-11-29 09:16:03 (35.3 MB/s) - '10.61.29.21_index.php' saved  *Replication Off
2019-11-29 09:31:05 (13.7 MB/s) - '10.61.29.21_index.php' saved 
2019-11-29 09:46:03 (20.3 MB/s) - '10.61.29.21_index.php' saved 
2019-11-29 10:01:04 (28.9 MB/s) - '10.61.29.21_index.php' saved 
2019-11-29 10:16:04 (11.8 MB/s) - '10.61.29.21_index.php' saved 
2019-11-29 10:31:04 (17.8 MB/s) - '10.61.29.21_index.php' saved 
2019-11-29 10:46:02 (28.9 MB/s) - '10.61.29.21_index.php' saved 
2019-11-29 11:01:05 (856 KB/s) - '10.61.29.21_index.php' saved  *Replication On
2019-11-29 11:16:03 (160 KB/s) - '10.61.29.21_index.php' saved 
2019-11-29 11:31:04 (681 KB/s) - '10.61.29.21_index.php' saved 
The above shows that if MySQL replication is off performance if good, but as soon as I put it on it drops like a lead balloon!
So, I still need to see if this is MySQL replication problem or Solaris I/O disk problem.
OK! It looks like the problem is all down to MySQL and ZFS.
I am looking at some of documentation about MySQL performance with ZFS.
At the moment changes I have made which have fixed the problem (but early stages).
zfs set atime=off rpool/VARSHARE/local/mysql/data
zfs set recordsize=128k rpool/VARSHARE/local/mysql/data/mysql
zfs set recordsize=16k rpool/VARSHARE/local/mysql/data/wwwdb