Could you figure out who is driving the writes from the guest domains? Maybe do 'prstat -Lm' to see who has a lot of system calls.The zpool-* processes are just workers handling load created by somebody else. There must be somebody asking for I/O to be done... Also, could you send the output of 'zpool list' and 'zpool status'? Is this RAIDZ or mirror? Is it close to full? Are there scrub or resilvering operations going on? There are many things that could affect ZFS performance. It also would be good to see the domain definition, as in 'ldm list -l $MYDOMAIN'
I hope that's helpful to get started.... Jeff
thank you for answering.
No I couldn't figure it out because the problem disappeared in the meantime. No idea why until now.
I traced the read & write system calls with DTrace and what I saw was confusing to me. In the global zone I had a lot of I/O from the zpool-* process, at the same time in the nonglobal zone I saw only 2% - 5% of syscalls. That's what I wondered about.
Pool is healthy, I/O from global zone to another dataset in the same pool shows no problems. Pool was 60% filled up, it's a two-way-mirror.
It was not a ZFS problem in general. I just had no idea where to look further, because:
"The zpool-* processes are just workers handling load created by somebody else. There must be somebody asking for I/O to be done..."
Yes, but it couldn't figure out any process in the zone asking for I/O.
But now everything works fine, maybe the problem "strikes back" in the future, then you will read more :-)
I'm glad the problem went away - too bad we don't know why it happened the first time. If it reoccurs, please check back here, or better yet, on the part of the forums that deal with ZFS. Another thing to do is subscribe to the firstname.lastname@example.org mailing list. You can post questions where very ZFS-knowledgeable people are likely to see it.
the problem occurs again two weeks later.
Seems we ran into a bug:
ZFS ARC can shrink down without memory pressure result in slow performance [ID 1404581.1]
I created a SR and had help to discover what happens here. Currently I'm updating to 11.1. Hopefully the problem will be fade away then..
Sorry that the problem reoccurred, and there are definitely a lot of fixes in Solaris 11.1 - make sure you go to a recent SRU and check out the README information. I'm still concerned about the very high service times, though. Are the backend disks zvols within a local (on the control domain) ZFS pool? Check that the pool isn't too full, as that will affect performance.
Download arcstat.pl and arc_summary.pl if you don't have them, as that gives very good information about the current and recent sizes of the ARC and the cache hit and miss statistics. Please post how how things turn out.