I do not see Oracle on ZFS often, in fact, i was called in too meet the first. The database was experiencing heavy IO problems, both by undersized IOPS capability, but also a lack of performance on the backups - the reading part of it. The IOPS capability was easily extended by adding more LUNS, so i was left with the very poor bandwidth experienced by RMAN reading the datafiles. iostat showed that during a simple datafile copy (both cp and dd with 1MiB blocksize), the average IO blocksize was very small, and varying wildly. i feared fragmentation, so i set off to test.
i wrote a small C program that initializes a 10 GiB datafile on ZFS, and repeatedly does
1 - 1000 random 8KiB writes with random data (contents) at 8KiB boundaries (mimicking a 8KiB database block size)
2 - a full read of the datafile from start to finish in 128*8KiB=1MiB IO's. (mimicking datafile copies, rman backups, full table scans, index fast full scans)
3 - goto 1
so it's a datafile that gets random writes and is full scanned to see the impact of the random writes on the multiblock read performance. note that the datafile is not grown, all writes are over existing data.
even though i expected fragmentation (it must have come from somewhere), is was appalled by the results. ZFS truly sucks big time in this scenario. Where EXT3, on which i ran the same tests (on the exact same storage), the read timings were stable (around 10ms for a 1MiB IO), ZFS started of with 10ms and went up to 35ms for 1 128*8Kib IO after 100.000 random writes into the file. it has not reached the end of the test yet - the service times are still increasing, so the test is taking very long. i do expect it to stop somewhere - as the file would eventually be completely fragmented and cannot be fragmented more.
I started noticing statements that seem to acknowledge this behavior in some Oracle whitepapers, such as the otherwise unexplained advice to copy datafiles regularly. Indeed, copying the file back and forth defragments it. I don't have to tell you all this means downtime.
On the production server this issue has gotten so bad that migrating to a new different filesystem by copying the files will take much longer than restoring from disk backup - the disk backups are written once and are not fragmented. They are lucky the application does not require full table scans or index fast full scans, or perhaps unlucky, because this issue would have been become impossible to ignore earlier.
I observed the fragmentation with all settings for logbias and recordsize that are recommended by Oracle for ZFS. The ZFS caches were allowed to use 14GiB RAM (and moslty did), bigger than the file itself.
The question is, of course, am i missing something here? Who else has seen this behavior?