Forum Stats

  • 3,752,058 Users
  • 2,250,452 Discussions
  • 7,867,705 Comments

Discussions

ZFS and fragmentation

Jan-Marten Spit
Jan-Marten Spit Member Posts: 144
edited Nov 21, 2013 10:24AM in General Database Discussions

I do not see Oracle on ZFS often, in fact, i was called in too meet the first. The database was experiencing heavy IO problems, both by undersized IOPS capability, but also a lack of performance on the backups - the reading part of it. The IOPS capability was easily extended by adding more LUNS, so i was left with the very poor bandwidth experienced by RMAN reading the datafiles. iostat showed that during a simple datafile copy (both cp and dd with 1MiB blocksize), the average IO blocksize was very small, and varying wildly. i feared fragmentation, so i set off to test.

i wrote a small C program that initializes a 10 GiB datafile on ZFS, and repeatedly does

1 - 1000 random 8KiB writes with random data (contents) at 8KiB boundaries (mimicking a 8KiB database block size)

2 - a full read of the datafile from start to finish in 128*8KiB=1MiB IO's. (mimicking datafile copies, rman backups, full table scans, index fast full scans)

3 - goto 1

so it's a datafile that gets random writes and is full scanned to see the impact of the random writes on the multiblock read performance. note that the datafile is not grown, all writes are over existing data.

even though i expected fragmentation (it must have come from somewhere), is was appalled by the results. ZFS truly sucks big time in this scenario. Where EXT3, on which i ran the same tests (on the exact same storage), the read timings were stable (around 10ms for a 1MiB IO), ZFS started of with 10ms and went up to 35ms for 1 128*8Kib IO after 100.000 random writes into the file. it has not reached the end of the test yet - the service times are still increasing, so the test is taking very long. i do expect it to stop somewhere - as the file would eventually be completely fragmented and cannot be fragmented more.

I started noticing statements that seem to acknowledge this behavior in some Oracle whitepapers, such as the otherwise unexplained advice to copy datafiles regularly. Indeed, copying the file back and forth defragments it. I don't have to tell you all this means downtime.

On the production server this issue has gotten so bad that migrating to a new different filesystem by copying the files will take much longer than restoring from disk backup - the disk backups are written once and are not fragmented. They are lucky the application does not require full table scans or index fast full scans, or perhaps unlucky, because this issue would have been become impossible to ignore earlier.

I observed the fragmentation with all settings for logbias and recordsize that are recommended by Oracle for ZFS. The ZFS caches were allowed to use 14GiB RAM (and moslty did), bigger than the file itself.

The question is, of course, am i missing something here? Who else has seen this behavior?

Best Answer

  • Stefan Koehler
    Stefan Koehler Member Posts: 281 Bronze Badge
    Accepted Answer

    Hi Jan-Marten,

    well i got a multi billion dollar enterprise client running his whole Oracle infrastructure on ZFS (Solaris x86) and it runs pretty good. In fact ZFS introduces a "new level of complexity", but it is worth for some clients (especially the snapshot feature for example).

    > So i was left with the very poor bandwidth experienced by RMAN reading the datafile

    Maybe you hit a sync I/O issue. I have written a blog post about a ZFS issue and its sync I/O behavior with RMAN: [Oracle] RMAN (backup) performance with synchronous I/O dependent on OS limitations

    Unfortunately you have not provided enough information to confirm this.


    > I observed the fragmentation with all settings for logbias and recordsize that are recommended by Oracle for ZFS.

    How does the ZFS pool layout look like? Is the whole database in the same pool? At first you should separate the log and data files into different pools. ZFS works with "copy on write".

    How does the ZFS free space look like? Depending on the free space of the ZFS pool you can delay the "ZFS ganging" or sometimes let (depending on the pool usage) it disappear completely.

    ZFS ganging can be traced with DTrace. For example like that:

    shell> dtrace -qn 'fbt::zio_gang_tree_issue:entry { @[pid]=count(); }' -c "sleep 300"
    

    Regards

    Stefan

«1

Answers

  • Donghua
    Donghua Member Posts: 9 Blue Ribbon

    i have quite a number of database running on zfs, the experience is quite positive. Have you read this paper: http://www.oracle.com/technetwork/server-storage/solaris/config-solaris-zfs-wp-167894.pdf

  • Donghua
    Donghua Member Posts: 9 Blue Ribbon

    i did not perform such detail analysis as you die, but I do have one DB,which i use rman rate to limit rman io consumption.

  • Donghua,

    yes i have read it, but thanks

    actually, that whitepaper is stating

    "Periodically copying data files reorganizes the file location on disk and gives better full scan response time."

    which acknowledges the issue without using the word fragmentation.

  • Donghua,

    the problem is not that the backups are using too much IO capacity (although the number of IOPS is staggering due to the fragementation), the problem is that a full (1TiB) backup is taking 22 hours to complete. That is a backup to disk that is not bound by writing, but by reading the fragmented datafiles.

    thanks.

  • JohnWatson
    JohnWatson Member Posts: 2,459

    I have used ZFS for Oracle databases with no problems, including migrating from UFS to ZFS, no change to performance. This was on the same storage hardware, EMC Clariion.

    Is it possible that your problems are something else, like changing CPU? For example, if you have moved to a T series chip you will not get the performance you might expect for a single threaded operation such as copying a file. Even an old V490 can out-perform these newer chips for that kind of work.

    I am not saying you are wrong - I know very little bout this sort of thing - but my experience has been different from yours.

  • John, thanks.

    I observed the poor bandwidth on a T3, but on other T3's, as in the preproduction environment, the average blocksize during a copy is much higher.  That could be explained by the fact that on preprod we do not have that much random writes - it's only used for testing and -restored- every now and then. Now what else than fragmentation could break up the 1MiB IO's that i specify with dd into (on average) 55k chunks? Note that copying the copy (which is not fragmented) has no bandwidth problems on the same T3 chip (its not great, but is way higher than the 21 MiB/s read bandwidth i get on the 'fragmented' file, which eventually sits on a high end storage subsystem).

    the tests i am running now are on Linux with a coreI7, (Linux may have an older ZFS version, unable to check that now),  EXT3 and JFS are giving a steady scan-read performance no matter how many random IO's i write into it. ZFS with different record sizes and logbias all give the same picture: ZFS fragments files on sustained random writes.

    i found some blog posts that came to the same conclusion, but as many as i would expect based on the terrible results i get.

    besides, why would oracle state that on ZFS

    "Periodically copying data files reorganizes the file location on disk and gives better full scan response time."


    It appears to me that ZFS may be a wise choice for a lot of applications, but databases are not among them.

  • JohnWatson
    JohnWatson Member Posts: 2,459

    Well, you believe that your problem is to do with the file system. All I know is that the T3 chip is only one quarter the speed you would expect for a single threaded application (this is documented) and in my experience if you want fast backups you need to launch multiple channels and use multi-section backups.

    That's it from me.

  • John,


    "Well, you believe that your problem is to do with the file system."


    i am not a believer, the facts demonstrate it.


    "All I know is that the T3 chip is only one quarter the speed you would expect for a single threaded application (this is documented) and in my experience if you want fast backups you need to launch multiple channels and use multi-section backups."


    i have heard that too, and is noted


    however, i can reproduce on a corei7 that has no problem whatsoever reaching high read bandwidths on ZFS, as long as i have not done 100.000 random writes to the file i read. the same test on the exact same storage with any other filesystem i tested does not have this problem.


    thx!

  • Stefan Koehler
    Stefan Koehler Member Posts: 281 Bronze Badge
    Accepted Answer

    Hi Jan-Marten,

    well i got a multi billion dollar enterprise client running his whole Oracle infrastructure on ZFS (Solaris x86) and it runs pretty good. In fact ZFS introduces a "new level of complexity", but it is worth for some clients (especially the snapshot feature for example).

    > So i was left with the very poor bandwidth experienced by RMAN reading the datafile

    Maybe you hit a sync I/O issue. I have written a blog post about a ZFS issue and its sync I/O behavior with RMAN: [Oracle] RMAN (backup) performance with synchronous I/O dependent on OS limitations

    Unfortunately you have not provided enough information to confirm this.


    > I observed the fragmentation with all settings for logbias and recordsize that are recommended by Oracle for ZFS.

    How does the ZFS pool layout look like? Is the whole database in the same pool? At first you should separate the log and data files into different pools. ZFS works with "copy on write".

    How does the ZFS free space look like? Depending on the free space of the ZFS pool you can delay the "ZFS ganging" or sometimes let (depending on the pool usage) it disappear completely.

    ZFS ganging can be traced with DTrace. For example like that:

    shell> dtrace -qn 'fbt::zio_gang_tree_issue:entry { @[pid]=count(); }' -c "sleep 300"
    

    Regards

    Stefan

  • Hemant K Chitale
    Hemant K Chitale Member Posts: 15,759 Blue Diamond

    Interesting.  Never used ZFS but when I look at the paper referenced, I find :

    "Do not pre-allocate Oracle table spaces - As ZFS writes the new blocks on free space and

    marks free the previous block versions, the subsequent block versions have no reason to be

    contiguous. The standard policy of allocating large empty data files at the start of the database

    to provide continuous blocks is not beneficial with ZFS."

    "ZFS writing strategies change when the storage volume used goes over 80% of the storage

    pool capacity. This change can impact the performance of rewriting data files as Oracle's main

    activity. Keep more than 20% of free space is suggested for an OLTP database."

    and, of course,

    "Periodically copying data files reorganizes the file location on disk and gives better full scan

    response time."

    These indicate that data may be physically non-contiguous in datafiles.

    Hemant K Chitale


This discussion has been closed.