14 Replies Latest reply on Nov 21, 2013 3:24 PM by Jan-Marten Spit

    ZFS and fragmentation

    Jan-Marten Spit

      I do not see Oracle on ZFS often, in fact, i was called in too meet the first. The database was experiencing heavy IO problems, both by undersized IOPS capability, but also a lack of performance on the backups - the reading part of it. The IOPS capability was easily extended by adding more LUNS, so i was left with the very poor bandwidth experienced by RMAN reading the datafiles. iostat showed that during a simple datafile copy (both cp and dd with 1MiB blocksize), the average IO blocksize was very small, and varying wildly. i feared fragmentation, so i set off to test.

       

      i wrote a small C program that initializes a 10 GiB datafile on ZFS, and repeatedly does

       

      1 - 1000 random 8KiB writes with random data (contents) at 8KiB boundaries (mimicking a 8KiB database block size)

      2 - a full read of the datafile from start to finish in 128*8KiB=1MiB IO's. (mimicking datafile copies, rman backups, full table scans, index fast full scans)

      3 - goto 1

       

      so it's a datafile that gets random writes and is full scanned to see the impact of the random writes on the multiblock read performance. note that the datafile is not grown, all writes are over existing data.

       

      even though i expected fragmentation (it must have come from somewhere), is was appalled by the results. ZFS truly sucks big time in this scenario. Where EXT3, on which i ran the same tests (on the exact same storage), the read timings were stable (around 10ms for a 1MiB IO), ZFS started of with 10ms and went up to 35ms for 1 128*8Kib IO after 100.000 random writes into the file. it has not reached the end of the test yet - the service times are still increasing, so the test is taking very long. i do expect it to stop somewhere - as the file would eventually be completely fragmented and cannot be fragmented more.

       

      I started noticing statements that seem to acknowledge this behavior in some Oracle whitepapers, such as the otherwise unexplained advice to copy datafiles regularly. Indeed, copying the file back and forth defragments it. I don't have to tell you all this means downtime.

       

      On the production server this issue has gotten so bad that migrating to a new different filesystem by copying the files will take much longer than restoring from disk backup - the disk backups are written once and are not fragmented. They are lucky the application does not require full table scans or index fast full scans, or perhaps unlucky, because this issue would have been become impossible to ignore earlier.

       

      I observed the fragmentation with all settings for logbias and recordsize that are recommended by Oracle for ZFS. The ZFS caches were allowed to use 14GiB RAM (and moslty did), bigger than the file itself.

       

      The question is, of course, am i missing something here? Who else has seen this behavior?

        • 1. Re: ZFS and fragmentation
          Donghua

          i have quite a number of database running on zfs, the experience is quite positive. Have you read this paper: http://www.oracle.com/technetwork/server-storage/solaris/config-solaris-zfs-wp-167894.pdf

          • 2. Re: ZFS and fragmentation
            Donghua

            i did not perform such detail analysis as you die, but I do have one DB,which i use rman rate to limit rman io consumption.

            • 3. Re: ZFS and fragmentation
              Jan-Marten Spit

              Donghua,

               

              yes i have read it, but thanks

               

              actually, that whitepaper is stating

               

              "Periodically copying data files reorganizes the file location on disk and gives better full scan response time."

               

              which acknowledges the issue without using the word fragmentation.

              • 4. Re: ZFS and fragmentation
                Jan-Marten Spit

                Donghua,

                 

                the problem is not that the backups are using too much IO capacity (although the number of IOPS is staggering due to the fragementation), the problem is that a full (1TiB) backup is taking 22 hours to complete. That is a backup to disk that is not bound by writing, but by reading the fragmented datafiles.

                 

                thanks.

                • 5. Re: ZFS and fragmentation
                  JohnWatson

                  I have used ZFS for Oracle databases with no problems, including migrating from UFS to ZFS, no change to performance. This was on the same storage hardware, EMC Clariion.

                   

                  Is it possible that your problems are something else, like changing CPU? For example, if you have moved to a T series chip you will not get the performance you might expect for a single threaded operation such as copying a file. Even an old V490 can out-perform these newer chips for that kind of work.

                   

                  I am not saying you are wrong - I know very little bout this sort of thing - but my experience has been different from yours.

                  • 6. Re: ZFS and fragmentation
                    Jan-Marten Spit

                    John, thanks.

                     

                    I observed the poor bandwidth on a T3, but on other T3's, as in the preproduction environment, the average blocksize during a copy is much higher.  That could be explained by the fact that on preprod we do not have that much random writes - it's only used for testing and -restored- every now and then. Now what else than fragmentation could break up the 1MiB IO's that i specify with dd into (on average) 55k chunks? Note that copying the copy (which is not fragmented) has no bandwidth problems on the same T3 chip (its not great, but is way higher than the 21 MiB/s read bandwidth i get on the 'fragmented' file, which eventually sits on a high end storage subsystem).

                     

                    the tests i am running now are on Linux with a coreI7, (Linux may have an older ZFS version, unable to check that now),  EXT3 and JFS are giving a steady scan-read performance no matter how many random IO's i write into it. ZFS with different record sizes and logbias all give the same picture: ZFS fragments files on sustained random writes.

                     

                    i found some blog posts that came to the same conclusion, but as many as i would expect based on the terrible results i get.

                     

                    besides, why would oracle state that on ZFS

                     

                    "Periodically copying data files reorganizes the file location on disk and gives better full scan response time."


                    It appears to me that ZFS may be a wise choice for a lot of applications, but databases are not among them.

                    • 7. Re: ZFS and fragmentation
                      JohnWatson

                      Well, you believe that your problem is to do with the file system. All I know is that the T3 chip is only one quarter the speed you would expect for a single threaded application (this is documented) and in my experience if you want fast backups you need to launch multiple channels and use multi-section backups.

                      That's it from me.

                      • 8. Re: ZFS and fragmentation
                        Jan-Marten Spit

                        John,


                        "Well, you believe that your problem is to do with the file system."


                        i am not a believer, the facts demonstrate it.


                        "All I know is that the T3 chip is only one quarter the speed you would expect for a single threaded application (this is documented) and in my experience if you want fast backups you need to launch multiple channels and use multi-section backups."


                        i have heard that too, and is noted


                        however, i can reproduce on a corei7 that has no problem whatsoever reaching high read bandwidths on ZFS, as long as i have not done 100.000 random writes to the file i read. the same test on the exact same storage with any other filesystem i tested does not have this problem.


                        thx!

                        • 9. Re: ZFS and fragmentation
                          Stefan Koehler

                          Hi Jan-Marten,

                          well i got a multi billion dollar enterprise client running his whole Oracle infrastructure on ZFS (Solaris x86) and it runs pretty good. In fact ZFS introduces a "new level of complexity", but it is worth for some clients (especially the snapshot feature for example).

                           

                          > So i was left with the very poor bandwidth experienced by RMAN reading the datafile

                          Maybe you hit a sync I/O issue. I have written a blog post about a ZFS issue and its sync I/O behavior with RMAN: [Oracle] RMAN (backup) performance with synchronous I/O dependent on OS limitations

                          Unfortunately you have not provided enough information to confirm this.


                          > I observed the fragmentation with all settings for logbias and recordsize that are recommended by Oracle for ZFS.

                          How does the ZFS pool layout look like? Is the whole database in the same pool? At first you should separate the log and data files into different pools. ZFS works with "copy on write".

                          How does the ZFS free space look like? Depending on the free space of the ZFS pool you can delay the "ZFS ganging" or sometimes let (depending on the pool usage) it disappear completely.

                           

                          ZFS ganging can be traced with DTrace. For example like that:

                          shell> dtrace -qn 'fbt::zio_gang_tree_issue:entry { @[pid]=count(); }' -c "sleep 300"
                          

                           

                          Regards

                          Stefan

                          • 10. Re: ZFS and fragmentation
                            Hemant K Chitale

                            Interesting.  Never used ZFS but when I look at the paper referenced, I find :

                             

                            "Do not pre-allocate Oracle table spaces - As ZFS writes the new blocks on free space and

                            marks free the previous block versions, the subsequent block versions have no reason to be

                            contiguous. The standard policy of allocating large empty data files at the start of the database

                            to provide continuous blocks is not beneficial with ZFS."

                             

                            "ZFS writing strategies change when the storage volume used goes over 80% of the storage

                            pool capacity. This change can impact the performance of rewriting data files as Oracle's main

                            activity. Keep more than 20% of free space is suggested for an OLTP database."

                             

                            and, of course,

                            "Periodically copying data files reorganizes the file location on disk and gives better full scan

                            response time."

                             

                             

                            These indicate that data may be physically non-contiguous in datafiles.

                             

                             

                            Hemant K Chitale


                            • 11. Re: ZFS and fragmentation
                              Jan-Marten Spit

                              Stephan,


                              "well i got a multi billion dollar enterprise client running his whole Oracle infrastructure on ZFS (Solaris x86) and it runs pretty good."


                              for random reads there is almost no penalty because randomness is not increased by fragmentation. the problem is in scan-reads (aka scattered reads). the SAN cache may reduce the impact, or in the case of tiered storage, SSD's abviously do not suffer as much from fragmentation as rotational devices.


                              "In fact ZFS introduces a "new level of complexity", but it is worth for some clients (especially the snapshot feature for example)."


                              certainly, ZFS has some very nice features.


                              "Maybe you hit a sync I/O issue. I have written a blog post about a ZFS issue and its sync I/O behavior with RMAN: [Oracle] RMAN (backup) performance with synchronous I/O dependent on OS limitations

                              Unfortunately you have not provided enough information to confirm this."


                              thanks for that article,  in my case it is a simple fact that the datafiles are getting fragmented by random writes. this fact is easily established by doing large scanning read IO's and observing the average block size during the read. moreover, fragmentation MUST be happening because that's what ZFS is designed to do with random writes - it allocates a new block for each write, data is not overwritten in place. i can 'make' test files fragmented by simply doing random writes to it, and this reproduces on both Solaris and Linux. obviously this ruins scanning read performance on rotational devices (eg devices for which the seek time is a function of the 'distance between consecutive file offsets).


                              "How does the ZFS pool layout look like?"

                               

                              separate pools for datafiles, redo+control, archives, disk backups and oracle_home+diag. there is no separate device for the ZIL (zfs intent log), but i tested with setups that do have a seprate ZIL device, fragmentation still occurs.

                               

                              "Is the whole database in the same pool?"

                              as in all the datafiles: yes.

                               

                              "At first you should separate the log and data files into different pools. ZFS works with "copy on write""

                               

                              it's already configured like that.

                               

                              "How does the ZFS free space look like? Depending on the free space of the ZFS pool you can delay the "ZFS ganging" or sometimes let (depending on the pool usage) it disappear completely."

                               

                              yes, i have read that. we never surpassed 55% pool usage.

                               

                              thanks!

                              • 12. Re: ZFS and fragmentation
                                Jan-Marten Spit

                                Herman,

                                 

                                "These indicate that data may be physically non-contiguous in datafiles."

                                 

                                we can drop the 'may'. ZFS is not suited for random writes on rotational storage. (i tested btrfs too, which is also COW, but btrfs performed far worse than ZFS, zo i stopped the tests to spare my drives) The same tests on the same storage for JFS, EXT3, EXT4 and XFS show perfectly stable read-scan timings no matter how many random writes are made into the file (as one might expect). I am currently testing the behavior with SSD's, as on SSD's the seek time is not dependent on the relative offset distance (in theory, i have read articles that state that this is not true in the long run, but i am no SSD specialist). So ZFS may be a filesystem of the future, given it's great functionality and the SSD future of storage. For now, on rotational storage, ZFS is ill advised for OLTP database load - unless you do not care about scan-read performance.

                                 

                                after i few days, i found that many people, among which DBA's are observing fragmentation on ZFS, and the ZFS maintainers acknowledge the problem.

                                 

                                i will provide a link to the results and method used in a few days.

                                 

                                regarding the proportional problem, we are moving datafiles to a new striped ZPOOL (customer is reluctant to replace ZFS) with more LUNS, which has helped lower the impact of fragmentation on the read-scan bandwidth. (striping also lowers the average seek distance on the same disk for fragmented scan-reads, something i only realized afterwards :).

                                 

                                thx!

                                • 13. Re: ZFS and fragmentation
                                  Stefan Koehler

                                  Hi Jan-Marten,

                                   

                                  > The SAN cache may reduce the impact, or in the case of tiered storage, SSD's abviously do not suffer as much from fragmentation as rotational devices.

                                  In fact they use ZFS L2ARC with SSDs in their productive environment. However such ZFS enhancements may hide the possible "copy on write" impacts, but the result matters

                                  I am not quite sure about ZIL as i can not remember anymore.

                                   

                                  > fragmentation MUST be happening because that's what ZFS is designed to do with random writes - it allocates a new block for each write, data is not overwritten in place.

                                  Absolutely - this is what i called "copy on write" and it is necessary for features like snapshots.

                                  In addition the recommended logbias setting by Oracle was not fitting to their environment. They figured out the perfect setting for their environment by running load tests on their own.

                                   

                                  > we never surpassed 55% pool usage.

                                  Please check the following graphics in this white paper (page 9): http://www.trivadis.com/uploads/tx_cabagdownloadarea/kopfschmerzen_mit_zfs_abgabe_tvd_whitepaper.pdf

                                  Unfortunately it is in german, but the problem disappeared in this case by a ZFS pool usage =< 50 %.

                                   

                                  Another important point is ZFS (file level) pre-fetching which could influence the performance. As previously stated ZFS introduces a new level of complexity - so test it, measure it, test it

                                   

                                  Regards

                                  Stefan

                                  1 person found this helpful
                                  • 14. Re: ZFS and fragmentation
                                    Jan-Marten Spit

                                    "As previously stated ZFS introduces a new level of complexity"

                                     

                                    this seems to be the correct answer to my question

                                     

                                    ZFS will fragment the datafiles on random writes as it is copy-on-write. this can be a serious problem, but it does not have to be - the effects can be minimized or nulled by using SSD's as intermediate cache. In fact, ZFS gives you this option, whilst other filesystems cannot 'tier' the storage.

                                     

                                    In the practical case i mentioned, the DB had a certain size, divided in 200GiB LUNs which expose rotational behavior - seek time depending on the relative seek offset, and ZFS dropped on these LUNS. This is something you can get away with on other filesystems, but with ZFS this is asking for trouble.