Part 10 - Monitoring and Tuning ZFS Performance

Version 4

    in Oracle Solaris 11.1


    by  Alexandre Borgesace-icon.gif


    Part 10, which is the final article, in a series that describes the key features of ZFS in Oracle Solaris 11.1 and provides step-by-step procedures explaining how to use them. This article provides an overview of how to monitor ZFS statistics and tune ZFS performance.


    Introduction to ZFS and the ZFS Intent Log


    ZFS provides transactional behavior that enforces data and metadata integrity by using a powerful 256-bit checksum that provides a big advantage: data and metadata are written together (but not exactly at the same time) by using the "uberblock ring" concept, which represents a round that is completed when both data and metadata are written. Thus, by using the uberblock ring concept, both are written to disk or neither is written. The entire operation uses a copy-on-write (COW) mechanism to help to guarantee the atomicity of the process. Therefore, taking this approach makes the ZFS file system always consistent, even during a crash event. This is very different from traditional file systems, which can be corrupted because they write data and metadata in different stages, increasing the chance of consistency problems that cannot be fixed using the fsck command.


    Surprisingly, some people compare ZFS transactions to a file system using journaling, but this comparison is not correct, because the journaling process records a kind of log to be replayed when a crash occurs, which accelerates the recovery process but decreases performance during the writing process. ZFS transactions are similar to ACID (Atomicity, Consistency, Isolation, Durability) operations that occur in databases such as Oracle Database, where either the complete operation is committed or it is totally roll backed.


    By the way, the ZFS file system has a log named ZIL (ZFS Intent Log) that performs similarly to a journaling file system, but only synchronous writes are written to the ZIL (all other write operations are written directly to memory and later committed to disk) and without suffering a penalty. While the file system in online, the ZIL is never read; it is only written to. Thus, it never can be placed in temporary storage (the general recommendation is to use a dedicated log disk, such solid-state disks) because during a crash, the updates to transactions depend on the ZIL to be committed (and confirmed) to disk. In a nutshell, the fundamental role of the ZIL is to replay the last transactions in the event of a crash. During a write cycle, the information is written as a transaction group (txg) to memory and to the ZIL at the same time. About five seconds later, the transaction group is committed to disk (remember the uberblock ring) and the ZIL is thrown away. If the system suffers a crash during the txg commit operation, the ZIL will be used in the next Oracle Solaris boot to recover the data and to try to mount the data set.


    From the explanation above, we now know that the ZIL is used by a data set (that is, a volume or file system), but another question arises: What is the appropriate size for the ZIL? Also, how do we know if our data set uses the ZIL much?


    The first question is difficult to answer, but the minimum size of the ZIL is 64 MB and the maximum size is the amount of RAM divided by 2. However, it is uncommon to have a ZIL larger than 16 GB.


    To answer the second question, we could use the zilstat.ksh script from Richard Elling to follow the write activity on the ZIL device.


    To get a feel for zilstat.ksh, execute the command shown below:


    root@solaris11-1:~# ./zilstat.ksh -h


    The following usage information comes from Elling's website:


    zilstat.ksh [gMt][-l linecount] [-p poolname] [interval [count]]

        -M  # print numbers as megabytes (base 10)

        -t  # print timestamp

        -p poolname  # only look at poolname

        -l linecount # print header every linecount lines (default=only once)

        interval in seconds or "txg" for transaction group commit intervals

                 note: "txg" only appropriate when -p poolname is used

        count will limit the number of intervals reported


    Here are some examples:



    zilstat.kshDefault output, 1-second samples
    zilstat.ksh 10Ten-second samples
    zilstat.ksh 10 6Print 6 x 10-second samples
    zilstat.ksh -p rpoolShow ZIL stats for rpool only


    Output (Note: data bytes are actual data; total bytes counts buffer size.):


    • [TIME]
    • N-Bytes: data bytes written to ZIL over the interval
    • N-Bytes/s: data bytes per second written to ZIL over the interval
    • N-Max-Rate: maximum data rate during any 1-second sample
    • B-Byte: buffer bytes written to ZIL over the interval
    • B-Bytes/s: buffer bytes per second written to ZIL over the interval
    • B-Max-Rate: maximum buffer rate during any 1-second sample
    • Ops: number of synchronous IOPS per interval
    • <=4kB: number of synchronous IOPS <= 4k bytes per interval
    • 4-32kB: number of synchronous IOPS 400–32k bytes per interval
    • >=32kB: number of synchronous IOPS >= 32k bytes per interval


    To test the script, create a mirrored pool with a mirrored log:


    root@solaris11-1:~# zpool create zil_pool mirror c7t2d0 c7t3d0 log mirror c7t4d0 c7t5d0

    root@solaris11-1:~# zpool status zil_pool


      pool: zil_pool

    state: ONLINE

      scan: none requested




       zil_pool    ONLINE       0     0     0

         mirror-0  ONLINE       0     0     0

           c7t2d0  ONLINE       0     0     0

           c7t3d0  ONLINE       0     0     0


         mirror-1  ONLINE       0     0     0

           c7t4d0  ONLINE       0     0     0

           c7t5d0  ONLINE       0     0     0


    root@solaris11-1:~# zfs create zil_pool/fs_zil

    root@solaris11-1:~# zfs list zil_pool/fs_zil



    zil_pool/fs_zil   31K  9.78G    31K  /zil_pool/fs_zil


    Now it's time to verify eventual write operations on the log by executing the following command (of course, this case will not show anything because it's an isolated test):


    root@solaris11-1:~# ./zilstat.ksh -p zil_pool 1 60

    N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops <=4kB 4-32kB >=32kB

          0         0          0       0         0          0    0    0      0      0

          0         0          0       0         0          0    0    0      0      0


    Our analysis from running zilstat shows us that the ZIL is not being used, so it might be suitable to disable the ZIL. Nonetheless, the usual recommendation is for not disabling ZIL, mainly when handling NFS.


    To disable the ZIL, run the following commands:


    root@solaris11-1:~# zfs get sync zil_pool



    zil_pool  sync      standard  default


    root@solaris11-1:~# zfs set sync=disabled zil_pool

    root@solaris11-1:~# zfs get sync zil_pool



    zil_pool  sync      disabled  local


    To re-enable the ZIL, execute the following commands:


    root@solaris11-1:~# zfs set sync=standard zil_pool

    root@solaris11-1:~# zfs get sync zil_pool



    zil_pool  sync      standard  local


    DTrace brings some possibilities for following the ZIL behavior and operations, as shown below:


    root@solaris11-1:~# dtrace -n zil*:entry'{@[probefunc]=count();}'

    dtrace: description 'zil*:entry' matched 60 probes



      zil_clean                                                         1

      zil_itxg_clean                                                    1

      zil_header_in_syncing_context                                     3

      zil_sync                                                          3


    To finish this small experiment using the ZIL, destroy the zil_pool data set by running the following command:


    root@solaris11-1:~# zpool destroy zil_pool


    The Self-Healing Capability of ZFS


    Continuing on, ZFS is a self-healing file system that uses 256-bit checksum verification to fix a corrupted block. For example, during a mirrored file system scenario using two disks, ZFS tries to read a block from the first disk. If the checksum reveals that this block is corrupted, ZFS performs a self-healing job by trying to read the block from the second disk and by verifying the respective checksum. If it is OK, ZFS replaces the bad block in the first disk with the good one from the second disk.


    As a simple and straightforward demonstration about the self-healing feature, a quick step-by-step sequence follows. First, we choose the disks, create the pool, and verify its status, as shown below:


    root@solaris11-1:~# devfsadm

    root@solaris11-1:~# format

    Searching for disks...done



           0. c7t0d0 <ATA-VBOX HARDDISK-1.0-40.00GB>


           1. c7t2d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>


           2. c7t3d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>


           3. c7t4d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>


    Specify disk (enter its number): ^D


    root@solaris11-1:~# zpool create selfheal_pool mirror c7t2d0 c7t3d0


    root@solaris11-1:~# zpool list selfheal_pool


    selfheal_pool  9.94G    85K  9.94G   0%  1.00x  ONLINE  -



    Create a file system named test_fs inside selfheal_pool by running the following command:


    root@solaris11-1:~# zfs create selfheal_pool/test_fs


    Copy some information (any data) to the created file system, for example, by executing the following command:


    root@solaris11-1:~# cp -r /kernel /usr/jdk /usr/gnu /selfheal_pool/test_fs


    Verify the file system status by running the following:


    root@solaris11-1:~# zfs list selfheal_pool/test_fs

    NAME                   USED  AVAIL  REFER  MOUNTPOINT

    selfheal_pool/test_fs  956M  8.85G   956M  /selfheal_pool/test_fs


    In another terminal window, execute the following command:


    root@solaris11-1:~# dtrace -n zfs:zio_checksum_error:entry


    Now it is time to destroy some of the data of a ZFS disk from selfheal_pool by executing the following command:


    root@solaris11-1:~# dd if=/dev/urandom of=/dev/c7t2d0 bs=1024k count=5000 conv=notrunc

    0+5000 records in

    0+5000 records out


    To ensure that there is not any data in cache, export and import the pool again, as shown below:


    root@solaris11-1:~# zpool export -f selfheal_pool

    root@solaris11-1:/# zpool import -f selfheal_pool

    root@solaris11-1:~# zpool status selfheal_pool

      pool: selfheal_pool

    state: ONLINE

      scan: none requested



       NAME           STATE     READ WRITE CKSUM

       selfheal_pool  ONLINE       0     0     0

       mirror-0       ONLINE       0     0     0

       c7t2d0         ONLINE       0     0     0

       c7t3d0         ONLINE       0     0     0


    errors: No known data errors


    root@solaris11-1:~# cd /selfheal_pool/test_fs/

    root@solaris11-1:/selfheal_pool/test_fs# ls

    gnu     jdk     kernel


    No data was lost and everything in the file system is fine. It is simple to see that ZFS caught the checksum errors during the write operation using the dd command. In the last column of the output from the DTrace command below, a zero means "OK" and a non-zero means "error," but ZFS recovered itself, no data was lost, and there is not anything to worry about, as the previous step proved:


    root@solaris11-1:/# dtrace -n  'fbt::zio_checksum_error:return {trace(arg1)}'


    CPU     ID                    FUNCTION:NAME

       0  41554        zio_checksum_error:return                 0

       0  41554        zio_checksum_error:return                 0

       0  41554        zio_checksum_error:return                50

       0  41554        zio_checksum_error:return                50

       0  41554        zio_checksum_error:return                50

       0  41554        zio_checksum_error:return                50



    ZFS Caches


    ZFS deploys a very interesting kind of cache named ARC (Adaptive Replacement Cache) that caches data from all active storage pools. The ARC grows and shrinks as the system's workload demand for memory fluctuates, using two caching algorithms at the same time to balance main memory: MRU (most recently used) and MFU (most frequently used). These two caching algorithms generate two lists (the MRU list and the MFU list) that hold metadata about the data cached in memory, whereas counterpart lists (MRU ghost and MFU ghost) hold metadata about the evicted data from cache in memory.


    Note that both the MRU and MFU ghost caches have an important role in the ARC capacity to self-adapt to load, and it is appropriate to highlight again the fundamental concept: both the MFU ghost and the MRU ghost hold metadata about evicted pages from cache and there is no data in these lists. Furthermore, it could be useful to understand the inner working of the ARC cache to understand the MRU and MFU algorithms.


    Usually, all common deployed LRU (least recently used) mechanisms present some problems, because they don't have a good way to handle scanning files through the file systems. Therefore, if a heavy reading of sequential data happens, it can "trash" the file system cache. The worst case occurs when this reading operation happens only once, thereby evicting "good" data from the cache.


    The size of the directory table for pages in the cache is two times bigger than the data in the cache and ZFS has four internal lists, the first one being the MRU (for most recently used pages) and another being the MFU (most frequently used pages). The other two lists are the ghost MRU and ghost MFU, which do not hold any data! From here, if an application reads a page from a file system, this page goes to the cache and a reference to this same page is put into the MRU (probably, at the first position). Following this first read operation, if the same page is read again (eventually repeating the operation), this page will appear in the MFU, because there is a simple rule that states that if there are two or more read operations from same page, the page reference goes to the MFU and its reference is also updated in the MRU.


    Obviously, there are pages that are not accessed so recently and they are moved to end of both lists. However, the cache size is finite, so in the near future, a page (the least used) should be evicted from cache to disk. Nevertheless, no page can be evicted if there is any reference to it either in the MRU or the MFU.


    Therefore, at some point, the page reference is evicted from the MRU list, so a reference to this page is created in the ghost MRU, and this page can be evicted from cache to disk. (We should remember that a page can be evicted from cache only if there isn't any reference to this page in either the MRU list or the MFU list.) In this case, clearly the ghost MRU works as a kind of "back log" of recently evicted pages.


    Sometime later, after many pages were evicted from cache, their references are created in the ghost MRU (in this case, it works as a second list), and our first page's reference in the ghost MRU is evicted from there. Thus, no reference exists in the ghost MRU either.


    As a simple conclusion, if a read operation happens after the first page is evicted from cache but before it is evicted from the ghost MRU, this page must be fetched from disk. However, the ghost MRU has a clear indication that this page was evicted recently. Finally, this indicates that the ZFS cache is smaller than necessary.


    As a final note, we should remember that the ZFS ARC is smart because the MRU and MFU lists (and the respective ghosts) do not have a fixed size and are adapted according the application load. Usually, the maximum ARC size grows to fill almost all of the available memory according to the following rules: a maximum of 75 percent from total RAM for systems with less than 4 GB or RAM minus 1 GB, and a minimum of 64 MB and an upper limit of one quarter the ARC for metadata on Oracle Solaris 11. Nonetheless, the ARC adapts its size based on the amount of free physical memory; occasionally, the ARC reaches its limit (its maximum size is specified by the zfs_arc_max parameter), and then the reallocation process, which evicts some data from memory, begins. Moreover, other factors can impact this upper limit such as page scanner activity, insufficient swap space, and the kernel heap being more than 75 percent full.


    MRU and MFU statistics can be obtained by executing the following commands:


    root@solaris11-1:~# kstat -p zfs:0:arcstats | grep mru

    zfs:0:arcstats:mru_ghost_hits 0

    zfs:0:arcstats:mru_hits 402147

    root@solaris11-1:~# kstat -p zfs:0:arcstats | grep mfu

    zfs:0:arcstats:mfu_ghost_hits 0

    zfs:0:arcstats:mfu_hits 4240888


    In addition to the ARC, there's another cache named the Level 2 Adaptive Replacement Cache (L2ARC), which is like a level-2 cache between main memory and disk. Actually, the L2ARC works like an extension to the ARC for data recently evicted from the ARC. Very good candidates for housing the L2ARC are solid-state disks (SSDs). Data evicted from the ARC goes to the L2ARC, and having the L2ARC on SSDs is a good way to accelerate random reads.


    Here's a fact we must remember: The L2ARC is a low-bandwidth but low-latency device. It performs best when your working set is too large to fit into main memory, but the block size is small (32k or less). Being able to do random reads from the L2ARC's SSD devices, bypassing the main pool, boosts performance considerably.


    To add SSD cache disks to a ZFS pool, run the following command and wait for some time until the data comes into the cache (the warm-up phase):


    root@solaris11-1:~# zpool add zfs_l2_pool cache <ssd disk 1> <ssd disk 2> <ssd disk 3> <ssd disk 4>...


    Getting ZFS Statistics


    You can get basic statistics about the ZFS rpool by using the following zpool iostat command, which enables us to see how many read and write operations have happened in the rpool and how much data was written or read:


    root@solaris11-1:~# zpool iostat rpool 1

    capacity operations bandwidth

    pool alloc  free  read   write read   write

    ---- ------ ----- -----  ----- -----  -----

    rpool 31.1G 48.4G 14     10    1.35M   563K

    rpool 31.1G 48.4G 6      48     267K  27.2M

    rpool 31.1G 48.4G 14     27     859K   106K

    rpool 31.2G 48.3G 23    136    1.44M  1.18M

    rpool 31.2G 48.3G 4     120     516K  28.9M

    rpool 31.2G 48.3G 8     162     897K  13.6M

    rpool 31.2G 48.3G 12    114     522K  19.2M

    rpool 31.2G 48.3G 0      28     1023  20.4M



    Another very good tool for tracing which operations are slowing the ZFS system performance is the DTrace script zfsslower.d (from Brendan Gregg's DTrace book). For example, the following command shows what ZFS operations take more than 15 milliseconds during a real access to disk (not cached):


    root@solaris11-1:~/zfs_scripts# ./zfsslower.d 15

    TIME                 PROCESS D KB ms FILE

    2014 May 11 16:14:32 bash    R 0  25 /opt/openv/java/jnbSA

    2014 May 11 16:14:33 jnbSA   R 8  22 /opt/openv/java/.nbjConf

    2014 May 11 16:14:33 jnbSA   R 0  21 /usr/bin/tail

    2014 May 11 16:14:33 jnbSA   R 0  17 /usr/bin/awk

    2014 May 11 16:14:33 jnbSA   R 0  21 /usr/bin/whoami

    2014 May 11 16:14:33 jnbSA   R 0  24 /usr/bin/locale

    2014 May 11 16:14:33 jnbSA   R 0  68 /usr/bin/grep

    2014 May 11 16:14:33 java    R 0  29 /opt/openv/java/jre/bin/amd64/java

    2014 May 11 16:14:33 java    R 0  22 /opt/openv/java/jre/lib/amd64/server/

    2014 May 11 16:14:33 java    R 0  20 /usr/lib/amd64/

    2014 May 11 16:14:33 java    R 0  18 /lib/amd64/

    2014 May 11 16:14:33 java    R 0  39 /usr/lib/amd64/

    2014 May 11 16:14:33 java    R 0  25 /opt/openv/java/jre/lib/amd64/

    2014 May 11 16:14:33 java    R 0  22 /opt/openv/java/jre/lib/amd64/

    2014 May 11 16:14:33 java    R 0  22 /opt/openv/java/jre/lib/amd64/

    2014 May 11 16:14:34 java    R 2  19 /opt/openv/java/jre/lib/meta-index



    The DTrace script zfssnoop.d is a helpful tool for seeing what processes are requesting ZFS I/O operations, and it's very suitable for analyzing performance issues, as shown below:


    root@solaris11-1:~/zfs_scripts# ./zfssnoop.d


    6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/netbackup/bin/bprd_parent

    6316520  0  6293 bprd    zfs_open    0  /opt/openv/netbackup/bin/bprd_parent

    6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/netbackup/bin/bprd_parent

    6316520  0  6293 bprd    zfs_readdir 0  /opt/openv/netbackup/bin/bprd_parent

    6316520  0  6293 bprd    zfs_readdir 0  /opt/openv/netbackup/bin/bprd_parent

    6316520  0  6293 bprd    zfs_close   0  /opt/openv/netbackup/bin/bprd_parent

    6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/netbackup/bp.conf

    6316520  0   797 nscd    zfs_getattr 0  /etc/nsswitch.conf

    6316520  0   797 nscd    zfs_getattr 0  /etc/resolv.conf

    6316520  0   797 nscd    zfs_getattr 0  /etc/passwd

    6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/netbackup/bp.conf

    6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/var/license.txt

    6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/var/license.txt

    6316521  0  797  nscd    zfs_getattr 0  /etc/nsswitch.conf

    6316521  0  797  nscd    zfs_getattr 0  /etc/resolv.conf

    6316521  0  797  nscd    zfs_getattr 0  /etc/passwd



    For getting ARC-specific statistics, there are many parameters that can be collected using different tools and techniques. For example, kstat is a great way to gather ARC statistics:


    root@solaris11-1:~/zfs_scripts# kstat -p "zfs:0:arcstats:"

    zfs:0:arcstats:buf_size 3301320

    zfs:0:arcstats:c 320595190

    zfs:0:arcstats:c_max 1597679616

    zfs:0:arcstats:c_min 67108864

    zfs:0:arcstats:class misc

    zfs:0:arcstats:crtime 824.813864746

    zfs:0:arcstats:data_size 281752896

    zfs:0:arcstats:deleted 124895

    zfs:0:arcstats:demand_data_hits 4181581

    zfs:0:arcstats:demand_data_misses 30322

    zfs:0:arcstats:demand_metadata_hits 329062

    zfs:0:arcstats:demand_metadata_misses 15013

    zfs:0:arcstats:hash_chain_max 5

    zfs:0:arcstats:hash_chains 18446744073709535339

    zfs:0:arcstats:hash_collisions 31739

    zfs:0:arcstats:hash_elements 18446744073709478435

    zfs:0:arcstats:hash_elements_max 18446744073709551615

    zfs:0:arcstats:hits 4606729

    zfs:0:arcstats:l2_abort_lowmem 0

    zfs:0:arcstats:l2_cksum_bad 0

    zfs:0:arcstats:l2_evict_lock_retry 0

    zfs:0:arcstats:l2_evict_reading 0

    zfs:0:arcstats:l2_feeds 0

    zfs:0:arcstats:l2_hdr_size 0

    zfs:0:arcstats:l2_hits 0

    zfs:0:arcstats:l2_io_error 0

    zfs:0:arcstats:l2_misses 45335

    zfs:0:arcstats:l2_read_bytes 0

    zfs:0:arcstats:l2_rw_clash 0

    zfs:0:arcstats:l2_write_bytes 0

    zfs:0:arcstats:l2_writes_done 0

    zfs:0:arcstats:l2_writes_error 0

    zfs:0:arcstats:l2_writes_hdr_miss 0

    zfs:0:arcstats:l2_writes_sent 0

    zfs:0:arcstats:memory_throttle_count 141

    zfs:0:arcstats:meta_limit 0

    zfs:0:arcstats:meta_max 41224144

    zfs:0:arcstats:meta_used 23865960

    zfs:0:arcstats:mfu_ghost_hits 0

    zfs:0:arcstats:mfu_hits 4029416

    zfs:0:arcstats:misses 132715

    zfs:0:arcstats:mru_ghost_hits 0

    zfs:0:arcstats:mru_hits 268796

    zfs:0:arcstats:mutex_miss 4980

    zfs:0:arcstats:other_size 20564640

    zfs:0:arcstats:p 320595190

    zfs:0:arcstats:prefetch_data_hits 50530

    zfs:0:arcstats:prefetch_data_misses 68513

    zfs:0:arcstats:prefetch_metadata_hits 45556

    zfs:0:arcstats:prefetch_metadata_misses 18867

    zfs:0:arcstats:size 305618856

    zfs:0:arcstats:snaptime 10481.932353541


    Personally, I prefer another method: the one shown in Listing 1:


    root@solaris11-1:~/zfs_scripts# echo "::arc" | mdb -k

    hits = 4696135

    misses = 133577

    demand_data_hits = 4251628

    demand_data_misses = 30808

    demand_metadata_hits = 345851

    demand_metadata_misses = 15083

    prefetch_data_hits = 50577

    prefetch_data_misses = 68616

    prefetch_metadata_hits = 48079

    prefetch_metadata_misses = 19070

    mru_hits = 310110

    mru_ghost_hits = 0

    mfu_hits = 4102471

    mfu_ghost_hits = 0

    deleted = 124895

    mutex_miss = 4986

    hash_elements = 18446744073709478638

    hash_elements_max = 18446744073709551615

    hash_collisions = 39936

    hash_chains = 18446744073709535395

    hash_chain_max = 5

    p = 352 MB

    c = 318 MB

    c_min = 64 MB

    c_max = 1523 MB

    size = 318 MB

    buf_size = 3 MB

    data_size = 294 MB

    other_size = 20 MB

    l2_hits = 0

    l2_misses = 45891

    l2_feeds = 0

    l2_rw_clash = 0

    l2_read_bytes = 0 MB

    l2_write_bytes = 0 MB

    l2_writes_sent = 0

    l2_writes_done = 0

    l2_writes_error = 0

    l2_writes_hdr_miss = 0

    l2_evict_lock_retry = 0

    l2_evict_reading = 0

    l2_abort_lowmem = 0

    l2_cksum_bad = 0

    l2_io_error = 0

    l2_hdr_size = 0 MB

    memory_throttle_count = 141

    meta_used = 23 MB

    meta_max = 39 MB

    meta_limit = 0 MB

    arc_no_grow = 1

    arc_tempreserve = 0 MB

    Listing 1


    There are some important statistics shown in Listing 1:


    • size shows the current ARC size.
    • c is the target ARC size.
    • c_max is the maximum target ARC size.
    • c_min is the minimum target ARC size.
    • p is the size of the MFU cache.
    • l2_hdr_size is the space in the ARC that is consumed by managing the L2ARC.
    • l2_size is the size of the data in the L2ARC.
    • memory_throttle_count is the number of times that ZFS had to limit the ARC growth.


    For example, a constant increasing of the memory_throttle_count statistic can indicate excessive pressure to evict data from the ARC.


    Some of the arcstat statistics can be examined by running the following command:


    root@solaris11-1:~# echo "arc_stats::print -d arcstat_p.value.ui64 \

    arcstat_c.value.ui64 arcstat_c_max.value.ui64" | mdb -k


    arcstat_p.value.ui64 = 0t2325356992

    arcstat_c.value.ui64 = 0t3208292352

    arcstat_c_max.value.ui64 = 0t3208292352


    Additionally, to interpret the output shown in Listing 1, we must understand some operations, such as prefetch requests and demand requests. Demand requests are done directly to the ARC without leveraging prefetch requests. But, what are prefetch requests? A prefetch request is a "read ahead" operation where data is brought from disk in advance and kept in the ARC. Consequently, future sequential read operations can take advantage of the data in the ARC, because the prefetch request anticipated that the data brought from the disk would be needed, and this prevents the need to perform a new physical read operation later. This ZFS file-level prefetch mechanism is called a zfetch.


    Therefore, the contents of the ARC can be broken down to the following: prefetch (data + metadata) and demand (data + metadata). A hit happens when the requested information is found in the ARC and a miss happens when the requested information isn't found in the ARC.


    Surprisingly, it is possible to disable the ZFS prefetch setting in the /etc/system file: zfs_prefetch_disable = 0x1. This might be recommended when you are facing some contention on zfetch locks or if the prefetch efficiency ratio is very low and is causing slow performance . For example, if many sequential small reads hit the cache, prefetching data from disk can consume a significant amount of CPU time and degrade CPU performance. Additionally, there are rare cases where application loads are limited by zfetch. Based on these concepts, we can calculate efficiency ratios. For example, to see the hit rate and miss rate for ARC data and metadata prefetch requests, we could run the command shown in Listing 2:


    root@solaris11-1:~/zfs_scripts# kstat -p "zfs:0:arcstats:" | grep prefetch

    zfs:0:arcstats:prefetch_data_hits 50530

    zfs:0:arcstats:prefetch_data_misses 68513

    zfs:0:arcstats:prefetch_metadata_hits 45556

    zfs:0:arcstats:prefetch_metadata_misses 18867

    Listing 2


    Using the values shown in Listing 2, we can calculate the prefetch efficiency ratio by running the following command, which uses the following formula:


    [(data_hits + metadata_hits)/(data_hits + data_misses + metadata_hits + metadata_misses)]:


    root@solaris11-1:~/zfs_scripts# bc -l

    (50530 + 45556)/(50530 + 68513 + 45556 + 18867)

    .52372646702931333326 = 52.37%



    If we only wanted to know the ratio of data hits to all data requests, which follows the formula [(data_hits)/(data_hits + data_misses)] , we could run this command:


    root@solaris11-1:~/zfs_scripts# bc -l

    (50530/(50530 + 68513))

    .42446846937661181253 = 42.44%


    I'm sure you get the idea.


    The following script from Brendan Gregg makes it possible to know the hit rate, but our main concern should be with misses (not shown, but easy to calculate):


    root@solaris11-1:~/zfs_scripts# more


    interval=${1:-5} # 5 secs by default


    kstat -p zfs:0:arcstats:hits zfs:0:arcstats:misses $interval | awk '

    BEGIN {

    printf "%12s %12s %9s\n", "HITS", "MISSES", "HITRATE"


    /hits/ {

    hits = $2 - hitslast

    hitslast = $2


    /misses/ {

    misses = $2 - misslast

    misslast = $2

    rate = 0

    total = hits + misses

    if (total)

    rate = (hits * 100) / total

    printf "%12d %12d %8.2f%%\n", hits, misses, rate




    Running the script produces output like the following:


    root@solaris11-1:~/zfs_scripts# ./


    4667846 132926 97.23%

    25      0      100.00%

    92      2      97.87%

    287     20     93.49%

    58      0      100.00%

    3429    127    96.43%

    4088    111    97.36%

    39      0      100.00%


    Eventually, we might be interested in collecting information about which processes are using the ARC most often, as shown in Listing 3:


    root@solaris11-1:~/zfs_scripts# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @[execname] = count() }'

    dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes



    svc.configd 314

    tpconfig 331

    bptm 358

    firefox 398

    userreq_notify 455

    ksh93 469

    update-refresh.s 563

    nbpem 660

    nbemm 683

    nbsl 795

    oprd 871

    nbproxy 887

    nbemmcmd 999

    bpgetconfig 1032

    dbsrv11 1170

    bprd 1278

    fsflush 1638

    bpjava-susvc 1945

    bpstsinfo 2942

    sched 5433

    java 8461

    bpdbm 12290

    Listing 3


    The output in Listing 3 shows that bpdbm (a database manager from NetBackup) is using the ARC most of the time.


    As mentioned previously, the maximum ARC size grows until the RAM space minus 1 GB is filled but adapts dynamically based upon the amount of free physical memory on the system. Normally, that isn't a problem, but if an application needs a guaranteed amount of memory (for example, 2 GB), it could be appropriate to limit the ARC size through the zfs_arc_max parameter. For example, we could limit the ARC to 6 GB (8GB - 2 GB) in the /etc/system file, as follows:


    set zfs_arc_max= 6442450944 (bytes)


    There are many ZFS parameters in the kernel, which we can see by running the following command:


    root@solaris11-1:~/zfs_scripts# echo "::zfs_params" | mdb -k

    arc_reduce_dnlc_percent = 0x3

    zfs_arc_max = 0x0

    zfs_arc_min = 0x0

    arc_shrink_shift = 0x7

    zfs_mdcomp_disable = 0x0

    zfs_prefetch_disable = 0x0

    zfetch_max_streams = 0x8

    zfetch_min_sec_reap = 0x2

    zfetch_block_cap = 0x100

    zfetch_array_rd_sz = 0x100000

    zfs_default_bs = 0x9

    zfs_default_ibs = 0xe

    metaslab_aliquot = 0x80000

    spa_max_replication_override = 0x3

    spa_mode_global = 0x3

    zfs_flags = 0x0

    zfs_txg_synctime_ms = 0x1388

    zfs_txg_timeout = 0x5

    zfs_write_limit_min = 0x2000000

    zfs_write_limit_max = 0xfdf1c00

    zfs_write_limit_shift = 0x3

    zfs_write_limit_override = 0x0

    zfs_no_write_throttle = 0x0

    zfs_vdev_cache_max = 0x4000

    zfs_vdev_cache_size = 0x0

    zfs_vdev_cache_bshift = 0x10

    vdev_mirror_shift = 0x15

    zfs_vdev_max_pending = 0xa

    zfs_vdev_min_pending = 0x4

    zfs_vdev_future_pending = 0xa

    zfs_scrub_limit = 0xa

    zfs_no_scrub_io = 0x0

    zfs_no_scrub_prefetch = 0x0

    zfs_vdev_time_shift = 0x6

    zfs_vdev_ramp_rate = 0x2

    zfs_vdev_aggregation_limit = 0x20000

    fzap_default_block_shift = 0xe

    zfs_immediate_write_sz = 0x8000

    zfs_read_chunk_size = 0x100000

    zfs_nocacheflush = 0x0

    zil_replay_disable = 0x0

    metaslab_gang_threshold = 0x100001

    metaslab_df_alloc_threshold = 0x100000

    metaslab_df_free_pct = 0x4

    zio_injection_enabled = 0x0

    zvol_immediate_write_sz = 0x8000


    For example, you can modify the block size of a ZFS file system to match the needs of an application:


    root@solaris11-1:~# zfs get recordsize netbackup_zfs_pool/fs_backup_1

    NAME                           PROPERTY   VALUE SOURCE

    netbackup_zfs_pool/fs_backup_1 recordsize 128K  default


    Usually, Oracle recommends setting the recordsize parameter equal to the block_size parameter for Oracle Database, and the parameter zfs_immediate_write_sz could be a bit lower than the recordsize value to improve the ZIL performance. Additionally, any modification in the recordsize parameter will affect only new files that are created after the change. You should copy existing files to another place, change the ZFS file system's recordsize parameter, and then copy them back.


    It's recommended that other ZFS parameters, such as atime, primarycache, and secondarycache be analyzed in a case-by-case fashion.




    In this series of articles, I showed how to explore some nice features of ZFS. I sincerely hope this series has motivated you to learn more about ZFS. Remember: ZFS always wins!!!


    See Also


    Here are some ZFS resources that are relevant to this article and the other articles in this series:



    In addition, here are some links to other things I've written:



    And here are some Oracle Solaris 11 resources:



    About the Author


    Alexandre Borges is an Oracle ACE in Solaris and has been teaching courses on Oracle Solaris since 2001. He worked as an employee and a contracted instructor at Sun Microsystems, Inc. until 2010, teaching hundreds of courses on Oracle Solaris (such as Administration, Networking, DTrace, and ZFS), Oracle Solaris Performance Analysis, Oracle Solaris Security, Oracle Cluster Server, Oracle/Sun hardware, Java Enterprise System, MySQL Administration, MySQL Developer, MySQL Cluster, and MySQL tuning. He was awarded the title of Instructor of the Year twice for his performance teaching Sun Microsystems courses. Since 2009, he has been imparting training at Symantec Corporation (NetBackup, Symantec Cluster Server, Storage Foundation, and Backup Exec) and EC-Council [Certified Ethical Hacking (CEH)]. In addition, he has been working as a freelance instructor for Oracle education partners since 2010. In 2014, he became an instructor for Hitachi Data Systems (HDS) and Brocade.


    Currently, he also teaches courses on Reverse Engineering, Windows Debugging, Memory Forensic Analysis, Assembly, Digital Forensic Analysis, and Malware Analysis. Alexandre is also an (ISC)2 CISSP instructor and has been writing articles on the Oracle Technical Network (OTN) on a regular basis since 2013.



    Revision 1.0, 03/16/2015


    Follow us:
    Blog | Facebook | Twitter | YouTube