Forum Stats

  • 3,816,182 Users
  • 2,259,153 Discussions
  • 7,893,409 Comments

Discussions

Part 10 - Monitoring and Tuning ZFS Performance

steph-choyer-Oracle
steph-choyer-Oracle Member Posts: 101 Red Ribbon
edited Nov 30, 2016 4:00AM in Solaris 11

in Oracle Solaris 11.1

by  Alexandre Borgesace-icon.gif

Part 10, which is the final article, in a series that describes the key features of ZFS in Oracle Solaris 11.1 and provides step-by-step procedures explaining how to use them. This article provides an overview of how to monitor ZFS statistics and tune ZFS performance.



Introduction to ZFS and the ZFS Intent Log

ZFS provides transactional behavior that enforces data and metadata integrity by using a powerful 256-bit checksum that provides a big advantage: data and metadata are written together (but not exactly at the same time) by using the "uberblock ring" concept, which represents a round that is completed when both data and metadata are written. Thus, by using the uberblock ring concept, both are written to disk or neither is written. The entire operation uses a copy-on-write (COW) mechanism to help to guarantee the atomicity of the process. Therefore, taking this approach makes the ZFS file system always consistent, even during a crash event. This is very different from traditional file systems, which can be corrupted because they write data and metadata in different stages, increasing the chance of consistency problems that cannot be fixed using the fsck command.

Surprisingly, some people compare ZFS transactions to a file system using journaling, but this comparison is not correct, because the journaling process records a kind of log to be replayed when a crash occurs, which accelerates the recovery process but decreases performance during the writing process. ZFS transactions are similar to ACID (Atomicity, Consistency, Isolation, Durability) operations that occur in databases such as Oracle Database, where either the complete operation is committed or it is totally roll backed.

By the way, the ZFS file system has a log named ZIL (ZFS Intent Log) that performs similarly to a journaling file system, but only synchronous writes are written to the ZIL (all other write operations are written directly to memory and later committed to disk) and without suffering a penalty. While the file system in online, the ZIL is never read; it is only written to. Thus, it never can be placed in temporary storage (the general recommendation is to use a dedicated log disk, such solid-state disks) because during a crash, the updates to transactions depend on the ZIL to be committed (and confirmed) to disk. In a nutshell, the fundamental role of the ZIL is to replay the last transactions in the event of a crash. During a write cycle, the information is written as a transaction group (txg) to memory and to the ZIL at the same time. About five seconds later, the transaction group is committed to disk (remember the uberblock ring) and the ZIL is thrown away. If the system suffers a crash during the txg commit operation, the ZIL will be used in the next Oracle Solaris boot to recover the data and to try to mount the data set.

From the explanation above, we now know that the ZIL is used by a data set (that is, a volume or file system), but another question arises: What is the appropriate size for the ZIL? Also, how do we know if our data set uses the ZIL much?

The first question is difficult to answer, but the minimum size of the ZIL is 64 MB and the maximum size is the amount of RAM divided by 2. However, it is uncommon to have a ZIL larger than 16 GB.

To answer the second question, we could use the zilstat.ksh script from Richard Elling to follow the write activity on the ZIL device.

To get a feel for zilstat.ksh, execute the command shown below:

[email protected]:~# ./zilstat.ksh -h

The following usage information comes from Elling's website:

zilstat.ksh [gMt][-l linecount] [-p poolname] [interval [count]]

    -M  # print numbers as megabytes (base 10)

    -t  # print timestamp

    -p poolname  # only look at poolname

    -l linecount # print header every linecount lines (default=only once)

    interval in seconds or "txg" for transaction group commit intervals

             note: "txg" only appropriate when -p poolname is used

    count will limit the number of intervals reported

Here are some examples:

      

CommandDescription
zilstat.kshDefault output, 1-second samples
zilstat.ksh 10Ten-second samples
zilstat.ksh 10 6Print 6 x 10-second samples
zilstat.ksh -p rpoolShow ZIL stats for rpool only

Output (Note: data bytes are actual data; total bytes counts buffer size.):

  • [TIME]
  • N-Bytes: data bytes written to ZIL over the interval
  • N-Bytes/s: data bytes per second written to ZIL over the interval
  • N-Max-Rate: maximum data rate during any 1-second sample
  • B-Byte: buffer bytes written to ZIL over the interval
  • B-Bytes/s: buffer bytes per second written to ZIL over the interval
  • B-Max-Rate: maximum buffer rate during any 1-second sample
  • Ops: number of synchronous IOPS per interval
  • <=4kB: number of synchronous IOPS <= 4k bytes per interval
  • 4-32kB: number of synchronous IOPS 400–32k bytes per interval
  • >=32kB: number of synchronous IOPS >= 32k bytes per interval

To test the script, create a mirrored pool with a mirrored log:

[email protected]:~# zpool create zil_pool mirror c7t2d0 c7t3d0 log mirror c7t4d0 c7t5d0

[email protected]:~# zpool status zil_pool

  pool: zil_pool

state: ONLINE

  scan: none requested

config:

   NAME        STATE     READ WRITE CKSUM

   zil_pool    ONLINE       0     0     0

     mirror-0  ONLINE       0     0     0

       c7t2d0  ONLINE       0     0     0

       c7t3d0  ONLINE       0     0     0

   logs

     mirror-1  ONLINE       0     0     0

       c7t4d0  ONLINE       0     0     0

       c7t5d0  ONLINE       0     0     0

[email protected]:~# zfs create zil_pool/fs_zil

[email protected]:~# zfs list zil_pool/fs_zil

NAME             USED  AVAIL  REFER  MOUNTPOINT

zil_pool/fs_zil   31K  9.78G    31K  /zil_pool/fs_zil

Now it's time to verify eventual write operations on the log by executing the following command (of course, this case will not show anything because it's an isolated test):

[email protected]:~# ./zilstat.ksh -p zil_pool 1 60

N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops <=4kB 4-32kB >=32kB

      0         0          0       0         0          0    0    0      0      0

      0         0          0       0         0          0    0    0      0      0

Our analysis from running zilstat shows us that the ZIL is not being used, so it might be suitable to disable the ZIL. Nonetheless, the usual recommendation is for not disabling ZIL, mainly when handling NFS.

To disable the ZIL, run the following commands:

[email protected]:~# zfs get sync zil_pool

NAME      PROPERTY  VALUE     SOURCE

zil_pool  sync      standard  default

[email protected]:~# zfs set sync=disabled zil_pool

[email protected]:~# zfs get sync zil_pool

NAME      PROPERTY  VALUE     SOURCE

zil_pool  sync      disabled  local

To re-enable the ZIL, execute the following commands:

[email protected]:~# zfs set sync=standard zil_pool

[email protected]:~# zfs get sync zil_pool

NAME      PROPERTY  VALUE     SOURCE

zil_pool  sync      standard  local

DTrace brings some possibilities for following the ZIL behavior and operations, as shown below:

[email protected]:~# dtrace -n zil*:entry'{@[probefunc]=count();}'

dtrace: description 'zil*:entry' matched 60 probes

^C

  zil_clean                                                         1

  zil_itxg_clean                                                    1

  zil_header_in_syncing_context                                     3

  zil_sync                                                          3

To finish this small experiment using the ZIL, destroy the zil_pool data set by running the following command:

[email protected]:~# zpool destroy zil_pool

The Self-Healing Capability of ZFS

Continuing on, ZFS is a self-healing file system that uses 256-bit checksum verification to fix a corrupted block. For example, during a mirrored file system scenario using two disks, ZFS tries to read a block from the first disk. If the checksum reveals that this block is corrupted, ZFS performs a self-healing job by trying to read the block from the second disk and by verifying the respective checksum. If it is OK, ZFS replaces the bad block in the first disk with the good one from the second disk.

As a simple and straightforward demonstration about the self-healing feature, a quick step-by-step sequence follows. First, we choose the disks, create the pool, and verify its status, as shown below:

[email protected]:~# devfsadm

[email protected]:~# format

Searching for disks...done

AVAILABLE DISK SELECTIONS:

       0. c7t0d0 <ATA-VBOX HARDDISK-1.0-40.00GB>

          /[email protected],0/pci8086,[email protected]/[email protected],0

       1. c7t2d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>

          /[email protected],0/pci8086,[email protected]/[email protected],0

       2. c7t3d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>

          /[email protected],0/pci8086,[email protected]/[email protected],0

       3. c7t4d0 <ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>

          /[email protected],0/pci8086,[email protected]/[email protected],0

Specify disk (enter its number): ^D

[email protected]:~# zpool create selfheal_pool mirror c7t2d0 c7t3d0

[email protected]:~# zpool list selfheal_pool

NAME            SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT

selfheal_pool  9.94G    85K  9.94G   0%  1.00x  ONLINE  -

Create a file system named test_fs inside selfheal_pool by running the following command:

[email protected]:~# zfs create selfheal_pool/test_fs

Copy some information (any data) to the created file system, for example, by executing the following command:

[email protected]:~# cp -r /kernel /usr/jdk /usr/gnu /selfheal_pool/test_fs

Verify the file system status by running the following:

[email protected]:~# zfs list selfheal_pool/test_fs

NAME                   USED  AVAIL  REFER  MOUNTPOINT

selfheal_pool/test_fs  956M  8.85G   956M  /selfheal_pool/test_fs

In another terminal window, execute the following command:

[email protected]:~# dtrace -n zfs:zio_checksum_error:entry

Now it is time to destroy some of the data of a ZFS disk from selfheal_pool by executing the following command:

[email protected]:~# dd if=/dev/urandom of=/dev/c7t2d0 bs=1024k count=5000 conv=notrunc

0+5000 records in

0+5000 records out

To ensure that there is not any data in cache, export and import the pool again, as shown below:

[email protected]:~# zpool export -f selfheal_pool

[email protected]:/# zpool import -f selfheal_pool

[email protected]:~# zpool status selfheal_pool

  pool: selfheal_pool

state: ONLINE

  scan: none requested

config:

   NAME           STATE     READ WRITE CKSUM

   selfheal_pool  ONLINE       0     0     0

   mirror-0       ONLINE       0     0     0

   c7t2d0         ONLINE       0     0     0

   c7t3d0         ONLINE       0     0     0

errors: No known data errors

[email protected]:~# cd /selfheal_pool/test_fs/

[email protected]:/selfheal_pool/test_fs# ls

gnu     jdk     kernel

No data was lost and everything in the file system is fine. It is simple to see that ZFS caught the checksum errors during the write operation using the dd command. In the last column of the output from the DTrace command below, a zero means "OK" and a non-zero means "error," but ZFS recovered itself, no data was lost, and there is not anything to worry about, as the previous step proved:

[email protected]:/# dtrace -n  'fbt::zio_checksum_error:return {trace(arg1)}'

CPU     ID                    FUNCTION:NAME

   0  41554        zio_checksum_error:return                 0

   0  41554        zio_checksum_error:return                 0

   0  41554        zio_checksum_error:return                50

   0  41554        zio_checksum_error:return                50

   0  41554        zio_checksum_error:return                50

   0  41554        zio_checksum_error:return                50

...

ZFS Caches

ZFS deploys a very interesting kind of cache named ARC (Adaptive Replacement Cache) that caches data from all active storage pools. The ARC grows and shrinks as the system's workload demand for memory fluctuates, using two caching algorithms at the same time to balance main memory: MRU (most recently used) and MFU (most frequently used). These two caching algorithms generate two lists (the MRU list and the MFU list) that hold metadata about the data cached in memory, whereas counterpart lists (MRU ghost and MFU ghost) hold metadata about the evicted data from cache in memory.

Note that both the MRU and MFU ghost caches have an important role in the ARC capacity to self-adapt to load, and it is appropriate to highlight again the fundamental concept: both the MFU ghost and the MRU ghost hold metadata about evicted pages from cache and there is no data in these lists. Furthermore, it could be useful to understand the inner working of the ARC cache to understand the MRU and MFU algorithms.

Usually, all common deployed LRU (least recently used) mechanisms present some problems, because they don't have a good way to handle scanning files through the file systems. Therefore, if a heavy reading of sequential data happens, it can "trash" the file system cache. The worst case occurs when this reading operation happens only once, thereby evicting "good" data from the cache.

The size of the directory table for pages in the cache is two times bigger than the data in the cache and ZFS has four internal lists, the first one being the MRU (for most recently used pages) and another being the MFU (most frequently used pages). The other two lists are the ghost MRU and ghost MFU, which do not hold any data! From here, if an application reads a page from a file system, this page goes to the cache and a reference to this same page is put into the MRU (probably, at the first position). Following this first read operation, if the same page is read again (eventually repeating the operation), this page will appear in the MFU, because there is a simple rule that states that if there are two or more read operations from same page, the page reference goes to the MFU and its reference is also updated in the MRU.

Obviously, there are pages that are not accessed so recently and they are moved to end of both lists. However, the cache size is finite, so in the near future, a page (the least used) should be evicted from cache to disk. Nevertheless, no page can be evicted if there is any reference to it either in the MRU or the MFU.

Therefore, at some point, the page reference is evicted from the MRU list, so a reference to this page is created in the ghost MRU, and this page can be evicted from cache to disk. (We should remember that a page can be evicted from cache only if there isn't any reference to this page in either the MRU list or the MFU list.) In this case, clearly the ghost MRU works as a kind of "back log" of recently evicted pages.

Sometime later, after many pages were evicted from cache, their references are created in the ghost MRU (in this case, it works as a second list), and our first page's reference in the ghost MRU is evicted from there. Thus, no reference exists in the ghost MRU either.

As a simple conclusion, if a read operation happens after the first page is evicted from cache but before it is evicted from the ghost MRU, this page must be fetched from disk. However, the ghost MRU has a clear indication that this page was evicted recently. Finally, this indicates that the ZFS cache is smaller than necessary.

As a final note, we should remember that the ZFS ARC is smart because the MRU and MFU lists (and the respective ghosts) do not have a fixed size and are adapted according the application load. Usually, the maximum ARC size grows to fill almost all of the available memory according to the following rules: a maximum of 75 percent from total RAM for systems with less than 4 GB or RAM minus 1 GB, and a minimum of 64 MB and an upper limit of one quarter the ARC for metadata on Oracle Solaris 11. Nonetheless, the ARC adapts its size based on the amount of free physical memory; occasionally, the ARC reaches its limit (its maximum size is specified by the zfs_arc_max parameter), and then the reallocation process, which evicts some data from memory, begins. Moreover, other factors can impact this upper limit such as page scanner activity, insufficient swap space, and the kernel heap being more than 75 percent full.

MRU and MFU statistics can be obtained by executing the following commands:

[email protected]:~# kstat -p zfs:0:arcstats | grep mru

zfs:0:arcstats:mru_ghost_hits 0

zfs:0:arcstats:mru_hits 402147

[email protected]:~# kstat -p zfs:0:arcstats | grep mfu

zfs:0:arcstats:mfu_ghost_hits 0

zfs:0:arcstats:mfu_hits 4240888

In addition to the ARC, there's another cache named the Level 2 Adaptive Replacement Cache (L2ARC), which is like a level-2 cache between main memory and disk. Actually, the L2ARC works like an extension to the ARC for data recently evicted from the ARC. Very good candidates for housing the L2ARC are solid-state disks (SSDs). Data evicted from the ARC goes to the L2ARC, and having the L2ARC on SSDs is a good way to accelerate random reads.

Here's a fact we must remember: The L2ARC is a low-bandwidth but low-latency device. It performs best when your working set is too large to fit into main memory, but the block size is small (32k or less). Being able to do random reads from the L2ARC's SSD devices, bypassing the main pool, boosts performance considerably.

To add SSD cache disks to a ZFS pool, run the following command and wait for some time until the data comes into the cache (the warm-up phase):

[email protected]:~# zpool add zfs_l2_pool cache <ssd disk 1> <ssd disk 2> <ssd disk 3> <ssd disk 4>...

Getting ZFS Statistics

You can get basic statistics about the ZFS rpool by using the following zpool iostat command, which enables us to see how many read and write operations have happened in the rpool and how much data was written or read:

[email protected]:~# zpool iostat rpool 1

capacity operations bandwidth

pool alloc  free  read   write read   write

---- ------ ----- -----  ----- -----  -----

rpool 31.1G 48.4G 14     10    1.35M   563K

rpool 31.1G 48.4G 6      48     267K  27.2M

rpool 31.1G 48.4G 14     27     859K   106K

rpool 31.2G 48.3G 23    136    1.44M  1.18M

rpool 31.2G 48.3G 4     120     516K  28.9M

rpool 31.2G 48.3G 8     162     897K  13.6M

rpool 31.2G 48.3G 12    114     522K  19.2M

rpool 31.2G 48.3G 0      28     1023  20.4M

^C

Another very good tool for tracing which operations are slowing the ZFS system performance is the DTrace script zfsslower.d (from Brendan Gregg's DTrace book). For example, the following command shows what ZFS operations take more than 15 milliseconds during a real access to disk (not cached):

[email protected]:~/zfs_scripts# ./zfsslower.d 15

TIME                 PROCESS D KB ms FILE

2014 May 11 16:14:32 bash    R 0  25 /opt/openv/java/jnbSA

2014 May 11 16:14:33 jnbSA   R 8  22 /opt/openv/java/.nbjConf

2014 May 11 16:14:33 jnbSA   R 0  21 /usr/bin/tail

2014 May 11 16:14:33 jnbSA   R 0  17 /usr/bin/awk

2014 May 11 16:14:33 jnbSA   R 0  21 /usr/bin/whoami

2014 May 11 16:14:33 jnbSA   R 0  24 /usr/bin/locale

2014 May 11 16:14:33 jnbSA   R 0  68 /usr/bin/grep

2014 May 11 16:14:33 java    R 0  29 /opt/openv/java/jre/bin/amd64/java

2014 May 11 16:14:33 java    R 0  22 /opt/openv/java/jre/lib/amd64/server/libjvm.so

2014 May 11 16:14:33 java    R 0  20 /usr/lib/amd64/libsched.so.1

2014 May 11 16:14:33 java    R 0  18 /lib/amd64/libm.so.1

2014 May 11 16:14:33 java    R 0  39 /usr/lib/amd64/libdemangle.so.1

2014 May 11 16:14:33 java    R 0  25 /opt/openv/java/jre/lib/amd64/libverify.so

2014 May 11 16:14:33 java    R 0  22 /opt/openv/java/jre/lib/amd64/libjava.so

2014 May 11 16:14:33 java    R 0  22 /opt/openv/java/jre/lib/amd64/libzip.so

2014 May 11 16:14:34 java    R 2  19 /opt/openv/java/jre/lib/meta-index

...

The DTrace script zfssnoop.d is a helpful tool for seeing what processes are requesting ZFS I/O operations, and it's very suitable for analyzing performance issues, as shown below:

[email protected]:~/zfs_scripts# ./zfssnoop.d

TIME(ms) UID PID PROCESS CALL        KB PATH

6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/netbackup/bin/bprd_parent

6316520  0  6293 bprd    zfs_open    0  /opt/openv/netbackup/bin/bprd_parent

6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/netbackup/bin/bprd_parent

6316520  0  6293 bprd    zfs_readdir 0  /opt/openv/netbackup/bin/bprd_parent

6316520  0  6293 bprd    zfs_readdir 0  /opt/openv/netbackup/bin/bprd_parent

6316520  0  6293 bprd    zfs_close   0  /opt/openv/netbackup/bin/bprd_parent

6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/netbackup/bp.conf

6316520  0   797 nscd    zfs_getattr 0  /etc/nsswitch.conf

6316520  0   797 nscd    zfs_getattr 0  /etc/resolv.conf

6316520  0   797 nscd    zfs_getattr 0  /etc/passwd

6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/netbackup/bp.conf

6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/var/license.txt

6316520  0  6293 bprd    zfs_getattr 0  /opt/openv/var/license.txt

6316521  0  797  nscd    zfs_getattr 0  /etc/nsswitch.conf

6316521  0  797  nscd    zfs_getattr 0  /etc/resolv.conf

6316521  0  797  nscd    zfs_getattr 0  /etc/passwd

...

For getting ARC-specific statistics, there are many parameters that can be collected using different tools and techniques. For example, kstat is a great way to gather ARC statistics:

[email protected]:~/zfs_scripts# kstat -p "zfs:0:arcstats:"

zfs:0:arcstats:buf_size 3301320

zfs:0:arcstats:c 320595190

zfs:0:arcstats:c_max 1597679616

zfs:0:arcstats:c_min 67108864

zfs:0:arcstats:class misc

zfs:0:arcstats:crtime 824.813864746

zfs:0:arcstats:data_size 281752896

zfs:0:arcstats:deleted 124895

zfs:0:arcstats:demand_data_hits 4181581

zfs:0:arcstats:demand_data_misses 30322

zfs:0:arcstats:demand_metadata_hits 329062

zfs:0:arcstats:demand_metadata_misses 15013

zfs:0:arcstats:hash_chain_max 5

zfs:0:arcstats:hash_chains 18446744073709535339

zfs:0:arcstats:hash_collisions 31739

zfs:0:arcstats:hash_elements 18446744073709478435

zfs:0:arcstats:hash_elements_max 18446744073709551615

zfs:0:arcstats:hits 4606729

zfs:0:arcstats:l2_abort_lowmem 0

zfs:0:arcstats:l2_cksum_bad 0

zfs:0:arcstats:l2_evict_lock_retry 0

zfs:0:arcstats:l2_evict_reading 0

zfs:0:arcstats:l2_feeds 0

zfs:0:arcstats:l2_hdr_size 0

zfs:0:arcstats:l2_hits 0

zfs:0:arcstats:l2_io_error 0

zfs:0:arcstats:l2_misses 45335

zfs:0:arcstats:l2_read_bytes 0

zfs:0:arcstats:l2_rw_clash 0

zfs:0:arcstats:l2_write_bytes 0

zfs:0:arcstats:l2_writes_done 0

zfs:0:arcstats:l2_writes_error 0

zfs:0:arcstats:l2_writes_hdr_miss 0

zfs:0:arcstats:l2_writes_sent 0

zfs:0:arcstats:memory_throttle_count 141

zfs:0:arcstats:meta_limit 0

zfs:0:arcstats:meta_max 41224144

zfs:0:arcstats:meta_used 23865960

zfs:0:arcstats:mfu_ghost_hits 0

zfs:0:arcstats:mfu_hits 4029416

zfs:0:arcstats:misses 132715

zfs:0:arcstats:mru_ghost_hits 0

zfs:0:arcstats:mru_hits 268796

zfs:0:arcstats:mutex_miss 4980

zfs:0:arcstats:other_size 20564640

zfs:0:arcstats:p 320595190

zfs:0:arcstats:prefetch_data_hits 50530

zfs:0:arcstats:prefetch_data_misses 68513

zfs:0:arcstats:prefetch_metadata_hits 45556

zfs:0:arcstats:prefetch_metadata_misses 18867

zfs:0:arcstats:size 305618856

zfs:0:arcstats:snaptime 10481.932353541

Personally, I prefer another method: the one shown in Listing 1:

[email protected]:~/zfs_scripts# echo "::arc" | mdb -k

hits = 4696135

misses = 133577

demand_data_hits = 4251628

demand_data_misses = 30808

demand_metadata_hits = 345851

demand_metadata_misses = 15083

prefetch_data_hits = 50577

prefetch_data_misses = 68616

prefetch_metadata_hits = 48079

prefetch_metadata_misses = 19070

mru_hits = 310110

mru_ghost_hits = 0

mfu_hits = 4102471

mfu_ghost_hits = 0

deleted = 124895

mutex_miss = 4986

hash_elements = 18446744073709478638

hash_elements_max = 18446744073709551615

hash_collisions = 39936

hash_chains = 18446744073709535395

hash_chain_max = 5

p = 352 MB

c = 318 MB

c_min = 64 MB

c_max = 1523 MB

size = 318 MB

buf_size = 3 MB

data_size = 294 MB

other_size = 20 MB

l2_hits = 0

l2_misses = 45891

l2_feeds = 0

l2_rw_clash = 0

l2_read_bytes = 0 MB

l2_write_bytes = 0 MB

l2_writes_sent = 0

l2_writes_done = 0

l2_writes_error = 0

l2_writes_hdr_miss = 0

l2_evict_lock_retry = 0

l2_evict_reading = 0

l2_abort_lowmem = 0

l2_cksum_bad = 0

l2_io_error = 0

l2_hdr_size = 0 MB

memory_throttle_count = 141

meta_used = 23 MB

meta_max = 39 MB

meta_limit = 0 MB

arc_no_grow = 1

arc_tempreserve = 0 MB

Listing 1

There are some important statistics shown in Listing 1:

  • size shows the current ARC size.
  • c is the target ARC size.
  • c_max is the maximum target ARC size.
  • c_min is the minimum target ARC size.
  • p is the size of the MFU cache.
  • l2_hdr_size is the space in the ARC that is consumed by managing the L2ARC.
  • l2_size is the size of the data in the L2ARC.
  • memory_throttle_count is the number of times that ZFS had to limit the ARC growth.

For example, a constant increasing of the memory_throttle_count statistic can indicate excessive pressure to evict data from the ARC.

Some of the arcstat statistics can be examined by running the following command:

[email protected]:~# echo "arc_stats::print -d arcstat_p.value.ui64 \

arcstat_c.value.ui64 arcstat_c_max.value.ui64" | mdb -k

arcstat_p.value.ui64 = 0t2325356992

arcstat_c.value.ui64 = 0t3208292352

arcstat_c_max.value.ui64 = 0t3208292352

Additionally, to interpret the output shown in Listing 1, we must understand some operations, such as prefetch requests and demand requests. Demand requests are done directly to the ARC without leveraging prefetch requests. But, what are prefetch requests? A prefetch request is a "read ahead" operation where data is brought from disk in advance and kept in the ARC. Consequently, future sequential read operations can take advantage of the data in the ARC, because the prefetch request anticipated that the data brought from the disk would be needed, and this prevents the need to perform a new physical read operation later. This ZFS file-level prefetch mechanism is called a zfetch.

Therefore, the contents of the ARC can be broken down to the following: prefetch (data + metadata) and demand (data + metadata). A hit happens when the requested information is found in the ARC and a miss happens when the requested information isn't found in the ARC.

Surprisingly, it is possible to disable the ZFS prefetch setting in the /etc/system file: zfs_prefetch_disable = 0x1. This might be recommended when you are facing some contention on zfetch locks or if the prefetch efficiency ratio is very low and is causing slow performance . For example, if many sequential small reads hit the cache, prefetching data from disk can consume a significant amount of CPU time and degrade CPU performance. Additionally, there are rare cases where application loads are limited by zfetch. Based on these concepts, we can calculate efficiency ratios. For example, to see the hit rate and miss rate for ARC data and metadata prefetch requests, we could run the command shown in Listing 2:

[email protected]:~/zfs_scripts# kstat -p "zfs:0:arcstats:" | grep prefetch

zfs:0:arcstats:prefetch_data_hits 50530

zfs:0:arcstats:prefetch_data_misses 68513

zfs:0:arcstats:prefetch_metadata_hits 45556

zfs:0:arcstats:prefetch_metadata_misses 18867

Listing 2

Using the values shown in Listing 2, we can calculate the prefetch efficiency ratio by running the following command, which uses the following formula:

[(data_hits + metadata_hits)/(data_hits + data_misses + metadata_hits + metadata_misses)]:

[email protected]:~/zfs_scripts# bc -l

(50530 + 45556)/(50530 + 68513 + 45556 + 18867)

.52372646702931333326 = 52.37%

If we only wanted to know the ratio of data hits to all data requests, which follows the formula [(data_hits)/(data_hits + data_misses)] , we could run this command:

[email protected]:~/zfs_scripts# bc -l

(50530/(50530 + 68513))

.42446846937661181253 = 42.44%

I'm sure you get the idea.

The following archits.sh script from Brendan Gregg makes it possible to know the hit rate, but our main concern should be with misses (not shown, but easy to calculate):

[email protected]:~/zfs_scripts# more archits.sh

#!/usr/bin/sh

interval=${1:-5} # 5 secs by default

kstat -p zfs:0:arcstats:hits zfs:0:arcstats:misses $interval | awk '

BEGIN {

printf "%12s %12s %9s\n", "HITS", "MISSES", "HITRATE"

}

/hits/ {

hits = $2 - hitslast

hitslast = $2

}

/misses/ {

misses = $2 - misslast

misslast = $2

rate = 0

total = hits + misses

if (total)

rate = (hits * 100) / total

printf "%12d %12d %8.2f%%\n", hits, misses, rate

}

'

Running the archits.sh script produces output like the following:

[email protected]:~/zfs_scripts# ./archits.sh

HITS    MISSES HITRATE

4667846 132926 97.23%

25      0      100.00%

92      2      97.87%

287     20     93.49%

58      0      100.00%

3429    127    96.43%

4088    111    97.36%

39      0      100.00%

Eventually, we might be interested in collecting information about which processes are using the ARC most often, as shown in Listing 3:

[email protected]:~/zfs_scripts# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @[execname] = count() }'

dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes

^C

...

svc.configd 314

tpconfig 331

bptm 358

firefox 398

userreq_notify 455

ksh93 469

update-refresh.s 563

nbpem 660

nbemm 683

nbsl 795

oprd 871

nbproxy 887

nbemmcmd 999

bpgetconfig 1032

dbsrv11 1170

bprd 1278

fsflush 1638

bpjava-susvc 1945

bpstsinfo 2942

sched 5433

java 8461

bpdbm 12290

Listing 3

The output in Listing 3 shows that bpdbm (a database manager from NetBackup) is using the ARC most of the time.

As mentioned previously, the maximum ARC size grows until the RAM space minus 1 GB is filled but adapts dynamically based upon the amount of free physical memory on the system. Normally, that isn't a problem, but if an application needs a guaranteed amount of memory (for example, 2 GB), it could be appropriate to limit the ARC size through the zfs_arc_max parameter. For example, we could limit the ARC to 6 GB (8GB - 2 GB) in the /etc/system file, as follows:

set zfs_arc_max= 6442450944 (bytes)

There are many ZFS parameters in the kernel, which we can see by running the following command:

[email protected]:~/zfs_scripts# echo "::zfs_params" | mdb -k

arc_reduce_dnlc_percent = 0x3

zfs_arc_max = 0x0

zfs_arc_min = 0x0

arc_shrink_shift = 0x7

zfs_mdcomp_disable = 0x0

zfs_prefetch_disable = 0x0

zfetch_max_streams = 0x8

zfetch_min_sec_reap = 0x2

zfetch_block_cap = 0x100

zfetch_array_rd_sz = 0x100000

zfs_default_bs = 0x9

zfs_default_ibs = 0xe

metaslab_aliquot = 0x80000

spa_max_replication_override = 0x3

spa_mode_global = 0x3

zfs_flags = 0x0

zfs_txg_synctime_ms = 0x1388

zfs_txg_timeout = 0x5

zfs_write_limit_min = 0x2000000

zfs_write_limit_max = 0xfdf1c00

zfs_write_limit_shift = 0x3

zfs_write_limit_override = 0x0

zfs_no_write_throttle = 0x0

zfs_vdev_cache_max = 0x4000

zfs_vdev_cache_size = 0x0

zfs_vdev_cache_bshift = 0x10

<

Comments

  • Krum
    Krum Member Posts: 12 Green Ribbon

    Terrific explanation, I'm not reading it for first time, but refreshed my memory with a pleasure  !

    Wonder, just wonder I want to underline  ,not critique  - why other OSes ,proprietary or not, all of them lacks docs like this ?

    Mine "default" Solaris knowledge source is a Oracle manuals and blogs, for other operating systems - certain third party forums/blogs.