in Oracle Solaris 11.1
by [Alexandre Borges
](/people/Alexandre Borges)
Part 10, which is the final article, in a series that describes the key features of ZFS in Oracle Solaris 11.1 and provides step-by-step procedures explaining how to use them. This article provides an overview of how to monitor ZFS statistics and tune ZFS performance.
Introduction to ZFS and the ZFS Intent Log
ZFS provides transactional behavior that enforces data and metadata integrity by using a powerful 256-bit checksum that provides a big advantage: data and metadata are written together (but not exactly at the same time) by using the "uberblock ring" concept, which represents a round that is completed when both data and metadata are written. Thus, by using the uberblock ring concept, both are written to disk or neither is written. The entire operation uses a copy-on-write (COW) mechanism to help to guarantee the atomicity of the process. Therefore, taking this approach makes the ZFS file system always consistent, even during a crash event. This is very different from traditional file systems, which can be corrupted because they write data and metadata in different stages, increasing the chance of consistency problems that cannot be fixed using the fsck
command.
Surprisingly, some people compare ZFS transactions to a file system using journaling, but this comparison is not correct, because the journaling process records a kind of log to be replayed when a crash occurs, which accelerates the recovery process but decreases performance during the writing process. ZFS transactions are similar to ACID (Atomicity, Consistency, Isolation, Durability) operations that occur in databases such as Oracle Database, where either the complete operation is committed or it is totally roll backed.
By the way, the ZFS file system has a log named ZIL (ZFS Intent Log) that performs similarly to a journaling file system, but only synchronous writes are written to the ZIL (all other write operations are written directly to memory and later committed to disk) and without suffering a penalty. While the file system in online, the ZIL is never read; it is only written to. Thus, it never can be placed in temporary storage (the general recommendation is to use a dedicated log disk, such solid-state disks) because during a crash, the updates to transactions depend on the ZIL to be committed (and confirmed) to disk. In a nutshell, the fundamental role of the ZIL is to replay the last transactions in the event of a crash. During a write cycle, the information is written as a transaction group (txg) to memory and to the ZIL at the same time. About five seconds later, the transaction group is committed to disk (remember the uberblock ring) and the ZIL is thrown away. If the system suffers a crash during the txg commit operation, the ZIL will be used in the next Oracle Solaris boot to recover the data and to try to mount the data set.
From the explanation above, we now know that the ZIL is used by a data set (that is, a volume or file system), but another question arises: What is the appropriate size for the ZIL? Also, how do we know if our data set uses the ZIL much?
The first question is difficult to answer, but the minimum size of the ZIL is 64 MB and the maximum size is the amount of RAM divided by 2. However, it is uncommon to have a ZIL larger than 16 GB.
To answer the second question, we could use the zilstat.ksh
script from Richard Elling to follow the write activity on the ZIL device.
To get a feel for zilstat.ksh
, execute the command shown below:
root@solaris11-1:~# ./zilstat.ksh -h
The following usage information comes from Elling's website:
zilstat.ksh [gMt][-l linecount] [-p poolname] [interval [count]]
-M # print numbers as megabytes (base 10)
-t # print timestamp
-p poolname # only look at poolname
-l linecount # print header every linecount lines (default=only once)
interval in seconds or "txg" for transaction group commit intervals
note: "txg" only appropriate when -p poolname is used
count will limit the number of intervals reported
Here are some examples:
| Command | Description |
| zilstat.ksh
| Default output, 1-second samples |
| zilstat.ksh 10
| Ten-second samples |
| zilstat.ksh 10 6
| Print 6 x 10-second samples |
| zilstat.ksh -p rpool
| Show ZIL stats for rpool only |
Output (Note: data bytes are actual data; total bytes counts buffer size.):
[TIME]
N-Bytes
: data bytes written to ZIL over the interval
N-Bytes/s
: data bytes per second written to ZIL over the interval
N-Max-Rate
: maximum data rate during any 1-second sample
B-Byte
: buffer bytes written to ZIL over the interval
B-Bytes/s
: buffer bytes per second written to ZIL over the interval
B-Max-Rate
: maximum buffer rate during any 1-second sample
Ops
: number of synchronous IOPS per interval
<=4kB
: number of synchronous IOPS <= 4k bytes per interval
4-32kB
: number of synchronous IOPS 400–32k bytes per interval
>=32kB
: number of synchronous IOPS >= 32k bytes per interval
To test the script, create a mirrored pool with a mirrored log:
root@solaris11-1:~# zpool create zil_pool mirror c7t2d0 c7t3d0 log mirror c7t4d0 c7t5d0
root@solaris11-1:~# zpool status zil_pool
pool: zil_pool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
zil_pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c7t2d0 ONLINE 0 0 0
c7t3d0 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
c7t4d0 ONLINE 0 0 0
c7t5d0 ONLINE 0 0 0
root@solaris11-1:~# zfs create zil_pool/fs_zil
root@solaris11-1:~# zfs list zil_pool/fs_zil
NAME USED AVAIL REFER MOUNTPOINT
zil_pool/fs_zil 31K 9.78G 31K /zil_pool/fs_zil
Now it's time to verify eventual write operations on the log by executing the following command (of course, this case will not show anything because it's an isolated test):
root@solaris11-1:~# ./zilstat.ksh -p zil_pool 1 60
N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops <=4kB 4-32kB >=32kB
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Our analysis from running zilstat
shows us that the ZIL is not being used, so it might be suitable to disable the ZIL. Nonetheless, the usual recommendation is for not disabling ZIL, mainly when handling NFS.
To disable the ZIL, run the following commands:
root@solaris11-1:~# zfs get sync zil_pool
NAME PROPERTY VALUE SOURCE
zil_pool sync standard default
root@solaris11-1:~# zfs set sync=disabled zil_pool
root@solaris11-1:~# zfs get sync zil_pool
NAME PROPERTY VALUE SOURCE
zil_pool sync disabled local
To re-enable the ZIL, execute the following commands:
root@solaris11-1:~# zfs set sync=standard zil_pool
root@solaris11-1:~# zfs get sync zil_pool
NAME PROPERTY VALUE SOURCE
zil_pool sync standard local
DTrace brings some possibilities for following the ZIL behavior and operations, as shown below:
root@solaris11-1:~# dtrace -n zil*:entry'{@[probefunc]=count();}'
dtrace: description 'zil*:entry' matched 60 probes
^C
zil_clean 1
zil_itxg_clean 1
zil_header_in_syncing_context 3
zil_sync 3
To finish this small experiment using the ZIL, destroy the zil_pool
data set by running the following command:
root@solaris11-1:~# zpool destroy zil_pool
The Self-Healing Capability of ZFS
Continuing on, ZFS is a self-healing file system that uses 256-bit checksum verification to fix a corrupted block. For example, during a mirrored file system scenario using two disks, ZFS tries to read a block from the first disk. If the checksum reveals that this block is corrupted, ZFS performs a self-healing job by trying to read the block from the second disk and by verifying the respective checksum. If it is OK, ZFS replaces the bad block in the first disk with the good one from the second disk.
As a simple and straightforward demonstration about the self-healing feature, a quick step-by-step sequence follows. First, we choose the disks, create the pool, and verify its status, as shown below:
root@solaris11-1:~# devfsadm
root@solaris11-1:~# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c7t0d0 \<ATA-VBOX HARDDISK-1.0-40.00GB>
/pci@0,0/pci8086,2829@d/disk@0,0
1. c7t2d0 \<ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>
/pci@0,0/pci8086,2829@d/disk@2,0
2. c7t3d0 \<ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>
/pci@0,0/pci8086,2829@d/disk@3,0
3. c7t4d0 \<ATA-VBOX HARDDISK-1.0 cyl 1303 alt 2 hd 255 sec 63>
/pci@0,0/pci8086,2829@d/disk@4,0
Specify disk (enter its number): ^D
root@solaris11-1:~# zpool create selfheal_pool mirror c7t2d0 c7t3d0
root@solaris11-1:~# zpool list selfheal_pool
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
selfheal_pool 9.94G 85K 9.94G 0% 1.00x ONLINE -
Create a file system named test_fs
inside selfheal_pool
by running the following command:
root@solaris11-1:~# zfs create selfheal_pool/test_fs
Copy some information (any data) to the created file system, for example, by executing the following command:
root@solaris11-1:~# cp -r /kernel /usr/jdk /usr/gnu /selfheal_pool/test_fs
Verify the file system status by running the following:
root@solaris11-1:~# zfs list selfheal_pool/test_fs
NAME USED AVAIL REFER MOUNTPOINT
selfheal_pool/test_fs 956M 8.85G 956M /selfheal_pool/test_fs
In another terminal window, execute the following command:
root@solaris11-1:~# dtrace -n zfs:zio_checksum_error:entry
Now it is time to destroy some of the data of a ZFS disk from selfheal_pool
by executing the following command:
root@solaris11-1:~# dd if=/dev/urandom of=/dev/c7t2d0 bs=1024k count=5000 conv=notrunc
0+5000 records in
0+5000 records out
To ensure that there is not any data in cache, export and import the pool again, as shown below:
root@solaris11-1:~# zpool export -f selfheal_pool
root@solaris11-1:/# zpool import -f selfheal_pool
root@solaris11-1:~# zpool status selfheal_pool
pool: selfheal_pool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
selfheal_pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c7t2d0 ONLINE 0 0 0
c7t3d0 ONLINE 0 0 0
errors: No known data errors
root@solaris11-1:~# cd /selfheal_pool/test_fs/
root@solaris11-1:/selfheal_pool/test_fs# ls
gnu jdk kernel
No data was lost and everything in the file system is fine. It is simple to see that ZFS caught the checksum errors during the write operation using the dd
command. In the last column of the output from the DTrace command below, a zero means "OK" and a non-zero means "error," but ZFS recovered itself, no data was lost, and there is not anything to worry about, as the previous step proved:
root@solaris11-1:/# dtrace -n 'fbt::zio_checksum_error:return {trace(arg1)}'
CPU ID FUNCTION:NAME
0 41554 zio_checksum_error:return 0
0 41554 zio_checksum_error:return 0
0 41554 zio_checksum_error:return 50
0 41554 zio_checksum_error:return 50
0 41554 zio_checksum_error:return 50
0 41554 zio_checksum_error:return 50
...
ZFS Caches
ZFS deploys a very interesting kind of cache named ARC (Adaptive Replacement Cache) that caches data from all active storage pools. The ARC grows and shrinks as the system's workload demand for memory fluctuates, using two caching algorithms at the same time to balance main memory: MRU (most recently used) and MFU (most frequently used). These two caching algorithms generate two lists (the MRU list and the MFU list) that hold metadata about the data cached in memory, whereas counterpart lists (MRU ghost and MFU ghost) hold metadata about the evicted data from cache in memory.
Note that both the MRU and MFU ghost caches have an important role in the ARC capacity to self-adapt to load, and it is appropriate to highlight again the fundamental concept: both the MFU ghost and the MRU ghost hold metadata about evicted pages from cache and there is no data in these lists. Furthermore, it could be useful to understand the inner working of the ARC cache to understand the MRU and MFU algorithms.
Usually, all common deployed LRU (least recently used) mechanisms present some problems, because they don't have a good way to handle scanning files through the file systems. Therefore, if a heavy reading of sequential data happens, it can "trash" the file system cache. The worst case occurs when this reading operation happens only once, thereby evicting "good" data from the cache.
The size of the directory table for pages in the cache is two times bigger than the data in the cache and ZFS has four internal lists, the first one being the MRU (for most recently used pages) and another being the MFU (most frequently used pages). The other two lists are the ghost MRU and ghost MFU, which do not hold any data! From here, if an application reads a page from a file system, this page goes to the cache and a reference to this same page is put into the MRU (probably, at the first position). Following this first read operation, if the same page is read again (eventually repeating the operation), this page will appear in the MFU, because there is a simple rule that states that if there are two or more read operations from same page, the page reference goes to the MFU and its reference is also updated in the MRU.
Obviously, there are pages that are not accessed so recently and they are moved to end of both lists. However, the cache size is finite, so in the near future, a page (the least used) should be evicted from cache to disk. Nevertheless, no page can be evicted if there is any reference to it either in the MRU or the MFU.
Therefore, at some point, the page reference is evicted from the MRU list, so a reference to this page is created in the ghost MRU, and this page can be evicted from cache to disk. (We should remember that a page can be evicted from cache only if there isn't any reference to this page in either the MRU list or the MFU list.) In this case, clearly the ghost MRU works as a kind of "back log" of recently evicted pages.
Sometime later, after many pages were evicted from cache, their references are created in the ghost MRU (in this case, it works as a second list), and our first page's reference in the ghost MRU is evicted from there. Thus, no reference exists in the ghost MRU either.
As a simple conclusion, if a read operation happens after the first page is evicted from cache but before it is evicted from the ghost MRU, this page must be fetched from disk. However, the ghost MRU has a clear indication that this page was evicted recently. Finally, this indicates that the ZFS cache is smaller than necessary.
As a final note, we should remember that the ZFS ARC is smart because the MRU and MFU lists (and the respective ghosts) do not have a fixed size and are adapted according the application load. Usually, the maximum ARC size grows to fill almost all of the available memory according to the following rules: a maximum of 75 percent from total RAM for systems with less than 4 GB or RAM minus 1 GB, and a minimum of 64 MB and an upper limit of one quarter the ARC for metadata on Oracle Solaris 11. Nonetheless, the ARC adapts its size based on the amount of free physical memory; occasionally, the ARC reaches its limit (its maximum size is specified by the zfs_arc_max
parameter), and then the reallocation process, which evicts some data from memory, begins. Moreover, other factors can impact this upper limit such as page scanner activity, insufficient swap space, and the kernel heap being more than 75 percent full.
MRU and MFU statistics can be obtained by executing the following commands:
root@solaris11-1:~# kstat -p zfs:0:arcstats | grep mru
zfs:0:arcstats:mru_ghost_hits 0
zfs:0:arcstats:mru_hits 402147
root@solaris11-1:~# kstat -p zfs:0:arcstats | grep mfu
zfs:0:arcstats:mfu_ghost_hits 0
zfs:0:arcstats:mfu_hits 4240888
In addition to the ARC, there's another cache named the Level 2 Adaptive Replacement Cache (L2ARC), which is like a level-2 cache between main memory and disk. Actually, the L2ARC works like an extension to the ARC for data recently evicted from the ARC. Very good candidates for housing the L2ARC are solid-state disks (SSDs). Data evicted from the ARC goes to the L2ARC, and having the L2ARC on SSDs is a good way to accelerate random reads.
Here's a fact we must remember: The L2ARC is a low-bandwidth but low-latency device. It performs best when your working set is too large to fit into main memory, but the block size is small (32k or less). Being able to do random reads from the L2ARC's SSD devices, bypassing the main pool, boosts performance considerably.
To add SSD cache disks to a ZFS pool, run the following command and wait for some time until the data comes into the cache (the warm-up phase):
root@solaris11-1:~# zpool add zfs_l2_pool cache <ssd disk 1> <ssd disk 2> <ssd disk 3> <ssd disk 4>...
Getting ZFS Statistics
You can get basic statistics about the ZFS rpool by using the following zpool iostat
command, which enables us to see how many read and write operations have happened in the rpool and how much data was written or read:
root@solaris11-1:~# zpool iostat rpool 1
capacity operations bandwidth
pool alloc free read write read write
---- ------ ----- ----- ----- ----- -----
rpool 31.1G 48.4G 14 10 1.35M 563K
rpool 31.1G 48.4G 6 48 267K 27.2M
rpool 31.1G 48.4G 14 27 859K 106K
rpool 31.2G 48.3G 23 136 1.44M 1.18M
rpool 31.2G 48.3G 4 120 516K 28.9M
rpool 31.2G 48.3G 8 162 897K 13.6M
rpool 31.2G 48.3G 12 114 522K 19.2M
rpool 31.2G 48.3G 0 28 1023 20.4M
^C
Another very good tool for tracing which operations are slowing the ZFS system performance is the DTrace script zfsslower.d
(from Brendan Gregg's DTrace book). For example, the following command shows what ZFS operations take more than 15 milliseconds during a real access to disk (not cached):
root@solaris11-1:~/zfs_scripts# ./zfsslower.d 15
TIME PROCESS D KB ms FILE
2014 May 11 16:14:32 bash R 0 25 /opt/openv/java/jnbSA
2014 May 11 16:14:33 jnbSA R 8 22 /opt/openv/java/.nbjConf
2014 May 11 16:14:33 jnbSA R 0 21 /usr/bin/tail
2014 May 11 16:14:33 jnbSA R 0 17 /usr/bin/awk
2014 May 11 16:14:33 jnbSA R 0 21 /usr/bin/whoami
2014 May 11 16:14:33 jnbSA R 0 24 /usr/bin/locale
2014 May 11 16:14:33 jnbSA R 0 68 /usr/bin/grep
2014 May 11 16:14:33 java R 0 29 /opt/openv/java/jre/bin/amd64/java
2014 May 11 16:14:33 java R 0 22 /opt/openv/java/jre/lib/amd64/server/libjvm.so
2014 May 11 16:14:33 java R 0 20 /usr/lib/amd64/libsched.so.1
2014 May 11 16:14:33 java R 0 18 /lib/amd64/libm.so.1
2014 May 11 16:14:33 java R 0 39 /usr/lib/amd64/libdemangle.so.1
2014 May 11 16:14:33 java R 0 25 /opt/openv/java/jre/lib/amd64/libverify.so
2014 May 11 16:14:33 java R 0 22 /opt/openv/java/jre/lib/amd64/libjava.so
2014 May 11 16:14:33 java R 0 22 /opt/openv/java/jre/lib/amd64/libzip.so
2014 May 11 16:14:34 java R 2 19 /opt/openv/java/jre/lib/meta-index
...
The DTrace script zfssnoop.d
is a helpful tool for seeing what processes are requesting ZFS I/O operations, and it's very suitable for analyzing performance issues, as shown below:
root@solaris11-1:~/zfs_scripts# ./zfssnoop.d
TIME(ms) UID PID PROCESS CALL KB PATH
6316520 0 6293 bprd zfs_getattr 0 /opt/openv/netbackup/bin/bprd_parent
6316520 0 6293 bprd zfs_open 0 /opt/openv/netbackup/bin/bprd_parent
6316520 0 6293 bprd zfs_getattr 0 /opt/openv/netbackup/bin/bprd_parent
6316520 0 6293 bprd zfs_readdir 0 /opt/openv/netbackup/bin/bprd_parent
6316520 0 6293 bprd zfs_readdir 0 /opt/openv/netbackup/bin/bprd_parent
6316520 0 6293 bprd zfs_close 0 /opt/openv/netbackup/bin/bprd_parent
6316520 0 6293 bprd zfs_getattr 0 /opt/openv/netbackup/bp.conf
6316520 0 797 nscd zfs_getattr 0 /etc/nsswitch.conf
6316520 0 797 nscd zfs_getattr 0 /etc/resolv.conf
6316520 0 797 nscd zfs_getattr 0 /etc/passwd
6316520 0 6293 bprd zfs_getattr 0 /opt/openv/netbackup/bp.conf
6316520 0 6293 bprd zfs_getattr 0 /opt/openv/var/license.txt
6316520 0 6293 bprd zfs_getattr 0 /opt/openv/var/license.txt
6316521 0 797 nscd zfs_getattr 0 /etc/nsswitch.conf
6316521 0 797 nscd zfs_getattr 0 /etc/resolv.conf
6316521 0 797 nscd zfs_getattr 0 /etc/passwd
...
For getting ARC-specific statistics, there are many parameters that can be collected using different tools and techniques. For example, kstat
is a great way to gather ARC statistics:
root@solaris11-1:~/zfs_scripts# kstat -p "zfs:0:arcstats:"
zfs:0:arcstats:buf_size 3301320
zfs:0:arcstats:c 320595190
zfs:0:arcstats:c_max 1597679616
zfs:0:arcstats:c_min 67108864
zfs:0:arcstats:class misc
zfs:0:arcstats:crtime 824.813864746
zfs:0:arcstats:data_size 281752896
zfs:0:arcstats:deleted 124895
zfs:0:arcstats:demand_data_hits 4181581
zfs:0:arcstats:demand_data_misses 30322
zfs:0:arcstats:demand_metadata_hits 329062
zfs:0:arcstats:demand_metadata_misses 15013
zfs:0:arcstats:hash_chain_max 5
zfs:0:arcstats:hash_chains 18446744073709535339
zfs:0:arcstats:hash_collisions 31739
zfs:0:arcstats:hash_elements 18446744073709478435
zfs:0:arcstats:hash_elements_max 18446744073709551615
zfs:0:arcstats:hits 4606729
zfs:0:arcstats:l2_abort_lowmem 0
zfs:0:arcstats:l2_cksum_bad 0
zfs:0:arcstats:l2_evict_lock_retry 0
zfs:0:arcstats:l2_evict_reading 0
zfs:0:arcstats:l2_feeds 0
zfs:0:arcstats:l2_hdr_size 0
zfs:0:arcstats:l2_hits 0
zfs:0:arcstats:l2_io_error 0
zfs:0:arcstats:l2_misses 45335
zfs:0:arcstats:l2_read_bytes 0
zfs:0:arcstats:l2_rw_clash 0
zfs:0:arcstats:l2_write_bytes 0
zfs:0:arcstats:l2_writes_done 0
zfs:0:arcstats:l2_writes_error 0
zfs:0:arcstats:l2_writes_hdr_miss 0
zfs:0:arcstats:l2_writes_sent 0
zfs:0:arcstats:memory_throttle_count 141
zfs:0:arcstats:meta_limit 0
zfs:0:arcstats:meta_max 41224144
zfs:0:arcstats:meta_used 23865960
zfs:0:arcstats:mfu_ghost_hits 0
zfs:0:arcstats:mfu_hits 4029416
zfs:0:arcstats:misses 132715
zfs:0:arcstats:mru_ghost_hits 0
zfs:0:arcstats:mru_hits 268796
zfs:0:arcstats:mutex_miss 4980
zfs:0:arcstats:other_size 20564640
zfs:0:arcstats:p 320595190
zfs:0:arcstats:prefetch_data_hits 50530
zfs:0:arcstats:prefetch_data_misses 68513
zfs:0:arcstats:prefetch_metadata_hits 45556
zfs:0:arcstats:prefetch_metadata_misses 18867
zfs:0:arcstats:size 305618856
zfs:0:arcstats:snaptime 10481.932353541
Personally, I prefer another method: the one shown in Listing 1:
root@solaris11-1:~/zfs_scripts# echo "::arc" | mdb -k
hits = 4696135
misses = 133577
demand_data_hits = 4251628
demand_data_misses = 30808
demand_metadata_hits = 345851
demand_metadata_misses = 15083
prefetch_data_hits = 50577
prefetch_data_misses = 68616
prefetch_metadata_hits = 48079
prefetch_metadata_misses = 19070
mru_hits = 310110
mru_ghost_hits = 0
mfu_hits = 4102471
mfu_ghost_hits = 0
deleted = 124895
mutex_miss = 4986
hash_elements = 18446744073709478638
hash_elements_max = 18446744073709551615
hash_collisions = 39936
hash_chains = 18446744073709535395
hash_chain_max = 5
p = 352 MB
c = 318 MB
c_min = 64 MB
c_max = 1523 MB
size = 318 MB
buf_size = 3 MB
data_size = 294 MB
other_size = 20 MB
l2_hits = 0
l2_misses = 45891
l2_feeds = 0
l2_rw_clash = 0
l2_read_bytes = 0 MB
l2_write_bytes = 0 MB
l2_writes_sent = 0
l2_writes_done = 0
l2_writes_error = 0
l2_writes_hdr_miss = 0
l2_evict_lock_retry = 0
l2_evict_reading = 0
l2_abort_lowmem = 0
l2_cksum_bad = 0
l2_io_error = 0
l2_hdr_size = 0 MB
memory_throttle_count = 141
meta_used = 23 MB
meta_max = 39 MB
meta_limit = 0 MB
arc_no_grow = 1
arc_tempreserve = 0 MB
Listing 1
There are some important statistics shown in Listing 1:
size
shows the current ARC size.
c
is the target ARC size.
c_max
is the maximum target ARC size.
c_min
is the minimum target ARC size.
p
is the size of the MFU cache.
l2_hdr_size
is the space in the ARC that is consumed by managing the L2ARC.
l2_size
is the size of the data in the L2ARC.
memory_throttle_count
is the number of times that ZFS had to limit the ARC growth.
For example, a constant increasing of the memory_throttle_count
statistic can indicate excessive pressure to evict data from the ARC.
Some of the arcstat
statistics can be examined by running the following command:
root@solaris11-1:~# echo "arc_stats::print -d arcstat_p.value.ui64 \
arcstat_c.value.ui64 arcstat_c_max.value.ui64" | mdb -k
arcstat_p.value.ui64 = 0t2325356992
arcstat_c.value.ui64 = 0t3208292352
arcstat_c_max.value.ui64 = 0t3208292352
Additionally, to interpret the output shown in Listing 1, we must understand some operations, such as prefetch requests and demand requests. Demand requests are done directly to the ARC without leveraging prefetch requests. But, what are prefetch requests? A prefetch request is a "read ahead" operation where data is brought from disk in advance and kept in the ARC. Consequently, future sequential read operations can take advantage of the data in the ARC, because the prefetch request anticipated that the data brought from the disk would be needed, and this prevents the need to perform a new physical read operation later. This ZFS file-level prefetch mechanism is called a zfetch.
Therefore, the contents of the ARC can be broken down to the following: prefetch (data + metadata) and demand (data + metadata). A hit happens when the requested information is found in the ARC and a miss happens when the requested information isn't found in the ARC.
Surprisingly, it is possible to disable the ZFS prefetch setting in the /etc/system
file: zfs_prefetch_disable = 0x1
. This might be recommended when you are facing some contention on zfetch locks or if the prefetch efficiency ratio is very low and is causing slow performance . For example, if many sequential small reads hit the cache, prefetching data from disk can consume a significant amount of CPU time and degrade CPU performance. Additionally, there are rare cases where application loads are limited by zfetch. Based on these concepts, we can calculate efficiency ratios. For example, to see the hit rate and miss rate for ARC data and metadata prefetch requests, we could run the command shown in Listing 2:
root@solaris11-1:~/zfs_scripts# kstat -p "zfs:0:arcstats:" | grep prefetch
zfs:0:arcstats:prefetch_data_hits 50530
zfs:0:arcstats:prefetch_data_misses 68513
zfs:0:arcstats:prefetch_metadata_hits 45556
zfs:0:arcstats:prefetch_metadata_misses 18867
Listing 2
Using the values shown in Listing 2, we can calculate the prefetch efficiency ratio by running the following command, which uses the following formula:
[(data_hits + metadata_hits)/(data_hits + data_misses + metadata_hits + metadata_misses)]:
root@solaris11-1:~/zfs_scripts# bc -l
(50530 + 45556)/(50530 + 68513 + 45556 + 18867)
.52372646702931333326 = 52.37%
If we only wanted to know the ratio of data hits to all data requests, which follows the formula [(data_hits
)/(data_hits
+ data_misses
)] , we could run this command:
root@solaris11-1:~/zfs_scripts# bc -l
(50530/(50530 + 68513))
.42446846937661181253 = 42.44%
I'm sure you get the idea.
The following archits.sh
script from Brendan Gregg makes it possible to know the hit rate, but our main concern should be with misses (not shown, but easy to calculate):
root@solaris11-1:~/zfs_scripts# more archits.sh
#!/usr/bin/sh
interval=${1:-5} # 5 secs by default
kstat -p zfs:0:arcstats:hits zfs:0:arcstats:misses $interval | awk '
BEGIN {
printf "%12s %12s %9s\n", "HITS", "MISSES", "HITRATE"
}
/hits/ {
hits = $2 - hitslast
hitslast = $2
}
/misses/ {
misses = $2 - misslast
misslast = $2
rate = 0
total = hits + misses
if (total)
rate = (hits * 100) / total
printf "%12d %12d %8.2f%%\n", hits, misses, rate
}
'
Running the archits.sh
script produces output like the following:
root@solaris11-1:~/zfs_scripts# ./archits.sh
HITS MISSES HITRATE
4667846 132926 97.23%
25 0 100.00%
92 2 97.87%
287 20 93.49%
58 0 100.00%
3429 127 96.43%
4088 111 97.36%
39 0 100.00%
Eventually, we might be interested in collecting information about which processes are using the ARC most often, as shown in Listing 3:
root@solaris11-1:~/zfs_scripts# dtrace -n 'sdt:zfs::arc-hit,sdt:zfs::arc-miss { @[execname] = count() }'
dtrace: description 'sdt:zfs::arc-hit,sdt:zfs::arc-miss ' matched 3 probes
^C
...
svc.configd 314
tpconfig 331
bptm 358
firefox 398
userreq_notify 455
ksh93 469
update-refresh.s 563
nbpem 660
nbemm 683
nbsl 795
oprd 871
nbproxy 887
nbemmcmd 999
bpgetconfig 1032
dbsrv11 1170
bprd 1278
fsflush 1638
bpjava-susvc 1945
bpstsinfo 2942
sched 5433
java 8461
bpdbm 12290
Listing 3
The output in Listing 3 shows that bpdbm
(a database manager from NetBackup) is using the ARC most of the time.
As mentioned previously, the maximum ARC size grows until the RAM space minus 1 GB is filled but adapts dynamically based upon the amount of free physical memory on the system. Normally, that isn't a problem, but if an application needs a guaranteed amount of memory (for example, 2 GB), it could be appropriate to limit the ARC size through the zfs_arc_max
parameter. For example, we could limit the ARC to 6 GB (8GB - 2 GB) in the /etc/system
file, as follows:
set zfs_arc_max= 6442450944
(bytes)
There are many ZFS parameters in the kernel, which we can see by running the following command:
root@solaris11-1:~/zfs_scripts# echo "::zfs_params" | mdb -k
arc_reduce_dnlc_percent = 0x3
zfs_arc_max = 0x0
zfs_arc_min = 0x0
arc_shrink_shift = 0x7
zfs_mdcomp_disable = 0x0
zfs_prefetch_disable = 0x0
zfetch_max_streams = 0x8
zfetch_min_sec_reap = 0x2
zfetch_block_cap = 0x100
zfetch_array_rd_sz = 0x100000
zfs_default_bs = 0x9
zfs_default_ibs = 0xe
metaslab_aliquot = 0x80000
spa_max_replication_override = 0x3
spa_mode_global = 0x3
zfs_flags = 0x0
zfs_txg_synctime_ms = 0x1388
zfs_txg_timeout = 0x5
zfs_write_limit_min = 0x2000000
zfs_write_limit_max = 0xfdf1c00
zfs_write_limit_shift = 0x3
zfs_write_limit_override = 0x0
zfs_no_write_throttle = 0x0
zfs_vdev_cache_max = 0x4000
zfs_vdev_cache_size = 0x0
zfs_vdev_cache_bshift = 0x10
<