This discussion is archived
13 Replies Latest reply: Nov 16, 2011 12:06 PM by kutrovsky RSS

zfs list on solaris express 11 always reads from disk (metadata not cached)

kutrovsky Newbie
Currently Being Moderated
Hello All,

I am migrating from OpenSolaris 2009.11 to SolarisExpress 11.

I noticed that "zfs list" takes longer than usual, and is not instant. I then discovered via a combination of arcstat.pl and iostat -xnc 2 that every time a list command is issued, there are disk reads. This leads me to believe that some metadata is not getting cached.

This is not the case in OpenSolaris where repeated "zfs list" do not cause disk reads.

Has anyone observed this, and do you know of any solution?

This is on an IDLE sustem with 48 GB of RAM - with plenty of free memory.
  • 1. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    SteveS Pro
    Currently Being Moderated
    Can you quantify the time differences with time/timex so we know what we're dealing with.

    Is this an actual problem for you or just an observation you're trying to clarify?

    I don't have a b151a (Solaris 11 Express) system to hand but I do have a a b171 system and mine shows no IO when doing a list. Admittedly it's a laptop with only one disk so this might be a bad system to test this on. But the approach I took is somewhat different to you.

    If you truss the 'zfs list' command you'll see ioctl's as we walk the list
    $ truss -leaf -o /tmp/truss.zfs-list zfs list
    
    --- truss.zfs-list ---
    [...snip...]
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x080460D0) = 0
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x080460D0) = 0
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x080448A0) = 0
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x08043070) = 0
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x08041840) Err#3 ESRCH
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x08043070) Err#3 ESRCH
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x080448A0) Err#3 ESRCH
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x080460D0) Err#12 ENOMEM
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x080460D0) = 0
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x080460D0) = 0
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x080448A0) = 0
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x08043070) Err#3 ESRCH
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x080448A0) = 0
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x08043070) Err#3 ESRCH
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x080448A0) Err#3 ESRCH
    2567/1:         ioctl(3, ZFS_IOC_DATASET_LIST_NEXT, 0x080460D0) Err#3 ESRCH
    2567/1:         ioctl(1, TCGETA, 0x080449A0)                    = 0
    2567/1:         fstat64(1, 0x08044900)                          = 0
    2567/1:         write(1, " N A M E                ".., 59)      = 59
    2567/1:         write(1, " r p o o l              ".., 55)      = 55
    2567/1:         write(1, " r p o o l / R O O T    ".., 55)      = 55
    2567/1:         write(1, " r p o o l / R O O T / S".., 50)      = 50
    2567/1:         write(1, " r p o o l / R O O T / S".., 50)      = 50
    2567/1:         write(1, " r p o o l / d u m p    ".., 50)      = 50
    2567/1:         write(1, " r p o o l / e x p o r t".., 56)      = 56
    2567/1:         write(1, " r p o o l / e x p o r t".., 61)      = 61
    [...snip...]
    --- end ---
    The ioctl() calls may or may not end up in physical IO's. You need to confirm that the IO you're seeing is actually caused by 'zfs list' and not something else.

    The following will report the disk name if 'zfs list' is doing IO to the device
    $ sudo dtrace -n 'io:::start /pid == $target/ { printf("%s", args[1]->dev_statname); }' -c "zfs list"
    For me the above reported no output and thus no physio's during the 'zfs list'

    If the above shows IO then you can print backtraces to see the code flow which will give you an idea as to where to probe next
    $ sudo dtrace -n 'io:::start /pid == $target/ { @[stack()] = count(); }' -c "zfs list"
    Hope it helps,
    Steve
  • 2. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    kutrovsky Newbie
    Currently Being Moderated
    Hi Steve,

    Thanks so much for your post. I wasn't expecting such a quick and valuable response.

    Background:

    2 zpools
    root with 2 mirrors
    data with 7x 2TB WD Green drives (64 Mb cache each) in RAIDZ2 mode

    I noticed "zfs list" taking a moment, and after a few runs been fast again. Machine has lots of RAM, yet this behavior seemed consistent. I started investigating. Using the "arcstat.pl" monitoring script and iostat I determined that every time there is IO sent to the disk drives. The reason for speed up is due to disk level caching.

    Altough 1s is not bad, when I list all snapshots - this takes a long time. So does rolling back snapshots.

    Btw, how can I obtain b171 ? Is it "safe" to use?

    Timing:

    time zfs list
    real 0m1.599s
    user 0m0.009s
    sys 0m0.026s

    time zfs list

    real 0m0.057s
    user 0m0.008s
    sys 0m0.021s

    Notice how second and further runs are sub-second. However, IO is still going to disk. Here's the dtrace you sent AFTER the list has been run a few times, and is already "fast":

    sudo dtrace -n 'io:::start /pid == $target/ { printf("%s", args[1]->dev_statname); }' -c "zfs list"
    dtrace: description 'io:::start ' matched 6 probes
    dtrace: pid 8708 has exited
    CPU ID FUNCTION:NAME
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd4
    1 72288 bdev_strategy:start sd7
    1 72288 bdev_strategy:start sd5
    1 72288 bdev_strategy:start sd6
    1 72288 bdev_strategy:start sd10
    1 72288 bdev_strategy:start sd8
    1 72288 bdev_strategy:start sd9
    1 72288 bdev_strategy:start sd6
    7 72288 bdev_strategy:start sd8
    7 72288 bdev_strategy:start sd8
    7 72288 bdev_strategy:start sd10
    2 72288 bdev_strategy:start sd5
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd10
    2 72288 bdev_strategy:start sd8
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd10
    2 72288 bdev_strategy:start sd8
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd4
    2 72288 bdev_strategy:start sd7
    2 72288 bdev_strategy:start sd5
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd10
    2 72288 bdev_strategy:start sd8
    2 72288 bdev_strategy:start sd9
    2 72288 bdev_strategy:start sd4
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd10
    2 72288 bdev_strategy:start sd8
    2 72288 bdev_strategy:start sd9
    2 72288 bdev_strategy:start sd4
    2 72288 bdev_strategy:start sd7
    2 72288 bdev_strategy:start sd5
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd10
    2 72288 bdev_strategy:start sd8
    2 72288 bdev_strategy:start sd9
    2 72288 bdev_strategy:start sd4
    2 72288 bdev_strategy:start sd7
    2 72288 bdev_strategy:start sd5
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd9
    2 72288 bdev_strategy:start sd4
    2 72288 bdev_strategy:start sd7
    2 72288 bdev_strategy:start sd5
    2 72288 bdev_strategy:start sd5
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd10
    2 72288 bdev_strategy:start sd7
    2 72288 bdev_strategy:start sd8
    2 72288 bdev_strategy:start sd9
    2 72288 bdev_strategy:start sd4
    2 72288 bdev_strategy:start sd7
    2 72288 bdev_strategy:start sd5
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd10
    2 72288 bdev_strategy:start sd9
    2 72288 bdev_strategy:start sd9
    2 72288 bdev_strategy:start sd4
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd10
    2 72288 bdev_strategy:start sd8
    2 72288 bdev_strategy:start sd9
    2 72288 bdev_strategy:start sd5
    2 72288 bdev_strategy:start sd8
    2 72288 bdev_strategy:start sd9
    2 72288 bdev_strategy:start sd4
    2 72288 bdev_strategy:start sd6
    2 72288 bdev_strategy:start sd10
    2 72288 bdev_strategy:start sd8
    2 72288 bdev_strategy:start sd9
    2 72288 bdev_strategy:start sd5


    p.s.
    How do I tag a code section ?
  • 3. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    kutrovsky Newbie
    Currently Being Moderated
    Here's the second dtrace with the backtrace:

    dtrace -n 'io:::start /pid == $target/ { @[stack()] = count(); }' -c "zfs list"
    genunix`ldi_strategy+0x59
    zfs`vdev_disk_io_start+0xd0
    zfs`zio_vdev_io_start+0x1ed
    zfs`zio_execute+0x8d
    zfs`zio_nowait+0x42
    zfs`vdev_mirror_io_start+0x14c
    zfs`zio_vdev_io_start+0x1ed
    zfs`zio_execute+0x8d
    zfs`zio_nowait+0x42
    zfs`vdev_mirror_io_start+0x14c
    zfs`zio_vdev_io_start+0x229
    zfs`zio_execute+0x8d
    zfs`zio_nowait+0x42
    zfs`arc_read_nolock+0x7ee
    zfs`arc_read+0x79
    zfs`dsl_read+0x33
    zfs`dbuf_read_impl+0x17e
    zfs`dbuf_read+0xfd
    zfs`dnode_hold_impl+0xf1
    zfs`dnode_hold+0x2b
    20

    genunix`ldi_strategy+0x59
    zfs`vdev_disk_io_start+0xd0
    zfs`zio_vdev_io_start+0x1ed
    zfs`zio_execute+0x8d
    zfs`zio_nowait+0x42
    zfs`vdev_mirror_io_start+0x14c
    zfs`zio_vdev_io_start+0x1ed
    zfs`zio_execute+0x8d
    zfs`zio_nowait+0x42
    zfs`vdev_mirror_io_start+0x14c
    zfs`zio_vdev_io_start+0x229
    zfs`zio_execute+0x8d
    zfs`zio_nowait+0x42
    zfs`arc_read_nolock+0x7ee
    zfs`arc_read+0x79
    zfs`dsl_read+0x33
    zfs`dbuf_read_impl+0x17e
    zfs`dbuf_read+0x1b4
    zfs`dmu_buf_hold+0xac
    zfs`zap_lockdir+0x6d
    20

    genunix`ldi_strategy+0x59
    zfs`vdev_disk_io_start+0xd0
    zfs`zio_vdev_io_start+0x1ed
    zfs`zio_execute+0x8d
    zfs`zio_nowait+0x42
    zfs`vdev_mirror_io_start+0x14c
    zfs`zio_vdev_io_start+0x1ed
    zfs`zio_execute+0x8d
    zfs`zio_nowait+0x42
    zfs`vdev_mirror_io_start+0x14c
    zfs`zio_vdev_io_start+0x229
    zfs`zio_execute+0x8d
    zfs`zio_nowait+0x42
    zfs`arc_read_nolock+0x7ee
    zfs`arc_read+0x79
    zfs`dsl_read+0x33
    zfs`dbuf_read_impl+0x17e
    zfs`dbuf_read+0x1b4
    zfs`dbuf_findbp+0xf8
    zfs`dbuf_hold_impl+0x81
    120
  • 4. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    SteveS Pro
    Currently Being Moderated
    kutrovsky wrote:

    Btw, how can I obtain b171 ? Is it "safe" to use?
    You can't until it's officially released. Development versions are only available to Oracle Employees.
    Timing:

    time zfs list
    real 0m1.599s
    user 0m0.009s
    sys 0m0.026s

    time zfs list

    real 0m0.057s
    user 0m0.008s
    sys 0m0.021s

    Notice how second and further runs are sub-second. However, IO is still going to disk. Here's the dtrace you sent AFTER the list has been run a few times, and is already "fast":
    In version 31 of zpool we implemented an enhancement to 'zfs list' - http://hub.opensolaris.org/bin/view/Community+Group+zfs/31
    $ zpool upgrade -v
    This system is currently running ZFS pool version 33.
    
    The following versions are supported:
    
    VER  DESCRIPTION
    ---  --------------------------------------------------------
    [...snip...]
     31  Improved 'zfs list' performance
    [...snip...]
    The improved 'zfs list' performance was due in part to an accumulation of several bugs:

    6386929 - initial "zfs list" is slow.
    6755389 - Initial run of any zfs command that iterates over datasets can be slow
    6539380 - zfs list uses too much memory
    6596160 - zfs list is slow due to version property
    6734916 - zfs list should omit snapshots by default

    Zpool ver 31 was released in build 150. Solaris 11 Express was distributed with b151a so check your zpool version to see what version you're running. If it's older than version 31 run "zpool upgrade -a" to bring it up. The command can be run whilst the zpool is online and is immediate. However, be warned that it's a one-way process. You cannot downgrade your zpool version.

    When most the data is cached in the ARC then zfs list completes relatively quickly. Otherwise, the vast majority of the time needed for zfs list to complete is spent waiting for the root block of each dataset to be read in from disk. The DTrace you ran proved this is what's happening.

    If you're already running zpool v31 then it's likely that the data is being pushed out of cache/ARC and there's no way to persistently cache the data to ensure zfs commands are always quick on the first execution. The more zpools, filesystems, and snapshots you have, the more chance you have that the data isn't in cache and thus 'zfs list' will need to go to disk for the information.
    p.s.
    How do I tag a code section ?
    Use the [ code ] [ /code ] tags (without spaces)

    For other formatting codes click the FAQ link in the top right.
  • 5. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    kutrovsky Newbie
    Currently Being Moderated
    Hi Steve,

    - Definetelly not getting aged out of the cache, the cache is not yet fully utilized
    * The dtrace was after repeated runs on an IDLE system, nothing could have aged it out
    - My pool version is 31. (checked with zpool)
    - I created this pool with b151

    Now I did use a modified zpool program to create the pool with a different ashift. My disk drives are one of those lie-ing ones. They have 4K disk sector, but they say they have 512 to be compatible. The performance difference is significant. ZFS does write large blocks, but they are were aligned on 4K boundary.

    I used the zpool command from this blog:

    http://www.solarismen.de/archives/5-Solaris-and-the-new-4K-Sector-Disks-e.g.-WDxxEARS-Part-2.html

    Could this be the problem? Can't cache 4K metadata?
  • 6. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    kutrovsky Newbie
    Currently Being Moderated
    Steve,

    A follow up on my case.

    This lack of caching for certain metadata is causing a lot of problems. After enabling the automatic snapshots, especially the "Frequent" one (every 15 minutes) the resulting IO is significant as it reads every snapshot created, and that takes a few IOs each.

    I am wondering if you've managed to locate why those differences exist? Especially given the fact that in opensolaris this is not the case.
  • 7. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    kutrovsky Newbie
    Currently Being Moderated
    Hi Steve,

    I was wondering if you were planning to help me out a bit more in this particular case?

    I wonder if this is an issue that is resolved in future releases. I would be willing to give a try in a recent beta, altought I am sure you can try it as well. It's quite reproducible.

    Thanks !
  • 8. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    SteveS Pro
    Currently Being Moderated
    Apologies for the late reply. I've been busy on service request volume. I do still have this on my radar though.
    If you need a quicker response, feel free to raise a P4 service request with us and one of my Kernel ZFS colleagues will be able to assist.

    Regarding the beta testing, that's not going to be possible. If you have a support contract you should have access to the S11 Support Repository where some of the bug fixes are being deployed. Doc ID: 1021281.1 - "Support Repositories Explained" explains more and has the link to request your SSL Certificate which is required to get access.
  • 9. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    kutrovsky Newbie
    Currently Being Moderated
    Thanks for info Steve. I appreciate your reply even when under load !

    Do you know how close you are to a new major release? I checked the list of bug fixes (SRU10 readme), but I did not notice anything regarding this lack of caching of certain metadata. Would something like that be there?

    I keep wondering, which specific type of metadata is not cached, any idea how to find out?
  • 10. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    SteveS Pro
    Currently Being Moderated
    Do you know how close you are to a new major release?
    Solaris 11 will be released before the end of the year, but this is still subject to change.
    I checked the list of bug fixes (SRU10 readme), but I did not notice anything regarding this lack of caching of certain metadata. Would something like that be there?
    The SRU's are only bug fixes for certain issues at the moment. Unlike the OpenSolaris updates which were features and bug fixes, the SRU only provides bug fixes. Solaris 11 will contain all the bug fixes and feature updates when it's released.
  • 11. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    SteveS Pro
    Currently Being Moderated
    As we can see from the stacks, arc_read() is where we start to check whether the info we need is in the cache or not.
    http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#2644
    
       2621 /*
       2622  * "Read" the block block at the specified DVA (in bp) via the
       2623  * cache.  If the block is found in the cache, invoke the provided
       2624  * callback immediately and return.  Note that the `zio' parameter
       2625  * in the callback will be NULL in this case, since no IO was
       2626  * required.  If the block is not in the cache pass the read request
       2627  * on to the spa with a substitute callback function, so that the
       2628  * requested block will be added to the cache.
       2629  *
       2630  * If a read request arrives for a block that has a read in-progress,
       2631  * either wait for the in-progress read to complete (and return the
       2632  * results); or, if this is a read with a "done" func, add a record
       2633  * to the read to invoke the "done" func when the read completes,
       2634  * and return; or just return.
       2635  *
       2636  * arc_read_done() will invoke all the requested "done" functions
       2637  * for readers of this block.
       2638  *
       2639  * Normal callers should use arc_read and pass the arc buffer and offset
       2640  * for the bp.  But if you know you don't need locking, you can use
       2641  * arc_read_bp.
       2642  */
       2643 int
       2644 arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_buf_t *pbuf,
       2645     arc_done_func_t *done, void *private, int priority, int zio_flags,
       2646     uint32_t *arc_flags, const zbookmark_t *zb)
       2647 {
    You can see the codeflow within arc_read() using the following DTrace script
    --- zfs_list.d ---
    #!/usr/sbin/dtrace -s
    
    #pragma D option flowindent
    
    fbt:zfs:arc_read:entry
    /pid == $target/
    {
         self->interested = 1;
    }
    
    fbt:zfs::,
    sdt:zfs::
    /self->interested/
    {
    }
    
    fbt:zfs:arc_read:return
    /pid == $target/
    {
            self->interested = 0;
    }
    --- end ---
    Here's an example where the data block IS is the ARC
    CPU FUNCTION                                 
      1  -> arc_read                              
      1    -> arc_hash_find                       
      1      -> spa_guid                          
      1      <- spa_guid                          
      1      -> arc_hash                          
      1      <- arc_hash                          
      1      -> arc_hold                          
      1      <- arc_hold                          
      1    <- arc_hash_find                       
      1    -> arc_add_ref                         
      1      -> arc_hold                          
      1      <- arc_hold                          
      1    <- arc_add_ref                         
      1    -> arc_access                          
      1    <- arc_access                          
      1   | arc_read:arc-hit        <-- HIT
      1    -> arc_getref_func                     
      1    <- arc_getref_func                     
      1    -> arc_rele                            
      1    <- arc_rele                            
      1  <- arc_read 
    [ ... ]
    And here's an example where it's not
      1  -> arc_read                              
      1    -> arc_hash_find                       
      1      -> spa_guid                          
      1      <- spa_guid                          
      1      -> arc_hash                          
      1      <- arc_hash                          
      1    <- arc_hash_find                       
      1   | arc_read:arc-miss   <-- MISS                  
      1    -> arc_get_data_block                  
      1      -> zio_data_buf_alloc                
      1      <- zio_data_buf_alloc                
      1      -> arc_space_consume                 
      1      <- arc_space_consume                 
      1    <- arc_get_data_block                  
      1   | arc_read:l2arc-miss                   
      1    -> zio_read                            
      1      -> spa_version                       
      1      <- spa_version                       
      1      -> zio_create                        
      1        -> zio_add_child                   
      1        <- zio_add_child                   
      1      <- zio_create                        
      1    <- zio_read                            
      1    -> zio_nowait                          
      1      -> zio_unique_parent                 
      1        -> zio_walk_parents                
      1        <- zio_walk_parents                
      1        -> zio_walk_parents                
      1        <- zio_walk_parents                
      1      <- zio_unique_parent                 
      1      -> zio_execute                       
      1        -> zio_read_bp_init                
      1        <- zio_read_bp_init                
      1        -> zio_ready                       
      1          -> zio_wait_for_children         
      1          <- zio_wait_for_children         
      1          -> zio_wait_for_children         
      1          <- zio_wait_for_children         
      1          -> zio_walk_parents              
      1          <- zio_walk_parents              
      1          -> zio_walk_parents              
      1          <- zio_walk_parents              
      1          -> zio_notify_parent             
      1          <- zio_notify_parent             
      1        <- zio_ready                       
      1        -> zio_taskq_member                
      1        <- zio_taskq_member                
      1        -> zio_vdev_io_start               
      1          -> spa_config_enter              
      1          <- spa_config_enter              
      1          -> vdev_mirror_io_start          
      1            -> vdev_mirror_map_alloc       
      1              -> spa_get_random            
      1              <- spa_get_random            
      1              -> vdev_lookup_top           
      1              <- vdev_lookup_top           
      1            <- vdev_mirror_map_alloc       
      1            -> vdev_mirror_child_select    
      1              -> vdev_readable             
      1                -> vdev_is_dead            
      1                <- vdev_is_dead            
      1              <- vdev_readable             
      1              -> vdev_dtl_contains         
      1              <- vdev_dtl_contains         
      1            <- vdev_mirror_child_select    
      1            -> zio_vdev_child_io           
      1              -> zio_create                
      1                -> zio_add_child           
      1                <- zio_add_child           
      1              <- zio_create                
      1            <- zio_vdev_child_io           
      1            -> zio_nowait                  
      1              -> zio_execute               
      1                -> zio_vdev_io_start       
      1                  -> spa_syncing_txg       
      1                  <- spa_syncing_txg       
      1                  -> vdev_cache_read       
      1                  <- vdev_cache_read       
      1                  -> vdev_queue_io         
      1                    -> vdev_queue_io_add   
      1                    <- vdev_queue_io_add   
      1                    -> vdev_queue_io_to_issue 
      1                    <- vdev_queue_io_to_issue 
      1                  <- vdev_queue_io         
      1                <- zio_vdev_io_start       
      1              <- zio_execute               
      1            <- zio_nowait                  
      1          <- vdev_mirror_io_start          
      1        <- zio_vdev_io_start               
      1        -> zio_vdev_io_done                
      1          -> zio_wait_for_children         
      1          <- zio_wait_for_children         
      1        <- zio_vdev_io_done                
      1      <- zio_execute                       
      1    <- zio_nowait                          
      1  <- arc_read               
    You can clearly see that when it's not in the cache, we go to the disk to retrieve it.

    It's not obvious from the above, but looking at the source code the real magic/decision happens in arc_read_nolock()
    http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#2671
    Remember that not all functions have DTrace probes assigned. The stack traces you provided earlier + DTrace show the real code flow in varying degrees of detail.

    The above was actually reproduced on my laptop (b171a), but I'm short on memory so my 'zfs list' metadata data put in to the ARC was eventually evicted to make way for new data. That's okay with me :)
    $ echo "::memstat" | sudo mdb -k 
    Password: 
    Page Summary                Pages                MB  %Tot
    ------------     ----------------  ----------------  ----
    Kernel                     196070               765   23%
    ZFS File Data              135280               528   16%
    Anon                       399927              1562   47%
    Exec and libs                9088                35    1%
    Page cache                  38662               151    5%
    Free (cachelist)            27618               107    3%
    Free (freelist)             43065               168    5%
    
    Total                      849710              3319
    Physical                   849709              3319
    The following looks at the ARC in general and for me, clearly shows I'm short
    $ sudo dtrace -n 'fbt:zfs:arc*: {@[probefunc] = count();}'
    Q) Have you implemented any ARC tuning to limit the ARC?
    Q) Are you running short on memory? (the memstat above should tell you)
    Q) Are any of your fileystems over 70% full?
    Q) Have you altered what data is/is not cached? ($ zfs get primarycache)

    All of those can affect the ARC eviction algorithm.

    What would be more interesting is to watch the data blocks be loaded in to the ARC byt 'zfs list', then watch them be pushed to the ghost list before being finally dereferenced, thus forcing the next request for the data to go fetch it from disk. A task for another day I'm afraid.

    There's been a lot of changes to the way the ARC works between b134 (OpenSolaris) and b151a (Solaris 11 Express) so it's not surprising that there's some behavioural differences as well.

    To be continued.....
  • 12. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    kutrovsky Newbie
    Currently Being Moderated
    Hi Steve,

    Great info again. I am still new to dtrace, particularly navigating probes and etc. I've seen that navigation tree before.

    I would like to start by answering your questions:

    Q) Have you implemented any ARC tuning to limit the ARC?
    -> No out of the box config

    Q) Are you running short on memory? (the memstat above should tell you)
    -> Definetelly not. I have 48 GB ram, ARC grows to about 38 GB and then stops growing. I can reproduce problem at boot at will with only 8GB used. Nothing is getting aged out of the ARC at that time. Only those metadata reads never get stored.

    Q) Are any of your fileystems over 70% full?
    -> No. I am curious, what changes when this happens? Particularly in regards to ARC - perhaps another discussion, I don't want to distract this subject.

    Q) Have you altered what data is/is not cached? ($ zfs get primarycache)
    -> No - everything should be cached. I also have recently added l2cache (80GB). The metadata is not cached there neither.


    I am not yet familiar with dtrace processing capabilities, thus I had to parse output via perl. Notice how each execution has the exact same number of misses. This is due to the fact that these particular datablocks (metadata blocks) are not cached in the arc at all:
    :~/dtrace# perl -MData::Dumper -e 'while (<>) {if (/.+(:arc-hit|:arc-miss).*/) { $h{$1}+=1}} print Dumper \%h' ^C
    :~/dtrace# ./zfs_list.d -c 'zfs list' |perl -MData::Dumper -e 'while (<>) {if (/.+(:arc-hit|:arc-miss).*/) { $h{$1}+=1}} print Dumper \%h'
    dtrace: script './zfs_list.d' matched 4828 probes
    dtrace: pid 11021 has exited
    $VAR1 = {
              ':arc-hit' => 2,
              ':arc-miss' => 192
            };
    :~/dtrace# ./zfs_list.d -c 'zfs list' |perl -MData::Dumper -e 'while (<>) {if (/.+(:arc-hit|:arc-miss).*/) { $h{$1}+=1}} print Dumper \%h'
    dtrace: script './zfs_list.d' matched 4828 probes
    dtrace: pid 11026 has exited
    $VAR1 = {
              ':arc-hit' => 1,
              ':arc-miss' => 192
            };
    :~/dtrace# ./zfs_list.d -c 'zfs list' |perl -MData::Dumper -e 'while (<>) {if (/.+(:arc-hit|:arc-miss).*/) { $h{$1}+=1}} print Dumper \%h'
    dtrace: script './zfs_list.d' matched 4828 probes
    dtrace: pid 11031 has exited
    $VAR1 = {
              ':arc-hit' => 12,
              ':arc-miss' => 192
            };
    :~/dtrace# ./zfs_list.d -c 'zfs list' |perl -MData::Dumper -e 'while (<>) {if (/.+(:arc-hit|:arc-miss).*/) { $h{$1}+=1}} print Dumper \%h'
    dtrace: script './zfs_list.d' matched 4828 probes
    dtrace: pid 11036 has exited
    $VAR1 = {
              ':arc-hit' => 4,
              ':arc-miss' => 192
            };
    :~/dtrace# ./zfs_list.d -c 'zfs list' |perl -MData::Dumper -e 'while (<>) {if (/.+(:arc-hit|:arc-miss).*/) { $h{$1}+=1}} print Dumper \%h'
    dtrace: script './zfs_list.d' matched 4828 probes
    dtrace: pid 11041 has exited
    $VAR1 = {
              ':arc-hit' => 27,
              ':arc-miss' => 192
            };
    :~/dtrace# 
    I presume next steps would be to perform stack analysis on which blocks are been not cached. I don't know how to do this ... I am guessing this is a mid-function probe? "| arc_read_nolock:arc-miss" I don't know how to access it's parameters.

    FYI, here's an example of a cache miss in my zfs list:
      0  -> arc_read                              
      0    -> arc_read_nolock                     
      0      -> spa_guid                          
      0      <- spa_guid                          
      0      -> buf_hash_find                     
      0        -> buf_hash                        
      0        <- buf_hash                        
      0      <- buf_hash_find                     
      0      -> add_reference                     
      0      <- add_reference                     
      0      -> buf_cons                          
      0        -> arc_space_consume               
      0        <- arc_space_consume               
      0      <- buf_cons                          
      0      -> arc_get_data_buf                  
      0        -> arc_adapt                       
      0          -> arc_reclaim_needed            
      0          <- arc_reclaim_needed            
      0        <- arc_adapt                       
      0        -> arc_evict_needed                
      0          -> arc_reclaim_needed            
      0          <- arc_reclaim_needed            
      0        <- arc_evict_needed                
      0        -> zio_buf_alloc                   
      0        <- zio_buf_alloc                   
      0        -> arc_space_consume               
      0        <- arc_space_consume               
      0      <- arc_get_data_buf                  
      0      -> arc_access                        
      0       | arc_access:new_state-mfu          
      0        -> arc_change_state                
      0        <- arc_change_state                
      0      <- arc_access                        
      0     | arc_read_nolock:arc-miss            
      0     | arc_read_nolock:l2arc-miss  
      0      -> zio_read                          
      0        -> zio_create                      
      0          -> zio_add_child                 
      0          <- zio_add_child                 
      0        <- zio_create                      
      0      <- zio_read                          
      0      -> zio_nowait                        
      0        -> zio_unique_parent               
      0          -> zio_walk_parents              
      0          <- zio_walk_parents              
      0          -> zio_walk_parents              
      0          <- zio_walk_parents              
      0        <- zio_unique_parent               
      0        -> zio_execute                     
      0          -> zio_read_bp_init              
      0            -> zio_buf_alloc               
      0            <- zio_buf_alloc               
      0            -> zio_push_transform          
      0            <- zio_push_transform          
      0          <- zio_read_bp_init              
      0          -> zio_ready                     
      0            -> zio_wait_for_children       
      0            <- zio_wait_for_children       
      0            -> zio_wait_for_children       
      0            <- zio_wait_for_children       
      0            -> zio_walk_parents            
      0            <- zio_walk_parents            
      0            -> zio_walk_parents            
      0            <- zio_walk_parents            
      0            -> zio_notify_parent           
      0            <- zio_notify_parent           
      0          <- zio_ready                     
      0          -> zio_taskq_member              
      0          <- zio_taskq_member              
      0          -> zio_vdev_io_start             
      0            -> spa_config_enter            
      0            <- spa_config_enter            
      0            -> vdev_mirror_io_start        
      0              -> vdev_mirror_map_alloc     
      0                -> spa_get_random          
      0                <- spa_get_random          
      0                -> vdev_lookup_top         
      0                <- vdev_lookup_top         
      0              <- vdev_mirror_map_alloc     
      0              -> vdev_mirror_child_select  
      0                -> vdev_readable           
      0                  -> vdev_is_dead          
      0                  <- vdev_is_dead          
      0                <- vdev_readable           
      0                -> vdev_dtl_contains       
      0                <- vdev_dtl_contains       
      0              <- vdev_mirror_child_select  
      0              -> zio_vdev_child_io         
      0                -> zio_create              
      0                  -> zio_add_child         
      0                  <- zio_add_child         
      0                <- zio_create              
      0              <- zio_vdev_child_io         
      0              -> zio_nowait                
      0                -> zio_execute             
      0                  -> zio_vdev_io_start     
      0                    -> spa_syncing_txg     
      0                    <- spa_syncing_txg     
      0                    -> zio_buf_alloc       
      0                    <- zio_buf_alloc       
      0                    -> zio_push_transform  
      0                    <- zio_push_transform  
      0                    -> vdev_mirror_io_start 
      0                      -> vdev_mirror_map_alloc 
      0                      <- vdev_mirror_map_alloc 
      0                      -> vdev_mirror_child_select 
      0                        -> vdev_readable   
      0                          -> vdev_is_dead  
      0                          <- vdev_is_dead  
      0                        <- vdev_readable   
      0                        -> vdev_dtl_contains 
      0                        <- vdev_dtl_contains 
      0                      <- vdev_mirror_child_select 
      0                      -> zio_vdev_child_io 
      0                        -> zio_create      
      0                          -> zio_add_child 
      0                          <- zio_add_child 
      0                        <- zio_create      
      0                      <- zio_vdev_child_io 
      0                      -> zio_nowait        
      0                        -> zio_execute     
      0                          -> zio_vdev_io_start 
      0                            -> vdev_cache_read 
      0                              -> vdev_cache_allocate 
      0                              <- vdev_cache_allocate 
      0                            <- vdev_cache_read 
      0                            -> vdev_queue_io 
      0                              -> vdev_queue_io_add 
      0                              <- vdev_queue_io_add 
      0                              -> vdev_queue_io_to_issue 
      0                                -> vdev_queue_io_remove 
      0                                <- vdev_queue_io_remove 
      0                              <- vdev_queue_io_to_issue 
      0                            <- vdev_queue_io 
      0                            -> vdev_accessible 
      0                              -> vdev_is_dead 
      0                              <- vdev_is_dead 
      0                            <- vdev_accessible 
      0                            -> vdev_disk_io_start 
      0                            <- vdev_disk_io_start 
      0                          <- zio_vdev_io_start 
      0                        <- zio_execute     
      0                      <- zio_nowait        
      0                    <- vdev_mirror_io_start 
      0                  <- zio_vdev_io_start     
      0                  -> zio_vdev_io_done      
      0                    -> zio_wait_for_children 
      0                    <- zio_wait_for_children 
      0                  <- zio_vdev_io_done      
      0                <- zio_execute             
      0              <- zio_nowait                
      0            <- vdev_mirror_io_start        
      0          <- zio_vdev_io_start             
      0          -> zio_vdev_io_done              
      0            -> zio_wait_for_children       
      0            <- zio_wait_for_children       
      0          <- zio_vdev_io_done              
      0        <- zio_execute                     
      0      <- zio_nowait                        
      0    <- arc_read_nolock                     
      0  <- arc_read                              
    I've compared the output of a single-non cached metadata read, to a single read from filesystem by running dd and read from a file that is not in the cache. The only difference in the stack is that the non-cached reads are missing:
      0                                -> vdev_queue_offset_compare 
      0                                <- vdev_queue_offset_compare 
    This is called in "-> vdev_queue_io_to_issue ". But I don't think this is relevant, perhaps related to metadata vs file data read.

    What do you think should be next?
  • 13. Re: zfs list on solaris express 11 always reads from disk (metadata not cached)
    kutrovsky Newbie
    Currently Being Moderated
    I've upgraded to Solaris 11 release - the issue persists.

    I have some new information. Only specific snapshots cause the extra reads, and the amount of reads changes with usage. Other snapshots on the same volume do not cause repeated disk reads.

    Example output after been run several times (metadata should have been cached!)
    root@storage1:~/dtrace# dtrace -n 'io:::start /pid == $target/ { printf("%s", args[1]->dev_statname); }' -c "zfs list -r -t all data/vol/srv1@snap1"
    dtrace: description 'io:::start ' matched 6 probes
    NAME                                         USED  AVAIL  REFER  MOUNTPOINT
    data/vol/srv1@snap1                     41.5M      -  13.5G  -
    dtrace: pid 1700 has exited
    CPU     ID                    FUNCTION:NAME
      0    631              bdev_strategy:start sd11
      0    631              bdev_strategy:start sd11
      0    631              bdev_strategy:start sd11
      1    631              bdev_strategy:start sd9
      1    631              bdev_strategy:start sd4
      2    631              bdev_strategy:start sd11
      2    631              bdev_strategy:start sd10
      2    631              bdev_strategy:start sd7
      2    631              bdev_strategy:start sd9
      2    631              bdev_strategy:start sd9
      2    631              bdev_strategy:start sd4
      4    631              bdev_strategy:start sd3
      4    631              bdev_strategy:start sd4
    ...
    Any ideas how to troubleshoot this next?

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points