10405 Implement ZFS sorted scans

Review Request #1226 — Created Oct. 8, 2018 and submitted

tsoome
illumos-gate
10405
1224
1572
645d0e2...
general
This originated from ZFS On Linux, as https://github.com/zfsonlinux/zfs/commit/d4a72f23863382bdf6d0ae33196f5b5decbc48fd

https://svnweb.freebsd.org/base?view=revision&revision=334844

During scans (scrubs or resilvers), it sorts the blocks in each transaction group by block offset; the result can be a significant improvement. (On my test system just now, which I put some effort to introduce fragmentation into the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the changes.) I've seen similar rations on production systems.

FreeNAS has had these changes since Oct 2017.

Scrub & rebuild pools. Note times for performance analysis.

The pools are compatible with systems without the changes, so bouncing back and forth between two versions is possible, and I've used that for correctness-checking.

zpool scrub
zpool offline rpool disk and then later zpool online rpool disk
zpool replace

In my test systems the speedup on phydical disks is very noticeable, 4x 4TB raidz1 (7200RPM WD Black SATA), the scrub/resilver is down to 2 hours from 4 hours (3.5TB allocated data). The virtual disks under vmware fusion (on top of SSD) are not that drastic, but still significant and overall impression is consistent with notes from ZoL and FreeBSD.

mahrens
  1. I think this is similar to, but supersedes https://github.com/openzfs/openzfs/pull/648/, is that right?

    There were a few follow on commits which George Wilson or Tom Caputi can point you at, including a8b2e30685c9214ccfd0181977540e080340df4e "Support re-prioritizing asynchronous prefetches"

    1. Tom provided this list:
      a8b2e30685c9214ccfd0181977540e080340df4e: Support re-prioritizing asynchronous prefetches
      a76f3d0437e5e974f0f748f8735af3539443b388: Fix deadlock in IO pipeline
      c26cf0966d131b722c32f8ccecfe5791a789d975: Fix zio->io_priority failed (7 < 6) assert

    2. I havent really been tracking 648, but I most certainly check over. As I wrote in dev list, my attention was catched by commit in fbsd (commit to 11, meaning it has been around some time in current and therefore should be in relatively good state:D).

      included are also (verified):
      a8b2e30685c9214ccfd0181977540e080340df4e: Support re-prioritizing asynchronous prefetches (dtrace script update added)

      those patches were missing and I did add now:
      a76f3d0437e5e974f0f748f8735af3539443b388: Fix deadlock in IO pipeline
      c26cf0966d131b722c32f8ccecfe5791a789d975: Fix zio->io_priority failed (7 < 6) assert

  2. 
      
tsoome
jjelinek
  1. I know you didn't write this code so I only concentrated on looking for possible bugs or problems. Overall this looks good. I assume this was already reviewed by people who are more knowledgeable about zfs than me. Here are the issues I noticed.

    usr/src/uts/common/fs/zfs/arc.c
    348-349 This comment seems wrong now
    5069 At first I thought this should still be wrapped in VERIFY0, but after
    looking at the next change, I think we're missing the assignment of
    the return to 'rc'.

    usr/src/uts/common/fs/zfs/dbuf.c
    2445 Why was this block added? It looks redundant with the existing block
    at 2406.

    usr/src/uts/common/fs/zfs/dsl_scan.c
    99 s/limitted/limited/
    137 and 160 either change to boolean or use 0
    159 this tunable seems worthy of an explanatory comment
    384 I'm not sure why the fill_weight code is duplicated here from scan_init.
    1757 Unless I'm misreading the code, it appears that we'll leak bp_toread if
    dsl_scan_recurse returns an error.
    3857 I don't see how the rest of the code in this function will block the
    caller. Is something missing here?

    usr/src/uts/common/fs/zfs/range_tree.c
    210 This function seems kind of problematic. If rtop_remove is set but
    rtop_add is NULL, we lose the entry. It seems like it would be better
    to do the checks up front before we remove/adjust/add, although based
    on how this is used, maybe it should just assert the ops? I don't see
    any of these with only one op and not the other. I'm not sure why the
    extra checks were added for the individual ops throughout the code.

    usr/src/uts/common/fs/zfs/sys/dsl_scan.h
    136 Trailing whilespace. cstyle should have caught this.

  2. 
      
tsoome
tsoome
jjelinek
  1. I went back through my old review comments and cross checked them with the latest ZoL code. These are my comments that I still think we need to address:

    usr/src/uts/common/fs/zfs/arc.c
    5069 assign return value to 'rc'

    usr/src/uts/common/fs/zfs/dbuf.c
    2445 Why was this block added? It looks redundant with the existing block at 2406 and doesn't exist in ZoL.

    usr/src/uts/common/fs/zfs/dsl_scan.c
    99 s/limitted/limited/ this is spelled correctly in ZoL.
    384 I'm not sure why the fill_weight code is duplicated here from scan_init, which matches ZoL. This function doesn't exists in ZoL. We would also need to remove call from spa_init.
    1757 We'll leak bp_toread if dsl_scan_recurse returns an error. In ZoL, code does goto out.

    usr/src/uts/common/fs/zfs/sys/dsl_scan.h
    136 Trailing whilespace, doesn't exist in ZoL.

    1. fixed. The differences were there since I did start this update from FreeBSD commit:)

  2. 
      
tsoome
tsoome
mahrens
  1. you'll also want the fix for https://github.com/zfsonlinux/zfs/pull/8453 (once finalized).

  2. 
      
tsoome
tsoome
jjelinek
  1. I went back through the diffs that have changed since the last last time I reviewed this, and it looks good to me, as long as the separate "Multiple DVA Scrubbing Fix" which you ported from ZoL is also included in the same integration.

    1. I was looking at the original ZoL patch (d4a72f23863) one more time and I realized that patch includes some changes to various tests but it doesn't look like those changes are included here. Is there a reason?

    2. The reason is that some tests they do change are not even there, some are already different. Well, we definitely can add those tests there.

    3. Yes, if we can update the tests we already have, I think that would be good. We don't need to bring in the tests they modified that we don't have yet.

  2. 
      
tsoome
jjelinek
  1. Thanks for updating the tests, sorry I missed that earlier.

  2. 
      
kkantor
  1. 
      
  2. usr/src/uts/common/fs/zfs/dsl_scan.c (Diff revision 8)
     
     

    It seems to me like it would be more correct if these were hrtime_t. I see scn_sync_start_time is a uint64_t, so maybe this isn't such a big deal.

    1. I think, we should be correct there, the zol gethrtime() is also returning hrtime_t.

    2. I'm confused about this change. It seems like the ZoL patch removed the elapsed_nanosecs variable altogether and you didn't change the type on the following line where curr_time_ns is still a uint64_t. Am I misreading this somehow?

    3. yep, it is my mistake there.

  3. usr/src/uts/common/fs/zfs/dsl_scan.c (Diff revision 8)
     
     

    IIUC, Linux has much more stringent stack requirements than illumos. I would vote to keep this in, but I know we're trying to minimize our diff from ZoL. The ZoL issue states that this saves up to 390 bytes with a recursion depth of around 13.

    Also, if it matters, it doesn't look like this was added to ZoL as part of the sequential scan/scrub change.

    1. yes, I am aware of linux limit, also we can always use dtrace:)

  4. 
      
tsoome
tsoome
jjelinek
  1. Thanks for fixing that up, this looks good to me now.

  2. 
      
tsoome
tsoome
Review request changed

Status: Closed (submitted)

Loading...