9318 vol_volsize_to_reservation does not account for raidz skip blocks

Review Request #1997 — Created June 13, 2019 and submitted

mgerdts
illumos-gate
general

This makes it so that refreservation=auto causes the refreservation to be sized based on the least efficient top-level vdev in the pool.

  • Performed a full run of zfstest on a baseline (pkg://omnios/entire@11-151031.0:20190603T195911Z) and on new bits. No new failures were observed. Note that some tests caused zfs-on-zfs deadlock in the baseline so I commented those tests.
  • Added new tests to cover new functionality. They pass with 3 disks, with 14+ disks, with 512 byte or 4096 sectors. 512x3 512x20 4096x3 4096x20
  • Created raidz{1,2,3} pools with various numbers of 4Kn disks, created volumes with various values for volblocksize, then filled the volumes. Verified that referenced never exceeded refreservation and that refreservation was unreasonbly high. results
  • Did volume overwrite test on a bunch of other whacky multi-raidz{1,2,3} and mirror combo pools. results
  • While testing volblocksize=512 on ashift=9 pools, I found that things are whack even in the baseline. See illumos#11237.
  • 0
  • 0
  • 5
  • 2
  • 7
Description From Last Updated
relling
  1. overall it looks promising, minor comments below

    1. Thanks for the quick review.

  2. volsize_from_vdevs() doesn't seem to consider metadata requirements (L1+ and copies). This won't
    be seen in small volumes, but will be important for large volumes.

    1. I think you are suggesting that I should be scaling volsize after numdb has been added. Is that right?

  3. nit: we've been trying to change the ZTS test names to something more human-friendly.
    I'd suggest renaming as:

    • refreserv_006_pos.ksh -> refreserv_raidz.ksh
    • refreserv_007_pos.ksh -> refreserv_multi_raidz.ksh
  4. for consistency $parity can be parity

    1. heh. I used to always not use $ in arithmetic expressions, then I was convinced to switch to using $ and suffering the "unnecessary string conversion" messages from ksh -n. The argument for $ is so that tracing (set -x) gives more meaningful output. Regardless, it should be consistent.

  5. this seems like a copy-n-paste from refreserv_006_pos.ksh, is that intentional?

    1. I previously had refreserv_00{6,7} as a single test. Since they are really testing different things, I split them apart. Most of the of the logic near the top of both is quite similar. 006 now skips tracking of refreservation for each zvol and 007 skips the dd in the first loop.

  6. 
      
mgerdts
relling
  1. Ship It!
  2. 
      
mgerdts
jjelinek
  1. In libzfs_dataset.c, line 5191, the comment talks about updating the relevant test. Should there be an associated change in that test for the modifications in this function?

    1. The version that is in reservation.shlib is still accurate for single device or mirrored zpools (and some raidz), as evidenced by tests continuing to pass. While this could be extended to add the special case for raidz, I question whether implementing the same algorithm in a different language is the right way to ensure the alogithm is correct. New tests verify that the refreservation is appropriate by actually filling some zvols and ensuring that the result falls within an acceptable range.

    2. I'm not sure if the existing tests pass due to the specific sizes of disks you're testing with or if there are no configurations in the existing tests which would fail with raidz and different sizes of disks. I went through all of the tests using the shell version of the code, but I am still not certain.

  2. 
      
mgerdts
  1. For the record:

    To: Mike Gerdts <mike.gerdts@joyent.com>
    From: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
    Subject: RB 9318 vol_volsize_to_reservation does not account for raidz skip blocks
    Date: Sat, 15 Jun 2019 18:26:14 -0600
    
    Hi Mike,
               I can't seem to login to RB.   I've reviewed your code and it 
    looks too me.
    
    -Sanjay
    
  2. 
      
jjelinek
  1. Ship It!
  2. 
      
mahrens
  1. 
      
  2. usr/src/lib/libzfs/common/libzfs_dataset.c (Diff revision 2)
     
     
     
     
     
     

    Could you add a comment here explaining this math? The use of "SPA_OLD_BLOCKSIZE / tsize" is especially non-obvious. Explaining that the refreservation already takes into account this "raid-z deflation", but we need to essentially adjust the deflation ratio based on the actual block size.

    Also you might mention that we are assuming all the data blocks will be this size, but that isn't actually the case if compression is enabled. But that is OK since compression won't make the space usage any worse.

    You might also talk about how we're applying this to the total size (including both data blocks and indirect blocks). Since smaller blocks have worse RAID-Z space inflation, and the data blocks are (typically) larger than indirect blocks, and the data blocks are the vast majority of the space, this is all fine.

  3. 
      
mgerdts
kkantor
  1. Just a few comments on the test files. Thank you.

    1. Thanks for the review.

  2. Other tests will create flat files to use to back the zpool when specific configurations like this are required. Do you think we could do that here and in the next test instead of requiring the user to create/use a large number of disks?

    If this test created its own backing files then we wouldn't have to worry as much about the 4k vs 512 sector size logic. On my machine zfs uses a 512 sector size for file-backed zpools.

    Personally, I only run this test suite with three disks.

    1. This work was done specifically for the 4k scenario, which is why I go through specific effort to test it.

      I had initially started down the path of using files or zvols on $TESTPOOL and kept getting zfs-on-zfs deadlock. As noted in my test notes, some of the existing tests caused repeatable zfs-on-zfs deadlock while trying to get a baseline. It is for this reason that I went down this other path.

      A 3 disk configuration will provide basic coverage for 2 and 3 disk raidz configurations.

  3. What about default_cleanup_noexit, or default_cleanup instead?

    1. It looks as though default_cleanup_noexit could be used in place of the zpool destroy, but it is not a substitute for default_setup_noexit. Other tests in this directory assume that $TESTPOOL has already been created by setup.

  4. We could use the TESTVOL variable here instead.

  5. Can we use '-eq' here and in the following conditionals instead of '==' for more strict checking?

    1. I think that would be the wrong thing to do. 0 is a string that can only be interprted as itself. Since [[ ]] is used, it is not coercered into an integer. The thing we are comparing it to is the name of a key from an associative array, which is also a string. If I were to change it to -eq, then ksn -n is not happy.

      $ cat -n f.ksh
           1  #! /bin/ksh
           2
           3  if [[ $a == 0 ]]; then
           4          echo blah
           5  fi
           6  if [[ $a -eq 0 ]]; then
           7          echo blah
           8  fi
      $ ksh -n f.ksh
      f.ksh: warning: line 6: -eq within [[...]] obsolete, use ((...))
      

      Per The New Kornshell by Bolsky and Korn (1995) page 141, (emphasis mine):

      A primitive can be any of the following exprssions that compare to arithmetic expressions:
      exp1 -eq exp2 ...

      The man page in SmartOS is kinda broken in the area around -eq. If you squint a little bit, you can see how someone messed up while transcribing the book, leaving out the sentence about arithmetic expressions, implying that -eq has something to do with files.

      https://drive.google.com/open?id=1HpHMxIrX23aWt4Cz35-0Npv0m3eEwqiq
      https://drive.google.com/open?id=1-809JrpyUx-xiHgVOir582Pv9R312Slg

  6. Should this be default_cleanup_noexit or default_cleanup instead?

    1. Same as with the other file

  7. Could use the TESTVOL variable here (from default.cfg).

  8. 
      
jjelinek
  1. Great job on the new comment

  2. 
      
mgerdts
kkantor
  1. Ship It!
  2. 
      
jjelinek
  1. Ship It!
  2. 
      
relling
  1. 
      
  2. why not use parens here to be explicit and then remove the comment above?

    1. I then expect the next reviewer to say "you don't need those parens, because that's the normal order of operations". I'll just delete the comment because none of the reviewers were confused before I added it.

  3. see comment below on switch

  4. personal preference is to use a switch here. Something like
    
    ```
    case "$bps" in 
      512)
        allshifts=(9 10 17)
        maxpct=151
        ;;
      4096)
        allshifts=(12 13 17)
        maxpct=110
        ;;
      *)
        log_fail "bytes/sector != (512 | 4096)"
        ;;
    esac
  5. 
      
mgerdts
jjelinek
  1. Ship It!
  2. 
      
relling
  1. Ship It!
  2. 
      
mgerdts
Review request changed

Status: Closed (submitted)

Loading...