7537 want nextboot (one time boot) support

Review Request #249 — Created Nov. 6, 2016 and submitted

tsoome
illumos-gate
9af5e6b9b0
7537
68aed11...
general
7537 want nextboot (one time boot) support

bootadm activate -t/-T and boot time/post reboot verification.

root@openindiana:/home/tsoome# beadm list
BE                     Active Mountpoint Space  Policy Created
openindiana-2020:04:15 -      -          19,23M static 2020-04-15 09:28
openindiana-2020:07:09 NR     /          11,56G static 2020-07-09 11:16
root@openindiana:/home/tsoome# beadm activate -t openindiana-2020:04:15
Activated successfully
root@openindiana:/home/tsoome# beadm list
BE                     Active Mountpoint Space  Policy Created
openindiana-2020:04:15 T      -          10,01G static 2020-04-15 09:28
openindiana-2020:07:09 NR     /          1,57G  static 2020-07-09 11:16
root@openindiana:/home/tsoome# beadm activate -T openindiana-2020:04:15
Temporary activation removed
root@openindiana:/home/tsoome# beadm list
BE                     Active Mountpoint Space  Policy Created
openindiana-2020:04:15 -      -          10,01G static 2020-04-15 09:28
openindiana-2020:07:09 NR     /          1,57G  static 2020-07-09 11:16
root@openindiana:/home/tsoome# 

After activate -t, we do have on disk (first label):
00002000  01 01 00 00 00 00 00 00  00 00 00 01 00 00 00 48  |...............H|
00002010  00 00 00 40 00 00 00 07  63 6f 6d 6d 61 6e 64 00  |...@....command.|
00002020  00 00 00 09 00 00 00 01  00 00 00 26 7a 66 73 3a  |...........&zfs:|
00002030  72 70 6f 6f 6c 2f 52 4f  4f 54 2f 6f 70 65 6e 69  |rpool/ROOT/openi|
00002040  6e 64 69 61 6e 61 2d 32  30 32 30 3a 30 34 3a 31  |ndiana-2020:04:1|
00002050  35 3a 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |5:..............|
00002060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fd0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
00003fe0  77 08 7c 75 55 e9 df e3  27 f9 25 4a da 52 65 fc  |w.|uU...'.%J.Re.|
00003ff0  43 29 e7 6c 00 4f 1f db  28 36 85 29 be 35 92 73  |C).l.O..(6.).5.s|

on reboot, indeed, the loader is using currdev "zfs:rpool/ROOT/openindiana-2020:04:15:",
then I did reboot the loader, got currdev "zfs:rpool/ROOT/openindiana-2020:07:09:" mounted for rootfs:
tsoome@openindiana:~$ df -h
Filesystem             Size   Used  Available Capacity  Mounted on
rpool/ROOT/openindiana-2020:07:09
                     28,70G  5,41G      8,19G    40%    /

And now we can confirm the nvlist on label is there, but empty:
00002000  01 01 00 00 00 00 00 00  00 00 00 01 00 00 00 00  |................|
00002010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00003fd0  00 00 00 00 00 00 00 00  11 7a 0c b1 7a da 10 02  |.........z..z...|
00003fe0  b3 46 76 2e 24 fd fa 01  b7 cd 13 e5 42 1f d8 64  |.Fv.$.......B..d|
00003ff0  ad ae 2f 35 db 6b c2 29  db 4d 99 28 85 3c 13 15  |../5.k.).M.(.<..|
  • 0
  • 0
  • 15
  • 3
  • 18
Description From Last Updated
pcd
  1. I'll admit to not having thought too much about this, but is the second label the best place for this information? We don't actually need to store this on every vdev in the pool, we just need to store it once. And we only have two of these padding blocks; if we want to use them later, we'll be out of luck. It's entirely possible this is the best place to store this information, I just want to make sure we've thought about alternatives.

    1. You meant PAD2, not second label?

      Well, there are many possibilities. Firstly, the original (current) fbsd implementation is using exactly the scenario you did describe - info on the first disk, first label (PAD2). However, there are some things to think about - what if somehow this first disk was disconnected and reconnected later. What if the update by early boot program fails? Which disk is first one - bios systems reorder boot disks in many systems, depending on where you boot from).

      Just as an idea (it does not have to be this way, but just as reference idea): We do not know which disk is the first one, so if we boot from one or another disk (I do it fairly often to verify things), which leaf vdev should I probe? I thought, ok, lets record it on every label of every vdev of boot pool - so whatever we use as "first" disk, it does not matter. Therefore, if the disk is readable, it should have all label instances with this config data. Now if we reset the data, what if one write will go ok, and others not? So, the natural idea is - if there is one empty PAD2, we fail back to normal bootfs property. For completeness, it would also mean that on kernel startup, the boot pool PAD2 areas should be wiped - for case the bootloader did fail all writes.

      The PAD2 itself is just one of the currently safe areas. Note we can not use PAD1, or at least, we can not use first sector (512B from pad1) as it is reserved for partition boot block (and used as such for BIOS systems). Actually, if we keep in mind the sparc, pad1 is entirely used there - vtoc + first part of the boot program. Ok, in fact, we even could leave first label out for good and do not touch it at all. But we also do not have much more left except the tail of the boot area (3.5MB). So we could use it, but, there is obvoious risk as well.

      So I would take this implementation as reference, proof of concept, and something to start from. Note, this very instance does not actually scan all 4 labels from vdev yet, as I did realize only during the writing, the loader is slacking there, so the label read update is in separate update, and after it is in place, we have tools to read all 4 labels.

      Another possible idea is to not use disk at all and rely on, say, uefi variables, but that will limit target audience and will push whole idea into future.

      And finally, we could invent yet another area - for example the reserved partition on GPT, but reserved partition is already in use for device id (usable or not), and reserved partition also may not be there...

    2. I did mean PAD2, my mistake.

      I agree that storing it on only one disk is a poor idea; and besides, if you use pad2 on one disk, you might as well use it on all of them, because once it's in use for one purpose, using it for a different purpose on only some devices would be much trickier. I was mostly trying to think of other locations we could store it, like reserved partitions. Probably this is the best spot we have available. The only other thing I would consider, in that case, is allocating only part of the 8k region to the nextboot config info; I don't think we expect it to be very large, and it might be useful to save most of the space to store other things later on. Deciding on a maximum length for something like this is easier to do early, rather than late.

  2. usr/src/boot/sys/boot/zfs/zfs.c (Diff revision 1)
     
     

    Rather than treating this as an array of integers, could we cast the appropriate location to a zio_eck_t and use ZEC_MAGIC and ZIO_SET_CHECKSUM? Or create a struct for this padding space that has the appropriate length char[] followed by a zio_eck_t, and then cast the whole thing. We do that for the vdev_label_phys_t.

  3. usr/src/boot/sys/boot/zfs/zfs.c (Diff revision 1)
     
     

    Not sure I understand this conditional. My reading of this loop is "iterate over all of the vdev's children, and at the end, result will be the contents of pad2 from the first one, unless any of them had the empty string as the contents of their pad. If they did, we return the empty string." Is the intent here to protect ourself against a case where only some vdevs get zeroed? A comment explaining that conditional might be helpful.

    I think specifically, what confuses me is that we trust earlier vdevs more in one way (we use their result over others if they're all non-empty), but not in others (empty strings have priority?).

    1. Did update the comment and added check for pad2 content from different labels.

      Yes, the idea is that for nextboot to be used, we need to have the same non-empty nextboot string from all (healthy) disks. This is because the unsetting the next boot may have been interrupted in between updates, leaving some pad2's with unset command, and some with command.

      Additionally, we need to require all pads to have the same command string, otherwise we do not know which one is to be preferred...

      Also note this version is only reading the first label from vdev (label 0), as during the write I did figure there is no code to read all the labels, and so this code will be needing the update once the https://www.illumos.org/rb/r/250/ will land.

  4. usr/src/boot/sys/boot/zfs/zfs.c (Diff revision 1)
     
     

    If you have {} on the if statement, you should have them on the else as well.

  5. usr/src/uts/common/fs/zfs/vdev_label.c (Diff revision 1)
     
     

    Why do we pass cbp in here? We pass it to our children, but they don't do anything with it.

    1. Yea, its leftover before I figured to set it as zio private in zio_root()

  6. usr/src/uts/common/fs/zfs/vdev_label.c (Diff revision 1)
     
     

    Why do we return the best error? If only one disk managed to successfully write the nextboot information, and the others left the padding area zeroed out, then we're going to have a pretty bad day when it comes time to boot and we're expecting it to work, but we ignore the first disk's nextbook because the others are zeroed out.

    1. The idea is that from this write we should only get physical IO errors, and if lucky, we get the disk ending up in error state, but we still have good writes as well. Now while reading nextboot data, the errored disks will be ignored, the IO error should also get ignored and we have read from "good" label(s).

      Now also on boot, we at least do attempt to zero out the nextboot, so even if we get something bad while reading, we still will get chance to boot without nextboot setting. And if all will go bad, there is an option for user to get to boot2: prompt and enter boot dataset manually.

      Of course the alternative would just be to error out on whatever error we had there and therefore disable the feature. Basically it is decison point there, which way we should go - either to be opportunistic and try or to be more conservative and deny the feature on any errors.

  7. 
      
tsoome
pcd
  1. 
      
  2. usr/src/boot/sys/boot/zfs/zfs.c (Diff revision 2)
     
     

    Why are we using zap_scratch here? I guess it doesn't really matter, but it seems wrong to use the zap_scratch space for non-zap stuff.

  3. usr/src/boot/sys/boot/zfs/zfs.c (Diff revision 2)
     
     

    Why do we use a precalculated checksum here? It looks like the sha256 source code is present in usr/src/boot/sys/cddl/boot/zfs/sha256.c. Could we use that instead?

  4. usr/src/boot/sys/boot/zfs/zfs.c (Diff revision 2)
     
     

    Could we verify the checksum here?

  5. 
      
mahrens
  1. 
      
  2. usr/src/lib/libzfs/common/libzfs_impl.h (Diff revision 2)
     
     

    use the zpool_ prefix

  3. usr/src/lib/libzfs/common/libzfs_pool.c (Diff revision 2)
     
     
    It's a little confusing that we are defining the "nextboot" to mean the "dataset" string (for zpool_get_nextboot) and the "zfs:dataset:" string (for zfs_ioc_get_nextboot).  Would it be possible to use the same string all the way, or to use different names for these different strings?
  4. usr/src/uts/common/fs/zfs/vdev_label.c (Diff revision 2)
     
     
     

    The semantic here is that we return a random pad2, unless one of them is a zero-length string, in which case we return the zero-length string.

    There's no checksum of the pad2, right? So if one of them is damaged such that it becomes a zero-length string, the overall property is lost.

    What's the motivation behind this behavior?

  5. usr/src/uts/common/fs/zfs/vdev_label.c (Diff revision 2)
     
     
     
     
     

    This seems very specific to treating pad2 as a string. I.e. we aren't really checking if it's "zeroed", only that it's a zero-length string (subsequent bytes may be nonzero). Can we document that somewhere?

    Also, I think this is equivalent to:

    if (pad2[0] == '\0')
      *cbp[0] = '\0';
    
  6. usr/src/uts/common/fs/zfs/vdev_label.c (Diff revision 2)
     
     

    do we also want to check vdev_writeable()?

  7. usr/src/uts/common/fs/zfs/vdev_label.c (Diff revision 2)
     
     

    I guess there's no performance concern with writing each vdev one at a time? If there are a lot of disks, this could be slow, but presumably this is typically used only on the root pool which has a small number of disks?

  8. usr/src/uts/common/fs/zfs/zfs_ioctl.c (Diff revision 2)
     
     

    Why is this needed? It looks like we are modifying the pad2 area of the labels, and not modifying anything referenced by the uberblock.

  9. usr/src/uts/common/fs/zfs/zfs_ioctl.c (Diff revision 2)
     
     

    Are we intentionally allowing this on suspended / readonly pools? That seems dangerous.

    We may also want to pass allow_log=B_TRUE. Even if not used from the CLI, this will cause the ioctl to log itself.

  10. usr/src/uts/common/fs/zfs/zfs_ioctl.c (Diff revision 2)
     
     

    What would be the downside of using zfs_secpolicy_read here?

  11. usr/src/uts/common/fs/zfs/zfs_ioctl.c (Diff revision 2)
     
     

    What would be the downside of using zfs_secpolicy_read here?

  12. 
      
tsoome
tsoome
alhazred
  1. 
      
  2. usr/src/boot/lib/libstand/zfs/zfs.c (Diff revision 3)
     
     

    Сan we make this moment safer?

  3. 
      
tsoome
tsoome
tsoome
tsoome
tsoome
tsoome
tsoome
tsoome
tsoome
tsoome
tsoome
tsoome
Review request changed

Status: Closed (submitted)

Loading...