Add support to vioblk for DISCARD operation

Review Request #2534 — Created April 10, 2020 and submitted


Adds support for vioblk DISCARD operation

zfs TRIM tests, manually issuing various DKIOCFREE calls while observing the resulting guest and host calls

  • 3
  • 0
  • 34
  • 9
  • 46
Description From Last Updated
I think the flow of control is clearer when the thing that does the allocation is the same place that ... rm rm
I wonder if we want a different way of a driver being able to indicate the support for trim / ... rm rm
These kinds of comments make me feel like we should fix this. Certainly it's an easier driver design when we ... rm rm
  2. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)

    This comment is likely no longer true.

  3. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)

    Should this really be spelled FALLTHROUGH? I don't remember exactly if gcc wasn't happy with FALLTHRU.

  4. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)

    Should this be above along with "current" version as it's BD_OPS_VERSION_1 where we add new functionality?

  5. usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 1)

    Why do we need to explicitly state that 0x21 is unused?

  6. usr/src/uts/common/sys/blkdev.h (Diff revision 1)

    Likely needs to be updated? Or I would drop the first sentence completely as it's rather obvious.

  1. I started going through this, but ended up with several larger architectural questions. I figured it was worth sharing those and talking through those first and opted to get that out rather than go through all the details. Hopefully that makes sense. I think we really want to think through the error semantics and the actual blkdev APIs, how that works with multiple queues, etc.
  2. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)
    Clarify logical or physical.
  3. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)
    Should we sanity check any of these values before we hand them back further in the stack?
    1. I'll add a check.

  4. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)
    Given that your specifying the sleeping or non-sleeping allocation behavior into the iteration function, maybe that's worth passing into this since you're passing that along?
    1. Yeah, I'll add that.

  5. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)
    The current design of this makes it seem like we're only ever going to be issue a single request at a time to a given device. It feels like that's not going to be great in an nmve world where you need a larger queue depth to be able to support saturation of the device. That doesn't mean this needs to change  at this particular time, but it probably means we should think through what kinds of questions that we want to ask.
    1. I'm not exactly sure I understand -- dkioc_free_list_t consists of a header followed by an array of one or more extents (offset, length) to free (I think the existing utility functions in os/dkioc_free_util.c limit it to 2^20 extents -- at least it won't do a copyin() of more than that).

      dfl_iter() takes care of breaking up a request into multiple smaller requests or splitting a single large extent into multiple smaller extents -- only when the driver needs it. The result -- whether it's the single dkioc_free_list_t from the original request, or multiple smaller requests are queued. If it results in multiple dkioc_free_list_ts, each of those are queued on their own.

      For NVMe, if I'm understanding the 1.4 spec correctly, DKIOCFREE would be implemented by using the 'Dataset Management' command which supports up to 256 ranges (extents) in a single command. If there's concern that submitting a single command with multiple extents would be worse vs. breaking those up into multiple dataset management requests, it should just be a matter of telling blkdev to limit things to a smaller number of requests.

    2. You're right, I misunderstood this when trying to come up to speed with all the different pieces (honestly an IPD would have helped).

  6. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 1)
    This suggests that the interface is really unsafe for an application to use as written, if they care about reliable results. While forcing all callers to break this up into reasonable requests on their own is painful, if we can't communicate what happened with a dkioc free correctly, will the right thing happen?
    For example, if a caller assumes that it's either all worked or none of it worked, but the reality is that some of the blocks were freed before an error, but they don't think they were and that the data is still valid, it seems like that'd lead to probable data corruption. What is the ZFS usage here? What are its assumptions going to be? Are callers really all going to treat this as advisory?
    1. That is true -- however, those are the limitations of the existing DKIOCFREE ioctl, which itself is based on the limitations of the SCSI UNMAP command (FWIW, the NVMe equivalent -- 'dataset management w/ discard bit set' has similar limitations) -- dfl_iter() can't fix those, but it's not making it any worse either, so the unsafeness has always been there (and isn't something we can really fix) -- this is just being more explicit about it.

      I think callers (including zfs) must treat the DKIOCFREE ioctl as advisory (regardless of this change). From what I've been able to gather, the contents of unmapped LBAs on subsequent reads is undefined for SCSI devices, and for NVMe may be 0x00..., 0xFF..., or cause an error depending on what was specified during the dataset management command. While I'm not sure about other protocols, just between those two mean we can't use a single ioctl for SCSI, vioblk, NVMe, etc. and have any expectations other than 'undefined' as it stands today. ZFS itself appears to treat TRIM as advisory -- basically the only error it acts upon for TRIM operations is ENOTSUP (and ENOTTY which is treated as equivalent to ENOTSUP -- see vdev_disk_io_start()) which disables subsequent TRIM requests on the vdev. All other errors are ignored AFAICT (see zio_vdev_io_done()).

    2. I understand that these are limitations of the existing DKIOCFREE interface, but there's nothing that says we can't add new interfaces to better communicate this and create a better programmatic interface. Seeing this made me feel like there's a lot of effort going to accommodate a bad API when we should instead ask what can we do to accommodate what would be a good programming interface to this and give us back the kind of information we'd like. That doesn't mean we shouldn't try and make the existing thing work.
      The question is if we were to then go back and add an API that did allow us to correctly communicate what we did and why, would the blkdev APIs be able to actually reflect that to us?
    3. I've been trying to think of a good answer to this (I actually though a lot about this before posting the review as well because the current interfaces bothered me and why I wanted to at least make sure the issues were noted for others).

      There's a few things that complicate trying to answer that the 'ideal' interface would look like -- the current interfaces tend to reflect what the various protocols do, and even there as new protocols (e.g. NVMe) have arisen, they've tended to resemble the previous interfaces, making it seem unlikely much improvement from any of the protocols in the future, and trying to come up with something we think is better but isn't actually supported by any devices (and seems unlikely to in the future just due to inertia) doesn't seem useful.

      The other side of it, is understanding what things a consumer of this interface would want. Today that's largely just ZFS, which today doesn't really care. I suspect what could be useful is to report back what ranges were successful. For things like autotrim, it could avoid issuing requests for already unmapped blocks (I suspect zpool trim, would always want to issue trim requests for all unused ranges even if it thinks they might already be unmapped). Without more users, it's hard to say how useful a more comprehensive API would be.

      Another thing that might be useful is something analogous to 'sync' or a write barrier -- IIUC, today (again not changed by this change) I don't believe there's any guarantees about the order I/Os issued to a blkdev device complete. There is a way for the underlying device to flush it's write cache, but that request is queued just like any other I/O. I don't see an obvious way to say 'tell me when all I/Os currently queued are complete'. There's probably an argument to be made that the current flush write cache bit in blkdev should do this, though I'd probably want to think about it more before I'd make a judgement one way or another.

      I don't know if that helps any -- I don't think anything in this change would prevent implementing any of the bits I mentioned (if they were developed into a more fully fleshed out design), and I think it's still useful to support DKIOCFREE (warts and all) since it's what existing consumers are going to expect. And at least since DKIOCFREE is optional (in the sense that a consumer cannot expect it will be supported by every device), we'd have a way to transition consumers in the future if those better interaces ever arrive.

    4. Thanks for at least thinking through this. I agree that it's still important to support ZFS and that since the only consumer that exists treats this as advisory, it's probably OK. But we we should be careful that others who come around have the same advisory expectations.
  7. usr/src/uts/common/sys/blkdev.h (Diff revision 1)
    Why is this opaque? The client is going to need to know the size and layout of the structure so they can actually use it.
    1. It's not really opaque as much as trying to avoid bringing in extra header files when not needed -- blkdev drivers that elect to implement the functionality will need to #include <sys/dkioc_free_util.h> or #include <sys/dkio.h>, those that don't implement it won't have to. I can add the #include to blkdev.h if you feel strongly enough about it.

    2. It feels like a weird amount of effort to go through and something that a driver would miss. It's not the choice I would have made, but I guess I can see where this is coming from.
  8. usr/src/uts/common/sys/blkdev.h (Diff revision 1)
    I think it's worth spending some time documenting this in more detail. For example, it took me a while to realize that bfi_align was in units of the sector size and it's not clear that it should be the physical or the logical sector size, given that a blkdev device can specify both.
    How did you pick whether it was the physical or the logical block size that it represented.
    Did you check the NVMe and eMMC 4.5 spec (which added a logical discard) to make sure the constraints can be reflected here.
    1. I'll add comments -- I just missed these. It's always logical blocks -- AFAICT, while blkdev tracks both physical and logical block sizes (and can report via the DKIOCGMEDIAINFOEXT ioctl, it only ever uses logical blocks, so

      The eMMC spec appears to be behind a paywall, so I wasn't able to look at that. For NVMe, it looks like a limit of 256 segments, and looking a bit closer, it does appear while it supports 64-bit LBAs, it only supports 32 bit lengths for list of extents in the dataset management command. I'll post an update that includes support for that.

      There's certainly some additional bits that while not essential, we could look at supporting either now or in the future -- e.g. I believe NVMe has a preferred I/O size and alignment (I think other protocols may have similar parameters), though it wasn't clear (though maybe I missed it going over the specs -- they do get a bit dry after a while :P) how those values should guide issuing UNMAP/TRIM/DISCARD requests. I.e. is it better to split a single extent that doesn't start on a preferred alignment boundary, but crosses one or more alignment bounardies into aligned and unaligned requests -- e.g. with a device with a block size of 512 bytes, with a preferred alignment of 4096 (8 blocks), a request starting at LBA 7 and a length of 10 blocks (offset 3584 and a length of 5120) there is a segment (LBA 8-15 / offset 4096 - 8191) that conforms, but two portions that are unaligned (the first and last block) -- would it be better to issue 1, 2, or 3 requests (or would it make a difference in that case)?

    2. The eMMC spec requires registration, but is free, for what it's worth.

      Right, these are all good quesitons and things we don't have a great answer to. I imagine we'll find over time that this will change for diferent classes of devices. I think if there's additional information we'd like a device to plumb through that it can tell us, then it might be useful for us to do so even if we don't consume it today.

  9. usr/src/uts/common/sys/blkdev.h (Diff revision 1)
    So, right now the only member of the bd_xfer_t that a client driver uses is the additional segment list when processing free requests. Does it really make sense to use that for the user?
    Actually, as I was reviewing other bits, a more general question came to mind:
    1) Should the free space info just be part of the existing drive information?
    2) Should this really be its own submission point. The main reason I ask is that I wonder if devices need to control whether or not other I/O is going on at the same time and if we don't want the ability to leverage all the existing queuing mechanisms and be able to specify that. While vioblk or say the eMCC driver don't probably care, nvme definitely will if these are interwoven in any serious fashion and we'd want that to tie into the existing queueing mechanism.
    1. Since the bd_drive_t struct isn't versioned, I figured there would be objections to changing it (even if just appending members). If we're not concerned about that, it would be fairly simple to add it in and remove the extra entry point in bd_ops_t.

      FWIW, freebsd uses a separate queue for TRIM/DISCARD/FREE requests (that will also split requests as required), but it's mostly concerned about limiting the rate that TRIMs are issued to devices than about maximizing parallelism of requests (there also doesn't appear to be any mechanism to enforce any ordering). That though does predate the ZFS trim work (though ZFS on FreeBSD does appear to go through this layer). ZFS itself puts its own limits on the number of TRIM zios it issues (as well as the size of the TRIM requests). Since it's currently the only major consumer, it didn't seem like it was worth adding a whole separate queueing mechanism initially. I'd also be concerned if DKIOCFREE requests could effectively bypass the existing blkdev I/O queue, it could allow the requests to starve other I/O -- at least with using the existing queues, other I/Os should be able to make progress in the face of DKIOCFREE requests.

    2. So, regarding 1), the overally interface is versioned. So if you use the updated version you can extend it and we have to know not to write to older things. It was already extended in the work that added multiple queue support.
      Regarding the second bit, I agree that it's important that we don't want to overwhelm the device with this work and we want other I/Os to be snappy. But in essence, you actually have created a second queue system on the side. Right now I/Os are distributed across all the queues, but in this mechanism a driver has to figure out how to actually make this fit into that. Right?
      Tease this out for the NVMe driver. If a device has 16 queues, right now blkdev is scheduling I/Os across all 16. But the trim/discard/free/whatever requests are just being handled to the driver to do and figure out which queue to send it to. If it only ever say submitted to the first queue, then that seems problematic. But if it's doing it's own round-robining across things independent of the I/O submission, that seems somewhat less than ideal.
      As you're suggesting in your comment there are really two different things today that we want to consider:
      1) What is the rate of trim/discard/free/whatevers that are issued to the underlying device.
      2) When it is issued to the device, how should that be scheduled across queues.
      When we look at the SATA and SAS APIs, there is no difference in how 2 is performed for these. Basically 2 isn't special to the device and its up to whatever is managing 1 to do so. So I think it's worth considering adding this as a top-level bd op that we dispatch through the existing mechanism so it can tie into the normal I/O flow. As you point out, 1) is important and we don't want to starve I/O in the face of issuing them, but that shouldn't impact the actual API that blkdev exposes to devices.
    3. Since we're ok with extending bd_drive_t, I'll go head and move the bits currently in free_space_info() into the drive_info bits then -- that makes more sense.

      For the second part -- I get what you're saying, but I'm not sure what you mean by doing 'its own round robining' -- for each DKIOCFREE request, bd_submit() is called for each piece (possibly multiple times if the request needs to be broken up into smaller pieces) -- the sequence is ioctl(DKIOCFREE, ...) -> bd_free_space() -> dfl_iter() -> bd_free_space_cb() -> bd_submit(). AFAIK, bd_submit() does the round-robin queueing for all blkdev I/O requests, so nothing different should be happening here. The reason I added the x_dfl field to bd_xfer_t to hold a dkioc_free_list_t is explicitly so that the requests could flow through the same queueing path as all the other requests instead of short-circuiting it. Did I miss a piece?

    4. I believe I misread the original queueing path and mised the extra step through the bd_submit() path which applied the normal round robining.

  10. usr/src/uts/common/sys/blkdev.h (Diff revision 1)
    This should definitely return an int so that a caller can indicate failure. The inability to indicate failure has been a problem in other APIs.
  11. The macro name implies this is a boolean, but it's not actually. I'd suggest actually constructing that.
  12. Why does the only producer of this structure speak in terms of uint64_t units and here you're using units of a size_t? While they are conveniently the same on an LP64 system, it seems like that's a recipe for eventual confusion and type mismatches. It also seems odd when the dfl_iter function which takes this is comparing it to uint64_t lengths.
  13. usr/src/uts/common/sys/dkioc_free_util.h (Diff revision 1)
    Does the system already have a constraint on power of 2 block sizes?
    1. blkdev does require (AFAICT) that the logical block size be a power of two >= 512. However the DKIOCFREE ioctl uses bytes, not blocks (logical or physical) for all of its values, while SCSI and NVMe (at least, I suspect others will be the same) indicate their limits in logical blocks. I tried to avoid mixing bytes/blocks in structs since the the dkioc and blkdev interfaces (and corresponding structs) are already using different units seemed confusing enough. Since the kernel already has the notion of both a block size and block shift, the block shift amount was already readily available from bd_t, and if we're concerned about potential non-LP64 platforms, we might need to introduce ddi_ffsll() function to correctly handle converting the block sizes into a shift value where needed (illumos-joyent has that, but hasn't been upstreamed yet -- I'm not sure what the current consumers are), going the block shift route seemed simpler (and no less 'wrong').

    2. OK. Thank you for the additional detail here.
  2. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)

    There's almost 400 existing uses of it in usr/src/uts/common -- and gcc7 doesn't seem to flag it -- so seems like it's ok.

  3. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)

    This confused me a bit since instead of making BD_OPS_CURRENT_VERSION effectively an alias to the latest BD_OPS_VERSION_xx value, it is the current version -- that is everything prior to this change uses a value of 1 or 0 (though the bd_ops_t definition is the same with those two).

    This change adds a new version (2) as BD_OPS_CURRENT_VERSION, which follows the previous work.

    It would probably be clearer to add BD_OPS_VERSION_2 to the enum, and have BD_OPS_CURRENT_VERSION be an alias to BD_OPS_VERSION_2, but I'm not sure how bad that'd break existing expectations elsewhere.

    1. I don't believe doing that would break anything. I agree that it'd be clearer.
  4. usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 1)

    The existing values (with the sizes in the comments beside them) end up being contiguous chunk of the configuration space. I though it'd be good to explicitly note that there's a gap there in case someone though that perhaps VIRTIO_BLK_CONFIG_WRITEBACK was actually supposed to be 16 bits and not 8.

  1. Apologies for the delay in getting back to you, Jason. I've gone through this now and a have a few questions and thoughts, some of it which is inspired by your own comments which are pointing out issues with the interfaces. Thanks for doing this work.

  2. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3)
    I think the flow of control is clearer when the thing that does the allocation is the same place that does the freeing. Does dfl_iter(), called by bd_free_space(), have to be the thing that does the kmem_free() of dfl? It feels like it might be clearer to see how data gets freed if it's possible to have it be the responsibility of the allocator. Given that there's a couple layers here to track through, I'd recommend at least commenting about this fact here.
    1. I've added a comment here, but to explain further -- the data in dfl is what needs to get queued by the blkdev framework to be sent to the driver. We could use a different struct to hold the data by blkdev, but it'd end up looking a lot like dkioc_free_list_t already, so it doesn't seem like that would be useful (so might as well reuse what we have). Since we need to queue the data in dfl, we cannot free it until after the request has been processed by the driver (we do this by attaching it to bd_xfer_t and the driver calls bd_xfer_done w/ the bf_xfer_t instance once it completes the request. bd_xfer_done() will then free any dkioc_free_list_t that's attached to the bd_xfer_t it was given as an argument.

      Conceptually, the way I thought of it is as we try to pass dfl down to the driver, if all is good, it gets passed down like any other I/O. If free request sent to blkdev can't be used (due to alignment, sizing, etc.), blkdev tries to instead pass down newly allocated dkioc_free_list_t instances made from adjusting or splitting up the original request and frees the original request (effectively replacing it with one or more new instances that the underlying driver can accept).

      We could create a copy of dfl to pass down and free the original dfl after we call bd_free_space(), but it seems a bit gratitutious -- dfl_copyin() already always allocates, populates, and returns a newly allocated dkioc_free_list_t instance (on success). We'd be creating a new dkioc_free_list_t instance, then immediately create a copy of what we just created, call bd_free_space() with the copy (to let it dispose of it), then free the original instance we just created. It seems like an equally valid question would be why not just use the first dkioc_free_list_t we just created?

      The other alternative would be to block until bd_xfer_done() is called on any/all of the resulting dkioc_free_list_t instances that get passed down to the driver as a result of this initial request, but that effectively turns all DKIOCFREE ioctls into blocking calls. Granted there are no man pages (or much in the way of documented behavior) for the ioctl (so things are far from certain here), but there is a DF_WAIT_SYNC flag that can be passed to the ioctl, which seems like it'd imply that it's absense means it shouldn't (or at least doesn't have to) wait for the request to complete before returning.

    2. I do understand the manipulations and transformations and it's a reasonable reason for it. There are certainly a lot of advantages to it. I mostly brought this up originally because it was hard for me to understand that control flow and convince myself of it at first, which I know is often a complaint of yours with other pieces of code like in ixgbe. The comment definitely helps. Since this is rarely allocated, it'll hopefully be clear for most folks. I don't think trying to turn this into a wait or another operation necessarily makes sense just because of the issue of a reader's comprehension.
  3. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3)
    I wonder if we want a different way of a driver being able to indicate the support for trim / discard than plumbing through the op vector or not. In other areas when there's conditional support, usually the function remains but we indicate the ops vector support another way. Though I realize that this is something that the blkdev does already with other areas.
    1. Did you have anything specific in mind? I suppose we could require drivers that implement the endpoint to support some sort of 'probe' invocation (vs. being called to service an real request) -- e.g. call the endpoint with a NULL dkioc_free_list_t argument and return 0 or ENOTSUP depending on what that device instance supports, though I haven't thought hard on if such behavior might suggest also changing other parts of blkdev for consistency.

    2. The things that come to mind are cases like the ufm code or other areas where we have a lot of parts with divergent behaviors, there are usually different ways to ask that question. From the driver writer's perspective, it's nice when you can use a relatively constant array here, but I guess this ship has well sailed in blkdev and trying to do something different here that isn't holistic from everything else isn't worth it and at the end of the day, while I think it ends up being a harder design to understand, changing only this to be different wouldn't help that so we should probably just stick with what we have.
  4. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3)
    FWIW, the return value from biowait is the same as the call to geterror(9F) at last as documented.
    1. I noticed that -- bd_flush_write_cache() does it that way, so I figured I'd be consistent with it just in case there was some sublety or convention I had missed.

  5. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3)
    Given that we're shifting all of these values by shift, perhaps we should verify that they are all actually valid and that a driver that attaches and reports a bad value (because it got a bad value say through the virtio or from hardware) doesn't cause an overflow as part of the other checks that we're doing so we know that everywhere else this is safe and that way we don't have to check all subsequent operations here and in other parts of blkdev.
    1. We do some sanity checking in bd_attach() -- d_free_align must be > 0, and if either d_max_free_blks (max total bytes) or d_max_free_seg_blks (max bytes in one segment) are > 0, both must be set and d_max_free_blks >= d_max_free_seg_blks is enforced.

      What isn't currently enforced by blkdev is that d_blkshift defaults to 9, but the driver can change it to any value (including 0), similarly for d_qcount (not relevant for this change, but similar issue). As mentioned earlier, I suspect we really can't support block sizes < 512, so we can enforce 512 as the minimum block size. Similarly, I don't think a queue size of 0 makes any sense, and we should probably error out and fail to attach the instance if the driver alters it to 0.

  6. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3)
    Is there any particular reason you chose ENXIO here?
    1. All the other uses of cmlb_partinfo (bd_dump, bd_strategy) treat any failures in the call as ENXIO, so it seemed the most consistent.

  7. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3)
    Are we guaranteed that the longlong_t in the form of the diskaddr_t values that we're getting won't be negative?
    1. I think so, though available information makes it a somewhat weak assertion -- I cannot find anything that states that an offset of -1 or a length of -1 are valid partition values (and what -1 would mean as a 'valid value'), and cmlb_partinfo is only supposed to succeed for valid partitions.

  8. usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 3)
    This isn't in the Virtio block spec yet, but is in qemu/linux, fwiw. Not sure we should remove it.
  9. usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 3)
    Since this is the first usage of the WO string in this context, probably useful to indicate what WO means for future readers.
  10. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3)
    dfl_num_exts is a uint64_t, not a size_t. We should probably match.
  11. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3)
    I assume that this is being verified / checked as something that won't overflow by callers?
    1. Yes -- the only caller should be from blkdev which has already validated this. I can add an ASSERT or VERIFY if you want (though usually then someone asks why it's being checked twice :P).

  12. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3)
    Since we're only using this for the bcopy, maybe worth making it const so bcopy will make sure we never mess up the argument orders.
    1. Seems reasonable.

  13. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3)
    So in vioblk_bd_flush() we sanity check the feature is here. Do we need to do the same here?
    1. I don't think there's a way for the host or the guest to renegotiate the features, but also no harm in being as paranoid as vioblk_bd_flush() either.

  14. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3)
    Should we sanity check these at all? If the guest gave us a bad value, that would cause us to fail to attach. If this is a boot device, we may prefer to disable the advertisement of the feature.
    1. We can do a little bit at least -- the spec is pretty thin here, but I think we can safely assume none of those values should be zero (and disable support if any of them are).

  15. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3)
    These kinds of comments make me feel like we should fix this. Certainly it's an easier driver design when we can have a constant ops vector if possible.
  16. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3)
    Maybe it's an rb thing, but it seems visually like this assignment looks different from the others?
    1. Hrm.. needs two tabs, not one. I'll fix that.

  17. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3)
    So, reading this it doesn't immediately clarify things for me to the point that this is clear, fwiw. It's not clear what the resources are in this case and given that a driver would have to call dfl_free() after calling this, which you do anyways, I think I'm missing something here.
    1. Let me see if I can word it better (I think I was probably merging a bit in how blkdev handles them into here, muddying things a bit) -- the idea is that dfl_iter() may just pass along the dkioc_free_list_t to the callback unchanged if the contents don't require any modification. Since the only major consumer of DKIOCFREE -- zfs -- is already fairly conservative in how it generates DKIOCFREE requests, there's a fairly decent chance that most of the time dfl_iter() can do this.

      When dfl_iter() cannot just pass the dkioc_free_list_t along, it must allocate new instances of dkioc_free_list_ts, populate and pass those along to the callback. To try to keep things simple, the callback is responsible for freeing the dkioc_free_list_t passed to it. If the dkioc_free_list_t passed to the callback is not the one passed to dfl_iter() (because dfl_iter() had to allocate a new one), dfl_iter() frees the dkioc_free_list_t passed to it.

      The net effect is you call dfl_iter() and hand off the dkioc_free_list_t to dfl_iter() and the callback ensures the dkioc_free_list_t passed to it is freed and the caller to dfl_iter() doesn't have to worry if the request needed to be modified or not. The alternative would require creating multiple dkioc_free_list_t copies along the way -- which we could do, it just seems a bit unnecessary.

  18. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3)
    These all appear to be a uint64_t / size_t mismatch. The indexes are all being compared to a uint64_t and the n_bytes is having uint64_t's added to it.
  19. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3)
    Should we assume that dfi_bshift is well formed and correct and that this isn't a value that'll cause this to overflow?
    1. Since it's unsigned, it cannot be negative, and 0 could be valid for something that uses bytes, so the only thing we could check would be a maximum size, though any cap we impose is going to be fairly arbitrary. That being said, it wouldn't surprise me if most of the block device code implicitly requires a block size of at least 512, so if we're fairly confident about that, we could add a requirement that dfi_bshift >= 9 and enforce it.

    2. This was referring to a case of a dfi_bshift value that ended up being something passing a probably erroneous bshift of 33, which is how we would end up overflowing a uint_t.
    3. I'll add a check here.

  20. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3)
    Some comment about what the different cases are handling might be useful. I've been trying to make sure I've understood all the different cases that are covered, but it may be useful. This is for all the major cases here.
  21. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3)
    You store bsize as a uint_t above in dfl_iter(), but as a uint64_t here. Is there a reason for the difference?
    1. The P2ALIGN macro will truncate to 32 bits -- I'll add a comment.

  22. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3)
    type mismatch: dfl_num_exts is a uint64_t, i should be a size_t.
  23. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3)
    It does seem that ENXIO is more documented for this case, though it is overloaded.
    1. I have no preference here, so I'm happy to put whatever feels more appropriate.

  24. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3)
    I've been looking at this and there's something that's not settling for me. If we've been given a starting and ending address rounding up the starting address and aligning (truncating) the ending address will necessarily shrink the range we trim, but don't we need to actually increase it to cover what we've been asked to do?
    Or instead, is the reason that you're shrinking it is because it doesn't make sense to do this, because we'd be trimming something that has valid data, which is why we're shrinking the range.
    All of this makes me feel like semantically we're doing something close on a best effort to what the caller wants and that your comments that this is a bad interface we're doing the best we can with are true and that we should be coming up with a better way to express this. If I was trying to trim/discard something but the device had different constraints from what I thought, I'd really want to know what those were so I could deal with that properly at my level and adjust my expectations.
    1. Yes -- the adjustment is to prevent freeing of data outside of the original range. If the start and lengths already meet the alignment requirements, the values are unchanged.

      A better interface would be to have a way for consumers to discover the requirements, and then just require that they only issue requests that conform (rejecting non-conforming requests). It would be a large enough to probably warrant it's own change, since that's potentially touching most of the block device drivers (sd, sata, blkdev, etc) as well as consumers (e.g. zfs). Though for things like an iSCSI target, they'd still need to implement the SCSI semantics (which I believe treats it more of an advisory request).

  25. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3)
    I believe start_idx and n probably want to be uint64_t's given the values they're derived from are.
  26. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3)
    size_t / uint64_t comparison. You probably want to make idx a uint64_t.
    Should this instead be a check that we do at the start and just fail, rather than asserting it?
    1. I'll work it in the earlier overflow check.

  27. usr/src/uts/common/sys/blkdev.h (Diff revision 3)
    Is there value in making this a const pointer?
  2. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 4)

    Shouldn't these only be checked if the device provides a o_free_space entry point?

    1. I suppose it depends on perspective -- a driver shouldn't bother altering them if they don't support o_free_space. I'd rather make sure the values were always sane (even if they're not used).

  3. usr/src/uts/common/os/dkioc_free_util.c (Diff revision 4)

    ERANGE is "Math result not representable", so I would prefer ENXIO here. Another good choice might be EINVAL, after all an invalid argument was given if this happens.

    1. I'll go with EINVAL

  2. usr/src/uts/common/io/vioblk/vioblk.c (Diff revisions 3 - 5)
    I would recommend we include the invalid values in this error message. Otherwise a user will need to be taught how to fetch them out with mdb.
  3. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 5)
    This comment is out of date as there is no o_free_info member anymore.
  1. Thanks for all your work on this. I think this looks pretty good.

  1. I have one minor comment nit, but otherise looks good.

  2. usr/src/uts/common/io/vioblk/vioblk.c (Diff revisions 7 - 9)
    I would use the actual spec name, virtio as opposed to vioblk.
  1. Ship It!
  1. Ship It!
Review request changed

Status: Closed (submitted)