-
-
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1) Should this really be spelled FALLTHROUGH? I don't remember exactly if gcc wasn't happy with FALLTHRU.
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1) Should this be above along with "current" version as it's BD_OPS_VERSION_1 where we add new functionality?
-
usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 1) Why do we need to explicitly state that 0x21 is unused?
-
usr/src/uts/common/sys/blkdev.h (Diff revision 1) Likely needs to be updated? Or I would drop the first sentence completely as it's rather obvious.
Add support to vioblk for DISCARD operation
Review Request #2534 — Created April 10, 2020 and submitted
Information | |
---|---|
jbk | |
illumos-gate | |
12506 | |
Reviewers | |
general | |
Adds support for vioblk DISCARD operation
zfs TRIM tests, manually issuing various DKIOCFREE calls while observing the resulting guest and host calls
- 3
- 0
- 34
- 9
- 46
Description | From | Last Updated |
---|---|---|
I think the flow of control is clearer when the thing that does the allocation is the same place that ... |
|
|
I wonder if we want a different way of a driver being able to indicate the support for trim / ... |
|
|
These kinds of comments make me feel like we should fix this. Certainly it's an easier driver design when we ... |
|
-
I started going through this, but ended up with several larger architectural questions. I figured it was worth sharing those and talking through those first and opted to get that out rather than go through all the details. Hopefully that makes sense. I think we really want to think through the error semantics and the actual blkdev APIs, how that works with multiple queues, etc.
-
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1) Should we sanity check any of these values before we hand them back further in the stack?
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1) Given that your specifying the sleeping or non-sleeping allocation behavior into the iteration function, maybe that's worth passing into this since you're passing that along?
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1) The current design of this makes it seem like we're only ever going to be issue a single request at a time to a given device. It feels like that's not going to be great in an nmve world where you need a larger queue depth to be able to support saturation of the device. That doesn't mean this needs to change at this particular time, but it probably means we should think through what kinds of questions that we want to ask.
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 1) This suggests that the interface is really unsafe for an application to use as written, if they care about reliable results. While forcing all callers to break this up into reasonable requests on their own is painful, if we can't communicate what happened with a dkioc free correctly, will the right thing happen? For example, if a caller assumes that it's either all worked or none of it worked, but the reality is that some of the blocks were freed before an error, but they don't think they were and that the data is still valid, it seems like that'd lead to probable data corruption. What is the ZFS usage here? What are its assumptions going to be? Are callers really all going to treat this as advisory?
-
usr/src/uts/common/sys/blkdev.h (Diff revision 1) Why is this opaque? The client is going to need to know the size and layout of the structure so they can actually use it.
-
usr/src/uts/common/sys/blkdev.h (Diff revision 1) I think it's worth spending some time documenting this in more detail. For example, it took me a while to realize that bfi_align was in units of the sector size and it's not clear that it should be the physical or the logical sector size, given that a blkdev device can specify both. How did you pick whether it was the physical or the logical block size that it represented. Did you check the NVMe and eMMC 4.5 spec (which added a logical discard) to make sure the constraints can be reflected here.
-
usr/src/uts/common/sys/blkdev.h (Diff revision 1) So, right now the only member of the bd_xfer_t that a client driver uses is the additional segment list when processing free requests. Does it really make sense to use that for the user? Actually, as I was reviewing other bits, a more general question came to mind: 1) Should the free space info just be part of the existing drive information? 2) Should this really be its own submission point. The main reason I ask is that I wonder if devices need to control whether or not other I/O is going on at the same time and if we don't want the ability to leverage all the existing queuing mechanisms and be able to specify that. While vioblk or say the eMCC driver don't probably care, nvme definitely will if these are interwoven in any serious fashion and we'd want that to tie into the existing queueing mechanism.
-
usr/src/uts/common/sys/blkdev.h (Diff revision 1) This should definitely return an int so that a caller can indicate failure. The inability to indicate failure has been a problem in other APIs.
-
usr/src/uts/common/sys/dkioc_free_util.h (Diff revision 1) The macro name implies this is a boolean, but it's not actually. I'd suggest actually constructing that.
-
usr/src/uts/common/sys/dkioc_free_util.h (Diff revision 1) Why does the only producer of this structure speak in terms of uint64_t units and here you're using units of a size_t? While they are conveniently the same on an LP64 system, it seems like that's a recipe for eventual confusion and type mismatches. It also seems odd when the dfl_iter function which takes this is comparing it to uint64_t lengths.
-
usr/src/uts/common/sys/dkioc_free_util.h (Diff revision 1) Does the system already have a constraint on power of 2 block sizes?
-
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1) There's almost 400 existing uses of it in usr/src/uts/common -- and gcc7 doesn't seem to flag it -- so seems like it's ok.
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1) This confused me a bit since instead of making
BD_OPS_CURRENT_VERSION
effectively an alias to the latestBD_OPS_VERSION_xx
value, it is the current version -- that is everything prior to this change uses a value of1
or0
(though thebd_ops_t
definition is the same with those two).This change adds a new version (
2
) asBD_OPS_CURRENT_VERSION
, which follows the previous work.It would probably be clearer to add
BD_OPS_VERSION_2
to the enum, and haveBD_OPS_CURRENT_VERSION
be an alias toBD_OPS_VERSION_2
, but I'm not sure how bad that'd break existing expectations elsewhere. -
usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 1) The existing values (with the sizes in the comments beside them) end up being contiguous chunk of the configuration space. I though it'd be good to explicitly note that there's a gap there in case someone though that perhaps
VIRTIO_BLK_CONFIG_WRITEBACK
was actually supposed to be 16 bits and not 8.
Diff: |
Revision 3 (+850 -92)
|
---|
-
Apologies for the delay in getting back to you, Jason. I've gone through this now and a have a few questions and thoughts, some of it which is inspired by your own comments which are pointing out issues with the interfaces. Thanks for doing this work.
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3) I think the flow of control is clearer when the thing that does the allocation is the same place that does the freeing. Does dfl_iter(), called by bd_free_space(), have to be the thing that does the kmem_free() of dfl? It feels like it might be clearer to see how data gets freed if it's possible to have it be the responsibility of the allocator. Given that there's a couple layers here to track through, I'd recommend at least commenting about this fact here.
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3) I wonder if we want a different way of a driver being able to indicate the support for trim / discard than plumbing through the op vector or not. In other areas when there's conditional support, usually the function remains but we indicate the ops vector support another way. Though I realize that this is something that the blkdev does already with other areas.
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3) FWIW, the return value from biowait is the same as the call to geterror(9F) at last as documented.
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3) Given that we're shifting all of these values by shift, perhaps we should verify that they are all actually valid and that a driver that attaches and reports a bad value (because it got a bad value say through the virtio or from hardware) doesn't cause an overflow as part of the other checks that we're doing so we know that everywhere else this is safe and that way we don't have to check all subsequent operations here and in other parts of blkdev.
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3) Is there any particular reason you chose ENXIO here?
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 3) Are we guaranteed that the longlong_t in the form of the diskaddr_t values that we're getting won't be negative?
-
usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 3) This isn't in the Virtio block spec yet, but is in qemu/linux, fwiw. Not sure we should remove it.
-
usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 3) Since this is the first usage of the WO string in this context, probably useful to indicate what WO means for future readers.
-
usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3) dfl_num_exts is a uint64_t, not a size_t. We should probably match.
-
usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3) I assume that this is being verified / checked as something that won't overflow by callers?
-
usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3) Since we're only using this for the bcopy, maybe worth making it const so bcopy will make sure we never mess up the argument orders.
-
usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3) So in vioblk_bd_flush() we sanity check the feature is here. Do we need to do the same here?
-
usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3) Should we sanity check these at all? If the guest gave us a bad value, that would cause us to fail to attach. If this is a boot device, we may prefer to disable the advertisement of the feature.
-
usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3) These kinds of comments make me feel like we should fix this. Certainly it's an easier driver design when we can have a constant ops vector if possible.
-
usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 3) Maybe it's an rb thing, but it seems visually like this assignment looks different from the others?
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3) So, reading this it doesn't immediately clarify things for me to the point that this is clear, fwiw. It's not clear what the resources are in this case and given that a driver would have to call dfl_free() after calling this, which you do anyways, I think I'm missing something here.
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3) These all appear to be a uint64_t / size_t mismatch. The indexes are all being compared to a uint64_t and the n_bytes is having uint64_t's added to it.
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3) Should we assume that dfi_bshift is well formed and correct and that this isn't a value that'll cause this to overflow?
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3) Some comment about what the different cases are handling might be useful. I've been trying to make sure I've understood all the different cases that are covered, but it may be useful. This is for all the major cases here.
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3) You store bsize as a uint_t above in dfl_iter(), but as a uint64_t here. Is there a reason for the difference?
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3) type mismatch: dfl_num_exts is a uint64_t, i should be a size_t.
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3) It does seem that ENXIO is more documented for this case, though it is overloaded.
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3) I've been looking at this and there's something that's not settling for me. If we've been given a starting and ending address rounding up the starting address and aligning (truncating) the ending address will necessarily shrink the range we trim, but don't we need to actually increase it to cover what we've been asked to do? Or instead, is the reason that you're shrinking it is because it doesn't make sense to do this, because we'd be trimming something that has valid data, which is why we're shrinking the range. All of this makes me feel like semantically we're doing something close on a best effort to what the caller wants and that your comments that this is a bad interface we're doing the best we can with are true and that we should be coming up with a better way to express this. If I was trying to trim/discard something but the device had different constraints from what I thought, I'd really want to know what those were so I could deal with that properly at my level and adjust my expectations.
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3) I believe start_idx and n probably want to be uint64_t's given the values they're derived from are.
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 3) size_t / uint64_t comparison. You probably want to make idx a uint64_t. Should this instead be a check that we do at the start and just fail, rather than asserting it?
-
Diff: |
Revision 4 (+930 -93)
|
---|
-
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 4) Shouldn't these only be checked if the device provides a o_free_space entry point?
-
usr/src/uts/common/os/dkioc_free_util.c (Diff revision 4) ERANGE is "Math result not representable", so I would prefer ENXIO here. Another good choice might be EINVAL, after all an invalid argument was given if this happens.
Diff: |
Revision 5 (+929 -93)
|
---|
-
-
usr/src/uts/common/io/vioblk/vioblk.c (Diff revisions 3 - 5) I would recommend we include the invalid values in this error message. Otherwise a user will need to be taught how to fetch them out with mdb.
-
usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 5) This comment is out of date as there is no o_free_info member anymore.
Diff: |
Revision 6 (+930 -93)
|
---|
Diff: |
Revision 7 (+944 -93)
|
---|
Diff: |
Revision 8 (+958 -93)
|
---|
Change Summary:
Rebase and fix missing
d_max_free_seg_blks
in vioblk.
Diff: |
Revision 9 (+972 -97)
|
---|
-
I have one minor comment nit, but otherise looks good.
-
usr/src/uts/common/io/vioblk/vioblk.c (Diff revisions 7 - 9) I would use the actual spec name, virtio as opposed to vioblk.
Diff: |
Revision 10 (+972 -97)
|
---|