-
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1) Would you be willing to start a small theory statement on blkdev that describes the relationships between the queues, devices, etc.? It would also help to document the locking rules.
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1) Since we've added a new field here, what happens if folks don't actually set d_qcount? Should we pre-init it to 1 prior to calling this entry point? If a driver returns with a zeroed value, should we actually continue attaching it? If a driver requests some ridiculous number of queues, say accidentally ending up with a (uint64_t)-1, we should consider capping it at some point, I imagine.
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1) Isn't this a global value? Aren't we clobbering it by having each thread always enter it? This is true of all the other global activity here. Perhaps there's some other way this always worked that makes it fine. Just not sure how it changes since we'll be progressing through the queues in parallel now.
-
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) If we're using taskq_dispatch_ent below, I'm not sure we should try and use the ddi_taskq_create function here and play casting games.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) How did you come up with 4 here? Why do we need more than one if we're just doing this to get the device set up?
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Maybe worth a comment that describes how you came up with this calculation.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Please put '{' and '}' around this multi-line for loop.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Why are we decrementing the namespace attachable count here? Why are there no corresponding changes here for nvme_ioctl_detach()?
Remove a couple of performance bottlenecks in nvme
Review Request #2401 — Created Oct. 17, 2019 and submitted
Information | |
---|---|
pwinder | |
illumos-gate | |
Reviewers | |
general | |
This started as an exercise to improve concurreny through blkdev (https://www.illumos.org/issues/11827 ). Whilst testing that I discovered that the single taskq used to handle command completions is also a bottleneck (https://www.illumos.org/issues/11847 ).
This fixes both and when using vdbench, shows significant performance benefits - again see https://www.illumos.org/issues/11827 for benchmark results.
The changes in blkdev are to provide multiple wait/runq per device.
Som of the changes in nvme are for the multiple blkdev queues, the majority are changing the single command completion taskq, to one taskq per completion queue.
I ran vdbench as a stress test and to confirm performance.
-
Modulo my desire to version the blkdev driver API to avoid future possible runtime incompatibilities (it would be fine to just fail to attach a driver that doesn't use the same API version for now), the rest looks very good.
Right now the multiple queues are being serviced in naive round-robin fashion in this implementation. I wonder if it might make sense to allow the submission queuues to more optimally place I/Os by having blkdev keep track of which queues are least busy (a count per queue), and choose the optimal queue based on queue depth. One possible way to do that would be to simply have the queues in a list where we move queues to the head of the list when they complete an I/O. This might deal better with devices that take differing amounts of time for different operations.
-
usr/src/uts/common/sys/blkdev.h (Diff revision 1) So this alters the blkdev API. I think this is only used internally -- not a public API, so I think this is OK but we should check to make sure no other external consumers exist. (There should not be.)
It seems to me that it might be a good idea to version this. I'd also have put d_qcount at the end, so that we don't disrupt other field members.
Actually as I think about this -- we should go ahead and version this API now -- that will prevent problems in case someone has a binary that uses different structure offsets, etc.
-
I actually am wondering if the premise that the blkdev waitq/runq is a bottleneck is accurate. There may be some contention there, but I'd expect it to be minor. Did you try to use a qsize that is really qcount * qsize in the nvme driver, and letting the nvme driver figure out how to dispatch the operation to its per device queue instead? blkdev is meant to support a number of parallel operations at one time, without necessarily needing to know about the parallelism.
I guess what I'm saying here is, I'm not sure this complexity is warranted in blkdev, as I think about it.
-
-
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) This comment probably needs updating.
It's still true that blkdev knows only one qsize per instance, but that is now applied to multiple queues. You also fixed the issue that we reduced the queue length even when we had only one usable namespace and any number of unusable ones.
-
Ship It!
-
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 2) I find "routines that a parent" easier to read.
-
usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 2) ... that it is lockless.
... it will be worthwhile
-
-
-
-
Thanks for the follow ups and the more detailed comments.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 3) What happens if bd_attach_handle() fails for some reason and doesn't get to the namespace attachable point due to a transient reason? If that means we came back here we could double decrement, no? It may be that the current implementation of bd_attach_handle makes this unlikely at best, but it seems if that's the case then we'd be in trouble. For example, bd_attach_handle() silently returns success if we call this on an already attached item. Though I don't think that'd happen in this particular case.
-
-
usr/src/uts/common/io/nvme/nvme.c (Diff revisions 3 - 4) Reading the updated comment and the previous issue made me ask the question of what makes sure that there aren't two ioctls in attach going on simultaneously here both modifying this data? Shouldn't there be some kind of lock around this? Apologies I didn't think about this sooner. Actually, looking further this seems to be a general problem with the attach an detach ioctls. We should probably file a follow up bug as this is unsafe. I'm not sure you're making this much worse given the kmem manipulations going on here.
Description: |
|
---|
Change Summary:
I have consolidated the counting of attachable namespaces into one funtion, nvme_init_ns()
Diff: |
Revision 5 (+344 -131) |
---|
-
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 5) So the way this code is written, it seems like the intent is such that this can change across the lifetime of the device. For example, a namespace was ignored by default; however, it is then formatted.
If that happens, and the n_namespaces_attachable count changes, then all of the old namespaces will have the wrong divisor count in nvme_bd_driveinfo and there will be overlap in the mappings.
At least, as written, this would mean that if we then came back later and attached the fourth namespace after formatting it, that we wouldn't go back and call nvme_bd_driveinfo() on the three existing, valid namespaces. I think this means that the first three would think they each have 1/3rd of the submission queue size that they've told blkdev. However, the fourth would take up 1/4 of that and the original three would overlap with the fourth.
I'm not sure if this is actually a problem. Only that the comment in nvme_bd_driveinfo() suggests that this is.
Change Summary:
Rework some comments around the use of n_namespaces_attachable
Diff: |
Revision 6 (+355 -131) |
---|
-
For some reason the diff between 5 -> 6 don't show all the changes. The diff from orig -> 6 does have them all