Remove a couple of performance bottlenecks in nvme

Review Request #2401 — Created Oct. 17, 2019 and submitted


This started as an exercise to improve concurreny through blkdev ( ). Whilst testing that I discovered that the single taskq used to handle command completions is also a bottleneck ( ).

This fixes both and when using vdbench, shows significant performance benefits - again see for benchmark results.

The changes in blkdev are to provide multiple wait/runq per device.
Som of the changes in nvme are for the multiple blkdev queues, the majority are changing the single command completion taskq, to one taskq per completion queue.

I ran vdbench as a stress test and to confirm performance.

  • 0
  • 0
  • 14
  • 3
  • 17
Description From Last Updated
  2. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)
    Would you be willing to start a small theory statement on blkdev that describes the relationships between the queues, devices, etc.? It would also help to document the locking rules.
  3. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)
    Since we've added a new field here, what happens if folks don't actually set d_qcount? Should we pre-init it to 1 prior to calling this entry point? If a driver returns with a zeroed value, should we actually continue attaching it? If a driver requests some ridiculous number of queues, say accidentally ending up with a (uint64_t)-1, we should consider capping it at some point, I imagine.
    1. That's a good idea, and I can drop the one line changes I made in the other drivers to set it to 1.

  4. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 1)
    Isn't this a global value? Aren't we clobbering it by having each thread always enter it? This is true of all the other global activity here. Perhaps there's some other way this always worked that makes it fine. Just not sure how it changes since we'll be progressing through the queues in parallel now.
    1. Its not a single global value. kstat runq/waitqs are intended to support multiple outstanding things in the queue.

  5. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    idx is unsigned so the last %d should be %u.
  6. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    If we're using taskq_dispatch_ent below, I'm not sure we should try and use the ddi_taskq_create function here and play casting games.
  7. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    How did you come up with 4 here? Why do we need more than one if we're just doing this to get the device set up?
    1. The admin queue is not just used during setup. It is used by quite a few ioctls including set/get_features, format, firmware updating, retrieving log pages.

  8. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Maybe worth a comment that describes how you came up with this calculation.
  9. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Please put '{' and '}' around this multi-line for loop.
  10. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Why are we decrementing the namespace attachable count here?
    Why are there no corresponding changes here for nvme_ioctl_detach()?
    1. During namespace enumeration if a namespace fails to meet certain criteria it is marked to be ignored, if the namespace is not flagged to be ignored it is considered attachable.
      When a namespace is attached through the ioctl, it goes through the same inititialisation and the attachable count could be incremented. To avoid double counting, if the namespace was previously accounted for (ie it wasn't ignore), then it needs to be discounted.
      Detaching a namespace doesn't effect its attachable state, it is still there just not attached.
      The attachable counter is a mechanism so we can set reasonable queue sizes. The previous mechanism accounted for all the namespaces - good and bad (as many as 128) and unnecessarily reduced the queue size.

    2. OK. I think this would be worth a comment that explains this. It was hard for me to see that interaction.
    3. Done

  1. Modulo my desire to version the blkdev driver API to avoid future possible runtime incompatibilities (it would be fine to just fail to attach a driver that doesn't use the same API version for now), the rest looks very good.

    Right now the multiple queues are being serviced in naive round-robin fashion in this implementation. I wonder if it might make sense to allow the submission queuues to more optimally place I/Os by having blkdev keep track of which queues are least busy (a count per queue), and choose the optimal queue based on queue depth. One possible way to do that would be to simply have the queues in a list where we move queues to the head of the list when they complete an I/O. This might deal better with devices that take differing amounts of time for different operations.

    1. The bd_ops struct has a version field. I can bump it to 1, and fail bd_alloc_handle() if it doesn't match (that is where the ops is first referenced).
      Seem ok?

    2. Yes, that would be fine.

    3. Done

    4. The round-robin mechanism is naive, and we do have the kstats with queue sizes. As I won't have access to me resources for too much longer I'd rather leave it as it is here.

    5. I agree that we shouldn't change it at this point. The least filled queue isn't always the best option and has its own pathologies that you can end up on.

  2. usr/src/uts/common/sys/blkdev.h (Diff revision 1)

    So this alters the blkdev API. I think this is only used internally -- not a public API, so I think this is OK but we should check to make sure no other external consumers exist. (There should not be.)

    It seems to me that it might be a good idea to version this. I'd also have put d_qcount at the end, so that we don't disrupt other field members.

    Actually as I think about this -- we should go ahead and version this API now -- that will prevent problems in case someone has a binary that uses different structure offsets, etc.

    1. I will put d_qcount at the end

    2. With the change of bd_ops version, I have left the field as is.

  1. I actually am wondering if the premise that the blkdev waitq/runq is a bottleneck is accurate. There may be some contention there, but I'd expect it to be minor. Did you try to use a qsize that is really qcount * qsize in the nvme driver, and letting the nvme driver figure out how to dispatch the operation to its per device queue instead? blkdev is meant to support a number of parallel operations at one time, without necessarily needing to know about the parallelism.

    I guess what I'm saying here is, I'm not sure this complexity is warranted in blkdev, as I think about it.

    1. It was lockstat that highlighted which mutexes were hot. And at the top of the list was the d_iomutex and then the mutex within the taskq enqueue/dequeue path.
      If you look at it more closely, when an I/O is enqueued it goes on to the waitq, then moves to the runq and is removed from the runq when it is complete. Each of these requires a lock/unlock of the d_iomutex. So, to achieve 200K IOPS, there will be 600K lock/unlock mutex events from multiple threads.
      It was only after I refactored the mutexes that I was able to achieve this level of IOPs.

  2. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)

    Would it be worth to #define this magic number?

    1. I can if you want. It is just has to large enough for the formatted taskq name.

    2. I've added a comment

  3. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)

    This comment probably needs updating.

    It's still true that blkdev knows only one qsize per instance, but that is now applied to multiple queues. You also fixed the issue that we reduced the queue length even when we had only one usable namespace and any number of unusable ones.

  2. usr/src/uts/common/io/blkdev/blkdev.c (Diff revisions 1 - 2)

    nit: Incoming

  1. Ship It!
  2. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 2)
    "One of the"?
  3. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 2)

    I find "routines that a parent" easier to read.

  4. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 2)

    ... that it is lockless.

    ... it will be worthwhile

  5. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 2)

    ... queues depending on

  6. usr/src/uts/common/io/blkdev/blkdev.c (Diff revision 2)

    ... pair is protected by its mutex ?

  7. usr/src/uts/common/io/nvme/nvme.c (Diff revision 2)


  1. Ship It!
  1. Thanks for the follow ups and the more detailed comments.

  2. usr/src/uts/common/io/nvme/nvme.c (Diff revision 3)

    What happens if bd_attach_handle() fails for some reason and doesn't get to the namespace attachable point due to a transient reason? If that means we came back here we could double decrement, no? It may be that the current implementation of bd_attach_handle makes this unlikely at best, but it seems if that's the case then we'd be in trouble. For example, bd_attach_handle() silently returns success if we call this on an already attached item. Though I don't think that'd happen in this particular case.

    1. My bad ... the comment is incorrect. It will not be counted in bd_attach_handle. I will change the comment to:

          * If the namespace wasn't previously ignored, discount it now,  
          * because if it is still attachable, the above nvme_init_ns()
          * would have been re-counted it.
          * It is important n_namespaces_attachable is maintained accurately
          * as it is used in the determination of the active I/O queue size
          * limit passed back to blkdev. 
    2. OK. I guess because this is the last thing that's done in nvme_init_ns() then it's safe for the time being. The updated comment makes this a bit clearer.

  2. usr/src/uts/common/io/nvme/nvme.c (Diff revisions 3 - 4)
    Reading the updated comment and the previous issue made me ask the question of what makes sure that there aren't two ioctls in attach going on simultaneously here both modifying this data? Shouldn't there be some kind of lock around this? Apologies I didn't think about this sooner.
    Actually, looking further this seems to be a general problem with the attach an detach ioctls. We should probably file a follow up bug as this is unsafe. I'm not sure you're making this much worse given the kmem manipulations going on here.
    1. I have created bug 11964

  1. Ship It!
  1. Ship It!
  2. usr/src/uts/common/io/nvme/nvme.c (Diff revision 5)

    So the way this code is written, it seems like the intent is such that this can change across the lifetime of the device. For example, a namespace was ignored by default; however, it is then formatted.

    If that happens, and the n_namespaces_attachable count changes, then all of the old namespaces will have the wrong divisor count in nvme_bd_driveinfo and there will be overlap in the mappings.

    At least, as written, this would mean that if we then came back later and attached the fourth namespace after formatting it, that we wouldn't go back and call nvme_bd_driveinfo() on the three existing, valid namespaces. I think this means that the first three would think they each have 1/3rd of the submission queue size that they've told blkdev. However, the fourth would take up 1/4 of that and the original three would overlap with the fourth.

    I'm not sure if this is actually a problem. Only that the comment in nvme_bd_driveinfo() suggests that this is.

    1. As far as I'm aware in illumos we have no way of creating or deleting namespaces using nvmeadm. We can only attach/detach what is there.
      You are correct in that if we were able to manipulate namespaces in this way, then there would be disparity in the size of the queues.

    2. Also, it isn't really a mapping of the submission queues. It's an attempt give equal queue sizes, there is already code in nvme_bd_driveinfo() which will stop the blkdev queue size from going below a minimum, in that case the submission queues will be over-subscribed.

      Generally, if I/O rates are high enough that blkdev submits more I/Os than there are submission queue entries, then excess commands will be held behind a semaphore in nvme.

      Prior to this change, blkdev tended to have v. small queue size as the divisor was the total number of namespaces (in the drives I have seen that tends to be 128) even if only one is used. We are now giving more reasonable blkdev queue sizes, and if they exceed the nvme submission queue size it isn't a real problem.

    3. You can use nvmeadm format to change a particular namespace's lba format. This is one of the ways in which you can toggle the ignorability bits. So there is a way to do this.

      That said, based on your reply, it sounds like the oversubscription is OK. The reason I mention it is that the comment at +3713 and at +2574 suggest that this wasn't true and that basically getting this wrong would be bad. I'm not sure I have a great suggestion for the comments about that.

    4. The drives I have worked with have had invalid namespaces where the block size is 1, and there are no other sizes which are indicated as supported by the namespace - I guess that's because they have zero size. I would expect (but I maybe wrong) for other namespaces to at least have a valid blocksize, which can be changed with the format option, in that case they would still be counted as attachable and not flagged as ignore.

      With all that, I will review and make appropriate ammendments to the comments.

  1. For some reason the diff between 5 -> 6 don't show all the changes. The diff from orig -> 6 does have them all

  1. Thanks for updating the comment here.

  2. usr/src/uts/common/io/nvme/nvme.c (Diff revisions 4 - 6)
Review request changed

Status: Closed (submitted)