Disconnect NVMe submission and completion queues

Review Request #1979 — Created June 7, 2019 and submitted

hans, rm

Separate the submission and completion queues. Each queue pair has a dedicated submission queue, but may have shared completion queues. This allows the number of submission queues to be increased beyond the current limitation set by the nunber of interrupt vectors.

The change includes a few bug fixes we have come across. They are highlighted in the diff.

See: https://www.illumos.org/issues/11202

Issues also covered here are:
nvme may queue more submissions than allowed
nvme_get_logpage() can allocate a too small buffer to receive logpage data
Panic in nvme_fill_prp() because of miscalculation of the number of PRPs per page
nvme in polled mode ignores the command call back

The base of this code has been in extensive use in WDC's Intelliflash O/S.

Within Illumos (using SmartOS), it has been tested by running the full zfs test suite against a set of NVMe drives, as well as stress tested using vdbench.

It has been run on raw hardware with Dual Intel(R) Xeon(R) Gold 6130, with NVMe drives in a PCI fanout behind a Switchtec switch.

Also tested in a Fusion VM with a NVMe root disk.

  • 0
  • 0
  • 20
  • 0
  • 20
Description From Last Updated
  2. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)

    When len > 1 page, we could get 2 coookies

  3. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)

    We spotted this - although never actually hit it.

  4. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)

    The size argument was incorrect

  5. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)

    These caused a panic when running zfs test suite

  2. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)

    I wondered if it make sense to do something with io-queue-len if it is set and the new values are not present? It is possible that some people will have customised this in their driver conf file.

    I then checked and it seems that any local modifications to nvme.conf are not preserved on upgrade; that seems wrong but means you do not need to worry about backwards compatibility here at least.

    1. The algorithm effectively computes 2^(log2(n - 1) + 1).
      The n-- takes care of the case when n is already a power-of-2.
      The shift operations set all bits to 1 from the msb set to lsb.
      I will add a comment on these lines to the code, and will publish an update when more have reviewed.

  2. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)

    Could you perhaps add a short comment on how or why this works or what the algorithm is called?

  1. Thanks for putting this together, I have a few questions and some minor feedback. In general, the approach seems good.

  2. usr/src/man/man7d/nvme.7d (Diff revision 1)
    This should clarify which will win then, the lower or higher value.
  3. usr/src/man/man7d/nvme.7d (Diff revision 1)
    larger than
  4. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Is it legal to hold these for multiple queues at any given time? If so what order do we need to take these? For example, if we have more sq's than cq's, we should probably comment on how to lock everything.
  5. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)

    This will overflow a UINT32 to zero when we have the largest value here. We should make sure no one actually uses it with anything larger than 1 << 31. The most common, though hopefully unlikely to happen case would be someone passing UINT32_MAX as an attempt to have a value set to -1.

  6. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)

    Can another bug be opened that describes this particular problem (it can be fixed in the same commit as the queue separation).

    Why did we need to round up to a power of two to guarantee that we only get a single DMA cookie? If we only want a single DMA cookie, all we need to do is just ask for a single DMA cookie in the request. It's not clear why we need more memory to be allocated. With the sgllen set to one, that's the thing that actually guarantees we only get a single cookie.

    Is this comment trying to help the system by rounding it up to a power of two to try and make that more likely? Is there a reason we're doing that to a power of two and not say page size? It seems like we could be wasting a lot of contiguous DMA memory, which if a device is hot plugged late in life of the system won't be available.

    Given that we can use a PRP list to represent a queue (based on my read of the spec), we should probably look at coming back to that for future work so we can make this much more likely to work and not have to constrain ourselve to a single cookie -- though I will admit it's drastically simpler.

    1. The work-around is for this particular check near the start of i_ddi_mem_alloc:

          if (attr->dma_attr_minxfer == 0 || attr->dma_attr_align == 0 ||
              (attr->dma_attr_align & (attr->dma_attr_align - 1)) ||
              (attr->dma_attr_minxfer & (attr->dma_attr_minxfer - 1))) {
                          return (DDI_FAILURE);

      So, if dma_attr_min and dma_attr_align (ie the len argument to nvme_zalloc_queue_dma()) are not a power of 2 the memory allocation fails.

    2. Been giving this some more thought.
      We should set the dma_attr_align to a fixed size (I'd suggest n_pagesize), and then the memory allocation should work and the dma_attr_sgllen should work as expected

  7. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Is there a reason we don't pass ncq as a uint16_t given that's the valid range that we can actually create something? I know that we're casting it here, but it seems like we're there's never a case where it's correct for this to take a uint16_t and the callers should make sure that they are doing proper size checking before reaching here. I've left additional comments assuming you don't change this.
    1. The cq array includes the admin completion queue as well as all the others. Which does mean the array can be UINT16_MAX + 1.
      I have changed ncq to be uint_t.

  8. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    If we're casting ncq, shouldn't we do that before we actually do the allocation? This also means that currently if we get a value that's larger than a uint16_t, we'll allocate a different number of arrays than we have recorded, which means we'll incorrectly free this.
    1. I have changed n_cq_count to uint_t.
      There csn also be UINT16_MAX + 1 qpairs (1 admin + UINT16_MAX std). I have changed n_ioq_count to uint_t too.

  9. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Shouldn't this be a uint16_t to actually match the width of the n_cq_count? This value should never be negative.
  10. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Can we add an ASSERT that the ncq mutex is held here please?
  11. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Any reason this isn't unsigned? This value should never be negative.
  12. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    I'd strongly suggest that the size of the DMA region be set to zero, which will do the entire region. This is what you're trying to do and will also cut out on a potential failure mode if that size is not right. In this case, casting to void means we could miss errors. I realize this is just a moved/renamed version of what already exists, but seems like we should fix that now.
    1. Changed the call to use a size of zero.
      99% of the kernel code either voids ddi_dma_sync() or simply prints a warning and continues. I have changed it to do the latter.

    2. If we pass zero for the offset and length, then it's defined not to fail, which is why I was suggesting changing it.
  13. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Doesn't this need to be cast to void if we're going to ignore the return value?
    1. It is a void function.

  14. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    I would change the size to zero here as well.
  15. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Can we please open another bug that covers this and can be fixed in this change?
  16. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Worth adding an ASSERT that neither of these are zero before we do the subtraction?
  17. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Do we need to log this every time we attach a device?
  18. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
    Should we actually change this to be MAX(n_submission_queues, and n_completion_queues) so that way if we need to change things in the future around how many to have, we won't hit any implicit assumptions. Otherwise, should we assert or do something else to make sure that can't happen in the face of future refactoring?
    1. I'll add an ASSERT(n_submission_queues >= n_completion_queues)

  19. usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)

    Worth opening another bug for this issue?

  1. Thanks for the feedback.
    I'm reviewing the comments, I will open a few more bugs and address the comments aqnd re-post in a few days

  1. Ship It!
  2. usr/src/uts/common/io/nvme/nvme.c (Diff revision 2)

    Extra word in comment: "When the there are".

    Have gone through all the changes, look good to me.

  1. Ship It!
Review request changed

Status: Closed (submitted)