-
-
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) We spotted this - although never actually hit it.
-
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) These caused a panic when running zfs test suite
Disconnect NVMe submission and completion queues
Review Request #1979 — Created June 7, 2019 and submitted
Information | |
---|---|
pwinder | |
illumos-gate | |
Reviewers | |
general | |
hans, rm |
Separate the submission and completion queues. Each queue pair has a dedicated submission queue, but may have shared completion queues. This allows the number of submission queues to be increased beyond the current limitation set by the nunber of interrupt vectors.
The change includes a few bug fixes we have come across. They are highlighted in the diff.
See: https://www.illumos.org/issues/11202
Issues also covered here are:
nvme may queue more submissions than allowed
nvme_get_logpage() can allocate a too small buffer to receive logpage data
Panic in nvme_fill_prp() because of miscalculation of the number of PRPs per page
nvme in polled mode ignores the command call back
The base of this code has been in extensive use in WDC's Intelliflash O/S.
Within Illumos (using SmartOS), it has been tested by running the full zfs test suite against a set of NVMe drives, as well as stress tested using vdbench.
It has been run on raw hardware with Dual Intel(R) Xeon(R) Gold 6130, with NVMe drives in a PCI fanout behind a Switchtec switch.
Also tested in a Fusion VM with a NVMe root disk.
-
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) I wondered if it make sense to do something with io-queue-len if it is set and the new values are not present? It is possible that some people will have customised this in their driver conf file.
I then checked and it seems that any local modifications to nvme.conf are not preserved on upgrade; that seems wrong but means you do not need to worry about backwards compatibility here at least.
-
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Could you perhaps add a short comment on how or why this works or what the algorithm is called?
-
Thanks for putting this together, I have a few questions and some minor feedback. In general, the approach seems good.
-
usr/src/man/man7d/nvme.7d (Diff revision 1) This should clarify which will win then, the lower or higher value.
-
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Is it legal to hold these for multiple queues at any given time? If so what order do we need to take these? For example, if we have more sq's than cq's, we should probably comment on how to lock everything.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) This will overflow a UINT32 to zero when we have the largest value here. We should make sure no one actually uses it with anything larger than 1 << 31. The most common, though hopefully unlikely to happen case would be someone passing UINT32_MAX as an attempt to have a value set to -1.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Can another bug be opened that describes this particular problem (it can be fixed in the same commit as the queue separation).
Why did we need to round up to a power of two to guarantee that we only get a single DMA cookie? If we only want a single DMA cookie, all we need to do is just ask for a single DMA cookie in the request. It's not clear why we need more memory to be allocated. With the sgllen set to one, that's the thing that actually guarantees we only get a single cookie.
Is this comment trying to help the system by rounding it up to a power of two to try and make that more likely? Is there a reason we're doing that to a power of two and not say page size? It seems like we could be wasting a lot of contiguous DMA memory, which if a device is hot plugged late in life of the system won't be available.
Given that we can use a PRP list to represent a queue (based on my read of the spec), we should probably look at coming back to that for future work so we can make this much more likely to work and not have to constrain ourselve to a single cookie -- though I will admit it's drastically simpler.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Is there a reason we don't pass ncq as a uint16_t given that's the valid range that we can actually create something? I know that we're casting it here, but it seems like we're there's never a case where it's correct for this to take a uint16_t and the callers should make sure that they are doing proper size checking before reaching here. I've left additional comments assuming you don't change this.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) If we're casting ncq, shouldn't we do that before we actually do the allocation? This also means that currently if we get a value that's larger than a uint16_t, we'll allocate a different number of arrays than we have recorded, which means we'll incorrectly free this.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Shouldn't this be a uint16_t to actually match the width of the n_cq_count? This value should never be negative.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Can we add an ASSERT that the ncq mutex is held here please?
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Any reason this isn't unsigned? This value should never be negative.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) I'd strongly suggest that the size of the DMA region be set to zero, which will do the entire region. This is what you're trying to do and will also cut out on a potential failure mode if that size is not right. In this case, casting to void means we could miss errors. I realize this is just a moved/renamed version of what already exists, but seems like we should fix that now.
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Doesn't this need to be cast to void if we're going to ignore the return value?
-
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Can we please open another bug that covers this and can be fixed in this change?
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Worth adding an ASSERT that neither of these are zero before we do the subtraction?
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Do we need to log this every time we attach a device?
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1) Should we actually change this to be MAX(n_submission_queues, and n_completion_queues) so that way if we need to change things in the future around how many to have, we won't hit any implicit assumptions. Otherwise, should we assert or do something else to make sure that can't happen in the face of future refactoring?
-
-
Thanks for the feedback.
I'm reviewing the comments, I will open a few more bugs and address the comments aqnd re-post in a few days
Description: |
|
---|
-
-
usr/src/uts/common/io/nvme/nvme.c (Diff revision 2) Extra word in comment: "When the there are".
Have gone through all the changes, look good to me.