NVMe hotplug support and related bug fixes

Review Request #2308 - Created Sept. 17, 2019 and updated

Information
Rob Johnston
illumos-gate
Reviewers
general

11698 Want NVMe Hotplug Support
11699 x86 pci configurator should not fail device teardown if device is gone
11700 DDI hotplug request handler resets connection handle state before performing state change operations
11701 ldi_handle dcmd segfaults occasionally

Please see the detailed testing notes in https://www.illumos.org/issues/11698

Issues

  • 1
  • 0
  • 0
  • 1
Description From Last Updated
I think there is a race condition in between this piece of code and nvme_submit_io_cmd()/nvme_submit_cmd_common(). 1. A drive is hot-unplugged. ... Paul Winder Paul Winder
Rob Johnston
Rob Johnston
Review request changed

Status: Re-opened

Paul Winder

   
usr/src/uts/common/io/nvme/nvme.c (Diff revision 1)
 
 

I think there is a race condition in between this piece of code and nvme_submit_io_cmd()/nvme_submit_cmd_common().
1. A drive is hot-unplugged.
2. An I/O is being submitted, and the code is after the n_dead check in nvme_submit_io_cmd().
3. The remove_callback is run. Sets n_dead and acquires the the nq_mutex before nvme_submit_cmd_common().
4. nvme_submit_cmd_common() blocks whilst nvme_remove_callback() empties the cmd queues.
5. nvme_submit_cmd_common() queues the command.

After 5. the I/O will hang.

  1. Yes, I see what you mean. I need to look at this some more. But offhand, I wonder if we could address this potential race by having nvme_submit_cmd_common() check the value of n_dead after acquiring nq_mutex and if it's true, simply free the cmd and return? We'd have to make some other changes to get a pointer to the nvme_t passed into nvme_io_submit_cmd_common().

  2. That seems fine. You also need to dispatch the cmd's callback.

  3. And it turns out that the nvme_cmd_t has a pointer to the nvme_t - so that makes this a lot less intrusive of a change. I should have an updated change later today, but I probably won't post it to RB till sometime Monday as that's the next time I'll be in the office and hence be able test the change on actual hardware.

Loading...