Topo enumeration of NVMe devices

Review Request #2483 — Created Jan. 9, 2020 and submitted

rejohnst
illumos-gate
11958, 11959
general

These changes extend the disk and pcibus topo plugins to support enumerating NVMe devices in the topo tree. It also provides a platform-specific topo map for the Supermicro SYS-2028U-E1CNRT+ server platform.

Testing was performed on a Supermicro SYS-2028U-E1CNRT+ with a variety of u.2. NVMe devices and an Intel PCIe Add-in card NVMe device. Testing was also performed on a desktop system with an M.2 NVMe device. I verified that the NVMe devices were properly enumerated. The code changes were fully exercised with UMEM_DEBUG enabled and checks were run to verify no memory leaks or other bad memory juju are being introduced by these changes. The code is pbchk-clean and builds cleanly.

  • 0
  • 0
  • 23
  • 6
  • 29
Description From Last Updated
rm
  1. Thanks for putting to this together Rob.

  2. Why did this have to have? Is it because the dependents need to rely on the pgroups now?
    1. Yes. The disk enumerator was being executed before the "binding" property group was added to the parent "bay" node. As a result the disk enumerator didn't find the properties it was expecting because they hadn't been created, yet. I've added a comment to clarify the dependency on the ordering here.

  3. This should be topo_usb_metadata.c, not topo_usb_file.c. I think the existing ones need to be updated (I'll look at that). I think topo_usb_file.c was from an older refactoring.
  4. It looks like this is missing the EHCI ports. This platform is based on Broadwell, which means that it should support operating with xhci disabled and falling back on ehci.
    1. Oops - that was an old version of that file. It's updated now - but still doesn't include the EHCI ports. I see 6 EHCI ports in ACPI, but the PLD doesn't match the PLD's for any of the HS or SSP ports so I don't know how to match them up.

    2. The way that I tested this is I forced the BIOS into disabling xhci so it only enumerated EHCI. Then I just manually tested each one and noted the port and then manually matched it up. The EHCI PLD usually doesn't match the XHCI PLD. Unfortunately there's no guarantee that the same PLD is present across both which seems rather weird, but well, that's ACPI for you.
    3. Thanks for the tip! By booting the system with xhci disabled in BIOS, I was able to work out which EHCI ports to put into the map.

  5. Is this actually determined by SPARC and x86? It looks like cmlb actually uses the _FIRMWARE_NEEDS_FDISK and the other macros. Though maybe this is simpler than using the macros in sys/isa_defs.h that seem to control this.
    1. Yeah, I'm not really sure but this was pre-existing logic - I just moved it.

  6. usr/src/lib/fm/topo/modules/common/disk/disk.c (Diff revision 1)
     
     
     
     
    Maybe this would be more useful for the disk_enum dprintf to say what it didn't know how to enumerate?
  7. Minor thing, but based on prototypes.c we try to put new lines on either side of the copyright statement and the other block comment.
  8. Elsewhere you use 'AIC' instead of AiC. Maybe pick one for the whole file? Probably the former?
    1. I've updated the code to use "AIC" everywhere.

  9. typedef?
  10. usr/src/lib/fm/topo/modules/common/disk/disk_nvme.c (Diff revision 1)
     
     
     
     
     
    Are target names guaranteed to never have an 's' in them?
    1. Yes, for nvme devices.

  11. Maybe we can make devpath const in this function? It looks like it's read-only.
  12. It is documented that this can fail. Shouldn't we verify that it actually returns zero?
    1. I guess I'm not sure that we'd do anything different here if it did fail, other than log a debug message.

    2. It looks like the only thing that it fails on (today) is related to bad arguments. So the other option would be to abort due to programmer error, but I'm not sure if that's worthwhile.

    3. Cool - I'm going to leave it as-is then.

  13. A couple of high level questions about this function.  Right now we call this once for every namespace on the system. But it seems like we don't ever account for the fact that a namespace may not have a blkdev handle attached for a few reasons:
    
    1) We don't support the namespace
    2) We don't handle the case where someone uses nvmeadm to detach the namespace.
    
    In particular, if one fails it seems like the whole thing will fail. Is there a reason that this doesn't use the nvme_identify_nsid_t information to try and get this instead of walking devinfo? It seems like getting the capacity for the namespace that way would be better than going via DKIOCGMEDIAINFO or others.
    1. Right - but that was part of the reasoning for iterating over the devinfo children of the nvme device - i.e. the assumption being that if the child node exists, then the namespace has attached and hopefully is valid. But I did change the code to not break out of the loop on a make_disk_node() failure.

    2. I guess if the goal is to only ever enumerate namespaces for the purposes of what block devices should be underneath them, then that makes some sense. Thanks for cleaning up the single failure issue.
  14. Rather than encode these here can we mkae them macros that sys/nvme.h defines? It seems like it'd be useful in general.
  15. I assume we don't care if this fails because the auth is optional in the hc fmri?
  16. Perhaps this comment should indicate where it happens for the others? Why do we only need an NVME node for the AIC case and not for the U.2, M.2, etc. cases?

    1. We actually always create an nvme node. We create an nvme node for each nvme controller and child disk node(s) for each namespace. The issue with this code is for the pcibus case, the nvme range won't exist yet, so we need to create it before creating the nvme node. For cases where the enumeration is being driven by a topo map file, the range will have already been created in the XML. I've updated the comment to clarify this.

  17. I notice that a lot of other enumeration modules don't call this. Why do we need to do so here?
    1. We don't need it here. I used it in the smbios topo module to handle a corner case with the motherboard node and I copied/pasted parts of that code in here and inadvertently included the call to topo_group_hcset(). I removed it.

  18. usr/src/lib/fm/topo/modules/common/disk/disk_nvme.c (Diff revision 1)
     
     
     
     
     
     
     
     
    Should we fail the whole enumeration if the parent doesn't have a label? What if we're on a PCIe bus without label information (for example emulated NVMe in an HVM)? It seems like we should just ignore ETOPO_PROP_NOENT.
    1. Yeah, that makes more sense. Fixed.

  19. Should we be putting topo_mod_dprintf's in the property cases here to be consistent with everything else?
  20. I believe this cast is unnecessary.
    1. You're right - removed.

  21. While the existing code ensures that 'path' is always initialized, maybe it's worth initializing it to NULL to make sure that this is safe in the face of a refactor?
    1. Couldn't hurt - done.

  22. With the comment, this becomes a multi-line if statement (even though it's not from a C perspective). Mind putting parens around that?
  23. If this fails, don't we need to call topo_mod_seterrno()?
  24. Here and all the other devfs places, shouldn't we call topo_mod_seterrno() with EMOD_UKNOWN_ENUM or similar? This is true for all the places you use 'goto out'.

  25. Since pgi is only used in the if statement at +800, perhaps we should just scope it there rather than worry about it in the whole function?
  26. This whole bit is a bit gnarly. Maybe it's worth a function dedicated to this?
    1. I've reworked this code a bit so it's not quite as bad - but let me know if you think it should still be factored out into a separate function.

    2. It seems fine for now. We can always refactor it another time.
  27. So it's not clear to me why we'd expect them to have an MB label. Can you clarify that a little bit, at least here? Also, if someone had some NVMe soldered onto the board, would we not want to catch that?
    
    Also, if we require M.2 to have a topo map, what do we do for something like an Intel NUC? Does that mean we won't enumerate the NVMe devices on it?
    1. If we relaxed this and allowed the pcibus enumerator to enumerate all nvme controllers (not just ones on AICs) then it wasn't clear to me how to prevent the pcibus enumerator from enumerating U.2 devices on platforms where we do have a topo map - thus we'd end up with the same device potentially getting enumerated twice in two different places in the tree - which I think is undesirable. I don't see anything in the NVMe spec that allows me to determine the form-factor of a given controller so essentially U.2 devices seem indistinguishable from M.2 devices. Maybe there's a way I'm not seeing.

    2. OK, thanks for the additional explanation. I guess this makes sense. Probably this gets back to the general desire for a logical hw map as we talked about back in the project tiresias rfd.
  28. Would di_driver_name() be simpler here? It also gets you out of the string allocs.
    
    Also, is there a reason we don't use the NVMe class code and sub-class code here like we do for other stuff?
    1. No good reason :). I've reworked this code to check the class/subclass.

  29. Are we always guaranteed that we're going to have a label for a PCI device? It's not clear to me that virtualized NVMe will hae one per se.
    1. Yes, they should always have a label because the way that the pcibus labeling code works, they will either get a slot label or will end up inherting the label from the motherboard topo node ("MB")

      I have verified in both Virtualbox and VMware Fusion that the pciexfn nodes for virtualized NVMe devices end up with the default motherboard "MB" label.

  30. I think it'd be clearer if you just keep track of the size like you do in the other case. I don't think it's obvious to most folks that setting it back to a valid character is loadbearing for the strlen in topo_mod_strfree(), especially since you're restoring it to a different character.
  31. 
      
rejohnst
rm
  1. Thanks for the clean up here. I noticed one small memory leak introduced in the new bits.

  2. We need a corresponding di_devfs_path_free().
  3. 
      
rejohnst
rm
  1. Thanks for putting all this together, it looks good!

  2. 
      
rejohnst
Review request changed

Status: Closed (submitted)

Loading...