11329 improved Virtio framework

Review Request #2068 — Created July 8, 2019 and submitted

7366, 10012, 11329

11329 improved Virtio framework
10012 vioblk should not accept an all-zero serial number
7366 vioif happily creates rx descriptors until it consumes all memory

NOTE: I recommend starting with the block comment in virtio.h.

  • 0
  • 0
  • 39
  • 24
  • 63
Description From Last Updated
  2. usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 1)

    It seems wrong to me to use an enum for things that are actually flag bits. Does this cause any problems with compiler checks? If two bits are set, the value of a vioblk_req_status_t does not match any of the enum values.

    1. The reason I'm using an enum for this is so that MDB is able to render the field in terms of the constant symbols (using the CTF data) rather than just as a number. As far as I can tell, the compiler doesn't (at least currently) have a problem with this. I'm not using any switch statements with this value, which is where I'd expect to run into trouble.

    2. Interesting, thanks for the explanation!

  3. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)

    VERIFY0? I'm not sure what cstyle says about using ! for non-boolean operations.

    1. In this case, I'm using vbr_status as a set of flags so though the result isn't strictly 1 or 0, it is (in the C sense) more boolean than int/bitfield. I thought it was clearer to write this in the boolean style, as that's the intent of the comparison here; i.e., "Verify that the request is not COMPLETE".

  4. This is a DDI_DEVICE_ATTR_V1 structure
    (V0 does not have .devacc_attr_access, or at least that field is ignored with V0). It seems that our man page needs a fix.

    1. Ah, thanks! I'll file a follow-up ticket to fix the manual page.

    2. https://www.illumos.org/issues/11497

  1. Thanks for putting this together. The new virtio framework looks like this has greatly simplified things and will make it easier to add newer virtio devices, which is pretty great.
  2. usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 1)
    Shouldn't this be BLK and not net?
  3. usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 1)
    Should we file a future RFE to cover enabling / toggling the write cache?
    1. I think we can probably put together a roadmap for future improvements after this integrates.

  4. usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 1)
    Any particular reason the data array isn't here like in the spec? I guess it's not necessary.
    1. I've updated the comment to explicitly call out the fact that the data and status byte parts of the request are in separate descriptors from this structure.

    2. Thanks. After going through that in more detail it made a lot more sense. I think the block statement you have makes that much clearer.
  5. usr/src/uts/common/io/vioblk/vioblk.h (Diff revision 1)
    QEMU 0.14.x further says it was supposed to be modeled off of the ATA IDENTIFY command and was considered deprecated during its era.
  6. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    It'd be worth a comment about how you're always using the dma descriptor that's part of the chain to cover the status and header information. Also talking about how that DMA memory is allocated, etc. I think that and a little bit about the chain management and other things might be helpful.
    1. I have drawn a picture!

  7. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    If we're making copies can it be const?
    1. Yes! I've fixed the virtio_dma_*() API to make this const as well.

  8. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    VIOBLK_REQSTAT_ALLOCATED seems a bit weird to me right now. It's always set and never appears to be cleared. Did you mean to clear it before putting it back on the list in vioblk_req_free()?
    1. We do actually clear it in vioblk_req_free() by setting vbr_status to 0. I'll add a comment.

  9. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    I find this comment a little confusing. I get why for the asynchoronous approach it has to be freed there, but it leaves the reader a question of what happens in the polled case which is a bit of an adventure to determine.
    1. I've tried to rearrange this to make it a bit clearer.

    2. Thanks, I think that is.
  10. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    Perhaps this should assert that the vib_mutex is held like in the vioblk_rquest path?
  11. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    Is this just a sanity check that we're doing? If so do we want to think about overflow?
    1. It is, but I'll add an overflow check!

  12. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    ENOTTY and ENXIO seem really weird here, but I guess that's what was already done.
    1. Right. I think I'll leave it for now -- we can still improve it in future.

  13. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    Shouldn't this be BLK and not NET? Though it is correctly queue zero.
  14. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    Is seg_max explicitly just for data and it's safe for us to add these?
    1. Unfortunately the specification just says "Maximum number of segments in a request is in seg_max" and that's all. If that isn't true, a degenerate segment count of 0 or 1 or even 2 would probably be invalid because we need to specify the request header and the status byte in two different descriptors (they have opposite data direction). If this turns out to be an issue with an actual implementation in the wild I think we'll have to fix it later based on those reports.

    2. OK, I guess that approach makes sense. Certainly we don't have enough information to say either way.
  15. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    Why is this kstat persistent?
    1. If you look in kstat_create(9F) you'll see:

      KSTAT_FLAG_PERSISTENT ... This feature is provided so that statistics are not lost across driver close/open (such as raw disk I/O on a disk with no mounted partitions.) ...

      It seems like that applies to this, a disk device?

    2. OK, fair enough. We don't do that for NICs, which is interesting.
  16. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    I see that while we get the RO flag here and we pass it to blkdev in the media info, that doesn't stop us from issuing write commands. Does that make sense?
    1. In theory, I blkdev might then stop you opening the device for writes. In practice, at least, it seems not unreasonable to allow the device to fail writes (or not!) here.

  17. usr/src/uts/common/io/vioblk/vioblk.c (Diff revision 1)
    Should we check for PCI_EINVAL32?
  18. usr/src/uts/common/io/vioif/vioif.h (Diff revision 1)
    So... what you actually want is that the IP header is 4 byte aligned. You don't want the MAC header to be 4 byte aligned, instead you want it on a 2-byte phase.
    1. OK, I've drawn a picture also!

  19. usr/src/uts/common/io/vioif/vioif.h (Diff revision 1)
    FWIW, I try to avoid bitfields in structs and instead just use flags for things like this. Doesn't matter either way.
  20. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    If they're going to be NULL, I don't think you ned to specify them in as explicitly initialized.
  21. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    Why don't we care if this fails to add a buffer? Might be worth a comment. I have a few guesses, but I'm not sure which, if any, is right.
    1. I've added comments to the vioif_add_rx() call sites that don't care about the return value.

  22. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)

    FWIW, I find 'mems' an odd term.

    1. Yes! I have changed it to *bufs().

  23. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    Normally these are non-sleeping allocations for NIC drivers as they are being done in the mc_start() entry point. I guess in this case you're doing it from attach, so it should be fine, but it's just a little different from other NIC drivers.
    1. It seemed easier to just do it up front, with sleeping, because we have a fixed (and not huge) target size in mind.

  24. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    See earlier notes, but this is the place you want to make sure that the ethernet header is actually being started with at a 2-byte phase offset so that IP is 4 byte aligned. As in the addr % 4 == 2.
    1. I decided to check this at allocation time in vioif_alloc_bufs() instead.

  25. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    You comment elsewhere that the length value isn't always correct or safe to use. Perhaps it's worth noting why it's ok for us to do so here?
  26. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    I was a little surprised to see that we don't allow the host to checksum anything for us. Maybe we should file an RFE to cover doing so in the future?
    1. I think we can probably put together a roadmap for future improvements after this integrates.

  27. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    So there's no reason that you need to do this outside of the lock, that I know of. I'm not sure either way if we need to consider the implications and possibilities for a race in updates and saying it's corked or not.
  28. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    To protect against future refactoring, should we make sure that this conditional also includes !vif_tx_drain?
    1. I think the VERIFY() above it is probably sufficient?

  29. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    Is there some point at which we should time this out and move to reset the device?
    1. This could probably be improved, but I think (for now, at least) we're assuming in a few places that the device isn't going to misbehave in a way where that would be required. We can revisit and improve the handling if it comes up.

  30. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    This goto fail ends up returning B_TRUE to suggest that we've consumed the mblk_t to vioif_send. However, the mblk_t hasn't actually been consumed. The actual failure paths of the vioif_tx_inline and vioif_tx_external will always consume the mblk_t. I think we need to either do a freemsg() here before the goto fail or we need to do something else.
  31. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    Do we want to do this every tx or only when we've crossed some threshold?
    1. I talked to Patrick about this at some point, and I believe he had expressed a desire to see TX resources freed as promptly as possible due to his experience with viona backing up the networking stack under some conditions, if that makes sense?

  32. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    Any particular reason that you dropped support for VIRTIO_NET_F_STATUS?
    1. Yes. I believe we were not doing anything with it. That is: I think we were not correctly updating the link state later, and that seemed to be about the only useful thing _F_STATUS does, at least with the driver structured the way it otherwise is today. I opted to make it simpler and to have the link state always be "up".

  33. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    Is there any particular reason we don't care about the number of rx buffers that we added here?
  34. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    This should be an unsigned long to match the actual types of the txcopy/rxcopy thresholds. While I know you constrain it in the setprop to things that'll always fit in a uint_t, I don't think it makes sense to encode that assumption in other parts. If it does, then the types should be changed.
  35. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    Since you're not constructing the buffer to know that it fits, you need to check the return value of snprintf and return EOVERFLOW if it doesn't fit.
    Also, vif_tx_copy_thresh and vif_rxcopy_thresh are unsigned long values, the percent format needs a %l (once the truncation to value is fixed).
  36. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    ether_sprintf() seems highly dubious and not safe! It uses a single shared global kernel buffer. This seems like something we shouldn't use!
    1. Why is this even in the kernel?!

  37. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    FYI, this doesn't have to be kept around after mac_register succeeds.
  38. usr/src/uts/common/io/vioif/vioif.c (Diff revision 1)
    There's no need to keep the registration around. It can be freed immediately after you call mac_register(), fwiw.
    1. OK, I've done this! I see it in the example, but it was not clear (to me, at least) from the prose in the manual.

  39. usr/src/uts/common/io/virtio/virtio.h (Diff revision 1)
    Can you add a comment about endianness here somewhere? Mostly that everything is in host endianness for legacy devices which is why folks don't see anything related to endianness and that if this needs to be adapted for 1.x it will need to be done? It may also be useful to indicate that no one really supports 1.x which is why this only is legacy.
    It might also be useful to talk a bit somewhere (maybe virito_main.c) about how you chose to place membars and when they are needed and not needed.
    Similarly, there isn't much talking about the lock ordering between vio_mutex and the queues. This probably would be useful to include.
  40. usr/src/uts/common/io/virtio/virtio.h (Diff revision 1)
    How does one reset it?
  41. usr/src/uts/common/io/virtio/virtio.h (Diff revision 1)
    This is a bit weird since descriptor chains and submitting them or receiving them haven't been covered at all at this point. It may make sense to move this after the descriptor chain section.
    1. I've added a bit on synchronous requests, and I've moved VIRTQUEUE OPERATION down under DESCRIPTOR CHAINS.

  42. usr/src/uts/common/io/virtio/virtio.h (Diff revision 1)

    Maybe mention an example of what would be considered synchronous or that it doesn't apply?

  43. init_early() threw me off for a bit. It made me think that this should only be used early in boot or something similar. It seems like the point of it is to only allocate the handle so it can be used for binding cases.
    1. I've renamed it to virtio_dma_init_handle(), as that's basically what it does.

  44. usr/src/uts/common/io/virtio/virtio_dma.c (Diff revision 1)
    I'm not sure that it's required, but often after allocating DMA memory I do a bzero to scrub it to make sure that it's in a sane state as I'm not sure the framework makes any guarantees about its contents and I'd rather not leak data if we could avoid it.
    1. I've shifted the bzero() I was doing in other places into a central location in the base framework.

  45. There's nothing to change per se, but it took me a moment to realize that the explanation was below the bit. But I know sometimes that style makes more sense.
    1. I've restyled this to look a bit more like the feature bits stuff in vioif.h.

  46. Is there a reason that these and the other attributes aren't const? It seems that we're passing them directly to a number of different routines, which can result in them changing the actual DMA attributes down the line. We probably want to actually copy them to a per-usage value.
    1. We pass it to ddi_regs_map_setup() and ddi_dma_mem_alloc(), both of which already make a defensive copy. I'd love to make it const, but those functions (for whatever reason) aren't const yet so I'd need to cast and that seems more error prone.

  47. Shouldn't this be DDI_DEVICE_ATTR_V1? Especially if you're specifying .devacc_attr_access.
  48. usr/src/uts/common/io/virtio/virtio_main.c (Diff revision 1)
    Currently all the consumers of this cast it's return value away. This only seems to return success. Is there any reason we bother with a return value then?
  49. I find having a macro for the register number in the implementation header with a comment is useful. YMMV.
  50. Shouldn't we destroy the mutex and list we just created before freeing the memory?
    1. I have, instead, rearranged things!

  51. usr/src/uts/common/io/virtio/virtio_main.c (Diff revision 1)

    It's not clear to me from the spec that we've ever promissed a 64-bit read to necessairily be plausible, fwiw.

    1. I've made the language a bit broader then!

  52. Any reason not to check for the overflow here?
    1. I think it's likely to be long enough, and it's a purely cosmetic string, so I think it's fine to just assume it'll be OK. I can increase the buffer lenght if you'd like, but 256 seems long enough in general?

    2. OK, I'll treat this as differences in style.
  53. Can't this fail?
  54. It's not clear to me what this is referring to here, fwiw. I think it's referring to how we're basically doing a single large DMA allocation and parcelling it up between the driver and device parts.
    1. I've fleshed it out a bit more.

  55. Should we pre-emptively zero the rest of this so that way if we ever use other fields we're not using stack garbage in the search?
  56. usr/src/uts/common/io/virtio/virtio_main.c (Diff revision 1)
    If we need to take a mutex to return this information, how does the caller manage to use it correctly as they won't be holding the queue lock and it can change out from under them?
    1. I'm taking the mutex to protect the AVL, but you're right it doesn't protect the value from changing. If that's required, the driver will need to arrange for serialisation one level up.

    2. OK. I've tried to audit all the callers here, but I'm not sure if I missed something or not. I think the usage is probably OK.

  57. usr/src/uts/common/io/virtio/virtio_main.c (Diff revision 1)
    If this comment is accurate, then shouldn't this be placed after we actually read the index at line 845? Otherwise, it's not clear to me what we're trying to protect in this case.
    1. I've fleshed the comment out a bit, mentioning the names of the members to make it clearer.

  58. I realize the != NULL check here is a bit irrelevant for the vic_indirect_dma cases because the current label can only be reached if other stuff is bad. Still, I wonder if we should then be putting a check in virtio_dma_fini() to allow for a NULL argument in case something like this ever gets refactored.
  59. If this is just returning DDI_SUCCESS and failure maybe we should just use a boolean_t rather than further overload the values? Though I can also see how it's being used directly for users and in that case you'd rather use the DDI values.
    1. I think FAILURE/SUCCESS is slightly clearer than TRUE/FALSE for a verb like append -- and yes, as you note, I want to pass the value back on to the external caller.

  60. usr/src/uts/common/io/virtio/virtio_main.c (Diff revision 1)
    I think this comment will be clearer if you make it clear the writes to the descriptors have been done in other threads possibly and that the gotcha is making sure those have all posted before it reads and sees the following change. I'm still trying to convince myself we don't need another membar after we set the index, but I'm not sure.
    1. It's not about writes in other threads, actually -- it's just to avoid the current CPU from reordering a store to the ring pointer so that it "happens before" the store to the ring memory. For synchronisation with other threads or CPUs we use the mutexes.

  61. usr/src/uts/common/io/virtio/virtio_main.c (Diff revision 1)

    This comment threw me off because the code has the length in bytes, which is right. Maybe clarify in the comment it's the length in bytes? I thought it was just the number of entries based on the comment.

    1. I have reworded this!

  62. How often do we fall back to FIXED interrupts? Is there ever a case where we should just used a shared MSI-X rather than a shared legacy interrupt?
    1. Infrequently, as near as I can tell. In fact, we should probably never need to fall back unless something is seriously wrong. If it turns out later that's not true, we could enhance the code to try multiplexing handlers onto a smaller number of MSI-X vectors.

  63. This appears unused?

  64. usr/src/uts/common/io/virtio/virtio_main.c (Diff revision 1)
    When using INTX interrupts, none of the queues will have viq_handler_added. It doesn't appear we'll remove the interrupt handler otherwise which I think we likely still need to do in that case before we free them below.
  65. Is it worth logging an error if this fails?
  66. usr/src/uts/common/io/virtio/virtio_main.c (Diff revision 1)
    How do we distinguish between MSI zero being successfully programmed and it ignoring the write?
    1. The value for there being no vector programmed is VIRTIO_LEGACY_MSI_NO_VECTOR (0xFFFF), rather than zero.

  1. Thanks for the clean up, this all looks pretty good. I just have a few additional notes below.

  2. usr/src/uts/common/io/vioblk/vioblk.h (Diff revisions 1 - 2)
    The 'not appear' sounds weird. Maybe strike the not?
  3. usr/src/uts/common/io/virtio/virtio_dma.c (Diff revisions 1 - 2)
    I don't believe it's safe to cast this away. The DMA attributes definitely can be modified by the system.
    1. I don't think they can: https://github.com/joyent/illumos-joyent/blob/master/usr/src/uts/common/os/sunddi.c#L6956-L6967

      ddi_dma_alloc_handle(9F) should really have a const argument here -- it just doesn't yet.

  4. usr/src/uts/common/io/virtio/virtio_main.c (Diff revisions 1 - 2)
    So vic_indirect_used is a uint_t. Is there a reason this is an id_t? Note an id_t is signed.
    1. The return value from the ID space allocation is signed, unfortunately. I've moved the actual call there into another (inline) function with a check on the value before casting.

  1. Ship It!
  1. Ship It!
Review request changed

Status: Closed (submitted)