9873 SMB logon fails during 1st second after service start

Review Request #1245 — Created Oct. 16, 2018 and submitted

gwr
illumos-gate
9873
general

Correct some problems with how authentication requests wait for domain controller (DC) discovery to complete.

Return meaningful status codes from session setup
Let smb_domain_getinfo() wait for DC info
No longer need smb_ddiscover_wait
Retry named pipe opens
Be more selective about when to mark the DC as failed
Rework netlogon loops
fix lint

This came up in some fail-over testing we were doing on later versions of the server.
It turns out that in a fail-over scenario, SMB clients may be "knocking on the door"
(trying to authenticate) immediately as the SMB server is coming up.
Those tests no longer have unexpected logon failures.

  • 0
  • 0
  • 1
  • 3
  • 4
Description From Last Updated
gwr
gdamore
  1. 
      
  2. No actual changes here besides copyright?

    1. Huh. Must have been left over from my moving changes around.

  3. Where does "5" come from? Could this start as a smaller value and increase gradually? Should it be perturbed slightly with randomness (though sleep maybe is coarse enough not to need that...)

    1. The value is somewhat arbitrary, but long enough so we don't "spam" the network with a constant barrage of DC location traffic when DC location is failing. This should only happen if all your DNS and AD servers are down, so I don't think there's any need to get fancy with exponential backoff or something like that.

    2. I think this should at least be a constant defined elsewhere rather than a magic number.

      I'm still thinking that maybe 5 is too long to start, but it's probably "good enough" at any rate. Please do consider sticking it as a tunable somewhere though.

    3. fair enough

  4. Is this description the only way this error can occur? Or perhaps the description is incorrect?

    1. Well, the code is just moved down, as you can see.
      The error message is reasonably accurate.

      smb_domain_start_update / smb_dcache_updating
      is essentially a form of readers/writer lock,
      which this is taking as writer in preparation
      for updating the domain cache state.

  5. Just a minor thing. But I am of the opinion that we should be use C99 and stdbool.h "bool" types now. The C99 boolean values have some very minor semantic changes that make them superior to the old hack of using ints and #defines....

    1. OK, but let's consider that later (for new code) shall we?
      I'd rather not frob all this code for that right now.

    2. I'm fine with that... as I said, its a minor thing. But something to keep in mind going forward.

  6. Good catch here.
  7. 
      
gwr
gwr
gwr
gwr
gwr
gwr
gdamore
  1. Ship It!
  2. 
      
gdamore
  1. Ship It!
  2. 
      
danmcd
  1. 
      
  2. Nice!

  3. 
      
gwr
Review request changed

Status: Closed (submitted)

Loading...