5753 libnsl: set_up_connection() over TCP does not adhere the specified timeout

Review Request #37 — Created April 17, 2015 and submitted


webrev: http://cr.illumos.org/~webrev/marcel/il-5753-set_up_connection/

The set_up_connection() over TCP does not adhere the specified timeout.

Without the fix:

# time ./rpc-timeout 5
clnt_create_service_timed failed

real    6m29.996s
user    0m0.002s
sys     0m0.005s

With the fix:

# time ./rpc-timeout 5
clnt_create_service_timed failed

real    0m7.934s
user    0m0.001s
sys     0m0.004s
  2. usr/src/lib/libnsl/rpc/clnt_vc.c (Diff revision 1)
    typo: nanoseonds
  1. I have a question about the retry loop in clnt_vc.c
    for (nconnect = 0; nconnect < 3; nconnect++) ...

    It's not clear why this should ever retry a connect attempt,
    given that the transport (TCP) has its own retry mechanism
    where it re-sends the TCP SYN at some interval.
    Doesn't this therefore constitute nested retry loops?
    (those are generally undesirable)

    I think I prefer the version of this fix I saw earlier in
    NexentaStor, where you just eliminate the outer retry loop.
    Why not do that here too?

    1. The t_connect() could fail for various reasons, not only because of a timeout, for example TLOOK.  The Sun's 6564477 was touching this area and I believe there was some good reason for that.
    2. I see. But to solve that problem you don't really need all the fancy deadline-based timeout computation.
      If we get a connect timeout, we're going to get one every time, so you can probably just give up in pass one if the error code says the connect timed out.
      That would let you simplify or eliminate much of the code that's trying to keep track of "how much more time is left to wait?"

    3. Or perhaps, only loop for things like TLOOK, but I suppose that would require more information about what what the error scenario there.

    4. sorry: what what / what was...

    5. Actually, on the timeout, t_connect() fails with t_errno == 9 (TLOOK), and t_look() returns 16 (T_DISCONNECT). Is there some other way (other than the timeout) how the t_connect() could fail with TLOOK, and t_look() would return T_DISCONNECT? I'm not sure, and the t_connect(3nsl) and t_look(3nsl) man pages does not give satisfying answer to resolve this question. So we cannot exclude such possibility. Because of that, there seems to be no reliable way how to be sure that t_connect() returned because of the timeout (other than to measure the time spent in the t_connect()).

      Here is an example of a scenario:

      The timeout is 10 seconds, so t_connect() is called with the 10 seconds timeout. After 3 seconds, the t_connect() will return (fail) with TLOOK, so I'd like to handle this error condition and continue to wait in t_connect() for 7 seconds now.

      I believe my proposed fix is going to handle the above scenario properly, but I'm not sure how it will work with your proposal.

      Thank you.

Review request changed

Status: Closed (submitted)

Change Summary:

commit 7e89328164e4b89906924cf4e0387ea13a77631b
Author:     Marcel Telka <marcel.telka@nexenta.com>
AuthorDate: Mon Apr 20 15:41:13 2015 +0200
Commit:     Richard Lowe <richlowe@richlowe.net>
CommitDate: Tue May 5 14:41:51 2015 -0400

    5753 libnsl: set_up_connection() over TCP does not adhere the specified timeout
    Reviewed by: Dan Fields <dan.fields@nexenta.com>
    Approved by: Richard Lowe <richlowe@richlowe.net>

:100644 100644 4f92a33... 4ab5010... M  usr/src/lib/libnsl/rpc/clnt_vc.c