8040 NFSv4 client: 3-way deadlock between nfs4_bio(), nfs4_do_delegreturn(), and nfs4_flush_pages()

Review Request #591 — Created June 19, 2017 and submitted

marcel
illumos-gate
master
8040
afe8399...
general
This fixes a 3-way deadlock in NFSv4 client.
I tested the fix using steps at https://www.illumos.org/issues/8040#note-2 and
I confirmed the deadlock is no longer reproducible.

This fix is in production for about 3 years now without any regression found.
  • 0
  • 0
  • 0
  • 1
  • 1
Description From Last Updated
marcel
sensille
  1. Ship It!
  2. 
      
vgusev
  1. 
      
  2. usr/src/uts/common/fs/nfs/nfs4_client.c (Diff revision 2)
     
     

    I see that r_serial is set only in this function and at the end always is reset to NULL.

    Is it possible that @was_serial is always FALSE here or I missed something ? In other words, if @was_serial is TRUE, it should mean that nfs4_attr_cache() is nested call, but I don't see where it can happen.

    1. Maybe there is a nested call possible via the nfs4_attr_cache() -> nfs4_purge_caches() -> nfs4_flush_pages() -> VOP_PUTPAGE() sequence. In any case, if was_serial is always FALSE it is harmless and not related to my fix.

    2. But you would agree that your new code use this value/variable and we need to know all cases for calling of nfs4_attr_cache() before modify its behaviour.

      Nevertheless, nfs4_putpage/nfs4_putpages don't call r4_do_attrcache() or something similar that can call nfs4_attr_cache().

    3. But you would agree that your new code use this value/variable and we need to know all cases for calling of nfs4_attr_cache() before modify its behaviour.

      I'm not sure I understand your arguing. No, my code is not using was_serial.

      Nevertheless, nfs4_putpage/nfs4_putpages don't call r4_do_attrcache() or something similar that can call nfs4_attr_cache().

      If that's the case then was_serial is probably always FALSE. But I do not understand how is that related to my fix and how it affects my fix. Please elaborate.


      Also, I do not understand what is the issue you are suggesting to fix. Please be explicit what is the problem you see here. What is the scenario that could cause a problem? How do you suggest to fix the problem you see? Thanks.

    4. | I'm not sure I understand your arguing. No, my code is not using was_serial.

      | If that's the case then was_serial is probably always FALSE. But I do not understand how is that related to my fix and how it affects my fix. Please elaborate.

      There are two places:
      1.
      - if (rp->r_serial && !was_serial) {
      - klwp_t *lwp = ttolwp(curthread);
      -
      + if (rp->r_serial != NULL && !was_serial) {

      1. Code Under this condition "if (rp->r_serial != NULL && !was_serial)". Line 477,478.

      All above mean that was_serial has meaning and use that value. Therefore if was_serial has no meaning, does it mean that patch that correcting original code and adding a new code should be improved ? I suppose yes.

      | Also, I do not understand what is the issue you are suggesting to fix. Please be explicit what is the problem you see here. What is the scenario that could cause a problem? How do you suggest to fix the problem you see? Thanks.

      My intention is: new code and fixes should not introduce a new uncertainty or leave old uncertainties. Maybe there is not any issue, but to answer for sure, we need to clear understand what each variable means and what each line does. Without it the new code will be as old one. But we want to make the code better. Don't we?

      About scenario. If was_serial is always FALSE , rp->r_serial != NULL means that simultaneously another "nfs4_attr_cache" is being executed. For this case, when you remove check on "recov", does it mean that PURGE_ATTRCACHE4_LOCKED(rp) will almost always clean cached attributes and this may increase number of GETATTR requests ?

    5. All above mean that was_serial has meaning and use that value.

      Sure.

      Therefore if was_serial has no meaning, does it mean that patch that correcting original code and adding a new code should be improved ? I suppose yes.

      I never said was_serial has no meaning.

      My intention is: new code and fixes should not introduce a new uncertainty or leave old uncertainties.

      Sorry, I disagree. Trying to achieve that I would need to fix all possible uncertanities you could find anywhere in the code.

      Maybe there is not any issue, but to answer for sure, we need to clear understand what each variable means and what each line does.

      You are free to do so and propose an improvement if you find something needs to be changed.

      Without it the new code will be as old one.

      No.

      But we want to make the code better. Don't we?

      Yes. And because of that I propose this fix.

      About scenario. If was_serial is always FALSE , rp->r_serial != NULL means that simultaneously another "nfs4_attr_cache" is being executed. For this case, when you remove check on "recov", does it mean that PURGE_ATTRCACHE4_LOCKED(rp) will almost always clean cached attributes and this may increase number of GETATTR requests ?

      Yes, but only for this dangerous case (i.e. when there is a risk of possible deadlock). It is far better to issue one more GETATTR than to risk a deadlock.

      BTW, this fix is in production on all our NFSv4 clients for 2+ years now without any issue.

  3. 
      
marcel
vgusev
  1. Ship It!
  2. 
      
marcel
Review request changed

Status: Closed (submitted)

Change Summary:

commit 6dc7d05754d992040097e8ba8f85e77512125c60
Author:     Marcel Telka <marcel@telka.sk>
AuthorDate: Tue Dec 15 22:08:36 2020 +0100
Commit:     Gordon Ross <gordon.ross@tintri.com>
CommitDate: Fri Jan 8 09:50:43 2021 -0500

    8040 NFSv4 client: 3-way deadlock between nfs4_bio(), nfs4_do_delegreturn(), and nfs4_flush_pages()
    Reviewed by: Arne Jansen <arne@die-jansens.de>
    Reviewed by: Vitaliy Gusev <gusev.vitaliy@icloud.com>
    Approved by: Gordon Ross <gordon.w.ross@gmail.com>

:100644 100644 5456fc7c63 856da430ea M	usr/src/uts/common/fs/nfs/nfs4_client.c
:100644 100644 9f4099c3f8 db93bf2e72 M	usr/src/uts/common/fs/nfs/nfs4_vnops.c
Loading...