Bug #3354

kernel crash in rpcsec_gss after using gsscred

Added by Andreas Philipp about 6 years ago. Updated 4 months ago.

Status:ClosedStart date:2012-11-13
Priority:HighDue date:
Assignee:Andy Fiddaman% Done:

100%

Category:kernel
Target version:-
Difficulty:Medium Tags:

Description

Dear team,

we received a kernel panic, assuming that the following messages of /var/adm/messages are relevant:

Nov 13 15:36:13 wertach savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffff000da967c0 addr=10 occurred in module "rpcmod" due to a
NULL pointer dereference

Nov 13 15:36:13 wertach savecore: [ID 442558 auth.error] Panic crashdump pending on dump device but dumpadm -n in effect; run savecore(1M) manually to extract. Image UUID a373
c73c-a002-64ad-d3ac-c6b5ab0b625c.

The machine is a vmware NFS-server using krb5/ldap authentication. crash.0 is attached, whole dump can be made available. The crash appeared svereal times now.

Thanks for looking at it! It is our home-dir file server and we heavily rely on it, since it is working great so far!

crash.0 (469 KB) Andreas Philipp, 2012-11-13 04:46 PM


Related issues

Duplicated by illumos gate - Bug #7342: kernel crash in rpcsec_gss after using gsscred Closed 2016-08-30
Duplicated by illumos gate - Bug #9563: Panic with kerberized NFS and GSS Closed 2018-05-29

History

#1 Updated by Andreas Philipp about 6 years ago

echo '::status' | mdb 0

debugging crash dump vmcore.0 (64-bit) from wertach
operating system: 5.11 oi_151a7 (i86pc)
image uuid: a373c73c-a002-64ad-d3ac-c6b5ab0b625c
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff000da967c0 addr=10 occurred in module
"rpcmod" due to a NULL pointer dereference
dump content: kernel pages only

#2 Updated by Rich Lowe about 6 years ago

It'd be good to see the stack trace
($C in mdb) also, and possibly to make the crash dump available somehow as well.

#3 Updated by Andreas Philipp about 6 years ago

Like this?

echo '$C' | mdb 0
ffffff000da968c0 rmm_wput+0x1a(ffffff02a1b92138, ffffff029a1269e0)
ffffff000da96930 put+0x1aa(ffffff02a1b92138, ffffff029a1269e0)
ffffff000da969a0 svc_cots_ksend+0x1de(ffffff02addd6400, ffffff000da969b0)
ffffff000da96a20 svc_sendreply+0x59(ffffff02addd6400, fffffffff81ac8c8,
ffffff000da96ab0)
ffffff000da96b40 do_gss_accept+0x3fd(ffffff02addd6400, ffffff02f207e0e0,
ffffff02a86f2378, ffffff03802c02f8, 1, 1)
ffffff000da96b70 svcrpcsec_gss_taskq_func+0x38(ffffff0342775680)
ffffff000da96c20 taskq_thread+0x285(ffffff0295e97378)
ffffff000da96c30 thread_start+8()

Is it possible to send a link to vdump.0 via personal message? And if yes, how?

#4 Updated by Andreas Philipp about 6 years ago

Here is a link to vdump.0 (1.4Gb, should do it):

http://megastore.uni-augsburg.de/get/AeIygv9F9L/

#5 Updated by Marcel Telka about 6 years ago

553rmm_wput(queue_t *q, mblk_t *mp)
554{
555    (*((struct temp_slot *)q->q_ptr)->ops->xo_wput)(q, mp);
556}

> ffffff02a1b92138::whatis
ffffff02a1b92138 is ffffff02a1b92040+f8, freed from queue_cache
>

#6 Updated by Marcel Telka about 6 years ago

  • Status changed from New to Feedback

Please try to reproduce the issue with kmem_flags=0xf. Please read http://illumos.org/books/mdb/kmem-1.html for more details.

#7 Updated by Marcel Telka about 6 years ago

  • Status changed from Feedback to In Progress

#8 Updated by Marcel Telka about 6 years ago

  • Assignee set to Marcel Telka

#9 Updated by Jan Veninga almost 6 years ago

We have the same issue. The machine is also running on vmware. NFS server with kerberos authentication. If more info is needed, please let me know.

#10 Updated by Matthias Frey over 5 years ago

Hi

We have the same issue. We are serving the home directories via nfs/kerberos.
A dump is available here:

https://gigamove.rz.rwth-aachen.de/d/id/6jkY8jK7cCYDMa

#11 Updated by Piotr Jasiukajtis over 5 years ago

Please share your kernel version and panic stack trace.

#12 Updated by Matthias Frey over 5 years ago

>$C
ffffff001feb7870 rmm_wput+0x1a(ffffff04f2588ba0, ffffff0501070220)
ffffff001feb78e0 put+0x1b5(ffffff04f2588ba0, ffffff0501070220)
ffffff001feb7950 svc_cots_ksend+0x8f(ffffff04ffdf0200, ffffff001feb7960)
ffffff001feb79d0 svc_sendreply+0x57(ffffff04ffdf0200, fffffffff8305e40, ffffff001feb7a70)
ffffff001feb7b30 do_gss_accept+0x349(ffffff04ffdf0200, ffffff05305e2a20, ffffff05060c09c8, ffffff0530751bb8, 1, 1)
ffffff001feb7b60 svcrpcsec_gss_taskq_func+0x2f(ffffff050c910e78)
ffffff001feb7c20 taskq_thread+0x2d0(ffffff04f21f5178)
ffffff001feb7c30 thread_start+8()

uname -a:
SunOS iai-fileserv1 5.11 omnios-b281e50 i86pc i386 i86pc

The Server is running OmniOs at the moment. But we have this Problem also with Nexenta and Openindiana.

#13 Updated by Piotr Jasiukajtis over 5 years ago

Are you able to reproduce this with the oi_151a8 OpenIndiana version?

#14 Updated by Stefan Skoglund about 5 years ago

I'm seeing this too, and this is also a vmware guest.

----
SunOS san2 5.11 oi_151a8 i86pc i386 i86pc Solaris
----

#15 Updated by Marcel Telka over 2 years ago

  • Duplicated by Bug #7342: kernel crash in rpcsec_gss after using gsscred added

#16 Updated by Marcel Telka about 2 years ago

  • Subject changed from kernel panic due to rpcmod? to kernel crash in rpcsec_gss after using gsscred

#17 Updated by Paul Henson 11 months ago

We just hit this in the latest omnios CE:

  1. cat /etc/release
    OmniOS v11 r151022ac

ffffff0011d54870 rmm_wput+0x1a(ffffff0477fe5670, ffffff03dd6c7b40)
ffffff0011d548e0 put+0x1b5(ffffff0477fe5670, ffffff03dd6c7b40)
ffffff0011d54950 svc_cots_ksend+0x9e(ffffff047644f000, ffffff0011d54960)
ffffff0011d549d0 svc_sendreply+0x57(ffffff047644f000, fffffffff8579e40, ffffff0011d54a70)
ffffff0011d54b30 do_gss_accept+0x443(ffffff047644f000, ffffff0464b24870, ffffff30f4a17c80, ffffff1a23f3cea0, 1, 1)
ffffff0011d54b60 svcrpcsec_gss_taskq_func+0x2f(ffffff0423cd9830)
ffffff0011d54c20 taskq_thread+0x2d0(ffffff03eddec390)
ffffff0011d54c30 thread_start+8()

Longstanding bug 8-/; Marcel, any chance you'd have time to take a look at it again and finish the fix you mentioned in the duplicate ticket :)? I can provide a core dump if you'd like.

Thanks...

#18 Updated by Marcel Telka 7 months ago

  • Duplicated by Bug #9563: Panic with kerberized NFS and GSS added

#19 Updated by Marcel Telka 7 months ago

I received this today from Santosh Pradhan:

On Tue, May 29, 2018 at 8:35 PM, Marcel Telka <marcel@telka.sk> wrote:

> Hi Santosh,
>
> On Tue, May 29, 2018 at 08:32:17PM +0530, Santosh Pradhan wrote:
> > Hi Marcel,
> > This is Santosh again. Sorry to disturb you. Just wanted to let you know
> > that this issue I have fixed in our branch. I can share the one liner if
> > you want.
>
> Feel free to propose your fix upstream, but I doubt the proper fix is just
> oneliner.  IIRC the problem (and the fix) is more complex.
>

Could be, but as per my understanding it should be simple one and the fix
works for all the use cases we tested.

The stacktrace looks similar, here was my analysis:

[root@meeren:meeren]# mdb 0
Loading modules: [ unix genunix dtrace specfs mac cpu.generic uppc
pcplusmp scsi_vhci zfs mpt_sas3 sd ip hook neti sockfs arp usba uhci
stmf_sbd stmf md lofs random idm crypto nfs ipc nsmb smbsrv ptm ufs
logindmux ]
> $C
ffffff005e1089a0 svc_cots_ksend+0x1e1(ffffff0dcade6800, ffffff005e1089b0)
ffffff005e108a20 svc_sendreply+0x59(ffffff0dcade6800,
fffffffff821e8c8, ffffff005e108ab0)
ffffff005e108b40 do_gss_accept+0x3fd(ffffff0dcade6800,
ffffff12d1e49f80, ffffff105b172a58, ffffff0e85125ad8, 1, 1)
ffffff005e108b70 svcrpcsec_gss_taskq_func+0x38(ffffff0eb11ea658)
ffffff005e108c20 taskq_thread+0x285(ffffff0da63d1360)
ffffff005e108c30 thread_start+8()
>

> ffffff0dcade6800::print -ta SVCXPRT xp_master
ffffff0dcade6858 SVCMASTERXPRT *xp_master = 0xffffff0dcaea3448

> 0xffffff0dcaea3448::print -ta SVCMASTERXPRT xp_xpc.xpc_wq
ffffff0dcaea3468 queue_t *xp_xpc.xpc_wq = 0 <==== xprt is gone means
wq is already closed in master xprt

> ffffff0dcade6800::print -ta SVCXPRT xp_xpc.xpc_wq
ffffff0dcade6810 queue_t *xp_xpc.xpc_wq = 0xffffff0da66aab90

But clone xprt is still\pointing to stale one causing the corruption.

The gss security is handled in different thread (taskq) i.e.
svcrpcsec_gss_taskq_func()

and NFS RPC reqs are handled in different thread i.e. svc_run()
[created in svc_queuereq()].

Both can execute independent of each other, they know nothing about each other.

If svc_run()->svc_getreq()->sec_svc_msg() finishes first [as gss sec
is handled in thread,

rpcsec_gss_init() can return AUTH_OK if it could dispatches the task
into taskq],

then svc_run() ll releases it slot (fetched one mblk_t msg from
STREAMS wq) and if

there are no more messages that queue might get closed which is
happened in this case.

> 0xffffff0da66aab90::print -ta queue_t q_flag
ffffff0da66aabc8 uint_t q_flag = 0x64002

#define QWCLOSE         0x00020000      /* will not be enabled          */

> ffffff0dcade6800::print -ta SVCXPRT xp_master
ffffff0dcade6858 SVCMASTERXPRT *xp_master = 0xffffff0dcaea3448

> 0xffffff0dcaea3448::print -ta SVCMASTERXPRT xp_threads xp_xpc.xpc_wq
ffffff0dcaea34d0 int xp_threads = 0x1
ffffff0dcaea3468 queue_t *xp_xpc.xpc_wq = 0

xp_threads is 1 as svcrpcs_gss_taskq_func() links with this xprt. But
queue is gone.

svcrpcsec_gss_taskq_func() still access the same queue which was closed.

> 0xffffff0da66aab90::whatis
ffffff0da66aab90 is ffffff0da66aaa98+f8, freed from queue_cache

This is just a race between these threads.

I believe svc_run() [svc_getreq() invokes sec_svc_msg()], should wait
for svcrpcsec_gss_taskq_func()

thread to finish. i.e. rpcsec_gss_init() should wait for the task to
finish (ddi_taskq_wait()) before

returning AUTH_OK so that queue would not be closed.

The patch was:

diff --git a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c
b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c
index 23f98b9..ccbdd16 100644
--- a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c
+++ b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c
@@ -1000,6 +1000,7 @@ rpcsec_gss_init(
        }

        mutex_exit(&client_data->clm);
+       ddi_taskq_wait(svcrpcsec_gss_init_taskq);
        *no_dispatch = TRUE;
        return (AUTH_OK);

Let me know what do you think.

Best R, Santosh

#20 Updated by Marcel Telka 7 months ago

And here is my reply:

On Tue, May 29, 2018 at 08:57:27PM +0530, Santosh Pradhan wrote:
> diff --git a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c
> b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c
> index 23f98b9..ccbdd16 100644
> --- a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c
> +++ b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c
> @@ -1000,6 +1000,7 @@ rpcsec_gss_init(
>         }
> 
>         mutex_exit(&client_data->clm);
> +       ddi_taskq_wait(svcrpcsec_gss_init_taskq);
>         *no_dispatch = TRUE;
>         return (AUTH_OK);
> 
> 
> Let me know what do you think.

Ah, okay.  You simply made svcrpcsec_gss_taskq_func() synchronous to
rpcsec_gss_init().  This obviously fixes the problem but with possible
performance penalty.  I'm not sure how serious the performance degradation is
(or could be) but this definitely needs some measurement to be sure it is
acceptable.

If this fix is acceptable, then you can just call svcrpcsec_gss_taskq_func()
directly and do not need the tasq for it.  This would make your fix a bit
wider, but simplify the code significantly.

OTOH, I was thinking about fixing this without any possible performance impact,
but that would require more complex change.

#21 Updated by Marcel Telka 7 months ago

  • Assignee deleted (Marcel Telka)

#22 Updated by santosh pradhan 6 months ago

Again rpcsec_gss_init() gets called only during mount and not in the data path (e.g. rpcsec_gss_data()). So, it would not impact performance as per my understanding.

#23 Updated by Marcel Telka 6 months ago

santosh pradhan wrote:

Again rpcsec_gss_init() gets called only during mount and not in the data path (e.g. rpcsec_gss_data()). So, it would not impact performance as per my understanding.

Sure, but in some scenarios even a bit slower mount could have significant performance impact. In any case, I always prefer correct (even maybe possible slower) code, over faster, but buggy code, so I've no problem with your fix. :-)

#24 Updated by Andy Fiddaman 4 months ago

We've had this reported in OmniOS LTS again after we integrated KPTI.
KPTI is not at fault but appears to have changed timings enough in places to trigger the problem more often - at least for the customer who reported it.

#25 Updated by Andy Fiddaman 4 months ago

  • % Done changed from 0 to 80
  • Assignee set to Andy Fiddaman
  • Tags deleted (needs-triage)

#26 Updated by Andy Fiddaman 4 months ago

  • % Done changed from 80 to 100
  • Status changed from In Progress to Pending RTI

#27 Updated by Electric Monk 4 months ago

  • Status changed from Pending RTI to Closed

git commit 4a3b05278938491ea95d557939c130d3eb17cfd4

commit  4a3b05278938491ea95d557939c130d3eb17cfd4
Author: Andy Fiddaman <omnios@citrus-it.co.uk>
Date:   2018-08-29T17:20:11.000Z

    3354 kernel crash in rpcsec_gss after using gsscred
    Reviewed by: Toomas Soome <tsoome@me.com>
    Reviewed by: Carlos Neira <cneirabustos@gmail.com>
    Approved by: Robert Mustacchi <rm@joyent.com>

Also available in: Atom