kernel crash in rpcsec_gss after using gsscred
|Assignee:||Andy Fiddaman||% Done:|
we received a kernel panic, assuming that the following messages of /var/adm/messages are relevant:
Nov 13 15:36:13 wertach savecore: [ID 570001 auth.error] reboot after panic: BAD TRAP: type=e (#pf Page fault) rp=ffffff000da967c0 addr=10 occurred in module "rpcmod" due to a
NULL pointer dereference
Nov 13 15:36:13 wertach savecore: [ID 442558 auth.error] Panic crashdump pending on dump device but dumpadm -n in effect; run savecore(1M) manually to extract. Image UUID a373
The machine is a vmware NFS-server using krb5/ldap authentication. crash.0 is attached, whole dump can be made available. The crash appeared svereal times now.
Thanks for looking at it! It is our home-dir file server and we heavily rely on it, since it is working great so far!
#1 Updated by Andreas Philipp over 6 years ago
echo '::status' | mdb 0
debugging crash dump vmcore.0 (64-bit) from wertach
operating system: 5.11 oi_151a7 (i86pc)
image uuid: a373c73c-a002-64ad-d3ac-c6b5ab0b625c
BAD TRAP: type=e (#pf Page fault) rp=ffffff000da967c0 addr=10 occurred in module
"rpcmod" due to a NULL pointer dereference
dump content: kernel pages only
#3 Updated by Andreas Philipp over 6 years ago
echo '$C' | mdb 0
ffffff000da968c0 rmm_wput+0x1a(ffffff02a1b92138, ffffff029a1269e0)
ffffff000da96930 put+0x1aa(ffffff02a1b92138, ffffff029a1269e0)
ffffff000da969a0 svc_cots_ksend+0x1de(ffffff02addd6400, ffffff000da969b0)
ffffff000da96a20 svc_sendreply+0x59(ffffff02addd6400, fffffffff81ac8c8,
ffffff000da96b40 do_gss_accept+0x3fd(ffffff02addd6400, ffffff02f207e0e0,
ffffff02a86f2378, ffffff03802c02f8, 1, 1)
Is it possible to send a link to vdump.0 via personal message? And if yes, how?
#10 Updated by Matthias Frey over 5 years ago
We have the same issue. We are serving the home directories via nfs/kerberos.
A dump is available here:
#12 Updated by Matthias Frey over 5 years ago
>$C ffffff001feb7870 rmm_wput+0x1a(ffffff04f2588ba0, ffffff0501070220) ffffff001feb78e0 put+0x1b5(ffffff04f2588ba0, ffffff0501070220) ffffff001feb7950 svc_cots_ksend+0x8f(ffffff04ffdf0200, ffffff001feb7960) ffffff001feb79d0 svc_sendreply+0x57(ffffff04ffdf0200, fffffffff8305e40, ffffff001feb7a70) ffffff001feb7b30 do_gss_accept+0x349(ffffff04ffdf0200, ffffff05305e2a20, ffffff05060c09c8, ffffff0530751bb8, 1, 1) ffffff001feb7b60 svcrpcsec_gss_taskq_func+0x2f(ffffff050c910e78) ffffff001feb7c20 taskq_thread+0x2d0(ffffff04f21f5178) ffffff001feb7c30 thread_start+8()
SunOS iai-fileserv1 5.11 omnios-b281e50 i86pc i386 i86pc
The Server is running OmniOs at the moment. But we have this Problem also with Nexenta and Openindiana.
#17 Updated by Paul Henson about 1 year ago
We just hit this in the latest omnios CE:
- cat /etc/release
OmniOS v11 r151022ac
ffffff0011d54870 rmm_wput+0x1a(ffffff0477fe5670, ffffff03dd6c7b40)
ffffff0011d548e0 put+0x1b5(ffffff0477fe5670, ffffff03dd6c7b40)
ffffff0011d54950 svc_cots_ksend+0x9e(ffffff047644f000, ffffff0011d54960)
ffffff0011d549d0 svc_sendreply+0x57(ffffff047644f000, fffffffff8579e40, ffffff0011d54a70)
ffffff0011d54b30 do_gss_accept+0x443(ffffff047644f000, ffffff0464b24870, ffffff30f4a17c80, ffffff1a23f3cea0, 1, 1)
Longstanding bug 8-/; Marcel, any chance you'd have time to take a look at it again and finish the fix you mentioned in the duplicate ticket :)? I can provide a core dump if you'd like.
#19 Updated by Marcel Telka 9 months ago
I received this today from Santosh Pradhan:
On Tue, May 29, 2018 at 8:35 PM, Marcel Telka <firstname.lastname@example.org> wrote: > Hi Santosh, > > On Tue, May 29, 2018 at 08:32:17PM +0530, Santosh Pradhan wrote: > > Hi Marcel, > > This is Santosh again. Sorry to disturb you. Just wanted to let you know > > that this issue I have fixed in our branch. I can share the one liner if > > you want. > > Feel free to propose your fix upstream, but I doubt the proper fix is just > oneliner. IIRC the problem (and the fix) is more complex. > Could be, but as per my understanding it should be simple one and the fix works for all the use cases we tested. The stacktrace looks similar, here was my analysis: [root@meeren:meeren]# mdb 0 Loading modules: [ unix genunix dtrace specfs mac cpu.generic uppc pcplusmp scsi_vhci zfs mpt_sas3 sd ip hook neti sockfs arp usba uhci stmf_sbd stmf md lofs random idm crypto nfs ipc nsmb smbsrv ptm ufs logindmux ] > $C ffffff005e1089a0 svc_cots_ksend+0x1e1(ffffff0dcade6800, ffffff005e1089b0) ffffff005e108a20 svc_sendreply+0x59(ffffff0dcade6800, fffffffff821e8c8, ffffff005e108ab0) ffffff005e108b40 do_gss_accept+0x3fd(ffffff0dcade6800, ffffff12d1e49f80, ffffff105b172a58, ffffff0e85125ad8, 1, 1) ffffff005e108b70 svcrpcsec_gss_taskq_func+0x38(ffffff0eb11ea658) ffffff005e108c20 taskq_thread+0x285(ffffff0da63d1360) ffffff005e108c30 thread_start+8() > > ffffff0dcade6800::print -ta SVCXPRT xp_master ffffff0dcade6858 SVCMASTERXPRT *xp_master = 0xffffff0dcaea3448 > 0xffffff0dcaea3448::print -ta SVCMASTERXPRT xp_xpc.xpc_wq ffffff0dcaea3468 queue_t *xp_xpc.xpc_wq = 0 <==== xprt is gone means wq is already closed in master xprt > ffffff0dcade6800::print -ta SVCXPRT xp_xpc.xpc_wq ffffff0dcade6810 queue_t *xp_xpc.xpc_wq = 0xffffff0da66aab90 But clone xprt is still\pointing to stale one causing the corruption. The gss security is handled in different thread (taskq) i.e. svcrpcsec_gss_taskq_func() and NFS RPC reqs are handled in different thread i.e. svc_run() [created in svc_queuereq()]. Both can execute independent of each other, they know nothing about each other. If svc_run()->svc_getreq()->sec_svc_msg() finishes first [as gss sec is handled in thread, rpcsec_gss_init() can return AUTH_OK if it could dispatches the task into taskq], then svc_run() ll releases it slot (fetched one mblk_t msg from STREAMS wq) and if there are no more messages that queue might get closed which is happened in this case. > 0xffffff0da66aab90::print -ta queue_t q_flag ffffff0da66aabc8 uint_t q_flag = 0x64002 #define QWCLOSE 0x00020000 /* will not be enabled */ > ffffff0dcade6800::print -ta SVCXPRT xp_master ffffff0dcade6858 SVCMASTERXPRT *xp_master = 0xffffff0dcaea3448 > 0xffffff0dcaea3448::print -ta SVCMASTERXPRT xp_threads xp_xpc.xpc_wq ffffff0dcaea34d0 int xp_threads = 0x1 ffffff0dcaea3468 queue_t *xp_xpc.xpc_wq = 0 xp_threads is 1 as svcrpcs_gss_taskq_func() links with this xprt. But queue is gone. svcrpcsec_gss_taskq_func() still access the same queue which was closed. > 0xffffff0da66aab90::whatis ffffff0da66aab90 is ffffff0da66aaa98+f8, freed from queue_cache This is just a race between these threads. I believe svc_run() [svc_getreq() invokes sec_svc_msg()], should wait for svcrpcsec_gss_taskq_func() thread to finish. i.e. rpcsec_gss_init() should wait for the task to finish (ddi_taskq_wait()) before returning AUTH_OK so that queue would not be closed. The patch was: diff --git a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c index 23f98b9..ccbdd16 100644 --- a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c +++ b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c @@ -1000,6 +1000,7 @@ rpcsec_gss_init( } mutex_exit(&client_data->clm); + ddi_taskq_wait(svcrpcsec_gss_init_taskq); *no_dispatch = TRUE; return (AUTH_OK); Let me know what do you think. Best R, Santosh
#20 Updated by Marcel Telka 9 months ago
And here is my reply:
On Tue, May 29, 2018 at 08:57:27PM +0530, Santosh Pradhan wrote: > diff --git a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c > b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c > index 23f98b9..ccbdd16 100644 > --- a/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c > +++ b/usr/src/uts/common/rpc/sec_gss/svc_rpcsec_gss.c > @@ -1000,6 +1000,7 @@ rpcsec_gss_init( > } > > mutex_exit(&client_data->clm); > + ddi_taskq_wait(svcrpcsec_gss_init_taskq); > *no_dispatch = TRUE; > return (AUTH_OK); > > > Let me know what do you think. Ah, okay. You simply made svcrpcsec_gss_taskq_func() synchronous to rpcsec_gss_init(). This obviously fixes the problem but with possible performance penalty. I'm not sure how serious the performance degradation is (or could be) but this definitely needs some measurement to be sure it is acceptable. If this fix is acceptable, then you can just call svcrpcsec_gss_taskq_func() directly and do not need the tasq for it. This would make your fix a bit wider, but simplify the code significantly. OTOH, I was thinking about fixing this without any possible performance impact, but that would require more complex change.
#23 Updated by Marcel Telka 8 months ago
santosh pradhan wrote:
Again rpcsec_gss_init() gets called only during mount and not in the data path (e.g. rpcsec_gss_data()). So, it would not impact performance as per my understanding.
Sure, but in some scenarios even a bit slower mount could have significant performance impact. In any case, I always prefer correct (even maybe possible slower) code, over faster, but buggy code, so I've no problem with your fix. :-)
#27 Updated by Electric Monk 6 months ago
- Status changed from Pending RTI to Closed
commit 4a3b05278938491ea95d557939c130d3eb17cfd4 Author: Andy Fiddaman <email@example.com> Date: 2018-08-29T17:20:11.000Z 3354 kernel crash in rpcsec_gss after using gsscred Reviewed by: Toomas Soome <firstname.lastname@example.org> Reviewed by: Carlos Neira <email@example.com> Approved by: Robert Mustacchi <firstname.lastname@example.org>
Also available in: Atom