Bug 213273

Summary: BUG: kernel NULL pointer dereference - nfs4_proc_lock
Product: File System Reporter: Daire Byrne (daire)
Component: NFSAssignee: Trond Myklebust (trondmy)
Status: NEW ---    
Severity: normal CC: bfields
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.13 Subsystem:
Regression: No Bisected commit-id:
Attachments: lockd: lockd server-side shouldn't set fl_ops

Description Daire Byrne 2021-05-30 10:51:58 UTC
Hi,

We are doing that weird thing where we re-export NFSv4.2 client mounts to NFSv3 clients. This has been working fine for months now without issue or any changes in kernel versions.

But something might have changed in our production workload that is now triggering this crash of the re-export server on a daily basis.

It occurs on all recent kernel versions (v5.11, v5.12) and the stack looks the same. Here's what it looks like on v5.13-rc1:

[  162.335712] BUG: kernel NULL pointer dereference, address: 0000000000000028
[  162.342668] #PF: supervisor read access in kernel mode
[  162.347801] #PF: error_code(0x0000) - not-present page
[  162.351067] PGD 0 P4D 0
[  162.352326] Oops: 0000 [#1] SMP PTI
[  162.354080] CPU: 13 PID: 3058 Comm: lockd Kdump: loaded Not tainted 5.13.0-1.dneg.x86_64 #1
[  162.358139] Hardware name: Red Hat dneg, BIOS 0.5.1 01/01/2011
[  162.360078] RIP: 0010:nfs4_proc_lock+0xd5/0x540 [nfsv4]
[  162.362200] Code: 4d 8b 8c 24 b8 00 00 00 48 8b 42 28 48 8b b0 70 03 00 00 31 c0 4c 8b 06 f3 48 ab 48 89 55 88 49 8b 80 18 01 00 00 48 89 45 90 <49> 63 41 28 48 89 45 98 8b 86 e0 00 00 00 89 45 a0 49 8b 45 6
[  162.370859] RSP: 0018:ffffa591414ebc30 EFLAGS: 00010246
[  162.372838] RAX: cc0f8f607f51c240 RBX: 0000000000000006 RCX: 0000000000000000
[  162.374938] RDX: ffff946d12ca01e8 RSI: ffff946d154c2800 RDI: ffffa591414ebc88
[  162.377636] RBP: ffffa591414ebcb8 R08: ffff946d04182400 R09: 0000000000000000
[  162.380368] R10: ffff946d01b52840 R11: 0000000000000000 R12: ffff946d5143f8e0
[  162.383262] R13: ffff946d01b52840 R14: ffff946d12ca01e8 R15: ffff946d1b52b900
[  162.385953] FS:  0000000000000000(0000) GS:ffff94a74f540000(0000) knlGS:0000000000000000
[  162.388653] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  162.390270] CR2: 0000000000000028 CR3: 000000245e40a002 CR4: 00000000007706e0
[  162.392322] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  162.394348] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  162.396378] PKRU: 55555554
[  162.397159] Call Trace:
[  162.397875]  ? sysvec_apic_timer_interrupt+0x4e/0x90
[  162.399344]  do_setlk+0x67/0xf0 [nfs]
[  162.400415]  nfs_lock+0xb7/0x180 [nfs]
[  162.401533]  ? kmem_cache_alloc_trace+0x2c6/0x440
[  162.402908]  vfs_lock_file+0x21/0x40
[  162.403943]  nlmsvc_lock+0x14a/0x4c0 [lockd]
[  162.405146]  __nlm4svc_proc_lock+0xe6/0x130 [lockd]
[  162.406913]  nlm4svc_proc_lock+0x15/0x20 [lockd]
[  162.408505]  svc_process_common+0x64e/0x730 [sunrpc]
[  162.410251]  svc_process+0xbc/0xf0 [sunrpc]
[  162.411771]  lockd+0xce/0x1b0 [lockd]
[  162.412815]  kthread+0x114/0x150
[  162.413772]  ? grace_ender+0x20/0x20 [lockd]
[  162.414958]  ? kthread_park+0x90/0x90
[  162.415960]  ret_from_fork+0x22/0x30
[  162.417015] Modules linked in: tcp_diag inet_diag nfsv3 rpcsec_gss_krb5 nfsv4 act_mirred dns_resolver sch_ingress ifb nfs cls_u32 sch_fq sch_prio cachefiles fscache netfs ext4 mbcache jbd2 cirrus drm_kms_hed
[  162.435249] CR2: 0000000000000028

Once this happens, the re-export server reboots, mounts all the NFSv4.2 servers (RHEL7.6) and then as soon as the NFSv3 clients of those re-exported mounts start doing whatever they were doing, it promptly crashes once again.

I have to turn off all the NFSv3 clients to stop the crashing and I can't figure out which one is responsible or what operations our re-export server doesn't like.

I know our re-export case is pretty niche, but I figure it might at least be highlighting an issue that only this kind of workload can trigger?

Like I said, there have been no noticeable changes in kernel or configurations recently on any of these systems (that I'm aware of) - it just feels like our production workload is doing something different.

I also see some sporadic messages like this on our originating RHEL7.6 NFSv4.2 servers:

[2256248.594537] NFSD: client 10.XX.YY.10 testing state ID with incorrect client ID
[2256248.730531] NFSD: client 10.XX.YY.10 testing state ID with incorrect client ID

Where "10.XX.YY.10" is our re-export server. But it's not clear if that is related to this crash. I can think of no reason why the ID, hostname or network of the re-export server might change.

We are also using nconnect for the re-export server's client mounts if that is relevant.
Comment 1 bfields 2021-06-09 13:46:52 UTC
I can reproduce this just by running connectathon tests over a 4.2 re-export of a v3 mount.  I can't explain it yet.

The v4.2/v4.2 case crashes too (in a different place); I can post a patch for that.

In general file locking on re-exports looks in pretty poor shape, but fixable....
Comment 2 Trond Myklebust 2021-06-09 14:47:05 UTC
No... file locking on re-exports is terminally broken and is not trivially fixable.

Consider the case where the re-exporting server reboots, but neither the NFS client nor the 'original' server change. How do you recover locks?
Comment 3 bfields 2021-06-09 18:23:47 UTC
(In reply to Trond Myklebust from comment #2)
> No... file locking on re-exports is terminally broken and is not trivially
> fixable.
> 
> Consider the case where the re-exporting server reboots, but neither the NFS
> client nor the 'original' server change. How do you recover locks?

I was thinking it would be worthwhile just to fix up the crashes.

If we don't think "works if the re-export server never restarts" is really a supportable state, though, then maybe safest for now would be to add an export_op flag or otherwise give nfsd a way to fail any locking attempts.

I think we can handle recovery of the export server eventually, but it's definitely not trivial.
Comment 4 Daire Byrne 2021-06-09 22:15:38 UTC
In our particular case, the reboot recovery limitation is acceptable as long as the re-export server doesn't crash or needs to be rebooted. Our production NFS servers (RHEL) have uptimes in the order of years after all.

Although I do appreciate that it would be better to have a general (reboot recovery) solution so that users didn't have to worry about such caveats that may or may not be written down somewhere (the re-export wiki page only helps if you know where to look).

But obviously now that we are seeing crashes with some (rare) workloads, we can no longer even rely on our hopeful "don't reboot or crash" strategy.

So on a purely selfish level, fixing the crashes would help us out and ensure we never have to deal with the reboot recovery problem.

Or at least if we do reboot, it's part of a scheduled outage and we understand that the clients will also need rebooting.
Comment 5 bfields 2021-06-11 20:59:47 UTC
Created attachment 297343 [details]
lockd: lockd server-side shouldn't set fl_ops

Does this (applied to kernel on the re-export server) fix the crash?
Comment 6 Daire Byrne 2021-06-11 22:02:20 UTC
Thanks for that!

I've put it into production now but I should say that I have not seen the crash in a couple of days. It was at least twice a day all last week.

So it may take a little while to say anything for sure as this particular workload seems to come and go and I don't know what it is exactly.

I'm assuming that this patch fixes your reproducible connectathon crashes (NFSv4.2 re-export of a NFSv3 mount)?

I was sure this was actually happening for us with a NFSv3 re-export of a NFSv4.2 client mount (the other way around), but I have just realised that the same re-export server was also re-exporting a single NFSv3 mount over NFSv4.2 too which confused matters somewhat (well, me at least).
Comment 7 bfields 2021-06-12 22:19:36 UTC
I've got two easily reproducible crashes: one a NULL deference that looks like yours, whenever I take a lock on a v3 re-export of a v4.2 mount; the other a 4.2 re-export of a 4.2 mount, which I have a separate patch for.

I've also got a third patch that forces reclaims after a reboot to fail in the re-export case; that seems like the safer way to handle the lack of reboot recovery support.
Comment 8 Daire Byrne 2021-06-14 15:13:42 UTC
Thanks Bruce, these patches sound great for us. I haven't seen any issues over the weekend's production loads but I'll continue to run with it throughout the week.

I'm actually a little surprised that after almost 2 years of using a re-export server, I've only started seeing this lock crash on a v3 re-export of a V4.2 mount recently.

Now I know our workloads are not lock heavy (lots of unique output files per process), but certainly there should be lots of overlapping read locks as we pull common (cacheable) files from lots of clients.

I actually stopped using an NFSv4.2 re-export of an NFSv4.2 mount because I was also hitting a rare hang there too. I didn't spend too much time digging into it because the NFSv3 re-export of NFSv4.2 worked fine for our purposes. We get the benefits of NFSv4.2 over the WAN (less chatter) and then NFSv3 to local LAN clients is performant.

For NFS v4.2 + v4.2, there wasn't a crash exactly, it's just that the all the nfsds would get stuck (on something) and then I was just left with slowpath messages:

Oct  3 19:37:56 loncloudnfscache1 kernel: INFO: task nfsd:2864 blocked for more than 122 seconds.
Oct  3 19:37:56 loncloudnfscache1 kernel:      Not tainted 5.7.10-1.dneg.x86_64 #1
Oct  3 19:37:56 loncloudnfscache1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  3 19:37:56 loncloudnfscache1 kernel: nfsd            D    0  2864      2 0x80004000
Oct  3 19:37:56 loncloudnfscache1 kernel: Call Trace:
Oct  3 19:37:56 loncloudnfscache1 kernel: __schedule+0x3a1/0x6f0
Oct  3 19:37:56 loncloudnfscache1 kernel: ? nfs_check_cache_invalid+0x38/0xa0 [nfs]
Oct  3 19:37:56 loncloudnfscache1 kernel: schedule+0x4f/0xc0
Oct  3 19:37:56 loncloudnfscache1 kernel: rwsem_down_write_slowpath+0x2bb/0x472
Oct  3 19:37:56 loncloudnfscache1 kernel: down_write+0x42/0x50
Oct  3 19:37:56 loncloudnfscache1 kernel: nfsd_lookup_dentry+0xba/0x420 [nfsd]
Oct  3 19:37:56 loncloudnfscache1 kernel: ? fh_verify+0x341/0x6e0 [nfsd]
Oct  3 19:37:56 loncloudnfscache1 kernel: nfsd_lookup+0x82/0x140 [nfsd]
Oct  3 19:37:56 loncloudnfscache1 kernel: nfsd4_lookup+0x1a/0x20 [nfsd]
Oct  3 19:37:56 loncloudnfscache1 kernel: nfsd4_proc_compound+0x646/0x830 [nfsd]
Oct  3 19:37:56 loncloudnfscache1 kernel: ? svc_reserve+0x40/0x50 [sunrpc]
Oct  3 19:37:56 loncloudnfscache1 kernel: nfsd_dispatch+0xc1/0x260 [nfsd]
Oct  3 19:37:56 loncloudnfscache1 kernel: svc_process_common+0x323/0x760 [sunrpc]
Oct  3 19:37:56 loncloudnfscache1 kernel: ? svc_sock_secure_port+0x16/0x40 [sunrpc]
Oct  3 19:37:56 loncloudnfscache1 kernel: ? nfsd_svc+0x360/0x360 [nfsd]
Oct  3 19:37:56 loncloudnfscache1 kernel: svc_process+0xfc/0x110 [sunrpc]
Oct  3 19:37:56 loncloudnfscache1 kernel: nfsd+0xe9/0x160 [nfsd]
Oct  3 19:37:56 loncloudnfscache1 kernel: kthread+0x105/0x140
Oct  3 19:37:56 loncloudnfscache1 kernel: ? nfsd_destroy+0x60/0x60 [nfsd]
Oct  3 19:37:56 loncloudnfscache1 kernel: ? kthread_bind+0x20/0x20
Oct  3 19:37:56 loncloudnfscache1 kernel: ret_from_fork+0x35/0x40

All the nfsd threads and clients were completely stalled until a reboot (of everything). I'd be interested to re-test with your v4.2 + v4.2 patch to see if it makes any difference.

Many thanks for your time. I know we are a bit of an oddball case, but it mostly works for us. I'd say it has even helped us win Oscars... ;)

Daire
Comment 9 Daire Byrne 2021-06-17 17:41:44 UTC
Just a small update to say that I've not seen this crash at all this week or any (other issues for that matter). I'm running with it applied to 5.13-rc6.

I hope to be able to test a NFSv4.2 re-export of an NFSv4.2 server later next week to see if that other patch helps us out there too.

Thanks again,

Daire
Comment 10 bfields 2021-06-17 19:42:54 UTC
Thanks for the report.