Bug 203363 - High amount of specifically crafted file ops causes server crash / kernel panic (NFS 4.1 / NFS 4.2)
Summary: High amount of specifically crafted file ops causes server crash / kernel pan...
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 high
Assignee: bfields
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-19 11:05 UTC by Slawomir Pryczek
Modified: 2022-01-21 21:37 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.0.7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
FC 29 crash log (38.62 KB, text/plain)
2019-04-19 11:10 UTC, Slawomir Pryczek
Details
debug patch (1.01 KB, application/mbox)
2019-04-19 18:53 UTC, Jeff Layton
Details
Draft patch (1.12 KB, patch)
2019-04-20 01:40 UTC, Jeff Layton
Details | Diff
patch - delete block before sending callback (1.54 KB, patch)
2019-04-20 10:05 UTC, Jeff Layton
Details | Diff

Description Slawomir Pryczek 2019-04-19 11:05:34 UTC
This bug will lead to NFS Server machine crash or kernel panic when connected using NFS 4.1 or 4.2 client. When the client is using NFS 4.0 all seems to be working fine so that seems to be related to some functions implemented in NFS 4.1 (or also 4.2 as 4.2 seems to be crashing much faster 5-15s instead of 20-30s).

So the exploit/test code will start 30 threads and each will make random file ops on 30 files (rename/write/read/lock). All is working fine if we're _NOT_ doing flock in LOCK_EX mode (can be disabled in code). If lock is done but using LOCK_SH all is still working fine. So that flock with FLOCK_EX issued in parallel will cause the server to lockup or crash. It usually takes 2-3 runs for any server to crash (each run takes 5s)

I have put everything, including logs, test code, compile instructions, and more info on github
https://github.com/slawomir-pryczek/drbd_kill (it's called this way because i first thought that's related to drbd but it's not)

This was tested in fedora 29 and fedora 30, several machines, several different VMs. Reporting as high severity as this code can crash any fc server with new kernels and exposed NFS 4 share in seconds (also VMs). Also couple of times after this code was run and the machine crashed we lost all data on whole partition (it wasn't able to mount, error was something like "invalid filesystem")

And I also run this code on debian 9 with little older kernel (4.0.9-4-amd64). Same thing, but instead of crash the nfs server locks up and machine needs to be restarted so it seems this bug was there for a long time, just it got worse in recent kernel releases... 

When we used NFS 4.2 on FC26 the bug manifested same way as on older debian kernels ( no errors in dmesg, just nfs server stopped responding and needed to be restarted, and no server crashes ). And this happened very rarely (every couple of months) .. now when we tested new kernel and FC29 on production the machine is basically dead within 5-10 minutes(!) and it happens again and again every time (the test tool will generate more FS operations than our normal production traffic, probably that's why it takes much less time to kill the server).
Comment 1 Slawomir Pryczek 2019-04-19 11:10:17 UTC
Created attachment 282405 [details]
FC 29 crash log
Comment 2 bfields 2019-04-19 14:09:56 UTC
Thanks for the report!  The lock code has seen some change recently, and it's possible the problems with older kernels are different bugs.  I think it's best to start with the most recent results.  From your Fedora 30 log, it looks like posix_lock_file attempted to dereference an obviously bogus address:

Apr 19 00:02:38 localhost.localdomain kernel: BUG: unable to handle kernel paging request at 000000e8000000e8
Apr 19 00:02:38 localhost.localdomain kernel: #PF error: [normal kernel read fault]
Apr 19 00:02:38 localhost.localdomain kernel: PGD 80000000a7d91067 P4D 80000000a7d91067 PUD 0 
Apr 19 00:02:38 localhost.localdomain kernel: Oops: 0000 [#1] SMP PTI
Apr 19 00:02:38 localhost.localdomain kernel: CPU: 2 PID: 1007 Comm: nfsd Not tainted 5.0.7-300.fc30.x86_64 #1
Apr 19 00:02:38 localhost.localdomain kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
Apr 19 00:02:38 localhost.localdomain kernel: RIP: 0010:posix_lock_inode+0x54a/0x8c0
Apr 19 00:02:38 localhost.localdomain kernel: Code: 49 89 fe 0f 84 bd 00 00 00 49 8b 86 b0 00 00 00 48 85 c0 74 08 48 8b 10 48 85 d2 75 c5 48 8b 45 48 49 39 46 48 75 c8 4c 89 f5 <4d> 8b 36 4d 85 f6 75 f5 41 83 ef 01 0f 84 8a 00 00 00 49 8b 94 24
Apr 19 00:02:38 localhost.localdomain kernel: RSP: 0018:ffffac108243bcc0 EFLAGS: 00010206
Apr 19 00:02:38 localhost.localdomain kernel: RAX: ffff94c3a9eaf598 RBX: ffff94c4388c6b98 RCX: 61c8864680b583eb
Apr 19 00:02:38 localhost.localdomain kernel: RDX: 0000000000000000 RSI: ffff94c42e7ce360 RDI: ffff94c3a8f51a28
Apr 19 00:02:38 localhost.localdomain kernel: RBP: 000000e8000000e8 R08: ffff94c439ca80c0 R09: 0000000000000000
Apr 19 00:02:38 localhost.localdomain kernel: R10: ffff94c434b0d7a0 R11: ffff94c3a8f51a28 R12: ffff94c3a8f51a28
Apr 19 00:02:38 localhost.localdomain kernel: R13: ffff94c3aa996df8 R14: 000000e8000000e8 R15: 000000000000000c
Apr 19 00:02:38 localhost.localdomain kernel: FS:  0000000000000000(0000) GS:ffff94c439c80000(0000) knlGS:0000000000000000
Apr 19 00:02:38 localhost.localdomain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 19 00:02:38 localhost.localdomain kernel: CR2: 000000e8000000e8 CR3: 00000000aa6a0004 CR4: 00000000001606e0
Apr 19 00:02:38 localhost.localdomain kernel: Call Trace:
Apr 19 00:02:38 localhost.localdomain kernel:  ? _cond_resched+0x15/0x30
Apr 19 00:02:38 localhost.localdomain kernel:  nfsd4_lock+0x710/0x1110 [nfsd]
Apr 19 00:02:38 localhost.localdomain kernel:  nfsd4_proc_compound+0x34b/0x660 [nfsd]
Apr 19 00:02:38 localhost.localdomain kernel:  ? nfs4svc_decode_compoundargs+0x40f/0x4d0 [nfsd]
Apr 19 00:02:38 localhost.localdomain kernel:  nfsd_dispatch+0xc1/0x200 [nfsd]
Apr 19 00:02:38 localhost.localdomain kernel:  svc_process_common+0x44c/0x750 [sunrpc]
Apr 19 00:02:38 localhost.localdomain kernel:  svc_process+0xb7/0xf0 [sunrpc]
Apr 19 00:02:38 localhost.localdomain kernel:  nfsd+0xe3/0x140 [nfsd]
Apr 19 00:02:38 localhost.localdomain kernel:  kthread+0xfb/0x130
Apr 19 00:02:38 localhost.localdomain kernel:  ? nfsd_destroy+0x50/0x50 [nfsd]
Apr 19 00:02:38 localhost.localdomain kernel:  ? kthread_park+0x90/0x90
Apr 19 00:02:38 localhost.localdomain kernel:  ret_from_fork+0x35/0x40
Apr 19 00:02:38 localhost.localdomain kernel: Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables vmw_vsock_vmci_transport vsock coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel vmw_balloon intel_rapl_perf snd_ens1371 snd_ac97_codec gameport ac97_bus snd_rawmidi snd_seq_device joydev snd_pcm snd_timer snd soundcore vmw_vmci i2c_piix4 pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c vmwgfx crc32c_intel drm_kms_helper ttm serio_raw mptspi drm scsi_transport_spi e1000 mptscsih mptbase ata_generic pata_acpi fuse
Apr 19 00:02:38 localhost.localdomain kernel: CR2: 000000e8000000e8
Apr 19 00:02:38 localhost.localdomain kernel: ---[ end trace 7d4b236eb6560f1e ]---
Comment 3 bfields 2019-04-19 14:11:45 UTC
5.0.7 includes the thundering herd patches.  The only locks.c bugfix I see not included is 945ab8f6de944 "locks: wake any locks blocked on request before deadlock check" which shouldn't cause these symptoms.
Comment 4 bfields 2019-04-19 14:19:25 UTC
But if it's really minorversion specific, then it's less likely a bug in locks.c and more likely something like a bug in the nfsv4 state code that's freeing an object out from under the lock code.
Comment 5 Jeff Layton 2019-04-19 14:38:54 UTC
(In reply to bfields from comment #3)
> 5.0.7 includes the thundering herd patches.  The only locks.c bugfix I see
> not included is 945ab8f6de944 "locks: wake any locks blocked on request
> before deadlock check" which shouldn't cause these symptoms.

Actually, 5.0.6 got that patch so 5.0.7 should also have it. It might be worth trying a v5.0.5 kernel if you have a reliable reproducer.
Comment 6 bfields 2019-04-19 14:44:11 UTC
I forgot to ask: what filesystem are you exporting?
Comment 7 Jeff Layton 2019-04-19 15:01:44 UTC
In fact:

(gdb) list *(posix_lock_inode+0x4cf)
0xffffffff8131497f is in posix_lock_inode (fs/locks.c:879).
874	/* Find a lock that the owner of the given block_fl is blocking on. */
875	static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
876	{
877		struct file_lock *fl;
878	
879		hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) {
880			if (posix_same_owner(fl, block_fl))
881				return fl->fl_next;
882		}
883		return NULL;

Which makes me suspect the recent changes to deadlock detection. If you haven't already, then testing this on v5.0.5 would help determine if that patch is related to the issue.
Comment 8 bfields 2019-04-19 15:10:48 UTC
Oh, and we only set FL_SLEEP (and run deadlock detection) in the 4.1 case, so that'd also explain the minorversion difference; never mind my comment 4.
Comment 9 Jeff Layton 2019-04-19 17:44:35 UTC
NVM on 5.0.5. I tried Slawomir's reproducer and seem to get a consistent oops on the second run of "./main". I tried a v5.0.5 kernel and it didn't help.

Now that I read over the original problem description more carefully, it sounds like this is an older bug, maybe going back to the original lock callback work.

Many thanks for the work on the reproducer. That helps immensely. I'll plan to do some more careful testing with it, and we should be able to sort out what's going on soon.
Comment 10 Slawomir Pryczek 2019-04-19 18:06:33 UTC
Thanks Guys for all the comments and working on it. Quick comment about different kernel version is that we went straight from FC26 to FC29/30... so not sure if that's not present in FC27 or 28. Will try to do some test when i have more time and post results here

@bfields - for filesystem, we always used ext4 both in production and in my test
Comment 11 Jeff Layton 2019-04-19 18:53:16 UTC
Created attachment 282415 [details]
debug patch

Debug patch I added today. With that in a kernel I got this oops with the reproducer:

[  123.547980] ------------[ cut here ]------------
[  123.548876] kernel BUG at fs/nfsd/nfs4state.c:272!
[  123.549723] invalid opcode: 0000 [#1] SMP NOPTI
[  123.550832] CPU: 0 PID: 110 Comm: kworker/u32:1 Not tainted 5.1.0-rc5+ #99
[  123.551994] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
[  123.554045] Workqueue: rpciod rpc_async_schedule [sunrpc]
[  123.554983] RIP: 0010:free_blocked_lock+0x63/0x70 [nfsd]
[  123.555877] Code: 47 60 48 8d 57 60 48 39 c2 75 21 48 83 7f 48 00 75 1c 48 89 fb 48 89 cf e8 4a 04 d9 d5 48 89 df 5b e9 21 5d cf d5 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 81 ef 88
[  123.558723] RSP: 0018:ffffa5d400b4be08 EFLAGS: 00010206
[  123.559444] RAX: ffff89cc37525460 RBX: ffff89cc30ddaf00 RCX: ffff89cc37527028
[  123.560873] RDX: ffff89cc37527050 RSI: 0000000000000000 RDI: ffff89cc37527000
[  123.562615] RBP: 0000000000000e81 R08: 0000000000000000 R09: ffffffffc0541020
[  123.564097] R10: 0000000000000003 R11: 0000000000000400 R12: 0000000000000000
[  123.565261] R13: 0000000000000e81 R14: ffffffffc0504770 R15: 0000000000000000
[  123.566450] FS:  0000000000000000(0000) GS:ffff89cc3b800000(0000) knlGS:0000000000000000
[  123.567857] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  123.568637] CR2: 00007fd020010cb8 CR3: 000000013a322000 CR4: 00000000000006f0
[  123.569747] Call Trace:
[  123.570172]  rpc_free_task+0x39/0x70 [sunrpc]
[  123.570909]  __rpc_execute+0x330/0x340 [sunrpc]
[  123.571624]  ? __switch_to_asm+0x40/0x70
[  123.572194]  rpc_async_schedule+0x29/0x40 [sunrpc]
[  123.572948]  process_one_work+0x1a1/0x3a0
[  123.573517]  worker_thread+0x30/0x380
[  123.574054]  ? process_one_work+0x3a0/0x3a0
[  123.574637]  kthread+0x112/0x130
[  123.575117]  ? kthread_create_on_node+0x60/0x60
[  123.575767]  ret_from_fork+0x22/0x40
[  123.576282] Modules linked in: rpcsec_gss_krb5 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables joydev virtio_balloon pcspkr i2c_piix4 edac_mce_amd nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c qxl drm_kms_helper ttm virtio_net net_failover ata_generic virtio_console failover serio_raw virtio_blk drm pata_acpi floppy qemu_fw_cfg
[  123.583544] ---[ end trace c19a9186f99a6712 ]---

That's:

        BUG_ON(!list_empty(&fl->fl_blocked_requests));

...so it looks like we have some blocked locks that are being freed without being properly cleaned up.
Comment 12 Slawomir Pryczek 2019-04-19 19:39:43 UTC
Made tests with FC 28 (4.16.3-301.fc28.x86_64 #1 SMP)

All works fine with this one, just some errors in dmesg, but it doesn't have any bad effects on the server it seems, so this bug seems to be introduced after this kernel...


>[ 1368.904463] NFS: nfs4_handle_delegation_recall_error: unhandled error -512.
>[ 1413.608628] RPC request reserved 200 but used 268
>[ 1414.538163] RPC request reserved 200 but used 268
>[ 1420.472541] NFS: nfs4_handle_delegation_recall_error: unhandled error -512.
>[ 1565.025006] NFS: nfs4_handle_delegation_recall_error: unhandled error -512.
Comment 13 Jeff Layton 2019-04-20 01:40:57 UTC
Created attachment 282429 [details]
Draft patch

This patch seems to fix it for me. I was able to run your test program in a loop for many iterations with this patch added. I need to do a bit of digging to figure out which patch caused the problem, but it would have been introduced in v5.0.

This may also not be the best fix. This wakes up blocked locks after the callback RPC is done, but there is probably a case to be made for doing this before we send the request (maybe in the prepare phase of the callback).

Anyway, it's worth testing this if you are able.
Comment 14 Jeff Layton 2019-04-20 01:42:47 UTC
I know you mentioned a problem with a 4.0.9 based kernel too, most likely that is a different bug.
Comment 15 Jeff Layton 2019-04-20 10:05:39 UTC
Created attachment 282431 [details]
patch - delete block before sending callback

This patch also fixes the issue and I think I like it better. This one wakes all of the locks blocked on the lock callback before sending it.
Comment 16 Jeff Layton 2019-04-20 10:08:42 UTC
Note too that I do see a lot of log messages like this here:

[  640.080843] RPC request reserved 200 but used 324

...which looks like an unrelated bug, but something that we should track down and fix.
Comment 17 Jeff Layton 2019-04-20 10:16:14 UTC
Patch sent to linux-nfs mailing list.
Comment 18 Jeff Layton 2019-04-20 10:17:16 UTC
https://patchwork.kernel.org/patch/10910147/
Comment 19 bfields 2022-01-21 21:37:21 UTC
Looks like this has been fixed?  Reopen (or open a new bug) if not.

Note You need to log in before you can comment on or make changes to this bug.