Bug 201655
Summary: | nfsd hangs forever sporadically on rwsem_down_read_failed | ||
---|---|---|---|
Product: | File System | Reporter: | Kevin Liu (kevin) |
Component: | NFS | Assignee: | bfields |
Status: | RESOLVED UNREPRODUCIBLE | ||
Severity: | high | CC: | bfields, forum, qball, trondmy, tytso |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 4.19, 4.19.1, 4.18.17 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Kevin Liu
2018-11-10 20:59:23 UTC
Those are the only processes it reports blocked? I'm guessing it's the inode_lock at the start of ext4_file_write_iter that's blocking. On a quick look I don't see any of the callers taking any locks. So I'd expect that elsewhere there'd be a process holding that inode lock and blocking on something else. Based just on this might first guess would be a vfs or maybe ext4 bug rather than an nfsd bug, but I'm not seeing how to reassign. May be worth reporting to the relevant mailing lists. No, there are more blocked processes; looking at htop when this happens (assuming htop itself doesn't freeze), every process reading from NFS blocks. Thanks for the analysis! I'll try posting on the mailing lists about it. From my reply on linux-ext4: Sorry, this isn't something I've seen before. And it's not at all obvious from the information in Bugzilla what's causing the deadlock. The down_read() up appears to be in mm/memory.c, in __access_remote_vm() getting called from proc_pid_cmdline_read(). access_Remote_vm is apparently trying to get a shared lock on &mm->mmap_sem. How this would get involved with the inode_lock() is not immediately obvious. Things I would suggest. 1) Try running your kernel console log throughn ./scripts/decode_stacktrace.sh so we can be sure we've correctly assessed where the kernel is grabbing which lock. Enabling CONFIG_DEBUG_INFO and CONFIG_DEBUG_INFO_REDUCED will be helpful. 2) Try turning on CONFIG_LOCKDEP and see if this reports some potential deadlock. 3) Try using sysrq-d to find all held locks (running the resulting kernel console output through decode_stacktrace.sh will also eb helpful). I have a (similar?) problem. Where to start debugging?: [Mo Sep 30 21:37:47 2019] INFO: task nfsd:1557 blocked for more than 122 seconds. [Mo Sep 30 21:37:47 2019] Not tainted 5.2.14-artix1-1-ARTIX #1 [Mo Sep 30 21:37:47 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Mo Sep 30 21:37:47 2019] nfsd D 0 1557 2 0x80004000 [Mo Sep 30 21:37:47 2019] Call Trace: [Mo Sep 30 21:37:47 2019] ? __schedule+0x27f/0x6d0 [Mo Sep 30 21:37:47 2019] schedule+0x43/0xd0 [Mo Sep 30 21:37:47 2019] rwsem_down_write_failed+0x192/0x280 [Mo Sep 30 21:37:47 2019] ? btrfs_lookup_fs_root+0x32/0x40 [btrfs] [Mo Sep 30 21:37:47 2019] ? __d_obtain_alias+0x31/0x80 [Mo Sep 30 21:37:47 2019] down_write+0x38/0x40 [Mo Sep 30 21:37:47 2019] btrfs_file_write_iter+0xc1/0x630 [btrfs] [Mo Sep 30 21:37:47 2019] do_iter_readv_writev+0x166/0x1e0 [Mo Sep 30 21:37:47 2019] do_iter_write+0x7d/0x190 [Mo Sep 30 21:37:47 2019] nfsd_vfs_write+0x127/0x4b0 [nfsd] [Mo Sep 30 21:37:47 2019] nfsd4_write+0xf1/0x230 [nfsd] [Mo Sep 30 21:37:47 2019] nfsd4_proc_compound+0x334/0x740 [nfsd] [Mo Sep 30 21:37:47 2019] nfsd_dispatch+0xc1/0x200 [nfsd] [Mo Sep 30 21:37:47 2019] svc_process_common.isra.0+0x3cf/0x750 [sunrpc] [Mo Sep 30 21:37:47 2019] ? svc_sock_secure_port+0x12/0x30 [sunrpc] [Mo Sep 30 21:37:47 2019] ? nfsd_svc+0x2d0/0x2d0 [nfsd] [Mo Sep 30 21:37:47 2019] svc_process+0xe2/0x120 [sunrpc] [Mo Sep 30 21:37:47 2019] nfsd+0xed/0x150 [nfsd] [Mo Sep 30 21:37:47 2019] kthread+0xfb/0x130 [Mo Sep 30 21:37:47 2019] ? nfsd_destroy+0x60/0x60 [nfsd] [Mo Sep 30 21:37:47 2019] ? kthread_park+0x80/0x80 [Mo Sep 30 21:37:47 2019] ret_from_fork+0x1f/0x40 Since +- 1 months I have been hitting this bug too under ubuntu 20.04. Jan 25 09:11:32 o9 kernel: [ 604.872233] INFO: task nfsd:894 blocked for more than 120 seconds. Jan 25 09:11:32 o9 kernel: [ 604.872273] Not tainted 5.8.0-40-generic #45~20.04.1-Ubuntu Jan 25 09:11:32 o9 kernel: [ 604.872310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 25 09:11:32 o9 kernel: [ 604.872358] nfsd D 0 894 2 0x00004000 Jan 25 09:11:32 o9 kernel: [ 604.872360] Call Trace: Jan 25 09:11:32 o9 kernel: [ 604.872365] __schedule+0x394/0xa60 Jan 25 09:11:32 o9 kernel: [ 604.872369] schedule+0x55/0xc0 Jan 25 09:11:32 o9 kernel: [ 604.872372] rwsem_down_write_slowpath+0x239/0x4b0 Jan 25 09:11:32 o9 kernel: [ 604.872377] down_write+0x41/0x50 Jan 25 09:11:32 o9 kernel: [ 604.872385] ovl_dir_release+0x2f/0x70 [overlay] Jan 25 09:11:32 o9 kernel: [ 604.872388] __fput+0xe9/0x250 Jan 25 09:11:32 o9 kernel: [ 604.872390] flush_delayed_fput+0x23/0x30 Jan 25 09:11:32 o9 kernel: [ 604.872407] nfsd_file_close_inode_sync+0x15a/0x170 [nfsd] Jan 25 09:11:32 o9 kernel: [ 604.872421] nfsd_unlink+0x256/0x2b0 [nfsd] Jan 25 09:11:32 o9 kernel: [ 604.872437] nfsd3_proc_remove+0x74/0x110 [nfsd] Jan 25 09:11:32 o9 kernel: [ 604.872450] nfsd_dispatch+0xe2/0x220 [nfsd] Jan 25 09:11:32 o9 kernel: [ 604.872476] svc_process_common+0x47b/0x6f0 [sunrpc] Jan 25 09:11:32 o9 kernel: [ 604.872501] ? svc_sock_secure_port+0x16/0x30 [sunrpc] Jan 25 09:11:32 o9 kernel: [ 604.872513] ? nfsd_svc+0x330/0x330 [nfsd] Jan 25 09:11:32 o9 kernel: [ 604.872538] svc_process+0xbc/0xf0 [sunrpc] Jan 25 09:11:32 o9 kernel: [ 604.872551] nfsd+0xe8/0x150 [nfsd] Jan 25 09:11:32 o9 kernel: [ 604.872555] kthread+0x114/0x150 Jan 25 09:11:32 o9 kernel: [ 604.872568] ? nfsd_destroy+0x60/0x60 [nfsd] Jan 25 09:11:32 o9 kernel: [ 604.872571] ? kthread_park+0x90/0x90 Jan 25 09:11:32 o9 kernel: [ 604.872575] ret_from_fork+0x22/0x30 This box serves root file system via NFS for several small boards. Once all nfs processes hang, I have to restart the machine to get everything running again. Comments 4 and 5 are for different filesystem operations on different filesystems, with different NFS versions. It's possible they could all come down to the same nfsd bug somehow, but my first guess would be that they're unrelated to each other and to the originally reported bug, so I'd prefer seperate bugs. That said, the debugging steps Ted describes in comment 3 look like good starting points for debugging all 3 reports. Assuming this is no longer an issue. Reopen if it actually is. (And please consider a new bug if you're in any doubt as to whether it's the same as the bug in the original description.) |