Bug 29062
Summary: | soft lockup in nfs_commit_inode()? | ||
---|---|---|---|
Product: | File System | Reporter: | Roman Kononov (roman) |
Component: | NFS | Assignee: | Trond Myklebust (trondmy) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | ||
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.36, 2.6.37, 2.6.38 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
more call traces in 2.6.37
nfs: remove extraneous and problematic calls to nfs_clear_request NFS: Fix a hang/infinite loop in nfs_wb_page() |
Description
Roman Kononov
2011-02-13 18:59:50 UTC
Created attachment 47652 [details]
more call traces in 2.6.37
2.6.37 locks up as well
2.6.36 locks up as well 2.6.38-rc4 is likewise Can you use 'git bisect' to figure out where the problem started between 2.6.35 and 2.6.36? $ git bisect log # bad: [f6f94e2ab1b33f0082ac22d71f66385a60d8157f] Linux 2.6.36 # good: [9fe6206f400646a2322096b56c59891d530e8d51] Linux 2.6.35 git bisect start 'v2.6.36' 'v2.6.35' '--' 'fs/nfs' # skip: [452e93523d9433f83670e7b42cbe75319c208762] NFSv4: Clean up the process of renewing the NFSv4 lease git bisect skip 452e93523d9433f83670e7b42cbe75319c208762 # skip: [a17c2153d2e271b0cbacae9bed83b0eaa41db7e1] SUNRPC: Move the bound cred to struct rpc_rqst git bisect skip a17c2153d2e271b0cbacae9bed83b0eaa41db7e1 # bad: [df486a25900f4dba9cdc3886c4ac871951c6aef3] NFS: Fix the selection of security flavours in Kconfig git bisect bad df486a25900f4dba9cdc3886c4ac871951c6aef3 # bad: [77041ed9b49a9e10f374bfa6e482d30ee7a3d46e] NFSv4: Ensure the lockowners are labelled using the fl_owner and/or fl_pid git bisect bad 77041ed9b49a9e10f374bfa6e482d30ee7a3d46e # skip: [c48f4f3541e67881c9eb7c46e052f5ece48ef530] NFSv41: Convert the various reboot recovery ops etc to minor version ops git bisect skip c48f4f3541e67881c9eb7c46e052f5ece48ef530 # good: [d185a334c748b3ca9de1f3a293fd8a9cf68378ab] NFSv4.1: Simplify nfs41_sequence_done() git bisect good d185a334c748b3ca9de1f3a293fd8a9cf68378ab # skip: [a4432345352c2be157ed844603147ac2c82f209c] NFSv41: Deprecate nfs_client->cl_minorversion git bisect skip a4432345352c2be157ed844603147ac2c82f209c # good: [035168ab39f66e4946d493f9ee20d11e154f332a] NFSv4.1: Make nfs4_setup_sequence take a nfs_server argument git bisect good 035168ab39f66e4946d493f9ee20d11e154f332a # good: [d77d76ffb638bd013782138cca6d8f4918c5afd6] NFSv41: Clean up exclusive create git bisect good d77d76ffb638bd013782138cca6d8f4918c5afd6 # good: [1f0e890dba5b0f543fea47732116b1c65d55614e] NFSv4: Clean up struct nfs4_state_owner git bisect good 1f0e890dba5b0f543fea47732116b1c65d55614e # skip: [daccbded7f153ec84a3baf3136052e41d0eab555] NFSv4: Clean up for lockowner XDR encoding git bisect skip daccbded7f153ec84a3baf3136052e41d0eab555 # bad: [d3c7b7ccc199ee564177ee914c04771d6bc00295] NFSv4: Add support for the RELEASE_LOCKOWNER operation git bisect bad d3c7b7ccc199ee564177ee914c04771d6bc00295 # bad: [f11ac8db5d07b6e99d41ff4aa39d878ee5cef1c5] NFSv4: Ensure that we track the NFSv4 lock state in read/write requests. git bisect bad f11ac8db5d07b6e99d41ff4aa39d878ee5cef1c5 It says that "f11ac8db5d07b6e99d41ff4aa39d878ee5cef1c5 is the first bad commit". Does it make sense? I skipped some revisions because they caused some seemingly unrelated failures and I could neither prove nor disprove the lockup. Thanks! That certainly helps a lot! The patch in question is indeed known to trigger a problem in the writeback code. A fix was merged into kernel 2.6.37, but does not appear to have been sent to the stable kernel. I'll attach the fix patch to this bug report, but you can find it as commit 2df485a774ba59c3f43bfe84107672c1d9b731a0 (nfs: remove extraneous and problematic calls to nfs_clear_request) Created attachment 48482 [details]
nfs: remove extraneous and problematic calls to nfs_clear_request
I am confused. Are you saying that the 2df485a7 patch should fix the lockup? Earlier, I tried 2.6.37, which has the patch, and it locked up. Created attachment 51572 [details]
NFS: Fix a hang/infinite loop in nfs_wb_page()
When one of the two waits in nfs_commit_inode() is interrupted, it
returns a non-negative value, which causes nfs_wb_page() to think
that the operation was successful causing it to busy-loop rather
than exiting.
It also causes nfs_file_fsync() to incorrectly report the file as
being successfully committed to disk.
This patch fixes both problems by ensuring that we return an error
if the attempts to wait fail.
In my setup, v2.6.37 with the patches has worked for 15 hours. Without the second patch it fails withing an hour. Thanks. Closing. The patch has been submitted to mainline and stable@kernel.org. Please reopen this report if the bug reoccurs in a future kernel... |