Background: We use CIFS extensively in our firmware to allow file sharing from a Windows or a Samba server to the appliance firmware. This was all working fine upto kernel versions 3.14.77 until we recently upgraded to LTS 3.16.38. Problem: In 3.16.38, we noticed that when the Windows or Samba share goes offline(by means of a shutdown or a network change @ the share), processes on the appliance goes to a hung state and the system stops responding to requests. It remains in the state for an unpredictable period of time ranging from 1 hour to a worst case time of approx 6 hours. During this period, SSH to the device hangs and does not respond, on a logged in SSH session execution of any command including ls, ps, anything just hangs and sometimes does not even respond to Ctrl-C and nothing of any help could be executed. This used to work fine in 3.14.77 where a file operation accessing the share would return a valid 112 error which the applications used to detect and unmount the drive immediately and trigger application logic to manage recovery such as sending alerts and so on. This is not the case in the new kernel and as mentioned the entire system does not respond. Once the appliance comes out of the hung state, the below messages can be seen in /var/log/messages: -------------------------- Feb 17 15:26:55 hostname INFO: task java:3493 blocked for more than 120 seconds. Feb 17 15:26:55 hostname Not tainted 3.16.38 #1 Feb 17 15:26:55 hostname "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 17 15:26:55 hostname java D ffff8800ca3aa860 0 3493 1 0x00000000 Feb 17 15:26:55 hostname ffff8801fe3ffc80 0000000000000086 ffff8800ca3aa3d0 0000000000010240 Feb 17 15:26:55 hostname ffff8801d7d22490 ffff8801fe3fffd8 0000000000010240 ffff88015476a520 Feb 17 15:26:55 hostname ffff88015476a524 ffff8800ca3aa3d0 00000000ffffffff ffff88015476a528 Feb 17 15:26:55 hostname Call Trace: Feb 17 15:26:55 hostname [<ffffffff814a3334>] schedule_preempt_disabled+0x24/0x70 Feb 17 15:26:55 hostname [<ffffffff814a4b7a>] __mutex_lock_slowpath+0xca/0x1c0 Feb 17 15:26:55 hostname [<ffffffff810f9d2e>] ? lookup_fast+0x24e/0x2c0 Feb 17 15:26:55 hostname [<ffffffff814a4c82>] mutex_lock+0x12/0x22 Feb 17 15:26:55 hostname [<ffffffff810fdcd7>] do_last.isra.60+0x2f7/0x10c0 Feb 17 15:26:55 hostname [<ffffffff810fab69>] ? link_path_walk+0x69/0x860 Feb 17 15:26:55 hostname [<ffffffff8109e90f>] ? futex_wait+0x18f/0x250 Feb 17 15:26:55 hostname [<ffffffff810feb52>] path_openat+0xb2/0x5f0 Feb 17 15:26:55 hostname [<ffffffff810ff6c5>] do_filp_open+0x35/0x80 Feb 17 15:26:55 hostname [<ffffffff8110b1ed>] ? __alloc_fd+0x7d/0x120 Feb 17 15:26:55 hostname [<ffffffff810f0083>] do_sys_open+0x123/0x220 Feb 17 15:26:55 hostname [<ffffffff810f01af>] SyS_openat+0xf/0x20 Feb 17 15:26:55 hostname [<ffffffff814a66b2>] system_call_fastpath+0x16/0x1b Feb 17 15:26:55 hostname INFO: task java:3494 blocked for more than 120 seconds. Feb 17 15:26:55 hostname Not tainted 3.16.38 #1 Feb 17 15:26:55 hostname "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 17 15:26:55 hostname java D ffff8800a5d9a5e0 0 3494 1 0x00000000 Feb 17 15:26:55 hostname ffff8800a5f2fc80 0000000000000086 ffff8800a5d9a150 0000000000010240 Feb 17 15:26:55 hostname ffff8800d6936090 ffff8800a5f2ffd8 0000000000010240 ffff88015476a520 Feb 17 15:26:56 hostname ffff88015476a524 ffff8800a5d9a150 00000000ffffffff ffff88015476a528 -------------------------- There would be multiple tasks that are in this state. And once the system becomes responsive, executing CIFS mount command to the unreachable share puts the process performing mount go into 'D' state with the below stack: # cat /proc/12554/stack [<f836158d>] cifs_get_smb_ses+0x16d/0x290 [cifs] [<f8361a08>] cifs_mount+0x358/0x670 [cifs] [<f83565df>] cifs_do_mount+0x6f/0x370 [cifs] [<c10c8706>] mount_fs+0x16/0xb0 [<c10de344>] vfs_kern_mount+0x44/0xd0 [<c10e0640>] do_mount+0x1b0/0xab0 [<c10e11e7>] SyS_mount+0x77/0xb0 [<c13968c9>] sysenter_after_call+0x0/0x14 Command used to mount, defaulting to the "soft" mount option: mount -t cifs //x.x.x.x/data1 /etc/tempmountpoint/ -o user=username,pass=password Searched a lot for similar issues/patches for this kernel version but none of the solutions helped. I should also report 2 other observations: 1. When the system has recovered from the hung state and the share is still inaccessible, rebooting the system gets it back to a good state and the application logic waits for the share to become available again or provides means to reconfigure. 2. When the system is in the hung state and the share server becomes accessible again, the system recovers almost immediately and comes back to a working state and continues to work correctly.
Upon noticing fixes related to CIFS in 3.16.40, upgrading to it did not help either.
Created attachment 254905 [details] Kernel Configuration Attaching Kernel configuration. Please let me know if the any other information is required for debugging. Also, the issue is 100% reproducible.
I have the same problem on 4.4 and 4.8 kernels (Ubuntu 14.04 LTS and 16.04 LTS) and share on Windows Server 2003 and/or Windows Server 2008 R2. For testing I used simplest form of mount: sudo mount -t cifs //srv/public /home/user/public -o username=user,rw,uid=user,gid=users . And if /home/user/public is not used for some time (stays idle for 10-60 minutes) the connection stalls and next operation with it (i.e. "ls /home/user/public" ) will hang for dozens of minutes or even hours. It's all started from kernel 4.4.16. Using git bisect between builds 4.4.15 and 4.4.16 pointed the following patch to be the cause of the problem: "Fix reconnect to not defer smb3 session reconnect long after socket reconnect" https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.4.16&id=4ce7aa4e44d88ce64ea8ae2337b8910f3670b0ba This patch finds its way both to 4.8 ( https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.8.17&id=4fcd1813e6404dd4420c7d12fb483f9320f0bf93 ) and 3.16 trees ( https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v3.16.38&id=2e4378ee60049b752c9dce16f62ce6fbd11b379a )
It seems that the bug was corrected in main line between builds 4.10.12 and 4.10.13. Reverse bisecting points to commit 5cd77ebf2254e6f27753ec2041fa5e084bf3eb5e which is a backport of commit https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit?id=62a6cfddcc0a5313e7da3e8311ba16226fe0ac10 . So the question is whether this fix finds its way back to old (supported) lines of kernel. I found some discussion on the question here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1686099
Patch for issue: "Fix reconnect to not defer smb3 session reconnect long after socket reconnect" https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.4.16&id=4ce7aa4e44d88ce64ea8ae2337b8910f3670b0ba is already on kernel version 3.16.38 and I still see the system being unresponsive when the remote share goes offline.
The following patch fixes the issue with reconnects: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=53e0e11efe9289535b060a51d4cf37c25e0d0f2b v3.16.42: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/fs/cifs?h=linux-3.16.y&id=0ba4c6eaaacbcc4b18f51bb3b1567c65a8fecca9 v4.4.40: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/fs/cifs?h=linux-4.4.y&id=f0b715409cb9cf7e21e690f9b163047739761962 v4.8.16: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/fs/cifs?h=linux-4.8.y&id=ff04da387c10b6bf7b510392742c8cd46c130fd6