Bug 194691

Summary: Unreachable CIFS mount hangs system for an inconsistent amount of time ( between 60 mins to 5 hours)
Product: File System Reporter: Rajesh (rajesh.chivas)
Component: CIFSAssignee: fs_cifs (fs_cifs)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: blocking CC: 2contras, piastryyy, rajesh.chivas, rdiezmail-kernelbugzilla
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.16.38 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Kernel Configuration

Description Rajesh 2017-02-24 07:40:52 UTC
Background:
We use CIFS extensively in our firmware to allow file sharing from a Windows or a Samba server to the appliance firmware. This was all working fine upto kernel versions 3.14.77 until we recently upgraded to LTS 3.16.38. 

Problem:
In 3.16.38, we noticed that when the Windows or Samba share goes offline(by means of a shutdown or a network change @ the share), processes on the appliance goes to a hung state and the system stops responding to requests. It remains in the state for an unpredictable period of time ranging from 1 hour to a worst case time of approx 6 hours. During this period, SSH to the device hangs and does not respond, on a logged in SSH session execution of any command including ls, ps, anything just hangs and sometimes does not even respond to Ctrl-C and nothing of any help could be executed. 

This used to work fine in 3.14.77 where a file operation accessing the share would return a valid 112 error which the applications used to detect and unmount the drive immediately and trigger application logic to manage recovery such as sending alerts and so on. This is not the case in the new kernel and as mentioned the entire system does not respond.

Once the appliance comes out of the hung state, the below messages can be seen in /var/log/messages:
--------------------------
Feb 17 15:26:55 hostname INFO: task java:3493 blocked for more than 120 seconds. 
Feb 17 15:26:55 hostname Not tainted 3.16.38 #1 
Feb 17 15:26:55 hostname "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message. 
Feb 17 15:26:55 hostname java D ffff8800ca3aa860 0 3493 1 0x00000000 
Feb 17 15:26:55 hostname ffff8801fe3ffc80 0000000000000086 ffff8800ca3aa3d0 
0000000000010240 
Feb 17 15:26:55 hostname ffff8801d7d22490 ffff8801fe3fffd8 0000000000010240 
ffff88015476a520 
Feb 17 15:26:55 hostname ffff88015476a524 ffff8800ca3aa3d0 00000000ffffffff 
ffff88015476a528 
Feb 17 15:26:55 hostname Call Trace: 
Feb 17 15:26:55 hostname [<ffffffff814a3334>] schedule_preempt_disabled+0x24/0x70 
Feb 17 15:26:55 hostname [<ffffffff814a4b7a>] __mutex_lock_slowpath+0xca/0x1c0 
Feb 17 15:26:55 hostname [<ffffffff810f9d2e>] ? lookup_fast+0x24e/0x2c0 
Feb 17 15:26:55 hostname [<ffffffff814a4c82>] mutex_lock+0x12/0x22 
Feb 17 15:26:55 hostname [<ffffffff810fdcd7>] do_last.isra.60+0x2f7/0x10c0 
Feb 17 15:26:55 hostname [<ffffffff810fab69>] ? link_path_walk+0x69/0x860 
Feb 17 15:26:55 hostname [<ffffffff8109e90f>] ? futex_wait+0x18f/0x250 
Feb 17 15:26:55 hostname [<ffffffff810feb52>] path_openat+0xb2/0x5f0 
Feb 17 15:26:55 hostname [<ffffffff810ff6c5>] do_filp_open+0x35/0x80 
Feb 17 15:26:55 hostname [<ffffffff8110b1ed>] ? __alloc_fd+0x7d/0x120 
Feb 17 15:26:55 hostname [<ffffffff810f0083>] do_sys_open+0x123/0x220 
Feb 17 15:26:55 hostname [<ffffffff810f01af>] SyS_openat+0xf/0x20 
Feb 17 15:26:55 hostname [<ffffffff814a66b2>] system_call_fastpath+0x16/0x1b 
Feb 17 15:26:55 hostname INFO: task java:3494 blocked for more than 120 seconds. 
Feb 17 15:26:55 hostname Not tainted 3.16.38 #1 
Feb 17 15:26:55 hostname "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message. 
Feb 17 15:26:55 hostname java D ffff8800a5d9a5e0 0 3494 1 0x00000000 
Feb 17 15:26:55 hostname ffff8800a5f2fc80 0000000000000086 ffff8800a5d9a150 
0000000000010240 
Feb 17 15:26:55 hostname ffff8800d6936090 ffff8800a5f2ffd8 0000000000010240 
ffff88015476a520 
Feb 17 15:26:56 hostname ffff88015476a524 ffff8800a5d9a150 00000000ffffffff 
ffff88015476a528 
--------------------------

There would be multiple tasks that are in this state. And once the system becomes responsive, executing CIFS mount command to the unreachable share puts the process performing mount go into 'D' state with the below stack:

# cat /proc/12554/stack
[<f836158d>] cifs_get_smb_ses+0x16d/0x290 [cifs]
[<f8361a08>] cifs_mount+0x358/0x670 [cifs]
[<f83565df>] cifs_do_mount+0x6f/0x370 [cifs]
[<c10c8706>] mount_fs+0x16/0xb0
[<c10de344>] vfs_kern_mount+0x44/0xd0
[<c10e0640>] do_mount+0x1b0/0xab0
[<c10e11e7>] SyS_mount+0x77/0xb0
[<c13968c9>] sysenter_after_call+0x0/0x14


Command used to mount, defaulting to the "soft" mount option:
mount -t cifs //x.x.x.x/data1 /etc/tempmountpoint/ -o user=username,pass=password

Searched a lot for similar issues/patches for this kernel version but none of the solutions helped. 

I should also report 2 other observations:
1. When the system has recovered from the hung state and the share is still inaccessible, rebooting the system gets it back to a good state and the application logic waits for the share to become available again or provides means to reconfigure.
2. When the system is in the hung state and the share server becomes accessible again, the system recovers almost immediately and comes back to a working state and continues to work correctly.
Comment 1 Rajesh 2017-02-24 07:43:35 UTC
Upon noticing fixes related to CIFS in 3.16.40, upgrading to it did not help either.
Comment 2 Rajesh 2017-02-24 07:51:23 UTC
Created attachment 254905 [details]
Kernel Configuration

Attaching Kernel configuration. Please let me know if the any other information is required for debugging. Also, the issue is 100% reproducible.
Comment 3 Oleg Blashchuk 2017-06-07 20:06:37 UTC
I have the same problem on 4.4 and 4.8 kernels (Ubuntu 14.04 LTS and 16.04 LTS) and share on Windows Server 2003 and/or Windows Server 2008 R2. For testing I used simplest form of mount:
sudo mount -t cifs //srv/public /home/user/public -o username=user,rw,uid=user,gid=users .
And if /home/user/public is not used for some time (stays idle for 10-60 minutes) the connection stalls and next operation with it (i.e. "ls /home/user/public" ) will hang for dozens of minutes or even hours.
It's all started from kernel 4.4.16.
Using git bisect between builds 4.4.15 and 4.4.16 pointed the following patch to be the cause of the problem:
"Fix reconnect to not defer smb3 session reconnect long after socket reconnect" https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.4.16&id=4ce7aa4e44d88ce64ea8ae2337b8910f3670b0ba

This patch finds its way both to 4.8 ( https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.8.17&id=4fcd1813e6404dd4420c7d12fb483f9320f0bf93 ) and 3.16 trees ( https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v3.16.38&id=2e4378ee60049b752c9dce16f62ce6fbd11b379a )
Comment 4 Oleg Blashchuk 2017-06-13 03:35:23 UTC
It seems that the bug was corrected in main line between builds 4.10.12 and 4.10.13. Reverse bisecting points to commit 5cd77ebf2254e6f27753ec2041fa5e084bf3eb5e which is a backport of commit https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit?id=62a6cfddcc0a5313e7da3e8311ba16226fe0ac10 . So the question is whether this fix finds its way back to old (supported) lines of kernel. I found some discussion on the question here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1686099
Comment 5 Rajesh 2017-06-27 04:11:15 UTC
Patch for issue:

"Fix reconnect to not defer smb3 session reconnect long after socket reconnect" https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=v4.4.16&id=4ce7aa4e44d88ce64ea8ae2337b8910f3670b0ba

is already on kernel version 3.16.38 and I still see the system being unresponsive when the remote share goes offline.