Bug 194531 - CIFS mounts begin thrashing port 445 if samba server is using deadtime option.
Summary: CIFS mounts begin thrashing port 445 if samba server is using deadtime option.
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: CIFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_cifs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-09 21:07 UTC by Paul Klapperich
Modified: 2017-04-24 16:42 UTC (History)
8 users (show)

See Also:
Kernel Version: 4.9
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
default smb.conf with deadtime=1 set (10.29 KB, text/plain)
2017-02-09 21:07 UTC, Paul Klapperich
Details
Wireshark capture with kernel 4.4.52 (28.00 KB, application/octet-stream)
2017-03-16 10:35 UTC, Valantis Trigonis
Details
Wireshark capture with kernel 4.10.2 (25.54 KB, application/octet-stream)
2017-03-16 10:36 UTC, Valantis Trigonis
Details

Description Paul Klapperich 2017-02-09 21:07:42 UTC
Created attachment 254689 [details]
default smb.conf with deadtime=1 set

Starting with Kernel 4.9, the kernel's CIFS client does not seem to be handling disconnect caused by samba's deadtime option very well. A few minutes after deadtime triggers a disconnect, a disconnected client will start thrashing port 445. Server side, netstat shows as many as 14,000 connections open while client side only 1 port is shown as open at a time, but the port keeps changing. It appears to still be a problem in 4.10rc7.

Steps to reproduce:
1. Setup a samba server using the default config but add deadtime=1 to the global section. See attachment for example.
2. Mount one or more shares using mount.cifs. Example:
   mount.cifs //server/homes -o username=someuser /mnt
3. Do not open any files. Wait 1-5 minutes.
4. Notice CPU usage on the server has spiked. Notice netstat shows thousands of open connections from the client. Notice the client is continuously opening new connections to server:445

On the server side I see:
$ sudo netstat -pnet | grep 10.0.1.87 | grep 445 | wc
  5283   47548  649809

And on the client, the port is constant changing:
$ netstat -net | grep 10.0.0.8
tcp        0      0 10.0.1.87:53122         10.0.0.8:445            ESTABLISHED 0          1253359   
$ netstat -net | grep 10.0.0.8
tcp        0      0 10.0.1.87:53700         10.0.0.8:445            ESTABLISHED 0          1253439   
$ netstat -net | grep 10.0.0.8
tcp        0      0 10.0.1.87:53926         10.0.0.8:445            ESTABLISHED 0          1254557   
$ netstat -net | grep 10.0.0.8
tcp        0      0 10.0.1.87:54148         10.0.0.8:445            ESTABLISHED 0          1253578   
$ netstat -net | grep 10.0.0.8
tcp        0      0 10.0.1.87:54352         10.0.0.8:445            ESTABLISHED 0          1253604   
$ netstat -net | grep 10.0.0.8
tcp        0      0 10.0.1.87:54518         10.0.0.8:445            ESTABLISHED 0          1254685   
$ netstat -net | grep 10.0.0.8
tcp        0      0 10.0.1.87:54698         10.0.0.8:445            ESTABLISHED 0          1252177  

Work Arounds:
- downgrade to a 4.8.x kernel.
- set deadtime = 0 in the smb.conf used on the server
Comment 1 Valantis Trigonis 2017-02-21 11:59:49 UTC
I'm having the same issue. I'm creating a CIFS mount to a share on a Windows file-server that I don't have physical or remote administrative access ("nmap -O" reports "Windows Server 2008" though).

Examining a Wireshark capture on remote port 445 reveals that the server is "pinged" (i.e. SMB Echo request/response) every 1 minute. However it seems that this is not sufficient to keep the connection alive and, if no other kind of activity occurs for 15 minutes, the server closes the connection with a RST/ACK.

After that, my machine continuously tries to re-establish a new connection on a different local port. The conversation reaches the point where my client machine sends the "SMB Echo Request" packet but receives a RST/ACK instead of an "SMB Echo Response". This repeats in a tight loop, SYN flooding the network.

I started noticing this behaviour somewhere around kernel 4.9.6. I'm currently on 4.9.9 and can work-around the issue either by switching back to LTS kernel 4.4 or by starting a background job from the shell that does a simple "ls" on the mounted share every few minutes, in order to keep the connection active.
Comment 2 Valantis Trigonis 2017-03-15 13:13:55 UTC
Issue still exists in kernel 4.10.2
Comment 3 Paul Klapperich 2017-03-15 18:45:20 UTC
I'm suspect the bug is caused by revision 87dbe42a16b6, but I can't actually get my computer to boot when I build 87dbe42a16b6 or its parent d3304cadb2e2, so I've been unable to confirm that's the at-fault commit.

Valantis, do you think you could attach Wireshark  captures from a kernel expressing the issue as well an older kernel that's well behaved?
Comment 4 Valantis Trigonis 2017-03-16 10:35:35 UTC
Created attachment 255293 [details]
Wireshark capture with kernel 4.4.52
Comment 5 Valantis Trigonis 2017-03-16 10:36:35 UTC
Created attachment 255295 [details]
Wireshark capture with kernel 4.10.2
Comment 6 Valantis Trigonis 2017-03-16 10:37:45 UTC
Paul, I've added two captures, one with kernel 4.4.52 that apparently doesn't have the issue and one with 4.10.2 that does.

In the first capture, after mounting the remote directory and without any other interaction, I noticed some RST/ACK packages appearing at the 24 minute mark while the periodic echo request/response packets stopped. Three minutes later (packet no. 127) I did an 'ls' on the mounted directory just to confirm that I can still access it. This worked fine, seemingly triggering in the background a transparent to the user protocol re-negotiation. Echo request/response packets then resumed again.

In the second capture, 15 minutes after mounting the directory and again without any other interaction, the connection is reset by the server and never recovers.
Comment 7 Paul Klapperich 2017-03-16 17:07:56 UTC
Thanks Valantis!

On my end... I just learned of the modversions issue (http://lkml.iu.edu/hypermail/linux/kernel/1612.0/02013.html) in 4.9 development kernels; this is what was preventing me from booting some of the kernels I built. Now that I can build working kernels I've resumed bisecting Linus's kernel tree searching for the merge that broke things. 

I can report it wasn't rev 87dbe42a16b6.
Comment 8 Paul Klapperich 2017-03-16 20:14:25 UTC
It looks like the bug is introduced by commit b8c600120fc87d53642476f48c8055b38d6e14c7, "Call echo service immediately after socket reconnect". Additionally, if I checkout a 4.10 kernel and revert that changeset, I have a working 4.10 kernel without this issue.
Comment 9 Valantis Trigonis 2017-03-17 12:14:07 UTC
Good job Paul. I also re-compiled 4.10 without the changes introduced by that particular commit and I confirm that the issue disappeared for me to!
Comment 10 Paul Klapperich 2017-03-17 21:01:51 UTC
Some additional details I've noticed... I've been testing with Samba (primarily 4.3.6 on FreeBSD, but also on builds on Archlinux of versions I didn't keep track of) as the server, but I tried today with Win 10 Home as the server and there is a difference in how things react.

With Samba as the server, the user might not even notice the issue is expressed. We noticed the heavy CPU usage on the server dealing with the all the network traffic from misbehaving clients, but the users could still use their shares and all was fine.

With Win10 as the server, though, the share locks up for me. I can't ls the mount point and must lazy unmount (umount -l /mnt/path) before I can use it again.

Workaround for Windows Servers:
- MS KB297684 shows how to change and disable the Autodisconnect setting (synonymous with Samba's deadtime). https://support.microsoft.com/en-us/help/297684/mapped-drive-connection-to-network-share-may-be-lost
Comment 11 Jonathan Liu 2017-04-11 14:59:58 UTC
I can reproduce this issue even without setting deadtime=1 in smb.conf (deadtime is set to 0 by default if not specified). See Arch Linux bug report  https://bugs.archlinux.org/task/53639 for minimal instructions to reproduce the issue there.
Comment 12 Sachin Prabhu 2017-04-16 21:43:50 UTC
Patch sent to the mailing list
http://www.spinics.net/lists/linux-cifs/msg12456.html
Comment 13 Paul Klapperich 2017-04-24 16:36:23 UTC
I can confirm the above patch resolves the issue for me.
Comment 14 Steve French 2017-04-24 16:42:43 UTC
the fix was merged into mainline kernel last week, and should be backported to at least a few older kernels due to cc:stable

Note You need to log in before you can comment on or make changes to this bug.