Created attachment 254689 [details] default smb.conf with deadtime=1 set Starting with Kernel 4.9, the kernel's CIFS client does not seem to be handling disconnect caused by samba's deadtime option very well. A few minutes after deadtime triggers a disconnect, a disconnected client will start thrashing port 445. Server side, netstat shows as many as 14,000 connections open while client side only 1 port is shown as open at a time, but the port keeps changing. It appears to still be a problem in 4.10rc7. Steps to reproduce: 1. Setup a samba server using the default config but add deadtime=1 to the global section. See attachment for example. 2. Mount one or more shares using mount.cifs. Example: mount.cifs //server/homes -o username=someuser /mnt 3. Do not open any files. Wait 1-5 minutes. 4. Notice CPU usage on the server has spiked. Notice netstat shows thousands of open connections from the client. Notice the client is continuously opening new connections to server:445 On the server side I see: $ sudo netstat -pnet | grep 10.0.1.87 | grep 445 | wc 5283 47548 649809 And on the client, the port is constant changing: $ netstat -net | grep 10.0.0.8 tcp 0 0 10.0.1.87:53122 10.0.0.8:445 ESTABLISHED 0 1253359 $ netstat -net | grep 10.0.0.8 tcp 0 0 10.0.1.87:53700 10.0.0.8:445 ESTABLISHED 0 1253439 $ netstat -net | grep 10.0.0.8 tcp 0 0 10.0.1.87:53926 10.0.0.8:445 ESTABLISHED 0 1254557 $ netstat -net | grep 10.0.0.8 tcp 0 0 10.0.1.87:54148 10.0.0.8:445 ESTABLISHED 0 1253578 $ netstat -net | grep 10.0.0.8 tcp 0 0 10.0.1.87:54352 10.0.0.8:445 ESTABLISHED 0 1253604 $ netstat -net | grep 10.0.0.8 tcp 0 0 10.0.1.87:54518 10.0.0.8:445 ESTABLISHED 0 1254685 $ netstat -net | grep 10.0.0.8 tcp 0 0 10.0.1.87:54698 10.0.0.8:445 ESTABLISHED 0 1252177 Work Arounds: - downgrade to a 4.8.x kernel. - set deadtime = 0 in the smb.conf used on the server
I'm having the same issue. I'm creating a CIFS mount to a share on a Windows file-server that I don't have physical or remote administrative access ("nmap -O" reports "Windows Server 2008" though). Examining a Wireshark capture on remote port 445 reveals that the server is "pinged" (i.e. SMB Echo request/response) every 1 minute. However it seems that this is not sufficient to keep the connection alive and, if no other kind of activity occurs for 15 minutes, the server closes the connection with a RST/ACK. After that, my machine continuously tries to re-establish a new connection on a different local port. The conversation reaches the point where my client machine sends the "SMB Echo Request" packet but receives a RST/ACK instead of an "SMB Echo Response". This repeats in a tight loop, SYN flooding the network. I started noticing this behaviour somewhere around kernel 4.9.6. I'm currently on 4.9.9 and can work-around the issue either by switching back to LTS kernel 4.4 or by starting a background job from the shell that does a simple "ls" on the mounted share every few minutes, in order to keep the connection active.
Issue still exists in kernel 4.10.2
I'm suspect the bug is caused by revision 87dbe42a16b6, but I can't actually get my computer to boot when I build 87dbe42a16b6 or its parent d3304cadb2e2, so I've been unable to confirm that's the at-fault commit. Valantis, do you think you could attach Wireshark captures from a kernel expressing the issue as well an older kernel that's well behaved?
Created attachment 255293 [details] Wireshark capture with kernel 4.4.52
Created attachment 255295 [details] Wireshark capture with kernel 4.10.2
Paul, I've added two captures, one with kernel 4.4.52 that apparently doesn't have the issue and one with 4.10.2 that does. In the first capture, after mounting the remote directory and without any other interaction, I noticed some RST/ACK packages appearing at the 24 minute mark while the periodic echo request/response packets stopped. Three minutes later (packet no. 127) I did an 'ls' on the mounted directory just to confirm that I can still access it. This worked fine, seemingly triggering in the background a transparent to the user protocol re-negotiation. Echo request/response packets then resumed again. In the second capture, 15 minutes after mounting the directory and again without any other interaction, the connection is reset by the server and never recovers.
Thanks Valantis! On my end... I just learned of the modversions issue (http://lkml.iu.edu/hypermail/linux/kernel/1612.0/02013.html) in 4.9 development kernels; this is what was preventing me from booting some of the kernels I built. Now that I can build working kernels I've resumed bisecting Linus's kernel tree searching for the merge that broke things. I can report it wasn't rev 87dbe42a16b6.
It looks like the bug is introduced by commit b8c600120fc87d53642476f48c8055b38d6e14c7, "Call echo service immediately after socket reconnect". Additionally, if I checkout a 4.10 kernel and revert that changeset, I have a working 4.10 kernel without this issue.
Good job Paul. I also re-compiled 4.10 without the changes introduced by that particular commit and I confirm that the issue disappeared for me to!
Some additional details I've noticed... I've been testing with Samba (primarily 4.3.6 on FreeBSD, but also on builds on Archlinux of versions I didn't keep track of) as the server, but I tried today with Win 10 Home as the server and there is a difference in how things react. With Samba as the server, the user might not even notice the issue is expressed. We noticed the heavy CPU usage on the server dealing with the all the network traffic from misbehaving clients, but the users could still use their shares and all was fine. With Win10 as the server, though, the share locks up for me. I can't ls the mount point and must lazy unmount (umount -l /mnt/path) before I can use it again. Workaround for Windows Servers: - MS KB297684 shows how to change and disable the Autodisconnect setting (synonymous with Samba's deadtime). https://support.microsoft.com/en-us/help/297684/mapped-drive-connection-to-network-share-may-be-lost
I can reproduce this issue even without setting deadtime=1 in smb.conf (deadtime is set to 0 by default if not specified). See Arch Linux bug report https://bugs.archlinux.org/task/53639 for minimal instructions to reproduce the issue there.
Patch sent to the mailing list http://www.spinics.net/lists/linux-cifs/msg12456.html
I can confirm the above patch resolves the issue for me.
the fix was merged into mainline kernel last week, and should be backported to at least a few older kernels due to cc:stable