Bug 202611

Summary: Race condition in fs/cifs/connect.c causing "has not responded in 120 seconds. Reconnecting..."
Product: File System Reporter: Lutz Donnerhacke (lutz)
Component: CIFSAssignee: fs_cifs (fs_cifs)
Status: RESOLVED CODE_FIX    
Severity: low CC: ferry.toth, lsahlber, paul.ruiz, smfrench
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.9.0-8-amd64 (Debian 4.9.130-2) Subsystem:
Regression: No Bisected commit-id:

Description Lutz Donnerhacke 2019-02-18 14:53:04 UTC
Detailed description of the bug: https://lutz.donnerhacke.de/eng/Blog/Groundhog-Day-with-SMB-remount

Summary of the race condition:
 1) Daisy chaining scheduling creates a gap.
 2) If traffic comes unfortunate shortly after
    the last echo, the planned echo is suppressed.
 3) Due to the gap, the next echo transmission is delayed
    until after the timeout, which is set hard to twice
    the echo interval.

Possible solutions:
 a) Eliminate the gap by scheduling at fixed times.
 b) Send the echo requests in any case regardless of other traffic.
 c) Avoiding the entire problem by waiting at least three times
    the interval length.

For a quick fix change the 2 to a 3 in the code:
https://github.com/torvalds/linux/blob/master/fs/cifs/connect.c#L712
Comment 1 Lutz Donnerhacke 2019-02-19 12:18:03 UTC
I was advised to open a similar ticket at the samba side: https://bugzilla.samba.org/show_bug.cgi?id=13795
Comment 2 Lutz Donnerhacke 2019-05-27 10:26:13 UTC
Is there anything unclear in this ticket?

What are the reasons to deny the recommended quick fix?
Do you need a patch file?
Comment 3 Ferry Toth 2019-07-05 14:45:09 UTC
Hi Lutz,

I seems you have done quite an investigation. I sure would like to test it, but it not easy to get a patched Ubuntu kernel.

Could it be 202903 is a duplicate of this bug?
Comment 4 Steve French 2019-07-05 21:09:05 UTC
Ronnie posted a fix for this onto linux-cifs mailing list which I have tentatively merged into cifs-2.6.git for-next

Let me know if this works ok for you
Comment 5 Lutz Donnerhacke 2019-07-18 10:46:55 UTC
Thank you.

If you are referring to https://marc.info/?l=linux-cifs&m=156235998723334&w=2 the patch looks promising. It follows the "quick fix" proposal and AFAIK the problem vanishes using this patch.

But the patch for the comment section is wrong. Please keep the original text.
Comment 6 Lutz Donnerhacke 2019-07-18 11:01:31 UTC
Correct comment would be:

 712  * We need to wait 3 echo intervals to make sure we handle such
 713  * situations right:
 714  *   1s  client sends a normal SMB request
 715  *   2s  client gets a response
 716  *  30s echo workqueue job pops, and decides we got a response recently

712 refers to the wait interval.
715 refers to the hard coded delay (one second)
Comment 7 Ronnie Sahlberg 2019-07-24 01:44:31 UTC
I have sent a fix to the mailinglist to update the comment.

Will close bug as we have the code change in upstream now.