Bug 62221

Summary: Fanotify blocks any write operation to CIFS mounted volumes for about 10-15 secs
Product: File System Reporter: priyamn (pria.mn9)
Component: CIFSAssignee: fs_cifs (fs_cifs)
Status: NEW ---    
Severity: blocking CC: douglas.leeder, eparis, jlayton, lsahlber, smfrench, tim.pickersgill
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.0 Subsystem:
Regression: No Bisected commit-id:
Attachments: syslog output without fanotify
syslog output with fanotify

Description priyamn 2013-09-27 13:46:35 UTC
Whenever we use Fanotify to monitor "CIFS" mounted volumes and
try to write to a new file by doing:

echo "ABC" > /mnt/cifs/test.c

It takes around 10-15 secs to complete the write operation.

This happens when we try to monitor "FAN_OPEN_PERM". Looks like "FAN_OPEN_PERM" blocks and notifies only after 10-15 secs.
Comment 1 Jeff Layton 2013-09-30 10:07:04 UTC
Eric, any idea of what the problem here might be?
Comment 2 priyamn 2013-10-30 10:26:57 UTC
Any updates on this ?
Comment 3 Tim Pickersgill 2014-03-06 16:24:13 UTC
Easily reproduced using Eric Paris's fanotify-example test program (from http://git.infradead.org/users/eparis/fanotify-example.git/tree) E.g.:

./fanotify -o open_perm -m /mnt/cifs

We see a 30s or 40s delay when creating a new file with touch. Other accesses (including writes) appear to be okay.


Used git bisect to find the commit where this was introduced:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8b0192a5f478da1c1ae906bf3ffff53f26204f56

commit 8b0192a5f478da1c1ae906bf3ffff53f26204f56
Author: Jeff Layton <jlayton@redhat.com>
Date:   Tue Feb 7 06:30:52 2012 -0500

    cifs: request oplock when doing open on lookup
   
    Currently, it's always set to 0 (no oplock requested).
   
    Cc: <stable@vger.kernel.org>
    Signed-off-by: Jeff Layton <jlayton@redhat.com>
    Signed-off-by: Steve French <smfrench@gmail.com>
Comment 4 Tim Pickersgill 2014-03-07 09:28:54 UTC
Setting enable_oplocks (in /sys/module/cifs/parameters) to 0 makes the problem go away.
Comment 5 Tim Pickersgill 2014-03-07 14:50:07 UTC
Created attachment 128441 [details]
syslog output without fanotify

syslog output (with cifsFYI=7) when running touch to create a new file, without fanotify.
Comment 6 Tim Pickersgill 2014-03-07 14:50:36 UTC
Created attachment 128451 [details]
syslog output with fanotify

syslog output (with cifsFYI=7) when running touch to create a new file, with fanotify.
Comment 7 Tim Pickersgill 2014-03-07 15:28:13 UTC
Have set cifsFYI to 7 and re-run touch with and without fanotify trapping OPEN_PERM events. The syslog output from both are attached.

with fanotify trapping OPEN_PERM events, there's a cifs_open call in the middle of the cifs_lookup, which doesn't occur normally. I've inserted blank lines into the logs to highlight this difference. Everything else looks the same to me.

I guess this additional open may have been happening before, but without the oplock, it just worked anyway?
Comment 8 Douglas Leeder 2014-05-20 08:13:07 UTC
Backtrace of the fanotify and touch processes during the pause:

0xffff8800379617c0     1241     1204  0    0   D  0xffff880037961d28  fanotify_test
0xffff88003d3e8000     1244     1204  0    0   S  0xffff88003d3e8568  su
0xffff8800379e2f80     1245     1244  0    0   D  0xffff8800379e34e8  touch
[0]kdb> btp 1241
Stack traceback for pid 1241
0xffff8800379617c0     1241     1204  0    0   D  0xffff880037961d28  fanotify_test
 ffff88003d3e1a80 0000000000000086 ffff8800379617c0 ffffffff81909480
 ffff88003d3e1fd8 00000000000125c0 00000000000125c0 ffff880037b325d0
 ffff88003d04f968 0000000000000000 ffff88003d4d8000 ffff88003d3e1b64
Call Trace:
 [<ffffffff815f9689>] schedule+0x29/0x70
 [<ffffffff812c4ada>] wait_for_response.isra.7+0x9a/0xf0
 [<ffffffff8107d0b0>] ? prepare_to_wait_event+0x100/0x100
 [<ffffffff812c5a3f>] SendReceive+0x12f/0x300
 [<ffffffff812a6e34>] CIFSPOSIXCreate+0x194/0x3b0
 [<ffffffff812b7280>] cifs_posix_open+0x180/0x2b0
 [<ffffffff815f26d0>] ? printk+0x67/0x69
 [<ffffffff812b8f7d>] cifs_open+0x44d/0x650
 [<ffffffff81074f45>] ? set_next_entity+0x85/0xa0
 [<ffffffff815f9223>] ? __schedule+0x2f3/0x730
 [<ffffffff811511d2>] ? kmem_cache_alloc_trace+0x132/0x140
 [<ffffffff81156d5f>] do_dentry_open+0x1ff/0x2d0
 [<ffffffff812b8b30>] ? cifsFileInfo_get+0x30/0x30
 [<ffffffff81156e89>] dentry_open+0x59/0xe0
 [<ffffffff8119c730>] fanotify_read+0x210/0x550
 [<ffffffff8107d0b0>] ? prepare_to_wait_event+0x100/0x100
 [<ffffffff8115935b>] vfs_read+0x9b/0x160
 [<ffffffff81159f76>] SyS_read+0x46/0xb0
 [<ffffffff8116df95>] ? SyS_poll+0x65/0x100
 [<ffffffff81604f92>] system_call_fastpath+0x16/0x1b


[0]kdb> btp 1245
Stack traceback for pid 1245
0xffff8800379e2f80     1245     1244  0    0   D  0xffff8800379e34e8  touch
 ffff88003d32d9d8 0000000000000086 ffff8800379e2f80 ffff8800379617c0
 ffff88003d32dfd8 00000000000125c0 00000000000125c0 0000000000000000
 ffff88003d27a3c0 ffff880037b30c00 ffff880037b30cd8 0000000000000000
Call Trace:
 [<ffffffff815f9689>] schedule+0x29/0x70
 [<ffffffff8119c028>] fanotify_handle_event+0x1a8/0x310
 [<ffffffff8107d0b0>] ? prepare_to_wait_event+0x100/0x100
 [<ffffffff811984f1>] send_to_group+0xd1/0x1b0
 [<ffffffff811987a3>] fsnotify+0x1d3/0x2f0
 [<ffffffff812f6ace>] security_file_open+0x6e/0x70
 [<ffffffff81156c24>] do_dentry_open+0xc4/0x2d0
 [<ffffffff811569e0>] ? finish_no_open+0x20/0x20
 [<ffffffff81157001>] finish_open+0x31/0x40
 [<ffffffff812b625b>] cifs_atomic_open+0x16b/0x300
 [<ffffffff81167f3f>] do_last+0x76f/0x1200
 [<ffffffff812f68a6>] ? security_file_alloc+0x16/0x20
 [<ffffffff81168a8b>] path_openat+0xbb/0x660
 [<ffffffff8112e739>] ? handle_mm_fault+0x389/0x9a0
 [<ffffffff8116982a>] do_filp_open+0x3a/0x90
 [<ffffffff81175d0d>] ? __alloc_fd+0x7d/0x120
 [<ffffffff81158878>] do_sys_open+0x128/0x220
 [<ffffffff8115898e>] SyS_open+0x1e/0x20
 [<ffffffff81604f92>] system_call_fastpath+0x16/0x1b
Comment 9 Jeff Layton 2014-05-20 09:35:14 UTC
The fanotify case is resulting in an extra open of the file, which is just wrong. With a local filesystem, an extra open is generally painless -- not so much with a network filesystem.

So, it looks like the the client requests an oplock on the file (as it should), but then second open races in before the reply is processed. That causes the server to immediately request that the oplock be broken, but we have no record of that oplock yet. The original open hasn't had a chance to complete because it's blocked in the fanotify code. Eventually (15s later) the server gives up on the client and revokes the oplock anyway which lets the second open complete and that unwedges things.

The only way I can see to fix this is to fix fanotify not to do this extra open.
Comment 10 Douglas Leeder 2014-05-20 09:56:35 UTC
How would we do that? We need access to the file contents to to content-based access control, so would need to ?duplicate? the existing open file handle? (And close it if permission is denied)

The other alternatives we thought of:

a) Take the op-lock later, after the security check
b) Remember the file earlier so the release can happen.
c) Release unknown op-locks: Send the appropriate message back to the server if we can't match the op-lock to a file. (the code looks like it needs the inode to send the message though?)
Comment 11 Jeff Layton 2014-05-20 10:02:30 UTC
(In reply to Douglas Leeder from comment #10)
> How would we do that? We need access to the file contents to to
> content-based access control, so would need to ?duplicate? the existing open
> file handle? (And close it if permission is denied)
> 
> The other alternatives we thought of:
> 
> a) Take the op-lock later, after the security check
> b) Remember the file earlier so the release can happen.
> c) Release unknown op-locks: Send the appropriate message back to the server
> if we can't match the op-lock to a file. (the code looks like it needs the
> inode to send the message though?)

All of those options are pretty icky. Why reopen the file in the first place? What makes you think that the server is going to allow you to read the file. What if you don't have any credentials that allow it?

I think Trond's comments in bug 60563 also apply here. The correct way to do this is to plumb this sort of notification into the VFS layer as a proper operation, but even with that I'm not sure this will ever work properly with a network filesystem like NFS or CIFS. If you don't have an open file handle that you can read from, then I don't think you can reasonably do any sort of content-based access control.
Comment 12 Douglas Leeder 2014-05-20 10:08:42 UTC
c) is certainly very icky. a) or b) seem like they close a general race-condition. cifs isn't ready for a broken op-lock at the time when it requests the op-lock, which seems like it could happen anyway if another machine requested the same file.
Comment 13 Jeff Layton 2014-05-20 10:20:22 UTC
Ok, I'll buy that...

I don't see how you can reasonably do "a" since at this level you don't know that there will be a recursive open occurring. "b" is possible, but that would require an overhaul of the oplock handling.

That said...doing that might fix your immediate problem, but it's going to mean that performance will suck. Without an oplock, you can't do any client-side caching, so the *best* thing is still to avoid doing any sort of extra open on the file if you can help it.
Comment 14 Jeff Layton 2014-05-20 10:33:52 UTC
I do wonder though -- why doesn't commit 233839b1df65a prevent this from occurring?
Comment 15 Jeff Layton 2014-05-20 10:36:27 UTC
Ahh right...that patch only fixed this for the SMB2 case. The SMB1 case still has the same problem. Still, you can probably hook into the same infrastructure to fix the race there too.
Comment 16 Ronnie Sahlberg 2019-02-07 22:26:44 UTC
SMB1 is getting obsolete fast.  Since this is fixed for SMB2+  I suggest we close this bug.
Comment 17 Steve French 2019-06-24 03:02:13 UTC
Have we verified that this still fails on SMB1 even?

It does remind me that we should improve fanotify support