Bug 7903
Summary: | CIFS client completely broken? | ||
---|---|---|---|
Product: | File System | Reporter: | Brian Wang (ywang221) |
Component: | CIFS | Assignee: | Steve French (sfrench) |
Status: | RESOLVED CODE_FIX | ||
Severity: | blocking | CC: | aarora, akpm, jjudin+linuxkernel, shaggy, ywang221 |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.16-13 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
lock/unlock mutexes surrounding i_size_write in cifs
protect i_size_write calls in cifs via spinlock 2nd pass at i_lock fix (handles vmtruncate) missed one i_size_write in cifs truncate from last patch fix to test for this hang |
Description
Brian Wang
2007-01-30 17:33:28 UTC
More information. It looks like it is easy to screw up with multiple cp or find commands. I think there are some kind of race within the module. I have a SLES10 test machine with two (cpu0 and cpu1) Pentium III (coppermine) which are 32 bit processors running 2.6.20 kernel and cifs client version 1.47. I do not whether this MP machine is SMP (how do you determine that?). But I mounted /usr from a samba server running on another machine and started the cp -R command in three different ttys simultaneously. Do not see any error messages (at least the kinds mentioned in the bug) in dmesg and no hang experienced on the machine running cifs client so far. /usr on the samba server is pretyy big, 2GBytes, it has been quite a few minutes since copies started and although all three cp commands have not completed (have copied 500 or so MBytes each), I guess I would have seen error messages and experienced hang by now. Will check back in the morning. Shirish, Built the kernel with SMP enabled is crucial to reproduce this bug and the freeze. because i_size_read and i_size_write are completely different on SMP/32 bits than on kernel without SMP enabled or 64 bit machines. You have to rebuild the kernel by change the config file to enable SMP. I guess there are a few flags in the config you need, my kernel was built with the following. CONFIG_SMP=y CONFIG_X86_FIND_SMP_CONFIG=y CONFIG_X86_SMP=y I think I filed another bug on the freeze. The freez problem is reported numberous of times if you search GOOGLE. if you check the implementation of i_size_read and i_size_write, they will give you hints how it happens, they take different path on SMP/32bit platform. Bascially there are a few places within cifs module calling i_size_write without holding the inode mutex. Once we put the mutex lock/unlock around those calls if the mutex is not held, the freez problem would be fixed. by the way, if you see those error messages, pay attention to the number of files you copied (or find if you run "find", pipe it through wc, you will see different process report different count). I am restarting the three copies test. The machine was hung, could only ping and had to reboot. But then / filesystem was full. So not sure if hang was a result of that, it is very likely. And there were no errors in dmesg, if there were, they probably were overwritten. So this time, I am doing three simultaneous copies of a 1G directory under /usr (/usr/share). I can't recreate the problem, this time / filesystem did not become 100% full. I tried few times. I do not see any error messages in the syslog as well. I always had the config options CONFIG_SMP=y CONFIG_X86_FIND_SMP_CONFIG=y CONFIG_X86_SMP=y and then I added some more config options like CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEX=y CONFIG_DEBUG_INFO=y CONFIG_DEBUG_VM=y I will rebuild the kernel by taking out some of the new options I added and redo the testing, may be they change/affect any timing issues. Shirish The freeze problem is easier to see logically by reading the code and put WARN_ON(!mutex_is_locked(&file->f_dentry->d_inode->i_mutex)) statement before it. I suggest that you find all the places within the module where "i_size_write" is called and check is the inode mutex is held. My undersstanding is that you absoultely CANNOT call i_size_wrie without holding the inode mutex, which is violated in CIFS module. Sepcifically, in cifs_partialpagewrite, I believe it should acquire the i_mutex before cifs_write is called, calling cifs_write without holding the mutex result in i_size_write gets called without the lock and take the risk of i_size_read freezes the kernel. There are 2 paths which could end up calling cifs_write, one path is normal VFS path, the other is from mm, VFS path seems to always holding i_mutex, but mm path doesn't, which could cause the freeze and potential data corruption problem. We have been testing and working on this module for a few weeks, I personally believe this module is not for serious use and major work is needed, expecially the lock strategy. The errors i filed for this bug we believe is caused by the lack of proper lock when the server doesn't respond in time, the connection is torn down, a lag was set, but the flag set/check doesn't have the proper lock strategy which result in other threads still hold the dead connection to throws tons of errors. sorry about the typos Brian, Thanks. I am looking at the code and i_size_xxxx calls. It is just that, if I can_have/have_access_to a system/set-up where I can consistently recreate the problem, I can verify the fixed code. All I reported could be reprodued in a few seconds on my machine, which is a Intel server with 4 CPUs and 3.6 GB memory. The machine hung. I could reproduce the problem. Can ping but nothing else. Taking out those extra debug config options helped. The only error message in syslog I get though is init_special_inode: bogus i_mode (755) Anyway, looking at it. *** Bug 7013 has been marked as a duplicate of this bug. *** Brian, I agree - looks like you are correct. I wonder if the aio/vectored-io/mm changes altered the default locking for that path. Thanks , guys. I added locks to a few places in file.c aand inode.c and the freeze problem is pretty much gone, But I don't think my fix is good fix because I don't have a good overall view of VFS&CIFS. I am glad you guys are able to reproduce it and work on it. But even without the freeze problem, still tons of those send error in read, and tons of errors such as : Feb 21 15:41:20 angband kernel: CIFS VFS: No writable handles for inode Feb 21 15:41:20 angband kernel: CIFS VFS: No writable handles for inode Feb 21 15:41:20 angband kernel: CIFS VFS: No writable handles for inode Feb 21 15:41:20 angband kernel: CIFS VFS: No writable handles for inode Feb 21 15:41:20 angband kernel: CIFS VFS: No writable handles for inode Feb 21 15:41:20 angband kernel: CIFS VFS: No writable handles for inode Feb 21 15:41:20 angband kernel: CIFS VFS: No writable handles for inode Feb 21 15:41:20 angband kernel: CIFS VFS: No writable handles for inode Feb 21 15:41:20 angband kernel: CIFS VFS: No writable handles for inode Feb 21 15:47:27 angband kernel: CIFS VFS: Received no data, expecting 398 "No writable handles for inode" I believe is when fsync is called against a fd. If you think you have a fix, I am willing to test the patch, my problem could reprodue the problems pretty quick. my problem=my program sorry OK - would like to see your patch in any case although I suspect it is similar to what Shirish has been running. My worry about locking in CIFS is not about cifs specific data structures, global variables, linked lists (although as with every large piece of code, those need to be rereviewed Instead, the cases like this (locking rules implicit in the use of a worker function, in this case i_size_write) are easy to fall through the cracks. Looking at Documentation/filesystems/Locking and related files to refresh my memory, I noticed that there seems to be some dated information in there (although a very useful file) relating to i_mutex. Grepping for i_mutex usage in the vfs really worries me - there are lots of places it is held across blocking calls, and if these ever recursed into the memory manager in low memory situations (balance_inode_pages kind of stuff) could deadlock ... right? No sense going from a deadlock to a different deadlock, so am thinking that a better approach to protecting i_size_write of cifs inodes is not to grab a mutex that is held all over the place (fsync, readdir, ...) in the vfs but to lock the cifs i_size writes with something else. -9 error on read is bad file handle presumably caused by the connection to the server dropping while i/o was in progress. The case of "No writable handle" on write is probably similar - in writepages we need the inode to be open to write out data to it (which is not a big deal because before file close it first writes out data) - but perhaps in your case the session to the server dropped - I keep stats of the number of reconnects and session status in /proc/fs/cifs/Stats or /proc/fs/cifs/DebugData Shirish's patch seems to fix the problem, but am concerned that a vfs can not safely take the i_mutex if we are in the context of an fsync (since that path already seems to take the i_mutex). Anyone have ideas or a good test case for this? Where can I get Shirish's patch? would you post it here? Thanks By the way, the locking fix i out in myself runs for more than 3 days before it falls apart. It looks like the mm path calls cifs_write takes i_mutex most of the time and fails to take it in some very rare cases which is enough to break it down after a while. Shirish, would you email me your patch or direct me to where I can download it? Can't wait to try it out. I appreciate it. Thanks Created attachment 10510 [details]
lock/unlock mutexes surrounding i_size_write in cifs
patch based on cifs-2.6 git tree
Shirish, Thanks for the patch. It is similar to what I did, but i think your patch may lock too many places?. I don't know why you are not running into deadlock, I guess your test only does "read" from a CIFS mount? If you also do "write" to it, I think you may run into deadlock. If i only lock only 2 places in inode.c cifs_get_inode_info_unix, 2 places in readdir in function , i can pretty much get rid of the freeze. but if i put WARN_ON(!mutex_is_locked(&file->f_dentry->d_inode->i_mutex)); in cifs_write before i_size_write, i got 1 warning after a few days running, which means in some corner cases the lock is not held, but it is held most of the time. Badness in cifs_write at fs/cifs/file.c:966 [<c01061b3>] show_trace+0x13/0x20 [<c010633e>] dump_stack+0x1e/0x20 [<f8ab072c>] cifs_write+0x51c/0x540 [cifs] [<f8ab0fe6>] cifs_writepage+0x186/0x230 [cifs] [<c015038d>] shrink_zone+0xe4d/0xfe0 [<c0150cb2>] try_to_free_pages+0x122/0x220 [<c014bcce>] __alloc_pages+0x15e/0x340 [<c014da43>] __do_page_cache_readahead+0x103/0x270 [<c014dc15>] blockable_page_cache_readahead+0x65/0xc0 [<c014dcdd>] make_ahead_window+0x6d/0xc0 [<c014de4f>] page_cache_readahead+0x11f/0x1c0 [<c01472ad>] do_generic_mapping_read+0x51d/0x580 [<c0147b80>] __generic_file_aio_read+0xf0/0x250 [<c0147d29>] generic_file_aio_read+0x49/0x60 [<c0167149>] do_sync_read+0xb9/0xf0 [<c0167b91>] vfs_read+0xc1/0x150 [<c0168127>] sys_read+0x47/0x70 [<c010436b>] sysenter_past_esp+0x54/0x79 I pulled cifs 1.47 from cifs-2.6 git tree, but unfortuately 1.47 doesn't compile on my box with kernel 2.6.16-13. Steve, so 1.47 is is not compaitable with 2.6.16 ? Brian, You are right, I was doing cp -R from cifs mount to local directory. I will run the test other way and see what happens. cifs 1.47 will not build against kernel 2.6.16, Steve can tell better but I am at kernel 2.6.20 and building 1.47 of cifs. Steve and Shirish, So what's your thought on using a module global spin lock to protect i_size_write? so we don't need to worry about the interaction with VFS and mm. any concerns with the approach? I think Steve made the good point on not to grab i_mutex. Also, a question for steve, we have to stay with 2.6.16 and I noticed that there are quite a few important fixes in 1.46 and 1.47, I am just wonder if those fixes could be back port to 1.45. Started the copy from local directory to cifs mounted directory, machine did not hang but individual cp commands deadlocked. Will work on spinlock solution. My spin-lock test over night runs pretty well . I am convinced this is a better solution than playing with i_mutex. I will keep my test run but I do believe the freeze problem should be fixed. There are 2 parts about this bug, May be I should file a different one? the other part I also believe is because of race condition. You can reproduce by the same way, run multiple "cp" on same mount point, you will get lots of read errors and the some files are corrupted. same thing with writing into the mount point. I think it might be better to file it as a seperate bug, what do you guys think? The spinlock to use seems to be inode->i_lock (see note in struct inode in linux/include/fs.h). Not sure how I missed that, but it is not used enough by filesystems yet. And yes, it would be easier to file a distinct problem if we see a second problem, especially if it turns out to be a server or networking issue, it would be easier to track. Thanks, Steve, i guess most people missed that! if you look at the note for i_size_write, it actually suggest holding i_mutex, I think that should be updated! bug 8077(missing files and file corruption) is filed for the other part of this bug. Reply-To: akpm@linux-foundation.org <wakes up> Yes, i_mutex is required during i_size_write(). It might be that we can fix the CIFS lockups by takeing i_lock around the i_size_write calls, but if we do this then we will need to ensure that the VFS absolutely never runs i_size_write() against a cifs file or directory, because the VFS doens't use i_lock for i_size protection. Taking i_mutex around i_size_write() would be preferred. In that case, why not take i_lock inside of i_size_write and make it a rule that nowhere else shoudl take i_lock for i_size? Reply-To: akpm@linux-foundation.org Because i_size_write() is already protected by i_mutex. Taking another lock in the VFS and in all filesystems just because of a couple of unlocked codepaths in CIFS sounds like a bad dead. I don't see anywhere that the vfs could be calling i_size_write on cifs inodes (doesn't really make sense for the vfs to update the file size for a remote inode - only the network filesystem can know the real size). The problem is not really changing the unlocked paths in cifs to taking the i_mutex, it is avoiding all potential deadlock with the 70+ places in the fs and mm directory that take the i_mutex - it is much easier to avoid deadlock by taking something that is guaranteed not to deadlock and still protects all the writes to this field (i_size). Created attachment 10524 [details]
protect i_size_write calls in cifs via spinlock
The same idea, but protected with spinlock on i_lock.
Seems safer to me, although looking more carefully there is one place in the
vfs and mm that calls i_size_write on cifs behalf (vmtruncate).
Not sure the best way to handle that without deadlock.
> Because i_size_write() is already protected by i_mutex.
That may be true, but it is not obvious - the callers (outside the fs ie in mm
and vfs) of i_size_write include:
1) vmtruncate
2) simple_commit_write/no_bh_commit_write/generic_commit_write
3) fsstack_copy_inode_size
4) generic_file_direct_write
5) xip_file_write
None of them grab i_mutex and all are exported.
A network filesystem of course could update the file size on all kinds of events
... (e.g. lookup/revalidate of a file changed on the server) not just a local
write on the client - so how could one ever tell if the thread already has
i_mutex. Seems the easiest way is to find a way to avoid calling vmtruncate
(which presumably can't be wrapped with a spinlock because it could block).
> a way to avoid calling vmtruncate
I meant find a way to do the equivalent of truncate - and protect its
i_size_write with whatever safe lock we decide on.
Reply-To: akpm@linux-foundation.org > On Sat, 24 Feb 2007 22:53:27 -0800 bugme-daemon@bugzilla.kernel.org wrote: > 1) vmtruncate > 2) simple_commit_write/no_bh_commit_write/generic_commit_write > 3) fsstack_copy_inode_size > 4) generic_file_direct_write > 5) xip_file_write They should all be called under i_mutex with the exception of fsstack_copy_inode_size, which is I think copying to an inode whcih hasn't been made visible to anyone else yet. Sorry, but the rule is that i_size_write() requires i_mutex. There _are_ exceptions to this (as I discovered when I had an assertion in i_size_write) but they're all sutiations where we know that no other thread can be running i_size_write) on that inode at the same time. The prrferred solution might be fixing it with i_mutex.
But I think the quick and safe fix for this specific CIFS bug might be holding
i_lock , at the same time, looks like we need to duplicate vmtruncate into
CIFS module to make it a static cifs_vmtruncate and protect i_size_write with
i_lock.
Now we know that fsstack_copy_inode_size won't be a problem as the reason
stated above. The question is could the other 3 be potential problems for
CIFS, according to Steve.
> 2) simple_commit_write/no_bh_commit_write/generic_commit_write
> 4) generic_file_direct_write
> 5) xip_file_write
The only "external" (external to cifs) use of i_size_write is vmtruncate, and that is easy to handle and can be done within cifs with little code duplication. But isn't this cut and dried? Based on Brian and Shirish's reports it seems we (a network filesystem) can't really know when the vfs has i_mutex on a particular inode (for this thread) for sure so we can't safely use it in this case (if the vfs didn't hold the i_mutex before calling write and/or writepages we wouldn't have this problem in the first place - cifs would simply grab i_mutex around all its i_size_write calls and this would be trivial and we would be done ... but Brian and Shirish's notes seem to indicate that the vfs sometimes has i_mutex before it gets to cifs on write and other times not and there is no easy rule as there is in path based calls into the vfs on the use of i_mutex). cifs currently can set file size in its: 1) write (for the direct case) 2) writepage 3) writepages 4) commit_write 5) readdir (cifs_filldir) 6) and in rmdir (on directory rather than a file) 7) and in many other calls symlink read_inode (on the root inode of the sb) mkdir (and due to revalidate ... it is also called in) file_mmap dentry_revalidate certain lseek calls getattr I don't see any way to figure out based on Documentation/filesystem/Locking or grepping fs and mm - exactly which of these 13+ cases could hold the i_mutex on a particular inode. It is a much more performance sensitive operation on a network filesystem of course since the file size could change on the server on revalidate of inode metadata so a spinlock may be preferable to a mutex anyway. Created attachment 10532 [details]
2nd pass at i_lock fix (handles vmtruncate)
Did some testing - the latest patch (attached to this bugzilla bug #) seems to run ok vs. the stable version of the server code (Samba 3.0.23d) (but found an unrelated problem in the most current Samba server svn code which I am looking at). Let me know if you see any testing problem. It seems like it covers all paths safely and I am satisfied with it, although I really don't want to push it if akpm objects (he knows the mm quite well) i ran my test without vmtruncate change for more than 20 hours and encountered no problem. I started my test 2 hours ago with vmtruncate change included and will keep it running and I will let you know if I find anything. I understand that from Linux OS point of view, this may not be "perfect" fix, which might be the reason of akpm's objection. But i use this as part of the product development, so it is good enough for me. Thanks and I hope you will have time look at #8077,which I believe is a very critical bug too. Reply-To: akpm@linux-foundation.org I guess we can live with fs-private i_lock usage. We just have to be careful about any future vfs-level i_size_write()s popping up. checked into cifs-2.6 tree - pending minor additional tests will request a merge soon Created attachment 10540 [details]
missed one i_size_write in cifs truncate from last patch
Shaggy noticed that there are two i_size_writes in the new cifs truncate (we
forgot spinlock around one in the previous patch).
This patch applies on top of the previous one.
Steve, the 2nd patch above, looks like the first spin_unlock(&inode->i_lock); needs to be removed? you got double unlocks. sorry, my bad, didn't notice the goto after that. Please reopen this bug (I am not able to do it from my id), since the patch in this bug report is resulting in a deadlock situation. The problem is happening since the cifs code calls is_size_safe_to_change() holding a spinlock, and this function can sleep. Obviously, sleeping inside spinlock is not a good idea! Process with 8720 pid is the culprit here. It gets the spinlock on inode->i_lock and calls is_size_safe_to_change(). This in turn eventually calls wait_for_response() which puts the process on a waitqueue and makes it go to uninterruptible sleep. Now, since the above process is holding the lock, suppose all other CPUs (in this server we have only two) get to run processes which also start waiting on the same inode->i_lock, we get the deadlock - since the process with 8720 pid, which is holding the lock can _not_ be woken up now (as both the cpus are busy spinning!). This is the stack trace for the process with 8720 pid: locktests D 000000000fcff7fc 8720 23007 22971 23008 23006 (NOTLB) Call Trace: [C0000000725EB070] [C000000000521010] 0xc000000000521010 (unreliable) [C0000000725EB240] [C000000000010E80] .__switch_to+0x130/0x154 [C0000000725EB2D0] [C000000000357F38] .schedule+0xa98/0xbf4 [C0000000725EB3E0] [D000000000A21DA8] .wait_for_response+0xe8/0x1bc [cifs] [C0000000725EB4C0] [D000000000A22758] .SendReceive+0x2dc/0x598 [cifs] [C0000000725EB590] [D000000000A06BEC] .CIFSSMBOpen+0x2d8/0x518 [cifs] [C0000000725EB690] [D000000000A16FDC] .cifs_reopen_file+0x2e4/0x524 [cifs] [C0000000725EB780] [D000000000A17E18] .find_writable_file+0xe4/0x184 [cifs] [C0000000725EB820] [D000000000A185C0] .is_size_safe_to_change+0x24/0x90 [cifs] [C0000000725EB8B0] [D000000000A1BD08] .cifs_get_inode_info_unix+0x794/0x9a8 [cifs] [C0000000725EBA20] [D000000000A1A79C] .cifs_open+0x71c/0x8c4 [cifs] [C0000000725EBB30] [C0000000000E69D8] .__dentry_open+0x13c/0x2bc [C0000000725EBBE0] [C0000000000E6CCC] .do_filp_open+0x50/0x70 [C0000000725EBD00] [C0000000000E6D60] .do_sys_open+0x74/0x130 [C0000000725EBDB0] [C000000000125148] .compat_sys_open+0x24/0x38 [C0000000725EBE30] [C0000000000086A4] syscall_exit+0x0/0x40 reopened.. good catch - in this path the tcp session has died and we are reconnecting while someone is changing file size. is_size_safe_to_change will need to be changed so that it does not block. In this case the intent of calling find_writable_file is to check whether the inode might be in the process of being written to, not to get a file handle we can use to write. I see a few easy ways to change this - but one is to add a parm to find_writable_file Could you check to make sure that this patch is ok and works? Created attachment 12124 [details]
fix to test for this hang
also could you reassign this bug back to me, I hit the wrong button when I tried to add the attachment
Hi Steve, Thanks for the new patch. I have asked the person who faced this problem to try it out. Will update the bug when I get to hear from him. This is also being tracked (for RHEL5) via Red Hat bugzilla: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249394 (In reply to comment #56) > Hi Steve, > > Thanks for the new patch. I have asked the person who faced this problem to > try > it out. Will update the bug when I get to hear from him. The tests with this new patch have passed. They have been running fine for more than 12 hours, where we were observing the bug within 1-2 hours. Steve, thanks again for this patch ! You may want to monitor /proc/fs/cifs/Stats during the test to see if the number of session reconnects keeps increasing (a possible indication of networking problems or server crash). Although the cifs client automatically reconnects after a tcp session crash (and it might not be visible to an application or test) it is a possible indication that there is a networking problem and worth keeping an eye out for (the cifs client logs a dmesg entry in most of these cases too). |