7903 – CIFS client completely broken?

Bug 7903 - CIFS client completely broken?

Summary: CIFS client completely broken?

Status:	RESOLVED CODE_FIX

Alias:	None

Product:	File System
Classification:	Unclassified
Component:	CIFS (show other bugs)
Hardware:	i386 Linux

Importance:	P2 blocking
Assignee:	Steve French

URL:
Keywords:

Duplicates (1):	7013 (view as bug list)
Depends on:
Blocks:

Reported:	2007-01-30 17:33 UTC by Brian Wang
Modified:	2007-07-26 08:48 UTC (History)
CC List:	5 users (show)

See Also:
Kernel Version:	2.6.16-13
Subsystem:
Regression:	---
Bisected commit-id:

Attachments
lock/unlock mutexes surrounding i_size_write in cifs (4.87 KB, patch) 2007-02-23 12:47 UTC, Shirish Pargaonkar	Details \| Diff
protect i_size_write calls in cifs via spinlock (6.47 KB, patch) 2007-02-24 22:41 UTC, Steve French	Details \| Diff
2nd pass at i_lock fix (handles vmtruncate) (7.93 KB, patch) 2007-02-25 13:33 UTC, Steve French	Details \| Diff
missed one i_size_write in cifs truncate from last patch (1.32 KB, patch) 2007-02-26 12:08 UTC, Steve French	Details \| Diff
fix to test for this hang (1.63 KB, text/x-diff) 2007-07-24 12:42 UTC, Steve French	Details
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Description Brian Wang 2007-01-30 17:33:28 UTC

Most recent kernel where this bug did *NOT* occur:
Distribution: SUSE
Hardware Environment: Intel
Software Environment: SUSE 10, kernel 2.6.16 with CIFS 1.45
Problem Description:

1. with SMP/32 bits, CIFS client freezes the kernel. The problem is the the 
calls to i_size_write doesn't own the inode mutex, which causes i_size_read 
lock the kernel.

2. It looks like CIFS client completely unusable. dmesg shows tons of errors.
(see below). if you run a "find" command agaist a CIFS mount, "find" command 
returns different result everytime. (e.g. run "find /cifs_mnt | wc" returns 
different count everytime even the mounted share is completely static).

The question I'd like to ask: are there any serious users of Linux CIFS 
client? the problem is so easy to reproduce and it just seems completely 
broken and useless. surprisingly, not many bugs are reported?

I have tried different CIFS version from 1.40 to the most recent 1.45 on 
2.6.16 and 2.6.17, the problem is everywhere!


CIFS VFS: Send error in read = -11
 CIFS VFS: No response for cmd 4 mid 17535
 CIFS VFS: No response to cmd 46 mid 17534
 CIFS VFS: No response to cmd 46 mid 17532
 CIFS VFS: Send error in Close = -11
 CIFS VFS: No response to cmd 46 mid 17526
 CIFS VFS: Send error in read = -11
 CIFS VFS: Send error in read = -11
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 17530
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 17529
 CIFS VFS: No response to cmd 46 mid 17527
 CIFS VFS: Send error in read = -11
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 17528
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 17531
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 17525
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 17533
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 17524
 CIFS VFS: Send error in read = -11
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: No response for cmd 50 mid 20566
 CIFS VFS: No response for cmd 50 mid 20565
 CIFS VFS: No response to cmd 46 mid 20567
 CIFS VFS: No response to cmd 46 mid 20559
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20563
 CIFS VFS: Send error in read = -11
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20560
 CIFS VFS: No response to cmd 46 mid 20562
 CIFS VFS: No response to cmd 46 mid 20557
 CIFS VFS: Send error in read = -11
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20555
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20564
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20553
 CIFS VFS: Send error in read = -11
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20551
 CIFS VFS: No response to cmd 46 mid 20550
 CIFS VFS: No response to cmd 46 mid 20561
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20549
 CIFS VFS: Send error in read = -11
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20558
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20548
 CIFS VFS: Send error in read = -11
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20547
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20546
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response to cmd 46 mid 20552
 CIFS VFS: Send error in read = -11
 CIFS VFS: No response for cmd 4 mid 20556
 CIFS VFS: Send error in Close = -11
 CIFS VFS: No response to cmd 46 mid 20554
 CIFS VFS: Send error in read = -11
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in Close = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
 CIFS VFS: Send error in read = -9
init_special_inode: bogus i_mode (777)
init_special_inode: bogus i_mode (777)


Steps to reproduce:


1. Mount a  CIFS share from a Windows or SAMBA server with siginificant amount 
of files. (for example, something like /usr on any Linux system).

2. run "cp -R /cifs-mnt /somedir". It happens quick with multiple cp commands.

3. Do "dmesg" and you will see tons of errors.

4. If your system is SMP/32bits, the system will freeze up in a few minutes. 
you can ping it, but you CANNOT do anything else.

Comment 1 Brian Wang 2007-01-30 17:58:11 UTC

More information.

It looks like it is easy to screw up with multiple cp or find commands. I 
think there are some kind of race within the module.

Comment 2 Shirish Pargaonkar 2007-02-07 22:13:16 UTC

I have a SLES10 test machine with two (cpu0 and cpu1) Pentium III (coppermine) 
which are 32 bit processors running 2.6.20 kernel and cifs client version 1.47.
I do not whether this MP machine is SMP (how do you determine that?).

But I mounted /usr from a samba server running on another machine and started
the cp -R command in three different ttys simultaneously.
Do not see any error messages (at least the kinds mentioned in the bug) 
in dmesg and no hang experienced on the machine running cifs client so far.

/usr on the samba server is pretyy big, 2GBytes, it has been quite a few
minutes since copies started and although all three cp commands have not 
completed (have copied 500 or so MBytes each), I guess I would have seen error 
messages and experienced hang by now.

Will check back in the morning.

Comment 3 Brian Wang 2007-02-08 07:36:07 UTC

Shirish,

Built the kernel with SMP enabled is crucial to reproduce this bug and the 
freeze. because i_size_read and i_size_write are completely different on 
SMP/32 bits than on kernel without SMP enabled or 64 bit machines.

You have to rebuild the kernel by change the config file to enable SMP. I 
guess there are a few flags in the config you need, my kernel was built with 
the following.

CONFIG_SMP=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_SMP=y

I think I filed another bug on the freeze. The freez problem is reported 
numberous of times if you search GOOGLE. if you check the implementation of 
i_size_read and i_size_write, they will give you hints how it happens, they 
take different path on SMP/32bit platform. Bascially there are a few places 
within cifs module calling i_size_write without holding the inode mutex. Once 
we put the mutex lock/unlock around those calls if the mutex is not held, the 
freez problem would be fixed.

Comment 4 Brian Wang 2007-02-08 07:37:59 UTC

by the way, if you see those error messages, pay attention to the number of 
files you copied (or find if you run "find", pipe it through wc, you will see 
different process report different count).

Comment 5 Shirish Pargaonkar 2007-02-08 08:07:13 UTC

I am restarting the three copies test.  The machine was hung, could only ping
and had to reboot.
But then / filesystem was full.  So not sure if hang was a result of that, it is
very likely.  And there were no errors in dmesg, if there were, they probably
were overwritten.

So this time, I am doing three simultaneous copies of a 1G directory under
/usr (/usr/share).

Comment 6 Shirish Pargaonkar 2007-02-15 07:30:50 UTC

I can't recreate the problem, this time / filesystem did not become 100% full.
I tried few times.  I do not see any error messages in the syslog as well.

I always had the config options 

CONFIG_SMP=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_SMP=y

and then I added some more config options like 
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEX=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_VM=y

I will rebuild the kernel by taking out some of the new options I added
and redo the testing, may be they change/affect any timing issues.

Comment 7 Brian Wang 2007-02-15 08:45:19 UTC

Shirish

The freeze problem is easier to see logically by reading the code and put 
WARN_ON(!mutex_is_locked(&file->f_dentry->d_inode->i_mutex)) statement before 
it.

I suggest that you find all the places within the module where "i_size_write" 
is called and check is the inode mutex is held. My undersstanding is that you 
absoultely CANNOT call i_size_wrie without holding the inode mutex, which is 
violated in CIFS module. Sepcifically, in cifs_partialpagewrite, I believe it 
should acquire the i_mutex before cifs_write is called, calling cifs_write 
without holding the mutex result in i_size_write gets called without the lock 
and take the risk of i_size_read freezes the kernel.

There are 2 paths which could end up calling cifs_write, one path is normal 
VFS path, the other is from mm, VFS path seems to always holding i_mutex, but 
mm path doesn't, which could cause the freeze and potential data corruption 
problem.


We have been testing and working on this module for a few weeks, I personally 
believe this module is not for serious use and major work is needed, 
expecially the lock strategy. The errors i filed for this bug we believe is 
caused by the lack of proper lock when the server doesn't respond in time, the 
connection is torn down, a lag was set, but the flag set/check doesn't have 
the proper lock strategy which result in other threads still hold the dead 
connection to throws tons of errors.

Comment 8 Brian Wang 2007-02-15 08:54:59 UTC

sorry about the typos

Comment 9 Shirish Pargaonkar 2007-02-15 09:06:55 UTC

Brian,

Thanks. I am looking at the code and i_size_xxxx calls.
It is just that, if I can_have/have_access_to a system/set-up where I can
consistently recreate the problem, I can verify the fixed code.

Comment 10 Brian Wang 2007-02-15 09:11:11 UTC

All I reported could be reprodued in a few seconds on my machine, which is a 
Intel server with 4 CPUs and 3.6 GB memory.

Comment 11 Shirish Pargaonkar 2007-02-15 09:48:15 UTC

The machine hung.  I could reproduce the problem.  Can ping but nothing else.
Taking out those extra debug config options helped.  The only error message in
syslog I get though is
 init_special_inode: bogus i_mode (755)

Anyway, looking at it.

Comment 12 Steve French 2007-02-21 13:09:56 UTC

*** Bug 7013 has been marked as a duplicate of this bug. ***

Comment 13 Steve French 2007-02-21 13:10:35 UTC

Brian,
I agree - looks like you are correct.

I wonder if the aio/vectored-io/mm changes altered the default locking for that
path.

Comment 14 Brian Wang 2007-02-21 14:18:58 UTC

Thanks , guys.

I added locks to a few places in file.c aand inode.c and the freeze problem is 
pretty much gone,

But I don't think my fix is good fix because I don't have a good overall view 
of VFS&CIFS. I am glad you guys are able to reproduce it and work on it. 

But even without the freeze problem, still tons of those send error in read, 
and tons of errors such as :

Feb 21 15:41:20 angband kernel:  CIFS VFS: No writable handles for inode
Feb 21 15:41:20 angband kernel:  CIFS VFS: No writable handles for inode
Feb 21 15:41:20 angband kernel:  CIFS VFS: No writable handles for inode
Feb 21 15:41:20 angband kernel:  CIFS VFS: No writable handles for inode
Feb 21 15:41:20 angband kernel:  CIFS VFS: No writable handles for inode
Feb 21 15:41:20 angband kernel:  CIFS VFS: No writable handles for inode
Feb 21 15:41:20 angband kernel:  CIFS VFS: No writable handles for inode
Feb 21 15:41:20 angband kernel:  CIFS VFS: No writable handles for inode
Feb 21 15:41:20 angband kernel:  CIFS VFS: No writable handles for inode



Feb 21 15:47:27 angband kernel:  CIFS VFS: Received no data, expecting 398


"No writable handles for inode" I believe is when fsync is called against a fd.


If you think you have a fix, I am willing to test the patch, my problem could 
reprodue the problems pretty quick.

Comment 15 Brian Wang 2007-02-21 14:20:27 UTC

my problem=my program

sorry

Comment 16 Steve French 2007-02-21 15:21:12 UTC

OK - would like to see your patch in any case although I suspect it is similar
to what Shirish has been running.   My worry about locking in CIFS is not about
cifs specific data structures, global variables, linked lists (although as with
every large piece of code, those need to be rereviewed
Instead, the cases like this (locking rules implicit in the use of a worker
function, in this case i_size_write) are easy to fall through the cracks.

Looking at Documentation/filesystems/Locking and related files to refresh my
memory, I noticed that there seems to be some dated information in there
(although a very useful file) relating to i_mutex.  Grepping for i_mutex usage
in the vfs really worries me - there are lots of places it is held across
blocking calls, and if these ever recursed into the memory manager in low memory
situations (balance_inode_pages kind of stuff) could deadlock ... right?

No sense going from a deadlock to a different deadlock, so am thinking that a
better approach to protecting i_size_write of cifs inodes is not to grab a mutex
that is held all over the place (fsync, readdir, ...) in the vfs but to lock the
 cifs i_size writes with something else.

Comment 17 Steve French 2007-02-21 15:25:51 UTC

-9 error on read is bad file handle presumably caused by the connection to the
server dropping while i/o was in progress.

The case of "No writable handle" on write is probably similar - in writepages we
need the inode to be open to write out data to it (which is not a big deal
because before file close it first writes out data) - but perhaps in your case
the session  to the server dropped - I keep stats of the number of reconnects
and session status in /proc/fs/cifs/Stats or /proc/fs/cifs/DebugData

Comment 18 Steve French 2007-02-22 20:42:21 UTC

Shirish's patch seems to fix the problem, but am concerned that a vfs can not
safely take the i_mutex if we are in the context of an fsync (since that path
already seems to take the i_mutex).

Anyone have ideas or a good test case for this?

Comment 19 Brian Wang 2007-02-22 20:52:39 UTC

Where can I get Shirish's patch? would you post it here? 

Thanks

Comment 20 Brian Wang 2007-02-22 20:58:21 UTC

By the way, the locking fix i out in myself runs for more than 3 days before 
it falls apart. It looks like the mm path calls cifs_write takes i_mutex most 
of the time and fails to take it in some very rare cases which is enough to 
break it down after a while.

Comment 21 Brian Wang 2007-02-22 22:09:16 UTC

Shirish,

would you email me your patch or direct me to where I can download it? Can't 
wait to try it out. I appreciate it.

Thanks

Comment 22 Shirish Pargaonkar 2007-02-23 12:47:57 UTC

Created attachment 10510 [details]
lock/unlock mutexes surrounding i_size_write in cifs

patch based on cifs-2.6 git tree

Comment 23 Brian Wang 2007-02-23 13:29:59 UTC

Shirish,

Thanks for the patch. It is similar to what I did, but i think your patch may 
lock too many places?. I don't know why you are not running into deadlock, I 
guess your test only does "read" from a CIFS mount? If you also do "write" to 
it, I think you may run into deadlock.

If i only lock only 2 places in inode.c cifs_get_inode_info_unix, 2 places in 
readdir in function , i can pretty much get rid of the freeze.  but if i put 
WARN_ON(!mutex_is_locked(&file->f_dentry->d_inode->i_mutex)); in cifs_write 
before i_size_write, i got 1 warning after a few days running, which means in 
some corner cases the lock is not held, but it is held most of the time. 

Badness in cifs_write at fs/cifs/file.c:966
 [<c01061b3>] show_trace+0x13/0x20
 [<c010633e>] dump_stack+0x1e/0x20
 [<f8ab072c>] cifs_write+0x51c/0x540 [cifs]
 [<f8ab0fe6>] cifs_writepage+0x186/0x230 [cifs]
 [<c015038d>] shrink_zone+0xe4d/0xfe0
 [<c0150cb2>] try_to_free_pages+0x122/0x220
 [<c014bcce>] __alloc_pages+0x15e/0x340
 [<c014da43>] __do_page_cache_readahead+0x103/0x270
 [<c014dc15>] blockable_page_cache_readahead+0x65/0xc0
 [<c014dcdd>] make_ahead_window+0x6d/0xc0
 [<c014de4f>] page_cache_readahead+0x11f/0x1c0
 [<c01472ad>] do_generic_mapping_read+0x51d/0x580
 [<c0147b80>] __generic_file_aio_read+0xf0/0x250
 [<c0147d29>] generic_file_aio_read+0x49/0x60
 [<c0167149>] do_sync_read+0xb9/0xf0
 [<c0167b91>] vfs_read+0xc1/0x150
 [<c0168127>] sys_read+0x47/0x70
 [<c010436b>] sysenter_past_esp+0x54/0x79

Comment 24 Brian Wang 2007-02-23 14:33:39 UTC

I pulled cifs 1.47 from cifs-2.6 git tree, but unfortuately 1.47 doesn't 
compile on my box with kernel 2.6.16-13. Steve, so 1.47 is is not compaitable 
with 2.6.16 ?

Comment 25 Shirish Pargaonkar 2007-02-23 19:55:48 UTC

Brian,

You are right, I was doing cp -R from cifs mount to local directory.
I will run the test other way and see what happens.

cifs 1.47 will not build against kernel 2.6.16, Steve can tell better but I am
at kernel 2.6.20 and building 1.47 of cifs.

Comment 26 Brian Wang 2007-02-23 21:17:20 UTC

Steve and Shirish,

So what's your thought on using a module global spin lock to protect 
i_size_write? so we don't need to worry about the interaction with VFS and mm. 
any concerns with the approach? I think Steve made the good point on not to 
grab i_mutex.

Also, a question for steve, we have to stay with 2.6.16 and I noticed that 
there are quite a few important fixes in 1.46 and 1.47, I am just wonder if 
those fixes could be back port to 1.45.

Comment 27 Shirish Pargaonkar 2007-02-24 05:52:57 UTC

Started the copy from local directory to cifs mounted directory, machine
did not hang but individual cp commands deadlocked.  
Will work on spinlock solution.

Comment 28 Brian Wang 2007-02-24 08:46:07 UTC

My spin-lock test over night runs pretty well . I am convinced this is a 
better solution than playing with i_mutex. I will keep my test run but I do 
believe the freeze problem should be fixed.

There are 2 parts about this bug, May be I should file a different one? the 
other part I also believe is because of race condition. You can reproduce by 
the same way, run multiple "cp" on same mount point, you will get lots of read 
errors and the some files are corrupted. same thing with writing into the 
mount point. I think it might be better to file it as a seperate bug, what do 
you guys think?

Comment 29 Steve French 2007-02-24 08:55:59 UTC

The spinlock to use seems to be inode->i_lock (see note in struct inode in
linux/include/fs.h).

Not sure how I missed that, but it is not used enough by filesystems yet.

Comment 30 Steve French 2007-02-24 08:57:00 UTC

And yes, it would be easier to file a distinct problem if we see a second
problem, especially if it turns out to be a server or networking issue, it would
be easier to track.

Comment 31 Brian Wang 2007-02-24 09:02:57 UTC

Thanks, Steve, i guess most people missed that! if you look at the note for 
i_size_write, it actually suggest holding i_mutex, I think that should be 
updated!

Comment 32 Brian Wang 2007-02-24 09:14:39 UTC

bug 8077(missing files and file corruption) is filed for the other part of 
this bug.

Comment 33 Anonymous Emailer 2007-02-24 09:52:00 UTC

Reply-To: akpm@linux-foundation.org

<wakes up>

Yes, i_mutex is required during i_size_write().  It might be that we can fix the
CIFS lockups by takeing i_lock around the i_size_write calls, but if we do this
then we will need to ensure that the VFS absolutely never runs i_size_write() against
a cifs file or directory, because the VFS doens't use i_lock for i_size protection.

Taking i_mutex around i_size_write() would be preferred.

Comment 34 Brian Wang 2007-02-24 10:05:28 UTC

In that case, why not take i_lock inside of i_size_write and make it a rule 
that nowhere else shoudl take i_lock for i_size?

Comment 35 Anonymous Emailer 2007-02-24 18:53:30 UTC

Reply-To: akpm@linux-foundation.org

Because i_size_write() is already protected by i_mutex.  Taking another lock in the
VFS and in all filesystems just because of a couple of unlocked codepaths
in CIFS sounds like a bad dead.

Comment 36 Steve French 2007-02-24 22:33:47 UTC

I don't see anywhere that the vfs could be calling i_size_write on cifs inodes
(doesn't really make sense for the vfs to update the file size for a remote
inode - only the network filesystem can know the real size).

The problem is not really changing the unlocked paths in cifs to taking the
i_mutex, it is avoiding all potential deadlock with the 70+ places in the fs and
mm directory that take the i_mutex - it is much easier to avoid deadlock by
taking something that is guaranteed not to deadlock and still protects all the
writes to this field (i_size).

Comment 37 Steve French 2007-02-24 22:41:40 UTC

Created attachment 10524 [details]
protect i_size_write calls in cifs via spinlock

The same idea, but protected with spinlock on i_lock.

Seems safer to me, although looking more carefully there is one place in the
vfs and mm that calls i_size_write on cifs behalf (vmtruncate).

Not sure the best way to handle that without deadlock.

Comment 38 Steve French 2007-02-24 22:53:25 UTC

> Because i_size_write() is already protected by i_mutex.

That may be true, but it is not obvious - the callers (outside the fs ie in mm
and vfs) of i_size_write include:

1) vmtruncate
2) simple_commit_write/no_bh_commit_write/generic_commit_write
3) fsstack_copy_inode_size
4) generic_file_direct_write
5) xip_file_write

None of them grab i_mutex and all are exported.

A network filesystem of course could update the file size on all kinds of events
... (e.g. lookup/revalidate of a file changed on the server) not just a local
write on the client - so how could one ever tell if the thread already has
i_mutex.   Seems the easiest way is to find a way to avoid calling vmtruncate
(which presumably can't be wrapped with a spinlock because it could block).

Comment 39 Steve French 2007-02-24 22:55:31 UTC

> a way to avoid calling vmtruncate
I meant find a way to do the equivalent of truncate - and protect its
i_size_write with whatever safe lock we decide on.

Comment 40 Anonymous Emailer 2007-02-25 01:11:23 UTC

Reply-To: akpm@linux-foundation.org

> On Sat, 24 Feb 2007 22:53:27 -0800 bugme-daemon@bugzilla.kernel.org wrote:
> 1) vmtruncate
> 2) simple_commit_write/no_bh_commit_write/generic_commit_write
> 3) fsstack_copy_inode_size
> 4) generic_file_direct_write
> 5) xip_file_write

They should all be called under i_mutex with the exception of fsstack_copy_inode_size,
which is I think copying to an inode whcih hasn't been made visible to anyone else
yet.

Sorry, but the rule is that i_size_write() requires i_mutex.

There _are_ exceptions to this (as I discovered when I had an assertion
in i_size_write) but they're all sutiations where we know that no other
thread can be running i_size_write) on that inode at the same time.

Comment 41 Brian Wang 2007-02-25 10:28:33 UTC

The prrferred solution might be fixing it with i_mutex.

But I think the quick and safe fix for this specific CIFS bug might be holding 
i_lock , at the same time, looks like we need to duplicate vmtruncate into 
CIFS module to make it a static cifs_vmtruncate and protect i_size_write with 
i_lock.

Now we know that  fsstack_copy_inode_size won't be a problem as the reason 
stated above. The question is could the other 3 be potential problems for 
CIFS, according to Steve.

> 2) simple_commit_write/no_bh_commit_write/generic_commit_write
> 4) generic_file_direct_write
> 5) xip_file_write

Comment 42 Steve French 2007-02-25 13:07:34 UTC

The only "external" (external to cifs) use of i_size_write is vmtruncate, and
that is easy to handle and can be done within cifs with little code duplication.

But isn't this cut and dried?  Based on Brian and Shirish's reports it seems we
(a network filesystem) can't really know when the vfs has i_mutex on a
particular inode (for this thread) for sure so we can't safely use it in this
case (if  the vfs didn't hold the i_mutex before calling write and/or writepages
we wouldn't have this problem in the first place - cifs would simply grab
i_mutex around all its i_size_write calls and this would be trivial and we would
be done ... but Brian and Shirish's notes seem to indicate that the vfs
sometimes has i_mutex before it gets to cifs on write and other times not and
there is no easy rule as there is in path based calls into the vfs on the use of
 i_mutex).

cifs currently can set file size in its:
1) write (for the direct case)
2) writepage
3) writepages
4) commit_write
5) readdir (cifs_filldir)
6) and in rmdir (on directory rather than a file)
7) and in many other calls
    symlink
    read_inode (on the root inode of the sb)
    mkdir
(and due to revalidate ... it is also called in)
    file_mmap
    dentry_revalidate
    certain lseek calls
    getattr

I don't see any way to figure out based on Documentation/filesystem/Locking or
grepping fs and mm - exactly which of these 13+ cases could hold the i_mutex on
a particular inode.

It is a much more performance sensitive operation on a network filesystem of
course since the file size could change on the server on revalidate of inode
metadata so a spinlock may be preferable to a mutex anyway.

Comment 43 Steve French 2007-02-25 13:33:25 UTC

Created attachment 10532 [details]
2nd pass at i_lock fix (handles vmtruncate)

Comment 44 Steve French 2007-02-25 19:38:33 UTC

Did some testing - the latest patch (attached to this bugzilla bug #) seems to
run ok vs. the stable version of the server code (Samba 3.0.23d) (but found an
unrelated problem in the most current Samba server svn code which I am looking at).

Let me know if you see any testing problem.

It seems like it covers all paths safely and I am satisfied with it, although I
really don't want to push it if akpm objects (he knows the mm quite well)

Comment 45 Brian Wang 2007-02-25 20:25:10 UTC

i ran my test without vmtruncate change for more than 20 hours and encountered 
no problem. I started my test 2 hours ago with vmtruncate change included and 
will keep it running and I will let you know if I find anything.

I understand that from Linux OS point of view, this may not be "perfect" fix, 
which might be the reason of akpm's objection. But i use this as part of the 
product development, so it is good enough for me. 

Thanks and I hope you will have time look at #8077,which I believe is a very 
critical bug too.

Comment 46 Anonymous Emailer 2007-02-25 23:41:05 UTC

Reply-To: akpm@linux-foundation.org


I guess we can live with fs-private i_lock usage.  We just have to be
careful about any future vfs-level i_size_write()s popping up.

Comment 47 Steve French 2007-02-26 11:13:54 UTC

checked into cifs-2.6 tree - pending minor additional tests will request a merge
soon

Comment 48 Steve French 2007-02-26 12:08:12 UTC

Created attachment 10540 [details]
missed one i_size_write in cifs truncate from last patch

Shaggy noticed that there are two i_size_writes in the new cifs truncate (we
forgot spinlock around one in the previous patch).

This patch applies on top of the previous one.

Comment 49 Brian Wang 2007-02-26 22:30:11 UTC

Steve,

the 2nd patch above, looks like the first spin_unlock(&inode->i_lock); needs 
to be removed? you got double unlocks.

Comment 50 Brian Wang 2007-02-26 22:48:21 UTC

sorry, my bad, didn't notice the goto after that.

Comment 51 Amit K Arora 2007-07-24 03:51:17 UTC

Please reopen this bug (I am not able to do it from my id), since the patch in this bug report is resulting in a deadlock situation.

The problem is happening since the cifs code calls is_size_safe_to_change()
holding a spinlock, and this function can sleep. Obviously, sleeping inside
spinlock is not a good idea!

Process with 8720 pid is the culprit here. It gets the spinlock on inode->i_lock
and calls is_size_safe_to_change(). This in turn eventually calls
wait_for_response() which puts the process on a waitqueue and makes it go to
uninterruptible sleep.
Now, since the above process is holding the lock, suppose all other CPUs (in
this server we have only two) get to run processes which also start waiting on
the same inode->i_lock, we get the deadlock - since the process with 8720 pid, which is holding the lock can _not_ be woken up now (as both the cpus are busy spinning!).

This is the stack trace for the process with 8720 pid:


locktests     D 000000000fcff7fc  8720 23007  22971         23008 23006 (NOTLB)
Call Trace:
[C0000000725EB070] [C000000000521010] 0xc000000000521010 (unreliable)
[C0000000725EB240] [C000000000010E80] .__switch_to+0x130/0x154
[C0000000725EB2D0] [C000000000357F38] .schedule+0xa98/0xbf4
[C0000000725EB3E0] [D000000000A21DA8] .wait_for_response+0xe8/0x1bc [cifs]
[C0000000725EB4C0] [D000000000A22758] .SendReceive+0x2dc/0x598 [cifs]
[C0000000725EB590] [D000000000A06BEC] .CIFSSMBOpen+0x2d8/0x518 [cifs]
[C0000000725EB690] [D000000000A16FDC] .cifs_reopen_file+0x2e4/0x524 [cifs]
[C0000000725EB780] [D000000000A17E18] .find_writable_file+0xe4/0x184 [cifs]
[C0000000725EB820] [D000000000A185C0] .is_size_safe_to_change+0x24/0x90 [cifs]
[C0000000725EB8B0] [D000000000A1BD08] .cifs_get_inode_info_unix+0x794/0x9a8 [cifs]
[C0000000725EBA20] [D000000000A1A79C] .cifs_open+0x71c/0x8c4 [cifs]
[C0000000725EBB30] [C0000000000E69D8] .__dentry_open+0x13c/0x2bc
[C0000000725EBBE0] [C0000000000E6CCC] .do_filp_open+0x50/0x70
[C0000000725EBD00] [C0000000000E6D60] .do_sys_open+0x74/0x130
[C0000000725EBDB0] [C000000000125148] .compat_sys_open+0x24/0x38
[C0000000725EBE30] [C0000000000086A4] syscall_exit+0x0/0x40

Comment 52 Andrew Morton 2007-07-24 09:45:08 UTC

reopened..

Comment 53 Steve French 2007-07-24 12:08:19 UTC

good catch - in this path the tcp session has died and we are reconnecting while someone is changing file size.

is_size_safe_to_change will need to be changed so that it does not block.  In this case the intent of calling find_writable_file is to check whether the inode might be in the process of being written to, not to get a file handle we can use to write.  

I see a few easy ways to change this - but one is to add a parm to find_writable_file

Comment 54 Steve French 2007-07-24 12:40:48 UTC

Could you check to make sure that this patch is ok and works?

Comment 55 Steve French 2007-07-24 12:42:35 UTC

Created attachment 12124 [details]
fix to test for this hang

also could you reassign this bug back to me, I hit the wrong button when I tried to add the attachment

Comment 56 Amit K Arora 2007-07-25 04:53:36 UTC

Hi Steve,

Thanks for the new patch. I have asked the person who faced this problem to try it out. Will update the bug when I get to hear from him.

Comment 57 Bryn M. Reeves 2007-07-25 05:15:33 UTC

This is also being tracked (for RHEL5) via Red Hat bugzilla:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=249394

Comment 58 Amit K Arora 2007-07-26 02:08:06 UTC

(In reply to comment #56)
> Hi Steve,
> 
> Thanks for the new patch. I have asked the person who faced this problem to
> try
> it out. Will update the bug when I get to hear from him.
 
The tests with this new patch have passed. They have been running fine for more than 12 hours, where we were observing the bug within 1-2 hours.

Steve, thanks again for this patch !

Comment 59 Steve French 2007-07-26 08:48:08 UTC

You may want to monitor /proc/fs/cifs/Stats during the test to see if the number of session reconnects keeps increasing (a possible indication of networking problems or server crash).  Although the cifs client automatically reconnects after a tcp session crash (and it might not be visible to an application or test) it is a possible indication that there is a networking problem and worth keeping an eye out for (the cifs client logs a dmesg entry in most of these cases too).

Note You need to log in before you can comment on or make changes to this bug.