Bug 8077 - Missing files and data corruption when read/write to same CIFS mount point with multiple threads.
Summary: Missing files and data corruption when read/write to same CIFS mount point wi...
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: CIFS (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Steve French
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-02-24 09:12 UTC by Brian Wang
Modified: 2009-01-23 11:03 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.16.-13
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Brian Wang 2007-02-24 09:12:57 UTC
Most recent kernel where this bug did *NOT* occur:
Distribution: SUSE
Hardware Environment: Intel, SMP/32bit
Software Environment: 2.16.16-13, reproducible in later versions, 2.6.17-19
Problem Description: 

On same CIFS mount point, if read/write is done by multiple threads or 
processes, some files missing and some file get corrupted.

Steps to reproduce:


1. Mount a CIFS share.
2. Run multiple "cp -R ..." or "find ...." commands against the mount point.
3. dmesg shows tons of errors (-9, -11, -13, no file handles for inode, etc.) 
and user land commands also report errors. 

If you run 1 process, you still get read error once in about 10 rounds (my 
share has more than 30000 files).
Comment 1 Brian Wang 2007-02-24 09:31:26 UTC
P.S. there is a worse case data corruption on write size, big files (>300MB) 
tend to get corrupted very often even running single process copy.

More information i want to put here is that it seems multiple processes make 
it much worse than multi-thread. Of course of of the reason is that in my 
multi-thread program, there is only one finsd command and many read threads. 
So I believe the bug is most likely related to "lookup".
Comment 2 Brian Wang 2007-02-24 11:30:26 UTC
we believe the cause of this problem might be that cifsFileInfo.fh_sem is not 
held at multiple places where it is supposed to be taken.

Steve and Shirish, would you take a look at the code and confirm?
Comment 3 Steve French 2007-02-25 20:58:39 UTC
Reminds me of the cifs page handling bug that Linus found/fixed in the course of
addressing an ext file corruption.  See:

http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cb876f451455b6187a7d69de2c112c45ec4b7f99

Have you tried this on 2.6.20 (which presumably includes the fix)?

I was fairly certain that others (not just Shirish and me) had done cifs large
file copy tests with no particular problems, assuming this problem was not
addressed by the above, typical approaches for debugging data corruption
problems include looking for clues:

1) Identify the offset of the beginning and end of the corruptions
2) Identify whether the corruption includes a bad file size
3) Identify whether the corrupt range is zero filed
4) Identify whether the corrupt range or contents is consistent when the test is
repeated

We could also add additional debug code when sessions time out, but may not need
to if cifs already was compiled (or can be compiled) with additional debugging
options (CONFIG_CIFS_DEBUG2 and CONFIG_CIFS_STATS2 IIRC) - then it is possible
to get information on "slow" responses coming back from the server ("echo 4 >
/proc/fs/cifs/cifsFYI" to enable timing messages on slow respones to be logged
to the client's dmesg)

All of these are clues as to the type of problem which could cause this
(historically fs developers
have to tell truncation races from unitialized page ranges, writes at wrong
offsets, corruption at the
tcp socket or ethernet level etc.).

I find it useful to create a test source file for the copy which contains a
regular pattern such as:
    00000001 00000002 00000003 00000004 etc.
in order to quickly spot the area of the file which is corrupt.

An important consideration in the problem you describe is that we may actually
be taking the server or
network connection down with this load.   I am somewhat suspicious of that given
the particular
errors that you report seeing in the log some of which can be caused when the
session is dropped
(e.g. due to a stuck request on the server or the tcp socket getting killed).  
Could you confirm
whether /proc/fs/cifs/Stats indicates any session reconnects?

What is the server type?
Comment 4 Steve French 2007-02-25 21:05:13 UTC
Another interesting approach which can be used to isolate this kind of thing a
bit  is to take the page manager on the client out of the equation by cifs
mounting on the client with "forcedirectio" mount option which causes
reads/writes to flow with their requested size directly without going through
pagecache (readahead/writebehind).  If that still fails, it reduces the possible
causes of the failure significantly and is somewhat simpler to debug.
Comment 5 Shirish Pargaonkar 2007-02-26 07:27:09 UTC
Thought I will mention that Danny Sung had noted this wrt forcedirectio option

"Just giving you an update.  forcedirectio seems to be a lot more
reliable than not having the option.  However, for a particularly large
file (eg. 6GB) it aborts with a message saying "Host down". "
Comment 6 Brian Wang 2007-02-26 07:40:09 UTC
My test with "forcedirectio" has no effect to this bug regradless of what 
server I use (SAMBA, Win2003 Server, WinXP etc).

It is very reproducibble and persistent, as long as you run multiple streams 
on same share, you got problem. The interesting thing is that if you run 
multiple streams on same share but against disjoint trees under it, you are 
fine. I strongly believe it is some race condition on the connection. and 
connection drops not related to physical connection drop, instead, CIFS 
attemps the disconnection and re-connects because of some errors.
Comment 7 Steve French 2007-02-26 20:32:12 UTC
OK - the one piece of data we are missing is whether Linus's fix to
cifs_writepages is in the code that you tried (as it would  be if it were 2.6.20
or current kernel).
Comment 8 Brian Wang 2007-02-26 20:50:12 UTC
Yes, I put that patch into it already. I have not seen the data corruption on 
the write yet. before i had the patch in, the file mis compare once in a few 
rounds of cp. So it does seem to fix the corruption, but I will do more test 
on that. This corruption is silent corruption, no errors are reported to the 
user land.  But the read/write errors are reported to user land if you run 
multiple stream on one share. Kernel also reports tons of errors. In my case, 
my share has abour 34000 files, if I run 2 "cp" commands, about 200-500 read 
errors are reported to user land. If I run 1 cp command, the result is always 
good.
Comment 9 Brian Wang 2007-02-26 23:33:45 UTC
another way to see this if to run a few "find /cifs_share | wc" at the same 
time, you will get different counts. the share is absolutely static.
Comment 10 Steve French 2007-02-27 13:26:07 UTC
When I see readdir returning different numbers of files it does remind me of the
readdir fixes that went in after 2.6.16, and scanning through the log of cifs
changes ("git log v2.6.16.. fs/cifs") showed a few obvious possibilities for a
fix which already is in which addresses this.

I am not sure what version of cifs.ko that you are using - but if you can not
verify on more current kernel, should at least try the backport tree (which
includes all but the last three or four fixes)
http://master.kernel.org/git/?p=linux/kernel/git/sfrench/cifs-backport-old-kernels.git;a=shortlog
Comment 11 Brian Wang 2007-02-27 15:45:15 UTC
I brought in a bunch of patches. It works much better. Except I still got 
errors occasionally with multiple streams.

But i found out most of the read errors are due to "ECONNRESET" from smb_send 
or smb_send2. Then I added code to check "ECONNRESET" and then translate 
to "EAGAIN", for the last few tests, I have not seen read errors. I will keep 
running more stress tests.

Any concerns with translating "ECONNRESET" to "EAGAIN" in SendReceive and 
SendReceive2 after smb_send returns "ECONNRESET"?
Comment 12 Steve French 2007-02-27 20:47:49 UTC
This sounds good. 

I don't want to remap to EAGAIN without more indication of what is going on and
more analysis of what would happen in this code path.

That ECONNRESET is happening from time to time on sendmsg calls is strange.

Are the ECONNRESET associated with an increase in the session reconnect count
(can be displayed in /proc/fs/cifs/Stats - note that stats can also be reset to
zero)?
Comment 13 Brian Wang 2007-02-27 21:10:14 UTC
ECONNRESEts are associated with an increase in the session reconnect count. 
Looks like some servers keep resetting the connections on heavy load. so map 
ECONNRESET to EAGAIN does solve most of my problem, which also seems to be a 
logic thing to do in CIFS.

I got lots of ERRbadfid (mapped to EBADF by cifs) from servers when they are 
on heavy load. I don't think there is anything we can do about it, is there? 
but the weird thing is that only some of those EBADF are reported to user land 
and some don't show up on user land at all, but nothing seem to be wrong when 
nothing is reported to user land. I guess I still need to investigate it more.
Comment 14 Steve French 2007-03-01 08:27:56 UTC
I think you may be right.  SendReceive and SendReceive2 - do not special case
the return codes coming back from smb_send and smb_send2 so are likey to return
ECONNRESET rather than EAGAIN.

But one question ...
Associated with the CONNRESET, do you see any errors in dmesg indicating:
   "No response for cmd " ...
which might indicate that the client was the one closing the socket.

Comment 15 Brian Wang 2007-03-01 08:54:15 UTC
Yes, I do see "No response for cmd " ... but i don't think this always causes 
client to reset the connections.

Actually, by mapping CONNRESET to EAGAIN would also prevents client from 
reconnect, is it correct?
Comment 16 Steve French 2007-03-01 10:01:46 UTC
We typically mark the connection as bad in only one thread (the cifsd thread
which manages reads from the socket connection to this particular mounted server
actually to be precise there is one cifsd thread per unique server ip address we
are connected to).

So basically the trigger for marking the session as bad (and reconnecting the
socket) is usually:

kernel_recvmsg (peek or read of the tcp socket) returning something other than:
1) ERESTARTSYS or EAGAIN (which need to be retried)
2) EINTR (indicating the cifsd thread was killed intentionally)
or
another thread marking the server->tcpStatus field as CIFSNeedReconnect

So if there are errors on smb_send and smb_send2 (on kern_sendmsg) e.g.
ECONNRESET (?) which clearly indicate that the socket is down and not coming
back then we can mark "CIFSNeedReconnect" in the ses->server->tcpStatus field
and cifsd will reconnect it.
Comment 17 Steve French 2007-03-01 10:03:36 UTC
I should make a minor correction:
> We typically mark the connection as bad in only one thread (the cifsd thread

should be
"We reconnect a connetion marked as bad in only one thread (the cifsd thread)"

We can mark the tcpStatus as "needing reconnect" in a number of places in
different threads, not just after kern_recvmsg in cifsd

Comment 18 Brian Wang 2007-03-03 08:35:59 UTC
There is another interesting thing found last 2 days.

My program copies the files from the CIFS share and also saves the attributes, 
then copies the files back to the CIFS share then restores. but for each 
round, there is always a number of files fail to restore the correct last 
modified time. The user land call "utime" doesn't return error, but those 
files have the last modified time as the current time when the file is written 
back. Most of the files have the time restored correctly, but there are always 
a few of the files won't (different ones each round).

My guess is that smb requests are not serialized for a single file? if that's 
the case, then the request to set the time sent after a write request could be 
executed before a write request then the time is set back to current time. 
could it be the case?

We consider this a different type of data corruption. any thoughts?  but if 
what i said is correct, then the fix could be extremely hard.
Comment 19 Brian Wang 2007-03-07 08:03:41 UTC
ok, guys, here is some updates on the time stamp issue.

It looks like whenever there is a unclosed file handle, CIFSSMBSetTimes won't 
be effective because the time would get reset later. 

What i did is to lookup for the open file, if there is one, we re-use the file 
id and pass it to CIFSSMBSetFileTimes, otherwise, we open it again and do 
CIFSSMBSetFileTimes.

This seems to fix the time stamp problem.
Comment 20 Brian Wang 2007-03-07 08:41:00 UTC
I still got "ERRbadfid" which got mapped to EBADF in CIFS. but the file 
definitely is on the server, it is just in the middle of a read, ERRbadfid  
got returned.

Any idea how to attack this problem? one possibility is that CIFS sends 
something wrong to the server? 
Comment 21 Brian Wang 2007-03-07 12:45:53 UTC
Also, is there a fix for the "read" data corruption after 1.45? Now we are 
seeing read corruptions, not as often as "write" corrption fixed by Linus 
patch though.

It has been quiet.:-)
Comment 22 Brian Wang 2007-03-14 13:48:32 UTC
data corruption update.

It seems that there is a bug somewhere in VM when there are many processes run 
on CIFS, data corruption happens on heavy load.

But if we disbale the page cache by removing .writepages = cifs_writepages,
 and .writepages = cifs_readpages, we can get rid of the data corruption.

But we also get lower performance.







/linux-2.6-copan/src/linux/fs/cifs/file.c
====
@@ -2000,13 +2000,17 @@

  struct address_space_operations cifs_addr_ops = {
  	.readpage = cifs_readpage,
+#if 0 /* xxx */
  #if LINUX_VERSION_CODE > KERNEL_VERSION(2, 5, 0)
  	.readpages = cifs_readpages,
  #endif
+#endif /* xxx */
  	.writepage = cifs_writepage,
+#if 0 /* xxx */
  #if LINUX_VERSION_CODE > KERNEL_VERSION(2, 6, 14)
  	.writepages = cifs_writepages,
  #endif
+#endif /* xxx */
  	.prepare_write = cifs_prepare_write,
  	.commit_write = cifs_commit_write,
  	.set_page_dirty = __set_page_dirty_nobuffers, @@ -2022,9 +2026,11 @@
  struct address_space_operations cifs_addr_ops_smallbuf = {
  	.readpage = cifs_readpage,
  	.writepage = cifs_writepage,
+#if 0 /* xxx */
  #if LINUX_VERSION_CODE > KERNEL_VERSION(2, 6, 14)
  	.writepages = cifs_writepages,
  #endif
+#endif /* xxx */
  	.prepare_write = cifs_prepare_write,
  	.commit_write = cifs_commit_write,
  	.set_page_dirty = __set_page_dirty_nobuffers,
Comment 23 Shirish Pargaonkar 2009-01-22 11:27:22 UTC
I think these problems are fixed as part of the data integrity patches that
went into cifs last year, changing sockets from non-blocking to blocking and
a fix in cifs_writepages.
Comment 24 Steve French 2009-01-23 11:03:39 UTC
We made major changes to the cifs SendReceive code to alter sendmsg (now blocking) which helps reduce or eliminate the retries (and the code which allowed partial frames to be sent in some write cases is also fixed) and the handle invalidation code is now fixed and writepages as well - so we should be past this problem now.  Please reopen if you rediscover on 2.6.27.x or later.

Note You need to log in before you can comment on or make changes to this bug.