Bug 8077
Summary: | Missing files and data corruption when read/write to same CIFS mount point with multiple threads. | ||
---|---|---|---|
Product: | File System | Reporter: | Brian Wang (ywang221) |
Component: | CIFS | Assignee: | Steve French (sfrench) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | shaggy, shirishp |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.16.-13 | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Brian Wang
2007-02-24 09:12:57 UTC
P.S. there is a worse case data corruption on write size, big files (>300MB) tend to get corrupted very often even running single process copy. More information i want to put here is that it seems multiple processes make it much worse than multi-thread. Of course of of the reason is that in my multi-thread program, there is only one finsd command and many read threads. So I believe the bug is most likely related to "lookup". we believe the cause of this problem might be that cifsFileInfo.fh_sem is not held at multiple places where it is supposed to be taken. Steve and Shirish, would you take a look at the code and confirm? Reminds me of the cifs page handling bug that Linus found/fixed in the course of addressing an ext file corruption. See: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cb876f451455b6187a7d69de2c112c45ec4b7f99 Have you tried this on 2.6.20 (which presumably includes the fix)? I was fairly certain that others (not just Shirish and me) had done cifs large file copy tests with no particular problems, assuming this problem was not addressed by the above, typical approaches for debugging data corruption problems include looking for clues: 1) Identify the offset of the beginning and end of the corruptions 2) Identify whether the corruption includes a bad file size 3) Identify whether the corrupt range is zero filed 4) Identify whether the corrupt range or contents is consistent when the test is repeated We could also add additional debug code when sessions time out, but may not need to if cifs already was compiled (or can be compiled) with additional debugging options (CONFIG_CIFS_DEBUG2 and CONFIG_CIFS_STATS2 IIRC) - then it is possible to get information on "slow" responses coming back from the server ("echo 4 > /proc/fs/cifs/cifsFYI" to enable timing messages on slow respones to be logged to the client's dmesg) All of these are clues as to the type of problem which could cause this (historically fs developers have to tell truncation races from unitialized page ranges, writes at wrong offsets, corruption at the tcp socket or ethernet level etc.). I find it useful to create a test source file for the copy which contains a regular pattern such as: 00000001 00000002 00000003 00000004 etc. in order to quickly spot the area of the file which is corrupt. An important consideration in the problem you describe is that we may actually be taking the server or network connection down with this load. I am somewhat suspicious of that given the particular errors that you report seeing in the log some of which can be caused when the session is dropped (e.g. due to a stuck request on the server or the tcp socket getting killed). Could you confirm whether /proc/fs/cifs/Stats indicates any session reconnects? What is the server type? Another interesting approach which can be used to isolate this kind of thing a bit is to take the page manager on the client out of the equation by cifs mounting on the client with "forcedirectio" mount option which causes reads/writes to flow with their requested size directly without going through pagecache (readahead/writebehind). If that still fails, it reduces the possible causes of the failure significantly and is somewhat simpler to debug. Thought I will mention that Danny Sung had noted this wrt forcedirectio option "Just giving you an update. forcedirectio seems to be a lot more reliable than not having the option. However, for a particularly large file (eg. 6GB) it aborts with a message saying "Host down". " My test with "forcedirectio" has no effect to this bug regradless of what server I use (SAMBA, Win2003 Server, WinXP etc). It is very reproducibble and persistent, as long as you run multiple streams on same share, you got problem. The interesting thing is that if you run multiple streams on same share but against disjoint trees under it, you are fine. I strongly believe it is some race condition on the connection. and connection drops not related to physical connection drop, instead, CIFS attemps the disconnection and re-connects because of some errors. OK - the one piece of data we are missing is whether Linus's fix to cifs_writepages is in the code that you tried (as it would be if it were 2.6.20 or current kernel). Yes, I put that patch into it already. I have not seen the data corruption on the write yet. before i had the patch in, the file mis compare once in a few rounds of cp. So it does seem to fix the corruption, but I will do more test on that. This corruption is silent corruption, no errors are reported to the user land. But the read/write errors are reported to user land if you run multiple stream on one share. Kernel also reports tons of errors. In my case, my share has abour 34000 files, if I run 2 "cp" commands, about 200-500 read errors are reported to user land. If I run 1 cp command, the result is always good. another way to see this if to run a few "find /cifs_share | wc" at the same time, you will get different counts. the share is absolutely static. When I see readdir returning different numbers of files it does remind me of the readdir fixes that went in after 2.6.16, and scanning through the log of cifs changes ("git log v2.6.16.. fs/cifs") showed a few obvious possibilities for a fix which already is in which addresses this. I am not sure what version of cifs.ko that you are using - but if you can not verify on more current kernel, should at least try the backport tree (which includes all but the last three or four fixes) http://master.kernel.org/git/?p=linux/kernel/git/sfrench/cifs-backport-old-kernels.git;a=shortlog I brought in a bunch of patches. It works much better. Except I still got errors occasionally with multiple streams. But i found out most of the read errors are due to "ECONNRESET" from smb_send or smb_send2. Then I added code to check "ECONNRESET" and then translate to "EAGAIN", for the last few tests, I have not seen read errors. I will keep running more stress tests. Any concerns with translating "ECONNRESET" to "EAGAIN" in SendReceive and SendReceive2 after smb_send returns "ECONNRESET"? This sounds good. I don't want to remap to EAGAIN without more indication of what is going on and more analysis of what would happen in this code path. That ECONNRESET is happening from time to time on sendmsg calls is strange. Are the ECONNRESET associated with an increase in the session reconnect count (can be displayed in /proc/fs/cifs/Stats - note that stats can also be reset to zero)? ECONNRESEts are associated with an increase in the session reconnect count. Looks like some servers keep resetting the connections on heavy load. so map ECONNRESET to EAGAIN does solve most of my problem, which also seems to be a logic thing to do in CIFS. I got lots of ERRbadfid (mapped to EBADF by cifs) from servers when they are on heavy load. I don't think there is anything we can do about it, is there? but the weird thing is that only some of those EBADF are reported to user land and some don't show up on user land at all, but nothing seem to be wrong when nothing is reported to user land. I guess I still need to investigate it more. I think you may be right. SendReceive and SendReceive2 - do not special case the return codes coming back from smb_send and smb_send2 so are likey to return ECONNRESET rather than EAGAIN. But one question ... Associated with the CONNRESET, do you see any errors in dmesg indicating: "No response for cmd " ... which might indicate that the client was the one closing the socket. Yes, I do see "No response for cmd " ... but i don't think this always causes client to reset the connections. Actually, by mapping CONNRESET to EAGAIN would also prevents client from reconnect, is it correct? We typically mark the connection as bad in only one thread (the cifsd thread which manages reads from the socket connection to this particular mounted server actually to be precise there is one cifsd thread per unique server ip address we are connected to). So basically the trigger for marking the session as bad (and reconnecting the socket) is usually: kernel_recvmsg (peek or read of the tcp socket) returning something other than: 1) ERESTARTSYS or EAGAIN (which need to be retried) 2) EINTR (indicating the cifsd thread was killed intentionally) or another thread marking the server->tcpStatus field as CIFSNeedReconnect So if there are errors on smb_send and smb_send2 (on kern_sendmsg) e.g. ECONNRESET (?) which clearly indicate that the socket is down and not coming back then we can mark "CIFSNeedReconnect" in the ses->server->tcpStatus field and cifsd will reconnect it. I should make a minor correction:
> We typically mark the connection as bad in only one thread (the cifsd thread
should be
"We reconnect a connetion marked as bad in only one thread (the cifsd thread)"
We can mark the tcpStatus as "needing reconnect" in a number of places in
different threads, not just after kern_recvmsg in cifsd
There is another interesting thing found last 2 days. My program copies the files from the CIFS share and also saves the attributes, then copies the files back to the CIFS share then restores. but for each round, there is always a number of files fail to restore the correct last modified time. The user land call "utime" doesn't return error, but those files have the last modified time as the current time when the file is written back. Most of the files have the time restored correctly, but there are always a few of the files won't (different ones each round). My guess is that smb requests are not serialized for a single file? if that's the case, then the request to set the time sent after a write request could be executed before a write request then the time is set back to current time. could it be the case? We consider this a different type of data corruption. any thoughts? but if what i said is correct, then the fix could be extremely hard. ok, guys, here is some updates on the time stamp issue. It looks like whenever there is a unclosed file handle, CIFSSMBSetTimes won't be effective because the time would get reset later. What i did is to lookup for the open file, if there is one, we re-use the file id and pass it to CIFSSMBSetFileTimes, otherwise, we open it again and do CIFSSMBSetFileTimes. This seems to fix the time stamp problem. I still got "ERRbadfid" which got mapped to EBADF in CIFS. but the file definitely is on the server, it is just in the middle of a read, ERRbadfid got returned. Any idea how to attack this problem? one possibility is that CIFS sends something wrong to the server? Also, is there a fix for the "read" data corruption after 1.45? Now we are seeing read corruptions, not as often as "write" corrption fixed by Linus patch though. It has been quiet.:-) data corruption update. It seems that there is a bug somewhere in VM when there are many processes run on CIFS, data corruption happens on heavy load. But if we disbale the page cache by removing .writepages = cifs_writepages, and .writepages = cifs_readpages, we can get rid of the data corruption. But we also get lower performance. /linux-2.6-copan/src/linux/fs/cifs/file.c ==== @@ -2000,13 +2000,17 @@ struct address_space_operations cifs_addr_ops = { .readpage = cifs_readpage, +#if 0 /* xxx */ #if LINUX_VERSION_CODE > KERNEL_VERSION(2, 5, 0) .readpages = cifs_readpages, #endif +#endif /* xxx */ .writepage = cifs_writepage, +#if 0 /* xxx */ #if LINUX_VERSION_CODE > KERNEL_VERSION(2, 6, 14) .writepages = cifs_writepages, #endif +#endif /* xxx */ .prepare_write = cifs_prepare_write, .commit_write = cifs_commit_write, .set_page_dirty = __set_page_dirty_nobuffers, @@ -2022,9 +2026,11 @@ struct address_space_operations cifs_addr_ops_smallbuf = { .readpage = cifs_readpage, .writepage = cifs_writepage, +#if 0 /* xxx */ #if LINUX_VERSION_CODE > KERNEL_VERSION(2, 6, 14) .writepages = cifs_writepages, #endif +#endif /* xxx */ .prepare_write = cifs_prepare_write, .commit_write = cifs_commit_write, .set_page_dirty = __set_page_dirty_nobuffers, I think these problems are fixed as part of the data integrity patches that went into cifs last year, changing sockets from non-blocking to blocking and a fix in cifs_writepages. We made major changes to the cifs SendReceive code to alter sendmsg (now blocking) which helps reduce or eliminate the retries (and the code which allowed partial frames to be sent in some write cases is also fixed) and the handle invalidation code is now fixed and writepages as well - so we should be past this problem now. Please reopen if you rediscover on 2.6.27.x or later. |