Most recent kernel where this bug did not occur: I traced the change back to commit a32ea1e1f925399e0d81ca3f7394a44a6dafa12c with git-bisect so that would probably be 2.6.22. I actually ran 2.6.21 prior to hitting the bug. I pulled the latest from git a few days ago and the bug was still there. Distribution: Slackware 11. Hardware Environment: client: Core 2 Duo, 1 GB ram, gigabit ethernet (e1000). server: dual Athlon MP at 2133 MHz, 3 GB ram, hardware RAID, gigabit ethernet (e1000) and no load. Software Environment: client: Slackware 11 with a plain kernels from kernel.org. server: redhat linux 9 upgraded to kernel 2.6.19.2 from kernel.org. Both are running nfs-utils 1.0.10 and mount-2.12r. server (origin1)> cat /var/lib/nfs/xtab /users1 spock(rw,sync,wdelay,hide,nocrossmnt,insecure,no_root_squash,no_all_squash,no_subtree_check,secure_locks,acl,mapping=identity,anonuid=65534,anongid=65534) client (spock)> cat /proc/mounts origin1:/users1 /users1 nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=udp,timeo=7,retrans=3,sec=sys,addr=origin1 0 0 Problem Description: Code which does repeated lseek(end of file), read(), write() calls on a file now runs 10 times slower than it used to. It seems that each loop on that sequence now requires a round trip to the server. Possibly the read() used to be skipped because it is known locally that is is beyond EOF, and the commit removed that behavior to fix some other bug. Interesting note: exporting with async instead of sync on the server reduces the slowdown to a factor of 3 (only retested the buggy kernel though). Steps to reproduce: I will attach a sample program with the exact calls. The sequence of calls is generated by glibc which inserts huge read() calls between the lseek() and write() ones (from fseek() and fwrite()) because the file is opened in read/write mode. This is arguably bad but it used to have no noticeable effect.
Created attachment 14044 [details] simple test case
Do you have a 'strace' that demonstrates what glibc is doing? Also a tcpdump might be useful (binary). Something like tcpdump -s 9000 -w dump.out host my.server and port 2049 should do the trick.
The attached program contains both a version which goes through glibc and one with syscalls which I wrote from the strace. The glibc strace looks like: open("slow_nfs_test_file", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 mmap2(NULL, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb 7f55000 _llseek(3, 0, [0], SEEK_SET) = 0 write(3, "\234<\211\277\226:\366\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 25 60 _llseek(3, 0, [0], SEEK_SET) = 0 read(3, "\234<\211\277\226:\366\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 25 60 write(3, "\234<\211\277\226:\366\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 25 60 _llseek(3, 0, [0], SEEK_SET) = 0 read(3, "\234<\211\277\226:\366\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 51 20 write(3, "\234<\211\277\226:\366\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 25 60 [...] _llseek(3, 1343488, [1343488], SEEK_SET) = 0 read(3, "\0\0\0\0\370\277\365\267\250\205\247\2774\200\4\10\210"..., 32768) = 8192 write(3, "|~\247\277\226\252\364\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2560 _llseek(3, 1343488, [1343488], SEEK_SET) = 0 read(3, "\0\0\0\0\370\277\365\267\250\205\247\2774\200\4\10\210"..., 32768) = 10752 write(3, "|~\247\277\226\252\364\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2560 _llseek(3, 1343488, [1343488], SEEK_SET) = 0 read(3, "\0\0\0\0\370\277\365\267\250\205\247\2774\200\4\10\210"..., 32768) = 13312 write(3, "|~\247\277\226\252\364\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2560 [...] I'm not quite sure what's up with there being so many writes to the same location (don't remember that) but it does not matter for reproducing the problem. The simplified version has the following strace and the same slowdown: open("slow_nfs_test_file", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3 lseek(3, 0, SEEK_SET) = 0 read(3, "", 32768) = 0 write(3, "\274\367\272\277\226J\362\267\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2 560 lseek(3, 2560, SEEK_SET) = 2560 read(3, "", 32768) = 0 write(3, "\274\367\272\277\226J\362\267\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2 560 lseek(3, 5120, SEEK_SET) = 5120 read(3, "", 32768) = 0 write(3, "\274\367\272\277\226J\362\267\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2 560 [... all the same but with incrementing offsets until the end]
Created attachment 14047 [details] output of tcpdump when writing a 1 MB file
That is a really _sucky_ implementation of buffered i/o. I fail to see what that read() is supposed to be good for. That needs to be filed as a glibc bug. Looking at what it is doing, I'll bet that the real reason for the regression is once again commit 44dd151d5c21234cc534c47d7382f5c28c3143cd (NFS: Don't mark a written page as uptodate until it is on disk). I'm adding a patch that should fix this particular regression, but really this should be fixed on the glibc side.
Created attachment 14048 [details] NFS: Fix a read performance regression Fix the regression by setting the page as uptodate if we're appending to the file.
Reply-To: akpm@linux-foundation.org On Sat, 15 Dec 2007 09:59:44 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > That is a really _sucky_ implementation of buffered i/o. I fail to see what > that > read() is supposed to be good for. That needs to be filed as a glibc bug. Ulrich, are your ears burning?
The patch works ok here. Thanks. The glibc involved is 2.3.6 and is already 2 years old. I tried building 2.7 to see if it has the same problem but had no luck getting it to work. I'll file a glibc bug if I can't verify that they've fixed this already by some other means.
(In reply to comment #7) > Ulrich, are your ears burning? The stdio implementation is unfortunately littered with idiosyncrasies due to backward compatibility. These make it hard to impossible to make fundamental changes. In this case I don't see much of a problem. The implementation cannot make assumptions about the file content. The previously known file size might be wrong at the time when the next buffer has to be read from disk. Therefore the reads should all be necessary.
While I appreciate that backward compatibility may be a problem, I still don't fully understand your point. AFAICS the application isn't actually requesting a fread(). It is rather just performing a series of fseek() and fwrite() operations. While I can understand that the buffer contents may need to be revalidated should the client issue a fread() request, that is not the case here. I'm therefore surprised that glibc is inserting these read() syscalls which would appear to serve no useful purpose as far as the example application is concerned.
Ultimately, this problem was fixed by commit efc91ed0191e3fc62bb1c556ac93fc4e661214d2 (NFS: Optimise append writes with holes). Marking as resolved.