Bug 9566 - NFS client 10x performance regression with lseek/read/write mix
Summary: NFS client 10x performance regression with lseek/read/write mix
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Trond Myklebust
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-12-14 14:58 UTC by Olivier Paquet
Modified: 2010-02-03 20:52 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.23
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
simple test case (1.69 KB, text/plain)
2007-12-14 15:00 UTC, Olivier Paquet
Details
output of tcpdump when writing a 1 MB file (663.03 KB, application/octet-stream)
2007-12-14 16:40 UTC, Olivier Paquet
Details
NFS: Fix a read performance regression (1.30 KB, patch)
2007-12-15 10:00 UTC, Trond Myklebust
Details | Diff

Description Olivier Paquet 2007-12-14 14:58:14 UTC
Most recent kernel where this bug did not occur:
I traced the change back to commit a32ea1e1f925399e0d81ca3f7394a44a6dafa12c with git-bisect so that would probably be 2.6.22. I actually ran 2.6.21 prior to hitting the bug. I pulled the latest from git a few days ago and the bug was still there.

Distribution:
Slackware 11.

Hardware Environment:
client: Core 2 Duo, 1 GB ram, gigabit ethernet (e1000).
server: dual Athlon MP at 2133 MHz, 3 GB ram, hardware RAID, gigabit ethernet (e1000) and no load.

Software Environment:
client: Slackware 11 with a plain kernels from kernel.org.
server: redhat linux 9 upgraded to kernel 2.6.19.2 from kernel.org.
Both are running nfs-utils 1.0.10 and mount-2.12r.

server (origin1)> cat /var/lib/nfs/xtab
/users1 spock(rw,sync,wdelay,hide,nocrossmnt,insecure,no_root_squash,no_all_squash,no_subtree_check,secure_locks,acl,mapping=identity,anonuid=65534,anongid=65534)

client (spock)> cat /proc/mounts
origin1:/users1 /users1 nfs
rw,vers=3,rsize=32768,wsize=32768,hard,proto=udp,timeo=7,retrans=3,sec=sys,addr=origin1
0 0

Problem Description:
Code which does repeated lseek(end of file), read(), write() calls on a file now runs 10 times slower than it used to. It seems that each loop on that sequence now requires a round trip to the server. Possibly the read() used to be skipped because it is known locally that is is beyond EOF, and the commit removed that behavior to fix some other bug.

Interesting note: exporting with async instead of sync on the server reduces the slowdown to a factor of 3 (only retested the buggy kernel though).

Steps to reproduce:
I will attach a sample program with the exact calls.

The sequence of calls is generated by glibc which inserts huge read() calls between the lseek() and write() ones (from fseek() and fwrite()) because the file is opened in read/write mode. This is arguably bad but it used to have no noticeable effect.
Comment 1 Olivier Paquet 2007-12-14 15:00:29 UTC
Created attachment 14044 [details]
simple test case
Comment 2 Trond Myklebust 2007-12-14 15:36:39 UTC
Do you have a 'strace' that demonstrates what glibc is doing?

Also a tcpdump might be useful (binary). Something like

  tcpdump -s 9000 -w dump.out host my.server and port 2049

should do the trick.
Comment 3 Olivier Paquet 2007-12-14 16:37:10 UTC
The attached program contains both a version which goes through glibc and one with syscalls which I wrote from the strace. The glibc strace looks like:

open("slow_nfs_test_file", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
mmap2(NULL, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb
7f55000
_llseek(3, 0, [0], SEEK_SET)            = 0
write(3, "\234<\211\277\226:\366\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 25
60
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, "\234<\211\277\226:\366\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 25
60
write(3, "\234<\211\277\226:\366\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 25
60
_llseek(3, 0, [0], SEEK_SET)            = 0
read(3, "\234<\211\277\226:\366\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 51
20
write(3, "\234<\211\277\226:\366\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 25
60
[...]
_llseek(3, 1343488, [1343488], SEEK_SET) = 0
read(3, "\0\0\0\0\370\277\365\267\250\205\247\2774\200\4\10\210"..., 32768) = 8192
write(3, "|~\247\277\226\252\364\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2560
_llseek(3, 1343488, [1343488], SEEK_SET) = 0
read(3, "\0\0\0\0\370\277\365\267\250\205\247\2774\200\4\10\210"..., 32768) = 10752
write(3, "|~\247\277\226\252\364\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2560
_llseek(3, 1343488, [1343488], SEEK_SET) = 0
read(3, "\0\0\0\0\370\277\365\267\250\205\247\2774\200\4\10\210"..., 32768) = 13312
write(3, "|~\247\277\226\252\364\267\3\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2560
[...]

I'm not quite sure what's up with there being so many writes to the same location (don't remember that) but it does not matter for reproducing the problem. The simplified version has the following strace and the same slowdown:

open("slow_nfs_test_file", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
lseek(3, 0, SEEK_SET)                   = 0
read(3, "", 32768)                      = 0
write(3, "\274\367\272\277\226J\362\267\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2
560
lseek(3, 2560, SEEK_SET)                = 2560
read(3, "", 32768)                      = 0
write(3, "\274\367\272\277\226J\362\267\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2
560
lseek(3, 5120, SEEK_SET)                = 5120
read(3, "", 32768)                      = 0
write(3, "\274\367\272\277\226J\362\267\3\0\0\0\0\0\0\0\0\0\0\0\0"..., 2560) = 2
560
[... all the same but with incrementing offsets until the end]
Comment 4 Olivier Paquet 2007-12-14 16:40:23 UTC
Created attachment 14047 [details]
output of tcpdump when writing a 1 MB file
Comment 5 Trond Myklebust 2007-12-15 09:59:43 UTC
That is a really _sucky_ implementation of buffered i/o. I fail to see what that
read() is supposed to be good for. That needs to be filed as a glibc bug.

Looking at what it is doing, I'll bet that the real reason for the regression is
once again commit 44dd151d5c21234cc534c47d7382f5c28c3143cd (NFS: Don't mark a
written page as uptodate until it is on disk).
I'm adding a patch that should fix this particular regression, but really this
should be fixed on the glibc side.
Comment 6 Trond Myklebust 2007-12-15 10:00:38 UTC
Created attachment 14048 [details]
NFS: Fix a read performance regression

Fix the regression by setting the page as uptodate if we're appending to
the file.
Comment 7 Anonymous Emailer 2007-12-15 20:28:02 UTC
Reply-To: akpm@linux-foundation.org

On Sat, 15 Dec 2007 09:59:44 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote:

> That is a really _sucky_ implementation of buffered i/o. I fail to see what
> that
> read() is supposed to be good for. That needs to be filed as a glibc bug.

Ulrich, are your ears burning?
Comment 8 Olivier Paquet 2007-12-16 11:26:14 UTC
The patch works ok here. Thanks.

The glibc involved is 2.3.6 and is already 2 years old. I tried building 2.7 to see if it has the same problem but had no luck getting it to work. I'll file a glibc bug if I can't verify that they've fixed this already by some other means.
Comment 9 Ulrich Drepper 2007-12-19 09:58:18 UTC
(In reply to comment #7)
> Ulrich, are your ears burning?

The stdio implementation is unfortunately littered with idiosyncrasies due to backward compatibility.  These make it hard to impossible to make fundamental changes.  In this case I don't see much of a problem.  The implementation cannot make assumptions about the file content.  The previously known file size might be wrong at the time when the next buffer has to be read from disk.  Therefore the reads should all be necessary.
Comment 10 Trond Myklebust 2007-12-19 10:51:42 UTC
While I appreciate that backward compatibility may be a problem, I still don't
fully understand your point.

AFAICS the application isn't actually requesting a fread(). It is rather just
performing a series of fseek() and fwrite() operations. While I can understand
that the buffer contents may need to be revalidated should the client issue a fread() request, that is not the case here. I'm therefore surprised that glibc is
inserting these read() syscalls which would appear to serve no useful purpose as
far as the example application is concerned.
Comment 11 Trond Myklebust 2010-02-03 20:52:11 UTC
Ultimately, this problem was fixed by commit efc91ed0191e3fc62bb1c556ac93fc4e661214d2 (NFS: Optimise append writes with holes).

Marking as resolved.

Note You need to log in before you can comment on or make changes to this bug.