Bug 5949

Summary:	sendfile() with 100 simultaneous 100MB files
Product:	File System	Reporter:	Jon Smirl (jonsmirl)
Component:	VFS	Assignee:	fs_vfs
Status:	CLOSED CODE_FIX
Severity:	normal
Priority:	P2
Hardware:	i386
OS:	Linux
Kernel Version:	2.6.13	Subsystem:
Regression:	---	Bisected commit-id:

Description Jon Smirl 2006-01-24 08:28:00 UTC

I was reading this blog post about the lighttpd web server.
http://blog.lighttpd.net/articles/2005/11/11/optimizing-lighty-for-high-concurrent-large-file-downloads
It describes problems they are having downloading 100 simultaneous 100MB files.

In this post they complain about sendfile() getting into seek storms and
ending up in 72% IO wait. As a result they built a user space
mechanism to work around the problems.

LKML thread about the issue:
http://lkml.org/lkml/2006/1/20/331

Source to see how lighttpd uses sendfile() is here
http://www.lighttpd.net/download/

My general conclusion is that since they were able to write a user space
implementation that avoids the problem something must be broken in the kernel
readahead logic for sendfile().

Comment 1 A.D.F. 2006-03-29 10:04:14 UTC

The same problem (disk seek storms) happens (under the below scenario) on Linux
kernel 2.2.26 and 2.4.31, so I guess it has always been there (even if
sendfile() implementation has been changed).

I would say that (under the below scenario) it happens with mmapped files too.

To clarify things I guess it is useful to add a few details to the scenario that
triggers the problem.

1) There is a web server (but could be any file server type) that has to serve
lots of big or large files to the world through (mainly) fast connections (over
1 Mbit/sec. each).

2) The total size of all the served files is much bigger than available RAM,
let's say at least 2 or 3 times RAM size.

3) Each file can be very large, from a few MBytes up to many GBytes.

4) In practice, given points 2) and 3), file contents are almost never found in
page cache (the same files are accessed sparingly and very randomly).

5) The (socket) send buffer size is set to default (16 KB), thus each sendfile()
call only writes 16 KB at a time and as all 100, 200, 500 or 1000 parallel
downloads are for different files, each sendfile() call has to wait for HD heads
seeks (5 - 20 mlsec.) because 99% of times, requested content is not in page cache.

The conclusion is that, in this case, the standard readahead logic seems like it
is not working at all.

The problem (disk seek storms) can be mitigated by reading data in large chunks
(i.e. 256 KB, 512 KB, 1024 KB, etc.) at once.

Here user applications usually implement one of these two workarounds:

1) a big (512 KB) user space buffer (as used by lighttpd) is used for each
connection, data is read into it and then the buffer data is written to socket;

   and/or

2) the (socket) send buffer size is increased up to 128 KB, 256 KB, etc., but
this is not a good solution because:
  A) slow modem clients can suffer occasional network timeouts, specially with
persistent HTTP/1.1 connections (yeah, getting how much data is left in the
(socket) send buffer is not a standard method available on every OS);
  B) sockets eat a lot of RAM (twice the read + send buffer size) and under
memory pressure the (socket) send buffer size is shrinked (so the trick does not
work very well).

Comment 2 Jon Smirl 2006-03-29 21:27:08 UTC

There is a report of this patch fixing the problem.
http://lkml.org/lkml/2006/3/27/186