I was reading this blog post about the lighttpd web server. http://blog.lighttpd.net/articles/2005/11/11/optimizing-lighty-for-high-concurrent-large-file-downloads It describes problems they are having downloading 100 simultaneous 100MB files. In this post they complain about sendfile() getting into seek storms and ending up in 72% IO wait. As a result they built a user space mechanism to work around the problems. LKML thread about the issue: http://lkml.org/lkml/2006/1/20/331 Source to see how lighttpd uses sendfile() is here http://www.lighttpd.net/download/ My general conclusion is that since they were able to write a user space implementation that avoids the problem something must be broken in the kernel readahead logic for sendfile().
The same problem (disk seek storms) happens (under the below scenario) on Linux kernel 2.2.26 and 2.4.31, so I guess it has always been there (even if sendfile() implementation has been changed). I would say that (under the below scenario) it happens with mmapped files too. To clarify things I guess it is useful to add a few details to the scenario that triggers the problem. 1) There is a web server (but could be any file server type) that has to serve lots of big or large files to the world through (mainly) fast connections (over 1 Mbit/sec. each). 2) The total size of all the served files is much bigger than available RAM, let's say at least 2 or 3 times RAM size. 3) Each file can be very large, from a few MBytes up to many GBytes. 4) In practice, given points 2) and 3), file contents are almost never found in page cache (the same files are accessed sparingly and very randomly). 5) The (socket) send buffer size is set to default (16 KB), thus each sendfile() call only writes 16 KB at a time and as all 100, 200, 500 or 1000 parallel downloads are for different files, each sendfile() call has to wait for HD heads seeks (5 - 20 mlsec.) because 99% of times, requested content is not in page cache. The conclusion is that, in this case, the standard readahead logic seems like it is not working at all. The problem (disk seek storms) can be mitigated by reading data in large chunks (i.e. 256 KB, 512 KB, 1024 KB, etc.) at once. Here user applications usually implement one of these two workarounds: 1) a big (512 KB) user space buffer (as used by lighttpd) is used for each connection, data is read into it and then the buffer data is written to socket; and/or 2) the (socket) send buffer size is increased up to 128 KB, 256 KB, etc., but this is not a good solution because: A) slow modem clients can suffer occasional network timeouts, specially with persistent HTTP/1.1 connections (yeah, getting how much data is left in the (socket) send buffer is not a standard method available on every OS); B) sockets eat a lot of RAM (twice the read + send buffer size) and under memory pressure the (socket) send buffer size is shrinked (so the trick does not work very well).
There is a report of this patch fixing the problem. http://lkml.org/lkml/2006/3/27/186