Bug 54271 - readahead() docs incorrectly say it blocks
Summary: readahead() docs incorrectly say it blocks
Status: RESOLVED CODE_FIX
Alias: None
Product: Documentation
Classification: Unclassified
Component: man-pages (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: documentation_man-pages@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-02-23 04:25 UTC by Phillip Susi
Modified: 2014-03-15 09:11 UTC (History)
1 user (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Patch from Phillip Susi (1.71 KB, patch)
2014-03-15 09:11 UTC, Michael Kerrisk
Details | Diff

Description Phillip Susi 2013-02-23 04:25:19 UTC
The readahead(2) man page says that it blocks until the requested data has been read.  The whole point of this call is that it does NOT block but initiates background reads of data that will be required soon, so that when it is read, hopefully it can be done without blocking.
Comment 1 Michael Kerrisk 2013-02-25 09:36:57 UTC
(In reply to comment #0)
> The readahead(2) man page says that it blocks until the requested data has
> been
> read.  The whole point of this call is that it does NOT block but initiates
> background reads of data that will be required soon, so that when it is read,
> hopefully it can be done without blocking.

How did you verify or test this assertion?
Comment 2 Phillip Susi 2013-02-25 15:47:11 UTC
Common usage of the call ( see the ureadahead, sreadahead packages ), inspection of blktrace data while using these packages, discussion on linux-mm, inspection of the kernel sources.  The call *can* block, for instance, if it must read ext2/3 indirect blocks to map the data blocks so their reads can be queued, but the whole point of the call is to *avoid* blocking.
Comment 3 Michael Kerrisk 2013-02-25 20:55:17 UTC
So, I'm no expert here, but on my system (ext4), reading a 1 GB file (with large userspace buffers) takes about 12 seconds. Doing a readahead() of a 1GB file also takes about 12 seconds. If I'm understanding what you are saying correctly, the readahead() should return much faster than the file-read case. What am I missing?
Comment 4 Phillip Susi 2013-02-26 00:56:25 UTC
Around two years back I dug into why readahead() was blocking when it wasn't supposed to and after discussion on linux-mm, it turned out to be due to ext3 using indirect blocks which had to be read first to learn the location of the data blocks before they could be queued.  Switching to ext4 extents fixed that and readahead() stopped blocking.  It seems there has been a regression recently so I started a thread on linux-mm to try and figure it out.
Comment 5 Phillip Susi 2013-02-26 03:15:44 UTC
After closer inspection, it turned out to be the same issue as before.  I was testing on an iso file I had downloaded with bittorrent, and looking at it with debugfs showed that while the data blocks were contiguous, the extent tree was fragmented, causing ext4 to have to block the readahead to read extent tree blocks, the last of which was fairly close to the end of the file.  After creating a new file with dd from /dev/zero, and verifying it did not have the extent tree problem, readahead() does not block on the file.

Specifically running readahead ; iotop -d 2 ( after drop_caches ) immediately completes the readahead, then iotop shows lots of read throughput for several seconds.

So it seems that there are still some ext4 issues that cause readahead to block more than you would want to, but as I said before, the call is not supposed to block if possible.
Comment 6 Michael Kerrisk 2013-02-26 14:37:12 UTC
I'd be interested to see a (simple) script or log file so I can follow your steps.

I tried repeating the same measurements (read() an entire file, readahead() the entire file) for a variety of other file systems: XFS, JFS, Reiserfs. In each case, the time taken for the read loop to complete and the readahead() system call to complete were similar (10 to 15 seconds for a 1GB file).
Comment 7 Phillip Susi 2013-02-26 15:57:55 UTC
dd if=/dev/zero of=foo bs=1MiB count=512
echo 1 > /proc/sys/vm_drop_caches
time readahead foo && iostat -d 2

real    0m0.212s
user    0m0.000s
sys     0m0.164s


Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              13.07       348.27       564.72   24916849   40401892
sdb              12.34       342.80       564.88   24525467   40413276
sdc              12.64       353.08       558.45   25260487   39953612
sdd              12.70       344.31       561.36   24633134   40162088
md0             128.52       690.00      1518.61   49365215  108647060
dm-0              4.09        68.50        14.09    4900585    1008368
dm-1              1.13       529.16        10.01   37857989     716364
dm-2              0.00         0.01         0.00        748          0
dm-3            104.29        16.60      1494.47    1187561  106920224

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              83.50     42244.00         0.50      84488          1
sdb              73.50     37122.00         0.50      74244          1
sdc              76.50     38658.00         0.50      77316          1
sdd              80.00     40450.00         0.50      80900          1
md0               1.00         2.00         2.00          4          4
dm-0              1.00         2.00         2.00          4          4
dm-1              0.00         0.00         0.00          0          0
dm-2              0.00         0.00         0.00          0          0
dm-3              0.00         0.00         0.00          0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              33.00     16384.00         2.50      32768          5
sdb              40.00     20224.00         0.50      40448          1
sdc              37.00     18688.00         0.50      37376          1
sdd              32.00     15872.00         2.50      31744          5
md0               0.00         0.00         0.00          0          0
dm-0              0.00         0.00         0.00          0          0
dm-1              0.00         0.00         0.00          0          0
dm-2              0.00         0.00         0.00          0          0
dm-3              0.00         0.00         0.00          0          0

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               1.50         0.00         1.00          0          2
sdb               2.50         0.00        19.00          0         38
sdc               1.50         0.00         1.00          0          2
sdd               2.50         0.00        19.00          0         38
md0               5.00         0.00        18.00          0         36
dm-0              4.50         0.00        18.00          0         36
dm-1              0.00         0.00         0.00          0          0
dm-2              0.00         0.00         0.00          0          0
dm-3              0.00         0.00         0.00          0          0


When I was seeing the blocking, I observed the following stat output from a stat command on the file in a debugfs session:

EXTENTS:
(ETB0):33795, (0-30975):370688-401663, (30976-40191):401664-410879,
(ETB0):483328, (40192-72575):410880-443263,
(72576-81919):443264-452607, (ETB0):483329, (81920-111578):452608-482266

As you can see, the file has 3 level zero extent tree blocks.  That last extent of blocks (452608-482266) can't be queued for reading until the ETB at 483329 has been read, and so the readahead blocks until all of the blocks before have been read, because they are ahead of the ETB in the queue, then once it has the ETB, it queues the last extent and returns.

The freshly created file with dd has no extent tree blocks to read, so no blocking happens.
Comment 8 Phillip Susi 2013-04-08 15:09:00 UTC
Bump.
Comment 9 Michael Kerrisk 2014-03-15 09:11:27 UTC
Created attachment 129481 [details]
Patch from Phillip Susi

Patch submitted to linux-man by Phillip Susi on 2014-03-14
Comment 10 Michael Kerrisk 2014-03-15 09:11:54 UTC
Applied Phillip's patch, with some tweaks.

Note You need to log in before you can comment on or make changes to this bug.