Bug 200857
Summary: | 4.18 regression: /proc/diskstats: I/O busy time not updated correctly | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Klaus Kusche (klaus.kusche) |
Component: | Other | Assignee: | io_other |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | osandov, tom.leiming |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.18 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
vmstat sda 4.17
vmstat sdb 4.17 vmstat sda 4.18 unpatched vmstat sdb 4.18 unpatched vmstat sda 4.18 patched vmstat sdb 4.18 patched iostat 4.17 iostat 4.18 unpatched iostat 4.18 patched |
Description
Klaus Kusche
2018-08-19 04:58:04 UTC
Can you use 'iostat' for observing the difference between 4.18 and 4.17 about your workload? If yes, please post out the two 'iostat' output. Also do you use scsi_mq in your test kernel? (In reply to Lei Ming from comment #1) > Can you use 'iostat' for observing the difference between 4.18 and 4.17 about > your workload? > > If yes, please post out the two 'iostat' output. I'll try when I find time (no longer have 4.17, need to download, build and install it), please be patient. > Also do you use scsi_mq in your test kernel? CONFIG_SCSI_MQ_DEFAULT is set in my 4.18 kernel (don't know about 4.17). However, it does not make a difference: I just tried without it, same problem. Did some testing: Large tar backup (~13 GB, all tests took around 3 minutes) from sdb (SATA SSD) to sda (SATA conventional disk), always on a rebooted system (really reading from the disk, not from cache). tar keeps sdb almost 100 % busy reading most of the time during this test, with some short periods of activity between 75 and just below 100 %, and 0 % at the end when sda is writing out the rest of the backup. 2 sec interval iostat and vmstat. iostat looks fine in all cases, it doesn't seem to use the broken diskstats value in any way. vmstat for sda also looks fine in all cases. I assume that's because a conventional disk operation always takes long enough to avoid the truncation problem which seems to cause the wrong numbers (see below). The bug shows up in the vmstat files for sdb: The vmstat values shown are cumulative totals, not per interval. "Total reads" and "sectors read" are almost identical for all three tests, which proves that I measured the same thing in all tests. The interesting column is "reads ms" for sdb, i.e. cumulative time spent reading since reboot. For 4.18, it is 3530 ms, which is way too low for a backup reading 100 % for several minutes and 13 GB read. For 4.17, it is roughly 3 minutes, which is correct (corresponds to the time the backup takes). For patched 4.18 (see below), it is roughly 52 minutes, which is way to high for 3 minutes wallclock time. This corresponds to xfce's graphical disk load panel plugin: For 4.17, it seems to be correct (between 90 and 100 %), for 4.18 unpatched, it is almost always at 0 %, and for 4.18 patched, it is permanently at 100 %. Unpatched 4.18 also shows "0 ms total write time" since reboot for sdb, which is also definitely wrong (sdb is my system disk, it writes!). About the patch: osandov@osandov.com sent me the following: This was probably 522a777566f5 ("block: consolidate struct request timestamp fields"). Not the proper fix, but can you try the following: diff --git a/block/blk-core.c b/block/blk-core.c index 4dbc93f43b38..663430090281 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2738,7 +2738,7 @@ void blk_account_io_done(struct request *req, u64 now) struct hd_struct *part; int cpu; - duration = nsecs_to_jiffies(now - req->start_time_ns); + duration = max(nsecs_to_jiffies(now - req->start_time_ns), 1); cpu = part_stat_lock(); part = req->part; So obviously, if a disk operation completes quickly, its timing gets truncated to 0 and is lost from the statistics. Created attachment 278499 [details]
vmstat sda 4.17
Created attachment 278501 [details]
vmstat sdb 4.17
Created attachment 278503 [details]
vmstat sda 4.18 unpatched
Created attachment 278505 [details]
vmstat sdb 4.18 unpatched
Created attachment 278507 [details]
vmstat sda 4.18 patched
Created attachment 278509 [details]
vmstat sdb 4.18 patched
Created attachment 278511 [details]
iostat 4.17
Created attachment 278513 [details]
iostat 4.18 unpatched
Created attachment 278515 [details]
iostat 4.18 patched
Klaus, please try 4.19-rc5. It should be fixed by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b57e99b4b8b0ebdf9707424e7ddc0c392bdc5fe6. Tried 4.19-rc5. Xfce graphical disk load (busy time) looks good. diskstats and vmstat values are much better than for 4.18, but still different from 4.17: They are a few percent higher in 4.19. I noticed because "time spent reading" for my test case (backup) now exceeds wallclock time by 20 percent or so, whereas in 4.17, it was just below wallclock time. Is this diskstats field expected to contain "disk busy time reading" (i.e. if there are 10 parallel read requests, time is counted only once), or is it "total time spent in all read requests" (i.e. for 10 parallel requests, 10 times wallclock time is counted)? In the first case, the actual number in 4.19 would be too high, for the second case, it's perfectly fine. It is the second case, so that's good to hear! As the commit mentions, before 4.18 we would actually under-account, so it's not surprising that it exceeds wall clock time where it previously didn't. |