After upgrading to 4.18, I noticed that my panel app which displays the current disk loads no longer displayed any activity for my SSD (it should display % busy time read and % busy time write for each drive): Read load was shown as absolutely zero all the time. Sometimes, a little write load was displayed, but only a small fraction of the actual load. I checked "/proc/diskstats", and it seems to give wrong data: According to diskstats, an uncached kernel tree copy (which takes many seconds wallclock time and is almost 100 % disk read bound) did 75 K reads totalling 2,3 M sectors in just 200 msec, which is completely unreasonable for a more than 5 year old, SATA-connected SSD which is able to read around 300 MB/sec sequentially at best. After syncing, I deleted the copied tree (including discard!) which took more than one minute wallclock time. According to diskstats, the whole copy, sync, delete and discard just accumulated 7000 ms write busy time in total, which is also way too low. However, the disk busy times seem to be reasonable for my mechanical disk, which is hooked to the same SATA controller: Diskstat shows very close to 100 % read and/or write time when the disk is really busy. Quick guess (without knowing anything about the internals of the diskstat statistics): Perhaps most of the very short I/O times on the SSD are below measurement resolution or get truncated to zero before being summed up? (whereas the longer mechanical disk I/O times are summed up correctly) Disk statistics worked with previous kernels, also for the SSD.
Can you use 'iostat' for observing the difference between 4.18 and 4.17 about your workload? If yes, please post out the two 'iostat' output. Also do you use scsi_mq in your test kernel?
(In reply to Lei Ming from comment #1) > Can you use 'iostat' for observing the difference between 4.18 and 4.17 about > your workload? > > If yes, please post out the two 'iostat' output. I'll try when I find time (no longer have 4.17, need to download, build and install it), please be patient. > Also do you use scsi_mq in your test kernel? CONFIG_SCSI_MQ_DEFAULT is set in my 4.18 kernel (don't know about 4.17). However, it does not make a difference: I just tried without it, same problem.
Did some testing: Large tar backup (~13 GB, all tests took around 3 minutes) from sdb (SATA SSD) to sda (SATA conventional disk), always on a rebooted system (really reading from the disk, not from cache). tar keeps sdb almost 100 % busy reading most of the time during this test, with some short periods of activity between 75 and just below 100 %, and 0 % at the end when sda is writing out the rest of the backup. 2 sec interval iostat and vmstat. iostat looks fine in all cases, it doesn't seem to use the broken diskstats value in any way. vmstat for sda also looks fine in all cases. I assume that's because a conventional disk operation always takes long enough to avoid the truncation problem which seems to cause the wrong numbers (see below). The bug shows up in the vmstat files for sdb: The vmstat values shown are cumulative totals, not per interval. "Total reads" and "sectors read" are almost identical for all three tests, which proves that I measured the same thing in all tests. The interesting column is "reads ms" for sdb, i.e. cumulative time spent reading since reboot. For 4.18, it is 3530 ms, which is way too low for a backup reading 100 % for several minutes and 13 GB read. For 4.17, it is roughly 3 minutes, which is correct (corresponds to the time the backup takes). For patched 4.18 (see below), it is roughly 52 minutes, which is way to high for 3 minutes wallclock time. This corresponds to xfce's graphical disk load panel plugin: For 4.17, it seems to be correct (between 90 and 100 %), for 4.18 unpatched, it is almost always at 0 %, and for 4.18 patched, it is permanently at 100 %. Unpatched 4.18 also shows "0 ms total write time" since reboot for sdb, which is also definitely wrong (sdb is my system disk, it writes!). About the patch: osandov@osandov.com sent me the following: This was probably 522a777566f5 ("block: consolidate struct request timestamp fields"). Not the proper fix, but can you try the following: diff --git a/block/blk-core.c b/block/blk-core.c index 4dbc93f43b38..663430090281 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2738,7 +2738,7 @@ void blk_account_io_done(struct request *req, u64 now) struct hd_struct *part; int cpu; - duration = nsecs_to_jiffies(now - req->start_time_ns); + duration = max(nsecs_to_jiffies(now - req->start_time_ns), 1); cpu = part_stat_lock(); part = req->part; So obviously, if a disk operation completes quickly, its timing gets truncated to 0 and is lost from the statistics.
Created attachment 278499 [details] vmstat sda 4.17
Created attachment 278501 [details] vmstat sdb 4.17
Created attachment 278503 [details] vmstat sda 4.18 unpatched
Created attachment 278505 [details] vmstat sdb 4.18 unpatched
Created attachment 278507 [details] vmstat sda 4.18 patched
Created attachment 278509 [details] vmstat sdb 4.18 patched
Created attachment 278511 [details] iostat 4.17
Created attachment 278513 [details] iostat 4.18 unpatched
Created attachment 278515 [details] iostat 4.18 patched
Klaus, please try 4.19-rc5. It should be fixed by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b57e99b4b8b0ebdf9707424e7ddc0c392bdc5fe6.
Tried 4.19-rc5. Xfce graphical disk load (busy time) looks good. diskstats and vmstat values are much better than for 4.18, but still different from 4.17: They are a few percent higher in 4.19. I noticed because "time spent reading" for my test case (backup) now exceeds wallclock time by 20 percent or so, whereas in 4.17, it was just below wallclock time. Is this diskstats field expected to contain "disk busy time reading" (i.e. if there are 10 parallel read requests, time is counted only once), or is it "total time spent in all read requests" (i.e. for 10 parallel requests, 10 times wallclock time is counted)? In the first case, the actual number in 4.19 would be too high, for the second case, it's perfectly fine.
It is the second case, so that's good to hear! As the commit mentions, before 4.18 we would actually under-account, so it's not surprising that it exceeds wall clock time where it previously didn't.