Bug 91691 - Extremely slow disk I/O, high iowait in 3.16 and newer kernels, unless blktrace is running
Summary: Extremely slow disk I/O, high iowait in 3.16 and newer kernels, unless blktra...
Status: NEEDINFO
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Block Layer (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Jens Axboe
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-01-21 16:17 UTC by Andreas Klauer
Modified: 2016-08-20 01:13 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.16
Subsystem:
Regression: No
Bisected commit-id:


Attachments
cpuinfo (1.29 KB, text/plain)
2015-01-21 16:17 UTC, Andreas Klauer
Details
dmesg (33.91 KB, text/plain)
2015-01-21 16:17 UTC, Andreas Klauer
Details
lspci (643 bytes, text/plain)
2015-01-21 16:18 UTC, Andreas Klauer
Details
3.15.10 btrace vda3 statistics (1.56 KB, text/plain)
2015-01-21 19:10 UTC, Andreas Klauer
Details
3.15.10 btrace vda statistics (1.56 KB, text/plain)
2015-01-21 19:10 UTC, Andreas Klauer
Details
3.16.6 btrace vda3 statistics (1.56 KB, text/plain)
2015-01-21 19:10 UTC, Andreas Klauer
Details
3.16.6 btrace vda statistics (1.56 KB, text/plain)
2015-01-21 19:11 UTC, Andreas Klauer
Details

Description Andreas Klauer 2015-01-21 16:17:36 UTC
Created attachment 164161 [details]
cpuinfo

I rented a KVM server; nothing special: 2x Westmere with AES-NI, 6GB RAM, 120GB SATA disk (/dev/vda).

Using kernel 3.2.x through 3.14.x/3.15.x it works fine, disk write speed around ~30-50M/s. Starting with 3.16 through the current 3.19-rc, it drops down to ~2-3M/s, and that's for linear writes. Trying to extract a kernel tar.xz, the box gets stuck for hours in 100% iowait.

I found a remotely similar issue on LKML with no replies: https://lkml.org/lkml/2014/12/23/290

I tried to bisect and ended up at this commit: https://github.com/torvalds/linux/commit/07068d5b8ed8fa6759b2826ba9197e49b69a1fc3

But I'm not sure if it's correct - not much experience bisecting and many of the intermediate states had their own issues with kernel panics and such. I am unable to revert this commit in later kernels, as I'm not sure how to resolve any conflicts.

I am using the following testing method, on an otherwise idle box:

    truncate -s 512M /dev/shm/foobar
    shred /dev/shm/foobar
    # write 512M of random data to disk
    sync; time (dd bs=1M if=/dev/shm/foobar of=/dev/vda3; sync)

The partition /dev/vda3 is MiB-aligned.

The result in 3.15.10:

    512+0 records in
    512+0 records out
    536870912 bytes (537 MB) copied, 11.6523 s, 46.1 MB/s
    
    real 	0m13.468s
    user 	0m0.002s
    sys 	0m0.656s

The result in 3.16.6:

    512+0 records in
    512+0 records out
    536870912 bytes (537 MB) copied, 232.702 s, 2.3 MB/s
    
    real 	3m52.706s
    user 	0m0.000s
    sys 	0m0.770s

Changing the I/O scheduler (noop/deadline/cfq) does not seem to have any effect.

However. Running blktrace seems to affect it a great deal, for some reason.

The result in 3.16.6, while in another terminal 'btrace /dev/vda3':

    512+0 records in
    512+0 records out
    536870912 bytes (537 MB) copied, 0.409312 s, 1.3 GB/s
    
    real    0m13.184s
    user    0m0.000s
    sys     0m0.413s

I am getting these results using either Ubuntu or ArchLinux, as well as custom compiled minimal kernels. Please let me know what I can do to help debug the issue, including root access to the box if you need it (I won't be using it until end of February).

I'll also attach cpuinfo, lspci and dmesg in case it's of any relevance.
Comment 1 Andreas Klauer 2015-01-21 16:17:56 UTC
Created attachment 164171 [details]
dmesg
Comment 2 Andreas Klauer 2015-01-21 16:18:11 UTC
Created attachment 164181 [details]
lspci
Comment 3 Jens Axboe 2015-01-21 17:07:12 UTC
Would it be possible to use blktrace to log to a different machine, assuming that then still shows the slowdown? That might help us further pinpoint where things are going wrong. You can do that ala:

(on other machine, same network): # blktrace -l

then do blktrace -s other-machine /dev/vda

If you have some short isolated sequence of IO you can do, that would help.
Comment 4 Andreas Klauer 2015-01-21 19:01:47 UTC
It seems to behave the same over network as locally. But there seems to be a difference between running blktrace on /dev/vda as opposed to /dev/vda3.

The result in 3.16.6, blktrace -s -h othermachine /dev/vda3:

    512+0 records in
    512+0 records out
    536870912 bytes (537 MB) copied, 0.391734 s, 1.4 GB/s
    
    real    0m12.161s
    user    0m0.000s
    sys     0m0.397s
    server: connection from 37.120.172.165
    server: end of run for 37.120.172.165:vda3
    === vda3 ===
      CPU  0:               440602 events,    20654 KiB data
      CPU  1:                    1 events,        1 KiB data
      Total:                440603 events (dropped 0),    20654 KiB data

The result in 3.16.6, blktrace -s -h othermachine /dev/vda:

    512+0 records in
    512+0 records out
    536870912 bytes (537 MB) copied, 262.732 s, 2.0 MB/s
    
    real    4m22.735s
    user    0m0.000s
    sys     0m0.645s
    server: connection from 37.120.172.165
    server: end of run for 37.120.172.165:vda
    === vda ===
      CPU  0:               246610 events,    11560 KiB data
      CPU  1:               251797 events,    11804 KiB data
      Total:                498407 events (dropped 0),    23363 KiB data

I'll attach some btrace statistics (for 3.15, 3.16 and vda, vda3) in case they are of help.
Comment 5 Andreas Klauer 2015-01-21 19:10:10 UTC
Created attachment 164191 [details]
3.15.10 btrace vda3 statistics
Comment 6 Andreas Klauer 2015-01-21 19:10:32 UTC
Created attachment 164201 [details]
3.15.10 btrace vda statistics
Comment 7 Andreas Klauer 2015-01-21 19:10:54 UTC
Created attachment 164211 [details]
3.16.6 btrace vda3 statistics
Comment 8 Andreas Klauer 2015-01-21 19:11:11 UTC
Created attachment 164221 [details]
3.16.6 btrace vda statistics

Note You need to log in before you can comment on or make changes to this bug.