Bug 12553 - System slows down to a crawl during and after prolonged disk usage
Summary: System slows down to a crawl during and after prolonged disk usage
Status: CLOSED DOCUMENTED
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: io_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-27 11:40 UTC by Matti Niemenmaa
Modified: 2012-05-30 12:00 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.28.1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
strace man man >/dev/null, normal speed (704.88 KB, text/plain)
2009-01-27 11:42 UTC, Matti Niemenmaa
Details
strace man man >/dev/null, slow speed (700.77 KB, text/plain)
2009-01-27 11:43 UTC, Matti Niemenmaa
Details
kernel .config (38.99 KB, text/plain)
2009-01-27 11:43 UTC, Matti Niemenmaa
Details
dmesg (40.80 KB, text/plain)
2009-01-27 11:44 UTC, Matti Niemenmaa
Details
lspci -vvv (27.46 KB, text/plain)
2009-01-27 11:45 UTC, Matti Niemenmaa
Details

Description Matti Niemenmaa 2009-01-27 11:40:30 UTC
Latest working kernel version: none known, I have only tested the 2.6.28 series and can't easily try anything else since I'm on ext4.
Earliest failing kernel version: 2.6.28
Distribution: Arch Linux
Hardware Environment: Intel Core 2 Q9550 (x86-64), 8 GB RAM, Asus P5Q-E motherboard, Seagate Barracuda 7200.11 1.5 TB disk.
Software Environment: "none". Reproduced at runlevel 1 with practically nothing running.

Heavy disk usage or something which simply continues for a period of time, e.g.
downloading torrents, watching video, or simply copying a lot of stuff from one
partition to another makes Linux eventually slow down to the point of being
unusable. This slowdown persists even after the IO has finished and nothing is
using the disk. A reboot is a temporary fix for the problem.

This could be related to Bug 7372 or Bug 12309, but I think the key here is
that I tend to get system-wide poor performance even after the IO is done: when
both 'top' and 'iotop' tell me that for all practical purposes absolutely
nothing is going on.

What happens is that literally /everything/ takes more CPU time. A good
benchmark I found for testing this is 'time man man >/dev/null': typically it
claims around 0.10 seconds of user time, but when this slowdown has occurred it
can take over 8 seconds. An strace -tt doesn't point at any particular culprit:
every individual operation takes many times more time.

The Windows I occasionally dual boot into has no problems at all in this
regard.

I have no idea of the actual cause of the problem, I can only describe the
symptoms. I filed this under IO/Storage as my best guess.

Steps to reproduce:

I did an experiment at runlevel 1. First I dd'd a 500M file from /dev/zero.
Then I started up two programs to run repeatedly: 'time strace -fttT man man
>/dev/null; sleep 1' for benchmarking and 'cp zero-file zero-file2; rm
zero-file2' to reproduce the problem. I also set up 'vmstat 1' in case it gives
anything useful. I left this running for a few hours and logged everything.

This started at around 9:20 local time. The logs of the 'time man man' show
that for a long time all of the processes completed within 0.15 seconds of user
time, most below 0.10. Three hours later, at 12:21:03, we start to see the following:

0.10user 0.13system 0:00.23elapsed 102%CPU (0avgtext+0avgdata 0maxresident)k
0.07user 0.10system 0:02.79elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k
2.60user 0.44system 0:07.92elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k
3.15user 0.51system 0:05.74elapsed 63%CPU (0avgtext+0avgdata 0maxresident)k
4.54user 0.68system 0:12.46elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k
5.05user 0.53system 0:11.19elapsed 49%CPU (0avgtext+0avgdata 0maxresident)k
4.28user 0.55system 0:11.39elapsed 42%CPU (0avgtext+0avgdata 0maxresident)k
0.10user 0.14system 0:03.60elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k
0.08user 0.16system 0:00.82elapsed 30%CPU (0avgtext+0avgdata 0maxresident)k

The times spiked really high for a while there. (Normally I'd reboot around
this time, doing almost anything is painfully slow.) This spiking goes on and
on. An hour later I arrive at the scene and kill the 'cp' loop. About 10
minutes later this happens:

0.16user 0.16system 0:03.16elapsed 10%CPU (0avgtext+0avgdata 0maxresident)k
0.36user 0.15system 0:03.62elapsed 14%CPU (0avgtext+0avgdata 0maxresident)k
1.49user 0.22system 0:01.99elapsed 86%CPU (0avgtext+0avgdata 0maxresident)k
3.12user 0.28system 0:05.33elapsed 63%CPU (0avgtext+0avgdata 0maxresident)k
4.03user 0.50system 0:06.45elapsed 70%CPU (0avgtext+0avgdata 0maxresident)k
4.61user 0.51system 0:08.89elapsed 57%CPU (0avgtext+0avgdata 0maxresident)k
4.40user 0.58system 0:09.18elapsed 54%CPU (0avgtext+0avgdata 0maxresident)k
5.31user 0.54system 0:09.94elapsed 58%CPU (0avgtext+0avgdata 0maxresident)k
5.43user 0.55system 0:10.71elapsed 55%CPU (0avgtext+0avgdata 0maxresident)k
5.36user 0.57system 0:10.03elapsed 59%CPU (0avgtext+0avgdata 0maxresident)k
5.31user 0.70system 0:07.93elapsed 75%CPU (0avgtext+0avgdata 0maxresident)k
5.30user 0.66system 0:09.76elapsed 61%CPU (0avgtext+0avgdata 0maxresident)k
5.35user 0.75system 0:09.37elapsed 65%CPU (0avgtext+0avgdata 0maxresident)k

The times just stayed up from then on. After 10 minutes of times ranging from 5
to 8 seconds of user time I ended the experiment.

I omitted the 'pagefaults' lines from the time output: they're all fairly
identical, 1416 outputs and around 7700 minor pagefaults, everything else is at
zero.

Attached are two straces: the first one and the first one that managed over 7
seconds of user time, among the last of all the runs. I'll also attach my
kernel .config, a boot dmesg, and lspci -vvv output; I don't know what other
information would be useful.

Here's the ver_linux output for the kernel I ran my experiment on:

Linux niðavellir 2.6.28.1-deewiant #20 SMP PREEMPT Fri Jan 23 22:20:54 EET 2009 x86_64 Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux

Gnu C                  4.3.2
Gnu make               3.81
binutils               2.19.0.20081119
util-linux             2.14
mount                  support
module-init-tools      3.5
e2fsprogs              1.41.3
pcmciautils            015
PPP                    2.4.4
Linux C Library        2.9
Dynamic linker (ldd)   2.9
Linux C++ Library      6.0.10
Procps                 3.2.7
Net-tools              1.60
Kbd                    1.13
Sh-utils               6.12
Modules Loaded
Comment 1 Matti Niemenmaa 2009-01-27 11:42:06 UTC
Created attachment 20011 [details]
strace man man >/dev/null, normal speed
Comment 2 Matti Niemenmaa 2009-01-27 11:43:06 UTC
Created attachment 20012 [details]
strace man man >/dev/null, slow speed
Comment 3 Matti Niemenmaa 2009-01-27 11:43:58 UTC
Created attachment 20013 [details]
kernel .config
Comment 4 Matti Niemenmaa 2009-01-27 11:44:34 UTC
Created attachment 20014 [details]
dmesg
Comment 5 Matti Niemenmaa 2009-01-27 11:45:09 UTC
Created attachment 20016 [details]
lspci -vvv
Comment 6 Andrew Morton 2009-01-27 11:51:33 UTC
Do you know that the slowdown is IO related at all?

What if you just run something like


main()
{
        int i;

        for (i = 0; i < 1000000000; i++)
                ;
}

when the machine is running fast and when it's running slow?


Try running

   echo 3 > /proc/sys/vm/drop_caches

to blow away cache, see if that helps.


Try running `hdparm -t /dev/sda' or whatever to see if the disk
is getting slower.

If it's actually the CPU getting slower then check /proc/cpuinfo while
running the above busy-loop, see if the CPU clock frequency has slowed down.

This could even be MTRR-related.  Maybe there's some memory which is running
really slowly.  That memory doesn't initially get used.  Later, the system
start to use it and you notice the effects.
Comment 7 Matti Niemenmaa 2009-01-27 13:03:16 UTC
I don't think I've looked at the behaviour of a program that does no IO (d'oh!). I'll see how that does as well as try drop_caches when I get a chance to, thanks.

I recall having tried hdparm -t and -T without seeing any slowdown. Although it's only as a result of my experiment that I noticed the behaviour of the system slowdown: I might have just run hdparm in between those spikes.

For what it's worth I did have an MTRR-related problem earlier, although it didn't seem to cause any performance problems, only complaints from the proprietary fglrx ATI graphics driver, and I've managed to 'fix' it since. My original /proc/mtrr looked like the following:

reg00: base=0x000000000 (    0MB), size= 8192MB, count=1: write-back
reg01: base=0x200000000 ( 8192MB), size=  512MB, count=1: write-back
reg02: base=0x220000000 ( 8704MB), size=  256MB, count=1: write-back
reg03: base=0x0d0000000 ( 3328MB), size=  256MB, count=1: uncachable
reg04: base=0x0e0000000 ( 3584MB), size=  512MB, count=1: uncachable

Currently I just run a script on startup (much trial and error was involved in getting it to work) which turns it into the following:

reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
reg01: base=0x0d0000000 ( 3328MB), size=  256MB, count=1: write-combining
reg02: base=0x0e0000000 ( 3584MB), size=  512MB, count=1: uncachable
reg03: base=0x100000000 ( 4096MB), size= 4096MB, count=1: write-back
reg04: base=0x200000000 ( 8192MB), size= 2048MB, count=1: write-back

I've noticed the slowdown behaviour with both of the above MTRR tables, though.
Comment 8 Matti Niemenmaa 2009-01-31 06:34:50 UTC
Alright, some more info:

- echo 3 > /proc/sys/vm/drop_caches is a better way of fixing it than a reboot, thanks for that. Of course, the problem recurs after some time.

- /proc/cpuinfo tells me that I'm still running at around 3 GHz.

- I ran that program but with a constant of only 10 million and not 1000 million; takes around 0.02 seconds normally but it took over 4 seconds when the system was going slow:

$ time ./arst
Command exited with non-zero status 192
4.63user 0.01system 0:04.66elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+124minor)pagefaults 0swaps

- hdparm -t does show a slowdown, but judging from the output of 'time' it's more due to the CPU overhead on system calls:

$ sudo time hdparm -t /dev/sda      

/dev/sda:
 Timing buffered disk reads:   60 MB in  3.09 seconds =  19.45 MB/sec
0.07user 3.18system 0:06.72elapsed 48%CPU (0avgtext+0avgdata 0maxresident)k
123136inputs+0outputs (0major+1089minor)pagefaults 0swaps

- I ran memtest86+ for about 12 hours in case there's a problem with my RAM, but it came up with zero errors.
Comment 9 Matti Niemenmaa 2009-02-02 09:44:14 UTC
This may or may not be related: doing 'echo 3 > /proc/sys/vm/drop_caches' caused a GPF in the kernel:

[37331.924053] general protection fault: 0000 [#5] PREEMPT SMP 
[37331.924057] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[37331.924059] CPU 3 
[37331.924061] Modules linked in:
[37331.924064] Pid: 15141, comm: sh Tainted: G      D    2.6.28.1-deewiant #20
[37331.924066] RIP: 0010:[<ffffffff80244cd1>]  [<ffffffff80244cd1>] __wake_up_bit+0xc/0x2d
[37331.924072] RSP: 0018:ffff88010ea33d88  EFLAGS: 00010296
[37331.924074] RAX: 57fc74959adaa508 RBX: ffffe20001f3c6e0 RCX: 0000000000000040
[37331.924076] RDX: 0000000000000000 RSI: ffffe20001f3c6e0 RDI: 57fc74959adaa500
[37331.924078] RBP: 0000000005472286 R08: c000000000000000 R09: 0000000d80349a49
[37331.924080] R10: 0000000000006000 R11: 0000000000000001 R12: 0000000000000009
[37331.924082] R13: ffff88010ea33e20 R14: 000000000000027f R15: ffff88022f70c7d0
[37331.924084] FS:  00007f12bfb0d6f0(0000) GS:ffff88022f802b00(0000) knlGS:0000000000000000
[37331.924087] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[37331.924088] CR2: 00007f12bfb28000 CR3: 000000013a99e000 CR4: 00000000000406e0
[37331.924091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[37331.924093] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[37331.924095] Process sh (pid: 15141, threadinfo ffff88010ea32000, task ffff88016828a000)
[37331.924096] Stack:
[37331.924098]  0000000000000008 ffffffff00000000 ffffe20001f3c6e0 ffffffff80264687
[37331.924101]  ffff88010ea33dd9 ffff88010ea33dd8 0100000100000000 ffffffffffffffff
[37331.924104]  000000000000000e 0000000000000000 ffffe200053d7618 ffffe20004c011d8
[37331.924108] Call Trace:
[37331.924110]  [<ffffffff80264687>] ? __invalidate_mapping_pages+0xcb/0x140
[37331.924115]  [<ffffffff8050bbae>] ? _spin_lock_irqsave+0x33/0x3b
[37331.924119]  [<ffffffff8029d162>] ? drop_caches_sysctl_handler+0xca/0x184
[37331.924122]  [<ffffffff802c1eae>] ? proc_sys_call_handler+0x91/0xb8
[37331.924127]  [<ffffffff8028451b>] ? vfs_write+0xad/0x156
[37331.924130]  [<ffffffff80284680>] ? sys_write+0x45/0x6e
[37331.924132]  [<ffffffff8020b43a>] ? system_call_fastpath+0x16/0x1b
[37331.924136] Code: 00 48 29 f8 2b 8a 10 d0 66 80 4c 01 c0 48 d3 e8 48 6b c0 18 48 03 82 00 d0 66 80 5a 5b 5d c3 48 83 ec 18 48 8d 47 08 89 54 24 08 <48> 39 47 08 74 16 48 89 34 24 48 89 e1 ba 01 00 00 00 be 03 00 
[37331.924162] RIP  [<ffffffff80244cd1>] __wake_up_bit+0xc/0x2d
[37331.924165]  RSP <ffff88010ea33d88>
[37331.924167] ---[ end trace 4d409d3bbf01b279 ]---

The system isn't currently in its slow state but every time I run the command I get a trace like the above in dmesg. I don't know if its notable that the first trace I got said "Not tainted" and later ones say "Tainted: G      D".
Comment 10 Alan 2009-03-20 06:02:51 UTC
How much RAM does the system have and how is it laid out according to the early boot messages - wrong MTRR settings for an area of RAM could cause exactly what you see
Comment 11 Matti Niemenmaa 2009-03-20 06:15:36 UTC
The system has 8 GB of RAM, four 2 GB sticks. As for how it's laid out, I've attached my dmesg and posted my MTRR settings (both original and current, both of which exhibit the problem) in comment #7; is there any other source of info I should look at?
Comment 12 Alan 2009-03-20 06:59:54 UTC
Ok so physical memory layout is

0-639K RAM
640K->1M ISA hole
RAM 1MB -> 3GB
PCI and the like 3GB-4GB
RAM 4GB->8GB + a bit

You don't seem to have an MTRR range covering 2GB-3GB


Your original BIOS MTRR looks closer

You might also want to see if it occurs without ever loading fglrx in case its messing up due to the (rather odd) memory mtrr setups coming from the system
Comment 13 Matti Niemenmaa 2009-03-20 08:59:13 UTC
I haven't used fglrx in months, this does occur without it as well. It was one of the first things I removed when trying to get rid of the problem. :-)

xf86-video-radeonhd (the video driver I use currently) does complain about MTRRs if I restart X: the 256MB range completely disappears from /proc/mtrr when I close X. It doesn't seem to cause any problems, though.

I'll see about having the MTRRs cover the 2-3 GB range, in case that helps; I'll post results later.
Comment 14 Matti Niemenmaa 2009-03-28 11:14:15 UTC
Okay, it seems that it was indeed an MTRR problem. I've been running with the following settings for the past week and haven't run into the problem even once:

reg00: base=0x000000000 (    0MB), size= 2048MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back
reg02: base=0x0c0000000 ( 3072MB), size=  256MB, count=1: write-back
reg03: base=0x0d0000000 ( 3328MB), size=  256MB, count=1: write-combining
reg04: base=0x0e0000000 ( 3584MB), size=  512MB, count=1: uncachable
reg05: base=0x100000000 ( 4096MB), size= 4096MB, count=1: write-back
reg06: base=0x200000000 ( 8192MB), size=  512MB, count=1: write-back
reg07: base=0x220000000 ( 8704MB), size=  256MB, count=1: write-back

I'm not sure if this is optimal (X still manages to lose that write-combining MTRR permanently if restarted) but it takes care of this problem, at least.

I'll leave this report open for now, since it seems to me that correct MTRR setting is something that should be handled by the kernel. Bug 10508 looks similar to this (the original MTRRs described there are similar to what I started out with), but I'll leave deciding whether this is a duplicate to somebody more knowledgeable.

Reassigned to Platform Specific/Hardware since I guess that's where this belongs.

In case it's relevant, here's the order in which I need to disable the original MTRR settings (see comment #7 for them). I haven't tried all permutations but this is the first one I found that didn't slow the system down to pre-286 speeds:

echo "disable=0" >| /proc/mtrr
echo "disable=4" >| /proc/mtrr
echo "disable=3" >| /proc/mtrr
echo "disable=2" >| /proc/mtrr
echo "disable=1" >| /proc/mtrr

Note You need to log in before you can comment on or make changes to this bug.