Latest working kernel version: none known, I have only tested the 2.6.28 series and can't easily try anything else since I'm on ext4. Earliest failing kernel version: 2.6.28 Distribution: Arch Linux Hardware Environment: Intel Core 2 Q9550 (x86-64), 8 GB RAM, Asus P5Q-E motherboard, Seagate Barracuda 7200.11 1.5 TB disk. Software Environment: "none". Reproduced at runlevel 1 with practically nothing running. Heavy disk usage or something which simply continues for a period of time, e.g. downloading torrents, watching video, or simply copying a lot of stuff from one partition to another makes Linux eventually slow down to the point of being unusable. This slowdown persists even after the IO has finished and nothing is using the disk. A reboot is a temporary fix for the problem. This could be related to Bug 7372 or Bug 12309, but I think the key here is that I tend to get system-wide poor performance even after the IO is done: when both 'top' and 'iotop' tell me that for all practical purposes absolutely nothing is going on. What happens is that literally /everything/ takes more CPU time. A good benchmark I found for testing this is 'time man man >/dev/null': typically it claims around 0.10 seconds of user time, but when this slowdown has occurred it can take over 8 seconds. An strace -tt doesn't point at any particular culprit: every individual operation takes many times more time. The Windows I occasionally dual boot into has no problems at all in this regard. I have no idea of the actual cause of the problem, I can only describe the symptoms. I filed this under IO/Storage as my best guess. Steps to reproduce: I did an experiment at runlevel 1. First I dd'd a 500M file from /dev/zero. Then I started up two programs to run repeatedly: 'time strace -fttT man man >/dev/null; sleep 1' for benchmarking and 'cp zero-file zero-file2; rm zero-file2' to reproduce the problem. I also set up 'vmstat 1' in case it gives anything useful. I left this running for a few hours and logged everything. This started at around 9:20 local time. The logs of the 'time man man' show that for a long time all of the processes completed within 0.15 seconds of user time, most below 0.10. Three hours later, at 12:21:03, we start to see the following: 0.10user 0.13system 0:00.23elapsed 102%CPU (0avgtext+0avgdata 0maxresident)k 0.07user 0.10system 0:02.79elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k 2.60user 0.44system 0:07.92elapsed 38%CPU (0avgtext+0avgdata 0maxresident)k 3.15user 0.51system 0:05.74elapsed 63%CPU (0avgtext+0avgdata 0maxresident)k 4.54user 0.68system 0:12.46elapsed 41%CPU (0avgtext+0avgdata 0maxresident)k 5.05user 0.53system 0:11.19elapsed 49%CPU (0avgtext+0avgdata 0maxresident)k 4.28user 0.55system 0:11.39elapsed 42%CPU (0avgtext+0avgdata 0maxresident)k 0.10user 0.14system 0:03.60elapsed 6%CPU (0avgtext+0avgdata 0maxresident)k 0.08user 0.16system 0:00.82elapsed 30%CPU (0avgtext+0avgdata 0maxresident)k The times spiked really high for a while there. (Normally I'd reboot around this time, doing almost anything is painfully slow.) This spiking goes on and on. An hour later I arrive at the scene and kill the 'cp' loop. About 10 minutes later this happens: 0.16user 0.16system 0:03.16elapsed 10%CPU (0avgtext+0avgdata 0maxresident)k 0.36user 0.15system 0:03.62elapsed 14%CPU (0avgtext+0avgdata 0maxresident)k 1.49user 0.22system 0:01.99elapsed 86%CPU (0avgtext+0avgdata 0maxresident)k 3.12user 0.28system 0:05.33elapsed 63%CPU (0avgtext+0avgdata 0maxresident)k 4.03user 0.50system 0:06.45elapsed 70%CPU (0avgtext+0avgdata 0maxresident)k 4.61user 0.51system 0:08.89elapsed 57%CPU (0avgtext+0avgdata 0maxresident)k 4.40user 0.58system 0:09.18elapsed 54%CPU (0avgtext+0avgdata 0maxresident)k 5.31user 0.54system 0:09.94elapsed 58%CPU (0avgtext+0avgdata 0maxresident)k 5.43user 0.55system 0:10.71elapsed 55%CPU (0avgtext+0avgdata 0maxresident)k 5.36user 0.57system 0:10.03elapsed 59%CPU (0avgtext+0avgdata 0maxresident)k 5.31user 0.70system 0:07.93elapsed 75%CPU (0avgtext+0avgdata 0maxresident)k 5.30user 0.66system 0:09.76elapsed 61%CPU (0avgtext+0avgdata 0maxresident)k 5.35user 0.75system 0:09.37elapsed 65%CPU (0avgtext+0avgdata 0maxresident)k The times just stayed up from then on. After 10 minutes of times ranging from 5 to 8 seconds of user time I ended the experiment. I omitted the 'pagefaults' lines from the time output: they're all fairly identical, 1416 outputs and around 7700 minor pagefaults, everything else is at zero. Attached are two straces: the first one and the first one that managed over 7 seconds of user time, among the last of all the runs. I'll also attach my kernel .config, a boot dmesg, and lspci -vvv output; I don't know what other information would be useful. Here's the ver_linux output for the kernel I ran my experiment on: Linux niðavellir 2.6.28.1-deewiant #20 SMP PREEMPT Fri Jan 23 22:20:54 EET 2009 x86_64 Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz GenuineIntel GNU/Linux Gnu C 4.3.2 Gnu make 3.81 binutils 2.19.0.20081119 util-linux 2.14 mount support module-init-tools 3.5 e2fsprogs 1.41.3 pcmciautils 015 PPP 2.4.4 Linux C Library 2.9 Dynamic linker (ldd) 2.9 Linux C++ Library 6.0.10 Procps 3.2.7 Net-tools 1.60 Kbd 1.13 Sh-utils 6.12 Modules Loaded
Created attachment 20011 [details] strace man man >/dev/null, normal speed
Created attachment 20012 [details] strace man man >/dev/null, slow speed
Created attachment 20013 [details] kernel .config
Created attachment 20014 [details] dmesg
Created attachment 20016 [details] lspci -vvv
Do you know that the slowdown is IO related at all? What if you just run something like main() { int i; for (i = 0; i < 1000000000; i++) ; } when the machine is running fast and when it's running slow? Try running echo 3 > /proc/sys/vm/drop_caches to blow away cache, see if that helps. Try running `hdparm -t /dev/sda' or whatever to see if the disk is getting slower. If it's actually the CPU getting slower then check /proc/cpuinfo while running the above busy-loop, see if the CPU clock frequency has slowed down. This could even be MTRR-related. Maybe there's some memory which is running really slowly. That memory doesn't initially get used. Later, the system start to use it and you notice the effects.
I don't think I've looked at the behaviour of a program that does no IO (d'oh!). I'll see how that does as well as try drop_caches when I get a chance to, thanks. I recall having tried hdparm -t and -T without seeing any slowdown. Although it's only as a result of my experiment that I noticed the behaviour of the system slowdown: I might have just run hdparm in between those spikes. For what it's worth I did have an MTRR-related problem earlier, although it didn't seem to cause any performance problems, only complaints from the proprietary fglrx ATI graphics driver, and I've managed to 'fix' it since. My original /proc/mtrr looked like the following: reg00: base=0x000000000 ( 0MB), size= 8192MB, count=1: write-back reg01: base=0x200000000 ( 8192MB), size= 512MB, count=1: write-back reg02: base=0x220000000 ( 8704MB), size= 256MB, count=1: write-back reg03: base=0x0d0000000 ( 3328MB), size= 256MB, count=1: uncachable reg04: base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable Currently I just run a script on startup (much trial and error was involved in getting it to work) which turns it into the following: reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back reg01: base=0x0d0000000 ( 3328MB), size= 256MB, count=1: write-combining reg02: base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable reg03: base=0x100000000 ( 4096MB), size= 4096MB, count=1: write-back reg04: base=0x200000000 ( 8192MB), size= 2048MB, count=1: write-back I've noticed the slowdown behaviour with both of the above MTRR tables, though.
Alright, some more info: - echo 3 > /proc/sys/vm/drop_caches is a better way of fixing it than a reboot, thanks for that. Of course, the problem recurs after some time. - /proc/cpuinfo tells me that I'm still running at around 3 GHz. - I ran that program but with a constant of only 10 million and not 1000 million; takes around 0.02 seconds normally but it took over 4 seconds when the system was going slow: $ time ./arst Command exited with non-zero status 192 4.63user 0.01system 0:04.66elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+124minor)pagefaults 0swaps - hdparm -t does show a slowdown, but judging from the output of 'time' it's more due to the CPU overhead on system calls: $ sudo time hdparm -t /dev/sda /dev/sda: Timing buffered disk reads: 60 MB in 3.09 seconds = 19.45 MB/sec 0.07user 3.18system 0:06.72elapsed 48%CPU (0avgtext+0avgdata 0maxresident)k 123136inputs+0outputs (0major+1089minor)pagefaults 0swaps - I ran memtest86+ for about 12 hours in case there's a problem with my RAM, but it came up with zero errors.
This may or may not be related: doing 'echo 3 > /proc/sys/vm/drop_caches' caused a GPF in the kernel: [37331.924053] general protection fault: 0000 [#5] PREEMPT SMP [37331.924057] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map [37331.924059] CPU 3 [37331.924061] Modules linked in: [37331.924064] Pid: 15141, comm: sh Tainted: G D 2.6.28.1-deewiant #20 [37331.924066] RIP: 0010:[<ffffffff80244cd1>] [<ffffffff80244cd1>] __wake_up_bit+0xc/0x2d [37331.924072] RSP: 0018:ffff88010ea33d88 EFLAGS: 00010296 [37331.924074] RAX: 57fc74959adaa508 RBX: ffffe20001f3c6e0 RCX: 0000000000000040 [37331.924076] RDX: 0000000000000000 RSI: ffffe20001f3c6e0 RDI: 57fc74959adaa500 [37331.924078] RBP: 0000000005472286 R08: c000000000000000 R09: 0000000d80349a49 [37331.924080] R10: 0000000000006000 R11: 0000000000000001 R12: 0000000000000009 [37331.924082] R13: ffff88010ea33e20 R14: 000000000000027f R15: ffff88022f70c7d0 [37331.924084] FS: 00007f12bfb0d6f0(0000) GS:ffff88022f802b00(0000) knlGS:0000000000000000 [37331.924087] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [37331.924088] CR2: 00007f12bfb28000 CR3: 000000013a99e000 CR4: 00000000000406e0 [37331.924091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [37331.924093] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [37331.924095] Process sh (pid: 15141, threadinfo ffff88010ea32000, task ffff88016828a000) [37331.924096] Stack: [37331.924098] 0000000000000008 ffffffff00000000 ffffe20001f3c6e0 ffffffff80264687 [37331.924101] ffff88010ea33dd9 ffff88010ea33dd8 0100000100000000 ffffffffffffffff [37331.924104] 000000000000000e 0000000000000000 ffffe200053d7618 ffffe20004c011d8 [37331.924108] Call Trace: [37331.924110] [<ffffffff80264687>] ? __invalidate_mapping_pages+0xcb/0x140 [37331.924115] [<ffffffff8050bbae>] ? _spin_lock_irqsave+0x33/0x3b [37331.924119] [<ffffffff8029d162>] ? drop_caches_sysctl_handler+0xca/0x184 [37331.924122] [<ffffffff802c1eae>] ? proc_sys_call_handler+0x91/0xb8 [37331.924127] [<ffffffff8028451b>] ? vfs_write+0xad/0x156 [37331.924130] [<ffffffff80284680>] ? sys_write+0x45/0x6e [37331.924132] [<ffffffff8020b43a>] ? system_call_fastpath+0x16/0x1b [37331.924136] Code: 00 48 29 f8 2b 8a 10 d0 66 80 4c 01 c0 48 d3 e8 48 6b c0 18 48 03 82 00 d0 66 80 5a 5b 5d c3 48 83 ec 18 48 8d 47 08 89 54 24 08 <48> 39 47 08 74 16 48 89 34 24 48 89 e1 ba 01 00 00 00 be 03 00 [37331.924162] RIP [<ffffffff80244cd1>] __wake_up_bit+0xc/0x2d [37331.924165] RSP <ffff88010ea33d88> [37331.924167] ---[ end trace 4d409d3bbf01b279 ]--- The system isn't currently in its slow state but every time I run the command I get a trace like the above in dmesg. I don't know if its notable that the first trace I got said "Not tainted" and later ones say "Tainted: G D".
How much RAM does the system have and how is it laid out according to the early boot messages - wrong MTRR settings for an area of RAM could cause exactly what you see
The system has 8 GB of RAM, four 2 GB sticks. As for how it's laid out, I've attached my dmesg and posted my MTRR settings (both original and current, both of which exhibit the problem) in comment #7; is there any other source of info I should look at?
Ok so physical memory layout is 0-639K RAM 640K->1M ISA hole RAM 1MB -> 3GB PCI and the like 3GB-4GB RAM 4GB->8GB + a bit You don't seem to have an MTRR range covering 2GB-3GB Your original BIOS MTRR looks closer You might also want to see if it occurs without ever loading fglrx in case its messing up due to the (rather odd) memory mtrr setups coming from the system
I haven't used fglrx in months, this does occur without it as well. It was one of the first things I removed when trying to get rid of the problem. :-) xf86-video-radeonhd (the video driver I use currently) does complain about MTRRs if I restart X: the 256MB range completely disappears from /proc/mtrr when I close X. It doesn't seem to cause any problems, though. I'll see about having the MTRRs cover the 2-3 GB range, in case that helps; I'll post results later.
Okay, it seems that it was indeed an MTRR problem. I've been running with the following settings for the past week and haven't run into the problem even once: reg00: base=0x000000000 ( 0MB), size= 2048MB, count=1: write-back reg01: base=0x080000000 ( 2048MB), size= 1024MB, count=1: write-back reg02: base=0x0c0000000 ( 3072MB), size= 256MB, count=1: write-back reg03: base=0x0d0000000 ( 3328MB), size= 256MB, count=1: write-combining reg04: base=0x0e0000000 ( 3584MB), size= 512MB, count=1: uncachable reg05: base=0x100000000 ( 4096MB), size= 4096MB, count=1: write-back reg06: base=0x200000000 ( 8192MB), size= 512MB, count=1: write-back reg07: base=0x220000000 ( 8704MB), size= 256MB, count=1: write-back I'm not sure if this is optimal (X still manages to lose that write-combining MTRR permanently if restarted) but it takes care of this problem, at least. I'll leave this report open for now, since it seems to me that correct MTRR setting is something that should be handled by the kernel. Bug 10508 looks similar to this (the original MTRRs described there are similar to what I started out with), but I'll leave deciding whether this is a duplicate to somebody more knowledgeable. Reassigned to Platform Specific/Hardware since I guess that's where this belongs. In case it's relevant, here's the order in which I need to disable the original MTRR settings (see comment #7 for them). I haven't tried all permutations but this is the first one I found that didn't slow the system down to pre-286 speeds: echo "disable=0" >| /proc/mtrr echo "disable=4" >| /proc/mtrr echo "disable=3" >| /proc/mtrr echo "disable=2" >| /proc/mtrr echo "disable=1" >| /proc/mtrr