This is an attempt at bringing sanity to bug #7372. Please only comment here is you are experiencing high I/O wait times and interactvity on reasonable workloads. Latest working kernel version: 2.6.18? Problem Description: I/O operations on large files tend to produce extremely high iowait times and poor system I/O performance (degraded interactivity). This behavior can be seen to varying degrees in tasks such as, - Backing up /home (40GB with numerous large files) with diffbackup to external USB hard drive - Moving messages between large maildirs - updatedb - Upgrading large numbers of packages with rpm Steps to reproduce: The best synthetic reproduction case I have found is, $ dd if=/dev/zero of=/tmp/test bs=1M count=1M During this copy, IO wait times are very high (70-80%) with extremely degraded interactivity although throughput averages about 29MB/s (about the disk's capacity I think). Even starting a new shell takes minutes, especially after letting the machine copy for a while without being actively used. Could this mean it's a caching issue?
For the record, this is even reproducible with Linus's master.
I'm also having this problem. Latest working kernel version: 2.6.18.8 with config: http://svn.pardus.org.tr/pardus/2007/kernel/kernel/files/pardus-kernel-config.patch Currently working on 2.6.25.20 with config: http://svn.pardus.org.tr/pardus/2008/kernel/kernel/files/pardus-kernel-config.patch Tested also with 2.6.28 and felt no significant performance improvement. -- During heavy disk IO's like running 'svn up' hogs the system avoiding the start a new shell, browse on the internet, do some text editing using vim, etc. For example, after being able to open a text buffer with vim, 4-5 seconds delays happens between consecutive search attempts.
Hello Ben, I don't known where to post it exactly. Why Linux Memory Management? Or why -mm and not mainstream? Can you do it for me please? I have added a second test case, which using threads with pthread_mutex and pthread_cond instead of processes with pipes for communicating, to ensure it is a cpu scheduler issue. I have repeated the tests with some vanilla kernels again, as there is a remark in the bug report for tainted or distro kernels. As I got a segmentation fault with the 2.6.28 kernel, I added the result of the Ubuntu 9.04 kernel (see attachment). The results are not comparable to the results posted before, as I have changed the time handling (doubles instead of int32_t as some echo messages takes more than one second). The first three results are 2*100, 2*50 and 2*20 processes exchanging 100k, 200k and 1M messages over a pipe. The last three results are 2*100, 2*50, and 2*20 threads exchanging 100k, 200k and 1M messages with pthread_mutex and pthread_cond. I have added a 10 second pause at the beginning of every thread/process to assure the 2*100 processes or threads are all created and start to exchange the messages nearby at the same time. This was not the case at the old test-case with 2*100 processes, as the first thread was already destroyed before the last was created. With the second test-case with threads, I got the problems (threads:2*100/msg:1M) immediately with the kernel 2.6.22.19. There kernel 2.6.20.21 was fine with both test-cases. The meaning of the results: - min message time - average message time (80% of the messages) - message time at median - maximal message time - test duration Here the result. Linux balrog704 2.6.20.21 #1 SMP Wed Jan 14 10:11:34 CET 2009 x86_64 GNU/Linux min:0.000ms|avg:0.241-0.249ms|mid:0.244ms|max:18.367ms|duration:25.304s min:0.002ms|avg:0.088-0.094ms|mid:0.093ms|max:17.845ms|duration:19.694s min:0.002ms|avg:0.030-0.038ms|mid:0.038ms|max:564.062ms|duration:38.370s min:0.002ms|avg:0.004-0.007ms|mid:0.004ms|max:1212.746ms|duration:33.137s min:0.002ms|avg:0.004-0.005ms|mid:0.004ms|max:1092.045ms|duration:31.686s min:0.002ms|avg:0.004-0.007ms|mid:0.004ms|max:4532.159ms|duration:59.773s Linux balrog704 2.6.22.19 #1 SMP Wed Jan 14 10:16:43 CET 2009 x86_64 GNU/Linux min:0.003ms|avg:0.394-0.413ms|mid:0.403ms|max:19.673ms|duration:42.422s min:0.003ms|avg:0.083-0.188ms|mid:0.182ms|max:13.405ms|duration:37.038s min:0.003ms|avg:0.056-0.075ms|mid:0.070ms|max:656.112ms|duration:72.943s min:0.003ms|avg:0.005-0.010ms|mid:0.007ms|max:1756.113ms|duration:49.163s min:0.003ms|avg:0.005-0.010ms|mid:0.007ms|max:11560.976ms|duration:52.836s min:0.003ms|avg:0.008-0.010ms|mid:0.010ms|max:5316.424ms|duration:111.323s Linux balrog704 2.6.24.7 #1 SMP Wed Jan 14 10:21:04 CET 2009 x86_64 GNU/Linux min:0.003ms|avg:0.223-0.450ms|mid:0.428ms|max:8.494ms|duration:46.123s min:0.003ms|avg:0.140-0.209ms|mid:0.200ms|max:12.514ms|duration:39.100s min:0.003ms|avg:0.068-0.084ms|mid:0.076ms|max:38.778ms|duration:78.157s min:0.003ms|avg:0.454-0.784ms|mid:0.625ms|max:11.063ms|duration:65.619s min:0.004ms|avg:0.244-0.399ms|mid:0.319ms|max:21.018ms|duration:64.741s min:0.003ms|avg:0.061-0.138ms|mid:0.111ms|max:23.861ms|duration:126.309s
Created attachment 19795 [details] test case with processes and pipes
Created attachment 19796 [details] test case with threads and mutexes
Created attachment 19797 [details] All testresult on Core2 T7700 @ 2.40GHz / 4GB RAM
I guess the high I/O wait time and the poor responsiveness are the same problem, caused by the cpu scheduler, as I can produce the same symptoms without disc I/O. Since 2.6.26/27 everyone should be affected by this issue. What I did not understand is: Why takes the test with threads and mutexes twice as long as the test with processes and pipes, but stresses the system much more? The mouses freezes nearby immediately, while the test with processes and pipes allows to move the windows.
I've met the high I/O wait problem with 3ware cards on Centos 5.x. This is related to pci_try_set_mwi. More information here: https://bugzilla.redhat.com/show_bug.cgi?id=444759 Now Thomas seems to have found another source for the problem. Maybe mwi is adding on top of that (not every controller driver sets MWI - BIOS is supposed to do so, but I've met a couple of boards that do not). HTH.
If I run "google desktop indexer", then I get the long waits. E.G. vim goes away for up to 5-30 seconds, repeatably! So, I don't run "google desktop indexer". No problem since 12/15/08!
You can also add the task: - copy a file from a compactflash card through usb adaptor or pcmcia card. The computer is not usable until the copy of the file (3 to 5 megas) is finish. It doesn't matter if it copy the whole card or only a file. It seems to be similar to the description of the bug here.
I have found that this may be an issue with the Complete Fair Queuing I/O scheduler that was introduced as default in 2.6.18 (when most started observing this performance issue). Reverting back to the old AS scheduler for me seems to have resolved the problem. To use the AS scheduler and test for yourself, just specify "elevator=as" as a boot option.
(In reply to comment #2) > I'm also having this problem. > > Latest working kernel version: 2.6.18.8 with config: > > http://svn.pardus.org.tr/pardus/2007/kernel/kernel/files/pardus-kernel-config.patch > > Currently working on 2.6.25.20 with config: > > http://svn.pardus.org.tr/pardus/2008/kernel/kernel/files/pardus-kernel-config.patch > > Tested also with 2.6.28 and felt no significant performance improvement. > > -- > > During heavy disk IO's like running 'svn up' hogs the system avoiding the > start > a new shell, browse on the internet, do some text editing using vim, etc. > > For example, after being able to open a text buffer with vim, 4-5 seconds > delays happens between consecutive search attempts. You seem to be able to reproduce the bug easily, and have found a non affected kernel version. Can you git bisect between those kernels to at least isolate the culprit commit?
(In reply to comment #3) > > With the second test-case with threads, I got the problems > (threads:2*100/msg:1M) immediately with the kernel 2.6.22.19. There kernel > 2.6.20.21 was fine with both test-cases. I'm not sure that's the same issue I had when I posted but 7372, but since you seem to be a programmer you should git bisect between those kernels to isolate the culprit commit.
I'm not sure if this is related or not, but I'm getting similar behaviour on my own system, but *only* when copying files *from* my USB memory stick (a 4 GB Corsair Flash Voyager) *to* the internal SSD on my Asus Eee PC 900 running Ubuntu 8.10 with a custom build of Linux 2.6.27 (probably slightly patched) provided by array.org. I.e. reading a file from the USB stick to /dev/null, no slowdown. Writing /dev/zero to USB stick, no slowdown. Reading a file from the internal SSD to /dev/null, no slowdown. Writing /dev/zero to internal SSD, no slowdown. Copying a file from internal SSD to USB stick, no slowdown. Copying a file from USB stick to internal SSD, I get massive slowdowns on interactive performance. Launching a terminal, which usually takes a few seconds, suddenly takes the better part of a minute. Linux used is 2.6.27-8-eeepc on i686 SMP, as prebuilt by http://www.array.org/ubuntu/ The filesystem on the internal SSD is ext3, running on LVM, running on LUKS (encrypted filesystem). As set up by the Ubuntu 8.10 installer. Swap is also on the same encrypted LVM. The filesystem on the USB stick is vfat. Nothing fancy at all. I should also add that the read performance of my USB stick is faster (about 25 MB/s) than the write performance on the built-in SSD (about 10 MB/s). If you feel that it is useful, I can provide dumps of lspci/lsusb/lsmod or any other information. As for the exact build options and patches, that should be determinable by checking the web site specified above. Hope more data makes it possible to determine a pattern to this bug.
I tried the solution of Mike the comment http://bugzilla.kernel.org/show_bug.cgi?id=12309#c11 and indeed that solved my issue. So he seens that he is right at least for my problem.
I tried elevator=as on my system, and it did not change the behaviour. Copying files from external USB to internal encrypted SSD still totally smashes interactive performance. So this issue might be unrelated.
(In reply to comment #16) > I tried elevator=as on my system, and it did not change the behaviour. > Copying > files from external USB to internal encrypted SSD still totally smashes > interactive performance. So this issue might be unrelated. > This may be an unrelated issue having to do with USB I/O - since USB seems to be more CPU intensive anyway. When I experienced this bug (prior to switching from CFQ), it would happen whenever I copied a large file on ATA or SCSI devices and I noticed extremely high I/O wait times - with very low CPU usage. Not only during copying - but during any disk-intensive operation. Everything on my affected machines would come to a grinding halt until the operation was complete. Using AS for me so far has seemed to resolve the issue - as my machines are now responsive as they should be during heavy disk I/O.
I have had a very similar problem to this. I still have it often, but not as much from when I changed from EXT3 to ReiserFS. For the Scheduler, I've been using BFQ or V(R) thats included in the Zen Patchset. I have tried the stock kernel, and same problem exists, however I can't remember which scheduler I used at that point, I believe Deadline. Most of the IOWait I get comes when either I'm copying files to the local drives, or using multiple VM's (generally Windows as thats what is needed for work). I'm willing to try about anything to get this fixed. It's a little better since I switched FS's on my VM Drive, but still isn't totally fixed.
(In reply to comment #11) > I have found that this may be an issue with the Complete Fair Queuing I/O > scheduler that was introduced as default in 2.6.18 (when most started > observing > this performance issue). Reverting back to the old AS scheduler for me seems > to have resolved the problem. > > To use the AS scheduler and test for yourself, just specify "elevator=as" as > a > boot option. > Fwiw, I've never used the CFQ scheduler. I'm on the deadline scheduler with my 3ware 9560SE and still see this problem crop up from time to time, usually when doing a file copy large enough to fill the page cache.
I too have found that the choice of I/O scheduler makes little difference. Using AS generally yields no noticable improvement.
> > Fwiw, I've never used the CFQ scheduler. I'm on the deadline scheduler with > my > 3ware 9560SE and still see this problem crop up from time to time, usually > when > doing a file copy large enough to fill the page cache. > Another deadliner here. And the same thing. There are two clear cut triggers for me: 1. The test case thomas posted. 2. large copies which fill up page cache. I think its a process scheduling bug because page cache fill up might be triggering the pdflush processes (which are btw, normal priority. why?) into hyper drive and causing all other processes to wait. We do see various processes going into 'D' state and pdflush at the top of the cpu usage list, when the symptoms occur. If CFQ is used, and process priority determines IO priority, aren't pdflush processes going to compete with processes doing their own IO when dirty_ratio is reached and the process has priority equal or better than 0 (-1 and higher)? That may explain some of the stories with CFQ here.
Re: blaming the scheduler in 2.6.26 The problem was observed a long time before that. There might be additional scheduler problems (this bug in general suffers from the "lots of different problems" disease), but that is unlikely to be the old well known disk starvation with different devices issue. Re comment #9 vim stalls while disk is pounded: You're running ext3 or reiser right? That's a known problem in that vim regularly does fsync on its auto safe file and that causes a synchronous JBD transaction and since all transactions are strictly ordered if there are enough of them in front and the disk is busy it takes quite a long time. At least on the higher level that is supposed to be mostly solved by ext4 or by XFS. Of course it's another problem that the disk schedulers allow that long starvation in the first time.
Hi Thomas~ Can you elaborate on your test? You wrote: "The first three results are 2*100, 2*50 and 2*20 processes exchanging 100k, 200k and 1M messages over a pipe. The last three results are 2*100, 2*50, and 2*20 threads exchanging 100k, 200k and 1M messages with pthread_mutex and pthread_cond." So, I'm guessing you want the test to be run like this: ./processtest 200 100000 ./processtest 100 200000 ./processtest 40 1000000 ./threadtest 200 100000 ./threadtest 100 200000 ./threadtest 40 1000000 Is that correct? Just want to be sure i'm running the same tests (Also, the code limits number of processes to max 100... so I just edited this allowing the max limit to be 200) Here's our results: 2.6.15.7-ubuntu1-custom-1000HZ_CLK #1 SMP Thu Jan 15 19:06:30 PST 2009 x86_64 GNU/Linux (ubuntu 6.06.2 server LTS with clk_hz set to 1000HZ) min:0.004ms|avg:0.004-0.271ms|mid:0.005ms|max:42.049ms|duration:34.029s min:0.004ms|avg:0.004-0.138ms|mid:0.035ms|max:884.865ms|duration:33.105s min:0.004ms|avg:0.004-0.042ms|mid:0.004ms|max:2319.621ms|duration:62.438s min:0.005ms|avg:0.010-0.026ms|mid:0.012ms|max:1407.923ms|duration:92.132s min:0.005ms|avg:0.011-0.029ms|mid:0.013ms|max:1539.929ms|duration:97.034s min:0.005ms|avg:0.010-0.031ms|mid:0.013ms|max:18669.095ms|duration:176.555s 2.6.24-23-server #1 SMP Thu Nov 27 18:45:02 UTC 2008 x86_64 GNU/Linux (default ubuntu 64 8.04 server LTS at default 100HZ clock) min:0.004ms|avg:0.034-0.357ms|mid:0.324ms|max:39.789ms|duration:43.390s min:0.004ms|avg:0.006-0.149ms|mid:0.131ms|max:79.430ms|duration:39.288s min:0.004ms|avg:0.046-0.057ms|mid:0.052ms|max:52.427ms|duration:64.481s min:0.005ms|avg:0.006-0.650ms|mid:0.330ms|max:22.120ms|duration:60.142s min:0.005ms|avg:0.053-0.309ms|mid:0.276ms|max:21.560ms|duration:62.353s min:0.004ms|avg:0.033-0.123ms|mid:0.112ms|max:22.007ms|duration:131.029s Linux la 2.6.24.6-custom #1 SMP Thu Jan 15 23:34:10 UTC 2009 x86_64 GNU/Linux (ubuntu 8.04 server LTS with clk_hz custom set to 1000HZ) min:0.004ms|avg:0.054-0.364ms|mid:0.332ms|max:24.524ms|duration:42.522s min:0.004ms|avg:0.125-0.156ms|mid:0.144ms|max:13.171ms|duration:33.573s min:0.004ms|avg:0.046-0.058ms|mid:0.052ms|max:13.005ms|duration:64.388s min:0.005ms|avg:0.006-0.594ms|mid:0.302ms|max:13.481ms|duration:61.105s min:0.005ms|avg:0.109-0.336ms|mid:0.307ms|max:13.345ms|duration:65.000s min:0.002ms|avg:0.070-0.130ms|mid:0.120ms|max:13.137ms|duration:133.786s Side notes... we have been experiencing problems with MySQL specifically with sync-binlog=1 and log-bin on and performing high volume of concurrent transactions. Although we run raid-1 with battery cache on... our throughput is horrible. For some reason, we have found that by increasing the CONFIG_HZ=1000 from 100 in the kernel, we get much higher throughput. Otherwise our benchmarks just sit around and have trouble context switching. #CONFIG_HZ_100=y #CONFIG_HZ=100 #change to: CONFIG_HZ_1000=y CONFIG_HZ=1000 I do not know if the problems we are experiencing with the clock are related to this bug listed here. However, I did want to submit our feed back showing the difference in kernels where our bottleneck runs better. We use sysbench for our test (with vmstat -S M 3, iostat -dx 3, and mpstat 3 to monitor.. all part of sysstat suite). FYI, Here are our sysbench commands (be sure to change your mysql username and password and create the database sbtest): You can get sysbench here: http://sysbench.sourceforge.net/ Compile it like: ./configure --with-mysql --with-mysql-include=/usr/share/include --with-mysql-lib=/usr/share/lib make make install Prepare it: ./sysbench --num-threads=50 --test=oltp --oltp-test-mode=complex --oltp-table-size=100000 --oltp-distinct-ranges=0 --oltp-order-ranges=0 --oltp-sum-ranges=0 --oltp-simple-ranges=0 --oltp-point-selects=0 --oltp-range-size=0 --mysql-table-engine=innodb --mysql-host=127.0.0.1 --mysql-user=ROOT --mysql-password=PASSWORD prepare Run it: ./sysbench --num-threads=50 --test=oltp --oltp-test-mode=complex --oltp-table-size=100000 --oltp-distinct-ranges=0 --oltp-order-ranges=0 --oltp-sum-ranges=0 --oltp-simple-ranges=0 --oltp-point-selects=0 --oltp-range-size=0 --mysql-table-engine=innodb --mysql-host=127.0.0.1 --mysql-user=ROOT --mysql-password=PASSWORD run The important line of output is read/write requests per second, and total time. === 2.6.15.7-ubuntu1-custom-1000HZ_CLK #1 SMP Thu Jan 15 19:06:30 PST 2009 x86_64 GNU/Linux (ubuntu 6.06.2 server LTS with clk_hz custom set to 1000) read/write requests: 50000 (2394.13 per sec.) total time: 20.8844s vmstat -S M 3 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 9043 142 559 0 0 1 30341 5020 25659 6 15 78 1 iostat -dx 3 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 4320.74 0.00 4836.12 0.00 73254.85 0.00 36627.42 15.15 4.93 1.02 0.16 77.02 === === 2.6.24-23-server #1 SMP Thu Nov 27 18:45:02 UTC 2008 x86_64 GNU/Linux (default ubuntu 64 8.04 server LTS at default 100HZ clock) read/write requests: 50000 (434.33 per sec.) total time: 115.1207s vmstat -S M 3 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 1506 109 100 0 0 155 5011 531 4532 5 3 91 1 iostat -dx 3 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 951.67 30.67 551.00 274.67 12021.33 21.14 1.18 2.03 1.60 93.00 === === Linux la 2.6.24.6-custom #1 SMP Thu Jan 15 23:34:10 UTC 2009 x86_64 GNU/Linux (ubuntu 8.04 server LTS with clk_hz custom set to 1000) read/write requests: 50003 (2680.47 per sec.) total time: 18.6546s vmstat -S M 3 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 1710 46 73 0 0 1296 27104 3474 31095 5 3 82 9 iostat -dx 3 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 2432.33 159.33 2576.00 1632.00 40066.67 15.24 1.95 0.71 0.35 94.47 === Note: Our servers are 2xIntel xeon 5110 dual core 1.6GHz with 15k SAS Raid-1 2+GB Ram Not sure if this feedback is helping or not; my hope is that it is relevant to what you are trying to fix. My personal opinion is that the kernel should scale a little more uniform than 434.33 per sec versus 2680.47 per sec... seems to be a large difference. Even though 100Hz clock setting is recommended for servers... it's seems this could actually not be ideal for anyone running a MySQL server that needs safe transaction support via sync-binlog=1 (at least... that's what we are finding for high insert/update load). Perhaps you can look at sysbench as there are a number of test for threads, fileio, and etc to determine if this can expose the kernel issues in a different way? Any feedback about an ideal kernel and kernel config for servers is much appreciated as these are no doubt difficult to debug.
Did some more testing. My father has an Eee PC 900 exactly the same as mine also running Ubuntu 8.10 with the same kernel as mentioned before. Only difference that I can think of - he doesn't use LUKS and LVM like me, he instead has his / directly on /dev/sdb1 (internal SSD). I also, in addition to trying to launch a Terminal via Gnome (as I did previously) I tried the vim "stuttering" test by creating a file, saving it, and holding down a key to see when it stutters. The results of these tests: - On both my own (encypted) and the other (unencrypted) computer, vim occasionally freezes for a few seconds while I cp a file from USB memory to internal SSD. - On my computer (encrypted) lauching a gnome-terminal takes much longer while copying a file from SSD than on the other computer. While there is a noticable slowdown on the unencrypted machine, on the encrypted machine sometimes the gnome-terminal won't even launch until *after* the copy is complete. In conclusion - the effect exists on both machines, but the encryption of the SSD very significantly increases the problem. While some slowdown due to encryption should be expected, it should not make the machine almost completely unusable while copying a file from a USB stick to the internal SSD.
Different scheduler (#11) doesn't seem to do much. I did some quick and dirty testing with my laptop : Linux lupaus 2.6.28-customlupaus #4 SMP PREEMPT Thu Dec 25 15:05:35 EET 2008 x86_64 GNU/Linux Vanilla 2.6.28 kernel, config from Ubuntu 8.10, with some modifications to suit my laptop with io scheduler cfq ./threadtest 100 200000 min:0.004ms|avg:0.007-0.008ms|mid:0.008ms|max:894.480ms|duration:187.588s with elevator=as (eg. io scheduler anticipatory) ./threadtest 100 200000 min:0.004ms|avg:0.007-0.008ms|mid:0.008ms|max:884.016ms|duration:188.248s --- with io scheduler cfq ./proctest 50 100000 min:0.005ms|avg:0.005-0.006ms|mid:0.006ms|max:460.631ms|duration:35.773s with elevator=as (eg. io scheduler anticipatory) ./proctest 50 100000 min:0.005ms|avg:0.006-0.006ms|mid:0.006ms|max:479.695ms|duration:36.645s
One more observation from another experiment I did: I have swap on the same encrypted LVM as my root partition. Disabling swap makes the terminal launch much faster while copying -- still slower than when not copying files, but within a few seconds of clicking instead of within minutes. However! Now, instead individual running processes (like Firefox and vim) hang much more agressively and frequently during copying. I'm not sure what to make of this, but I hope somebody who actually knows something about the Linux kernel will find this useful. :-)
I'm not sure any developer will be able to pinpoint the problem in all this mess! ;-) There are likely several bugs here. For a start, I think it could be nice to separate people whose problem is fixed by elevator=as. And then separate people using encrypted disks. And then problems occurring only with USB disks. Please open new reports. What do developers think?
Created attachment 19828 [details] Bisect results I have done the bisect and isolated patch. In the attachment you can find the bisec result. I have done the sysbench test too. Tests: 100 Process / 1k messages Linux balrog704 2.6.20 #13 SMP Fri Jan 16 10:13:21 CET 2009 x86_64 GNU/Linux min:0.003ms|avg:0.243-0.253ms|mid:0.246ms|max:29.503ms|duration:25.080s min:0.002ms|avg:0.022-0.038ms|mid:0.037ms|max:756.082ms|duration:37.894s min:0.002ms|avg:0.004-0.007ms|mid:0.004ms|max:929.790ms|duration:34.608s Linux balrog704 2.6.20bad #14 SMP Fri Jan 16 10:52:17 CET 2009 x86_64 GNU/Linux min:0.003ms|avg:0.411-0.434ms|mid:0.424ms|max:18.328ms|duration:43.549s min:0.003ms|avg:0.063-0.075ms|mid:0.071ms|max:404.088ms|duration:72.860s min:0.003ms|avg:0.005-0.010ms|mid:0.009ms|max:712.033ms|duration:51.654s
Created attachment 19829 [details] sysbench results As I am using Firefox3 with the bad kernel, my post was submitted by accident. With the good kernel there are (nearby) no problems with firefox3 any more. The tests were where run with the following parameters - 2*100 processes / 100k messages - 2*20 processes / 1M messages - 2*200 threads / 100k messages
Created attachment 19830 [details] Bisect results wrong file
Re #26 There's some performance problem in general with encrypted swap. I've seen that too. But it's probably a different issue than the primary one which should be discussed here.
> Is that correct? Just want to be sure i'm running the same tests (Also, the > code limits number of processes to max 100... so I just edited this allowing > the max limit to be 200) I have used 100/50/20 as one echo process uses 2 threads or processes. But it is not important, as these test should only compare different kernel versions on the same computer.
(In reply to comment #18) > I have had a very similar problem to this. I still have it often, but not as > much from when I changed from EXT3 to ReiserFS. For the Scheduler, I've been > using BFQ or V(R) thats included in the Zen Patchset. I have tried the stock > kernel, and same problem exists, however I can't remember which scheduler I > used at that point, I believe Deadline. > Most of the IOWait I get comes when either I'm copying files to the local > drives, or using multiple VM's (generally Windows as thats what is needed for > work). I'm willing to try about anything to get this fixed. It's a little > better since I switched FS's on my VM Drive, but still isn't totally fixed. > I did try the AS Scheduler, as that was the only thing I changed in my kernel, and it didn't change anything interactively, still get a high IO Wait. The other thing I noticed, at least when in AS, I start using Swap, it's not a lot (within about 2 minutes I was using 10MB), but it was still climbing. One other thing, I'm wondering if this is 64bit related. All of my personal boxes are 64bit, and it seems of ones posted here, along with other threads I've read (over on Gentoo forums) that it seems this hits the 64bit users more then the 32bit users. Any truth to this, or am I trying to relate things that aren't related? My work box (most heavily used): Linux PC010233L 2.6.28-zen1-2 #2 SMP PREEMPT Thu Jan 15 16:06:37 EST 2009 x86_64 Intel(R) Core(TM)2 Duo CPU E8200 @ 2.66GHz GenuineIntel GNU/Linux
(In reply to comment #30) > Created an attachment (id=19830) [details] > Bisect results > If that bisection is to be believed, the assertion that the issue is caused by a scheduling issue seems quite plausible. (In reply to comment #33) > One other thing, I'm wondering if this is 64bit related. All of my personal > boxes are 64bit, and it seems of ones posted here, along with other threads > I've read (over on Gentoo forums) that it seems this hits the 64bit users > more > then the 32bit users. Any truth to this, or am I trying to relate things that > aren't related? > There is evidence that x86-64 is a factor here.
It does strike me as quite odd how large of a factor the size of the transfer seems to be. When I first start evolution (I have very large folders), the system will exhibit poor interactivity for upwards of 5 to 10 minutes. However, when transferring lots of small files (i.e. module_install'ing), the kernel behaves fine. (although modpost also seems to produce poor interactivity) I think it might help if we had a kernel developer here to list the kernel block/memory manager/scheduler statistics that might indicate where this I/O wait time is going. If sufficient statistics don't exist, it might be worthwhile to instrument the kernel specifically for this bug. It does seem clear that the bug I intended this ticket to describe is invariant on I/O scheduler, so that's one factor that needn't be accounted for.
I just recompiled my kernel without any SMP support and tested again. My laptop went from usable to totally unusable. Network traffic stops and it's even hard to type anything when process/thread test is running. I have only single CPU on my laptop. I also tried to change scheduler with this setup and that didn't make any difference. Good luck :)
Could this be a jiffies wraparound bug ? I've seen different formulas for doing interval arithmetic, and (not) handling wraparound. For instance, in as_antic_expired() :: long delta_jif; delta_jif = jiffies - ad->antic_start; if (unlikely(delta_jif < 0)) delta_jif = -delta_jif; :: , which seems incorrect to me. (it could alter the preditive powers of the scheduler in mysterious ways ;-) (A different calculation is performed at other places.) Jiffies wrap around depending on the HZ value (but still, intervals above INT_MAX should be relatively rare), and the jiffies start value will cause the first wrap @ 5 min after booting, so that would show. My 2 cents, AvK
Adriaan: drivers shouldnt be manually doing comparison on jiffies values. there are helps in linux/jiffies.h for doing the comparison (time_before() / time_after()) and those should handle wrap arounds. if you do see a driver that is doing the wrong thing, i'd open another bug specifically about that (or post a patch yourself :D).
With the following code I got negative time differences about -127ms. The tv_sec values where equal and the second tv_usec was smaller than the first. I cannot say which kernel it was, as I am no more able to reproduce it. Some days before it occurs on nearby every test. As this behaviour is connected with TSC synchronisation patch, I have posted it here. I will try to figure out the kernel version. > gettimeofday(&tv_s, &tz); > write(a2b[1], &c, 1); > read(b2a[0], &c, 1); > gettimeofday(&tv_e, &tz); > timersub(&tv_e, &tv_s, &tv_r);
I get the negative time difference on 2.6.17.14 kernel.org, 2.6.18.8 kernel.org and 2.6.18-92.el5 CentOS. My system is unusable with these three kernels, when I use the ide_generic. Disc throughput ~3MB/s I/O wait time at 100%. No problems in ahci and libata with 2.6.18-92.el5. I was not able to provoke a negative time difference with kernels 2.6.20, 2.6.21, 2.6.24, 2.6.27 and 2.6.8.
Created attachment 19839 [details] 32v64test 32 Bit Test vs 64-Bit This test is slightly apples and oranges... however, because someone inquired if this was a 32bit or a 64bit problem I ran these tests. I'm inclined to think it applies to both 32bit and 64bit for 2 reasons -The 32 bit test didn't perform that great -The git bisect comment states "the biggest change is the removal of the 'fix up TSCs' code on x86_64 and i386"
Created attachment 19840 [details] 32v64testCleanNewLines.txt formatting fix
Please ignore my comments #39 and #40, as this are other problems.
Are you guys aware of the Latencytop utility? http://www.latencytop.org/ You have to add CONFIG_LATENCYTOP=y to your config. Then run your tests which break down the system with Latencytop running. It might give additional information.
I've reproduced this problem with LTTng (http://ltt.polymtl.ca). It looks like the block layer is backmerging the large "dd if=/dev/zero ...." requests at a rate which leaves the request on the top of the request queue. I've started a more thorough discussion on lkml here : http://lkml.org/lkml/2009/1/16/487
re: the 32bit vs 64bit idea - I've experienced this issue on both 32 and 64 bit platforms, however - all of the platforms were on x64-capable CPUs (not sure if that would matter).
I hit this bug on Ubuntu 8.10 (updated to 2.6.27-9-generic) running Vmware Workstation 6.5.126130 with Ubuntu 8.04.1 LTS as a guest. It was esp pronounced when resuming a suspended VM. I tried the different elevator io schedulers. Nothing helped. Independent of VMWare, if I ran bonnie in one shell and launched firefox the whole system behaved in a very chunky manner. Renicing pdflush -10 had some great improvement on basic responsiveness. The weird part was after re-recreating a new VM and not seeing the iowait problems I then tried resuming a VM with VMware at the same time I was compressing a tar file with pbzip2 (parallel bzip). All 4 cores were pegged and my load average was normal, system responsiveness was good. As **soon** as I tried resuming the VM with VMWare workstation, the cpu load dropped to 1-5% across all cpus. iowait times shot way up. I have now killed Vmware and iowait times have dropped but my maximum read speed is hovers around 1MB/s (as measured with iostat). This is another symptom of the iowait problem. iostat -c -d -m -x sda 1 rMB/s is usually never over 2MB/s
(In reply to comment #46) > re: the 32bit vs 64bit idea - I've experienced this issue on both 32 and 64 > bit > platforms, however - all of the platforms were on x64-capable CPUs (not sure > if > that would matter). > Using an IBM X40 with an old Pentium M (32bit) and Thomas.pi's testcases made my machine totally unusable. So I don't think this has anything to do with x64-capable CPUs.
(In reply to comment #38) > Adriaan: drivers shouldnt be manually doing comparison on jiffies values. > there are helps in linux/jiffies.h for doing the comparison (time_before() / > time_after()) and those should handle wrap arounds. if you do see a driver > that is doing the wrong thing, i'd open another bug specifically about that > (or > post a patch yourself :D). Well, it was not in one of the driver's code but in block/as-iosched.c:as_fifo_expired() The observed behavior indicates that something is wrong with the shceduling of disk I/O, and that most time is spent by all theads competing for one or more (spin-)locks; you might call it a convoy or a thundering hurd syndrome. But it might be unrelated. AvK
Hi all, More tests Linux ws-esp16 2.6.27-11-generic #1 SMP Thu Jan 8 08:38:33 UTC 2009 i686 GNU/Linux $ ./processtest 100 200000 min:0.006ms|avg:0.278-0.520ms|mid:0.475ms|max:141.058ms|duration:107.646s $ ./threadtest 100 200000 min:0.006ms|avg:0.690-0.768ms|mid:0.715ms|max:235.106ms|duration:159.355s But if this is a IO problem why monitors does not show a big IO Wait Percentage. It shows a high system usage percentage. So I suppose that not IO problem seems to be related to process handling inside kernel. May it be related to the preemption model? I did some additional test: 1.-Change clock timing -> (no improvement) 2.-Change preemption model (tested all of them) -> (no improvement) 3.-Change IO scheduler -> (no improvement) Is there any way to profile the kernel to see what function gets more attention? Hope you find somethig... I attach a screenshot also...
Created attachment 19858 [details] Top output while running test
Created attachment 19859 [details] RFC patch to put a maximum to the number of cached bio merge done in a row Can you try this patch, which applies to 2.6.28, to see if it helps ? I have not been able to reproduce the problem with the patch applied.
Hi Mathieu, I tried this patch against 2.6.27 because it patched right. But the results are not good. It took even more time to complete the test. Can anyone confirm this?
This patch will probably diminish the overall throughput, because it is making sure that we do not merge more than 128 requests together. I am more interested in the I/O _latency_ (delay) you get when you run the system under a heavy I/O load. Mathieu
Created attachment 19866 [details] Port Attachment #19859 [details] to Linus's master (In reply to comment #53) > Hi Mathieu, > > I tried this patch against 2.6.27 because it patched right. But the results > are > not good. It took even more time to complete the test. > > Can anyone confirm this? > I can. Unfortunately, not only did the patch fail to reduce latency, but also reduces throughput. Even opening the file selection dialog to attach this patch took over 30 seconds while building a kernel.
Also, a patch set providing an ftrace interface to blktrace was recently submitted to the LKML (http://marc.info/?t=123212992300002&r=1&w=2). This could come in handy in further debugging.
Just a comment that might have gone unnoticed, but to me appears relevant as this bug again appears to become a collection of multiple issues again as happened with #7372 making that the kernel-devs started to ignore it. The bisect done by thomas.pi points yields a first bad commit dating from february 2007, while these symptoms first surfaced in 2.6.18, which dates from end 2006. Bug #7372 basically is from before this first bad commit; the bisect I did in that bug for example pointed towards a problem with NCQ with the CFQ scheduler from November 2006 that clearly was only present for 64bit. See http://bugzilla.kernel.org/show_bug.cgi?id=7372#c112 as a reminder for this proof. I'm not sure that issue got resolved in the end.....no clear pointers on what I could do to help further. Seeing reports in this bug reporting improvements when switching IO-scheduler and reports on differences between 32/64 bit makes me think those might be more related to that commit. Bottomline is to be sceptical with reports on whether or not a patch helps fully as to me it still appears to be multiple issues that have very similar but difficult to reliably trigger symptoms. However the test-case of Thomas does bring my system to its knees as well, so definitely a good way to tackle at least part of the problem. But I don't think it is the only problem.
No the patch does not fix the problem, but I think is now better than before. I think, that it is a cpu scheduler problem. As one process with many threads and thread switching can nearby stop the execution of other processes. This problem exists in every kernel version, even 2.6.15. You can test it by executing the thread based with 2*100 threads. My system starts to become unusable with the kernel 2.6.27 (Fedora 10) when executing the thread based test with 2*40-50 threads. I don't know how many interrupt occurs, while coping some data, but perhaps it is the commonness between copying files and the thread based test. The provided bisect, points to a cpu scheduler performance regression, which make the problem more noticeable. The biggest cpu scheduler performance regression was in 2.6.24 - 2.6.27. There was another cpu scheduler performance regression between 2.6.22 and 2.6.24.
(In reply to comment #57) > Seeing reports in this bug reporting improvements when switching IO-scheduler > and reports on differences between 32/64 bit makes me think those might be > more > related to that commit. Nobody confirms that changing io-scheduler or 32<->64bit improves system much? People are also testing different things, some test disk i/o and others are running process/thread tests. It's very confusing and someone should run couple of identical tests (including disk i/o AND process/thread test) with different kernel options. On my setup, just disabing or enabling SMP support made HUGE difference. I'm happy to do testing, but only if someone really needs information i can provide. Again, my worthless 5 cents.. :)
I just created a fio job file which acts like a "ls" executed while doing a large dd. It looks like the anticipatory I/O scheduler was causing those delays for me. The results for the ls-like jobs are interesting : I/O scheduler runt-min (msec) runt-max (msec) noop 41 10563 anticipatory 63 8185 deadline 52 33387 cfq 43 1420 Is it me or all I/O schedulers except cfq generate unexpectedly high latency ? Details here (including fio job file) : http://lkml.org/lkml/2009/1/18/198 Mathieu
Actually, in this bug as well as in the other (7372), there is no clear direction. None of the kernel devs have taken a leadership role and directed the reporters in a direction where we can start to get a handle on things. What we see here is a lot of speculation on the part of the users and hence enormity of variety of things being tried. Its like everybody shooting in the dark. Unless someone in the kernel team takes ownership of this bug, sorts out quarters from the pennies and directs users with a clear set of instructions to get well-defined data, I don't see this bug going anywhere. The question is who has the know-how and willingness to do that? We see process as well as io scheduler being involved, we see vm having effect, we see some libata effects. With so many components in the line of fire and kernel being as vast as it is, I don't see above (one savior coming along and putting 2 & 2 together) happening. IOW, take a beer and head away from the computer and into the sun....;-)
Created attachment 19894 [details] Test job description for fio Attaching the test case written by Mathieu Desnoyers and included in his earlier email
(In reply to comment #33) > > I did try the AS Scheduler, as that was the only thing I changed in my > kernel, > and it didn't change anything interactively, still get a high IO Wait. > > The other thing I noticed, at least when in AS, I start using Swap, it's not > a > lot (within about 2 minutes I was using 10MB), but it was still climbing. > > My work box (most heavily used): > Linux PC010233L 2.6.28-zen1-2 #2 SMP PREEMPT Thu Jan 15 16:06:37 EST 2009 > x86_64 Intel(R) Core(TM)2 Duo CPU E8200 @ 2.66GHz GenuineIntel GNU/Linux > Ok, I tried playing a little bit more, and switching to the DeadLine scheduler really helped things. I have topped around 73% IOWait, but it never bogged the whole box down. I still need to do definitive testing (via tests already in the bug report), but this seems to have helped. Not sure which problem this relates to in this bug though; I'm guessing the scheduler one.
Created attachment 19906 [details] fio test results of kernel 2.6.15 - 2.6.24 I have executed the test case of Mathieu Desnoyers on some different kernel version. I took the bad and good kernels from my bisection. The results do not confirm my theory. If someone can identificate a problem in it, I can make some more tests. The only regression I can seen is the regression with the noop scheduler. It is the average of the average latencies. ./test.2.6.15-53-amd64-genericresult.noop 700,62ms ./test.2.6.20-17-genericresult.noop 3520,24ms ./test.2.6.20result.noop 3005,24ms ./test.2.6.20badresult.noop 3698,64ms ./test.2.6.22.19result.noop 1393,67ms ./test.2.6.24.7result.noop 589,66ms I will check, if the 2.6.24.7 kernel test build has a improved desktop responsiveness.
There is no performance improve in 2.6.24.7. The list below shows the average times of the 41 small jobs with the cfq scheduler. I have the best desktop responsiveness on 2.6.20. Gimp start on heavy I/O in 10 seconds instead of 30 seconds. The freezes of the applications exists on 2.6.20, but they are much shorter, mostly under one second, while in kernels >= 2.6.22 there are freezes till one minute. min maxa avg stdev 2.6.20-17-generic 9.9 126.00 49.97 59.89 2.6.20 8.66 115.05 39.68 50.41 2.6.22.19 10.34 195.29 66.88 96.07 2.6.24.7 9.93 185.02 64.38 89.95 The high I/O wait is at 75% on the start and climbs to 99-100% after ~5 seconds. I have noticed, that the freezes occurs in all applications more often, when firefox is running. Currently I create a ram disk on startup, extract the .mozilla folder to it and save it again on shutdown. I makes my system more responsive, especially firefox3.
Created attachment 19912 [details] fio results for kernel 2.6.28 And finally the results for 2.6.28. I have removed all tracing stuff, I could find, but the system is still dull under heavy io. min max avg stdev 2.6.28 noop 97,61 1799,06 654,84 861,90 2.6.28 cfq 9,32 169,32 55,59 79,50
Created attachment 19920 [details] ext3 and ext4 comparison with patched and unpatched kernel Here some more results. I could gain or loose some latency by different kernel settings. In 2.6.20 I could reproduceable loose 10ms, which makes a decrease of 25% of average latency. But it makes no difference in the desktop responsiveness. I have tested the 2.6.28 kernel as patched ( http://bugzilla.kernel.org/attachment.cgi?id=19866 ) and unpatched kernel with ext3 and ext4 with exactly the same kernel settings. My test system is installed on a ext3 partition, the tests are executed on a extra ext3 or ext4 (on the slower one) partition on the same hard drive. The write performance on ext4 is now at 45MB/s instead of 35MB/s (ext3). The destop responsiveness on the ext4 test with the patched version decreases extremely. While copying a 10gig file from ext4 to ext4, there is nerby no problems with the unpatched kernel. While using the patched kernel, the system becomes unuseable. With ext3 there is a little responsiveness improve with the patched kernel. But it can be coincidence, as I have no exact test for desktop responsiveness. But while copying the 10gig file on ext4 and compiling the kernel, my system becomes unusable with the unpatched kernel too. There are freezes for >20 seconds, while access the menu in applications the first time. You can easly simulate this behaviour by executing the following compression for every core. bzip2 -9 -c /dev/urandom >/dev/null & And the average latencies of the last four tests. min maxa avg stdev 2.6.28 unpatched ext3 11.24 181.20 62.35 86.15 2.6.28 patched ext3 10.82 175.93 62.18 83.89 2.6.28 unpatched ext4 6.90 396.17 132.52 213.18 2.6.28 patched ext4 6.85 2078.93 707.26 1006.74
Forget the back merge patch. Have you tried running latencytop to spot big sleep offenders?
Created attachment 19924 [details] Latencytop results (In reply to comment #68) > Have you tried running latencytop to spot big sleep offenders? I am not sure, what I shall look at. You can find in the file latencytop-ext4-2*bzip2.txt the most results.
Most of them look as expected, up to about 1 second latency for a single IO under load. latencytop-ext4-2*bzip2.txt looks pretty bad, though. It has a 10 second wait on a single lock_page(), that's pretty slow. Again, this whole thread confuses me. The IO latencies from the fio jobs posted look OK, in the sense that they haven't regressed and that you can't expect zero latency when you are fully loading a disk with writes. So while we could do better there, it's not a catastrophy. The bisect you originally did pointed to something interesting, I think. If we have clock problems, the CPU scheduler could easily delay a single process for large amounts of time if other processes are repeatedly ready to run.
The scheduler has normal special code to handle bad (like going backwards) clocks. Of course it has its limit, but it should handle the typical cases. Of course it could confuse other subsystems. For testing you could force another clock like clock=pmtmr or clock=hpet (if you have HPET)
It may be something as simple as a wrapped variable. IIRC, someone recently found something like that in the scheduler, though I can't find the posting just now. It was in kernel/sched_fair.c:update_curr() I think.
My default clock source is hpet. It is faster, but I have long freezes. With acpi_pm the system is dull, but the freezes were allways below 5 seconds. Test: copy 10gig file and execute "bzip2 -9 -c /dev/urandom >/dev/null" twice on core2duo. hpet 1299.7 / 1651.3 / 39790.7 / 4580.1 / 943.9 / 2069.3 / 145.7 / 1739.2 / 691.4 / 2060.2 / 172.3 / 492.4 / 2286.4 / 3064.9 / 696.9 / 716.9 / 14096.2 / 3131.2 / 1640.2 / min:145.7 ms|max:39790.7 ms|avg:4277.31 acpi_pm 1969 / 1276.8 / 658.8 / 16303.8 / 1604.3 / 3885.8 / 823.6 / 3659.1 / 2719.6 / 2064.2 / 672.9 / 1327.9 / 1783.9 / 604.3 / 1289 / 9535.1 / 1271.5 / 280.9 / 2621.8 / 759.1 / min:280.9 ms|max:16303.8 ms|avg:2755.57
I'm not sure what my default clock source is (where does one look to determine this?), however I just booted with clock=hpet and things don't seem to be particularly better (50% IO wait time while evolution is starting, a process which takes over 5 minutes; this is with Jens' patch). These numbers are common with Jens' patch (which is a bit of an improvement, without the patch evolution pegs IO wait times at 70%+ and is very sluggish even after starting).
I just tried clock=acpi_pm and evolution startup performance seems no better. Tonight I'm going to try some quantitative benchmarks on these configurations so that legitimate comparisons can be made. One thing that I have neglected to mention is that Jens' patch does seem to help overall system interactivity---an application with a high IO load doesn't degrade the latency of the entire system nearly as much---although I have no numbers to support this claim.
On my computer on 2.6.20 kernel jiffies was the default scheduler. Since 2.6.22 hpet is. On my old notebook it is now acpi_pm. I don't known what it was before. With jiffies under 2.6.28, my system seems much better, although there are still some short freezes. It does not solve the problem, but makes it much better. Please try clocksource=jiffies . You can check yout current clocksource with. cat /sys/devices/system/clocksource/clocksource0/* jiffies 645 / 598.3 / 462.5 / 1496.2 / 213.2 / 1353.1 / 6470.6 / 337.6 / 3406.9 / 2057.5 / 155.3 / 309 / 2332 / 463.1 / 1804.4 / 3258.6 / 261.7 / 8124.3 / 2373.2 / 2471.1 min:116.1 ms|max:8124.3 ms|avg:1843.32 The long values are freezes of firefox. hpet 39790.7 acpi_pm 16303.8 jiffies 8124.3
(In reply to comment #76) > The long values are freezes of firefox. Do you mean startup time? or you click on a tab and it takes that long for it to switch?
Using the jiffies clocksource on linus's master causes the machine to wedge up on attempting the start Xorg. I'll have to look into it later.
(In reply to comment #77) > Do you mean startup time? or you click on a tab and it takes that long for it > to switch? It the longest time for switching or opening tabs during heavy io, and 2*bzip2 urandom.
> (WW) intel(0): No outputs definitely connected, trying again... > (WW) intel(0): Unable to find initial modes > (EE) intel(0): No valid modes. no Xorg coming up with jiffies clocksource. takes the console with it. I have darkness on the screen...:) I can ssh into it, though. some weird interaction between i915 and clocksource there.
echo hpet > current_clocksource and things are back to normal.
(In reply to comment #81) > echo hpet > current_clocksource > > and things are back to normal. I got a crash, while tried to set jiffies clocksource while linux was running. There is now a improvement in the process and thread test with clock source jiffies. Here the result. The performance is nearby as in 2.6.20. Linux bugs-laptop 2.6.28t61p4 #5 SMP Wed Jan 21 14:30:24 CET 2009 x86_64 GNU/Linux min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:945.000ms|duration:24.354s min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:466.000ms|duration:24.206s min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:220.000ms|duration:47.452s min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:870.000ms|duration:34.105s min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:479.000ms|duration:36.610s min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:212.000ms|duration:77.449s
I booted up with clocksource=jiffies and lost Xorg and console. So, it wasn't set while running.
(In reply to comment #83) > I booted up with clocksource=jiffies and lost Xorg and console. So, it wasn't > set while running. Try to blacklist the thermal and the processor kernel module.
Hi I have currently the following running. 2 x "bzip2 -9 -c /dev/urandom >/dev/null" since I have 2 cores and one "dd if=/dev/zero of=test.10g bs=1M count=10000" And only small lockups happenend during that time, which was about 9 minuttes Bu small locoups I mean a couple of seconds. After the dd-command had finished the lockups where still occuring but they where generally much shorter. For me it is definetly a fix.
Seems like is more complex. Only doing the dd-command halts my system in the same ways as earlier described in this bug. ~100% iowait etc. Adding a single bzip-command results in an iowait of around 40% and improved desktop reponse, and finally adding the second bzip-command results in 5% iowait and even better desktop response.
(In reply to comment #84) > (In reply to comment #83) > > I booted up with clocksource=jiffies and lost Xorg and console. So, it > wasn't > > set while running. > > Try to blacklist the thermal and the processor kernel module. > Wouldn't that throw everything cpufreq into a tizzy? Its a laptop, so losing cpufreq and other potential ACPI functions is a big loss. Let me know if I am wrong about this.
blacklisting processor and thermal didn't work either. I give up on jiffies...:-)
Well, looks like there's a good reason why machines hang with clock=jiffies. http://lkml.org/lkml/2009/1/21/402 Any ideas why those users whose machines didn't crash saw improvement? Does this suggest a scheduler issue?
> Well, looks like there's a good reason why machines hang with clock=jiffies. > http://lkml.org/lkml/2009/1/21/402 > This means I need to recompile kernel without high resolution timer and then pass clocksource=jiffies? Do we have an explanation for why the freezing period reduced to half with acpi_pm and to a quarter with jiffies for Thomas? I would have thought faster timers will result in better behavior and it was a step in the future direction. But we seem to be going backwards.
(In reply to comment #90) > This means I need to recompile kernel without high resolution timer and then > pass clocksource=jiffies? No, it shouldn't be possible to run the kernel using jiffies as a clocksource. The system's time source needs to have a sufficiently high resolution. Using a low resolution time source (like jiffies) can cause the kernel to hang. > > Do we have an explanation for why the freezing period reduced to half with > acpi_pm and to a quarter with jiffies for Thomas? I would have thought faster > timers will result in better behavior and it was a step in the future > direction. But we seem to be going backwards. It's far more complicated than that. If we have a timer wrapping around, it is entirely possible that a slower clock source would give you expected behavior whereas a higher resolution time source would fail. It completely depends upon the source of the freezes. Jens, what do you think in light of this growing body of evidence pointing towards timer issues?
(In reply to comment #91) Hmm, I think I was a little tired last night. To clarify, I guess you probably could recompile without CONFIG_HIGH_RES_TIMERS, however I'm not sure you'd want to. If I'm not mistaken, the no-tick kernel option is dependent on high-res timers, so you'd have to give that up. Also, correction: s/towards timer issues/towards timer-triggered-issues/
Has anyone run latency top yet?
These are the total values of latency top. http://bugzilla.kernel.org/show_bug.cgi?id=12309#c73 http://bugzilla.kernel.org/show_bug.cgi?id=12309#c76 Currently my system crashes, while I am executing the copy and 2*bzip operation with jiffies. I will make some new measures, as soon my test system runs.
Created attachment 19954 [details] latencytop captures with clocksource jiffies and hpet I was not able to execute the 2*bzip2 with jiffies any more. The system freezes for ever, while copying a file and ziping urandom. It happens in runlevel 1, 3 and 5, during cpu intensive tasks. I have made an test with less cpu consumption. The test uses a script to have the same execution with different clocksources. It's copying a file and extract kernel source, build kernel and finally delete the kernel path. Concurrent the script started gimp, oowriter, firefox, htop, opens some web pages and a document. Here the "Total:" time from the captures. jiffies min:0.1 ms|max:5442.1 ms|avg:213.2 hpet min:0.0 ms|max:14777.7 ms|avg:403.71 The full capture without the escape sequences are added in the attachment. The escapes sequences are not correctly removed, but it's enough to see the necessary. I can provide the captures with the escape sequences too, if someone wants.
Filtering 10% of the upper and lower times out, results in an average latency time of 1737.18ms for jiffies and 3164.72ms for hpet.
Basically all of them show waiting for an async page write to finish, and that can take quite a bit of time with heavy writing going on. First thing next week I'll try and provide a 'this async write now went sync' helper for the io scheduler, so that they can make sure it gets expedited as soon as the sync io is. This should drastically reduce latencies for this situation. I'll probably be less than straight forward, but a test patch should be quite doable.
That sounds good. I have to correct the last values, as I was using the filter for capture logs with escape sequences. jiffies: min:0.1 ms|max:5442.1 ms|avg:834.12|avg80:2248.28 hpet: min:0 ms|max:14777.7 ms|avg:1474.09|avg80:3638.15 Why are there such a big difference in the average latency with jiffies and hpet? The total latency of 80% of the recording is 2,2s with jiffies and 3,6s with hpet.
Created attachment 19996 [details] Test patch for async page promotion First attempt at doing sync promotion of async page waiting. It actually booted, however I haven't done any sort of testing with it yet. Note that this will only work with CFQ currently.
Created attachment 19997 [details] latencytop captures with clocksource hpet and patched kernel Same test, with patched kernel and hpet as clocksource. hpet: min:10.1 ms|max:11733 ms|avg:3096.22|avg80:4082.79
One observation is that ext4 seems quite latency prone in waiting for write access to the journal. IIRC, that matches earlier results where ext3 was much quicker in that area. No idea what causes this, as I'm not familiar with the ext4 internals. Another observation is that I neglected to include the buffer waiting in the async promotion, it only worked for page locking. I'll add an updated patch below after this posting. And finally, lots of time is spent waiting for a new write request in the block layer. So you are maxing all 128 requests out in this test case. You can try and increase that to 512 for testing purposes, you can do that ala: # echo 512 > /sys/block/sda/queue/nr_requests That will get your async wait numbers down, but it may not reduce your latencies. Fact is that 128 writes is already a lot, and with more requests in the queue, you will have higher completion times for each individual request.
Created attachment 19998 [details] Test patch for async page promotion v2
Attachement http://bugzilla.kernel.org/attachment.cgi?id=19998&action=view Causes the following OOPS as soon as stress-testing starts. Is it possible that bdi->unplug_io_data can be NULL in blk_backing_dev_wop ? Should we simply discard those ? [ 138.345195] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 138.346301] IP: [<ffffffff803f997d>] elv_wait_on_page+0xd/0x20 [ 138.346301] PGD 434c05067 PUD 434c06067 PMD 0 [ 138.346301] Oops: 0000 [#1] PREEMPT SMP [ 138.346301] LTT NESTING LEVEL : 0 [ 138.346301] last sysfs file: /sys/block/md1/md/raid_disks [ 138.346301] Dumping ftrace buffer: [ 138.346301] (ftrace buffer empty) [ 138.346301] CPU 3 [ 138.346301] Modules linked in: e1000e loop ltt_tracer ltt_trace_control ltt_type_serializer ltte [ 138.346301] Pid: 1272, comm: kjournald Not tainted 2.6.28.1 #69 [ 138.346301] RIP: 0010:[<ffffffff803f997d>] [<ffffffff803f997d>] elv_wait_on_page+0xd/0x20 [ 138.346301] RSP: 0018:ffff88043cc19cd0 EFLAGS: 00010286 [ 138.346301] RAX: 0000000000000000 RBX: ffff88043f460938 RCX: 0000000000000000 [ 138.346301] RDX: ffff880438490000 RSI: ffffe200193f0bc0 RDI: ffff88043e580a40 [ 138.346301] RBP: ffff88043cc19cd0 R08: ffff88043d09de78 R09: 0000000000000001 [ 138.346301] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88043cc19d50 [ 138.346301] R13: ffff88043cc19d60 R14: 0000000000000002 R15: ffff8800280590c8 [ 138.346301] FS: 0000000000000000(0000) GS:ffff88043f804d00(0000) knlGS:0000000000000000 [ 138.346301] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [ 138.346301] CR2: 0000000000000000 CR3: 0000000434817000 CR4: 00000000000006e0 [ 138.346301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 138.346301] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 138.346301] Process kjournald (pid: 1272, threadinfo ffff88043cc18000, task ffff88043d09d8c0) [ 138.346301] Stack: [ 138.346301] ffff88043cc19ce0 ffffffff803fd2a2 ffff88043cc19d00 ffffffff802f6762 [ 138.346301] ffff88043cc19d60 0000000000000000 ffff88043cc19d40 ffffffff8067ace2 [ 138.346301] ffffffff802f6710 ffff880438490000 0000000000000002 0000000000000002 [ 138.346301] Call Trace: [ 138.346301] [<ffffffff803fd2a2>] blk_backing_dev_wop+0x12/0x20 [ 138.346301] [<ffffffff802f6762>] sync_buffer+0x52/0x80 [ 138.346301] [<ffffffff8067ace2>] __wait_on_bit+0x62/0x90 [ 138.346301] [<ffffffff802f6710>] ? sync_buffer+0x0/0x80 [ 138.346301] [<ffffffff802f6710>] ? sync_buffer+0x0/0x80 [ 138.346301] [<ffffffff8067ad89>] out_of_line_wait_on_bit+0x79/0x90 [ 138.346301] [<ffffffff802566f0>] ? wake_bit_function+0x0/0x50 [ 138.346301] [<ffffffff802f6649>] __wait_on_buffer+0xf9/0x130 [ 138.346301] [<ffffffff8036c0c5>] journal_commit_transaction+0x7d5/0x1540 [ 138.346301] [<ffffffff80265991>] ? trace_hardirqs_on_caller+0x1b1/0x210 [ 138.346301] [<ffffffff8067d457>] ? _spin_unlock_irqrestore+0x47/0x80 [ 138.346301] [<ffffffff80249cef>] ? try_to_del_timer_sync+0x5f/0x70 [ 138.346301] [<ffffffff803708c8>] kjournald+0xe8/0x250 [ 138.346301] [<ffffffff802566b0>] ? autoremove_wake_function+0x0/0x40 [ 138.346301] [<ffffffff803707e0>] ? kjournald+0x0/0x250 [ 138.346301] [<ffffffff802561de>] kthread+0x4e/0x90 [ 138.346301] [<ffffffff80256190>] ? kthread+0x0/0x90 [ 138.346301] [<ffffffff8020d8d9>] child_rip+0xa/0x11 [ 138.346301] [<ffffffff8020cd58>] ? restore_args+0x0/0x30 [ 138.346301] [<ffffffff80256190>] ? kthread+0x0/0x90 [ 138.346301] [<ffffffff8020d8cf>] ? child_rip+0x0/0x11
Yes that's expected, I didn't fixup the non-request_fn based drivers. It's trickier to do for dm/md, since you need to know where that page went. Or you can just cycle all the bottom backing_dev_info's like it's done for unplug. I'll be back at the machine in an hour or two, I'll update the patch for dm/md.
Created attachment 20001 [details] Test patch for async page promotion v2 Adds support for raid0/1/10/5 and should not oops on dm (just not work as intended, it'll do nothing). There's still the debug printk in there that notifies you of when something has happened, ala: $ dmesg | tail cfq: moving e4a348d4 to dispatch cfq: moving e49dede4 to dispatch cfq: moving f687d8d4 to dispatch
Another question - are people using CONFIG_NO_HZ or not?
(In reply to comment #106) > Another question - are people using CONFIG_NO_HZ or not? Yes, I am.
(In reply to comment #106) > Another question - are people using CONFIG_NO_HZ or not? > As am I
(In reply to comment #106) Me currently too.
So my next question would be if disabling that option makes any difference?
We are not using CONFIG_NO_HZ and get high latency (subjective) while running: dd if=/dev/zero of=file bs=1M count=2048 Additionally, all 8 core cores go to at least 50% iowait, several peg at ~=95%. We see similar results with: 2.6.18, 2.6.24, deadline, cfq.
Created attachment 20024 [details] 2.6.25.20 fio test with NOHZ disabled
Created attachment 20025 [details] 2.6.25.20 fio test with NOHZ enabled
What is the preferred way of testing different kernels against this bug? I've done the fio test of Mathieu but I'm not sure if it gives detailed clue about the problem. I've attached the results.
Created attachment 20026 [details] latencytop captures with clocksource hpet with nohz and no high resolution timer hpet - no hz - no high resolution timer min:0 ms|max:10888.7 ms|avg:1311.17 hpet - no hz min:2 ms|max:16980.9 ms|avg:1513.26 Same settings as in http://bugzilla.kernel.org/attachment.cgi?id=19954&action=view hpet min:0 ms|max:14777.7 ms|avg:1474.09 jiffies min:0.1 ms|max:5442.1 ms|avg:834.12
Created attachment 20027 [details] latencytop captures + fio results amd64 I have run the fio job on a different machine on two different discs. While running the fio job, I have captured the latency with latencytop. Each test was executed twice. Once with 2*bzip-urandom and the other without cpu consumption. You can find the test results for every io scheduler in the archive. 100MB/s disk + 2bzip (2009-01-27.0847-2.6.28.2-acpi_pm) 100MB/s disk (2009-01-27.0908-2.6.28.2-acpi_pm) 40MB/s disk + 2bzip (2009-01-27.0934-2.6.28.2-acpi_pm) 40MB/s disk (2009-01-27.1029-2.6.28.2-acpi_pm) Total latency - cfq min: 133.3 ms | max 18555.8 ms | avg 5978.08 min: 25.5 ms | max 5057.2 ms | avg 1660.21 min: 369.0 ms | max 11872.0 ms | avg 3764.57 min: 557.0 ms | max 12215.6 ms | avg 3002.81 fio results - cfq mint 25msec | maxt 1669msec mint 23msec | maxt 1596msec mint 77msec | maxt 2370msec mint 106msec | maxt 738msec
// Adding myself to CC
(In reply to comment #101) > One observation is that ext4 seems quite latency prone in waiting for write > access to the journal. IIRC, that matches earlier results where ext3 was much > quicker in that area. No idea what causes this, as I'm not familiar with the > ext4 internals. It is possible, that the reduced latency on ext4 is a result of the increased write speed, which is nearby doubled. You can see in the result posted before (comment #116), a reduction on ext3 partitions with different hard drives.
I have really noticed this lately. I replace a old server running and older kernel. The replacement hardware was by orders of magnitude more powerful. The I/O system in the old machine was a 4 disk hardware RAID 5 on 64 bit PCI with the very first SATA 10,000RPM WD Raptors (WD740-00FLA). The new machine has and 8 disk hardware RAID 5 using the new 300gig 10,000rpm Velociraptor SATA drives on PCI-Express. The old machine had a Pentium 4 HT CPU. The new machine has a 4 core Core 2 CPU. All high end gear. The new machine does get far better disk through put, however on the workloads the latencies seem far higher, the interactvity of the machine is poor and all CPU core show high I/O waits. This machine serves a application that run from Samba shares to 15 or so Windows workstations. This involved lots of file activity on large flat file database files. Some of the files are up to 4GB in size. The old server was very busy however not a huge amounts of I/O wait was seen. On the new server using a 2.6.18 kernel on an enterprise distro the I/O waits are heaps higher. Especially noticed at backup times. Users of the system have noticed the extra latencies when the system is busy and at these time the I/O waits are high. The server feels slower than the old machine and this should not be so. Just thought I would let you know this info as it seems a hard to quantify this to real world.
Just wanted to add a couple of links to places where some additional real world experience is related, for whatever they might be worth. http://forums.storagereview.net/index.php?s=121e3f0d26cbd551c84271019f82f6d3&showtopic=25923&st=0 http://community.novacaster.com/showarticle.pl?id=7395&n=8001
(In reply to comment #105) I have tried the patch with 2.6.28.2 and 2.6.29-rc3 and always get a crash, when io start. Sometimes even after the X-server has started. kernel 2.6.29-rc3 at cfq_remove_request 0xe3/0x251 0xffffffff811ca8fc is in cfq_remove_request (block/cfq-iosched.c:650). 645 { 646 struct cfq_queue *cfqq = RQ_CFQQ(rq); 647 struct cfq_data *cfqd = cfqq->cfqd; 648 const int sync = rq_is_sync(rq); 649 650 BUG_ON(!cfqq->queued[sync]); 651 cfqq->queued[sync]--; 652 653 elv_rb_del(&cfqq->sort_list, rq); 654 kernel 2.6.28.2 elv_rb_del+0x21/0x4b 394 } 395 EXPORT_SYMBOL(elv_rb_add); 396 397 void elv_rb_del(struct rb_root *root, struct request *rq) 398 { 399 BUG_ON(RB_EMPTY_NODE(&rq->rb_node)); 400 rb_erase(&rq->rb_node, root); 401 RB_CLEAR_NODE(&rq->rb_node); 402 } 403 EXPORT_SYMBOL(elv_rb_del);
Could this be the same bug as: http://lkml.org/lkml/2008/6/15/163 ? Because on the same system on which I have the same sympthones on what his bug describes, also the following happens: http://beheer.eduwijs.nl/kernellog-brikama.log I need to say that changing the IO scheduler from CFQ to AS seems to help a bit. It will not solve the problem, but the system will be much more responsive. System information: IO Scheduler: AS (default is CFQ, using elevator=as) Timer: hpet CONFIG_NO_HZ=y Kernel: Linux brikama 2.6.27-9-generic #1 SMP x86_64 GNU/Linux Distro: Ubuntu 8.10 Intrepid amd64 CPU: Intel(R) Core(TM)2 CPU E8400 @ 3.00GHz (2 cores) Memory: 4GB Using LVM: yes Using LVM encryption: no LVM version: LVM version: 2.02.39 (2008-06-27) Library version: 1.02.27 (2008-06-25) Driver version: 4.14.0 Using DM: yes HDD: /dev/sda: Model=WDC WD5000AACS-00G8B1 , FwRev=05.04C05, SerialNo= WD-WCAUF0869014 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50 BuffType=unknown, BuffSize=16384kB, MaxMultSect=16, MultSect=?0? CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=976773168 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 AdvancedPM=no WriteCache=enabled Drive conforms to: Unspecified: ATA/ATAPI-1,2,3,4,5,6,7
Anyone here managed to reproduce this problem on an AMD platform ? Because I can't seem to be able to reproduce it. But both of 965GM and 945GM chipset motherboard have the problem with the T7600 and T9500 cpu. My old celeron has the same problem but it doesn't feel like freezing so much.
(In reply to comment #123) > Anyone here managed to reproduce this problem on an AMD platform ? Because I > can't seem to be able to reproduce it. But both of 965GM and 945GM chipset > motherboard have the problem with the T7600 and T9500 cpu. My old celeron has > the same problem but it doesn't feel like freezing so much. > AMD on nForce4 here running x86_64. Look over at gentoo forums, there is a long thread. And almost all of the people experiencing the problem there are on amd.
The problem exists on an AMD platform too, but not as bad as on a Intel platform. By changing the clocksource to the acpi_pm, you can recuce the problem a bit on a Intel platform, but the system feels a little bit slower. Using ext4 reduces the problem enormous. Even firefox is usable while eclipse is indexing the kernel build tree.The problem still exists on heavy io.
Sounds like the infamous ext3 fsync() issue is also a factor. Can you try mounting ext3 with -o data=writeback and see if that makes ext3 behave better?
On my machine (nForce 5, AMD Phenom II 940) I also experience huge slowdowns when performing I/O. For example, using Ben's: dd if=/dev/zero of=/tmp/test bs=1M count=1M test, it takes me about 40 secs to spawn a shell (15 secs for konsole to open a new tab, and about 25 secs for the shell to actually spawn). This was conducted on my a HD with ext4. Turning off swap helps a lot with launching a shell. On a more substantial note, I use Unison to sync files between various places and when it is running my system is hardly responsive. This happens to me on ext4, ext3, and ReiserFS. Changing the schedule to noop, the dd-and-open-shell test is very responsive (with both swapon or off), but any substantial usage, such as using firefox is still slow, just as it is above with cfq. If I can free up some space and one of my partitions, I'm going to install some distro pre 2.6.18 and "feel" what the performance is like.
I think the appearance of this bug is conditioned on cpu speed and drive speed. I have make some more tests. Currently I am using the following command. for i in 1 2 ; do \ dd if=/dev/zero of=test-$i bs=1M count=4K oflag=direct & echo test-$i; \ done Once with oflag=direct and once without. With ext3 the problem occurs immediately in both cases. With ext4 the problem occurs without oflag=direct occurs immediately. With with oflag=direct I can use even firefox, but sometime the desktop in unusable. In direct mode new application does not start and disk intensive operations take a long time, but I can move the windows and change the desktops without problems and io-wait at 60%. With dd in non direct mode, I can start new application (it takes still a lot of time), but everything is freezing from time to time and io-wait is immediately at 100%. I have captures some statistic by adding a printk with the duration time in the function __make_request (blk-core.c). The time is taken directly before and after the spin_lock_irq(q->queue_lock); and finally before the unlock. There is a dramatic difference between the request per seconds in direct and non-direct mode. W: wait time before entering lock state D: duration time of the make_request T: total time = W + D ext3 - direct requests: 209.694080/s total: W: 0.000645 / D: 0.014584 / T: 0.015229 W: avg: 0.000000307 / min: 0.000000000 / max: 0.000007606 D: avg: 0.000006948 / min: 0.000000255 / max: 0.000085018 T: avg: 0.000007255 / min: 0.000000365 / max: 0.000085018 4294967296 Bytes (4,3 GB) kopiert, 203,66 s, 21,1 MB/s 4294967296 Bytes (4,3 GB) kopiert, 203,582 s, 21,1 MB/s ext3 requests: 4662.272968/s total: W: 0.013624 / D: 15.256149 / T: 15.269773 W: avg: 0.000000291 / min: 0.000000000 / max: 0.000275893 D: avg: 0.000325819 / min: 0.000000000 / max: 1.092940760 T: avg: 0.000326110 / min: 0.000000000 / max: 1.092940920 4294967296 Bytes (4,3 GB) kopiert, 203,559 s, 21,1 MB/s 4294967296 Bytes (4,3 GB) kopiert, 214,995 s, 20,0 MB/s ext4 - direct requests: 114.510132/s total: W: 0.000356 / D: 0.017658 / T: 0.018014 W: avg: 0.000000311 / min: 0.000000110 / max: 0.000000630 D: avg: 0.000015408 / min: 0.000000220 / max: 0.000127249 T: avg: 0.000015719 / min: 0.000000330 / max: 0.000127689 4294967296 Bytes (4,3 GB) kopiert, 154,491 s, 27,8 MB/s 4294967296 Bytes (4,3 GB) kopiert, 157,853 s, 27,2 MB/s ext4 requests: 7009.744726/s total: W: 0.018928 / D: 6.110891 / T: 6.129819 W: avg: 0.000000270 / min: 0.000000000 / max: 0.000032916 D: avg: 0.000087046 / min: 0.000000000 / max: 0.603327176 T: avg: 0.000087316 / min: 0.000000000 / max: 0.603327516 4294967296 Bytes (4,3 GB) kopiert, 146,303 s, 29,4 MB/s 4294967296 Bytes (4,3 GB) kopiert, 149,361 s, 28,8 MB/s
And some test results with clocksource=jiffies instead of hpet (non-direct only), which runs much better on my machine. The total times are added in an interval of 10s. The 15s with ext3 above should come from the two cores. ext3 total: W: 0.018617 / D: total: 3.714917 / T: total: 3.733534 requests: 4050.191168/s W: avg: 0.000000459 / min: 0.000000000 / max: 0.000048408 D: avg: 0.000091496 / min: 0.000000000 / max: 0.615268038 T: avg: 0.000091954 / min: 0.000000000 / max: 0.615268379 4294967296 Bytes (4,3 GB) kopiert, 213,215 s, 20,1 MB/s 4294967296 Bytes (4,3 GB) kopiert, 222,198 s, 19,3 MB/s ext4 total: W: 0.026263 / D: 3.681891 / T: 3.708154 requests: 6006.413044/s W: avg: 0.000000431 / min: 0.000000000 / max: 0.001003075 D: avg: 0.000060427 / min: 0.000000000 / max: 0.344179020 T: avg: 0.000060858 / min: 0.000000000 / max: 0.344179370 4294967296 Bytes (4,3 GB) kopiert, 147,343 s, 29,1 MB/s 4294967296 Bytes (4,3 GB) kopiert, 146,386 s, 29,3 MB/s
Can you try with this simple patch applied? diff --git a/block/blk.h b/block/blk.h index 6e1ed40..a145c3a 100644 --- a/block/blk.h +++ b/block/blk.h @@ -5,7 +5,7 @@ #define BLK_BATCH_TIME (HZ/50UL) /* Number of requests a "batching" process may submit */ -#define BLK_BATCH_REQ 32 +#define BLK_BATCH_REQ 1 extern struct kmem_cache *blk_requestq_cachep; extern struct kobj_type blk_queue_ktype;
I would say there are no changes. Perhaps a little bit worse. There are still freezes with non-direct write access, e.g. while painting circles in gimp. No freezes with direct-io, but high lattency with concurrent disk access (as before). ext3 - direct requests: 205.795295/s total: W: 0.000616 / D:: 0.011195 / T: 0.011811 W: avg: 0.000000299 / min: 0.000000000 / max: 0.000007085 D: avg: 0.000005434 / min: 0.000000000 / max: 0.000100447 T: avg: 0.000005733 / min: 0.000000000 / max: 0.000100958 4294967296 Bytes (4,3 GB) kopiert, 210,281 s, 20,4 MB/s 4294967296 Bytes (4,3 GB) kopiert, 210,525 s, 20,4 MB/s ext3 requests: 4960.868922 total: W: 0.032503 / D: 21.032077 / T: 21.064580 W: avg: 0.000000655 / min: 0.000000000 / max: 0.000069624 D: avg: 0.000423863 / min: 0.000000000 / max: 0.415194973 T: avg: 0.000424518 / min: 0.000000000 / max: 0.415195303 requests: 3588.105593/s total: W: 0.014912 / D: 10.578434 / T: 10.593346 W: avg: 0.000000415 / min: 0.000000000 / max: 0.000077581 D: avg: 0.000294754 / min: 0.000000000 / max: 0.447073476 T: avg: 0.000295170 / min: 0.000000000 / max: 0.447073806 4294967296 Bytes (4,3 GB) kopiert, 218,708 s, 19,6 MB/s 4294967296 Bytes (4,3 GB) kopiert, 228,355 s, 18,8 MB/s ext4 - direct requests: 115.981745/s total: W: 0.000344 / D: 0.016716 / T: 0.017061 W: avg: 0.000000297 / min: 0.000000110 / max: 0.000025846 D: avg: 0.000014398 / min: 0.000000650 / max: 0.000075554 T: avg: 0.000014695 / min: 0.000000990 / max: 0.000076195 4294967296 Bytes (4,3 GB) kopiert, 156,476 s, 27,4 MB/s 4294967296 Bytes (4,3 GB) kopiert, 157,78 s, 27,2 MB/s ext4 requests: 7556.114616/s total: W: 0.029942 / D: 9.424271 / T: 9.454213 W: avg: 0.000000396 / min: 0.000000000 / max: 0.000127857 D: avg: 0.000124722 / min: 0.000000000 / max: 0.046151790 T: avg: 0.000125119 / min: 0.000000000 / max: 0.046152130 4294967296 Bytes (4,3 GB) kopiert, 147,553 s, 29,1 MB/s 4294967296 Bytes (4,3 GB) kopiert, 151,226 s, 28,4 MB/s
(In reply to comment #130) > Can you try with this simple patch applied? > > diff --git a/block/blk.h b/block/blk.h > index 6e1ed40..a145c3a 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -5,7 +5,7 @@ > #define BLK_BATCH_TIME (HZ/50UL) > > /* Number of requests a "batching" process may submit */ > -#define BLK_BATCH_REQ 32 > +#define BLK_BATCH_REQ 1 > > extern struct kmem_cache *blk_requestq_cachep; > extern struct kobj_type blk_queue_ktype; > Hi Jens, I tried it on a 2.6.29-rc3 kernel. It made things worse for "default" config, but did help with config1. (fio "ssh" test bench) (config1 : quantum=1, slice_async_rq=1, queue_depth=1) max runt 2.6.29-rc3 default no patch 14247msec max runt 2.6.29-rc3 default patch 30833msec max runt 2.6.29-rc3 config1 no patch 7574msec max runt 2.6.29-rc3 config1 patch 6585msec Note that the results seems to indicate that the larger run times occur near the "write" job. The listings below show the runtime of the jobs (1 large write and many 2M reads executed at regular interval for most of the load, and ending with more randomly delayed jobs) in the order they were run. Note that all the read jobs are started at a 4s interval, except the last 2 jobs which are started after 50s for the 1st one, and after another 10s for the last one. Here is the listing of the 2.6.29-rc3 default no patch write: io=10240MiB, bw=56062KiB/s, iops=53, runt=191526msec read : io=2052KiB, bw=3411KiB/s, iops=141, runt= 616msec read : io=2084KiB, bw=409KiB/s, iops=16, runt= 5215msec read : io=2060KiB, bw=349KiB/s, iops=15, runt= 6031msec read : io=2060KiB, bw=445KiB/s, iops=17, runt= 4731msec read : io=2068KiB, bw=377KiB/s, iops=14, runt= 5606msec read : io=2084KiB, bw=558KiB/s, iops=23, runt= 3824msec read : io=2056KiB, bw=398KiB/s, iops=15, runt= 5279msec read : io=2048KiB, bw=328KiB/s, iops=13, runt= 6393msec read : io=2056KiB, bw=337KiB/s, iops=12, runt= 6236msec read : io=2072KiB, bw=596KiB/s, iops=23, runt= 3558msec read : io=2068KiB, bw=448KiB/s, iops=17, runt= 4723msec read : io=2052KiB, bw=342KiB/s, iops=14, runt= 6143msec read : io=2056KiB, bw=448KiB/s, iops=19, runt= 4695msec read : io=2060KiB, bw=362KiB/s, iops=14, runt= 5814msec read : io=2072KiB, bw=1202KiB/s, iops=44, runt= 1765msec read : io=2048KiB, bw=395KiB/s, iops=17, runt= 5308msec read : io=2056KiB, bw=434KiB/s, iops=17, runt= 4851msec read : io=2064KiB, bw=382KiB/s, iops=14, runt= 5521msec read : io=2072KiB, bw=412KiB/s, iops=16, runt= 5144msec read : io=2052KiB, bw=439KiB/s, iops=17, runt= 4784msec read : io=2076KiB, bw=408KiB/s, iops=15, runt= 5209msec read : io=2084KiB, bw=405KiB/s, iops=15, runt= 5263msec read : io=2052KiB, bw=379KiB/s, iops=14, runt= 5543msec read : io=2076KiB, bw=438KiB/s, iops=18, runt= 4852msec read : io=2052KiB, bw=1016KiB/s, iops=38, runt= 2068msec read : io=2056KiB, bw=227KiB/s, iops=9, runt= 9271msec read : io=2072KiB, bw=1256KiB/s, iops=48, runt= 1689msec read : io=2048KiB, bw=347KiB/s, iops=13, runt= 6036msec read : io=2068KiB, bw=594KiB/s, iops=24, runt= 3562msec read : io=2052KiB, bw=415KiB/s, iops=16, runt= 5057msec read : io=2052KiB, bw=326KiB/s, iops=14, runt= 6430msec read : io=2064KiB, bw=394KiB/s, iops=16, runt= 5362msec read : io=2068KiB, bw=280KiB/s, iops=12, runt= 7553msec read : io=2064KiB, bw=364KiB/s, iops=15, runt= 5806msec read : io=2052KiB, bw=1001KiB/s, iops=41, runt= 2098msec read : io=2084KiB, bw=490KiB/s, iops=18, runt= 4352msec read : io=2056KiB, bw=1197KiB/s, iops=51, runt= 1758msec read : io=2048KiB, bw=471KiB/s, iops=19, runt= 4444msec read : io=2052KiB, bw=158KiB/s, iops=6, runt= 13259msec read : io=2052KiB, bw=147KiB/s, iops=6, runt= 14247msec read : io=2060KiB, bw=3906KiB/s, iops=148, runt= 540msec Here is the listing of the 2.6.29-rc3 default patch write: io=10240MiB, bw=54981KiB/s, iops=52, runt=195291msec read : io=2072KiB, bw=3843KiB/s, iops=159, runt= 552msec read : io=2080KiB, bw=4302KiB/s, iops=173, runt= 495msec read : io=2064KiB, bw=276KiB/s, iops=11, runt= 7642msec read : io=2056KiB, bw=462KiB/s, iops=18, runt= 4552msec read : io=2064KiB, bw=311KiB/s, iops=12, runt= 6790msec read : io=2076KiB, bw=832KiB/s, iops=34, runt= 2554msec read : io=2052KiB, bw=298KiB/s, iops=12, runt= 7038msec read : io=2048KiB, bw=493KiB/s, iops=20, runt= 4250msec read : io=2048KiB, bw=310KiB/s, iops=12, runt= 6746msec read : io=2060KiB, bw=595KiB/s, iops=24, runt= 3542msec read : io=2068KiB, bw=280KiB/s, iops=12, runt= 7542msec read : io=2056KiB, bw=506KiB/s, iops=20, runt= 4155msec read : io=2052KiB, bw=352KiB/s, iops=13, runt= 5953msec read : io=2068KiB, bw=1778KiB/s, iops=73, runt= 1191msec read : io=2080KiB, bw=239KiB/s, iops=9, runt= 8885msec read : io=2064KiB, bw=790KiB/s, iops=31, runt= 2675msec read : io=2048KiB, bw=235KiB/s, iops=9, runt= 8900msec read : io=2052KiB, bw=395KiB/s, iops=16, runt= 5312msec read : io=2048KiB, bw=490KiB/s, iops=20, runt= 4279msec read : io=2048KiB, bw=350KiB/s, iops=14, runt= 5991msec read : io=2060KiB, bw=289KiB/s, iops=13, runt= 7296msec read : io=2060KiB, bw=392KiB/s, iops=14, runt= 5368msec read : io=2048KiB, bw=323KiB/s, iops=13, runt= 6487msec read : io=2052KiB, bw=442KiB/s, iops=17, runt= 4753msec read : io=2056KiB, bw=382KiB/s, iops=15, runt= 5506msec read : io=2052KiB, bw=299KiB/s, iops=11, runt= 7005msec read : io=2052KiB, bw=372KiB/s, iops=15, runt= 5647msec read : io=2068KiB, bw=512KiB/s, iops=18, runt= 4136msec read : io=2056KiB, bw=326KiB/s, iops=13, runt= 6453msec read : io=2060KiB, bw=765KiB/s, iops=30, runt= 2756msec read : io=2052KiB, bw=392KiB/s, iops=15, runt= 5357msec read : io=2060KiB, bw=420KiB/s, iops=19, runt= 5013msec read : io=2052KiB, bw=307KiB/s, iops=12, runt= 6838msec read : io=2056KiB, bw=724KiB/s, iops=33, runt= 2905msec read : io=2052KiB, bw=407KiB/s, iops=16, runt= 5153msec read : io=2048KiB, bw=417KiB/s, iops=15, runt= 5021msec read : io=2048KiB, bw=345KiB/s, iops=15, runt= 6069msec read : io=2048KiB, bw=451KiB/s, iops=21, runt= 4643msec read : io=2048KiB, bw=68KiB/s, iops=2, runt= 30833msec read : io=2048KiB, bw=121KiB/s, iops=5, runt= 17290msec read : io=2052KiB, bw=3876KiB/s, iops=167, runt= 542msec Here is the listing of the 2.6.29-rc3 config1 no patch write: io=10240MiB, bw=61068KiB/s, iops=58, runt=175827msec read : io=2048KiB, bw=4185KiB/s, iops=167, runt= 501msec read : io=2056KiB, bw=3814KiB/s, iops=161, runt= 552msec read : io=2056KiB, bw=448KiB/s, iops=17, runt= 4692msec read : io=2056KiB, bw=1070KiB/s, iops=42, runt= 1966msec read : io=2052KiB, bw=424KiB/s, iops=16, runt= 4946msec read : io=2076KiB, bw=512KiB/s, iops=19, runt= 4149msec read : io=2076KiB, bw=580KiB/s, iops=25, runt= 3664msec read : io=2052KiB, bw=470KiB/s, iops=18, runt= 4467msec read : io=2068KiB, bw=624KiB/s, iops=26, runt= 3390msec read : io=2060KiB, bw=929KiB/s, iops=39, runt= 2270msec read : io=2064KiB, bw=508KiB/s, iops=19, runt= 4160msec read : io=2076KiB, bw=659KiB/s, iops=26, runt= 3224msec read : io=2080KiB, bw=366KiB/s, iops=14, runt= 5819msec read : io=2064KiB, bw=1023KiB/s, iops=42, runt= 2066msec read : io=2060KiB, bw=322KiB/s, iops=13, runt= 6540msec read : io=2060KiB, bw=1383KiB/s, iops=52, runt= 1525msec read : io=2052KiB, bw=691KiB/s, iops=26, runt= 3039msec read : io=2064KiB, bw=444KiB/s, iops=20, runt= 4755msec read : io=2080KiB, bw=551KiB/s, iops=20, runt= 3860msec read : io=2084KiB, bw=743KiB/s, iops=29, runt= 2870msec read : io=2056KiB, bw=412KiB/s, iops=16, runt= 5106msec read : io=2056KiB, bw=406KiB/s, iops=15, runt= 5179msec read : io=2048KiB, bw=465KiB/s, iops=19, runt= 4507msec read : io=2060KiB, bw=446KiB/s, iops=15, runt= 4725msec read : io=2068KiB, bw=467KiB/s, iops=20, runt= 4528msec read : io=2052KiB, bw=461KiB/s, iops=18, runt= 4557msec read : io=2076KiB, bw=628KiB/s, iops=25, runt= 3385msec read : io=2052KiB, bw=518KiB/s, iops=23, runt= 4054msec read : io=2068KiB, bw=492KiB/s, iops=20, runt= 4296msec read : io=2048KiB, bw=543KiB/s, iops=21, runt= 3858msec read : io=2048KiB, bw=559KiB/s, iops=20, runt= 3750msec read : io=2064KiB, bw=646KiB/s, iops=26, runt= 3270msec read : io=2056KiB, bw=426KiB/s, iops=17, runt= 4938msec read : io=2052KiB, bw=741KiB/s, iops=29, runt= 2835msec read : io=2048KiB, bw=453KiB/s, iops=19, runt= 4621msec read : io=2072KiB, bw=579KiB/s, iops=24, runt= 3662msec read : io=2068KiB, bw=418KiB/s, iops=16, runt= 5066msec read : io=2056KiB, bw=2101KiB/s, iops=82, runt= 1002msec read : io=2072KiB, bw=280KiB/s, iops=11, runt= 7574msec read : io=2048KiB, bw=4877KiB/s, iops=190, runt= 430msec read : io=2076KiB, bw=4160KiB/s, iops=168, runt= 511msec and, for comparison, here is the listing of the 2.6.29-rc3 config1 patch write: io=10240MiB, bw=59607KiB/s, iops=56, runt=180134msec read : io=2068KiB, bw=4152KiB/s, iops=162, runt= 510msec read : io=2060KiB, bw=4185KiB/s, iops=168, runt= 504msec read : io=2064KiB, bw=508KiB/s, iops=21, runt= 4157msec read : io=2060KiB, bw=476KiB/s, iops=19, runt= 4425msec read : io=2056KiB, bw=444KiB/s, iops=18, runt= 4738msec read : io=2084KiB, bw=525KiB/s, iops=21, runt= 4063msec read : io=2072KiB, bw=481KiB/s, iops=20, runt= 4406msec read : io=2084KiB, bw=565KiB/s, iops=22, runt= 3777msec read : io=2048KiB, bw=498KiB/s, iops=20, runt= 4209msec read : io=2068KiB, bw=544KiB/s, iops=21, runt= 3888msec read : io=2080KiB, bw=389KiB/s, iops=15, runt= 5462msec read : io=2068KiB, bw=1384KiB/s, iops=55, runt= 1529msec read : io=2072KiB, bw=444KiB/s, iops=18, runt= 4774msec read : io=2064KiB, bw=320KiB/s, iops=12, runt= 6585msec read : io=2060KiB, bw=630KiB/s, iops=28, runt= 3348msec read : io=2064KiB, bw=428KiB/s, iops=15, runt= 4931msec read : io=2052KiB, bw=422KiB/s, iops=15, runt= 4973msec read : io=2056KiB, bw=480KiB/s, iops=21, runt= 4385msec read : io=2060KiB, bw=1453KiB/s, iops=61, runt= 1451msec read : io=2076KiB, bw=426KiB/s, iops=16, runt= 4983msec read : io=2052KiB, bw=735KiB/s, iops=28, runt= 2855msec read : io=2060KiB, bw=427KiB/s, iops=16, runt= 4939msec read : io=2064KiB, bw=508KiB/s, iops=19, runt= 4158msec read : io=2064KiB, bw=511KiB/s, iops=21, runt= 4134msec read : io=2052KiB, bw=538KiB/s, iops=20, runt= 3900msec read : io=2048KiB, bw=454KiB/s, iops=18, runt= 4612msec read : io=2052KiB, bw=520KiB/s, iops=21, runt= 4034msec read : io=2064KiB, bw=505KiB/s, iops=19, runt= 4183msec read : io=2052KiB, bw=414KiB/s, iops=17, runt= 5074msec read : io=2068KiB, bw=520KiB/s, iops=19, runt= 4065msec read : io=2048KiB, bw=392KiB/s, iops=15, runt= 5349msec read : io=2064KiB, bw=671KiB/s, iops=27, runt= 3148msec read : io=2068KiB, bw=551KiB/s, iops=21, runt= 3843msec read : io=2056KiB, bw=665KiB/s, iops=28, runt= 3162msec read : io=2084KiB, bw=606KiB/s, iops=23, runt= 3518msec read : io=2056KiB, bw=346KiB/s, iops=14, runt= 6076msec read : io=2056KiB, bw=452KiB/s, iops=19, runt= 4656msec read : io=2076KiB, bw=495KiB/s, iops=20, runt= 4291msec read : io=2052KiB, bw=407KiB/s, iops=17, runt= 5152msec read : io=2068KiB, bw=2267KiB/s, iops=92, runt= 934msec read : io=2064KiB, bw=4080KiB/s, iops=144, runt= 518msec I start to think that I should put more than a 4s delay between the jobs, since the duration of the reads is always around those 4s. Things become more interesting with the 50s delay probably because the read queue is empty. Mathieu
(edit) Note that the results seems to indicate that the larger run times occur near the "write" job *end*.
Hi. On my laptop(Core2Duo 1.6 ghz) I run my gentoo kernel 2.6.28-gentoo. I didn't have any problems with latency. If I run "dd if=/dev/zero of=file bs=1M count=2048" or "dd if=/dev/zero of=/tmp/test bs=1M count=1M" (I tried to run it as user and also as root), my system works well and I can start firefox, another shell, open dolphin (i'm under kde4-svn) and everything is faster. I have XFS filesystem on my home and reiserfs on root. Since I configured my kernel manually, maybe it could be usefull for someone to have my .config so I'll post it.
Created attachment 20105 [details] With this .config I don't have latency bug. My 2.6.28 .config , Everything is ok with this .config. I didn't have any slowdowns running "dd if=/dev/zero of=/tmp/test bs=1M count=1M" on my core2duo laptop(1.6 ghz).
After looking through Alexsandar's kernel I decided to try a new config. Changing my kernel from 250HZ and Voluntary Kernel Preemption to 1000HZ and Preemptible Kernel (Low-Latency Desktop), I can actually open tabs in firefox, new terminals, or SSH into my computer (from itself) without waiting 10-30 seconds. Perhaps there is no bug but this is just expected behavior. I wonder if it was more of the clock change or the preemption change which made the difference, or both. For those of you who have this problem what is your HZ and preemption model?
Enabling the 1000Hz timer frequency and Low-Latency Desktop as preemption model does not solve the problem for me. The mouse still freezes, I cannot move windows or switch between desktops on heavy i/o. The time of these freezes is now reduced to less than 3s, the freezes interval is 2-10s and the desktop still unusable for me.
Maybe it's not only the preemption and the frequency. I think one of these things could be: General setup: - Control Group support DISABLED - Group CPU Scheduler DISABLED - Enable full-sized data structures for core ENABLED - Enable futex support ENABLED - Use full shmem filesystem ENABLED - Enable AIO support ENABLED - SLAB Allocator: SLUB Processor type and features (ENABLED): - Tickless System (NO_HZ) - High Resolution Timer Support - HPET Timer Support - Multi-core scheduler support - Preemptible RCU - 64 bit Memory and IO resources - Add LRU list to track non-evictable pages Good luck...
I think it could be great if someone of the kernel can take a look on this. Linux is starting to loss advantage in performance tests because this problem. Is there any kernel developer who can address this issue?
(In reply to comment #136) > For those of you who have this problem what is your HZ and preemption model? > I'm currently using Voluntary Preemption and HZ=1000. However, I think we're probably losing focus here. Just randomly changing configurations seems like grasping at straws to me. There are far too many potentially relevant configuration options to realistically test them all. If we are going to make progress, we are going to have to use more targeted investigation. (In reply to comment #139) > Is there any kernel developer who can address this issue? > Jens Axboe has sent us a few patches although he doesn't seem to have a lot of time to dedicate to the issue. Honestly, I think we might need to find a distribution with a block layer developer on payroll who could focus on this issue until it is solved. In my discussions on #fedora-kernel, it doesn't look like Redhat has such a person. I haven't received any responses one way or another on #ubuntu-kernel with respect to Canonical. Does anyone know of a company who might have someone with the requisite skill set to debug this issue? Jens, do you think you'll be able to sustainably work on this bug? (Thanks for your work so far, by the way) I think it would be amazing if we could give 2.6.29 proper I/O performance. I know it's getting late considering we're at -rc3, but this bug has been with us for far too long.
Well, I'm fairly certain at least part of the issue is a scheduler bug. Just now I was make module_install'ing a few kernels and after some time found that specific processes had stopped responding. This pattern continued, with more and more processes blocking. Eventually the entire X session stopped responding. For a while I could maintain an SSH session and found that IO wait time was 40%, with the rest of the CPU time going idle. After some time, however, even the ssh session stopped responding. This is the third time I have seen behavior like this, with the previous instances involving copying 15GB of data between external hard drives. Also, Jens, what do you think is the most useful benchmark we've seen here? Testers have used several benchmarks including dd, various fio jobs. Would it help if we standardized on a single benchmark?
The best illustration of this behavior seems #128 #129 #131. IMHO this illustrates that most CPU is burned on a spinlock. If the time spent inside the critical section also increases (which it does), there is IMHO a strong indication that there must be another (spin-) lock inside this code path. Currently I'm looking into mm->filemap.c My own testing consists of a toy search engine I am developing. It uses the maximum of mmap()ed files (32K or 64K). (the program maintains it's own LRU) In the first stage of its's indexer, it just reads mmap()ed pages, maybe dirtying them. When it is done, it unmaps() them (causing the buffers to be written back to backing store). The frozen-cursor and non-responsive system only occurs during the first phase. During the writing phase, things are back to normal again. IMHO, this could mean two things: 1) There is a funneling lock in the read() pathway 2) The mm runs into the mud
Sorry, I wish I could spend more time on it. I'll be on vacation the next 9 days, so no response until the week beginning on Feb 16th. I'll try and set aside a few days to work on it then. With complete freezing of the mouse, it does look like some sort of spinning issue. To that extent, the most valuable information would be profiling from those 5 seconds surrounding the freeze. Hard to do, but would be very valuable. People seem to be certain that this is a block layer issue, I'm far from convinced that is the case.
I have have limited the usage of generic_file_aio_write in filemap.c for every process. Once I have limited the throughput of every process. When the overall throughput was below disc capacity, there where no more freezes of the mouse. When the overall throughput was above disc capacity, the problem appears immediately. When I have limited the usage to max 20% of interval time for every process, and suspended the thread when it needs more. The problems was present as before, as every 20st requests, __generic_file_aio_write_nolock needs more than 2s for finishing. I tried the same for the cfq scheduler in cfq_choose_req and added penalties for processes with heavy io, but the pid is not correctly set for all cfq_queue and I got a kernel panic after a while. Before the the kernel panic there was no improvement.
Created attachment 20148 [details] Graph of I/O waits on CPU Core 0 Running dd if=/dev/zero of=/storage/hwraid0/test1 bs=1M count=1M On my AMD Phenom 9950 Quad-Core Processor running a distro kernel (2.6.27.12-170.2.5.fc10.x86_64). This test was run against a XFS file system on a 8 disk PCI-Express hardware RAID card. I also get the same if I run against ext4 on the same hardware. I also get similar results with this machine on a single 10,000RPM drive connected to the motherboards SATA with ext4. When this test was running the system was very unresponsive. In a different test run I launched evolution and it took around 60 seconds to load. [root@bajor hwraid0]# dd if=/dev/zero of=/storage/hwraid0/test1 bs=1M count=1M 436560+0 records in 436560+0 records out 457766338560 bytes (458 GB) copied, 2535.92 s, 181 MB/s
<stupidmetoopost /> Now at least i know what's going on.. it seems like its somehow coupled with mm because when this happens a) i can see invocations of the oom_killer in the logs after reboot and b) SYSRQ + sync & unmount action do not end the furious HDD LED flashing so i presume the kernel is misusing swapspace.. btw this is a very indeterminate and simply doing the same thing again will not reproduce the problem... so my vote is for uuhm race condition or spinlock recursion, too.
P5K, CPU - Core 2 Duo E8400, connected to the motherboards (ICH9) SATA - ST31000340AS, openSUSE 11.1, kernel - 2.6.28.3 yura@suse:~> dd if=/dev/zero of=test1 bs=1M count=1M ^C 128443+0 records in 128443+0 records out 134682247168 байт (135 GB), 1872,43 c, 71,9 MB/c vmstat 1 (fragment) procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 7 0 46780 0 7750564 0 0 4880 12808 838 1491 2 3 0 95 0 9 0 45480 0 7751004 0 0 1876 36888 1268 2506 2 5 0 93 1 9 0 45420 0 7752056 0 0 7120 12296 705 1790 1 3 0 96 0 7 0 43924 0 7751636 0 0 1416 36888 979 2178 3 4 0 93 0 8 0 44148 0 7751480 0 0 900 28176 672 1444 2 3 0 95 0 2 0 54144 0 7753008 0 0 2468 24680 649 1191 2 3 0 95 0 11 0 46420 0 7757720 0 0 1508 72240 994 1696 2 7 2 88 4 10 0 43348 0 7749244 0 0 5212 51212 1247 2436 6 6 0 87 1 10 0 46256 0 7749752 0 0 1268 42504 799 1963 2 5 0 93 0 1 0 45468 0 7757836 0 0 0 81959 1126 2249 1 9 6 84 0 1 0 43880 0 7758912 0 0 0 71736 830 1818 1 8 31 60 0 10 0 43280 0 7756472 0 0 0 59473 998 1879 1 5 8 85 1 9 0 46832 0 7748176 0 0 0 81996 1114 2332 1 8 0 91 0 10 0 46652 0 7747356 0 0 0 79920 867 1748 1 8 0 91 0 10 0 45836 0 7747508 0 0 0 76848 1021 1947 1 8 0 91 0 10 0 46724 0 7751964 0 0 0 52272 821 1775 1 6 0 93 0 10 0 44388 0 7754660 0 0 0 77896 1054 2230 1 7 0 92 0 6 0 45672 0 7755792 0 0 0 71736 1343 2886 1 7 0 91 1 8 0 44624 0 7756444 0 0 0 77863 826 1736 0 7 0 92 1 6 0 43132 0 7757664 0 0 0 63560 1036 1911 1 7 0 91 0 3 0 43200 0 7757936 0 0 0 77896 721 1539 1 6 0 92 1 11 0 46716 0 7760684 0 0 428 63544 1538 2789 12 8 0 79 0 10 0 44808 0 7756940 0 0 6876 31248 1241 2857 4 4 0 91 The system dies. To call KDE main menu - it is the extremely inconvenient. About the rest - in general I am silent.
Ah !! I think that could be the problem. The dd test with a large file (20GB) on my machine with 16GB. Looking at top while it's done shows me that the available memory steadily shrinks, all being incrementally reserved for cache. It actually shrinks down to 80kB. Starting from that point, I experience lags when I type "ls". So.. I think this could be the problem. Is there any reason why the memory used for cache is allowed to grow out of proportion like this ? Mathieu (In reply to comment #146) > <stupidmetoopost /> > Now at least i know what's going on.. it seems like its somehow coupled with > mm > because when this happens a) i can see invocations of the oom_killer in the > logs after reboot and b) SYSRQ + sync & unmount action do not end the furious > HDD LED flashing so i presume the kernel is misusing swapspace.. > btw this is a very indeterminate and simply doing the same thing again will > not > reproduce the problem... so my vote is for uuhm race condition or spinlock > recursion, too. >
Well actually it is worse than that. If you have not tuned vm.swappiness to something much lower than the default of 60 (1 or something) the kernel will also start swapping out stuff to free memory. I don't know a way the limit the cachememory's size.
There seems to be some information about how to tune this here. Trying out parameter variations would be interesting : http://www.westnet.com/~gsmith/content/linux-pdflush.htm Mathieu
echo "1" > dirty_background_ratio echo "1" > dirty_ratio echo "3" > drop_caches and vmstat says procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 2 355844 427256 3508 67544 10 21 315 180 459 781 5 3 80 12 then after doing a 10gig dd-operation vmstat says procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 355872 24532 8656 457200 10 21 338 497 456 763 5 3 79 13 So if I read the numbers correct around 400 Mb of memory has now been used for caches. Hmm that doesn't match setting dirty_background_ratio and dirty_ratio to 1. Since I have 1G of memory only 1% (10 Mb) should be allowed to be dirty before forcing applications to wait. But this is apparently not the cause here.
In __block_write_full_page (buffer.c) nearby all submits to the block device are caused by pdflush. At the beginning there are submits of 300MB on VM with 384MB. After that the dd processes submits the data direct. As soon as there is available memory, it is filled and submitted immediately by pdflush. The 300MB are submitted at once or nearby at once. On the VM there is the following scheme, caused by the double buffering (VM/Host). At 67.506825 300MB (pdflush) 100MB (dd processes) At 72.750497 300MB (pdflush) 100MB (dd processes) At 74.215577 50MB (pdflush) // Host cache filled ... My guess is, that the dirty pages are not increased correctly by create_empty_buffers in __block_write_full_page. I currently don't known, how to check it, as I have just started to read and understand the kernel code.
The following solution works for me. I use the cgroups to limit the amount of memory dd can use. That shows that there is a problem with the kernel otherwise allowing the page cache to take _all_ the available kernel memory. mkdir -p /cgroups mount -t cgroup none /cgroups -o memory mkdir 0 echo $$ > /cgroups/0/tasks echo 4M > /cgroups/0/memory.limit_in_bytes dd if=/dev/zero of=/tmp/bigfile bs=1024k count=20480 The same works with the fio "ssh" test case when run under the cgroups limitations : write: io=10240MiB, bw=34349KiB/s, iops=32, runt=312595msec read : io=2068KiB, bw=404KiB/s, iops=16, runt= 5239msec read : io=2048KiB, bw=598KiB/s, iops=25, runt= 3505msec read : io=2056KiB, bw=283KiB/s, iops=12, runt= 7437msec read : io=2056KiB, bw=542KiB/s, iops=21, runt= 3879msec read : io=2060KiB, bw=388KiB/s, iops=16, runt= 5431msec read : io=2052KiB, bw=591KiB/s, iops=25, runt= 3554msec read : io=2076KiB, bw=375KiB/s, iops=15, runt= 5658msec read : io=2048KiB, bw=522KiB/s, iops=19, runt= 4011msec read : io=2080KiB, bw=468KiB/s, iops=19, runt= 4548msec read : io=2068KiB, bw=406KiB/s, iops=16, runt= 5206msec read : io=2080KiB, bw=412KiB/s, iops=17, runt= 5161msec read : io=2068KiB, bw=410KiB/s, iops=18, runt= 5159msec read : io=2064KiB, bw=320KiB/s, iops=13, runt= 6603msec read : io=2064KiB, bw=356KiB/s, iops=13, runt= 5924msec read : io=2052KiB, bw=565KiB/s, iops=22, runt= 3716msec read : io=2060KiB, bw=396KiB/s, iops=18, runt= 5321msec read : io=2048KiB, bw=507KiB/s, iops=19, runt= 4129msec read : io=2048KiB, bw=302KiB/s, iops=12, runt= 6924msec read : io=2060KiB, bw=497KiB/s, iops=20, runt= 4243msec read : io=2072KiB, bw=3138KiB/s, iops=130, runt= 676msec read : io=2048KiB, bw=3472KiB/s, iops=130, runt= 604msec read : io=2060KiB, bw=4080KiB/s, iops=172, runt= 517msec read : io=2052KiB, bw=4227KiB/s, iops=171, runt= 497msec read : io=2048KiB, bw=3744KiB/s, iops=166, runt= 560msec read : io=2076KiB, bw=4201KiB/s, iops=169, runt= 506msec read : io=2052KiB, bw=3531KiB/s, iops=159, runt= 595msec See Documentation/cgroups/memory.txt for more details. Mathieu
How can we limit this with pre 2.6.29* kernels? I'm using 2.6.28.4 but there's no memory.limit_in_bytes and documentation doesn't help much about this... Should we completely remove cgroups support from kernel until upgrading or waiting for a fix? (In reply to comment #153) [...] > echo 4M > /cgroups/0/memory.limit_in_bytes [...]
Is CONFIG_CGROUPS (and sub-options) enabled in your 2.6.28.x kernel ? I cannot guarantee that memory limits will be available, but I can see the CONFIG_CGROUPS option in my old 2.6.28.x .config. Mathieu
Does not work for me. I succeed in limitting the memory-usage from going to infinity, but I still get 98% iowait and bad loss of responsiveness. I'm running 2.6.28.7
well it is a little bit more detailed. 4M limit ended up to kill my dd-operation. A limit of 16M is better for me and seems to be way better than the default without any limits.
The CGROUPS are available in 2.6.28.3, but there is no the memory limit. (In reply to comment #156) Søren can you test it with clocksource=jiffies too? As I still think, that the reduces scheduler performance (#3) makes the problem worse. You can see the differences in comment #128 and #129 on my machine. The number of dirty pages and writeback pages (/proc/meminfo) is always below 20% of memory on my systems, even under heavy io. But there is a lot of "traffic" caused by pdflush, when dirty pages count reaches the limit. All dirty pages are passed to the blk/elevator nearby at once. The time for sorting the rb-tree or perhaps looks takes more time for every request, as there are a lot of requests. On ext3 it takes up to 1 second and 0.3 in average for inserting for new a request. And there are up to 7000 request submitted on my notebook. ( see comment #128 and #129 ). I think this one reason for the high io. The problem for the high memory usage is caused by pdflush too, which is called by generic_perform_write (filemap.c) -> balance_dirty_pages_ratelimited. The clear_page_dirty_for_io is called directly before the page is submitted to the blk/elevator in write_cache_pages. As a result the page buffers are still in the elevator queue and the global_page_state(NR_FILE_DIRTY) has a too small value.
It does not matter if I use jiffies in these cases where memory is limited memory.limit_in_bytes = 4M Responsiveness : Very good Disk speed : 40% of disk capabillity iowait : Generally around 50% Responsiveness : Good Disk speed : 50% of disk capabillity iowait : Generally around 50% but Interestingly I can't get the disk speed > 50% of the disk capabillity reported by hdparm, not event with oflag=direct Eearlier I have reported that jiffies performed better, but that was without memory-limitations.
Created attachment 20172 [details] mm fix page writeback accounting to fix oom condition under heavy I/O Makes sure the page cache accounting behaves correctly with I/O elevator, thus fixing OOM condition. Does not seem to fix the latency problem though. See changelog.
Hi Søren, It's possible that the memory limits does not help with the problem, as you say the Hd will go underspeed because lack of data (dute to the memory limits). So it will trigger the problem later or not trigger it at all. But it's good to have a way to limit the problem anyway. So I have a question. What is the right way to work? I mean under heavy loads of IO flow, what's the right way to work for a sane kernel? I propose some cases: A) We have 2 process one that makes high load IO operations (this time HD), one thats only do it occasionally. 1.- Process 1 (high IO) starts to do IO ops. So it will switch between blocked status by IO ops and active as it reads and sends data to controller. 2.- Process 2 tries to access disk, so it has to wait for a chance to read. In this case IO wait of process 1 should be almost 0 so it only waits microsecs while last IO op finish. But process 2 should have high IO waits because Process 1 takes all IO bandwidth. B) Same case but with a round robin style queue. CFQ? IO wait should be nearly 0 for Process 2 as it gets chance to write to disk but Process 1 must wait each operation to finish... What is the correct whay? Is there any other? What is clear is that is not normal that a process blocks all the other processes because is waiting to write. Just in case that every process want to write the IO Wait should rise as all processes are waiting to get a chance. In this case... Should we only have IO Wait times? Is this our case?
Created attachment 20176 [details] Screenshot of current status of the bug while letting a program hang the system Here you can see how IO Wait is 72.2% With Xorg going crazy on CPU usage and system showing that the rest of the system is completely unusable. That was just because transmission was verifying my torrents. So again it's not acceptable that systems renders unusable because a background operation in place... How can I help more?
Could it be possible to reused the concept from cpu-schedueling. Instead of talking about time-slices we could talk about IO-slices. The favor the processes which uses fewest IO-slices this will avoid an evil dd to starve other light reader/writers. I'm not kernel-skilled a t all so maybe this sound a lot like your RR-queue but just some thoughts.
May be someone can explain me why the simple copying eats ~50% cpu? May be it is a part of this problem? The same copying in Windows eats 5-10% cpu. UDMA 100 is enabled by my pata. I have jfs partitions.
With the last patch, the problem is permanent on my notebook on a ext4 and ext3 partition. The io wait time is at 100% with heavy io. Mouse clicks are not recognised very often, or the keyboard input is delayed for up to 10 seconds (all under xorg). I got a deadlock with the patch on kernel 2.6.28.2, but only once. The io wait time was at 100%, but there was no disc io any more. I could not start any programs or save some data, but I was able to use the running programs. I am not sure, if this is a problem caused by the patch or it is our problem. I got a complete freeze with clocksource=jiffies on a unpatched kernel with heavy io and heavy cpu usage too. I have checked some timings in the block and elevator functions (__make_request, get_request, get_request_wait, blk_complete_request, cfq_service_tree_add and cfq_add_rq_rb). All the timings where below 5µs. At some points they are climbing to 80µs. But it looks good for me. In get_request_wait the writing dd processes are waiting up to one second for a new free request. It was only the dd processes or sometime the pdflush process. Should be OK. Can prepare_to_wait_exclusive(&rl->wait[rw], &wait, TASK_UNINTERRUPTIBLE) in get_request_wait (blk-core.c) cause such a problem?
The patch from #160 to avoid the kernel from jsut taking all available memory almost works for me. Thanks Mathieu. I don't get crazy swapout as I used to, but the cache still occupies 400 Megs of memory out of my 1G which is also wrong.
hmm ..... I assume that the cache is both read- and write-cache. In that case everything is allright. I can confirm the allmost 100% iowait
I have limited the number of request of a process to 200 every second by adding a msleep_interruptible(5) just before spin_lock_irq(q->queue_lock) in __make_request, when there is a intensive usage by this process. The number of request are incremented in a ring buffer for four seconds and updated every 100ms. The throughput of the two dd processes is really bad at 3MB/s (as expected). Processes with a higher priority than 0, kjournald(2) or when (bio_data_dir(bio) == READ || bio_sync(bio)) is true, are passed without delay. The wait time is at 100% of one core at the beginning and 100% off both cores after ~5-10s. Only the two dd processes and pdflush are delayed. The problem is permanent. I cannot change the windows of two consoles or switching desktop. There are always long delays. It's exactly the freezing known from heavy io, with the difference of a moveable mouse cursor. I am not able using gedit to write a text, as every 5-15 seconds the keys are recognised with a long delay of at least 5 seconds. Even when the dd processes are killed and there is only a maximum write speed of 3MB/s (pdflush and perhaps kjournald) (0% io wait time) in the background. Gimp is starting in 10 seconds without preloading. The cache usage is at less than 20% of memory (~800MB). I am using the kernel 2.6.28.2 with the patch from Mathieu. Thanks a lot. I think It stops freezing the mouse cursor. And my delay in __make_request. Removing the delay only, restores the state before. I think it is the main problem, as I can simulate it! The high wait io are cause by the sleeping threads. In __make_request there are only 100-200 from 7000 request during heavy io, which are calling get_request_wait. And there are only 10 requests, which are entering the while loop in get_request_wait, realy waiting more then 20ms and up to 1 second on my machine (prepare_to_wait_exclusive(&rl->wait[rw], &wait, TASK_UNINTERRUPTIBLE);...; io_schedule(); in get_request_wait).
I have just replaced prepare_to_wait_exclusive(&rl->wait[rw], &wait, TASK_UNINTERRUPTIBLE); and io_schedule(); in the function get_request_wait) agains msleep_interruptible(500). The thoughtput of the two dd processes is at 57MB/s (27/30). The desktop freezes up to 100 seconds.
Is there any way I can help debugging this?
(In reply to comment #138) > Maybe it's not only the preemption and the frequency. I think one of these > things could be: > > General setup: > - Control Group support DISABLED > - Group CPU Scheduler DISABLED > - Enable full-sized data structures for core ENABLED > - Enable futex support ENABLED > - Use full shmem filesystem ENABLED > - Enable AIO support ENABLED > - SLAB Allocator: SLUB > > Processor type and features (ENABLED): > - Tickless System (NO_HZ) > - High Resolution Timer Support > - HPET Timer Support > - Multi-core scheduler support > - Preemptible RCU > - 64 bit Memory and IO resources > - Add LRU list to track non-evictable pages > > Good luck... > Many of these seem to be 32-bit settings. The funny thing is that if I boot into x86 32-bit, I don't see any of the slow downs or they are so little that effectively I don't feel them. Its only x86-64 which freezes on me during IO.
Must admit all machines I have noticed this on are x86_64.
On the systems I have noticed it, are also x86_64.
I have noticed this bug on a Pentium-M (32-Bit only) processor.
I have seen this bug on an Opteron 250 system with a 32-bit OS (CentOS 4.4 thru CentOS 5) installed.
Mine is gad@ws-esp16:~$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz stepping : 10 cpu MHz : 800.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm ida bogomips : 4388.98 clflush size : 64 power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz stepping : 10 cpu MHz : 800.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm ida bogomips : 4389.07 clflush size : 64 power management:
My cpu model is :AMD Turion(tm) 64 X2 Mobile Technology TL-50 The kernel is compiled for i686, and I see large slowdowns.
I see this on my Intel T81OO notebook on both kernel-2.6.29-0.33.rc5.fc10.x86_64 and kernel-2.6.27.15-170.2.24.fc10.x86_64 (default Fedora config options). Just using the simple dd /dev/zero test can provoke it; the desktop feels less responsive. latencytop shows things like evolution waiting almost 10 seconds for an fsync to complete. Hardware has an ICH8 chipset, DMA etc. seems configured properly. vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz stepping : 6 cpu MHz : 800.000 cache size : 3072 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1
In certain cases (2.6.28.5) at patch usage "mm fix page writeback accounting to fix oom condition under heavy I/O" the output from under the control, increase iowait ~ 100 % and a complete stop of system is observed. Any data I can not put, as one button reset on a box works only. Probably there is a set of the influencing factors demanding more detailed check.
(In reply to comment #179) > In certain cases (2.6.28.5) at patch usage "mm fix page writeback accounting > to > fix oom condition under heavy I/O" the output from under the control, > increase > iowait ~ 100 % and a complete stop of system is observed. Any data I can not > put, as one button reset on a box works only. Probably there is a set of the > influencing factors demanding more detailed check. > My patch "mm fix page writeback accounting to fix oom condition under heavy I/O" is probably no the right solution, but rather a step in the right direction. It poinpoints that the elevator fails to increment counters that are tested by the code which selects if the memory pressure from the dirty pages and writeback pages high enough to make the process fall into "sync write" mode. Therefore, I think a cleaner solution to this particular problem could be to create a new page type counter (like dirty pages, write buffers, ..) to let the vm know how many pages are used by the elevator. The fs/buffer.c code should then check for this value too to see if the pressure on memory is high enough to make the process do a "sync write". However, this problem is harder than it appears, because the buffer.c code would probably put such process in sync write mode independently of the elevator, and I really wonder what the interaction of such solution with the CFQ would be. I am not sure the CFQ I/O scheduler would behave correctly in such situation, but Jens could tell better than I on the subject. Hope this helps, Mathieu
(In reply to comment #179) > Any data I can not > put, as one button reset on a box works only. Probably there is a set of the > influencing factors demanding more detailed check. I have noticed this issue with a unpached kernel too. The "mm fix page writeback accounting to fix oom condition under heavy I/O" patch makes the problem reproduceable. Sometimes the io wait time is at 100%. Sometimes there is no io wait time. There is no problem with read access, but no write access is executed. I can reproduce the problem with xfs. With ext4 the problem does not appear very often on the patched and unpatched kernel.
(In reply to comment #181) Then I will bring specification, I use only xfs. Probably patch badly influences it, and probably well works with other file systems. I am sorry, it is simple to me there is nothing it to check up.
I have consistently had this problem with any kernel I have tried, above 2.6.17, so I have stuck with that up until now. There are some supposed resolutions to the problem at http://linux-ata.org/faq.html, but none of them work for me, and I don't have the mentioned BIOS setting in my BIOS. lspci reports... 00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev e1) 00:1f.0 ISA bridge: Intel Corporation 82801GBM (ICH7-M) LPC Interface Bridge (rev 01) 00:1f.2 IDE interface: Intel Corporation 82801GBM/GHM (ICH7 Family) SATA IDE Controller (rev 01) I do not appear to have the problem on my Macbook 2,1, although the disk performance is like 21M/s, which is lousy. But, what I'm seeing on my one machine is 1M-3M/s. I also tried passing "pci=routeirq" and "acpi=off" (grasping at straws), but that did not change anything. I did however notice that my HD is /dev/sda in 2.6.17, and /dev/hda in 2.6.25 and 2.6.27. On 2.6.17, dmesg tells me... ata_piix 0000:00:1f.2: version 1.05 ata_piix 0000:00:1f.2: MAP [ P0 P2 IDE IDE ] ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 17 (level, low) -> IRQ 18 ata: 0x170 IDE port busy PCI: Setting latency timer of device 0000:00:1f.2 to 64 ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0xBFA0 irq 14 ata1: dev 0 cfg 49:2f00 82:346b 83:7d09 84:6123 85:3469 86:bc09 87:6123 88:207f ata1: dev 0 ATA-8, max UDMA/133, 625142448 sectors: LBA48 ata1: dev 0 configured for UDMA/133 scsi2 : ata_piix Vendor: ATA Model: ST9320421ASG Rev: SD13 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back sda: sda1 sda2 sda3 sd 2:0:0:0: Attached scsi disk sda sd 2:0:0:0: Attached scsi generic sg0 type 0 But on 2.6.27, I get nothing of the sort. Nothing to do with SATA or anything. I did notice that with 2.6.27, libata was enabled, while with 2.6.17 it didn't appear to be an option even. Ever since that libata, nothing seems to work, and my computer is relatively new. I have a Dell D820 core 2 duo.
I noticed the same - would it be possible to revert the libata-integration ?
Trenton, it's unclear to me what you're describing here. > I have consistently had this problem which problem? Anyway, it sounds like what you're reporting is a straightforward regression in ATA throughput? If so, please raise a separate, new bug report against SATA for that, thanks.
Oops, mid-air collision. I'll answer Andrew's question first. I'm having two problems. 1. on my Dell D820 I see degraded throughput AND high io wait times as everyone else here has described 2. on my Macbook, I do not see degraded performance, but I see the extremely high io wait times. Both of these systems have the IDENTICAL IDE chipsets. Read on with my original reply, before collision, for more information. Quick question, is anyone else using the Intel 82801GBM/GHM IDE chipset, who has this problem as well??? I have a Dell D820 (64 bit) notebook, and a Macbook from late 2007 (the 64 bit ones). I noticed that they both have Intel 82801GBM/GHM IDE chipsets. They both exhibit the problem. If running Gentoo Linux 32 bit on the D820, and one of these bad kernels, my hard drive (which was renamed to hda), gets about 3M/sec, and the high wait times are also present. With the Macbook, the high io wait times are there, but I get a good throughput, with Gentoo 32 bit. Not sure what the difference is between the D820 and the Macbook, seeing they have very similar hardware (almost identical). I suppose it is possible that Apple made the suggested change that the linux-ata guy suggested (for the bios). This truly is debilitating. I have now tried two distributions with the latest 2.6.x kernels (Gentoo and OpenSUSE 11.1), and all of them exhibit these symptoms on my hardware. I am almost certain that if this does not get fixed, I will be unable to continue using Linux at work, unless I get a new computer (slim chance but possible). After all, eventually, Gentoo will move towards some new features that require a newer kernel, and I will be left in the dust. I will then be forced to run Linux in vmware under Windows. Please, someone save me from this awful DEATH. muhahahaha.
(In reply to comment #172) > Must admit all machines I have noticed this on are x86_64. > I am seeing both x86_64 and i686 machines exhibit this. Before my Dell D820 died on me, it was a duo core 32bit machine. Then it got replace with a newer D820 which is a core 2 duo 64bit machine. This issue happened on both of those. And, as mentioned in my last comment, it also happens on my core 2 duo Macbook.
I once had a similar *traumatic* throughput regression with an Intel processor + p4_clockmod. So the issues may completely have different causes.
(In reply to comment #134) > Hi. > > On my laptop(Core2Duo 1.6 ghz) I run my gentoo kernel 2.6.28-gentoo. > I didn't have any problems with latency. > > If I run "dd if=/dev/zero of=file bs=1M count=2048" or "dd if=/dev/zero > of=/tmp/test bs=1M count=1M" (I tried to run it as user and also as root), my > system works well and I can start firefox, another shell, open dolphin (i'm > under kde4-svn) and everything is faster. > > I have XFS filesystem on my home and reiserfs on root. > > Since I configured my kernel manually, maybe it could be usefull for someone > to > have my .config so I'll post it. > I have just unmasked, and tried 2.6.28 on Gentoo Linux as well, and the problem appears to be gone. This is on my D820, which is the one with really bad throughput as well. As I am in the process of converting to 64bit on my D820, I am unable to try GUI stuff out. But, before, during heavy load, I was unable to switch between terminals very well either. Now, the system is EXTREMELY responsive, during these heavy load times, which is what I expect. And, I'm getting 82M/sec once the caching limit has been reached, and 256M/sec with caching. This is equivalent to what I was getting with 2.6.17. Now, I don't know if the gentoo guys applied someone's patch from here, as comment #52 mentioned patching 2.6.28, but it's working for me now. I'm VERY happy about that. :D Based on his description, it very much sounds like the Gentoo guys must have applied the patch. I was doing a while loop, with dd, increasing the amount of data by 1M at a time. The first few, up to about 60M, were getting 256M/sec. Then, I noticed in my other terminal, running vmstat, the iowait times got pinned to nearly 100%. So, I'm thinking that all those dd's that got cached, were finally catching up to the NO LIMIT on cached items, and causing thrashing in the IO system. That caused a COMPLETE freezeup of the while loop. Also, during this time, my HD light was going crazy. Then, when the io wait times dropped to 0 again (cached items flushed), the loop did a few more iterations (and my HD light was off), and it started all over again. Then, again the loop froze, etc, etc, etc.
Also, I feel kind of stupid because I should have reported this back in 2007 when I saw it. But, I figured someone else would find it before too long, so I just hung back with my kernel version. SORRY!!! :( I guess I shouldn't do that next time. Especially considering it is way easier to find bugs when a new release just came out, and there is a new bug due to the changes in that release.
bugme-daemon@bugzilla.kernel.org schreef: > http://bugzilla.kernel.org/show_bug.cgi?id=12309 > > > > > > ------- Comment #189 from trent.bugzilla@trentonadams.ca 2009-03-01 01:26 > ------- > (In reply to comment #134) > >> Hi. >> >> On my laptop(Core2Duo 1.6 ghz) I run my gentoo kernel 2.6.28-gentoo. >> I didn't have any problems with latency. >> >> If I run "dd if=/dev/zero of=file bs=1M count=2048" or "dd if=/dev/zero >> of=/tmp/test bs=1M count=1M" (I tried to run it as user and also as root), >> my >> system works well and I can start firefox, another shell, open dolphin (i'm >> under kde4-svn) and everything is faster. >> >> I have XFS filesystem on my home and reiserfs on root. >> >> Since I configured my kernel manually, maybe it could be usefull for someone >> to >> have my .config so I'll post it. >> >> > > > I have just unmasked, and tried 2.6.28 on Gentoo Linux as well, and the > problem > appears to be gone. This is on my D820, which is the one with really bad > throughput as well. As I am in the process of converting to 64bit on my > D820, > I am unable to try GUI stuff out. But, before, during heavy load, I was > unable > to switch between terminals very well either. Now, the system is EXTREMELY > responsive, during these heavy load times, which is what I expect. And, I'm > getting 82M/sec once the caching limit has been reached, and 256M/sec with > caching. This is equivalent to what I was getting with 2.6.17. > > Now, I don't know if the gentoo guys applied someone's patch from here, as > comment #52 mentioned patching 2.6.28, but it's working for me now. I'm VERY > happy about that. :D Based on his description, it very much sounds like the > Gentoo guys must have applied the patch. I was doing a while loop, with dd, > increasing the amount of data by 1M at a time. The first few, up to about > 60M, > were getting 256M/sec. Then, I noticed in my other terminal, running vmstat, > the iowait times got pinned to nearly 100%. So, I'm thinking that all those > dd's that got cached, were finally catching up to the NO LIMIT on cached > items, > and causing thrashing in the IO system. That caused a COMPLETE freezeup of > the > while loop. Also, during this time, my HD light was going crazy. Then, when > the io wait times dropped to 0 again (cached items flushed), the loop did a > few > more iterations (and my HD light was off), and it started all over again. > Then, again the loop froze, etc, etc, etc. > > > Ok, so if that version is working for you of Gentoo, can we compare that with the vanilla kernel? Can you send us some system info to compare your kernel config with the vanilla one? Can we have a tarball with the following structure? (to make it easy to diff over it) -------------------------------------------------- systeminfo.txt vanilla \- config (original config of the vanilla kernel, not yours) |- kernel-info.txt |- dmesg.txt |- lsmod-output.txt |- test-report.txt gentoo-youredition \- config (the config file of your kernel version) |- dmesg.txt |- lsmod-output.txt |- test-report.txt |- gentoo.patch -------------------------------------------------- If you have the time can you do the following on the system: - Get the source for that gentoo version you are using (shouldn't be to hard on Gentoo ;-) ) - Get the source of the vanilla kernel with the same version/patch level as your gentoo kernel - Check to see if your current gentoo config is working on vanilla kernel and if that will result in a responding system - If that does not solve the bug on your system, create a patch file for the gentoo patches, so we can see exactly what gentoo has patched If you try this and send us the information, we can use a tool like Meld (http://meld.sourceforge.net/) to compare the 2 kernel configurations with each other. Can you put the following information in systeminfo.txt cat /proc/cpuinfo cat /proc/meminfo cat /proc/swaps And for per kernel information: In kernel-info.txt: cat /proc/version uname -a cat /proc/cmdline cat /sys/block/<disk>/queue/scheduler Config is just the .config file You can get the info by the command zcat /proc/config.gz or via your /boot/config-<something> or via kernel source In dmesg.txt your dmesg output In lsmod-output.txt your lsmod output. In test-report the reporting of your tests on the kernel. And how they performend and what tests you did. In gentoo.patch the patches Gentoo made on the vanilla kernel (using the diff command). I hope we can find a piece of the cause with this information. Greetings, Michiel
(In reply to comment #16) > I tried elevator=as on my system, and it did not change the behaviour. > Copying > files from external USB to internal encrypted SSD still totally smashes > interactive performance. So this issue might be unrelated. > Note, some SSD's have very poor random-write performance, this can cause stuttering and all sorts of side effects. Anandtech investigated this issue when comparing/reviewing Intel's SSD's vs. parts from OCZ which uses a certain JMicron controller. See here: http://www.anandtech.com/showdoc.aspx?i=3403&p=7 You should probably just read the entire review. It is therefore possible that your issue has more to do with the behaviour of your SSD during writes than the kernel scheduler or anything else.
Working on it now Michiel. I'll try and get that info for 2.6.27, 2.6.28, and vanilla 2.6.28. ttyl
Hmmm, apparently I forgot to try vmstat. The high io wait times are still there, but I haven't been noticing it. I wonder what could have caused me to not notice it now. The performance is way better, even with the high io wait though. I'm not seeing 30 second delays on stuff. Every now and then there's a second or two delay, perhaps five tops. I'll get the info anyhow, and see what the differences are. FYI: This is still on my D820.
(In reply to comment #192) > (In reply to comment #16) > > I tried elevator=as on my system, and it did not change the behaviour. > Copying > > files from external USB to internal encrypted SSD still totally smashes > > interactive performance. So this issue might be unrelated. > > > > Note, some SSD's have very poor random-write performance, this can cause > stuttering and all sorts of side effects. Anandtech investigated this issue > when comparing/reviewing Intel's SSD's vs. parts from OCZ which uses a > certain > JMicron controller. See here: > http://www.anandtech.com/showdoc.aspx?i=3403&p=7 > You should probably just read the entire review. > > It is therefore possible that your issue has more to do with the behaviour of > your SSD during writes than the kernel scheduler or anything else. > Well, if that is true, it would have to be a combination of the kernel and my system. Mainly because my system was SUPER fast before I tried upgrading my kernel past 2.6.17. As for my Mac, I don't recall having performance issues while running Mac OS X. Nothing like the article describes anyhow.
(In reply to comment #189) - #195 > Well, if that is true, it would have to be a combination of the kernel and my > system. Mainly because my system was SUPER fast before I tried upgrading my > kernel past 2.6.17. As for my Mac, I don't recall having performance issues > while running Mac OS X. Nothing like the article describes anyhow. There is another bug in 2.6.17/18-??, which gives a poor disc performance, while running the SATA controller on a ICH8M (or equal?) platform in compatibility mode, which gives a high i/o wait time too and lets this bug appear. There are dependencies between cpu-power, disc throughput, task switching time (eg. clocksource) and this bug. Has someone tried to identify the source of the problem, with the info provided in Comment #168 and Comment #169 ? There is a comment in the code (blk-core.c @ ~1300) /* * After dropping the lock and possibly sleeping here, our request * may now be mergeable after it had proven unmergeable (above). * We don't worry about that case for efficiency. It won't happen * often, and the elevators are able to handle it. */ But it happens up to 20 times every second during heavy io, causing high io wait times for the writing process (or pdflush) and makes the desktop responsiveness becomes poor. My proof is the real poor desktop responsiveness, when replacing prepare_to_wait_exclusive by msleep_interruptible (see Comment #169). I will be able to spend some more time on this bug in april.
Created attachment 20405 [details] info request by Michiel in comment 191 Here's the info you wanted Michiel. Doing a diff on the config of the bad kernel and the new one reveals this interesting tidbit... diff -u 2.6.27-gentoo-r8-kernel-config.txt 2.6.28-gentoo-r2-kernel-config.txt -CONFIG_BLK_DEV_IDEDISK=y -CONFIG_IDEDISK_MULTI_MODE=y +CONFIG_IDE_GD=y +CONFIG_IDE_GD_ATA=y That must have been what switched me back to using sda. Anyhow, that was obviously a separate issue. So, my system performance, and io wait times are totally fine during normal system operation. When I do REALLY heavy io, the wait times go up, but the responsiveness is still relatively good. I can start kwrite in about 2-3 seconds. It seems like it is fixed to me. But, I'll still try that patched 2.6.28 and get back to you, to see if it is even better. Perhaps Andrew Morton was right. Maybe my issue was entirely to do with my SATA issues.
(In reply to comment #196) > There is another bug in 2.6.17/18-??, which gives a poor disc performance, > while running the SATA controller on a ICH8M (or equal?) platform in > compatibility mode, which gives a high i/o wait time too and lets this bug > appear. > > There are dependencies between cpu-power, disc throughput, task switching > time > (eg. clocksource) and this bug. This is interesting, since my notebook has an ICH8M stuck in compatibility mode (no BIOS option). I'll see how it compares to my other notebook with an ATI-IXP chipset.
Anyone seen this on a non-sata drive ? If i do a "dd if=/dev/zero of=outfile bs=1M count=50000" on 2.6.28 the load raise to around 8, on 2.6.29-rc5 It never get past 4. I'm testing on 64bit, ich9 + sata, btw. I tried to install centos 4.7, with kernel 2.6.9.+, and It's just as bad as 2.6.28.
I have just tested the 2.6.29-rc6. The desktop responsiveness is increased enormous. Especially Firefox is now useable. The problem still exists for me, but it is now not as noticeable as before.
(In reply to comment #195) > (In reply to comment #192) > > It is therefore possible that your issue has more to do with the behaviour > of > > your SSD during writes than the kernel scheduler or anything else. > > > > Well, if that is true, it would have to be a combination of the kernel and my > system. Mainly because my system was SUPER fast before I tried upgrading my > kernel past 2.6.17. As for my Mac, I don't recall having performance issues > while running Mac OS X. Nothing like the article describes anyhow. OK well in that case I absolutely agree it's obviously a software only problem in your case and probably this scheduler kernel issue. (I just wanted to point out for the record so everyone's aware, that there are some SSD hardware combinations that inherently have limitations that will may very well cause similar sluggishness regardless of the kernel/software itself.) As an aside, high IO wait percentages are after all as far as I understand it not in and of themselves problematic, since high IO wait only means that a process is waiting for IO. This measure will therefore predictably be high when a process is doing heavy substantial IO with a comparatively slow device. Normally however one would expect such IO to not generally negatively affect other processes/general system reponsiveness, *except* if the other processes are also somehow IO hungry in order to proceed and you have some sort of IO resource contention going on, or as appears in this thread, there's actually a scheduling problem which causes processes that are runnable to not receive the CPU when they should, thus resulting in perceived sluggishness.
I must correct my last post (Comment #200). I was working with VMs the whole day and it is still awful as before. But there is a big improvement while using firefox.
I would agree that -rc6 has for some reason greatly improved system responsiveness under I/O load but there are most certainly still great issues in the block I/O world. Just now I once again managed to completely wedge up my machine by doing nothing more than copying a few gigabytes of files between drives. Furthermore, Firefox still freezes for several seconds when I first start typing in the location bar as it looks in its history database. Lastly, Evolution still takes several minutes to start and become usable while it's I/O rate is less than 1 MB/s. All in all, things are pretty unusable. Jens, are you around? I've been asking various distributions and vendors whether they could spare some qualified man-hours to get this problem finally worked out but it seems like you're our best hope. I know you'll be getting at least one case of beer when this is fixed ;)
Hi Guys, My brother has apparently been having the same problem on his computer. I hadn't realized it when I submitted my bug. For him, he has an ICH8 family of chipsets. The following works for him, and the problem goes away. echo anticipatory > /sys/block/sda/queue/scheduler Looks like this may be a tough one to nail down, because everyone's symptoms are slightly different. I'm wondering if perhaps there are multiple issues going on here.
Oh, crap, I forgot the details. Before the details, I also wanted to say that I am going to get him to try changing the BIOS option mentioned on the libata page I gave earlier, to see what happens. [03:05 root@zipper ~]# lspci 00:00.0 Host bridge: Intel Corporation 82P965/G965 Memory Controller Hub (rev 02) 00:02.0 VGA compatible controller: Intel Corporation 82G965 Integrated Graphics Controller (rev 02) 00:03.0 Communication controller: Intel Corporation 82P965/G965 HECI Controller (rev 02) 00:19.0 Ethernet controller: Intel Corporation 82566DC Gigabit Network Connection (rev 02) 00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Contoller #4 (rev 02) 00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #5 (rev 02) 00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #2 (rev 02) 00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio Controller (rev 02) 00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 (rev 02) 00:1c.1 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 2 (rev 02) 00:1c.2 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 3 (rev 02) 00:1c.3 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 4 (rev 02) 00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5 (rev 02) 00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #2 (rev 02) 00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #3 (rev 02) 00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev f2) 00:1f.0 ISA bridge: Intel Corporation 82801HB/HR (ICH8/R) LPC Interface Controller (rev 02) 00:1f.2 IDE interface: Intel Corporation 82801H (ICH8 Family) 4 port SATA IDE Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 02) 00:1f.5 IDE interface: Intel Corporation 82801H (ICH8 Family) 2 port SATA IDE Controller (rev 02) 02:00.0 IDE interface: Marvell Technology Group Ltd. 88SE6101 single-port PATA133 interface (rev b1) 06:00.0 RAID bus controller: Silicon Image, Inc. SiI 3112 [SATALink/SATARaid] Serial ATA Controller (rev 02) 06:01.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02) 06:03.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link) [03:05 root@zipper ~]# uname -a Linux zipper 2.6.18-53.el5xen #1 SMP Mon Nov 12 02:46:57 EST 2007 x86_64 x86_64 x86_64 GNU/Linux [03:09 root@zipper ~]# cat /etc/issue CentOS release 5 (Final) Kernel \r on an \m
I have noticed, that while working with VMs my system starts swapping after a while. I tried the -rc7 with Mathieus patch (Comment #160) and my system seems to be useable. There is still the non fair io scheduling between processes, but it's another problem. I am using a kernel without "Group CPU Scheduler" and "Control Group Support" and writing this text in firefox at load avg 12. To reach such high load avg, I have to run eight concurrent dd write operations. for i in 1 2 3 4 5 6 7 8; do \ dd if=/dev/zero of=test-$i bs=1M count=4K oflag=direct & echo test-$i; \ done Copying big files with nautilus makes my system from time to time unusable. With known symptoms such as "Unable to switch desktop" and "mouse freezes". And finally, I have not seen the complete io freeze with -rc7 kernel on xfs, ext3 and ext4.
Trenton, I too set my kernel to anticipatory scheduler and for a while i thought all was well when I ran dd if=/dev/zero of=~/test bs=1M count=1500 in order to test. Then I realized that its not a reliable testing method since the *anticipatory* can anticipate the coming zeroes that will be written. I ran dd if=/dev/zero of=~/test bs=1M count=1500 simultaniously with the one writing from dev/zero, and realized that the part of the syndrome is fixed with AS, but the problem persists...
argh, forgot to give details too... running 2.6.28-8-generic kernel (64bit) in ubuntu jaunty. and i had this problem in 32 kernels before aswell. khaal@Xeraphim:~$ sudo lspci [sudo] password for khaal: 00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2) 00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2) 00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2) 00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2) 00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2) 00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2) 00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2) 00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2) 00:02.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1) 00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1) 00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2) 00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a3) 00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a3) 00:0a.2 RAM memory: nVidia Corporation MCP51 Memory Controller 0 (rev a3) 00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) 00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) 00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1) 00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1) 00:0f.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1) 00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2) 00:10.1 Audio device: nVidia Corporation MCP51 High Definition Audio (rev a2) 00:14.0 Bridge: nVidia Corporation MCP51 Ethernet Controller (rev a3) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 02:00.0 VGA compatible controller: nVidia Corporation G80 [GeForce 8800 GTS] (rev a2) 03:05.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 70) 03:06.0 Multimedia controller: Philips Semiconductors SAA7131/SAA7133/SAA7135 Video Broadcast Decoder (rev d1) 03:07.0 Multimedia audio controller: Creative Labs SB X-Fi 03:09.0 Ethernet controller: Atheros Communications Inc. AR5413 802.11abg NIC (rev 01)
I had done some initial testing on my x86_64 box, of 2.6.17 vanilla (downloaded from kernel.org), and it seems to me that it has the problem too. I don't understand why my problem started with 2.6.18 if the vanilla 2.6.17 has the problem. Note that I tested the first 2.6.17, and the last version of 2.6.17. I'm thoroughly confused. I think I'll switch to 2.6.17, and run that for awhile to see if there's better performance overall. Perhaps loading it is not the best way to see if there's latency issues, as there will be some. Then, if I do see some improvement, I'll increment to 2.6.18. Hopefully, slowly but surely I can figure out which exact kernel as the problem, and then a kernel dev can fix it. That's the plan anyhow. :P
Hi I tried new 2.6.28.7 kernel. And things seem to go worse... Even btorrent checking downloaded files is able to lock the computer... I will upload a new screenshot showing 91.4% of processor time waiting for HD to read data... This is a nonsense... I will try to do same check for evey new kernel that goes out to check for improvements.
Created attachment 20464 [details] IWait problem 91,4% 2.6.28.7
Wanted to add even more testing results from my side, tried the suggestions from this source: http://stackoverflow.com/questions/392198/how-to-make-linux-gui-usable-when-lots-of-disk-activity-is-happening by changing some vm.dirty_ variables. No improvement could be seen neither changing to deadline scheduler didn't improve the situation. I also changed /sys/block/sda/queue/nr_requests to 64 with same unresponsiveness. i'm still on the same kernel (2.6.28-8) and my fstab mounts the partition with relatime,noatime,nodiratime flags.
Am currently installing 2.6.29-rc7. Hoping that it will solve some issues on the bug. Can changing SLAB allocator be an option to test for the problem? We can choose between SLAB/SLUB/SLOB. Maybe that can be helpfull.
There's still some confusing comments on IO wait in here, lets clear that up at least. 91% io wait does not mean it's using 91% cpu power for doing the IO, it merely means that some process is BLOCKED waiting for IO 91% of the time. It has zero relevance on cpu cycles consumed. Same goes for the observed load. Having a load of 2.0 due to io wait times does not mean that you have a doubly loaded system. It just means that, on average, two processes are blocked waiting for IO. When you start a bittorrent client and it checks the file data, you would expect io wait to be nearly 100%. It does do some cpu processing, so that's why it's not completely at 100%. So forget IO wait, it doesn't tell you ANYTHING about whether a system is supposed to be slow or not.
And to make a more general comment... This bug is impossible to solve, since it (once again) has degraded into somewhere for everybody to tunnel everything that relates to a system feeling sluggish. There could be at least 10 separate issues described in here, or more. And while some of these are surely things we could do better, some are also certainly expected behaviour. We are at least touching several file systems, mm issues, and io scheduler issues. I'm quite sure that some of the mentioned behaviour is completely due to ext3 sucking at fsync. I'd LOVE to be able to look into this, but honestly I have no idea where to start. What I would also love is for someone to post a test case that actually works. This includes observed behaviour and a description of what you would EXPECT to see happen. Then we/I should be able to at least judge whether there's something we can do about it. Expecting a fully fluid system while having 100 threads writing data to the device is not reasonable, for instance. But if it behaves significantly worse than previous kernels, then there's still something to look into.
I totally agree with you Jens. I have been having a hard time localizing the problem myself. I went back to the 2.6.17 kernel, and it seems to be worse than my 2.6.28 kernel. But keep in mind, I was running i686 when I originally discovered the problem, and now I'm doing x86_64. I think the only way I will be able to localize the issue, is if I restore my system to i686 gentoo, and then trying 2.6.28, then I may start getting somewhere. I also agree that it is nearly impossible to solve this one without some more concrete data. I wish I had chosen a different time to upgrade to 64bit, because then I could be fiddling with this issue on my i686 still. I'll post again if I find something more concrete.
I will admit that many of my issues seem to be caused by fsync() (I'm on ext4). One of the largest issues I'm currently having is Liferea blocking in fsync() for several seconds every time a new item is selected. During this time kjournald2 is writing, although iotop only shows a total write rate of ~500kB/s. This seems extremely slow and far below the disk's (a 7200 RPM SATA drive) capacity. This low I/O rate is common for all sluggish I/O cases. Does this sound like expected behavior? Perhaps my problems have been caused by just generally slow I/O?
The last without a problem kernel was - 2.6.16 (acknowledgement to that is SLES 10 SP2 does not give high iowait on ASUS P5K). So let's look what super-mega-function has appeared in 2.6.17 and was absent in 2.6.16. This function cannot clearly belong to separately taken file system (all file systems are subject to an error). Changes in schedulers between 2.6.16-2.6.17 I has not found out. Introduction libata - a unique difference. Who gives high iowait - itself libata or the infrastructure of its embedding in a kernel practically has no value. Value has only one - the kernel is disabled. Also it is the sad fact.
I do not mean the fsync problem, which is not a problem in the 29 kernel for me any more. I mean the sluggish behaviour of all gui application. Especially while working with vmware workstation. Suspend and resume time rises from less than two minutes to up to ten minutes. It started for me, when I upgraded from feisty (2.6.20) to gutsy (2.6.22) on a 32-bit Pentium-M. There is a problem to locate this problem, as it does not appear all the time and there are a lot other problems and many solved problems, which make a comparison very problematic. And my assumption is, that I depend on the cpu, hard drive and user. The best hint for me was the duration of the process test. I have not committed this test to adjust the kernel to this special test case, as I have seen at LKML. It should help to localize the problem. The results of this tests, seems to fit with the regression of the sluggish behaviour. See http://bugzilla.kernel.org/attachment.cgi?id=19797&action=view CentOS 2.6.18-92.el5 - 29.995s - good Feisty 2.6.20.21 - 25.304s - good Gusty 2.6.22-16 - 40.405s - bad Hardy 2.6.24-23 - 37.604s - bad Intrepid 2.6.27-9 - 96.922s - unusable I have seen with powertop, that the number of interrupt was doubled from 200 to 400 for keyboard input, when a high io was running in the background. And I know there is nothing wrong with a high io wait time, but as soon as the io wait time reaches 100% the desktop becomes sluggish and unusable. You can try this on an installation on a slow disk and ext3, or even on an full encrypted disc. The slow SSD could be related with this bug, as there is a real poor write performance with linux on many SSDs. I have measured transfer rates up(down) to 2MB/s on no direct write (4KB cache splitting), while direct writing gets up to 90MB/s on my SSD. My system on my SSD is completely unusable. I will execute some tests in a virtual machine, as it's seems to me, that an application running in the virtual machine is more affected by this sluggish behaviour than an application executed on the host. I will run exactly the same vm and test on different host kernels. But I am not able to send some more time before April. Perhaps someone else can starts earlier?
Jens, I'm trying to nail this down on my computer. So, I'm creating a vm of my i686 gentoo system, to see if I can see the same results as I was before. I used the following command, inside the vm, to extract my system tarball backup of my previous system. ssh root@192.168.8.4 'gunzip -c /media/backup/system.tar.gz' | tar -xv --exclude './usr/portage/packages/*' --exclude './userportage/distfiles/*' --exclude './var/log/apache2/*' --exclude ./Bonnie.10218 >extract-list.txt Now, on the host system (192.168.8.4) I am seeing the following... trenta@tdamac ~/Desktop $ uptime 01:39:37 up 1:21, 6 users, load average: 20.49, 14.92, 9.35 Obviously I'm getting REALLY sick performance. Normally something linear like a tar extraction does not produce these kinds of issues with performance. Granted that the disk may have to move around a little, but is it that bad?. Is there some sort of thing I can do, to analyze why this is happening? e.g. something like strace, or something? I ran strace -c on kwrite, during heavy load like this, and it claims that it finished everything in a tenth of a second, even though it took like 30. So, is there a lower level mechanism I can use to get a fix on what is making processes wait? For example, something that will tell me "kernel function X" is blocking? Thanks.
Hi Ben, Thank you for the clarification. I think I was really lost on this. I expected the process to wait while IO but then it's supposed that the rest of the system should take the rest of the processor power while it's not. The system seems to hang until IO stops. So I think best way to proceed is to start to discard problems. I propose to start with: I will try to do CPU intensive with no IO task while other process will write a file with no CPU intensive to check if the first process take the same time to execute under high IO or not. Process 1: CPU / No IO Process 2: High UI / No CPU And measure times... Should this test trigger the problem? As no IO for process 1 it should finish almost in the same time than under no load at all. Right? Can we discard a ext3 related problem? Test case (Test writing files 1 thread, over ext3 and ext4, reiser, etc) and observe responsivness. Can we track if this is a fsync problem? How (commands, test case)? How can we test this without making filesystem take part on the tests? Can we show differences between kernel 2.6.16 and >=2.6.28? (I will do this today) How to measure responsiveness? Can we put a numeric value to this? Thank you all.
Gonzalo, I think you're giving great question in order for us to establish the cause of the problem. Even though I can't anwer most of your questions (I'm no guru) I think we all should agree on a unified ways to test and measure the responsiveness. Regarding filesystems, I tried ReiserFS, ext3 and ext4 with two terminals running dd if=/dev/zero of=/test1 bs=1M count=1400 and dd if=/dev/urandom of=/tst2 bs=1M count=700 as a test, and they all gave the same sluggish feeling to the system.
I agree that those tests let us know that there's a problem, because we see the sluggish behaviour. However, if a kernel dev is not seeing the performance issues on their machines, it won't be very convincing for them. If however, we provide some concrete tests, showing which kernels didn't have the problem, which did, and the test results, then they may be able to get somewhere. That's why I'm hoping someone can chime in and tell us what sorts of tests would be useful, such as I suggested in comment #220.
Ok. Here are my firsts tests with 2.6.28.7: I used a modified version of the ThreadSchedulerTest.cpp that kills the initial timeout. And a dd to simulate high IO loads. First hypothesis seesm to be broken. High IO loads does not seem affect processing much. ------------------------------------------------------------------ ./kernel-test.sh Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 3362 min:0.008ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:19.791s Break! We have Burning CPU with 4855 min:0.006ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:18.754s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 6211 We have Burning CPU with 6212 min:0.007ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:20.265s DD Finished --- Finish --- Kernel tested: 2.6.28.7-level2crm i686 ----------------------------------------------------------------------- Results says that it takes 2 segs more to complete (Is this relevant for a process that takes ~18-19s to complete). A curious thing is that I observed no IO Wait was present while doing processing in test 2. Only system processor time. This also seem to be strange as it should be 100% USER time. System time (correct me if I'm wrong) means that OS is taking lot of time doing scheduling of the threads... Anyway, I will try to reproduce high iowait times before starting the CPU intensive program to see if we are right. I will post the test suite in bash. Feel free to add more tests.
Created attachment 20489 [details] Initial effort to build an automatic test suite for this bug Please feel free to add tests or correct what's wrong
Hello Gonzalo, I just ran your testsuit and here is the results: --------------------------------- khaal@Xeraphim:~/Desktop/test-suite-bug-12309$ sh kernel-test.sh Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 17986 min:0.006ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:21.873s We have Burning CPU with 19909 min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:17.708s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 21084 We have Burning CPU with 21085 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 12.5488 s, 16.7 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 16.0014 s, 13.1 MB/s DD Finished Killing 21085 process --- Finish --- Kernel tested: 2.6.28-8-generic x86_64 khaal@Xeraphim:~/Desktop/test-suite-bug-12309$ 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 18.6493 s, 11.2 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 18.9091 s, 11.1 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 20.0353 s, 10.5 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 20.1651 s, 10.4 MB/s ------------------------------------- I'm not really familiar with what it saying, but it did affect the desktop responsiveness. I made a google spread sheet thats open for access to anyone in order to organise test results and see common traits among our systems: http://spreadsheets.google.com/ccc?key=p3aerC-xkjEqvo7BvMHaxXg - there is one thing missing and that is a place to upload the output of these test results, anyone who knows of a service that's like photobucket but for text/console output? the document open to edit for everyone. Please choose a specific color for you so we keep the readability :-)
Created attachment 20491 [details] Initial effort to build an automatic test suite for this bug V2 This fixes the killing of the process (I hope)
I will try to explain: TEST 1: First test does two measures of a CPU intensive program: We have Burning CPU with 17986 min:0.006ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:21.873s We have Burning CPU with 19909 min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:17.708s It takes between 17s - 22s to complete. The lines like: 209715200 bytes (210 MB) copied, 18.6493 s, 11.2 MB/s Tells you the throughput of your HD. This throughput is shared between 6 processes that are writing at the same time. TEST 2. Then tries to do the same thing but with high IO. Unfortunately I killed the program before finish because High IO finished before than the CPU intensive program. so it seems it is affecting hard to you. In my computer CPU program finished early. Can you run it with the new version, please? NOTE: It writes several 200MB files to your hard disk. Please remove them after tests... it will take 200X6=1200MB of your disk.
For me throughput is horrible: First Test: How much gets to run the CPU intensive task? We have Burning CPU with 14987 min:0.005ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:21.527s We have Burning CPU with 16371 min:0.005ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:21.833s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 17768 We have Burning CPU with 17769 min:0.007ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:22.777s 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 64,2187 s, 3,3 MB/s 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 75,1226 s, 2,8 MB/s DD Finished IO Finished before than processing --- Finish --- Kernel tested: 2.6.28.7-level2crm i686 gad@ws-esp16:~$ 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 76,8811 s, 2,7 MB/s 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 79,4772 s, 2,6 MB/s 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 82,0248 s, 2,6 MB/s 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 82,9147 s, 2,5 MB/s --------------------------- I forgot to say ext3 filesystem here... I will try with different kernels from now on.
Results from my notebook: [james@rhapsody tsb]$ ./kernel-test.sh Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 3772 min:0.009ms|avg:0.013-0.013ms|mid:0.000ms|max:0.000ms|duration:37.528s We have Burning CPU with 6762 min:0.011ms|avg:0.013-0.013ms|mid:0.000ms|max:0.000ms|duration:37.351s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 9489 We have Burning CPU with 9490 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 21.1718 s, 9.9 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 38.183 s, 5.5 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 41.1141 s, 5.1 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 45.3742 s, 4.6 MB/s min:0.007ms|avg:0.012-0.013ms|mid:0.000ms|max:0.000ms|duration:38.801s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 49.0724 s, 4.3 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 50.0517 s, 4.2 MB/s DD Finished IO Finished before than processing --- Finish --- Kernel tested: 2.6.29-0.54.rc7.git3.fc10.x86_64 x86_64
Output on kubuntu 8.10 running on EliteBook 8530w. While running, it felt 'sluggish' but not by much. When copying/unziping big files, I can get 10+ seconds of firefox inactivity. Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 24021 min:0.004ms|avg:0.018-0.022ms|mid:0.000ms|max:0.000ms|duration:15.861s We have Burning CPU with 25229 min:0.004ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:15.678s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 27067 We have Burning CPU with 27068 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 15.0066 s, 14.0 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 19.0474 s, 11.0 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 21.9454 s, 9.6 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 22.6718 s, 9.3 MB/s DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 22.9066 s, 9.2 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 23.667 s, 8.9 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD FinishedUsing current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 24021 min:0.004ms|avg:0.018-0.022ms|mid:0.000ms|max:0.000ms|duration:15.861s We have Burning CPU with 25229 min:0.004ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:15.678s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 27067 We have Burning CPU with 27068 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 15.0066 s, 14.0 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 19.0474 s, 11.0 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 21.9454 s, 9.6 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 22.6718 s, 9.3 MB/s DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 22.9066 s, 9.2 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 23.667 s, 8.9 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished min:0.004ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:17.371s DD Finished IO Finished before than processing --- Finish --- Kernel tested: 2.6.27-13-generic x86_64 DD Finished DD Finished min:0.004ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:17.371s DD Finished IO Finished before than processing --- Finish --- Kernel tested: 2.6.27-13-generic x86_64
gad@ws-esp16:~$ ./kernel-test.sh /mnt/data/gad/ First Test: How much gets to run the CPU intensive task? We have Burning CPU with 8103 min:0.006ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:21.766s We have Burning CPU with 10098 min:0.007ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:21.275s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 12105 We have Burning CPU with 12106 min:0.007ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:20.630s 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 34,4896 s, 6,1 MB/s 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 35,157 s, 6,0 MB/s 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 37,4852 s, 5,6 MB/s DD Finished IO Finished before than processing --- Finish --- Kernel tested: 2.6.28-8-generic i686 gad@ws-esp16:~$ 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 40,6583 s, 5,2 MB/s 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 49,9392 s, 4,2 MB/s 200+0 registros de entrada 200+0 registros de salida 209715200 bytes (210 MB) copiados, 51,9306 s, 4,0 MB/s ----- Filesystem ext4
Seams last comment has double c/p, making it hard to read. Here goes another result (for some reason, I get a bunch of "DD Finished", I didn't want to cut as do not know if its relevant for test - probably not): Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 7139 min:0.005ms|avg:0.015-0.031ms|mid:0.000ms|max:0.000ms|duration:22.600s We have Burning CPU with 8947 min:0.004ms|avg:0.014-0.031ms|mid:0.000ms|max:0.000ms|duration:22.342s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 10772 We have Burning CPU with 10773 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 14.7651 s, 14.2 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 16.8547 s, 12.4 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 18.5809 s, 11.3 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 19.6679 s, 10.7 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 20.7152 s, 10.1 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 22.0414 s, 9.5 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished min:0.004ms|avg:0.018-0.033ms|mid:0.000ms|max:0.000ms|duration:24.033s DD Finished IO Finished before than processing --- Finish --- Kernel tested: 2.6.27-13-generic x86_64 This is ext3.
What for you brake disks a finger? yura@suse:~/Desktop> sh kernel-test.sh Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 14170 min:0.003ms|avg:0.006-0.007ms|mid:0.000ms|max:0.000ms|duration:4.725s We have Burning CPU with 14815 min:0.004ms|avg:0.006-0.007ms|mid:0.000ms|max:0.000ms|duration:4.752s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 15470 We have Burning CPU with 15471 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 2,45896 c, 85,3 MB/c 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 4,33352 c, 48,4 MB/c 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 4,51529 c, 46,4 MB/c 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 5,22602 c, 40,1 MB/c DD Finished DD Finished DD Finished 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 5,97021 c, 35,1 MB/c DD Finished DD Finished DD Finished 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 6,38097 c, 32,9 MB/c DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished min:0.003ms|avg:0.006-0.007ms|mid:0.000ms|max:0.000ms|duration:6.047s DD Finished IO Finished before than processing --- Finish --- Kernel tested: 2.6.28.5-default x86_64
$ ./kernel-test.sh Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 4215 min:0.005ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:14.822s We have Burning CPU with 5656 min:0.007ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:15.624s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 7403 We have Burning CPU with 7404 200+0 poster in 200+0 poster ut 209715200 byte (210 MB) kopierade, 12,7466 s, 16,5 MB/s 200+0 poster in 200+0 poster ut 209715200 byte (210 MB) kopierade, 15,3423 s, 13,7 MB/s 200+0 poster in 200+0 poster ut 209715200 byte (210 MB) kopierade, 17,363 s, 12,1 MB/s 200+0 poster in 200+0 poster ut 209715200 byte (210 MB) kopierade, 18,3437 s, 11,4 MB/s 200+0 poster in 200+0 poster ut 209715200 byte (210 MB) kopierade, 18,9163 s, 11,1 MB/s 200+0 poster in 200+0 poster ut 209715200 byte (210 MB) kopierade, 19,3732 s, 10,8 MB/s min:0.005ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:18.564s IO Finished before than processing --- Finish --- Kernel tested: 2.6.29-rc7-zen2-ARCH-20090309 x86_64
I have recognized, that the cpu clock scaling responds sluggish during heavy io. From time to time it stays at lowest clock rate, although there was cpu intensive, but discontinuous, work in other processes. I had just a freeze for 20 seconds during such a state.
I could move the mouse, but cursor did not change. All panel were working, but I could not move or switch windows.
Gonzalo, is it possible to include the motherboard chipset in the test? It would be interesting to see if everybody who's affected have the same or similiar chipsets... Here's another test result, with 2.5.29 RC7. Still affected by the bug, on ext4. khaal@Xeraphim:~/Desktop/test-suite-bug-12309-v2$ sh kernel-test.sh Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 9080 min:0.007ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:23.801s We have Burning CPU with 14728 min:0.007ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:22.593s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 19811 We have Burning CPU with 19812 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 13.901 s, 15.1 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 15.2808 s, 13.7 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 15.4188 s, 13.6 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 16.1941 s, 13.0 MB/s DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 16.6363 s, 12.6 MB/s DD Finished DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 17.1937 s, 12.2 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished min:0.004ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:18.957s DD Finished IO Finished before than processing --- Finish --- Kernel tested: 2.6.29-020629rc7-generic x86_64
Created attachment 20503 [details] Results in ODF for spreadsheet This shows the information recovered by each of the tests performed.
Created attachment 20504 [details] Results in ODF for spreadsheet This shows the information recovered by each of the tests performed.
I uploaded a spreadsheet to show results... For me a High IO is affecting to the scheduler or processor. Not really much for the tests but it may be important if long processing takes in place. It's very significative that increment is always about 2 seconds for all included for the tests of Yuriy Lalym where normally should only take 4,7s and the processing time gets incremented in 1,3 secs. Why always around 2 secs? Also we can see that ext4 does not really seem to be affected. Maybe because throughtput? It would be interesting to know the fs system tested by Khalid Rashid because it take less time to complete under high IO, like for me on ext4. And what the DD Finished says is that the last IO transfer done finished before thee CPU intensive task. Maybe this also affected the result. Ok. I will fix the format of the output of the testsuite program and include other tests. Also temp files will be deleted after tests. What other tests should be included? I will try to search for the fsync problem to include it in the tests. Also will try to report motherboard chipset as requested... Any ideas on what to test?
I have one question for the kernel developers... How many processor time is normal for a dd process using dma? I have two hipothesis: 1.- Kernel is taking to much time getting the process in and out even if it is blocked by IO. 2.- Is there one lock that prevents the scheduler from running free... How can I trackdown processor time of a program (say dd)? Want to see if times for each kind of process is normal. Current computers are fast and sometimes we do not realize that a process is taking to much time to complete. Any good ways to profile the kernel looking at only one PID? I want to profile specific parts of the kernel. Any good doc? Thank you all! I forgot to say. For now don't use the testsuite anymore until new tests are here.
Server on Xeon based, internal HDD SATA2 (no RAID), SLES 10 SP2 Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 31607 min:0.004ms|avg:0.013-0.049ms|mid:0.000ms|max:0.000ms|duration:19.071s We have Burning CPU with 7637 min:0.004ms|avg:0.015-0.057ms|mid:0.000ms|max:0.000ms|duration:21.218s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 15831 We have Burning CPU with 15832 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 1,0195 секунд, 206 MB/s 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 1,04578 секунд, 201 MB/s 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 1,26246 секунд, 166 MB/s 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 1,90053 секунд, 110 MB/s 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 2,19354 секунд, 95,6 MB/s 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 2,22529 секунд, 94,2 MB/s min:0.003ms|avg:0.014-0.060ms|mid:0.000ms|max:0.000ms|duration:20.705s IO Finished before than processing --- Finish --- Kernel tested: 2.6.16.60-0.21-smp x86_64 Server on Xeon based, 3-Ware RAID-1 (2 pieces SAS), SLES 10 SP2 Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 22420 min:0.004ms|avg:0.015-0.071ms|mid:0.000ms|max:0.000ms|duration:25.210s We have Burning CPU with 28763 min:0.004ms|avg:0.018-0.083ms|mid:0.000ms|max:0.000ms|duration:33.232s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 1628 We have Burning CPU with 1629 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 0,335776 секунд, 625 MB/s 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 0,367063 секунд, 571 MB/s 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 0,363934 секунд, 576 MB/s 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 0,430686 секунд, 487 MB/s 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 0,520617 секунд, 403 MB/s 200+0 записей считано 200+0 записей написано скопировано 209715200 байт (210 MB), 0,531063 секунд, 395 MB/s min:0.004ms|avg:0.014-0.065ms|mid:0.000ms|max:0.000ms|duration:22.025s IO Finished before than processing --- Finish --- Kernel tested: 2.6.16.60-0.21-smp x86_64
bpenglas@PC010233L ~/Desktop/bug $ ./kernel-test.sh Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 10638 min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:14.790s We have Burning CPU with 13523 min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:13.953s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 14793 We have Burning CPU with 14794 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 14.7986 s, 14.2 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 17.6264 s, 11.9 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 19.4253 s, 10.8 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 19.9593 s, 10.5 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 21.898 s, 9.6 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 21.9509 s, 9.6 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:14.694s DD Finished IO Finished before than processing --- Finish --- Kernel tested: 2.6.29-rc3-zen1-1-07438-g2953ca1 x86_64
Gonzalo, as I stated before I am on ext4 mounted with noatime and nodiratime flags. However, even if my throughput is fast according to the test, my performance takes a big hit during the tests still. I'm considering to reformat my partitions to ext3 so i can get an older kernel running and test how it fares. Also, it would be great to collect the results on one place, I've put one up at http://tinyurl.com/au4fda - feel free to rearrange it to fit your needs. Well done with the testsuit, and good bughunting everyone :-)
(In reply to comment #234) (In reply to comment #243) File system - xfs
(In reply to comment #244) Forgot to mention, on this system, all filesystems are EXT3. That is also without my VMs running, and it's my work machine. I'll try to get with VMs running, and also my home box tomorrow(3/13/09).
My Work machine: bpenglas@PC010233L ~/kernel $ ./kernel-test.sh Using current dir to do IO tests First Test: How much gets to run the CPU intensive task? We have Burning CPU with 16034 min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:19.169s We have Burning CPU with 18771 min:0.005ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:17.182s Second Test: Does the process queue get blocked because high IO? Starting We have High IO PID 21066 We have Burning CPU with 21067 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 21.8451 s, 9.6 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 21.7598 s, 9.6 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 21.9914 s, 9.5 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 24.8323 s, 8.4 MB/s 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 24.9565 s, 8.4 MB/s DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished DD Finished 200+0 records in 200+0 records out 209715200 bytes (210 MB) copied, 25.6149 s, 8.2 MB/s DD Finished DD Finished DD Finished min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:15.944s DD Finished IO Finished before than processing --- Finish --- Kernel tested: 2.6.29-rc3-zen1-1-07438-g2953ca1 x86_64 This is while FireFox is open, Audacious is playing music, and two VMWare Workstation VM's running (Windows Vista, and Windows XP). All filesystems are EXT3, main system drive is a WD 80gig at 10kRPM, other drive is 250gig 7.2kRPM. All intel Chipset, with an Core2Dou E8200. It's a Dell GX755.
Simple test case: dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync & sleep 10 time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync You'd expect the small file to be written fairly quickly - as in a couple seconds at most. But on every system with a recent kernel I've tried this on, it takes 6-45 seconds. Why the huge range? I'm not sure, but available memory seems to have something to do with it. The more memory in the machine, the larger the smallfile writes.
(In reply to comment #249) > dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync & > sleep 10 > time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync real 0m1.808s user 0m0.001s sys 0m0.001s I don't think this gets to the issue.
Well, for me it does: dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync & sleep 10 time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 15.8284 s, 0.3 kB/s real 0m16.024s user 0m0.004s sys 0m0.020s
(In reply to comment #249) > dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync & > sleep 10 > time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync 2.6.28-gentoo-r2 (/tmp on reiser3.6, rootfs-drive): >4096 bytes (4,1 kB) copied, 10.618 s, 0.4 kB/s >real 0m10.620s >user 0m0.000s >sys 0m0.077s 2.6.28-gentoo-r2 (/tmp on ext4, other drive): >4096 bytes (4,1 kB) copied, 5,34679 s, 0,8 kB/s >real 0m5.349s >user 0m0.000s >sys 0m0.003s 2.6.27.19-3.2-default (opensuse 11.1) (/tmp on ext3, rootfs): >4096 bytes (4,1 kB) copied, 60.5764 s, 0.1 kB/s >real 1m2.827s >user 0m0.004s >sys 0m0.036s
(In reply to comment #250) > real 0m1.808s > user 0m0.001s > sys 0m0.001s My 1.808s was on 2.6.27-gentoo-r8 with XFS on a 3ware 8-drive SATA RAID.
2.6.28.7 w/reiserFS 4096 bytes (4.1 kB) copied, 6.96955 s, 0.6 kB/s real 0m6.972s user 0m0.001s sys 0m0.026s
André did you mean to take ownership of this bug away from Jens? It looks like the test case I posted up earlier seems to be very effective at demonstrating at least one of the issues that is affecting people in this thread (namely people using ext3 or reiserfs). It appears that xfs and ext4 are better at avoiding these huge latencies - I'm also assuming that the IO scheduler interacts with these filesystems differently. Matt - I don't think this test case works for you as much because you have such a fast disk array. I imagine that you can write 10GB pretty quickly with an 8-drive array. Try increasing the 10GB to 100GB and increasing the sleep to 20-30 seconds so that you get more data waiting to be flushed to disk.
David, I want to stress that while my earlier test results looked good on my ext4 filesystem, I was still affected by the slow performance. I think we need a (diffrent?) way to measure desktop responsiveness in order to get actual values from there too.
Where are your performance numbers from the test case, Khalid, and what is your hardware setup like? André posted numbers in comment #252 on ext4 which are better than his ext3/resiserfs numbers, are still very poor, IMO. It's fairly obvious that it is likely there are multiple bugs causing similar symptoms and all have been jumbled into this bug report. Jens has asked for a simple test case illustrating at least one issue discussed in this thread. I have presented one extremely simple test case which duplicates the problems I (and others) am seeing. Feel free to create another.
sorry, didn't mean to reassign the bug in the first place
I've been doing some testing - two tunables I've found (briefly mentioned earlier) that helps immensely is setting /proc/sys/vm/dirty_background_ratio to 1 and /proc/sys/vm/dirty_ratio to 2. On some of my systems that I've run the test on it reduces latency down to a fraction of a second - on other systems it reduces it from 20+ seconds to less than 10. Anyone else see similar behaviour with my simple test?
(In reply to comment #259) > I've been doing some testing - two tunables I've found (briefly mentioned > earlier) that helps immensely is setting /proc/sys/vm/dirty_background_ratio > to > 1 and /proc/sys/vm/dirty_ratio to 2. > > On some of my systems that I've run the test on it reduces latency down to a > fraction of a second - on other systems it reduces it from 20+ seconds to > less > than 10. > > Anyone else see similar behaviour with my simple test? > This is right. Although it doesn't eliminate stutter (mouse freezing for 1-2 seconds) during heavy IO, it does make that stutter tolerable. Its basically converting your IO to almost sync inline instead of leaving the work for later for pdflush to pick up and choke the hell out of the IO subsystem. I have no idea why on larger memory configurations those default values are set so high as 40 and 20 (IIRC). I mean on a 4GB RAM system, we may not see any IO landing until expiry alarms fire in pdflush or 40% of 4GB=1.6G is ready to be written.
PC010233L vmware # dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync & [1] 10528 PC010233L vmware # sleep 10 PC010233L vmware # time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.00333981 s, 1.2 MB/s real 0m0.054s user 0m0.000s sys 0m0.000s PC010233L vmware # PC010233L vmware # time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.604249 s, 6.8 kB/s real 0m3.219s user 0m0.000s sys 0m0.000s The second time I ran the second DD was about 2 minutes later. My / (or /tmp) is located on a WD 10K RPM SATA II Drive. And after fixing the dirty ratios.... PC010233L vmware # dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync & [1] 10548 PC010233L vmware # sleep 10 PC010233L vmware # time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 1.41179 s, 2.9 kB/s real 0m2.044s user 0m0.000s sys 0m0.002s PC010233L vmware # time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.000649804 s, 6.3 MB/s real 0m6.366s user 0m0.000s sys 0m0.002s PC010233L vmware # Again, second one was about 2 minutes afterwards.
(In reply to comment #261) Brandon, this test case doesn't seem to reproduce any significant latency issues for you. I suspect that 10k RPM disk is able to write fast enough to keep a significant amount of data from being buffered in memory. 1.5 seconds isn't great, but all my systems are at least 5 times worse than that and often 10-40 times worse. Do you notice a large latency hit on the system when the large write is running? Why are you running that second small write afterwards? Was the big write done at that point or not? The latency of your small writes does seem to vary by quite a bit.
(In reply to comment #255) > Matt - I don't think this test case works for you as much because you have > such > a fast disk array. I imagine that you can write 10GB pretty quickly with an > 8-drive array. Try increasing the 10GB to 100GB and increasing the sleep to > 20-30 seconds so that you get more data waiting to be flushed to disk. > Setting dirty_background_ratio=1 and dirty_ratio=2 had a HUGE effect on my system. $ dd if=/dev/zero of=/var/tmp/bigfile bs=1M count=100000 conv=fdatasync & sleep 30 ; time dd if=/dev/zero of=/var/tmp/smallfile bs=4k count=1 conv=fdatasync 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 6.96642 s, 0.6 kB/s real 0m8.590s user 0m0.000s sys 0m0.004s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 1354.9 s, 77.4 MB/s # echo 1 > dirty_background_ratio ; echo 2 > dirty_ratio $ dd if=/dev/zero of=/var/tmp/bigfile bs=1M count=100000 conv=fdatasync & sleep 30 ; time dd if=/dev/zero of=/var/tmp/smallfile bs=4k count=1 conv=fdatasync [1] 22718 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.72366 s, 5.7 kB/s real 0m0.725s user 0m0.000s sys 0m0.001s 100000+0 records in 100000+0 records out 104857600000 bytes (105 GB) copied, 359.02 s, 292 MB/s
(In reply to comment #262) > (In reply to comment #261) > Brandon, this test case doesn't seem to reproduce any significant latency > issues for you. I suspect that 10k RPM disk is able to write fast enough to > keep a significant amount of data from being buffered in memory. 1.5 seconds > isn't great, but all my systems are at least 5 times worse than that and > often > 10-40 times worse. > > Do you notice a large latency hit on the system when the large write is > running? > > Why are you running that second small write afterwards? Was the big write > done > at that point or not? The latency of your small writes does seem to vary by > quite a bit. > The large write took a while to complete (about 10 minutes.. and only got to 5.3gig before I killed it), and yes, VERY degraded performance... took me a while to ssh in and kill it.. as local was almost unusable. The first small write wasn't when the system started lagging out on me... it was when the ram usage was going up, and cpu usage was going up, so I decided to run it again at a later point just to see. I can try doing the writes on my 7.2K RPM disc tomorrow when I'm back at work. just need to point the output to a different partition.
David Rees, all my test results are presented here: https://spreadsheets.google.com/ccc?key=p3aerC-xkjEqvo7BvMHaxXg&hl=en and my computer components can be seen here: http://h10025.www1.hp.com/ewfrf/wc/prodinfoCategory?lc=en&cc=se&dlc=sv&product=3387690&lang=sv& I tried this also on a WD Raptor drive just to ensure that it was not fauly harddrives that was the case, and the symptoms were still present.
PC010233L ~ # dd if=/dev/zero of=/home/bigfile bs=1M count=10000 conv=fdatasync & [1] 22333 PC010233L ~ # sleep 10 PC010233L ~ # time dd if=/dev/zero of=/home/smallfile bs=4k count=1 conv=fdatasync 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 6.27386 s, 0.7 kB/s real 0m6.275s user 0m0.000s sys 0m0.000s PC010233L ~ # time dd if=/dev/zero of=/home/smallfile bs=4k count=1 conv=fdatasync 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 2.4702 s, 1.7 kB/s real 0m2.482s user 0m0.000s sys 0m0.000s This was going to /home which is on a 250gig 7200k RPM SATA II drive. Also, even though the second one (ran about a minute or two later) completed quickly.. it was about another 10 secs till I got the prompt back.
(In reply to comment #249) the same system than in #254 but I changed kernel to latest rc of .29 2.6.29-rc8 w/reiserFS 4096 bytes (4.1 kB) copied, 1.2374 s, 3.3 kB/s real 0m2.843s user 0m0.001s sys 0m0.003s
Just a quick note: i've been having considerable troubles with kernels since 2.6.17 aswell, yet recently run across this article http://kerneltrap.org/node/3000, citing: "Kernel maintainer Andrew Morton has said that he runs his desktop machines with a swappiness of 100"... that made me think if my swappiness of 1 might be not such a good idea. An example of misbehaviour which i was actually crediting to this bug can be seen here: http://hfopi.org/files/temp/time-trouble.jpg (look at the three different clock times).. This problem was the result of physical memory running full, which was happening a lot (VLC mem leak..) - stalling the system sometimes for hours. Well setting swappiness .. ah let me quote Andrew: "I'm gonna stick my fingers in my ears and sing 'la la la' until people tell me 'I set swappiness to zero and it didn't do what I wanted it to do'." .. well here i am. To all of you: setting swappiness too extremely low values is a bad idea and won't achieve what you expect it to. So that might actually be your problem if you have done so; echo 100 > /proc/sys/vm/swappiness and make the test.
(In reply to comment #268) I see the problem, and I've never touched 'swappiness'. $ cat /proc/sys/vm/swappiness 60 Actually, I have no swap at all. # swapon -s swapon: /proc/swaps: No such file or directory
Well Mathieu Desnoyers did a fix for the write-cache accounting which solves the kernel write-cache to eat up all available memory + swap. Witout the fix the slowness is solved by setting swappiness to 0 or disabling swap, The fix is afaik not in 2.6.29.
So, we have one guy (#268) saying high swappiness will solve the problem and the other guy (#270) saying setting swappiness to 0 will solve the problem. I have a feeling neither is going to work, because I have run my system with both and this bug appears under high IO load in both cases. But I would like to see what others find.
The swappiness setting is irrelevant to this bug, as it is a disk io problem no matter which way you look at it. Yes, if you are swapping, this bug will cause the system to be even slower. p.s. I'm convinced that swap is evil. I just disable my swap, and my system works much better, especially when I get a run-away memory hog process.
Hi: A) if you have multipple harddrives - they are not equally affected - if you copy a file (e.g. 7 Gig) from drive A to drive B, a job running on drive C is not slowing down, accept, if perhas a swapfile is used. A job, in my case, is a vmware virtual machine I was spreading machines over different harddrives to reduce the trouble. B) isn't this slowdown a planed action of the system: About /proc/sys/vm/dirty_ratio > Note that all processes are blocked for writes when this happens (see below, original text) This is what slows everything down. IMHO, it should be: If "dirty_ratio" is reached, slow down the job that is creating so much "dirt" and leave the other ones alone. cut out from http://www.westnet.com/~gsmith/content/linux-pdflush.htm 8< ------------------- Process page writes There is another parameter involved though that can spill over into management of user processes: /proc/sys/vm/dirty_ratio (default 40): Maximum percentage of total memory that can be filled with dirty pages before processes are forced to write dirty buffers themselves during their time slice instead of being allowed to do more writes. Note that all processes are blocked for writes when this happens, not just the one that filled the write buffers. This can cause what is perceived as an unfair behavior where one "write-hog" process can block all I/O on the system. The classic way to trigger this behavior is to execute a script that does "dd if=/dev/zero of=hog" and watch what happens. See Kernel Korner: I/O Schedulers for examples showing this behavior. 8< ------------------- Reference: http://www.westnet.com/~gsmith/content/linux-pdflush.htm Does someone have an idea how to slow down the IO-heavy job (automatically) ? If the throughput of dd, rsync or "whatever" is reduced, the moment a triggervalue is reached, the problem would be only for dd, rsync, ... and not for the rest of the system.
Hi again: My test is to throttle the bandwith using "rsync --bwlimit=<throughput>" I am testing using vmware on /images3. Vmware runs fluent until I copy a lot (7Gig vmdk-file) to /images3, which is a separat harddrive on which 5 vmware systems are having their .vmdk-files. Copying this 7Gig file freezes the vmware systems for > 30 seconds. And now with limited bandwith ... all jobs run fine, no hangig or else: rsync --bwlimit=10000 /images5/vmware/vlab03/STD_XP_Prof.vmdk /images3/test rsync --bwlimit=20000 /images5/vmware/vlab03/STD_XP_Prof.vmdk /images3/test some jobs start to become slow and hang: rsync --bwlimit=30000 /images5/vmware/vlab03/STD_XP_Prof.vmdk /images3/test a lot of jobs hangs and are very slow, some freeze: rsync --bwlimit=40000 /images5/vmware/vlab03/STD_XP_Prof.vmdk /images3/test This is my estimation: rsync is creating more dirt than the "kernel" can get rid off and the system is put into this "processes are blocked for writes" (see previous posting) mode. I hope that my input can help.
Created attachment 20656 [details] vmstat with high # of uninterruptible processes I just had a hang for about 10-15 minutes. My system started to freeze, so I immediately switched to a console, and ran "vmstat 1" (see attachment). I sat there and watched it, as I wanted to catch it immediately after it became usable again, so that I could check the load average. uptime 23:38:18 up 6 days, 4:49, 8 users, load average: 23.30, 26.12, 16.21 23, with a 5 minute load average of 26 OUCH. I have no swap, and I think the problem happened when one of my processes did something to lock up the machine. But, take note how many processes are blocked in UNINTERRUPTIBLE sleep at various times... I think I also realized something very interesting about this bug. It does not occur as readily when you have a fast disk. As I had mentioned in previous comments, my macbook and my D820 have the same hardware. Well I'm rarely experiencing this on my D820 now. The only difference I can see, related to IO, is that the D820 just had a 320G 80M/s drive put into it. My Macbook runs at approximately 20-25M/s. Also, given that I am pretty sure that one of my processes hanged the machine, it seems (though I am not a kernel hacker) like this bug may be related to a wait on a mutex or semaphore in a location that it should not be, hence the high number of uninterruptible processes? Could that be?
There has been more discussion on LKML related to this issue attached to the 2.6.29 kernel release thread. I'll direct interested parties to this post from Ted Tso: http://lkml.org/lkml/2009/3/24/227 Attached to that post is Ted's fsync latency measuring tool. If people have a workload which generates high latency, this tool may be useful for measuring it and then posting that workload to Ted/LKML. His testing tool doesn't do anything much different than my earlier dd test, except that he writes 1MB of data which may show higher latencies. For those interested, I picked up a couple other workarounds for people this is affecting: 1. Mount ext3 in writeback instead of ordered. This has the drawback of leaving your data a bit more vulnerable than default, but now data writes won't be forced to be completed in order with meta data. 2. Increase IO priority of kjournald: for i in `pidof kjournald` ; do ionice -c1 -p $i ; done One theory is that by default kjournald is fighting for IO priority with normal processes. By making the IO priority of kjournald higher, the "important" data (IE, data that is getting synced to disk) should get written out faster reducing user visible latency. See this post/thread for more detail: http://lkml.org/lkml/2008/10/2/205
I've tested the second workaround posted by David above (high IO priority of kjournald), and it definitely improves things in my case. My test is very simple: doing normal upgrades under Ubuntu (esp. kernel packages) always make Firefox and even Evolution or the whole desktop freeze for several seconds, up to about 20 sec in some cases. With that workaround, the freezes don't last more than ~1 sec; the desktop experience is not really smooth, but I can work during upgrades. So I guess we can track down at least a specific issue here, which may be the major one affecting desktop boxes, and which seems to have appeared (maybe in different ways) between 2.6.17 and 2.6.28. I'm using a fairly basic Toshiba Satellite laptop with 512 MB of RAM and a 4200 rd/min HD. Can anybody confirm that too?
Ok. I'm also testing the kjournald option to see if it improves. I will post after some testing... I want to include the fsync tests you pointed out. I tested it and gave me: fsync time: 0.0145 fsync time: 0.0205 fsync time: 0.0221 fsync time: 0.0195 fsync time: 0.0177 fsync time: 0.0702 fsync time: 0.0456 What's the correct way to do reliable tests? I will include it in the test suite.
The kjournald option makes my system much more responsive.
Hi Guys, After reading those LKML messages from Theoodre, regarding his sync patches, it gave me an idea. Why not just mount my filesystem with "sync" mount option. I run the following command on one console... dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 And Theodore's fsync-test on another. On the standard test, WITHOUT mounting with sync, I get these results out of Theodore's test... fsync time: 1.5693 fsync time: 18.8047 fsync time: 21.2672 fsync time: 18.6747 fsync time: 2.3821 fsync time: 2.0494 fsync time: 2.8781 fsync time: 21.6300 Here's a "vmstat 1" snipette. All the lines while the dd is running are roughly the same. 2 9 380388 16716 33412 1409988 0 0 0 15340 806 1188 3 4 0 93 0 8 380388 15748 33428 1411080 0 0 0 16284 1165 2350 7 8 0 85 0 9 380388 16620 33432 1409752 0 0 0 18240 878 1108 5 3 0 92 1 8 380388 16776 33452 1410108 0 0 0 11888 1046 1140 10 8 0 82 When I do the following... mount -o remount,rw,sync /dev/s/sys / I get the following benches while running the same dd command... fsync time: 0.0067 fsync time: 0.0369 fsync time: 0.0208 fsync time: 0.0099 fsync time: 0.1175 fsync time: 0.0337 fsync time: 0.0003 fsync time: 0.0219 fsync time: 0.0110 fsync time: 0.0142 fsync time: 0.0076 fsync time: 0.0146 fsync time: 0.0153 fsync time: 0.1104 fsync time: 0.0061 fsync time: 0.0003 With "vmstat 1" snippet of ... 1 0 380624 1112236 93104 297252 0 0 0 13056 920 1167 5 3 49 43 0 1 380624 1098212 93252 311044 0 0 0 15876 925 1165 5 4 52 38 1 2 380624 1085796 93408 323296 0 0 0 13800 996 1239 10 4 47 38 Did something in the kernel change a couple years ago, in regard to syncing?
Just an FYI, there was some mm/msync.c "fsync" related changes between 2.6.16.62 and 2.6.17 vanilla. I didn't see the problem until after 2.6.17, but perhaps gentoo had patched the kernel heavily, I don't know. I'll try and do some more diffs between the kernel versions around the time I started having the problem, in case it can help you guys figure it out.
From first 2.6.17 release to first 2.6.18 release (haven't narrowed it down to exact versions), 3 PF_SYNCWRITE related lines have been removed from mm/msync.c. And some PF_SYNCWRITE related stuff in block/cfq-iosched.c was added in 2.6.17 (diff between 2.6.16.62 and 2.6.17), and then removed in 2.6.18. There's also fs/ sync related stuff between 2.6.16.62 and 2.6.17. I hope I'm not spamming. :P
(In reply to comment #280) > Hi Guys, > > After reading those LKML messages from Theoodre, regarding his sync patches, > it > gave me an idea. Why not just mount my filesystem with "sync" mount option. what are the disadvantages of sync mount option? reduced b/w? higher latency? data you posted does show any disadvantages or may be I don't know what to conclude from that data?
(In reply to comment #280) > Hi Guys, > > After reading those LKML messages from Theoodre, regarding his sync patches, > it > gave me an idea. Why not just mount my filesystem with "sync" mount option. what are the disadvantages of sync mount option? reduced b/w? higher latency? data you posted doesn't show any disadvantages or may be I don't know what to conclude from that data?
(In reply to comment #284) > (In reply to comment #280) > > Hi Guys, > > > > After reading those LKML messages from Theoodre, regarding his sync > patches, it > > gave me an idea. Why not just mount my filesystem with "sync" mount > option. > > what are the disadvantages of sync mount option? reduced b/w? higher latency? > data you posted doesn't show any disadvantages or may be I don't know what to > conclude from that data? It appears that the overall transfer rate has decreased a tiny bit. But, the big advantage of not doing "sync" on mount, is that the system can queue the writes. So, for anything that fits into kernel queues, the writes appear way faster to the user. That's my understanding of the difference between sync and not using sync.
Oh, I should have given an example. Normally, when doing a dd of say 10M, your write would be several hundred MEGABYTES per second, because it's writing to memory, not disk. In my case, I only get disk speeds, even with 10M. So yeah, the memory queueing is WAAAAY faster until you reach the limit. One last thing, for the kernel devs, as this may be important... The comment in 2.6.28's version of msync.c is as follows... /* * MS_SYNC syncs the entire file - including mappings. * * MS_ASYNC does not start I/O (it used to, up to 2.5.67). * Nor does it marks the relevant pages dirty (it used to up to 2.6.17). * Now it doesn't do anything, since dirty pages are properly tracked. * * The application may now run fsync() to * write out the dirty pages and wait on the writeout and check the result. * Or the application may run fadvise(FADV_DONTNEED) against the fd to start * async writeout immediately. * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to * applications. */ This is an interesting comment. Mainly because there was some logic based on MS_SYNC, that was removed from msync.c, in 2.6.18 (as I mentioned at the TOP of comment #282). That code would set the PF_SYNCWRITE flag. The code exists in 2.6.17 but not 2.6.18. I haven't checked if it was the 2.6.18 change that did it, or a previously 2.6.17.x change. Is this a problem kernel devs???????
I have the same result here when mouting with the "sync" option. I try also async and ionice -c1 'pidof kjournald' and doesn't seems to improve latency measured by fsync-tester. (In reply to comment #280) > Hi Guys, > > After reading those LKML messages from Theoodre, regarding his sync patches, > it > gave me an idea. Why not just mount my filesystem with "sync" mount option. > > I run the following command on one console... > dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 > > And Theodore's fsync-test on another. On the standard test, WITHOUT mounting > with sync, I get these results out of Theodore's test... > > fsync time: 1.5693 > fsync time: 18.8047 > fsync time: 21.2672 > fsync time: 18.6747 > fsync time: 2.3821 > fsync time: 2.0494 > fsync time: 2.8781 > fsync time: 21.6300 > > Here's a "vmstat 1" snipette. All the lines while the dd is running are > roughly the same. > 2 9 380388 16716 33412 1409988 0 0 0 15340 806 1188 3 4 0 > 93 > 0 8 380388 15748 33428 1411080 0 0 0 16284 1165 2350 7 8 0 > 85 > 0 9 380388 16620 33432 1409752 0 0 0 18240 878 1108 5 3 0 > 92 > 1 8 380388 16776 33452 1410108 0 0 0 11888 1046 1140 10 8 0 > 82 > > When I do the following... > mount -o remount,rw,sync /dev/s/sys / > > I get the following benches while running the same dd command... > fsync time: 0.0067 > fsync time: 0.0369 > fsync time: 0.0208 > fsync time: 0.0099 > fsync time: 0.1175 > fsync time: 0.0337 > fsync time: 0.0003 > fsync time: 0.0219 > fsync time: 0.0110 > fsync time: 0.0142 > fsync time: 0.0076 > fsync time: 0.0146 > fsync time: 0.0153 > fsync time: 0.1104 > fsync time: 0.0061 > fsync time: 0.0003 > > With "vmstat 1" snippet of ... > 1 0 380624 1112236 93104 297252 0 0 0 13056 920 1167 5 3 49 > 43 > 0 1 380624 1098212 93252 311044 0 0 0 15876 925 1165 5 4 52 > 38 > 1 2 380624 1085796 93408 323296 0 0 0 13800 996 1239 10 4 47 > 38 > > Did something in the kernel change a couple years ago, in regard to syncing?
@ #286: no the msync.c :: IMHO MS_[A]SYNC is _not_ related. With the introduction of the Unified (disk) Buffer Cache, msync(MS_ASYNC) became basically a no-op. Every process will see the same contents for a block, whether it uses read() or mmap() to access it. Other unices (without UBC) may behave differently. For MS_SYNC, the situation is more complicated. (IIUC: it is hard to wait for all pages to have been written if other processes may re-dirty them simultaneously) This bug / issue is not about throughput, it is about latency and (lack of) responsiveness (of other, unrelated processes). BTW, to me it seems there are actually two symptoms: 1) initially, the mouse cursor is stuck ("stuck/jerky mouse syndrome") 2) later on, the cursor gets quicker, but the actions (pop-ups, window focus, ...) are still slow. (1) can be associated with CPU scheduling, unix-domain socket-I/O, maybe even pagefaulting of X's code segments. (2) can be associated with CPU scheduling, pagefaulting of code, or memory shortage ( -->> pagefaulting + induced writing of dirty pages)
File system - xfs (mounted with options by default) dd if=/dev/zero of=test.img bs=572041216 count=1 Kernel 2.6.28.8 # time (cp test.img test1.img && sync) real 0m7.372s user 0m0.021s sys 0m1.152s Kernel 2.6.29 # time (cp test.img test2.img && sync) real 0m13.704s user 0m0.016s sys 0m1.060s
This bug is present at least from 2.6.15 and up, so it's older than the 2.6.18 (with question mark) reported in this bug. @breezer:~$ { sleep 5; dd if=/dev/zero of=/tmp/bigfile bs=1M count=5000 conv=fdatasync ; } & /tmp/fsync-tester [1] 4946 fsync time: 0.0188 fsync time: 0.0142 fsync time: 0.0142 fsync time: 0.0142 fsync time: 0.0143 fsync time: 9.2283 fsync time: 12.0892 fsync time: 11.9867 fsync time: 17.6123 fsync time: 13.5469 I've seen sync times up to 20 seconds. This is Ubuntu 6.06LTS, 2.6.15-53-686 kernel. I am seeing this behaviour on various machines with different hardware. It is a real problem for NFS servers in combination with clients that run Firefox 3.
Anyone willing to do some before and after tests? It looks like the huge filesystem thread has produced some results and latency during large writes should be much better now with 2.6.30-rc1 + Theodore Ts'o's ext3-latency-fixes. http://lkml.org/lkml/2009/4/8/760
Look at the difference in disk throughput when running with dirty_background_ratio=0 and dirty_ratio=0: http://img9.imageshack.us/img9/811/fsyncgraph00.png versus with dirty_background_ratio=40 and dirty_ratio=80: http://img154.imageshack.us/img154/9427/fsyncgraph4080.png Both images are graphs of vmstat output during this command: dd if=/dev/zero of=bigfile bs=1M count=20k conv=fdatasync I collected this data in single-user mode so no other processes were touching the disk. Do note the fairly steady throughput in the first case, in stark contrast with the huge burst at the beginning and end of, and slowness throughout, the second case. In case anyone missed the point, it took 55 seconds to write 20 GB with dbr=0,dr=0 and 593 seconds to write 20 GB with dbr=40,dr=80. For some reason, the page cache appears to be really gumming up the works.
Hi David, I would be willing to do some before/after testing. But it may be a couple of days, at least, before I can. When is 2.6.30 going to be released? Also, I have a local Linus git tree. How do I update it with the latest git, or do I have to re-clone the entire thing again?
Before you make the comparison tests, you should ensure that you use the same journal mode with ext3. The default ext3 journal mode was changed to writeback as default mode.
Is there a reliable testcase for the latency issue?
How about XFS, JFS, Reiserfs ???
I'm using XFS, and I have the same latency problems. I've been checking this thread, and testing the proposed ideas, without success. if you want me to test something, I'll happily do it. Cheers, Jose (In reply to comment #296) > How about XFS, JFS, Reiserfs ???
Same here using XFS on a multi disk (8) volume and seeing high IO waits.
For anyone who wants to test, here's what to do: 1. Document latencies with current setup which is performing poorly. 2. Document latencies with 2.6.30-rc1 (which should be much better for most people - make sure that if you are using ext3, that you mount your filesystem with the same journalling mode, as the default has changed) To document latencies, start a large streaming write: # dd if=/dev/zero of=/tmp/bigfile bs=1M count=5000 And run Ted Tso's latency testing tool in parallel (grab/compile it from here: http://lkml.org/lkml/2009/3/24/227) If you still have questions, read the last 50 or so comments to this bug for more information.
(In reply to comment #299) > For anyone who wants to test, here's what to do: # uname -a Linux amd64 2.6.29.1 #4 SMP PREEMPT Fri Apr 3 07:27:52 MSD 2009 x86_64 x86_64 x86_64 GNU/Linux # cat /proc/meminfo | grep MemTotal MemTotal: 4127376 kB #cat /proc/cpuinfo | grep -i "Model name" | uniq model name : Dual Core AMD Opteron(tm) Processor 265 # cat /proc/mounts | grep ' / ' /dev/sda2 / xfs rw,noatime,nodiratime,relatime,noquota 0 0 # hdparm -i /dev/sda | grep Model Model=WDC WD1500AHFD-00RAR5, FwRev=21.07QR5, SerialNo=WD-WMAP43732535 /* Western Digital Raptor */ # dd if=/dev/zero of=./bigfile bs=1M count=5000 && ./fsync-tester 5000+0 records in 5000+0 records out 5242880000 bytes (5,2 GB) copied, 69,7789 s, 75,1 MB/s fsync time: 0.0076 fsync time: 0.0091 fsync time: 0.0436 fsync time: 0.0359 fsync time: 0.0359 fsync time: 0.0359 fsync time: 0.0358 fsync time: 0.0359 fsync time: 0.0359 fsync time: 0.0359 fsync time: 0.0359 fsync time: 0.0358 fsync time: 0.0359 fsync time: 0.0358 fsync time: 0.0359 fsync time: 0.0359 fsync time: 0.0359 fsync time: 0.0359 ^C
(In reply to comment #300) > # dd if=/dev/zero of=./bigfile bs=1M count=5000 && ./fsync-tester That's supposed to be a single ampersand, which causes the dd process to start in the background so the fsync-tester process can run simultaneously with it.
(In reply to comment #301) > ...to start in the background ... dd if=/dev/zero of=./bigfile bs=1M count=5000 & ./fsync-tester; [1] 5298 fsync time: 0.0266 fsync time: 0.7677 fsync time: 0.6938 fsync time: 0.5879 fsync time: 1.1956 fsync time: 0.9582 fsync time: 0.9866 fsync time: 1.1833 fsync time: 0.6964 fsync time: 0.9986 fsync time: 0.9624 fsync time: 0.9093 fsync time: 0.9999 fsync time: 0.4423 fsync time: 0.8406 fsync time: 1.0880 fsync time: 0.1754 fsync time: 0.9039 fsync time: 0.8727 fsync time: 0.1261 fsync time: 0.2749 fsync time: 0.8547 fsync time: 0.5241 fsync time: 0.8164 fsync time: 0.4006 fsync time: 0.6532 fsync time: 0.8521 fsync time: 0.4151 fsync time: 0.3384 fsync time: 0.3326 fsync time: 0.4330 fsync time: 0.5800 fsync time: 0.8854 fsync time: 0.5953 fsync time: 0.3899 fsync time: 0.6722 fsync time: 0.1056 fsync time: 0.5554 ^C
Created attachment 20972 [details] fsync tester kernel 17 - 30 I have tested the kernels 17, 18, 20, 28, 29, 29 (patched with http://bugzilla.kernel.org/attachment.cgi?id=20172) and 30 (f4efdd65b754ebbf41484d3a2255c59282720650), which should include the patches. I got great results with the patched 29 kernel at the beginning and bad results, while executing the test again. This test case is not reliable, or my installation is changing parameters while switching the kernels. I have executed the two commands concurrent (Comment #299). dd if=/dev/zero of=./bigfile bs=1M count=5000 & ./fsync-tester
ASUS P5K linux suse 2.6.29-53-default x86_64 # cat /proc/meminfo | grep MemTotal MemTotal: 8196428 kB # cat /proc/cpuinfo | grep -i "Model name" | uniq model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz # cat /proc/mounts | grep ' /home ' /dev/sda3 /home xfs rw,attr2,noquota 0 0 # hdparm -i /dev/sda Model=ST31000340AS /* Seagate SATA2 */ UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 ~> dd if=/dev/zero of=./bigfile bs=1M count=5000 & ./fsync-tester [1] 5346 setting up random write file 5000+0 records in 5000+0 records out 5242880000 bytes (5.2 GB) copied, 90.9677 s, 57.6 MB/s done setting up random write file starting fsync run starting random io! fsync time: 1.0965s fsync time: 0.4574s fsync time: 0.7729s fsync time: 0.3746s fsync time: 0.5232s fsync time: 0.1928s fsync time: 0.9374s fsync time: 0.6353s fsync time: 0.3625s fsync time: 0.4970s fsync time: 0.3150s run done 11 fsyncs total, killing random writer [1]+ Done dd if=/dev/zero of=./bigfile bs=1M count=5000 ~> vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 13 0 38868 164 7778824 0 0 12 23407 959 1940 1 3 0 95 1 13 0 47144 164 7770084 0 0 0 26260 1435 2732 2 3 0 95 0 13 0 39740 164 7774280 0 0 60 30724 1534 2860 2 4 0 94 0 13 0 41124 164 7776080 0 0 0 13888 1103 2038 2 3 0 95 0 13 0 42460 164 7768056 0 0 0 52248 1320 2334 2 3 0 95 1 13 0 40456 164 7776908 0 0 0 3028 1058 1934 2 3 0 95 At the moment of performance of the test operation with graphic interface KDE is impossible
Just tried dd if=/dev/zero of=bigfile bs=1M count=20k conv=fdatasync on 2.6.30-rc2 and top still shows iowait of 70% to 90%, on ext3 filesystem. Motherboard: Gigabyte M57SLI-S4 Distro: Slamd64 12.2 $ cat /proc/meminfo | grep MemTotal MemTotal: 3089672 kB $ cat /proc/cpuinfo | grep -i "Model name" | uniq model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ sda: Model=WDC WD5000AAKS-00TMA0, FwRev=12.01C01, SerialNo=WD-WCAPW4009869 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 I believe the ext3 partition was mounted with data=writeback option, but can reboot and confirm if it is important enough.
(In reply to comment #304) > ASUS P5K > linux suse 2.6.29-53-default x86_64 You're running a kernel that is known to have high write latencies, and it doesn't appear that your fsync latency test is running in parallel with the dd. With 8GB of RAM, you likely need to change your dd to write out at least 10GB of data instead of 5GB. (In reply to comment #305) > Just tried dd if=/dev/zero of=bigfile bs=1M count=20k conv=fdatasync on > 2.6.30-rc2 and top still shows iowait of 70% to 90%, on ext3 filesystem. Your system *should* show high iowait when you're stress testing it like that. If it doesn't, you're not writing to disk as fast as it can handle it. High iowait is normal and expected. It is not an indication of a problem. What is not expected is high latency during those stress tests. Ideally you should see sync latencies of less than a second - if latencies get higher than that you are likely using ext3 data=ordered or a broken kernel. 2.6.30-rc2 was just released - that should be used for future tests.
2.6.30-rc2 fsync-tester shows mostly < 1 second, except a few times when it goes just above 1 sec. fsync time: 0.1964 fsync time: 0.2317 fsync time: 0.2923 fsync time: 0.0565 fsync time: 1.1033 fsync time: 0.2297 fsync time: 0.0124 fsync time: 0.0848 fsync time: 0.1049 fsync time: 0.6525 fsync time: 11.1130 <--- not sure what that was fsync time: 2.2619 fsync time: 0.3535 fsync time: 0.1543 fsync time: 0.2699 Unfortunately, the load average shoots up, peaking at about 8 before I run out of space on the disk. System responsiveness is also affected, but don't have a meaningful measurable quantity. top - 21:41:06 up 16 min, 6 users, load average: 7.23, 5.93, 3.98 top - 21:42:19 up 17 min, 7 users, load average: 8.12, 6.53, 4.34 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 9 0 19428 10752 2681252 0 0 180 13957 1344 497 2 9 30 59 1 8 0 20100 10780 2680416 0 0 0 47644 2883 1290 2 12 0 86 0 9 0 18908 10816 2681888 0 0 0 22528 2819 858 2 11 0 88 0 10 0 20116 10828 2680952 0 0 4 25080 2865 781 2 7 0 92 0 9 0 18900 10844 2682280 0 0 4 32696 3496 835 0 11 0 90 0 9 0 19040 10876 2681736 0 0 0 29936 3060 1064 1 10 0 89 2 8 0 18880 10892 2680868 0 0 4 47736 2954 731 0 7 0 92 0 9 0 18180 10920 2681448 0 0 0 44160 2723 971 0 13 0 87 /dev/sda4 /home ext3 rw,relatime,errors=continue,data=writeback 0 0
Hi all! I just ran the tests and obtained this: ###################################################### gad@ws-esp16:~$ ./kernel-test2.sh Using current dir to do IO tests #################### ## System info System: 2.6.28-11-generic i686 Tag: 2.6.28-11-generic Memory MemTotal: 2060636 kB CPU Model: model name : Intel(R) Core(TM)2 Duo CPU T7500 @ 2.20GHz Running in . Mounts: --------------------- rootfs / rootfs rw 0 0 /dev/disk/by-uuid/ee364958-34b6-474e-8e54-9a9eaff56d12 / ext3 rw,relatime,errors=remount-ro,data=ordered 0 0 --------------------- Sda info: Model=ST91608220AS , FwRev=3.ALE , SerialNo= 5MA4TF4V #################### First Test: FsyncProblem Starting ./test-2.6.28-11-generic-1 We have High IO PID 8949 running We have fsync-tester with 8950 running... fsync time: 0.1504 fsync time: 0.5174 fsync time: 0.3664 fsync time: 0.1727 fsync time: 0.2163 fsync time: 0.3080 fsync time: 0.3914 fsync time: 0.1766 fsync time: 0.4800 fsync time: 0.2304 fsync time: 0.4018 fsync time: 0.1159 fsync time: 0.4537 fsync time: 0.1837 fsync time: 0.3032 fsync time: 0.5013 fsync time: 2.0128 fsync time: 0.9343 fsync time: 0.3027 fsync time: 1.2761 fsync time: 0.7145 fsync time: 0.4678 fsync time: 2.0326 fsync time: 0.2019 fsync time: 0.5484 fsync time: 0.3867 fsync time: 0.0912 fsync time: 0.2040 fsync time: 0.3893 fsync time: 0.2703 fsync time: 0.3794 fsync time: 0.5449 fsync time: 0.7379 fsync time: 0.5957 fsync time: 0.6034 fsync time: 0.7915 fsync time: 1.0564 fsync time: 0.5795 fsync time: 0.4501 fsync time: 2.2850 fsync time: 8.1411 fsync time: 1.4754 fsync time: 1.3487 fsync time: 0.9896 fsync time: 0.6221 fsync time: 1.1703 fsync time: 0.2775 fsync time: 0.1842 fsync time: 0.3994 fsync time: 0.5275 fsync time: 0.3382 fsync time: 0.3295 fsync time: 0.6451 fsync time: 0.6803 fsync time: 1.2621 fsync time: 1.3397 fsync time: 0.3250 fsync time: 0.3182 fsync time: 0.3491 fsync time: 0.2745 fsync time: 0.3489 fsync time: 0.5478 fsync time: 0.6009 fsync time: 0.4482 fsync time: 0.3772 fsync time: 0.1414 fsync time: 0.2948 fsync time: 0.2228 fsync time: 0.3758 fsync time: 0.3091 fsync time: 0.2624 fsync time: 0.3526 fsync time: 0.0771 fsync time: 0.2078 fsync time: 0.1613 fsync time: 0.2265 fsync time: 0.2759 fsync time: 0.3231 fsync time: 0.3532 fsync time: 0.1200 fsync time: 0.2788 fsync time: 0.4866 fsync time: 0.2710 fsync time: 0.4107 fsync time: 0.4903 fsync time: 0.5680 fsync time: 0.1199 fsync time: 0.3397 fsync time: 0.3929 fsync time: 0.3373 fsync time: 0.4407 fsync time: 0.2629 fsync time: 0.2998 fsync time: 0.2175 fsync time: 0.3119 fsync time: 0.0971 fsync time: 0.1899 fsync time: 0.4977 fsync time: 0.4127 fsync time: 0.2498 fsync time: 0.8439 fsync time: 0.1513 fsync time: 0.1109 fsync time: 0.2506 fsync time: 0.3414 fsync time: 0.1470 fsync time: 0.0558 ./kernel-test2.sh: line 84: 8949 Terminado dd if=/dev/zero of="$io_test_path/test-$info_tag-$i" bs=1M count=5000 oflag=direct ./kernel-test2.sh: line 86: 8950 Terminado ./fsync-tester "$io_test_path/test-$info_tag-$i.fsynctest" ./test-2.6.28-11-generic-1 deleted! ./test-2.6.28-11-generic-1.fsynctest deleted! --- Finish --- Kernel tested: 2.6.28-11-generic i686 ###################################################### I have to say that I killed the dd program manually because it took to much time. I don't know if it's an issue to get about 1,2MB/S IO as much... I suppose that even for a laptop this is not much normal. Anyway there are results. I updated testsuite to include this new test. It's called kernel-testsuite.tar.gz and it includes: kernel-test-fsync.sh fsync-tester The package contains Theodore Ts'o sources but modified to use 1 parameter for the filename of the output. I hope this helps.
Created attachment 21007 [details] Automatic test suite for this bug V3 This includes the fsync test
(In reply to comment #306) > You're running a kernel that is known to have high write latencies, and it > doesn't appear that your fsync latency test is running in parallel with the > dd. ???????????????????????????? Most likely about it is not known to anybody. The bug status 'NEW'. > With 8GB of RAM, you likely need to change your dd to write out at least > 10GB > of data instead of 5GB. OK (add for (In reply to comment #304)) dd if=/dev/zero of=./bigfile bs=1M count=15000 & ./fsync-tester ... fsync time: 2.3800 fsync time: 2.4295 fsync time: 2.4099 fsync time: 2.1599 fsync time: 2.0760 fsync time: 2.6152 fsync time: 2.1427 fsync time: 2.4893 fsync time: 2.3252 fsync time: 2.3208 fsync time: 2.4223 ... fsync time: 2.3710 fsync time: 1.3094 fsync time: 1.4473 fsync time: 2.7260 fsync time: 2.2739 fsync time: 2.2078 fsync time: 0.5446 15000+0 records in 15000+0 records out fsync time: 1.5607 15728640000 bytes (16 GB) copied, 201,724 s, 78,0 MB/s procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 5 0 3930476 6108 3873852 0 0 0 74384 883 1632 1 4 0 94 0 5 0 3864644 6108 3941216 0 0 0 64512 667 1088 1 5 0 93 0 4 0 3788956 6108 4015088 0 0 0 73728 943 1738 2 5 0 93 0 5 0 3735848 6108 4070376 0 0 0 53268 666 1181 1 5 0 94 2 5 0 3671468 6108 4135384 0 0 0 65024 735 1277 1 4 0 94 0 4 0 3590356 6108 4213988 0 0 0 77824 860 1590 2 5 0 93 1 5 0 3524484 6108 4280384 0 0 0 64392 749 1495 1 4 0 94
Created attachment 21054 [details] test case: Takes the time of mouse click events All my results shows a high probability of high latencies, when there is a high system time. Most posts where related on high latencies during high IO with SSH connection or with the X-Server. Both uses a network/socket connection. The bug may be in the network stack and not in the io scheduler or block layer. Here my first test. The "Example Network Job" test (Flexible IO Tester) shows a regression since 2.6.22. (see the last test on http://global.phoronix-test-suite.com/?k=profile&u=ebird-3722-22013-9288 ) And here the mouse click test. This test case shows exactly the same regression on all kernels and the same behavior I have recognized in a real environment. It's !!not!! caused by the fsync bug. The test case is just clicking on a label and takes the time till the event arrives. It's using the platform's native input queue (see java.util.Robot). The test case is only a quick solution and has no error handling. It expects a factor as parameter. A high factor like 40.0 means a high sensitiveness and produces a high probability for high latencies, but increases the probability for a missing precondition (no high cpu usage and no high system time) on the current kernels. A value below 5.0 means a bad sensitiveness, which reduces the system time and reduces the probability of capture a high latency event. These values may differ on other machines, as it is not tested on other machines. For generating the high io, I have used the following commands, but it's enough to copy a big folder (> memory size) too. # for i in 1 2 3 4 5 6; do dd if=/dev/zero of=t-$i bs=1M count=1K & done The error occurs with the kernels 2.6.17, 2.6.18 and 2.6.20 only while the cache is filling up withing the first five seconds. kernel no IO high IO 2.6.17 max 160ms max 35ms (max 2.859s within the first 5 seconds) 2.6.18 max 152ms max 101ms (max 2.430s within the first 5 seconds) 2.6.20 max 164ms max 100ms (max 1.049s within the first 5 seconds) 2.6.27 max 46ms max 6.988s (during IO) 2.6.28 max 51ms max 3.778s (during IO) 2.6.29 max 99ms max 3.632s (during IO) 2.6.30-rc2 max 50ms max 4.993s (during IO) Unable to run test on this kernel, because of missing preconditions. 2.6.22 2.6.30-rc2 (smp) max 3.624s (during IO) An output like this or no cpu usage means missing preconditions for the test, reduce the factor. > High total latency of last 19 events at 138.783s - total latency : 646ms A factor below 5.0 means the test is not able to be run on this kernel. P.S. All tests where done on a kernel without SMP support to reduce multi core scheduler differences with a 250Hz timer and without cpu scaling. On multi cores system you should busy n-1 cores with an job like this. # bzip2 -c /dev/zero >/dev/null &
Created attachment 21055 [details] Complete test log
Hi guys, I have run my test script, which I ran with previous kernels. There is a pretty big increase in performance. on 2.6.30-rc3. The BIGGEST difference I noticed, about my test output was that vmstat reported large numbers (10) of "uninterruptible sleep" processes. Now, it's down to about 1-4. I saw some 9 and 10 second fsync latencies, but most were around 0.3 seconds, with some around 1-2 seconds. However, I don't think the kernel is back to what it used to be yet. I never used to have problems with ext3 fsync latencies at all. It used to be that a simple file copy would not cause much latency issues on the responsiveness of my regular apps. In fact, generally speaking, I never noticed any problems when copying huge files. Now, when copying large files, I still get some choppiness, even with Ted's patches. I'm wondering if the real problem lies in the block io layer, and not the file system layer?
The reliability of the mouse click test case (Comment #311) can be improved by adding a random reading process. # for i in 1 2 3 4 5 6; do dd if=/dev/zero of=t-$i bs=1M count=1K & done # find / 2>%1 >/dev/null # java MouseClickTester 40 I am able to catch latencies up to 12 seconds with the kernel 2.6.27 (no smp support). Is there a way to trace such an mouse click event in the kernel? It should be suspend/wait and resume.
Kernel 2.6.30-rc2 Other info see comment #304 TEST 1 ---------------------------------------------------------------------------- yura@suse:~> dd if=/dev/zero of=./bigfile bs=1M count=15000 & ./fsync-tester [1] 4561 fsync time: 0.0401 fsync time: 2.4475 fsync time: 1.7808 fsync time: 1.1141 fsync time: 1.6912 fsync time: 1.0753 fsync time: 1.2931 fsync time: 0.3260 fsync time: 0.3653 fsync time: 0.5603 ..... fsync time: 1.3651 fsync time: 1.0479 fsync time: 1.0806 fsync time: 0.6021 fsync time: 0.4708 fsync time: 1.3952 fsync time: 0.6665 fsync time: 1.4431 fsync time: 1.0893 fsync time: 1.7844 fsync time: 0.6520 fsync time: 0.3665 fsync time: 0.8171 fsync time: 0.7537 fsync time: 1.2100 fsync time: 0.9319 fsync time: 1.1578 fsync time: 1.1377 fsync time: 1.4913 fsync time: 1.0317 fsync time: 0.5870 fsync time: 1.8464 fsync time: 1.4770 fsync time: 1.3934 fsync time: 1.3794 fsync time: 0.7868 15000+0 записей считано 15000+0 записей написано скопировано 15728640000 байт (16 GB), 172,839 c, 91,0 MB/c ^C procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 4 0 6189644 808 1572324 0 0 4 116748 1585 1548 2 26 6 67 1 3 0 6098828 808 1663460 0 0 0 84472 973 1538 2 7 0 91 0 4 0 6011692 808 1749652 0 0 0 88416 722 1248 2 6 0 92 0 3 0 5915592 808 1844204 0 0 0 95232 996 1668 1 7 0 92 1 4 0 5834692 808 1925564 0 0 0 77832 672 838 1 6 0 93 0 4 0 5755452 808 2005900 0 0 0 79872 940 1472 1 5 0 93 1 2 0 5664856 808 2096760 0 0 0 88744 746 1316 1 6 0 92 0 4 0 5574556 808 2185520 0 0 0 86368 802 1286 1 6 0 93 0 3 0 5492072 808 2268036 0 0 0 81408 785 1112 1 6 0 93 0 4 0 5412744 808 2347624 0 0 0 78344 926 1400 1 5 0 93 0 3 0 5333768 808 2428624 0 0 0 78848 659 1046 1 5 50 43 0 4 0 5245744 808 2516336 0 0 0 86536 992 1526 1 6 50 42 0 4 0 5153952 808 2605988 0 0 0 89088 947 4596 4 7 48 41 0 3 0 5074720 808 2686532 0 0 0 78336 958 1768 1 6 49 43 0 4 0 4974280 808 2787192 0 0 0 92198 706 1028 1 7 20 72 0 3 0 4897224 808 2862716 0 0 0 80905 1046 1650 1 5 49 45 0 4 0 4819832 808 2940944 0 0 0 77348 1193 2076 1 6 0 93 1 2 0 4730172 808 3031732 0 0 0 82104 733 1020 1 6 1 91 0 3 0 4648668 808 3112676 0 0 0 86864 994 1674 1 6 50 42 1 3 0 4556864 808 3203828 0 0 0 87232 708 1136 2 6 49 43 TEST 2 ---------------------------------------------------------------------------- yura@suse:~> dd if=/dev/zero of=./bigfile2 bs=1M count=15000 15000+0 записей считано 15000+0 записей написано скопировано 15728640000 байт (16 GB), 174,036 c, 90,4 MB/c procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 3 0 45296 0 7683084 0 0 0 79360 1196 2213 1 6 33 60 0 3 0 46896 0 7682140 0 0 0 74752 792 1526 1 6 49 43 0 3 0 48996 0 7681420 0 0 0 79360 1103 2084 1 6 50 42 0 3 0 45216 0 7684640 0 0 0 84480 824 1494 2 7 49 42 1 3 0 46028 0 7684060 0 0 0 78336 1081 1981 1 7 16 76 1 2 0 46960 0 7684032 0 0 0 82536 1138 2168 1 8 0 91 0 3 0 50072 0 7680432 0 0 0 79768 760 1473 1 6 0 92 1 2 0 47396 0 7683396 0 0 0 86680 966 1670 1 6 19 74 1 3 0 48876 0 7681688 0 0 0 75624 758 1304 2 6 0 92 0 3 0 50652 0 7680384 0 0 0 83456 983 1656 1 7 8 83 0 3 0 45072 0 7684236 0 0 0 90624 1151 2103 1 7 47 45 0 3 0 45308 0 7683464 0 0 0 80896 817 1380 1 6 46 46 1 3 0 45936 0 7684280 0 0 0 73216 1049 1807 2 6 46 46 2 2 0 45284 0 7685624 0 0 0 81008 881 1397 1 7 47 45 0 3 0 47208 0 7683352 0 0 0 84405 1055 1642 1 7 47 45 0 4 0 48368 0 7682056 0 0 0 76299 1049 1721 1 5 48 45 TEST 3 (all parallel - one hdd) ---------------------------------------------------------------------------- yura@suse:~> dd if=/dev/zero of=./bigfile4 bs=1M count=15000 (terminal 1) 15000+0 записей считано 15000+0 записей написано скопировано 15728640000 байт (16 GB), 481,226 c, 32,7 MB/c yura@suse:~> dd if=/dev/zero of=./bigfile5 bs=1M count=15000 (terminal 2) 15000+0 записей считано 15000+0 записей написано скопировано 15728640000 байт (16 GB), 485,821 c, 32,4 MB/c And at the same time KDE Menu -> Dolphin -> click -> WAIT 1s -> Dolphin is opened procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 5 0 44792 0 7682396 0 0 116 57368 1016 2083 3 6 0 91 0 5 0 47988 0 7679356 0 0 140 45112 768 1116 2 4 0 94 0 6 0 48080 0 7679688 0 0 744 57352 935 1410 1 5 0 94 3 4 0 45584 0 7679752 0 0 1080 16396 1549 2173 2 3 0 95 1 5 0 46408 0 7680768 0 0 4648 32 1364 1708 4 2 0 93 0 6 0 46648 0 7680468 0 0 1080 32824 1052 1592 3 4 0 93 0 6 0 44884 0 7681468 0 0 8 73453 852 1252 1 6 0 93 0 5 0 48664 0 7676500 0 0 72 44126 825 1770 1 4 0 95 0 6 0 44752 0 7678648 0 0 540 71215 1272 2865 2 6 0 92
This absolutely cannot be an ext3 bug. I'm using reiserfs for my root, and it happens here too. System totally locks up with a swap storm when memory pressure starts forcing things into swap. Firefox using > 2GB memory, and a wine memory bug which causes it to report ~4GB VIRT are what triggers it for me. Killing either one fixes the storm. (which is often not possible because keyboard/mouse are unresponsive) Machine has 4GB RAM, 4GB swap. It must be in the block layer, or elsewhere. It also seems to happen with swap *off*.
Bob, (In reply to comment #316) > This absolutely cannot be an ext3 bug. I'm using reiserfs for my root, and > it > happens here too. System totally locks up with a swap storm when memory > pressure starts forcing things into swap. Firefox using > 2GB memory, and a > wine memory bug which causes it to report ~4GB VIRT are what triggers it for > me. Killing either one fixes the storm. (which is often not possible > because > keyboard/mouse are unresponsive) Machine has 4GB RAM, 4GB swap. > > It must be in the block layer, or elsewhere. > > It also seems to happen with swap *off*. Bob, what exact symptoms are you seeing? There is another issue in the kernel, to which I have been unable to reproduce for the kernel devs. I have seen it numerous times where the kernel has "futex" deadlocks. It is potentially possible that yours could be related to that. Because this performance problem, in this bug, does not cause a complete lockup. It may seem that way for a bit, but if you leave the machine, it will eventually recover. The futex one appears to be a complete deadlock, as it doesn't appear that it matters how long I leave it, it will never recover.
I recently experienced a new (for me) condition wherein this bug reared its ugly head, and it actually did not involve high disk throughput. I was running mencoder, which was pegging three of my CPU cores and using a fair share of the fourth. It was reading from a file on my RAID and writing to a file on a tmpfs, not particularly quickly on either end since it was doing a lot of number crunching in between. The bug cropped up when I started an rsync at the same time, sending some files from my RAID to a remote system, again not particularly quickly (my upstream network bandwidth is only about 80 KB/s). So I wasn't stressing the disk at all, yet my system came to a crawl. I could literally watch windows repainting themselves on expose events. Pressing Ctrl+Alt+Delete to bring up the KRunner process list took at least a minute, if not more. My disks were churning an awful lot, which was odd given the quite low demands I should have been placing on them. I thought maybe the input file to mencoder might have been heavily fragmented, but I ran xfs_fsr on it, and it said it only had 4 extents. Something is seriously FUBAR here. A possible theory: forcing the disks to seek back and forth to read from the two files "simultaneously" meant that the majority of the time was spent waiting for disk seeks. If the kernel was holding a big lock while waiting for those seeks, it could have seriously degraded the performance of the rest of the system.
The bug I'm seeing is extremely reproducible. (I just wait for about a day with firefox running and lots of tabs open and it will happen) As I mentioned it occurs when memory pressure starts forcing things into swap. This is not a hard lockup and the system will eventually recover. (Where "eventually" can be > 30 minutes) updated and trackerd also cause my system to be unusable, as reported above. I have disabled them as a consequence... Given that I can trigger it, I can run jobs in the background that could log something useful...locks? fsync? What do you suggest? (This system has a quad core intel and raid5 root as well -- don't know if that's related)
Matt,(In reply to comment #318) > I recently experienced a new (for me) condition wherein this bug reared its > ugly head, and it actually did not involve high disk throughput. Yes, that is one of the reasons that I believe there is more to it than just ext3 fsync improvements; it doesn't always take a lot to make it happen. Matt, do these things happen on 2.6.30-rc3? I've seen an almost disappearance of my issues with this release. It's still not gone, which indicates to me that they just didn't hit the nail on the head. But, it certainly is WAY better.
(In reply to comment #320) > Matt, do these things happen on 2.6.30-rc3? I'm not willing to run a pre-release kernel. In fact, the kernel is the only package on my Gentoo system that I intentionally maintain at the "Gentoo stable" level, rather than at the leading edge. This is mostly because I don't want to have to reboot every time a new patch set comes out. Right now I'm running 2.6.28-gentoo-r5, which is based on 2.6.28.9. If this bug is indeed improved upon in 2.6.30, then I look forward to the release of 2.6.31! :)
(In reply to comment #317) > > Bob, what exact symptoms are you seeing? There is another issue in the > kernel, > to which I have been unable to reproduce for the kernel devs. I have seen it > numerous times where the kernel has "futex" deadlocks. It is potentially > possible that yours could be related to that. Trenton, could you please point me to the bug of this issue you are speaking of?
I am using Ubuntu 9.04 with 2.6.30-rc3 x86_64 Kernel and I can confirm the whole behavior. The irony is that it feels like Windows 95 while a floppy was formated. You know, the whole pseudo multi tasking on top of Dos - everything was really choppy. A easy testcase is to set up two luks encrypted partitions and copy from one to another. Even if no core is under heavy load everything is slow. The same happens with usb transfers too. But as like Matt Whitlock pointed out it is not always a disk io problem. Even under higher cpu usage this could happen. If I encode a DVD with ogmrip/mencoder h264 and 16 threads (16 threads get the highest cpu usage from my quad core which is still under 80% per core) Gnome feels like a formatting Win 95. The latest problem has become less severe with 2.6.30-rc3 but it is still noticeable slow which makes no sense since no core has 100% load. To have an comparison how it could work. If I fire up Prime95 whith 100% load on every core in Windows Vista I can still play modern 3D games without lagging. Windows of course has also flaws with IO and so on but the cpu multi tasking works really great. Way to go imho.
FWIW, I've tried the test proposed by Thomas in comment 314: # for i in 1 2 3 4 5 6; do dd if=/dev/zero of=t-$i bs=1M count=1K & done # find / 2>%1 >/dev/null (the Java part did not start for some reason) I ended force-rebooting my laptop, since it was impossible to control *after a few seconds*. I could only switch to VT and back to X, but very slowly and I couldn't even type a character there or in X. I have 500MB of RAM with a swap of the same size, Pentium M 1500 MHz: not very high config, but that should be sufficient to work, isn't it? :-) This was with 2.6.28, I'll try with 2.6.30rc2.
My system also locks up when it tries to access swap. This is on Ubuntu Jaunty with both the Ubuntu 2.6.28 kernel and Ubuntu's vanilla 2.6.30.rc3 kernel. This machine has 4GB of RAM and 4GB of swap and is running on a root ext4 partition. My test case is to run multiple VirtualBox VMs (eg Jaunty installations) with say 1.4GB of RAM assigned to each. When I run the third one, as soon as the kernel starts to hit swap, it thrashes the hard drive, X rapidly becomes unresponsive and I have to hard reset the the machine. I am able to move the mouse (slowly) but clicking on individual windows doesn't work and the keyboard doesn't respond. atop -d manages to update itself as far as about 300MB of swap use and then stops updating. I've left it as long as 15 minutes to see if it will recover, but it doesn't.
(In reply to comment #325) > My system also locks up when it tries to access swap. > My test case is to run multiple VirtualBox VMs (eg Jaunty installations) with > say 1.4GB of RAM assigned to each. When I run the third one, as soon as the > kernel starts to hit swap, it thrashes the hard drive, X rapidly becomes > unresponsive and I have to hard reset the the machine. There are definitive some huge issues with the Kernel but I think this is not one of them. If your applications try to use more ram than it is available and always trying to access/reserve this mem which is likely with Virtualbox every other OS wouldn't operate fine anymore. Of course it should be possible to switch to console and run some commands but this has nothing to do with this report I think. Btw. I forgot to mention that I don't use a swap.
(In reply to comment #324) > I ended force-rebooting my laptop, since it was impossible to control *after > a > few seconds*. It's a extreme test case, as there will be generated a very high load. You can try with only two concurrent write processes, as your machine is PATA, only 1,5GHz and with a single core. And start the java test case at the beginning, it was switched before (a long day). # java MouseClickTester 40 # for i in 1 2; do dd if=/dev/zero of=t-$i bs=1M count=1K & done # find / 2>%1 >/dev/null
Little correction. # java MouseClickTester 40 # for i in 1 2; do dd if=/dev/zero of=t-$i bs=1M count=1K & done # find >/dev/null 2>&1
(In reply to comment #326) >There are definitive some huge issues with the Kernel but I think this is not >one of them. If your applications try to use more ram than it is available and >always trying to access/reserve this mem which is likely with Virtualbox every >other OS wouldn't operate fine anymore. Of course it should be possible to >switch to console and run some commands but this has nothing to do with this >report I think. >Btw. I forgot to mention that I don't use a swap. @unggnu: this is not a kernel issue?!!! If multiple apps are trying to reserve more RAM than is available and thus causing continuous access to swap, the kernel should NOT become completely unresponsive and require a hard reset, risking data loss or in the case of a remote server that you can't hard reset, denial of service. Surely the memory management system should be able to recognise this condition and take appropriate action, eg freeze one or more processes with high RAM requirements. At the VERY least it should allow an operator to kill off offending processes, but this is impossible because you can't even login via ssh or access a console. This is where the test case is relevant to this bug - if the system didn't become completely unresponsive, the operator could fix the problem without a hard reset.
IMO, this bug has long past the point where it is useful. There are far too many people posting with different issues. There is too much noise to filter through to find a single bug. There aren't any interested kernel developers following the bug. The bug needs to be closed and reopened with separate bugs for each issue. Each issue should be reproducible with the latest 2.6.30-rc kernel with a simple test case. Anything else will just result in another huge bug with 300+ comments and no kernel developer interest. (In reply to comment #329) > @unggnu: this is not a kernel issue?!!! If multiple apps are trying to > reserve > more RAM than is available and thus causing continuous access to swap It is not a kernel issue. It is a system configuration issue. If you have a half dozen large memory processes trying to fight for more memory than is available in the system causing each of those processes to be continuously swapped in and out as they fight to run, you're going to get horrible performance. You either need more memory, less swap so that the OOM killer can kill a process or need to avoid running so many large memory processes in parallel.
(In reply to comment #330) > IMO, this bug has long past the point where it is useful. Even I (the reporter) have more or less stopped tracking this bug. I absolutely agree. > There are far too many people posting with different issues. > > There is too much noise to filter through to find a single bug. > > There aren't any interested kernel developers following the bug. I would definitely agree; the bug has long outlived its usefulness. Closing with INSUFFICIENT_DATA. > The bug needs to be closed and reopened with separate bugs for each issue. > Each issue should be reproducible with the latest 2.6.30-rc kernel with a > simple test case. Absolutely, all of you who have commented on this bug thusfar should open new bugs. While I can't stop anyone from opening bug reports, it is likely that any report without a definite test case reproducing the issue will turn into yet another grab-bag like this one.
Having tracked bugs 7372 and 12309 on the primary issue (performance hitting a brick wall with heavy IO) since October 2007, and now facing the prospect of needing to track yet another one, can I make a plea that whoever opens the new one(s) posts a reference to the new bug ID(s) in this thread?
Thomas: thanks for that update, and indeed the second and more reasonable testcase does not completely kill the system. I'm seeing a possibly interesting phenomenon: the testcase does not trigger any hang when run alone, but when Firefox is started, I can see swap usage rise, and then the mouse won't move for about a second from time to time. So my guess is that when the system needs to swap, even for only a few MB, it's not able to do that smoothly for the user. Maybe there's a problem of scheduling when the kernel needs to choose to give priority to the swap or to the root partition. Or that's simply because writing to quite remote places on the disk leads to high latencies. Would that be worth a new bug? I think we're a few experiencing this problem here. I generally agree that this bug is not leading anywhere, but ATST we don't even know how many different issues there are, so opening new reports is problematic too. Maybe we could concentrate on the few cases we're best able to describe precisely, and hope we all suffer from these...
I have found this article after I had another "freeze". Just before freeze free memory was running out, swap was barely used, buffers were few hundred kB, BUT CACHE was over 2,7GB out of total 3GB of memory. After about 20 minutes I managed to switch to VT1 and there was now about 500MB of free memory, less cache and increased swap usage. Last output of top showed kswapd process kicking in. Googling gave me this thread: http://lkml.indiana.edu/hypermail/linux/kernel/0311.3/0406.html
Lets summarize the bugs. - High cache usage during write process enforce swapping of processes Patch in comment #160 works, but is not included in the linux tree. - Fsync Bug in Ext3 (There is a test case and a activity) - Too high prioritization of heavy writing processes (Copying a big file, can delay the start of a program till finishing the copy operation) - Missing read and write based scheduler And finally the annoying bugs - Low gui responsiveness during heavy IO A reliable test case is still missing. - The test case in comment #311 shows high click latencies 2-12s during heavy io on non smp kernels (on smp kernels too, but it's not easy to catch such an event) - I have a socket ping pong test (not submitted), which shops latencies of ~2s after the writing processes are finished - Low gui responsiveness in virtual machines no test case maybe the same bug as the "Low gui responsiveness during heavy IO" bug The gui responsiveness are not deterministic, as there may be a day with nearby no latencies and a hour with continuous latencies up to 60 seconds
Does anybody know why the caches are not dropped after I echo 3 to drop_caches? I would expect that number to come down to 0 ideally but still few megs may be practically. What I see is after some usage of the system, the caches keep increasing and never go down with drop_caches. The graph is ever increasing. Almost like a leak of caches. Has anybody debugged this aspect? I think this is one of the primary reasons for slow down because memory is locked in the caches and new memory requests are swapping the crap out of the system.
In case of GUI responsiveness iotop showed relatively high IO after the freeze on X process (read). Maybe X poor responsiveness is caused by waiting for IO as well.
Interesting thing is cache usage and inability to drop most of it. From my understanding memory cache can be dropped if it's not dirty (has been "wittenback" to disk) this brought me to this thread about lack of writeback: http://marc.info/?l=linux-kernel&m=113919849421679&w=2 On the other side /proc/meminfo shows only ~160kB of dirty memory. Cache shows 880868 KB. echo 3 > /proc/sys/drop_caches doesn't do anything. So why cache can't be freed? Is it possible to have cache leak?
Looks like drop_caches stopped working as expected somewhere around 2.6.18: look at first comment: http://jons-thoughts.blogspot.com/2007/09/tip-of-day-dropcaches.html
Be careful using drop_caches. I actually managed to cause a kernel crash by using it in combination with a removable medium. Think it was a double-free bug, but I don't remember for certain.
It has been mentioned time and again that none of the kernerl devs have gotten a concise description of the problem and hence none of them seems to have any nswers. Well, does anybody know why my caches show 700MB in a 2GB machine and why can't I get rid of any of it? I don't think the question can get any more precise.This is the heart of the problem folks.
(In reply to comment #341) > It has been mentioned time and again that none of the kernerl devs have > gotten > a concise description of the problem and hence none of them seems to have any > nswers. Well, does anybody know why my caches show 700MB in a 2GB machine and > why can't I get rid of any of it? I don't think the question can get any more > precise.This is the heart of the problem folks. I don't understand why you'd assume that cache is a problem. The kernel uses available RAM as cache as it's the most productive use for it. To assume that this is buggy behavior is extremely misled logic.
(In reply to comment #342) > (In reply to comment #341) > > It has been mentioned time and again that none of the kernerl devs have > gotten > > a concise description of the problem and hence none of them seems to have > any > > nswers. Well, does anybody know why my caches show 700MB in a 2GB machine > and > > why can't I get rid of any of it? I don't think the question can get any > more > > precise.This is the heart of the problem folks. > > I don't understand why you'd assume that cache is a problem. The kernel uses > available RAM as cache as it's the most productive use for it. To assume that > this is buggy behavior is extremely misled logic. What's buggy is that its not ready to relinquish it when asked to drop it or when needed. echo 3 to drop_caches should drop the damn thing. If I configure swappiness=1, cache should be dropped first and then swap disk should be used. I don't like it locking 700MB out of my 2GB RAM, then swapping heavily. If this behavior is by design, someone needs to change that design.
Kernel 2.6.30-rc3 If a task is executed using a processor on ~30 percents and work is simultaneously executed with the file system (cp, mv, rm) - a computer dies, at this juncture to start something is not real yet.
The kernel 2.6.30-rc4 is not better. This bug "Large I/O operations result in slow performance and high iowait times" has passed from the status NEW in not clear state but iowait both was high and remained. Stop these frauds. What data is still necessary?
(In reply to comment #346) > The kernel 2.6.30-rc4 is not better. > This bug "Large I/O operations result in slow performance and high iowait > times" has passed from the status NEW in not clear state but iowait both was > high and remained. Stop these frauds. What data is still necessary? No. Kernel folks will not "stop these frauds". Is your system using the latest DDR3 memory running at 2000Mhz? Is it using core i7 based processor, overclocked to 4.5Ghz? Does it have SSD drives with at least 150MB/s writes? Are you using ext4 yet? If all of these are true, and your system still hangs, only then kernel devs will "stop these frauds" and fix this bug. Until then, just use Vista (make sure to upgrade to SP1)...:D ... ... in case you couldn't tell, I was just kidding with ya! Please file a separate bug report with specific details about what you are experiencing on your system.
Terminal 1 (no other active task) :~/x1> time cp -r qt-x11-opensource-src-4.5.1 qt-x11-opensource-src-4.5.1-1 real 5m51.075s user 0m0.147s sys 0m2.192s 302.6Mb / 351s = 0.9 Mb/s Terminal 2 :~/x1> vmstat 1 <- only cp 0 0 0 4916172 808 2774512 0 0 24 16248 794 1469 2 3 95 0 2 0 0 4915228 808 2776340 0 0 24 2180 959 1385 2 1 97 0 1 0 0 4913492 808 2778140 0 0 24 3144 841 1251 1 1 97 0 0 0 0 4912500 808 2779104 0 0 24 2636 679 936 2 1 97 0 1 0 0 4910516 808 2781112 0 0 32 2804 862 1258 2 1 96 0 0 1 0 4908872 808 2781812 0 0 36 27160 749 913 5 2 91 2 <- enter to foolder in dolphin (100 files in this folder) 2 0 0 4907012 808 2783712 0 0 48 2615 1108 1563 3 2 82 12 0 1 0 4906020 808 2784728 0 0 80 3248 890 1274 2 1 64 33 0 1 0 4905648 808 2785164 0 0 56 2933 705 920 3 1 67 28 0 1 0 4904828 808 2786028 0 0 84 2600 884 1240 2 1 49 47 0 1 0 4903456 808 2787400 0 0 44 3148 723 873 3 1 62 35 0 1 0 4902084 808 2788572 0 0 64 2681 1177 2604 3 1 49 47 0 1 0 4901464 808 2789284 0 0 48 2328 952 1407 2 1 63 34 0 1 0 4900556 808 2790416 0 0 36 2624 951 2373 4 1 59 35 1 1 0 4898040 808 2792868 0 0 60 2672 1224 4018 8 3 46 43 0 1 0 4897032 808 2793868 0 0 80 2760 693 1004 2 1 49 47 0 1 0 4895552 808 2795304 0 0 28 2459 1029 1495 2 1 81 15 0 1 0 4894552 808 2796408 0 0 84 2744 877 1279 2 1 49 47 0 1 0 4892700 808 2798272 0 0 76 2204 773 1078 4 1 48 47
(In reply to comment #348) You need to open a new bug report with a thorough explanation of your test case, expected and observed results, and any pertinent data you may have collected. Leave a note here referencing your newly created bug but posting any data here is not going to help anyone. This bug is closed due to a lack of focus.
I need to get back to 2.6.17, can't work like that! I have 3GB RAM of which >2GB are used by cache that won't drop even if memory is runnig out.
(In reply to comment #349) > (In reply to comment #348) > > You need to open a new bug report with a thorough explanation of your test > case, expected and observed results, and any pertinent data you may have > collected. Leave a note here referencing your newly created bug but posting > any > data here is not going to help anyone. This bug is closed due to a lack of > focus. yup. Guys, problems like this aren't sovled very effectively via bugzilla. Please prefer to report these issues via email to linux-kernel and myself and any developers who you think might be relevant. it's confusing, and clarity is important. Being able to provide a means by which others can demonstrate the problem is a huge benefit.
Why am I still being CC'd on this bug, even though I'm not on the CC list?
(In reply to comment #352) > Why am I still being CC'd on this bug, even though I'm not on the CC list? Maybe you're watching Jens Axboe (the assignee), Ben Gamari (the reporter), or another user who is still in the CC list.
kernel 2.6.30-rc6 yura@suse:~> export LANG=en yura@suse:~> dd if=/dev/zero of=test1 bs=1M count=10000 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 129.928 s, 80.7 MB/s yura@suse:~> vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 3 8 0 44708 0 7628596 0 0 249 12283 362 821 4 3 66 28 0 8 0 49180 0 7627532 0 0 0 40517 1061 1581 7 4 0 89 0 7 0 47188 0 7627180 0 0 0 59694 1156 1505 5 6 0 89 1 6 0 46692 0 7628276 0 0 0 55553 1144 1476 6 5 0 90 0 8 0 46428 0 7628160 0 0 20 51573 900 1096 5 4 0 90 0 7 0 46568 0 7627860 0 0 0 64024 1127 1480 5 5 0 90 0 7 0 45796 0 7629100 0 0 12 44597 889 987 6 4 0 90 0 8 0 46904 0 7627808 0 0 332 40500 1100 1485 6 4 0 90 0 7 0 47772 0 7626884 0 0 168 45300 1158 1628 6 4 0 90 0 8 0 47216 0 7624456 0 0 72 67116 958 1151 5 5 0 90 0 7 0 47032 0 7626480 0 0 280 29244 1177 1667 5 4 0 91 0 7 0 45936 0 7626640 0 0 248 58872 922 1060 6 5 0 89 0 9 0 44988 0 7626640 0 0 216 62492 945 1359 2 6 0 92 0 8 0 47548 0 7625932 0 0 152 47164 926 1425 1 4 0 95 1 6 0 45276 0 7627256 0 0 36 54721 605 1089 2 4 0 94 0 7 0 48208 0 7626388 0 0 44 43612 834 1198 1 4 0 95 0 8 0 47096 0 7625644 0 0 132 53789 655 1156 1 4 0 94 0 7 0 46344 0 7624828 0 0 468 50292 981 2089 2 4 0 94 0 8 0 46576 0 7625416 0 0 116 44056 1155 2119 1 3 0 96 0 8 0 47476 0 7624800 0 0 636 38936 734 1125 2 4 0 94 0 8 0 47348 0 7626676 0 0 32 58410 885 1613 1 5 0 93 1 6 0 48508 0 7626280 0 0 0 67256 623 969 1 4 0 94 0 7 0 47984 0 7625328 0 0 0 64888 694 1335 2 6 0 92 0 7 0 45800 0 7626692 0 0 0 62496 1002 1698 1 4 0 95 0 7 0 48220 0 7625052 0 0 0 61952 614 1222 2 5 0 93 0 7 0 48508 0 7623300 0 0 0 69632 890 1586 1 5 0 94
#354 this bigfile yura@suse:~> time cp bigfile bigfile.cp real 5m52.457s user 0m0.343s sys 0m21.356s calc speed => 10485760000 / 352.457 = 29.75 Mb/s procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 46688 0 7686820 0 0 0 12 564 862 1 0 98 0 0 0 0 46688 0 7686820 0 0 20 0 387 730 2 0 96 1 0 0 0 46688 0 7686840 0 0 0 0 559 879 1 0 98 0 0 0 0 46688 0 7686840 0 0 0 0 598 937 1 1 97 0 0 0 0 46688 0 7686840 0 0 0 0 315 517 2 1 98 0 0 0 0 46704 0 7686840 0 0 0 16 600 1058 2 1 97 0 0 0 0 46704 0 7686840 0 0 0 0 328 473 2 0 98 0 0 0 0 46704 0 7686840 0 0 0 0 610 1122 2 0 98 0 0 0 0 46704 0 7686840 0 0 0 0 582 1013 2 0 98 0 0 0 0 46876 0 7686840 0 0 0 1 341 475 1 0 98 0 0 0 0 46876 0 7686840 0 0 0 0 577 988 2 0 98 0 0 0 0 46876 0 7686840 0 0 0 0 339 543 2 1 97 0 start cp 3 0 0 46500 0 7686704 0 0 17500 0 857 2379 2 2 91 5 3 0 0 43840 0 7689132 0 0 90624 0 2119 5710 4 11 61 24 0 1 0 46532 0 7686180 0 0 83968 0 2008 5246 8 11 57 24 1 1 0 43884 0 7689020 0 0 81024 46 2159 8097 6 10 59 25 0 1 0 45020 0 7687772 0 0 81920 1 1759 3732 4 10 60 26 0 1 0 44948 0 7687472 0 0 91264 0 2154 4449 4 10 60 25 0 1 0 43924 0 7688888 0 0 88064 0 2040 4500 3 11 60 26 0 1 0 46180 0 7686288 0 0 89984 0 1919 4107 3 11 63 22 0 2 0 44692 0 7680932 0 0 86784 39184 2156 4820 4 12 47 38 0 2 0 44568 0 7681376 0 0 64384 22436 1569 4127 3 7 35 54 0 2 0 44092 0 7682832 0 0 35584 37396 1331 2886 3 5 37 55 4 2 0 46920 0 7678572 0 0 42624 43336 1544 3311 3 6 28 63 0 2 0 45724 0 7679280 0 0 49792 31240 1301 3076 2 6 27 64 0 2 0 45328 0 7681288 0 0 41856 31648 1473 3322 3 5 27 65 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 2 0 46328 0 7679272 0 0 52480 29232 1425 3345 3 8 33 56 1 2 0 46276 0 7679748 0 0 48768 24844 1564 3539 3 6 32 60 1 2 0 47196 0 7678088 0 0 63360 28688 1830 4781 3 9 14 74 5 3 0 44052 0 7681744 0 0 58112 23612 1493 3905 3 8 5 83 1 2 0 44988 0 7679956 0 0 18560 53021 1107 2129 2 4 0 94 0 4 0 46872 0 7677272 0 0 55808 22541 1478 4117 3 7 1 89 0 4 0 43824 0 7681360 0 0 52608 33800 1627 4628 3 7 0 89 0 4 0 45536 0 7679600 0 0 41856 31720 1491 4026 3 6 2 89 0 4 0 45688 0 7680620 0 0 39424 34324 1202 3190 3 6 6 85 1 4 0 46052 0 7679708 0 0 48104 27148 1901 5505 3 7 2 88 0 2 0 44120 0 7680964 0 0 53280 6660 1531 5967 3 8 6 83 2 3 0 45692 0 7679440 0 0 55784 0 1837 6209 4 7 4 84 0 5 0 44796 0 7678908 0 0 52452 10248 1953 6444 4 8 1 87 0 2 0 46952 0 7676344 0 0 45264 24619 2085 8506 6 7 0 87 0 6 0 44820 0 7677336 0 0 34588 41550 2438 5289 2 7 4 86 0 2 0 47084 0 7675392 0 0 31016 34352 1203 3506 3 5 17 74 2 3 0 46856 0 7674440 0 0 10252 67612 685 1250 2 3 11 84 0 5 0 45072 0 7677236 0 0 48004 15368 1575 4859 2 7 5 85 0 5 0 45504 0 7678588 0 0 29824 23240 952 2688 3 3 8 85 0 5 0 45020 0 7678452 0 0 62080 8196 1865 6184 3 9 0 88 0 3 0 44564 0 7679208 0 0 6272 61473 607 926 2 2 24 71 1 3 0 46444 0 7676624 0 0 14216 64046 1059 2083 3 2 41 54 0 2 0 44444 0 7680932 0 0 53636 16048 1448 5787 3 8 10 79 0 2 0 45188 0 7680048 0 0 40320 34024 1439 3785 3 6 1 90 1 3 0 46208 0 7679612 0 0 63872 10248 1628 4998 3 9 1 87 3 4 0 45808 0 7680852 0 0 47360 27152 2030 5505 3 6 7 83 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 5 0 44496 0 7681320 0 0 45828 29452 1683 4815 3 6 0 90 0 5 0 46044 0 7679900 0 0 44160 28704 1273 3575 3 7 3 87 0 5 0 44676 0 7679868 0 0 17280 62076 840 1855 3 3 1 93 0 5 0 44716 0 7681108 0 0 53504 11268 1862 6838 3 6 0 91 0 4 0 44504 0 7681556 0 0 42880 30905 1460 3885 3 7 2 88 0 3 0 44364 0 7680420 0 0 19712 69876 904 1960 2 4 3 91 0 4 0 48668 0 7678024 0 0 52096 21788 1904 5313 3 7 1 88 0 4 0 47032 0 7677752 0 0 28032 54580 1066 2496 3 5 16 76 0 3 0 46600 0 7678904 0 0 41344 33300 1426 4071 3 6 10 80 0 4 0 45424 0 7679184 0 0 41856 40276 1240 3204 3 6 0 91 0 4 0 45692 0 7680156 0 0 46080 18172 1479 4094 3 6 8 83 0 4 0 45908 0 7679112 0 0 45824 40108 1660 4158 3 7 4 86 0 4 0 44048 0 7681292 0 0 49408 32776 1345 3872 3 7 8 81 0 3 0 46548 0 7678436 0 0 50816 22604 1609 4448 3 7 12 77 0 2 0 46156 0 7678672 0 0 46464 35852 1350 3829 2 7 12 79 0 3 0 45244 0 7678956 0 0 50304 26420 1664 4697 3 7 4 85 0 3 0 48256 0 7675968 0 0 33280 31759 1325 3366 3 5 13 79 0 4 0 44852 0 7679008 0 0 35328 53281 1168 3202 3 5 39 52 0 4 0 46668 0 7677444 0 0 38784 24628 1390 3743 2 5 14 78 0 5 0 45028 0 7680464 0 0 49920 20492 1373 4244 3 7 3 87 0 5 0 45356 0 7681500 0 0 33408 35548 1369 3642 3 5 2 90 0 5 0 46896 0 7679744 0 0 57216 23808 1526 4868 3 7 3 87 1 1 0 45884 0 7679008 0 0 34432 50311 1427 3789 3 5 3 89 2 3 0 44592 0 7679592 0 0 45952 40989 1378 4140 2 7 3 87 1 2 0 44188 0 7680924 0 0 25856 48793 1023 2593 3 4 1 92 0 5 0 45092 0 7680536 0 0 46336 26184 1533 4502 3 6 5 86 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 5 0 44880 0 7681796 0 0 46208 23228 1560 4530 3 7 4 85 0 5 0 46412 0 7679076 0 0 60800 23884 1623 5242 3 8 7 81 0 5 0 46488 0 7679024 0 0 50048 24908 1637 4677 3 7 8 83 0 2 0 44564 0 7681684 0 0 36224 39584 1140 3492 3 4 11 81 0 5 0 45200 0 7681196 0 0 50304 22796 1757 5024 3 7 7 82 0 3 0 46924 0 7672852 0 0 34560 43484 1175 3359 2 6 0 91 0 6 0 45632 0 7675104 0 0 37504 36780 1346 3847 3 6 4 87 0 6 0 45988 0 7673412 0 0 43776 42904 1434 4643 3 6 1 91 0 5 0 48244 0 7669064 0 0 30848 44548 1094 3317 3 5 3 88 0 5 0 286084 0 7439272 0 0 35456 33520 1403 3765 3 7 8 82 2 4 0 181440 0 7543056 0 0 51968 27148 1491 4836 3 6 3 87 0 4 0 119520 0 7601468 0 0 29184 43984 969 2825 3 4 4 89 0 6 0 44456 0 7680104 0 0 43008 34584 1377 4120 3 5 2 89 1 5 0 49396 0 7666304 0 0 29956 56836 1091 3636 2 5 0 92 0 6 0 47008 0 7669068 0 0 37508 22336 1439 4405 3 6 0 91 0 6 0 45312 0 7670448 0 0 20864 32748 1173 3639 2 4 0 94 0 7 0 45792 0 7673916 0 0 29568 14856 996 2870 3 5 1 91 0 6 0 44136 0 7679304 0 0 16128 63532 1052 2344 2 3 4 90 0 3 0 44800 0 7678884 0 0 49664 15644 1407 5044 2 7 1 89 0 7 0 45180 0 7679664 0 0 44416 28660 1534 4548 2 6 2 89 0 7 0 43892 0 7678188 0 0 30080 35848 1365 4087 3 4 2 91 0 7 0 45176 0 7672668 0 0 35456 39344 1145 3608 3 5 1 90 0 7 0 46560 0 7671748 0 0 25472 36872 1239 3494 3 5 0 92 0 4 0 47096 0 7675136 0 0 30464 24052 1044 3068 2 5 5 88 0 5 0 43864 0 7683468 0 0 34944 31988 1274 3345 3 5 10 82 0 3 0 44452 0 7676068 0 0 32640 43300 1758 5028 2 5 1 93 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 5 0 46088 0 7679004 0 0 26368 34104 951 2415 3 4 10 83 1 7 0 44644 0 7680964 0 0 39296 40568 1457 4161 3 5 4 87 0 3 0 46904 0 7677248 0 0 17280 52767 847 2121 2 4 26 68 0 6 0 45004 0 7679484 0 0 33408 36716 1135 3297 3 5 8 83 0 3 0 45324 0 7680364 0 0 32128 41892 1873 4455 2 5 7 85 0 3 0 45672 0 7679632 0 0 38144 32788 1146 2985 2 5 4 88 0 4 0 44792 0 7681208 0 0 31488 34320 1255 2825 3 4 6 87 0 5 0 44856 0 7678824 0 0 40960 31488 1259 3753 3 6 1 89 5 4 0 46340 0 7678096 0 0 53632 25620 1703 5203 3 7 4 86 0 3 0 46204 0 7678884 0 0 34816 37500 1395 3642 2 5 4 88 0 5 0 47052 0 7671596 0 0 51840 40476 1460 4763 3 6 3 87 0 3 0 44276 0 7676316 0 0 27520 30743 1245 3234 3 4 2 90 0 5 0 46692 0 7678184 0 0 22784 48161 909 2197 2 5 3 90 0 5 0 44396 0 7679168 0 0 46336 30268 1546 4511 3 6 5 85 0 5 0 44352 0 7680436 0 0 43264 26744 1543 4340 2 6 2 90 0 5 0 45496 0 7672684 0 0 44672 37352 1397 4869 3 7 0 90 0 5 0 49372 0 7662316 0 0 33792 49596 1403 3972 3 6 1 90 0 4 0 45644 0 7677232 0 0 24960 22280 926 2576 2 4 3 90 0 6 0 44964 0 7680548 0 0 53120 33176 1705 5679 3 7 1 88 0 7 0 45844 0 7671820 0 0 22456 37240 1288 4498 3 5 0 92 0 8 0 45220 0 7667760 0 0 27404 24584 979 2885 2 5 0 93 0 7 0 45812 0 7666412 0 0 21308 24720 1304 4636 2 3 5 89 0 6 0 44268 0 7664988 0 0 29152 36896 1250 4517 3 5 0 92 1 7 0 48448 0 7667896 0 0 27376 21508 1298 4418 3 5 0 92 0 3 0 52464 0 7672108 0 0 27776 36073 1239 4920 3 5 0 92 0 4 0 46272 0 7677508 0 0 43008 40522 1351 4645 2 6 0 91 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 7 0 45520 0 7672060 0 0 39296 36952 1499 4941 3 6 0 90 2 6 0 44008 0 7667764 0 0 41088 39800 1286 4794 3 6 0 91 0 5 0 45480 0 7665824 0 0 32512 39428 1362 4147 3 5 1 90 5 7 0 45260 0 7674556 0 0 41728 18204 1464 5160 2 6 1 90 0 6 0 44860 0 7674124 0 0 31360 43012 1069 3452 3 5 4 88 1 5 0 44912 0 7672628 0 0 31744 43300 1365 4663 3 5 0 92 0 4 0 50228 0 7674084 0 0 22656 51238 922 2896 2 5 3 90 1 6 0 44768 0 7680248 0 0 51840 24756 1661 6082 3 7 5 84 0 6 0 47312 0 7678352 0 0 45952 25656 1554 5644 3 6 3 87 0 8 0 44808 0 7675632 0 0 39552 47932 1247 4241 3 6 1 90 0 8 0 45832 0 7664640 0 0 33024 47104 1447 4698 2 5 0 93 0 9 0 46444 0 7664368 0 0 40192 41608 1299 4904 3 6 0 91 0 9 0 45212 0 7673000 0 0 39296 16156 1461 5472 3 5 0 92 5 6 0 44980 0 7673068 0 0 25216 51256 1245 4057 2 5 2 90 0 6 0 44220 0 7681560 0 0 34816 27188 1088 3746 2 6 6 85 0 9 0 44824 0 7680132 0 0 47616 19960 1891 6827 3 6 9 81 0 2 0 45324 0 7678940 0 0 3200 81459 615 855 2 2 42 54 0 6 0 40888 0 7683144 0 0 24080 43565 1102 3157 3 4 28 65 0 9 0 46280 0 7678620 0 0 51460 7176 1738 6587 3 6 3 88 1 1 0 44716 0 7681788 0 0 31360 39488 1113 3891 3 5 14 78 0 9 0 45096 0 7680228 0 0 52864 21536 1640 5803 3 7 6 84 0 9 0 45744 0 7680164 0 0 44032 31268 1332 4769 4 6 2 88 0 10 0 45880 0 7669856 0 0 39336 49300 1433 4746 1 5 0 94 ^C
Bug 12309 - Large I/O operations result in slow performance and high iowait times Where the low iowait Where the small I/O operations result Where Status: RESOLVED INSUFFICIENT_DATA
http://bugzilla.kernel.org/show_bug.cgi?id=13347
There is ongoing discussion about similar issue: http://lkml.org/lkml/2009/5/15/320 and http://lkml.org/lkml/2009/5/16/23
Confirm bug My OS is Fedora Core release 6 Kernel: 2.6.22.14-72.fc6 2 CPUs: Intel® Xeon® CPU 5130 @ 2.00GHz HDDs: SAS 3.0 Gb/s, FUJITSU RAID: Adaptec 4800SAS RAID10 How to test: # dd if=/dev/zero of=testfile.1gb bs=1M count=1000 In other terminal during a copying you should run: # vmstat 1 I see for example: r b swpd free buff cache si so bi bo in cs us sy id wa st 14 8 460 120716 280236 1509844 0 0 9 14 0 0 9 3 66 22 0 0 13 468 121936 279216 1550936 0 0 1368 47776 1927 4153 24 8 8 60 0 0 15 468 121516 280200 1551200 0 0 1408 3744 1726 2846 1 2 3 94 0 0 8 468 129804 280520 1545940 0 0 1612 4280 1854 4060 3 2 1 95 0 0 6 468 131388 281868 1546628 0 0 2140 3620 2020 4650 12 3 13 71 0 0 17 468 114220 282792 1571864 0 0 1208 3212 1647 2715 4 3 6 87 0 1 12 468 115356 283164 1570704 0 0 1420 18964 1718 2397 2 2 2 94 0 0 9 468 114320 283628 1570868 0 0 768 1204 1753 2831 3 1 0 96 iowait -> 80-90% during 'dd' All other CPU's task work very very slow ... AND (!!!), the output of 'dd' is: 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 112.086 seconds, 9.4 MB/s ^^^^^^^^^ During some years i see a following behaviour: if server often uses a harddisk (for testing: 'dd' examples here) then iowait is stability 50-90% and many tasks are frozen during some seconds (10-20 and may be more at me). It's easy for testing through 'dd'. I cannot resolve this trouble by ionice for example - iowait is high even if i do a some i/o tasks ionice -c3 or ionice -c2 -n7 for example! So each server under kernel 2.6.18 and more (i read many topics) has this bug. A people in forums write that the kernel of 2.6.30-rc2 has bug too and that FreeBSD work quickly (mouse moving, video showing and some other CPU's tasks) during 'dd' testing unlike Linux ... I don't know what arguments do you want for finding this bug! This bug to be since 2007 year ... Please help! Here examples of my loaded server in some times (not DD - there only typical Mysql database & Mysql tasks & apache tasks): procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- - r b swpd free buff cache si so bi bo in cs us sy id wa st 13 14 120 68460 574784 1286748 0 0 13 1 0 0 9 3 66 22 0 1 11 120 74564 576080 1286976 0 0 1560 0 1632 3641 34 10 0 57 0 0 12 120 69988 577572 1287352 0 0 1904 0 1969 3696 5 2 0 93 0 0 11 120 66916 578984 1287860 0 0 1900 0 1809 3615 6 2 0 92 0 0 11 120 64960 580424 1288028 0 0 1668 0 1642 2188 1 1 0 97 0 0 11 120 72764 576508 1286788 0 0 1668 0 1681 2198 3 2 0 96 0 1 11 120 71424 577940 1287300 0 0 1604 332 1575 2152 2 1 0 97 0 3 11 120 58852 579528 1289100 0 0 2000 0 1984 3286 44 7 0 49 0 1 11 120 75104 581012 1287472 0 0 1608 0 2119 2839 39 7 0 55 0 0 13 120 72160 582572 1287672 0 0 1908 120 1645 2366 7 1 0 92 0 [root@63 logs]# vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- - r b swpd free buff cache si so bi bo in cs us sy id wa st 5 9 120 95540 570248 1276840 0 0 13 1 0 0 9 3 66 22 0 1 7 120 93996 571428 1277440 0 0 1772 33712 2024 4341 28 4 11 57 0 0 7 120 97980 572528 1277884 0 0 1444 300 1568 2339 13 1 17 70 0 0 7 120 99900 573532 1278468 0 0 1504 0 1513 2364 4 2 3 90 0 1 5 120 98656 574484 1278540 0 0 1052 400 1629 1924 2 1 0 97 0 1 3 120 97924 574932 1278916 0 0 480 21108 2276 1987 11 2 47 40 0 1 4 120 87280 575264 1279040 0 0 432 3676 2456 2654 23 2 40 35 0 1 5 120 95856 575668 1279140 0 0 780 4128 2249 3097 26 2 25 47 0 Here you can see stability high 'wa' field. When my tasks frozen during 10-20 seconds i see there 80-90% 'wa'. Please can catch this bug! Thanks
Created attachment 21774 [details] Test patch against heavy io bug I have made an bisection and got these two patches. Reverting these patches improves the desktop responsiveness on my notebook enormous. I have tested it on a 2.6.28 non smp kernel (my heavy io testing installation) during four concurrent read and write operations, while working with two VMs. It's only a Core2 @2.4GHz system. I can even start new application during heavy io. I have added the patch, which I have applied to my test installation. Use it with care, as I am not a kernel developer and does not know the dependencies in the cfq scheduler. I have reverted theses two patches: 07db59bd6b0f279c31044cba6787344f63be87ea is first bad commit commit 07db59bd6b0f279c31044cba6787344f63be87ea Author: Linus Torvalds <torvalds@woody.linux-foundation.org> Date: Fri Apr 27 09:10:47 2007 -0700 Change default dirty-writeback limits Do this really early in the 2.6.22-rc series, so that we'll get feedback. And don't change by half measures. Just cut the default dirty limit to a quarter of what it was, and see if anybody even notices. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> :040000 040000 b63eb9faf5b9a42a1cdad901a5f18d6cceb7fdf6 2b8b4117ca34077cb0b817c77595aa6c9e34253a M mm a993800655ee516b6f6a6fc4c2ee13fedfb0590b is first bad commit commit a993800655ee516b6f6a6fc4c2ee13fedfb0590b Author: Jens Axboe <jens.axboe@oracle.com> Date: Fri Apr 20 08:55:52 2007 +0200 cfq-iosched: fix sequential write regression We have a 10-15% performance regression for sequential writes on TCQ/NCQ enabled drives in 2.6.21-rcX after the CFQ update went in. It has been reported by Valerie Clement <valerie.clement@bull.net> and the Intel testing folks. The regression is because of CFQ's now more aggressive queue control, limiting the depth available to the device. This patches fixes that regression by allowing a greater depth when only one queue is busy. It has been tested to not impact sync-vs-async workloads too much - we still do a lot better than 2.6.20. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> :040000 040000 07c48a6930ce62d36540b6650e3ea0563bd7ec59 95fc11105fe3339c90c4e7bebb66a820f7084601 M block Here the fsync result on my machine: ************************************************************************** Without patch Linux balrog 2.6.28 #2 Mon Mar 23 11:19:13 CET 2009 x86_64 GNU/Linux fsync time: 7.8282 fsync time: 17.3598 fsync time: 24.0352 fsync time: 19.7307 fsync time: 21.9559 fsync time: 21.0571 5000+0 Datensätze ein 5000+0 Datensätze aus 5242880000 Bytes (5,2 GB) kopiert, 129,286 s, 40,6 MB/s fsync time: 21.8491 fsync time: 0.0430 fsync time: 0.0448 fsync time: 0.0451 fsync time: 0.0451 fsync time: 0.0451 fsync time: 0.0452 ************************************************************************** With patch Linux balrog 2.6.28 #5 Fri Jun 5 22:23:54 CEST 2009 x86_64 GNU/Linux fsync time: 2.8409 fsync time: 2.3345 fsync time: 2.8423 fsync time: 0.0851 fsync time: 1.2497 fsync time: 0.9981 fsync time: 0.9494 fsync time: 2.7094 fsync time: 2.9753 fsync time: 2.8886 fsync time: 2.9894 fsync time: 1.2673 fsync time: 2.6728 fsync time: 1.3408 5000+0 Datensätze ein 5000+0 Datensätze aus 5242880000 Bytes (5,2 GB) kopiert, 117,388 s, 44,7 MB/s fsync time: 85.1461 fsync time: 23.5310 fsync time: 0.0317 fsync time: 0.0337 fsync time: 0.0338 fsync time: 0.0338
Fantastic! Have you bisected the whole kernel tree between 2.17 and 2.20? Really great I've found those patches. The first one doesn't seem to be very important to me, and in 2.6.30 some of its changes have been reverted. But the second one changes dramatically my system's responsiveness. I'm now running it reverted, and there's no possible comparison with the old behavior: now my pointer no longer freezes when performing updates, and almost everything is smooth! For those that would like to try the patch in 2.6.30, I've updated it as I could, and I'm attaching it. It's quite dirty and I was doubtful it would work, but it looks like that's enough. Would a kernel dev look at the patches Thomas identified and tell us what he thinks?
Created attachment 21816 [details] Patch to revert second commit, updated to apply against 2.6.30rc8
(In reply to comment #360) Thank you very much for you work. I can't imagine how long that bisection must have taken and it is very exciting to have finally found a potential culprit. It would be best for everyone if you opened a new bug report with this information. Developers would be far more likely to look at it if we had a clean slate on which to start.
Are there patches for 2.6.29 available that I can test?
Isn't the second patch just adjusting things which can be adjusted in proc? echo 10 > /proc/sys/vm/dirty_background_ratio echo 40 > /proc/sys/vm/dirty_ratio Someone want to do some tests after adjusting those two?
Created attachment 21822 [details] Backport of the reverted CFQ commit This is a proper backport of the commit that was indentified by Thomas to be the problematic one. Thomas, can you please verify that this makes 2.6.30-rc8 behave better? And if it does, it would be interesting to narrow it down to one single change. The first always makes sure that we drain the queue before servicing a queue that has idling enabled, and the second is just a tweak for idle/async immediate expiration. I think the first one is likely the interesting bit, but it would be good to have confirmation on that. And Thomas, thanks for all your work on this!
(In reply to comment #365) > Isn't the second patch just adjusting things which can be adjusted in proc? > > echo 10 > /proc/sys/vm/dirty_background_ratio > echo 40 > /proc/sys/vm/dirty_ratio > > Someone want to do some tests after adjusting those two? We already determined months ago that tuning those knobs way down was a way to minimize the problem. (See comment #263 and comment #292 for test results.) It's not a solution, though; it just skirts around the real issue.
(In reply to comment #366) > I think the first one is likely the interesting bit, but it would > be good to have confirmation on that. Yes it is the first one. I could only execute my long lasting test, which shows only a bad kernel and does not confirm a good kernel one, but I have executed it a long time and there weren't any long times of the lame encoding. It took 40s without any i/o on all kernels and 48-55s with the following lines during heavy i/o. + if (cfqd->rq_in_driver && cfq_cfqq_idle_window(cfqq)) + return 0; It took 55-80s without any patch or with the second patch during heavy i/o. This may be related too. While enabling the second core. The lame encoding process was shifted between the cores without the first patch and it takes up to 130s seconds. I could see it, as the maximum clocks frequency was switched between the cores.
This question has probably been answered before, but this bug is huge so I'll just ask again... Thomas, what kind of drive are you using? Does it have NCQ enabled? If so, does disabling NCQ make any difference? You can disable NCQ on sda by doing: # echo 1 > /sys/block/sda/device/queue_depth (or use sdX for others, naturally).
The last tests, I have done on a sata drives with queue depth 31. By reducing the queue depth the overall throughput of the two/four concurrent copy operations is nearby halved with and without patch. I have tried to run some tests, but got some really strange results. I will try it again on my test installation at home.
(In reply to comment #366) cd /usr/src/linux-2.6.30-rc8+ suse:/usr/src/linux-2.6.30-rc8+ # patch -p1 < cfq.dif (#360) patching file block/cfq-iosched.c Hunk #1 FAILED at 1073. Hunk #2 FAILED at 1119. Hunk #3 FAILED at 1129. 3 out of 3 hunks FAILED -- saving rejects to file block/cfq-iosched.c.rej patching file mm/page-writeback.c Reversed (or previously applied) patch detected! Assume -R? [n] y Hunk #1 succeeded at 66 with fuzz 1. Hunk #2 FAILED at 77. 1 out of 2 hunks FAILED -- saving rejects to file mm/page-writeback.c.rej suse:/usr/src/linux-2.6.30-rc8+ # patch -p1 < cfq.dif (#360 + #366) patching file block/cfq-iosched.c Hunk #3 FAILED at 1119. Hunk #4 FAILED at 1129. 2 out of 4 hunks FAILED -- saving rejects to file block/cfq-iosched.c.rej patching file mm/page-writeback.c Reversed (or previously applied) patch detected! Assume -R? [n] Apply anyway? [n] y Hunk #1 FAILED at 66. Hunk #2 FAILED at 77. 2 out of 2 hunks FAILED -- saving rejects to file mm/page-writeback.c.rej
(In reply to comment #371) You should only try the patch in comment #366
Ok, 2.6.30-rc8 + patch in comment #366, xfs dd if=/dev/zero of=./bigfile bs=1M count=15000 & ./fsync-tester fsync time: 1.7085 fsync time: 1.6639 fsync time: 0.4616 fsync time: 1.3800 fsync time: 1.3603 fsync time: 1.5529 fsync time: 1.8435 fsync time: 0.2561 fsync time: 0.9318 fsync time: 0.1965 fsync time: 1.2233 fsync time: 1.3920 fsync time: 0.4677 fsync time: 0.4560 fsync time: 1.8206 fsync time: 1.8135 fsync time: 1.8342 fsync time: 0.8565 fsync time: 0.9477 fsync time: 2.8569 fsync time: 0.4323 15000+0 записей считано 15000+0 записей написано скопировано 15728640000 байт (16 GB), 181,923 c, 86,5 MB/c fsync time: 1.3716 fsync time: 0.0168 fsync time: 1.5381 fsync time: 1.5649 fsync time: 0.0349 fsync time: 0.0636 fsync time: 0.0657 fsync time: 0.3337 fsync time: 0.0393 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 2 2 0 4230432 808 3417716 0 0 0 87568 1102 1850 1 7 13 79 0 4 0 4149632 808 3499392 0 0 0 83960 722 1037 1 5 36 57 0 4 0 4069892 808 3578140 0 0 0 76840 701 1178 1 5 0 93 1 3 0 3988784 808 3659444 0 0 0 78848 727 1151 1 5 14 79 0 4 0 3889380 808 3757188 0 0 0 97310 804 1200 2 6 33 59 0 3 0 3807540 808 3838720 0 0 0 79888 614 1010 2 5 19 74 0 4 0 3729056 808 3918092 0 0 0 76866 840 1367 0 5 29 65 0 3 0 3002860 808 4645932 0 0 0 90672 597 817 2 6 0 93 0 4 0 2921840 808 4728132 0 0 0 80416 865 1377 1 6 0 93 0 3 0 2841564 808 4810132 0 0 0 80384 627 933 1 5 0 93 1 4 0 2743820 808 4906136 0 0 0 94216 892 1398 1 7 0 92 0 3 0 2666100 808 4984280 0 0 0 77824 770 1217 1 5 0 93 1 2 0 2590248 808 5063188 0 0 0 82496 795 1283 2 6 0 92 In the moment of copying of /usr/src/linux-2.6.30-rc8 -> /usr/src/linux-2.6.30-rc8+ (in Konsole, without the use of dolphin, ... and other GUI) to cause :~> kdesu /usr/bin/kwrite it is impossible, after completion of copying - it is impossible, it is needed only to overload an user or computer Speed of copying of /usr/src/linux-2.6.30-rc8 -> /usr/src/linux-2.6.30-rc8+ as was near to the zero remained so. time cp -r /usr/src/linux-2.6.30-rc8 /usr/src/linux-2.6.30-rc8+ real 6m14.566s user 0m0.158s sys 0m2.838s
Brought in misinformation. Sometimes after completion of copying of kdesu /usr/bin/kwrite executed successfully, but in the moment of copying never.
(In reply to comment #369) > This question has probably been answered before, but this bug is huge so I'll > just ask again... Thomas, what kind of drive are you using? Does it have NCQ > enabled? If so, does disabling NCQ make any difference? This bug is really annoying. I was not able to reproduce the mouse freezes any more, with and without patch and with and without NCQ. I will try later again. Is there a possibility to simulate a disc in ram with a parametrized speed and latency?
Created attachment 21849 [details] The corrected patch from #360 post (for 2.6.29 and may be more kernels) I tried to patch from post #360 to kernel 2.6.29 and found some rejects I made rejects by hands and put here normal variant I saw the patch from #366, but i think there are not same correctoins as #360 So i would like to suggest to test this patch (only cfq-iosched.c file)
And i saw the patch from post #366, i didn't understand why the author tell that this is "proper backport". There no code with 'prev_cfqq' variable. I think that patch from #366 may be not valid patch. Please try this patch. This patch for 2.6.29 and more kernels as i think I didn't test it because i don't have a test machine for experiments. I have only Linux server under a heavy load...
It IS the proper backport. I'm the author and maintainer of CFQ, I should know... I would generally advise against using patches from people who don't know what they are doing, especially for data integrity important code like the IO scheduler. There could be data loss from bad patches. The reason the 2.6.30 and 2.6.29 patches are different is that the CFQ request dispatch mechanism is different in 2.6.30. As such there's no prev_cfqq to take into account, since we never dispatch from more than one cfqq in one round. You would need to take the prev_cfqq out of local function scope for it to have any meaning. So, not to be rude, but the last thing this bug needs are more cooks or chefs asking people to test things. It's a huge mess already. For now the focus is making Thomas happy, since he's spent much time on this and has a reproducible (sort of) way of testing it. Once that is done, we can proceed to any other potential issues. Any comments not related to that exact issue will be ignored.
Created attachment 21852 [details] test results Two hard drive SAMSUNG HD753LJ + NCQ + mdadm raid1 + ext3 + 2GB RAM + Core2Duo E6750 2.66 @ 3.44 GHz
Comment on attachment 21852 [details] test results >==============================2.6.30============================== >ff@home-desktop:~$ dd if=/dev/zero of=./bigfile bs=1M count=15000 & >./fsync-tester >[1] 6958 >fsync time: 0.1025 >fsync time: 0.8720 >fsync time: 5.5800 >fsync time: 5.6179 >fsync time: 3.7413 >fsync time: 4.2393 >fsync time: 5.2596 >fsync time: 0.0985 >fsync time: 1.7070 >fsync time: 4.1414 >fsync time: 0.1577 >fsync time: 4.8191 >fsync time: 0.6993 >fsync time: 3.6732 >fsync time: 3.6963 >fsync time: 4.7696 >fsync time: 6.0947 >fsync time: 3.4383 >fsync time: 0.7583 >fsync time: 4.0760 >fsync time: 4.1786 >fsync time: 3.9886 >fsync time: 0.3802 >fsync time: 3.4182 >fsync time: 1.1262 >fsync time: 2.8425 >fsync time: 3.9217 >fsync time: 1.4758 >fsync time: 3.7798 >fsync time: 3.9234 >fsync time: 0.3557 >fsync time: 4.1882 >fsync time: 4.4526 >15000+0 records in >15000+0 records out >15728640000 bytes (16 GB) copied, 231.473 s, 68.0 MB/s >fsync time: 2.1747 >fsync time: 0.0820 >fsync time: 0.0774 >fsync time: 0.0299 >fsync time: 0.0268 >fsync time: 0.0282 >fsync time: 0.0277 >fsync time: 0.0270 >^C >[1]+ Done dd if=/dev/zero of=./bigfile bs=1M count=15000 > >ff@home-desktop:~$ vmstat 1 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 0 214308 1592368 6344 66260 1 5 253 653 611 592 15 5 73 7 > 0 0 214308 1592400 6344 66264 0 0 0 0 309 525 7 3 90 0 > 2 0 214308 1592448 6344 66264 0 0 0 0 365 686 5 3 91 0 > 0 0 214308 1592400 6344 66264 0 0 0 0 291 543 5 3 92 0 > 0 2 214308 1126216 6756 464876 0 0 24 398976 980 1265 7 36 37 > 20 > 0 4 214308 1107468 6780 489032 0 0 0 20524 671 551 9 5 35 51 > 0 6 214308 1118544 6780 489032 0 0 0 4 658 575 7 3 32 58 > 0 5 214308 1129752 6784 489032 0 0 0 4 646 578 6 5 36 53 > 0 4 214308 1142036 6784 489032 0 0 0 8 656 576 6 4 36 54 > 2 3 214308 1151708 6784 489032 0 0 0 0 590 501 8 3 16 72 > 0 1 214308 1156616 6792 491124 0 0 0 1572 587 485 7 3 29 60 > 0 2 214308 704504 7188 876836 0 0 0 392152 885 716 8 38 21 32 > 0 4 214308 637132 7252 942604 0 0 0 65728 666 494 7 10 0 83 > 0 4 214308 561368 7324 1016556 0 0 0 73984 686 499 7 12 0 81 > 0 4 214308 490020 7392 1086476 0 0 0 69920 693 537 7 10 0 83 > 0 4 214308 418224 7460 1156364 0 0 0 69888 686 490 9 9 0 82 > 0 3 214308 398752 7496 1177372 0 0 4 22316 781 500 6 7 7 80 > 0 4 214308 406700 7496 1177404 0 0 28 0 532 510 8 3 10 79 > 0 5 214308 416212 7516 1177528 0 0 160 8 645 550 6 4 14 76 > 0 4 214308 427788 7524 1177648 0 0 108 12 620 526 8 4 1 87 > 0 3 214308 437528 7536 1177688 0 0 56 0 674 651 6 5 0 89 > 1 2 214308 302288 7688 1321540 0 0 8 16 540 533 9 12 20 58 > 1 3 214308 15268 7536 1548008 0 0 56 391944 878 707 7 28 0 > 65 > 0 5 214308 15152 7372 1548204 0 0 96 69896 699 574 8 11 0 81 > 1 7 214308 15232 7420 1549092 0 0 220 45220 661 616 6 9 0 85 > 7 6 214308 14752 7544 1548616 0 0 640 82216 755 747 6 13 0 81 > 0 5 214308 15084 7620 1548232 0 0 24 65720 709 674 8 11 0 81 > 0 4 214308 17436 7636 1549072 0 0 284 8224 889 572 6 6 0 88 > 0 4 214308 17776 7660 1550176 0 0 100 1496 637 545 6 5 0 89 > 0 4 214308 27604 7684 1550280 0 0 132 0 606 513 7 5 0 88 > 0 4 214308 35192 7712 1550412 0 0 156 0 588 572 8 3 0 90 > 0 5 214308 44060 7744 1550820 0 0 480 0 681 683 6 6 0 88 > 0 3 214308 55500 7800 1551100 0 0 296 12 750 746 5 6 31 58 > 2 1 214308 13580 8028 1601036 0 0 1356 120 696 881 8 9 26 56 > 0 3 214308 15220 8316 1547984 0 0 0 381300 944 1397 7 39 18 > 37 > 0 5 214308 13560 8384 1550264 0 0 0 69944 769 536 7 10 0 82 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 2 5 214308 14424 8452 1549564 0 0 0 69888 739 508 6 9 0 85 > 1 5 214308 13952 8524 1549784 0 0 0 74016 719 527 7 12 0 81 > 0 3 214308 14468 8564 1549144 0 0 0 69920 781 576 6 12 1 81 > 0 3 214308 20844 8564 1549148 0 0 0 0 580 477 7 3 22 67 > 0 5 214308 28084 8568 1549144 0 0 0 8 659 498 7 3 36 54 > 0 3 214308 37388 8576 1549148 0 0 0 1764 605 555 8 5 12 76 > 0 3 214308 45184 8576 1549148 0 0 0 0 525 470 6 5 0 90 > 0 1 214308 58560 8576 1549148 0 0 0 12 811 623 8 5 5 81 > 0 3 214308 14952 8812 1592052 0 0 0 57644 637 804 6 25 37 32 > 0 2 214308 14012 8952 1549216 0 0 0 347944 1108 767 8 19 14 > 59 > 0 2 214308 15548 8964 1549476 0 0 0 5320 421 472 7 3 2 88 > 0 4 214308 28940 8964 1549476 0 0 0 12 694 466 7 5 11 77 > 0 4 214308 43720 8964 1549476 0 0 0 0 669 455 6 3 34 57 > 0 4 214308 56184 8964 1549476 0 0 0 0 714 437 6 4 36 54 > 0 4 214308 13704 8752 1594000 0 0 8 32772 889 834 4 28 30 38 > 0 3 214308 14292 8760 1592532 0 0 0 72228 720 1005 8 6 36 51 > 0 2 214308 15452 8680 1548660 0 0 0 301880 688 756 6 18 18 > 59 > 0 2 214308 24520 8680 1548660 0 0 0 0 618 443 7 4 25 64 > 0 3 214308 39648 8680 1548660 0 0 0 4 683 482 5 5 21 68 > 1 2 214308 52696 8684 1548660 0 0 0 12 583 602 6 5 35 53 > 3 0 214308 13620 8344 1599184 0 0 40 4 817 598 6 14 4 76 > 1 1 214308 14660 7476 1565564 0 0 112 276832 772 970 3 27 14 > 56 > 0 4 214308 14024 7544 1552748 0 0 4 145788 625 542 0 9 0 > 91 > 2 2 214308 13996 7596 1556652 0 0 4 28800 523 529 1 5 3 90 > 0 5 214308 14692 7692 1552168 0 0 4 115004 702 578 0 12 2 > 86 > 0 5 214308 17748 7744 1551596 0 0 8 41132 619 522 2 5 0 93 > 0 4 214308 15604 7808 1560576 0 0 204 0 631 537 0 6 0 94 > 0 3 214308 26272 7808 1560576 0 0 0 0 539 486 2 1 0 97 > 0 3 214308 38512 7816 1560576 0 0 0 1440 555 503 0 0 0 > 100 > 1 1 214308 50424 7816 1560576 0 0 0 0 465 497 2 0 27 71 > 3 2 214308 14948 8096 1595136 0 0 12 81980 816 866 0 25 42 32 > 0 3 214308 14936 8188 1550608 0 0 4 347056 643 579 2 15 15 > 69 > 1 3 214308 13788 8256 1552700 0 0 4 64168 599 476 0 7 0 93 > 0 3 214308 14436 8292 1551948 0 0 0 73920 657 567 2 10 11 77 > 1 2 214308 14452 8380 1551596 0 0 4 80780 795 529 0 10 0 90 > 0 3 214308 15284 8400 1552484 0 0 4 16384 390 461 2 2 36 60 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 2 214308 27336 8408 1552484 0 0 0 1688 631 507 0 1 10 88 > 0 2 214308 39496 8408 1552484 0 0 0 0 500 442 2 1 29 68 > 1 1 214308 53336 8408 1552484 0 0 0 0 616 503 0 1 44 55 > 1 2 214308 14904 8456 1602564 0 0 148 4 515 588 1 10 42 46 > 0 6 214308 14088 6968 1595120 0 0 1280 123568 793 1043 1 26 1 > 73 > 0 4 214308 15032 5900 1554300 0 0 1108 312700 695 1060 0 13 0 > 87 > 0 6 214308 19824 5904 1556772 0 0 192 0 465 459 0 1 0 99 > 0 4 214308 14432 5640 1554152 0 0 680 125936 698 583 0 14 0 > 85 > 1 4 214308 14268 4524 1556644 0 0 4 79152 627 536 0 9 0 91 > 0 4 214308 15472 4328 1557116 0 0 44 67104 653 512 2 9 0 89 > 0 3 214308 14192 4356 1557264 0 0 4 34360 869 524 0 4 0 96 > 0 3 214308 23236 4356 1557264 0 0 0 0 521 509 2 1 37 60 > 0 4 214308 34784 4364 1557264 0 0 4 12 536 499 0 1 32 68 > 0 4 214308 47144 4364 1557336 0 0 72 0 420 426 0 1 28 71 > 0 1 214308 62740 4368 1557336 0 0 0 16 677 727 3 1 30 68 > 0 1 214308 14492 4608 1604772 0 0 72 37004 608 779 8 21 41 30 > 0 3 214308 17452 4768 1600460 0 0 4 91240 801 740 6 20 7 67 > 0 2 214308 13752 4848 1557448 0 0 4 352920 1058 907 8 15 9 > 69 > 0 3 214308 20620 4848 1557444 0 0 0 0 618 478 6 4 30 60 > 0 3 214308 18152 4972 1556620 0 0 4 117476 743 575 2 14 2 > 82 > 0 4 214308 13708 5060 1557112 0 0 0 100632 848 556 1 11 19 > 69 > 3 4 214308 17728 5132 1556508 0 0 592 32928 650 580 1 4 0 95 > 0 4 214308 17744 5148 1559500 0 0 324 1388 626 543 0 2 19 79 > 0 4 214308 27648 5160 1559788 0 0 236 0 498 504 0 1 2 97 > 0 3 214308 41392 5160 1559920 0 0 140 0 566 592 2 1 8 89 > 0 1 214308 55464 5160 1559920 0 0 0 0 613 659 2 1 43 53 > 2 3 214308 13524 5464 1566192 0 0 8 286840 771 911 3 35 35 > 27 > 1 2 214308 14304 5536 1556740 0 0 4 138424 727 671 7 12 10 > 71 > 0 3 214308 15288 5604 1555804 0 0 4 78144 740 555 7 13 18 63 > 0 3 214308 14120 5680 1554992 0 0 0 84032 622 538 2 10 19 69 > 0 4 214308 15312 5720 1556812 0 0 4 49508 590 564 0 6 12 81 > 0 3 214308 18076 5764 1555236 0 0 0 41132 730 548 2 6 21 71 > 1 3 214308 29532 5768 1555236 0 0 0 1720 501 498 0 1 16 83 > 1 4 214308 41840 5772 1555256 0 0 20 0 612 566 6 4 24 66 > 0 1 214308 56008 5772 1555332 0 0 80 12 747 670 6 3 3 88 > 3 0 214308 13156 5952 1607880 0 0 4 172 609 625 8 16 35 41 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 4 214308 15264 6168 1553756 0 0 8 409604 877 936 5 30 29 > 36 > 0 3 214308 15940 6232 1554652 0 0 4 64624 604 571 2 9 13 75 > 1 2 214308 14836 6316 1555000 0 0 0 93280 682 535 0 13 0 87 > 2 3 214308 14840 6388 1554324 0 0 4 74644 618 515 1 8 3 87 > 0 2 214308 15460 6452 1555048 0 0 4 57504 736 537 0 6 15 79 > 0 3 214308 19076 6460 1556076 0 0 0 1632 672 487 7 4 39 50 > 0 4 214308 30076 6460 1556076 0 0 0 4 630 492 5 5 0 90 > 0 3 214308 42704 6464 1556076 0 0 0 12 569 475 3 2 19 76 > 0 2 214308 51996 6464 1556076 0 0 0 0 520 551 0 0 4 96 > 2 1 214308 13928 6584 1606556 0 0 4 16 532 625 1 10 38 50 > 0 2 214308 14848 6812 1596100 0 0 8 129312 765 897 0 22 22 > 56 > 0 4 214308 14632 6880 1553992 0 0 0 308888 701 662 7 13 0 > 80 > 0 4 214308 15096 6976 1553728 0 0 4 94592 716 592 5 14 0 81 > 0 4 214308 14700 7048 1554080 0 0 0 74048 700 570 7 11 0 82 > 0 4 214308 13780 7116 1555188 0 0 4 71092 685 538 7 10 0 83 > 0 3 214308 14964 7152 1554204 0 0 0 36944 918 502 7 7 0 86 > 0 4 214308 19908 7156 1554292 0 0 88 1532 565 514 6 4 0 90 > 0 4 214308 28984 7160 1554336 0 0 44 12 552 487 8 3 0 89 > 0 5 214308 37492 7160 1554452 0 0 116 36 610 539 6 5 21 68 > 0 2 214308 52188 7164 1554452 0 0 0 12 731 722 6 5 9 79 > 1 1 214308 14644 7304 1605148 0 0 4 8 509 594 6 13 37 44 > 0 1 214308 17204 7340 1597036 0 0 8 81928 678 707 7 15 36 42 > 0 2 214308 15036 7276 1553748 0 0 8 357892 807 848 7 23 37 > 33 > 0 4 214308 14800 7344 1554312 0 0 4 65824 713 585 8 11 9 72 > 1 3 214308 14808 7420 1554168 0 0 0 86340 739 575 7 13 0 80 > 0 4 214308 14004 7492 1554560 0 0 4 74016 639 531 2 8 0 89 > 0 3 214308 19856 7504 1554272 0 0 0 12320 596 506 0 2 29 68 > 0 3 214308 23068 7508 1554272 0 0 0 1476 668 489 1 1 14 83 > 0 3 214308 35408 7508 1554272 0 0 0 0 529 474 0 1 0 99 > 0 3 214308 46364 7512 1554272 0 0 0 12 521 457 1 1 12 86 > 0 2 214308 62656 7512 1554272 0 0 0 4 492 609 0 0 14 86 > 0 2 214308 15412 7680 1594804 0 0 32 94884 735 897 2 32 25 41 > 1 4 214308 14692 7744 1575932 0 0 0 194608 731 1150 0 14 34 > 52 > 0 3 214308 17144 7780 1554232 0 0 4 180920 572 517 2 7 22 > 70 > 1 2 214308 14296 7864 1553780 0 0 0 86432 587 514 0 10 46 44 > 0 3 214308 13692 7944 1554200 0 0 4 82304 620 578 2 8 9 82 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 4 214308 14068 8020 1553412 0 0 0 69900 746 633 0 8 22 71 > 0 4 214308 19652 8036 1553056 0 0 4 8224 472 540 2 3 0 96 > 0 3 214308 25872 8048 1554388 0 0 0 1456 731 634 0 1 0 98 > 0 3 214308 34260 8048 1554388 0 0 0 0 431 509 1 1 13 84 > 1 3 214308 46200 8048 1554388 0 0 0 0 525 539 0 0 23 76 > 0 1 214308 58964 8048 1554388 0 0 0 0 605 600 2 1 39 58 > 1 1 214308 14124 8136 1556832 0 0 8 333512 796 925 0 35 36 > 28 > 0 3 214308 14004 8072 1552952 0 0 4 99628 638 512 1 10 0 88 > 0 3 214308 14708 8020 1552224 0 0 0 74048 648 501 0 10 12 78 > 0 3 214308 14104 8100 1553228 0 0 4 75852 661 524 2 11 0 87 > 0 3 214308 15048 8040 1552704 0 0 0 65772 775 572 0 8 0 92 > 0 5 214308 21864 8044 1552888 0 0 4 36 441 477 2 0 0 97 > 0 4 214308 33624 8052 1552884 0 0 0 1700 541 524 0 0 0 > 100 > 0 3 214308 45344 8052 1552892 0 0 0 8 533 574 2 1 33 64 > 0 2 214308 57328 8052 1552892 0 0 0 0 537 561 0 1 33 67 > 2 1 214308 14412 8188 1600840 0 0 4 28688 583 749 2 18 41 39 > 0 2 214308 20020 8212 1595688 0 0 4 56132 754 587 0 6 21 73 > 1 1 214308 15360 8220 1552048 0 0 4 369572 841 905 2 23 6 > 69 > 0 3 214308 21504 8224 1552504 0 0 0 5344 423 421 0 2 28 70 > 0 3 214308 32504 8224 1552504 0 0 0 0 547 606 1 1 10 88 > 0 3 214308 45044 8224 1552504 0 0 0 0 513 427 0 1 0 99 > 1 2 214308 57088 8224 1552504 0 0 0 0 609 652 2 0 4 94 > 2 0 214308 16176 8216 1602536 0 0 8 16 651 602 0 16 28 55 > 0 2 214308 15432 8324 1594212 0 0 4 117376 746 789 2 16 7 > 76 > 0 3 214308 15028 8424 1550876 0 0 0 318200 632 621 0 18 1 > 81 > 0 3 214308 14672 8472 1553268 0 0 4 75828 630 571 2 9 17 71 > 0 3 214308 15204 8516 1553032 0 0 4 57568 622 475 0 10 18 73 > 1 3 214308 14388 8580 1553448 0 0 0 86368 674 567 2 12 26 60 > 2 2 214308 20152 8580 1553432 0 0 0 0 521 413 0 0 35 65 > 0 4 214308 21876 8588 1553432 0 0 0 1508 569 578 2 2 21 75 > 0 2 214308 36576 8592 1553428 0 0 0 4 550 416 0 1 35 64 > 0 1 214308 49388 8592 1553432 0 0 0 8 565 553 1 1 39 58 > 0 1 214308 21752 8652 1596876 0 0 4 36 387 460 1 5 51 44 > 0 2 214308 15584 8900 1593092 0 0 8 118212 838 1253 2 24 24 > 50 > 0 1 214308 15640 9016 1599208 0 0 4 4540 610 557 1 11 30 59 > 0 3 214308 13976 9080 1551688 0 0 0 392328 813 684 2 16 16 > 67 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 3 214308 15492 9120 1551696 0 0 4 42364 511 481 3 6 7 84 > 0 3 214308 13956 9204 1552296 0 0 4 82272 686 633 8 12 0 80 > 0 2 214308 16768 9236 1551476 0 0 0 53472 652 483 7 9 5 79 > 0 3 214308 27796 9236 1551476 0 0 0 0 673 583 7 4 29 60 > 0 3 214308 37796 9240 1551476 0 0 0 12 693 458 5 5 36 53 > 0 5 214308 44304 9252 1551476 0 0 8 1448 589 612 8 4 36 52 > 0 1 214308 59348 9256 1551472 0 0 0 4 731 614 6 4 3 87 > 0 3 214308 14708 9500 1595132 0 0 12 57356 696 947 8 26 23 43 > 1 1 214308 13588 9620 1554652 0 0 4 335112 829 756 9 20 12 > 60 > 0 3 214308 14964 9660 1550636 0 0 0 69192 705 554 1 6 10 82 > 2 2 214308 13688 9724 1554112 0 0 4 49324 571 571 0 6 7 86 > 0 2 214308 17120 9788 1549904 0 0 0 68184 615 659 2 8 16 73 > 0 3 214308 28960 9788 1549904 0 0 0 4 530 409 0 1 33 66 > 1 1 214308 40744 9792 1549904 0 0 0 12 540 498 1 2 25 72 > 0 2 214308 47964 9792 1549904 0 0 0 0 481 410 0 0 21 79 > 1 0 214308 15176 9920 1598424 0 0 4 4 690 677 2 11 44 43 > 0 2 214308 14808 10060 1593288 0 0 8 122856 793 1034 0 21 30 > 49 > 0 5 214308 14700 10136 1549124 0 0 28 314392 636 812 2 17 22 > 60 > 2 2 214308 14576 10192 1551960 0 0 168 70868 654 548 0 9 7 83 > 1 5 214308 15064 10228 1550548 0 0 320 77584 605 582 1 11 5 82 > 1 4 214308 13652 10300 1551720 0 0 144 69852 654 557 0 9 8 83 > 1 3 214308 14824 10316 1551632 0 0 184 45248 606 587 2 6 17 75 > 0 3 214308 17100 10332 1552116 0 0 492 12 734 511 0 1 10 89 > 0 4 214308 24752 10336 1552140 0 0 24 1432 462 503 1 1 4 93 > 0 5 214308 36596 10344 1552336 0 0 288 12 535 508 0 0 0 > 100 > 0 6 214308 48460 10400 1554488 0 0 2116 8 686 1024 2 1 7 89 > 1 2 214308 13784 10568 1602292 0 0 528 20 535 581 0 12 34 53 > 0 3 214308 15168 10736 1549184 0 0 336 359376 1141 1015 2 26 1 > 72 > 0 3 214308 15156 10760 1549976 0 0 72 21776 496 442 0 3 1 95 > 0 3 214308 29756 10768 1550304 0 0 268 0 653 533 2 1 6 92 > 0 3 214308 38128 10776 1550776 0 0 492 0 536 448 0 1 4 95 > 0 3 214308 49364 10780 1551048 0 0 276 0 549 532 2 1 0 96 > 1 5 214304 13884 10992 1593420 32 0 892 20484 726 760 0 19 9 71 > 0 4 214304 14648 11116 1547548 0 0 852 324424 804 1278 2 19 4 > 75 > 0 6 214304 14964 11184 1549504 0 0 1848 35824 638 796 0 6 0 94 > 0 6 214304 28112 11188 1549480 0 0 4 12 620 571 2 1 0 97 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 7 214304 14604 11296 1549212 0 0 48 134928 640 696 0 13 0 > 86 > 0 7 214304 15052 11372 1547720 0 0 168 95424 822 1137 3 17 0 80 > 0 4 214304 20360 11392 1547520 0 0 0 20552 709 512 0 3 7 91 > 0 4 214304 25844 11396 1548544 0 0 16 1416 438 555 2 1 0 97 > 0 4 214304 38652 11396 1548560 0 0 0 0 571 407 0 2 0 98 > 0 3 214304 52360 11396 1548560 0 0 0 0 601 721 2 1 0 97 > 0 3 214304 50348 11424 1565520 0 0 4 16 481 397 0 2 2 97 > 1 3 214304 14952 11768 1596452 0 0 252 52444 661 952 2 25 19 54 > 0 5 214304 15320 11856 1548480 0 0 4 364900 760 666 0 16 4 > 79 > 0 5 214304 14896 11940 1547704 0 0 0 90496 709 633 2 11 0 87 > 0 5 214304 14400 11936 1547988 0 0 188 79464 652 544 0 10 0 90 > 0 7 214304 13556 11972 1549776 0 0 8 57844 668 553 2 8 0 90 > 0 4 214304 16036 11996 1550756 0 0 4 38756 755 518 0 6 0 94 > 0 4 214304 28108 11996 1550780 0 0 24 0 485 505 2 1 16 82 > 0 4 214304 40644 11996 1550780 0 0 0 0 498 427 1 1 12 86 > 0 3 214304 53108 11996 1550780 0 0 0 0 634 689 2 2 32 65 > 0 3 214304 14060 12008 1599916 0 0 52 20 525 490 0 8 27 66 > 1 2 214304 14952 11652 1554332 0 0 144 337780 805 1056 2 32 1 > 66 > 1 4 214304 14136 11672 1548960 0 0 0 121500 693 537 0 11 0 > 89 > 1 4 214304 20988 11700 1549328 0 0 888 12288 557 525 2 2 11 85 > 0 3 214304 35248 11700 1549436 0 0 120 0 659 461 6 4 28 62 > 0 3 214304 48952 11700 1549436 0 0 0 0 586 508 8 3 2 86 > 0 4 214304 49148 11748 1552932 0 0 3468 1368 888 542 3 2 1 93 > 0 5 214304 52508 11756 1552948 0 0 80 12 422 243 2 0 0 98 > 0 5 214304 60292 11784 1553904 32 0 1064 8 492 258 0 0 0 > 100 > 1 2 214304 61600 12056 1556072 0 0 2476 1092 916 2595 9 5 42 44 > 2 0 214304 60032 12072 1557844 0 0 1644 16 528 1295 7 6 66 22 > 0 0 214304 59660 12080 1558152 0 0 308 1052 448 937 7 2 86 4 > 3 0 214304 59660 12088 1558152 0 0 0 1052 297 627 1 0 98 1 > 0 0 214304 59668 12096 1558152 0 0 0 1052 390 1060 2 1 96 1 > 1 0 214304 59660 12104 1558152 0 0 0 1052 434 858 6 2 92 0 > 0 0 214304 59536 12112 1558152 0 0 0 1052 449 897 6 3 90 1 > 2 0 214304 60048 12120 1558620 0 0 468 1052 390 809 3 1 93 3 > 0 0 214304 57320 12136 1561792 0 0 3188 0 395 785 3 1 88 8 >^C > >==============================2.6.30 + patch from >#366============================== >ff@home-desktop:~$ dd if=/dev/zero of=./bigfile bs=1M count=15000 & >./fsync-tester >[1] 5148 >fsync time: 0.1111 >fsync time: 4.3442 >fsync time: 3.9939 >fsync time: 3.7558 >fsync time: 5.8475 >fsync time: 1.3059 >fsync time: 3.0354 >fsync time: 4.5832 >fsync time: 4.3041 >fsync time: 0.2866 >fsync time: 0.7935 >fsync time: 3.2131 >fsync time: 1.5684 >fsync time: 2.0876 >fsync time: 0.9385 >fsync time: 4.3251 >fsync time: 4.1135 >fsync time: 0.7379 >fsync time: 4.9408 >fsync time: 1.1250 >fsync time: 4.2838 >fsync time: 1.1455 >fsync time: 4.7464 >fsync time: 2.8139 >fsync time: 3.5942 >fsync time: 0.9125 >fsync time: 3.4242 >fsync time: 0.1742 >fsync time: 4.8445 >fsync time: 4.0925 >fsync time: 0.8951 >fsync time: 4.1239 >fsync time: 0.0716 >fsync time: 4.5728 >fsync time: 0.3215 >fsync time: 4.6018 >fsync time: 3.5965 >15000+0 records in >15000+0 records out >15728640000 bytes (16 GB) copied, 234.895 s, 67.0 MB/s >fsync time: 0.0515 >fsync time: 0.0345 >fsync time: 0.0722 >^C >[1]+ Done dd if=/dev/zero of=./bigfile bs=1M count=15000 > >ff@home-desktop:~$ vmstat 1 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 0 0 1235868 42744 329892 0 0 1120 36 267 866 7 4 68 22 > 0 0 0 1235860 42744 329892 0 0 0 0 193 463 0 1 99 0 > 0 0 0 1235860 42744 329892 0 0 0 64 279 633 1 0 99 0 > 1 0 0 1045624 42956 514680 0 0 0 1252 443 792 0 13 83 3 > 0 3 0 821604 43124 686576 0 0 0 355816 810 622 0 19 20 61 > 0 3 0 754580 43188 751656 0 0 0 65088 609 538 1 8 18 74 > 1 2 0 680256 43260 823852 0 0 0 67648 610 486 1 8 8 83 > 0 4 0 608228 43328 893888 0 0 0 74496 567 357 0 8 0 92 > 0 4 0 559444 43380 941264 0 0 0 48372 613 490 1 5 0 94 > 0 4 0 565140 43384 941268 0 0 0 12 435 464 0 0 0 100 > 0 5 0 578448 43384 941272 0 0 0 4 574 603 1 2 0 97 > 0 5 0 589836 43384 941272 0 0 0 0 507 469 0 1 0 99 > 0 1 0 603220 43388 941272 0 0 0 52 495 648 0 0 0 99 > 0 3 0 236396 43696 1253428 0 0 0 313596 912 502 2 26 26 > 45 > 0 4 0 242476 43700 1253432 0 0 0 1380 497 487 2 0 0 98 > 0 4 0 254124 43704 1253432 0 0 0 12 472 428 0 1 0 > 100 > 0 6 0 263184 43704 1253432 0 0 0 8 478 507 1 0 0 > 100 > 0 3 0 276748 43704 1253432 0 0 0 4 576 550 0 1 0 99 > 1 3 0 14600 42256 1501064 0 0 0 96932 816 697 6 23 10 61 > 0 4 0 20824 37932 1474172 0 0 0 213872 710 496 7 8 12 > 72 > 0 4 0 32524 37932 1474172 0 0 0 0 540 504 2 1 0 97 > 0 4 0 44192 37936 1474172 0 0 0 12 519 408 0 1 0 > 100 > 0 2 0 57712 37940 1474168 0 0 0 12 608 531 1 0 29 70 > 0 3 0 14572 19936 1491896 0 0 4 286460 714 1011 2 29 25 > 43 > 0 3 0 25932 19940 1491820 0 0 0 328 640 515 8 2 26 64 > 0 3 0 38744 19940 1491820 0 0 0 0 609 428 6 4 37 54 > 1 1 0 15000 20040 1526688 0 0 0 4 700 629 6 13 35 45 > 2 2 0 13512 20240 1493504 0 0 4 301892 782 750 6 24 4 > 67 > 1 3 0 14852 20304 1491700 0 0 0 52476 721 567 6 11 0 83 > 1 2 0 18284 20316 1492080 0 0 0 28780 566 451 5 3 5 87 > 2 2 0 27224 20316 1492080 0 0 0 0 496 495 0 1 46 52 > 1 2 0 35648 20316 1492080 0 0 0 0 677 381 0 1 25 74 > 0 1 0 42856 20328 1493104 0 0 8 1456 654 503 6 3 31 59 > 1 1 0 15036 20432 1536900 0 0 0 16 443 503 6 11 37 46 > 0 3 0 15760 20644 1490228 0 0 8 311972 751 904 8 23 7 > 63 > 0 3 0 30516 20648 1490232 0 0 0 12 501 426 0 1 34 65 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 3 0 42548 20648 1490232 0 0 0 0 632 528 6 4 23 68 > 1 1 0 50416 20664 1494928 0 0 8 4 604 478 7 3 2 88 > 0 3 0 14100 20860 1490812 0 0 24 325696 928 950 7 35 35 > 22 > 0 2 0 15028 19232 1491500 0 0 48 53484 716 533 7 8 22 62 > 0 4 0 24768 19232 1491512 0 0 0 4 545 482 2 1 28 69 > 0 2 0 35968 19236 1491512 0 0 0 12 607 423 3 2 16 79 > 0 2 0 46564 19236 1491512 0 0 0 0 492 483 1 1 31 67 > 0 1 0 51728 19240 1491508 0 0 0 1468 666 410 1 1 31 68 > 0 1 0 61724 19240 1491512 0 0 0 0 533 478 5 4 38 53 > 0 2 0 15500 17552 1490608 0 0 16 362168 805 1000 1 36 34 > 30 > 0 4 0 14232 17588 1492032 0 0 0 78084 692 585 1 9 6 85 > 0 3 0 14324 17496 1491412 0 0 0 69932 694 499 6 9 0 84 > 0 3 0 14712 16420 1492288 0 0 0 85520 715 568 2 11 19 68 > 0 3 0 15200 16464 1491824 0 0 0 74048 615 468 0 11 16 74 > 0 3 0 21124 16472 1492776 0 0 0 8224 606 517 1 2 33 64 > 0 3 0 30116 16476 1492772 0 0 0 12 476 425 0 1 38 61 > 0 5 0 39592 16480 1492776 0 0 0 1760 468 525 0 1 49 49 > 0 3 0 55292 16480 1492776 0 0 0 0 588 552 1 1 29 69 > 1 1 0 14584 16508 1543964 0 0 0 16 558 553 7 9 29 56 > 0 3 0 14960 9296 1542656 0 0 16 121004 802 950 6 27 20 > 48 > 0 3 0 15168 9224 1499636 0 0 4 347852 832 699 1 17 4 > 77 > 0 3 0 13620 9180 1500896 0 0 0 69932 609 485 0 6 0 94 > 0 4 0 13716 8364 1501620 0 0 4 66352 612 544 1 9 0 90 > 1 4 0 14900 6416 1502408 0 0 4 89640 627 492 0 10 0 90 > 1 3 0 17748 6452 1502668 0 0 104 29000 656 531 1 4 23 72 > 0 4 0 15436 6480 1506132 0 0 972 1556 641 527 0 1 24 75 > 0 4 0 25444 6488 1506384 0 0 236 0 464 505 0 1 6 94 > 0 5 0 34200 6504 1507236 0 0 836 0 491 557 0 0 16 83 > 0 2 0 46676 6508 1507436 0 0 232 12 649 608 7 4 16 72 > 0 1 0 57780 6508 1507436 0 0 0 0 558 434 5 4 25 66 > 0 4 0 14308 6440 1501448 0 0 8 346148 956 1041 6 34 33 > 27 > 0 3 0 14916 6444 1501928 0 0 36 13740 434 463 7 4 0 89 > 0 3 0 24976 6468 1502376 0 0 496 0 581 503 2 1 29 68 > 0 3 0 39420 6468 1502400 0 0 0 8 519 413 0 1 27 72 > 0 4 0 49148 6492 1502900 0 0 604 0 530 543 1 0 0 98 > 0 3 0 14804 5096 1547228 0 0 2128 8 689 686 1 10 2 86 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 1 0 14040 5168 1553184 0 0 16 54552 848 908 7 16 40 37 > 0 2 0 14596 5292 1544912 0 0 4 119588 757 883 6 17 12 > 66 > 0 2 0 14616 4196 1504332 0 0 0 318128 911 798 5 17 21 > 56 > 0 5 0 14252 4248 1504768 0 0 124 41100 573 479 0 6 17 77 > 0 5 0 21548 4248 1504760 0 0 0 0 356 454 0 0 0 > 100 > 0 5 0 30616 4256 1508596 0 0 4 8 492 408 0 1 0 99 > 1 4 0 13568 4480 1518228 0 0 64 128736 740 773 1 20 4 > 75 > 0 2 0 19012 4504 1504444 0 0 16 108876 500 458 0 4 11 > 85 > 0 2 0 29868 4504 1504444 0 0 0 0 629 680 1 1 30 68 > 0 3 0 43568 4508 1504444 0 0 0 1468 506 482 0 0 24 76 > 0 2 0 56920 4508 1504444 0 0 0 0 567 667 1 1 13 85 > 0 2 0 41276 4544 1530452 0 0 4 20 451 476 0 2 42 56 > 0 3 0 13836 4888 1546144 0 0 116 114292 802 1039 2 31 35 > 32 > 0 2 0 23516 4896 1525112 0 0 44 154896 622 499 0 6 16 > 77 > 0 3 0 14904 4984 1546848 0 0 4 8196 680 774 0 9 8 82 > 0 4 0 14516 5040 1503056 0 0 132 342932 655 613 0 18 0 > 82 > 0 3 0 13804 5116 1503620 0 0 76 73128 610 557 1 8 8 83 > 0 3 0 16136 5136 1504412 0 0 4 53484 605 481 0 7 0 93 > 0 3 0 27920 5136 1504404 0 0 0 4 505 470 1 1 5 93 > 0 2 0 40740 5140 1504408 0 0 0 12 710 456 0 0 36 63 > 1 2 0 44156 5144 1504408 0 0 0 1400 508 473 1 1 34 64 > 0 1 0 57520 5144 1504408 0 0 0 0 560 404 0 0 34 66 > 0 1 0 14276 5268 1555548 0 0 4 28 473 597 1 8 50 40 > 0 2 0 14720 5464 1544052 0 0 8 145540 769 793 0 25 41 > 34 > 0 4 0 13724 5552 1502248 0 0 208 318396 714 964 1 13 21 > 64 > 1 6 0 15652 5596 1502480 0 0 92 45248 478 454 0 6 0 94 > 1 8 0 13848 5700 1503536 0 0 3588 61408 784 1299 1 9 0 90 > 0 6 0 15004 5776 1502076 0 0 3012 57504 748 1322 0 8 0 92 > 0 6 0 13600 5840 1503888 0 0 380 78176 653 666 1 9 0 90 > 0 5 0 16248 5860 1502668 0 0 180 20556 821 557 0 3 0 97 > 0 7 0 19132 5868 1502708 0 0 56 1416 432 523 1 0 0 99 > 0 5 0 35600 5892 1504112 0 0 1360 4 511 614 0 1 0 99 > 1 5 0 45772 5892 1504196 0 0 100 8 512 547 1 1 0 98 > 0 3 0 55852 5900 1504464 0 0 316 0 562 604 0 1 0 99 > 1 2 0 14092 6252 1551468 0 0 3968 49244 783 1201 2 20 0 78 > 0 3 0 14336 6412 1505532 0 0 104 369176 788 713 0 22 34 > 45 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 5 0 14668 6476 1505676 0 0 132 57660 550 511 1 6 7 85 > 0 4 0 17624 6544 1508624 0 0 16 45260 588 483 0 6 0 93 > 0 4 0 13984 6668 1504972 0 0 64 115264 700 588 0 11 0 > 89 > 0 4 0 15700 6696 1505884 0 0 396 12868 477 508 0 2 0 98 > 0 3 0 28964 6700 1506988 0 0 0 1672 565 565 1 1 0 98 > 0 3 0 41988 6700 1506988 0 0 0 0 513 413 0 1 0 99 > 0 2 0 54796 6704 1506984 0 0 0 12 525 536 1 1 0 98 > 0 1 0 65068 6704 1506988 0 0 0 4 417 386 0 0 37 63 > 1 1 0 14544 7056 1547220 0 0 8 114364 785 1032 1 31 41 > 27 > 1 3 0 14984 7112 1546224 0 0 4 62020 633 990 0 6 10 84 > 0 4 0 15400 7180 1503764 0 0 4 322084 662 695 1 14 0 > 85 > 0 4 0 14676 7252 1505040 0 0 0 74048 542 511 0 8 0 92 > 0 4 0 14160 7328 1504748 0 0 4 74124 624 585 1 9 0 90 > 1 6 0 14044 7388 1505560 0 0 0 74020 619 511 0 8 0 92 > 1 3 0 14800 7452 1503940 0 0 4 58168 896 593 1 6 4 88 > 2 2 0 15276 7456 1503888 0 0 0 1464 386 432 0 1 0 98 > 1 3 0 25124 7456 1503892 0 0 0 0 547 505 1 0 0 99 > 0 3 0 38420 7456 1503892 0 0 0 0 490 410 0 0 0 > 100 > 0 3 0 51580 7456 1503892 0 0 0 0 533 498 1 0 0 98 > 0 1 0 65620 7456 1503892 0 0 0 12 482 564 0 0 36 64 > 0 2 0 14968 7680 1545992 0 0 52 103228 791 1004 1 32 37 > 29 > 0 3 0 13916 7708 1547864 0 0 0 36232 565 931 0 2 8 89 > 3 3 0 14048 7776 1504432 0 0 4 349496 640 762 1 16 9 > 74 > 0 4 0 14252 7860 1503760 0 0 4 78124 625 530 0 8 12 80 > 0 4 0 13900 7784 1504388 0 0 436 47364 554 581 1 6 1 92 > 0 4 0 15136 7740 1503292 0 0 196 87056 634 543 0 10 0 90 > 1 3 0 14756 7776 1503712 0 0 0 53468 626 516 2 6 0 92 > 0 3 0 20884 7780 1503724 0 0 0 12 726 446 0 1 0 99 > 0 5 0 24668 7788 1504748 0 0 52 1460 459 520 1 2 4 93 > 0 3 0 37628 7792 1504920 0 0 124 4 554 491 0 2 0 98 > 0 2 0 51356 7792 1504924 0 0 0 8 520 583 1 1 40 58 > 0 2 0 15188 7868 1556060 0 0 4 16 566 514 0 6 32 62 > 0 3 0 14740 7764 1547640 0 0 8 119488 762 941 1 23 47 > 28 > 0 1 0 19600 7776 1547712 0 0 4 27084 511 427 0 1 52 46 > 0 3 0 13664 7708 1505740 0 0 4 371988 717 828 2 23 18 > 58 > 0 3 0 13956 7656 1505784 0 0 4 61676 622 502 0 9 30 61 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 3 0 13952 7588 1506016 0 0 0 78208 600 589 1 10 26 63 > 0 3 0 15800 7628 1505272 0 0 4 51392 544 461 0 8 27 65 > 0 3 0 18264 7672 1504532 0 0 0 63748 745 536 4 7 10 80 > 1 3 0 19980 7684 1505532 0 0 0 1504 558 441 0 0 40 60 > 0 3 0 27952 7684 1505536 0 0 0 0 482 500 2 0 49 49 > 0 3 0 43392 7688 1505532 0 0 0 8 556 486 1 2 28 68 > 0 3 0 52820 7688 1505604 0 0 68 8 517 603 6 4 8 82 > 0 1 0 12496 7836 1558900 0 0 4 24 632 719 5 13 22 59 > 1 2 0 15284 7840 1547452 0 0 8 104228 767 898 6 18 20 > 56 > 1 2 0 21324 7900 1514460 0 0 0 249132 691 548 7 12 18 > 63 > 0 5 0 30912 7900 1514464 0 0 0 4 583 505 8 4 33 55 > 1 3 0 46064 7904 1515080 0 0 0 4 681 588 6 3 37 54 > 1 3 0 18740 8060 1552056 0 0 8 876 659 746 6 17 6 71 > 0 3 0 14936 7920 1503820 0 0 0 356160 956 624 5 21 22 > 52 > 0 2 0 25384 7936 1504544 0 0 4 1356 583 540 7 4 13 76 > 1 3 0 38272 7940 1504544 0 0 0 12 604 464 7 4 32 57 > 0 3 0 44132 7940 1504544 0 0 0 0 540 551 8 4 35 53 > 2 1 0 15056 8028 1547112 0 0 0 12 739 691 6 11 31 51 > 0 4 0 14924 8208 1538096 0 0 8 120368 687 900 8 24 16 > 52 > 0 3 0 16108 8228 1545492 0 0 4 13968 710 766 7 9 8 76 > 1 3 0 14888 8248 1504728 0 0 0 338572 708 670 7 16 25 > 52 > 1 3 0 15092 8288 1504772 0 0 4 41132 619 469 6 9 0 84 > 0 4 0 13720 8372 1506108 0 0 0 98724 718 577 7 13 27 53 > 0 2 0 15300 8412 1505056 0 0 4 32908 630 470 6 7 14 73 > 0 2 0 23212 8412 1505064 0 0 0 0 554 526 7 3 36 54 > 0 2 0 29264 8416 1505064 0 0 0 1364 801 430 7 4 39 50 > 0 2 0 40780 8416 1505064 0 0 0 0 521 478 3 2 18 77 > 2 1 0 53716 8420 1505064 0 0 0 12 437 405 0 0 51 49 > 2 0 0 13800 8500 1557188 0 0 0 36 456 623 2 8 49 40 > 1 2 0 15452 8712 1545408 0 0 8 125840 930 1130 7 26 37 > 30 > 1 1 0 14220 8788 1503544 0 0 4 339620 775 788 7 19 25 > 49 > 0 3 0 15200 8752 1502960 0 0 4 45216 621 477 6 10 1 83 > 0 3 0 15364 8816 1502940 0 0 0 90496 728 600 8 12 0 80 > 0 3 0 17116 8868 1505288 0 0 4 32908 641 454 5 10 0 85 > 0 4 0 18532 8836 1502924 0 0 0 90564 734 577 7 12 0 80 > 0 2 0 24432 8844 1502932 0 0 0 1440 724 466 6 4 21 69 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 1 2 0 32152 8844 1502932 0 0 0 0 484 497 7 3 36 54 > 1 1 0 45520 8844 1502932 0 0 0 0 594 401 7 2 37 54 > 0 1 0 58376 8844 1502932 0 0 0 0 669 498 6 6 27 61 > 1 1 0 14164 8908 1556064 0 0 4 16 531 552 7 14 37 41 > 0 2 0 15200 8868 1502456 0 0 12 394016 824 1047 2 26 14 > 57 > 0 3 0 17988 8920 1505192 0 0 4 32948 556 477 2 6 0 92 > 1 2 0 15244 9008 1502740 0 0 0 109784 739 597 7 14 0 > 78 > 1 2 0 14120 9064 1504488 0 0 4 69952 660 487 7 11 0 82 > 0 3 0 15544 9136 1502676 0 0 0 78112 795 584 9 10 0 81 > 0 2 0 14772 9164 1503996 0 0 4 20524 769 449 6 6 4 84 > 0 2 0 23116 9168 1505024 0 0 0 1456 472 500 8 3 21 67 > 0 3 0 35232 9168 1505024 0 0 0 4 598 413 6 5 11 78 > 0 2 0 47668 9172 1505024 0 0 0 12 506 558 2 0 14 83 > 0 2 0 56336 9180 1505016 0 0 104 0 548 426 0 1 37 62 > 1 1 0 14308 9104 1550864 0 0 4 45552 585 752 4 22 42 32 > 0 4 0 14908 9000 1546296 0 0 8 81468 787 747 6 15 8 70 > 2 1 0 13672 8984 1503684 0 0 4 356028 843 854 8 19 16 > 57 > 1 3 0 15088 9044 1506536 0 0 4 45248 707 506 7 10 15 68 > 0 4 0 15088 8956 1502944 0 0 0 98656 687 581 4 11 0 85 > 0 4 0 13996 9016 1503984 0 0 4 78144 631 477 0 10 0 89 > 1 4 0 17544 9072 1502820 0 0 4 49852 599 569 1 5 0 93 > 0 3 0 17292 9052 1505400 0 0 0 1468 730 476 0 3 1 97 > 0 4 0 30052 9052 1505404 0 0 0 36 566 506 6 4 30 60 > 0 3 0 43416 9056 1505404 0 0 0 12 585 449 4 2 12 83 > 0 2 0 53728 9060 1505408 0 0 60 0 534 604 1 2 6 91 > 2 1 0 13792 9204 1556156 0 0 4 16 556 618 3 12 47 39 > 1 0 0 15192 9128 1547256 0 0 4 119120 774 796 6 18 38 > 38 > 1 2 0 14468 9124 1503200 0 0 4 355892 930 949 5 21 30 > 44 > 0 4 0 16568 9168 1506296 0 0 4 24644 602 532 8 8 7 77 > 0 3 0 14576 9176 1503172 0 0 0 119060 728 587 6 16 18 > 61 > 1 2 0 14352 9120 1508740 0 0 4 20576 612 558 8 7 19 65 > 0 3 0 15532 9136 1501896 0 0 0 128572 813 532 7 15 0 > 78 > 0 2 0 15140 9152 1504820 0 0 4 1448 768 547 7 5 5 83 > 1 3 0 25064 9156 1504856 0 0 0 12 586 437 6 4 7 82 > 1 5 0 38540 9156 1504856 0 0 0 4 642 611 9 3 26 62 > 0 3 0 52500 9156 1504856 0 0 0 0 624 481 5 5 37 53 >procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 2 1 0 14500 9232 1555248 0 0 0 16 682 704 7 10 28 55 > 1 2 0 14644 9376 1546224 0 0 8 111736 724 810 6 26 40 > 28 > 0 2 0 13508 9308 1503808 0 0 4 341320 782 948 8 18 8 > 67 > 1 2 0 13724 9328 1504316 0 0 0 61708 662 620 7 10 15 67 > 1 3 0 14292 9376 1503728 0 0 4 61764 667 673 7 10 0 83 > 0 4 0 16960 9312 1504224 0 0 4 61700 661 587 7 12 0 81 > 0 3 0 14488 9400 1502776 0 0 0 98792 723 605 7 10 0 83 > 0 3 0 14544 9408 1502508 0 0 0 26160 794 573 6 7 34 53 > 0 3 0 26252 9412 1502512 0 0 0 1420 650 546 7 4 2 87 > 0 3 0 39100 9416 1502512 0 0 0 12 592 432 6 4 0 89 > 0 4 0 49712 9416 1502512 0 0 0 4 499 493 2 2 8 88 > 0 1 0 63372 9420 1502512 0 0 0 12 582 593 0 0 14 85 > 2 0 0 13900 9588 1555884 0 0 1632 56 531 762 2 14 27 56 > 1 2 0 14012 9768 1505492 0 0 8 391688 790 846 0 26 23 > 52 > 0 3 0 14904 9812 1502068 0 0 0 98880 695 574 1 11 21 66 > 0 3 0 13512 9812 1503280 0 0 4 74060 642 496 0 10 29 61 > 0 4 0 14024 9844 1502960 0 0 0 73940 658 598 1 10 32 57 > 0 3 0 18676 9872 1501804 0 0 4 35736 591 437 0 5 15 80 > 0 4 0 27660 9884 1501824 0 0 8 12 541 284 2 1 2 95 > 0 3 0 39176 9888 1501912 0 0 92 1720 448 274 0 1 10 89 > 0 3 0 51596 9888 1501912 0 0 0 0 545 593 1 1 36 61 > 0 1 0 66220 9892 1501912 0 0 0 16 570 555 0 0 48 51 > 0 0 0 15144 9940 1556124 0 0 2240 176 506 753 4 10 37 49 > 0 0 0 15192 9948 1556176 0 0 0 1052 283 483 0 0 98 1 > 0 0 0 15192 9956 1556176 0 0 0 1052 316 769 2 0 97 2 > 0 1 0 14348 9988 1556772 0 0 2996 1052 408 718 0 0 77 22 > 0 0 0 14252 10012 1558320 0 0 1588 0 404 803 2 6 78 13 > 0 0 0 14680 10036 1557968 0 0 1840 28 400 1197 0 3 88 8 > 0 0 0 14740 10080 1557812 0 0 2168 0 406 992 2 2 85 12 >^C
I read russian forums about this problem I should go now and i cannot write to more info right now But there somebodies tried to checnge scheduler from cfg to other and iowait bug stayed. If you think that bug in scheduler may be to try to change scheduler through /proc ?
I'm annoyed by the same bug (I suspect). And I'm able to reproduce it with both anticipatory and cfq schedulers. Therefore, is this but to be link with cfq ? I'm running my kernel with: elevator=as
@Jens Axboe: I tried your patch in comment 366 on the 2.6.30 kernel, and it did improve responsiveness in my initial testing. I used to have the problem that the kernel became highly unresponsive on large file copies to the same partition or as soon as it tried to use swap (in 2.6.30-rc3 and earlier), but the unpatched 2.6.30 performs quite reasonably and the patch improved responsiveness further (my unscientific test results are that moving the mouse resulted in much less 'stuttering' after the patch - note that with earlier kernels the mouse would just freeze). I did though just find a problem where an overnight memory leak caused X to become so unresponsive it couldn't even draw the screen background until I killed the culprit (firefox). This might be unrelated to the patch, ie a problem with swap management, but it does show that the kernel can still become bogged down under high disk I/O.
Did anybody here resolve this bug ? I see a workaround only as an installing FreeBSD instead a Linux kernel version >= 2.6.18
I think that i coped with a this bug ! I made some options of kernel and my server works stability and there are no frozen timeouts with high iowait already 10-12 hours! Detailed info: My kernel now is 2.6.22.14-72.fc6 Fedora Core 6 This the suggestion is not bug resolving (i think there is bug in kernel and it stays) but this is a workaround. I have read many topics and forums and stopped at these commands: # echo 50 > /proc/sys/vm/vfs_cache_pressure # echo deadline > /sys/block/DEVICE/queue/scheduler # # echo 1 > /sys/block/DEVICE/device/queue_depth # echo 1024 > /sys/block/DEVICE/queue/nr_requests The DEVICE is 'hda' or 'sda' for some HDDs. I didn't test queue_depth because for my HDDs (SAS SCSI + RAID10) this file is readonly (no there NCQ supporting as i think). But may be this command will help to you. I don't know. I suggest anybody who have a frozen timeouts with high iowait to try this turning I am very glad! Please to try this workaround. I didn't test 'dd' command but my heavy a HDD working has been freezing the server. Now i don't see this.
Can you try the three settings separately, to see which one makes the large difference?
I will try but this is my work server under heavy load. I am afraid now there to touch something already :-/ But near time i will try to define the main option of this turning. Already passed > 24 hours and i don't have a troubles there with freezes. I cannot believe ...
Here test for same server as in my post here # 359 Same server but after turning port # 385 # dd if=/dev/zero of=testfile.1gb bs=1M count=1000 And during 'dd' i do vnstat 1: 0 2 116 103632 507240 2016112 0 0 1324 16 1024 963 1 1 50 48 0 1 2 116 101512 507484 2015736 0 0 1436 0 1314 1253 21 5 25 48 0 0 2 116 103632 507240 2016112 0 0 1324 16 1024 963 1 1 50 48 0 0 7 116 25208 496944 2105464 0 0 4 26272 2892 239 0 4 23 73 0 0 9 116 21636 496972 2109568 0 0 32 21904 2150 339 0 2 8 90 0 0 10 116 39888 481904 2105552 0 0 4 23544 1964 368 0 4 1 96 0 0 9 116 49036 472984 2105016 0 0 8 18252 1730 728 0 3 0 97 0 0 7 116 61700 459736 2105412 0 0 16 74176 2167 317 0 5 13 82 0 0 7 116 71416 450576 2104272 0 0 24 8680 1322 237 0 4 16 80 0 1 5 116 82772 439000 2106280 0 0 24 58616 1457 3332 0 7 5 88 0 1 5 116 97224 424752 2105804 0 0 20 60164 848 286 0 6 24 70 0 0 7 116 110700 409384 2107036 0 0 56 105584 884 397 0 9 15 76 0 2 5 116 116444 392304 2118776 0 0 288 95624 1096 424 1 11 10 78 As you can see there no stability iowait 90-99%, only sometimes ...
Here some tests: I do to defaults settings before tuning: # echo 100 > /proc/sys/vm/vfs_cache_pressure # # echo cfq > /sys/block/sda/queue/scheduler # # echo 128 > /sys/block/sda/queue/nr_requests # dd if=/dev/zero of=testfile.1gb bs=1M count=1000 ^C 116+0 records in 116+0 records out 121634816 bytes (122 MB) copied, 20.5609 seconds, 5.9 MB/s ^^^^^^^^^^^^^^^^^^^^^^^^^ (!!!) During a riunning of 'dd' i do vmstat 1: procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- - r b swpd free buff cache si so bi bo in cs us sy id wa st 0 10 116 760168 503488 1329836 0 0 4 4 0 0 9 3 65 23 0 0 11 116 756132 502488 1330536 0 0 1332 5648 1744 4909 5 2 2 91 0 0 12 116 760208 503128 1330856 0 0 1136 4388 1875 3053 4 2 0 94 0 0 11 116 759832 502668 1331608 0 0 1004 7488 2379 4032 1 2 0 97 0 0 12 116 758740 503288 1331832 0 0 1280 3252 1818 2402 1 1 0 98 0 0 10 116 733976 502936 1356780 0 0 1232 4476 1753 4143 1 3 0 96 0 1 8 116 733596 502368 1357324 0 0 804 5792 1831 2980 20 2 0 79 0 1 7 116 738388 502920 1357788 0 0 928 6652 1875 2349 17 2 4 77 0 ************************** Now i after this to do: # echo 50 > /proc/sys/vm/vfs_cache_pressure # # echo deadline > /sys/block/sda/queue/scheduler # # echo 1024 > /sys/block/sda/queue/nr_requests # dd if=/dev/zero of=testfile.1gb bs=1M count=1000 ^C 638+0 records in 638+0 records out 668991488 bytes (669 MB) copied, 10.463 seconds, 63.9 MB/s ^^^^^^^^^^^^^^^^ (!!! :-))) ) During 'dd' i do in other terminal: # vmstat 1 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- - r b swpd free buff cache si so bi bo in cs us sy id wa st 1 7 116 718764 502884 1371484 0 0 4 4 0 0 9 3 65 23 0 0 9 116 687208 502924 1405624 0 0 8 26664 2708 746 6 4 3 87 0 1 8 116 668924 502976 1422116 0 0 16 21404 2246 8462 1 4 9 87 0 0 8 116 654804 501632 1434492 0 0 24 30804 2072 9249 10 4 0 86 0 0 10 116 613152 501692 1475220 0 0 20 42880 2021 4408 15 5 7 73 0 2 10 116 559860 499464 1524600 0 0 32 58504 2108 10612 5 6 15 74 0 0 11 116 510132 499528 1578340 0 0 36 59400 984 1748 17 5 2 77 0 0 10 116 399420 499672 1689316 0 0 108 111332 910 957 4 11 2 84 0 1 7 116 331556 499756 1750580 0 0 104 62268 1501 5255 11 6 10 74 0 ********************* and noticing: I have other servers, there other hardware. I cannot repeat this iowait problem with and without this turning (there Fedora release 7 (Moonshine), kernel 2.6.23.17-88.fc7). Now i think that this trouble is not for all HDDs. May be this trouble is hardware dependent. I am researching now what option will help to resolve iowait problem
I determined main option: Only this option helped to me: # echo deadline > /sys/block/sda/queue/scheduler I don't understand why. I have read many russian topics that a changing of scheduler doesn't help ... I don't think that only a changing of scheduler will help to me. But i only have changed the scheduler from cfq to deadline and 'dd' test now this: # dd if=/dev/zero of=testfile.1gb bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 13.7121 seconds, 76.5 MB/s iowait sometime was only 80-90%. Here my current setting: # cat /proc/sys/vm/vfs_cache_pressure 100 # cat /sys/block/sda/queue/scheduler noop anticipatory [deadline] cfq # cat /sys/block/sda/queue/nr_requests 128 Now i will keep these setting and will watch there are a freezes or not.
I made a some experiments And i think that i found main reason of high iowait with cfq scheduler. I made some tests: I changed cfg <--> scheduler into my two servers with same hardware & OS (FC6, kernel 2.6.22.14-72.fc6). There same CPUs, motherboard, SAS & RAID controllers & HDDs. But i saw only in one server high iowait & cfq scheduler during 'dd' command. I think that main reason is A LOT AMOUNT OF USED INODES OF PARTIOTION into HDD. For example: The 'OK' server where i counld not reproduce bug: # df -i /dev/sda1 524288 8543 515745 2% / tmpfs 219756 1 219755 1% /dev/shm /dev/sda6 787200 34068 753132 5% /usr /dev/sda5 787200 25582 761618 4% /usr/local /dev/sda7 524288 1993 522295 1% /var /dev/sda8 30900224 1719787 29180437 6% /wwws /dev/sda3 1048576 49655 998921 5% /wwws/accel-proxy I wrote the test testfile.1gb file to /wwws partiotion . There no highest iowait with deadline & cfq schedulers. The second server, 'BAD' server has a same hardware & soft but there df -i: Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda1 524288 7444 516844 2% / tmpfs 219756 1 219755 1% /dev/shm /dev/sda7 787200 35307 751893 5% /usr /dev/sda6 787200 27520 759680 4% /usr/local /dev/sda8 524288 2334 521954 1% /var /dev/sda3 30900224 5332794 25567430 18% /wwws /dev/sda5 524288 4128 520160 1% /wwws/accel-proxy I did 'dd' tests to /wwws/ partition too (i get used to write there big files) ... There if i use cfq scheduler and (important) have some worked processes (apaches, mysql - not idle server) that during 'dd' command i have highest iowait (90-99%) and very low speed of writing (9-10 Mb/sec). If i change there to deadline scheduler and write to /wwws/ partition too i have 60-80 Mb/sec speed and not high iowait. But if i wrote testfile.1gb to other partiotion (for example to /var) i have not iowait even with cfq scheduler. Thus cfq scheeduler + a lot used inodes is bad as i think. The deadline scehduler + a lot used inodes is not bad. So i think that high amount of used inodes in partiotion and cfq scheduler together have some wrong something. May be if i could have a much used inodes into my other servers (FC7 where i could not reproduce iowait problem) i could reproduce this high iowait bug too. Please to try make a many many small files in some partiotion (5-6 millions for example) and to test 'dd' & cfq scheduler.
It would be ideal, if you could try 2.6.30 on the problematic server. I realize that this may not be easy, however there's not much I can do about a problem on an ancient kernel. If you do try 2.6.30 and it also has the same problem, then I want you to capture some blktrace data of both deadline and cfq. Basically, right after you start the dd test, in another terminal do: # cd /dev/shm; blktrace /dev/sda and ctrl-c that blktrace after ~5 seconds or so. Then stop the dd as well. Save the blktrace files on the harddrive. Now switch do deadline and repeat the exact same thing. Then tar up the two sets of files and attach them to this bug report.
Jens Axboe, i happy to help but i cannot to try 2.6.30 :( I never install kernels and i am afraid that something maybe to be not right after installing kernel and i will not be able to access to server. This server under heavy load and is located in other continent. I cannot risk, sorry ;-( May be will anybody try to make many small files in HDD (many inodes - ~ 5-6 millions for example) and will try to compare cfq & deadline schedulers ?
Created attachment 22019 [details] test results 2.6.30: cfq, deadline It gives no improvement in responsiveness at all variants. Maybe quite a bit.
hi, i am using the 2.6.30 kernel with the patch from #366. Before using the patch i got really trouble when downloading large files with torrent at high speed ( over 5MB/sec). Now it just works great. Thanks for this patch.
Created attachment 22167 [details] test result 2.6.30 without ACHI I turned off the ACHI in BIOS on the laptop. System has become much more responsive. Now possible to run new applications, while the dd is running.
Created attachment 22180 [details] Drain async IO on the hw side This patch makes sure that async IO has completed drained from the device queue before starting sync IO. Hopefully that should make things as good as disabling NCQ, and it should even improve the situation without NCQ. I'd like for people to test this patch and see if it makes a difference. It's against 2.6.31-rc (ish), but I _think_ it will apply against 2.6.30 as well. If not, holler, and I'll do a backport too.
Created attachment 22184 [details] test result 2.6.30 with patch from #397 (2.6.30 + NCQ + patch from #397) == (2.6.30 + NCQ). New applications start very slow.
Notebooks toshiba earlier came across to me, and I saw as on them then worked Linux. If ACPI to switch off — it was possible to listen to music (but, and it is clear, not to see how many remains fuel in batteries) and if ACPI to switch on — any sound was not, whether but it was visible that with a battery began and to understand have stolen this battery while we enjoyed music. It is possible and to accuse of it the scheduler (in our case cfq) and to search why it cannot plan simultaneously two processes — to play music and to check a battery state. For us it has turned out that all schedulers have broken (because I have tried them all — and all non-working). The theory of probability does not deny possibility of such event. But, that, for one person all schedulers broke, and all worked for another are already influence supernatural forces. Struggle against them is useless. So why for one all works, and for another the system hardly creeps. In what a difference. Only in computers (or is more exact in their complete set). I can be mistaken, but can then someone will tell why on one iron all simply flies, and on other hardly creeps (without looking at that on the second both the processor faster and disks faster, and bus faster).
Had big troubles on a ASUS PN5e Motherboard and a WD 320 G. Compiled a 2.6.31-rc3 with your patch and it works great. Thank you very much! I'd like to backport it to 2.6.29 to try it together with the realtime patch. Is there a chance to get it working?
2.6.31-rc3-git3 + NCQ + patch from #397: new applications start very slow. Without NCQ new applications start quickly.
There is one more interesting question. KSYSGUARD shows "Used Memory" = 0.66Gb. > top top - 21:43:57 up 7:00, 3 users, load average: 0.74, 0.39, 0.29 Tasks: 149 total, 3 running, 146 sleeping, 0 stopped, 0 zombie Cpu (s): 2.8%us, 1.3%sy, 0.0%ni, 93.1%id, 2.5%wa, 0.2%hi, 0.2%si, 0.0%st Mem: 8035628k total, 7998716k used, 36912k free, 0k buffers Swap: 2104472k total, 6564k used, 2097908k free, 7402836k cached When value Mem:used aspires to value Mem:total - the graphic interface works much more slowly (and without any disk operations). It only at me is present such problem?
I applied the patch in 397 to a vanilla 2.6.30.4 and the difference was dramatic (with the patch is _much_ better, ie the complete freezing for 15+ seconds when running multiple IO intensive jobs are gone). I'll work on getting some hard numbers (with iobench, etc) to see if they agree.
(In reply to comment #397) > Created an attachment (id=22180) [details] > Drain async IO on the hw side > > This patch makes sure that async IO has completed drained from the device > queue > before starting sync IO. Hopefully that should make things as good as > disabling > NCQ, and it should even improve the situation without NCQ. > > I'd like for people to test this patch and see if it makes a difference. It's > against 2.6.31-rc (ish), but I _think_ it will apply against 2.6.30 as well. > If > not, holler, and I'll do a backport too. Is this in the vanilla 2.6.31-rc5 already?
No, the patch is queued up for 2.6.32 since it was a rather risky change for 2.6.31. But I'm glad it makes a difference, that means that the starvation experienced is largely on the device side. By draining the queue, we prevent that from happening (or, at least we lessen the effect dramatically).
2.6.31-rc7 + patch in 397 - There are no improvements
No improvements seen here with 2.6.30.5 and the patch, either. Pretty much *any* write to swap causes major latency (disruption to audio, graphics etc.).
There is an improvement in desktop responsiveness with kernel 2.6.31 and the as scheduler compared to the cfq scheduler. It does not solve the problem, but it makes it more sufferable. I am using a full encrypted lvm drive with ext3 partitions, mounted with noatime and data=ordered.
I've observed something that might be relevant to this bug (using the 2.6.31.5 kernel): when I do large I/O operations from one external device (say /dev/sdb) to another slow USB flash key (say /dev/sdc), I can hear my *internal* hard drive (/dev/sda) thrashing away constantly even though its light indicates that no read/write activity is going on. During this time anything that requires access to /dev/sda is slowed right down and hence running new programs slows down disk access. When I start copying, eg using nautilus, there is usually a 400 MB buffering delay before writing starts to the USB drive (ie before its light starts flashing). During this time, there is NO /dev/sda thrashing. /dev/sda starts thrashing starts as soon as the USB key light starts flashing. So there appears to be a bug that makes /dev/sda constantly seek during the /dev/sdc USB write operation, and this is affecting system responsiveness.
Please try 2.6.32-rc5. Make sure you are using CFQ as your io scheduler.
I opened http://bugzilla.kernel.org/show_bug.cgi?id=14491 to track this bug separately - I've put comments in there about 2.6.32-rc5, which I don't think exhibits the problem.
Created attachment 23618 [details] Simple sleeper test case As this bug occurs more permanent while working in an virtual machine or while using java and I still think, that's this is a process scheduler bug (or something related). Here another test case, which shows the suspected behaviour. As there are many system calls while using a virtual machine, I have tries to find an equal test. The test case just sleeps for 1µs and measures the time difference of the usleep operation. I am using such many of the usleep operations, as the problems does not occur deterministic and I tried to catch as many as possibly occurrences. I have run this test case on three machines. The first one was a Core2 Duo with a first generation SDD (OCZ Core Series) with a poor write performance and on a Ubuntu kernel 2.6.31-14-generic. The partitions are block aligned. I have run this test, while my wife was using firefox. Every time, she was submitting something and firefox is using sqlite for writing the history, there was a high latency for the sleep test. Timediff 7629094: 16.80ms Total: 61.12ms Timediff 7629100: 18.82ms Total: 93.68ms Timediff 7629101: 19.96ms Total: 113.54ms Timediff 7629102: 19.98ms Total: 133.43ms Timediff 7629103: 19.97ms Total: 153.31ms Timediff 7629104: 20.00ms Total: 173.24ms Timediff 7629105: 19.96ms Total: 193.09ms Timediff 7629106: 20.02ms Total: 213.02ms Timediff 7629107: 19.94ms Total: 232.86ms Timediff 7636162: 16.40ms Total: 34.44ms Timediff 7636164: 19.90ms Total: 64.00ms While the duration of 100 usleep should be somewhere between 10ms and 20ms, 10 usleep(1) takes more than 200ms. This behaviour is reproducible. On my machine, a Core2Duo, a normal 2.5" hard drive with a vanilla kernel 2.6.31.5, there is an equal behaviour. While making a backup from one hard drive to another, the latency jumps to >30ms for one usleep(1) nearby every second. But there are some latencies grater than >150ms for one usleep(1). Timediff 11054523: 38.23ms Total: 53.19ms Timediff 11212737: 21.64ms Total: 31.46ms Timediff 11213557: 35.59ms Total: 44.62ms Timediff 11213939: 59.88ms Total: 65.76ms Timediff 11264190: 40.83ms Total: 49.72ms Timediff 11264709: 53.77ms Total: 63.09ms Timediff 11265629: 145.74ms Total: 155.96ms Timediff 11327458: 16.94ms Total: 25.23ms Timediff 11376430: 18.91ms Total: 27.67ms Timediff 11408941: 17.67ms Total: 26.36ms Timediff 11424964: 19.26ms Total: 28.01ms Timediff 11509722: 19.84ms Total: 28.30ms Timediff 11627259: 27.01ms Total: 34.51ms Timediff 11645718: 18.26ms Total: 29.80ms On my server Athlon X2 on a full encrypted RAID-5 with lvm on a 2.6.18-128.2.1.el5.028stab064.7 (CentOS with OpenVZ) kernel, the behaviour was even worse. While copying a 4GB iso, there are latencies of one usleep(1) up to one second. Timediff 40397: 24.16ms Total: 122.93ms Timediff 40417: 859.04ms Total: 981.78ms Total 40417: 981.78ms Timediff 45928: 22.62ms Total: 220.16ms Timediff 50471: 25.02ms Total: 135.80ms Timediff 51085: 19.23ms Total: 163.03ms Timediff 51097: 205.12ms Total: 360.66ms Timediff 51160: 47.47ms Total: 422.81ms Total 51160: 422.81ms Timediff 51662: 21.93ms Total: 279.08ms Timediff 51663: 40.87ms Total: 318.58ms Total 52068: 401.49ms Timediff 54540: 16.69ms Total: 150.93ms Timediff 63056: 78.07ms Total: 203.86ms Timediff 65673: 16.43ms Total: 228.44ms Timediff 65675: 24.04ms Total: 265.11ms On all three machines, there were small latencies without any fsync or copy operation. On the machines with the Core2Duo and kernel 2.6.31 the latencies are below 0.2ms and 0.1ms, even while watching a movie or using 100% of the cpu. On the Athlon X2 and the kernel 2.6.18, the latencies are always below 1ms. A 200ms latency while moving the mouse is noticeable. A delay of 1 second, while moving the mouse, should be the freezes, which many of us notice during copy operations. Why is the kernel delaying resume of the usleep(1) operation up to one second during a copy operation? Please have a look on this behaviour.
I also had problem with system latency with high I/O usage. After applying patch from #397 to kernel 2.6.31.5, the problem became really smaller. Before patching, machine were sometimes freezing for more than 5 minutes. Now, maximum latency delay is less than half-second.
I have the same issue on a machine with i845e chipset, P4-1.5 Northwood, 2GB DDR RAM, GF6800 video and Audigy2 sound card. My main HDD is 160GB IDE Seagate. When there is disk activity the system becomes virtually unusable. For example, when I am burning a DVD on the drive attached to SII 3512 SATA controller, the CPU load goes from 40% at 7-8x to 98% at 16x. Downloading Fedora12 ISO last night at 500 kb/s kept system busy at 90%! If I start kernel compile, CPU load is stable 100%, which is Okay, but switching tabs in Firefox takes 10 seconds and starting any application like JUK, Dolphin, Konsole takes up to 1 minute. Running Fedora11 with 2.6.30.9.96 FC11 i686 PAE kernel. The system has become a bit more responsive (by about 10-20%) since I noticed p4-clockmod was being loaded and shut it down.
There are not enthusiastic comments after an output 2.6.32. I understand so - "And cartful and now there"
Created attachment 25281 [details] perf chart high io latency I am using 2.6.33 kernel and this problem is still present. When I copy big file (few GB) system becomes unresponsive. I ran perf chart and generated svg image. You can notice Plasma-desktop (part of KDE) is blocked for long time by IO. I copied file from the ntfs partition, but it also happens when I am copying big files over my Linux partition or from hard drive to pendrive.
(In reply to comment #416) > I am using 2.6.33 kernel and this problem is still present. Yep, this definitely earns the Most Embarrassing Linux Bug Award 2009 and is a Nominee for Most Annoying Linux Bug 2009 although the ATI binary driver wins in this category. Call me unfair for allowing binary blobs.
I will agree that something still isn't right with the VM. In my uninformed opinion, it does seem to be far too eager to swap out executable page in favor of streaming pages. Unfortunately, it seems that very few people know the VM well enough to fix it.
I am currently using the linux kernel 2.6.33 and the the desktop responsiveness is awful on my machine compared to the 2.6.32.x kernel. It's even worse than I have even seen it before. The load avg is rising to >7 very quickly, while writing many small file to the filesystem. I can make some tests with my configuration, but a kernel developer should tell me which tests.
(In reply to comment #419) > I am currently using the linux kernel 2.6.33 and the the desktop > responsiveness > is awful on my machine compared to the 2.6.32.x kernel. It's even worse than > I > have even seen it before. The load avg is rising to >7 very quickly, while > writing many small file to the filesystem. I can make some tests with my > configuration, but a kernel developer should tell me which tests. This isn't really the best place to bring this up. Please send a full description to linux-kernel@vger.kernel.org. cc myself, Ingo Molnar <mingo@elte.hu>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Jens Axboe <jens.axboe@oracle.com>. In that email, please identify what the system is doing at the time. Is it disk-related? CPU scheduler related? etc. Thanks.
Gentlemen, I have suffered the high iowait problem for almost 4 years, and I have been watching the bug report(Bug 12309) on bugzilla.kernel.org for 1 year, and yesterday I finally managed to get out of this trouble by switching from CentOS 5.4(with kernel 2.6.18) to zenwalk 6.2(with a snapshot kernel 2.6.32.2). The computer is used to collect signal data from 4 gas turbines in a power plant. The project started from 2004,and we used mandrake 9 and zenwalk, both are 2.4.x kernel,and there was no high iowait problems. Since 2006 we switched to fedore 6(kernel 2.6.18) and then centos 5, and the iowait began to make trouble, the system's response of mouse and keyboard became very slow, new applications needed a long time to be launched. During these years, I always thought the main reason of this was because the computer's hardware was not good enough. But early this month, the plant has upgraded the computer to a new Lenovo server with two Xeon E5504 CPUs(total 8 cores), and 4GB memory,but the iowait is still very very high, the following is the output of "top" command on that machine: Tasks: 215 total, 1 running, 213 sleeping, 0 stopped, 1 zombie Cpu0 : 1.0%us, 0.3%sy, 0.0%ni, 65.9%id, 32.8%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 1.0%us, 3.6%sy, 0.0%ni, 45.0%id, 50.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 1.0%us, 4.0%sy, 0.0%ni, 94.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu3 : 1.3%us, 3.3%sy, 0.0%ni, 56.3%id, 38.3%wa, 0.0%hi, 0.7%si, 0.0%st Cpu4 : 1.3%us, 6.7%sy, 0.0%ni, 0.0%id, 89.7%wa, 0.7%hi, 1.7%si, 0.0%st Cpu5 : 0.3%us, 3.3%sy, 0.0%ni, 91.7%id, 0.0%wa, 0.7%hi, 4.0%si, 0.0%st Cpu6 : 10.3%us, 30.2%sy, 0.0%ni, 50.2%id, 2.3%wa, 1.0%hi, 6.0%si, 0.0%st Cpu7 : 1.3%us, 8.6%sy, 0.0%ni, 83.1%id, 4.0%wa, 1.0%hi, 2.0%si, 0.0%st Mem: 4078540k total, 3872720k used, 205820k free, 182344k buffers Swap: 4192956k total, 0k used, 4192956k free, 2815596k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3841 markv 15 0 72172 12m 8380 S 42.2 0.3 1984:24 lvinf 8573 markv 15 0 60232 12m 8876 S 11.6 0.3 0:17.22 mark 4067 markv 15 0 19056 3224 2336 S 10.6 0.1 759:52.00 dms 3548 mysql 21 0 656m 617m 9292 S 9.0 15.5 764:42.05 mysqld 27042 markv 15 0 69404 12m 8756 S 4.3 0.3 290:36.14 walin 3810 root 15 0 39772 15m 8224 S 1.3 0.4 3:59.76 Xorg 1 root 15 0 2068 620 532 S 0.0 0.0 0:01.19 init 2 root RT -5 0 0 0 S 0.0 0.0 0:00.04 migration/0 3 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0 4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 5 root RT -5 0 0 0 S 0.0 0.0 0:00.02 migration/1 6 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1 7 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1 8 root RT -5 0 0 0 S 0.0 0.0 0:00.01 migration/2 9 root 34 19 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/2 10 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/2 11 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/3 12 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/3 13 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/3 14 root RT -5 0 0 0 S 0.0 0.0 0:00.09 migration/4 15 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/4 16 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/4 17 root RT -5 0 0 0 S 0.0 0.0 0:00.03 migration/5 18 root 36 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/5 [markv@markgt ~]$ All our application's job is to insert 16 records per second(every record is a fixed 12 bytes in 3 fields) into a mysql database, the storage is a LVM consisted of two 750GB seagate SATA 7200RPM disks. I am sure this high iowait was not caused by other things like network cards or video card, because I have experimented to comment out only the mysql inserting lines from our source code, and the system iowait would drop to 0, GUI would become very responsive. It also has nothing to do with the io scheduler,because I had tested deadline and noop on CentOS 5.4 and iowait could not be reduced. I also tried to enlarge the /sys/block/sda/queue/nr_requests, and it does not work. I got information from this bugzilla report that kernel 2.6.32 has fixed this high iowait problem, and I tested the snapshot kernel 2.6.32.2 of zenwalk on my notebook, and found the high iowait is gone, so yesterday I installed the zenwalk 6.2 with the 2.6.32.2 kernel on that server, although the kernel only detected/used one Xeon CPU and 2GB memory, the iowait is very low and the whole system became very fast, only several seconds iowait would reach to 30%-40%, and then dropped back to 0 very soon. By the way, the io scheduler is cfq. The following is "top" output of it: Tasks: 157 total, 2 running, 155 sleeping, 0 stopped, 0 zombie Cpu0 : 12.3%us, 7.8%sy, 0.0%ni, 77.3%id, 0.0%wa, 0.0%hi, 2.6%si, 0.0%st Cpu1 : 11.3%us, 8.4%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.0%hi, 4.2%si, 0.0%st Cpu2 : 5.2%us, 7.2%sy, 0.0%ni, 84.0%id, 0.0%wa, 0.3%hi, 3.3%si, 0.0%st Cpu3 : 8.1%us, 7.7%sy, 0.0%ni, 81.3%id, 0.0%wa, 0.0%hi, 2.9%si, 0.0%st Mem: 2272368k total, 1153508k used, 1118860k free, 79384k buffers Swap: 4192956k total, 0k used, 4192956k free, 797568k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3085 mysql 40 0 453m 68m 4864 S 35 3.1 12:25.12 mysqld 3203 markv 40 0 77852 17m 11m S 24 0.8 7:53.75 mark 2684 root 40 0 16440 2896 2144 S 9 0.1 11:03.99 dms 3879 markv 40 0 42256 12m 9336 S 4 0.6 1:43.26 walin 1520 root 40 0 4156 1232 972 S 0 0.1 0:00.06 ntpd 3235 root 40 0 64164 29m 9m S 0 1.3 0:50.78 X 3885 markv 40 0 2452 1180 892 R 0 0.1 0:02.08 top 1 root 40 0 804 332 292 S 0 0.0 0:00.90 init 2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0 4 root 20 0 0 0 0 S 0 0.0 0:00.05 ksoftirqd/0 We didn't change the storage, it is still that two seagate disks LVM with 4 years turbine data in them. I found in kernel 2.6.32.8 the high iowait is back. How do I know that? When I copy a 700MB avi file from my notebook disk to a 3.5" usb mobile disk, I found the reading side disk LED start to falsh quickly and immediately, but the writing side disk LED will keep still for a long time(like 25-30 seconds), and then start to flash slowly,and the course is abnormally long and low responsive. The kernel 2.6.32.2 is the only 2.6 kernel(since 2.6.18) on which I found both of the reading and writing side disk LED will start to falsh quickly and immediately.There must be somthing wrong with the write cache behavior which will cause the high iowait, and it has been fixed in 2.6.32.2 and brought back in 2.6.32.8. So by copying a big file to an usb disk and watching the disk LED, it can be used as a method for the kernel developers to reproduce and observe this bug. I hope this may be helpful. I noticed Mr. Morton said this is not the best place to discuss, but the linux-kernel@vger.kernel.org rejected two of my emails from two different email account. So I mailed and cc-ed the email, and also post it here, to make more people can share my experience. Best regards, Frank Ren
Im using mandriva 2010 with kernel 2.6.33-rc5 The freeze ist huge. System becomes unusable at every small disk activity (for example sudo urpmi blackbox). with kernels 2.6.31 2.6.32 is the problem too. Other kernels was not tested. Please, reopen the bug. It is a huge problem for many people.
It *is* a huge problem indeed. I kinda got used to it but it feels like in the 80s. I still have a Windows in a 10GB corner of my HDD which I use very rarely but every time I do it feels like a miracle to see what these modern computers are able to do when they don't run a f*cked up kernel :/
One angle to tackle this would be those who don't suffer from this bug, what kind of kernel (and with what parameters) and hardware they're running. Since this seem to affect a wide range of people and setups, could be intresting... but also a huge undertake
Probably is this bug triggered by GCC compiler? Have anyone tryed to compile a 2.6.30-2.6.33 with a earlyer gcc version?
Could be interesting, but I've read some comment whom writers had tried to isolate two consecutive kernel versions surrounding the bug. At last, it might be quicker, but quite boring for the operator, to try a laggy scenario with many different kernel versions catching the bug by dichotomia ? We might also distribute the effort between ourselves. I propose something like this: everybody uses the same kernel version which exhibit the bug, and try the same laggy scenario amongst a set of kernel version. Let's play with 4 version each, to cover the 2.6.x revisions, should not take so long. Who is volonteer ? Have a nice day. Topaz.
Topaz, you'll have to explain the meaning of catching the bug by dichotomia. I wonder if running with a "barebone" kernel can trigger this bug?
I'm currently running on Ubuntu Lucid, and I've noticed the bug since the Jaunty release (Intel x86 centrino platform, with a core 2 duo and on two different machines, both contaminated). When I first had some poor performance problems, I've tried to compile the vanilla kernel by myself and it resulted in a failure, the vanilla kernel 2.6.30 was also affected by this bug. My plans are to establish a laggy scenario, and to compile all version of 2.6.x kernel, and to test them all against my laggy scenario. Should not take that long, but the more the merrier :)
One clear angle that has not been investigated by kernel developers is that this issue is highlighted by 64-bit code. I don't see this lag and high IO wait in 32-bit kernel. I have a laptop with 2GB RAM and I got so sick of the lag that I have gone back to 32-bit kernel and userspace. And the speed difference is amazing to say the least. No more stuck mouse and no more waiting to see that konsole window pop up. Everything is much faster. Feels like a new laptop. And this is a 2Ghz core2duo based T61, not a slow hardware by any means! And I get extra 300-400MB of RAM back (YES! that's what you are reading!) by just switching to 32-bit system. 64-bit C++ apps like firefox and KDE eat almost twice the RAM. Firefox is running at 250MB with the same number of tabs and windows as was the 64-bit system, where it was consuming (RSS) about 450MB. Go figure! I am running a Virtualbox copy of XP on the laptop and I still don't see swap kick in. With 64-bit, running firefox and XP in VB at the same time would lead to heavy swapping and things would be crawling! So much for advancement to 64-bit! I have been running 64-bit systems for 4 years now and switching to 32-bit feels like I was living under a rock! I know all this sounds backwards. But give it a try.
Frank Ren: If you are sure the bug doesn't happen with 2.6.32.2, but with all other releases you could test, then you should try to find what has changed in it. Were you always running vanilla upstream kernels? Or always kernels from your distribution? Built on the same machine with the same compiler? If so, then have a look at the changelog from 2.6.32.2 to 2.6.32.8, looking for the culprit. I'd suggest you try 2.6.32.3 and check if the bug is there; and if not, increase the minor version until you get it: that will make the changelog really small. Then, send a mail to LKML with your findings. You seem to be the reporter with the most precise informations out there, you may catch something interesting! devsk: Beware not to be misled by the swapping behavior of your system. If you're often completely filling your RAM when on 64bits, then swapping may hurt responsiveness badly. When moving to 32bits, if you gain 300MB, you may not suffer from this because there's free RAM, but that's not really linked with a 64bit-only bug.
(In reply to comment #430) > One clear angle that has not been investigated by kernel developers is that > this issue is highlighted by 64-bit code. I don't see this lag and high IO > wait > in 32-bit kernel. I do. I share your opinion on RAM use though but it surely doesn't belong here. The bug itself is definitely not restricted to 64-bit systems.
Anybody, please read this comment: https://bugzilla.kernel.org/show_bug.cgi?id=13347#c59 I think there is the worthwhile suggestion.
I'm really unsure CFQ is the (only ?) culprit. I've met the same behaviour using deadline and a 3ware 9650, and the fix was a completely other thing (pci_set_mwi). See https://bugzilla.redhat.com/show_bug.cgi?id=444759 for more details.
I'm gonna chip in my experiences: I've had this bug with both 32 and 64 bits of the kernel. setting different schedulers didnt make a difference. I've tried different versions of the kernel with no luck (though i haven't tried specifically 2.6.32-2).
I have a 32bit system. The bug is almost there.
This bug depends on cpu, memory and first of all disc and filesystem, lvm and encryption. It's a mix of transactions/s and throughput. If both are in a system dependent range, the problem starts. There is no throughput/transaction statistic for processes in the scheduler to disadvantage processes, which are causing a high load. A process can gain all available dirty pages and block the other processes.
I update from 2.6.32 to 2.6.34 and bug fixed on two computers on vmstat wa take full free time, but interface not freeze may give all need info and try build any version from git for test
2 topaz (#429) Ready to join. It would be nice to determine the methods of testing: be advised that some of the methods. At the kernel of the current Lucid 2.6.32 bug reproduced.
That's the problem. There is no reliable method for testing.
I know about using dd. In addition I can move really big files (files in 4-7Gb). My database on the server really tiny and there I can easily reproduce the bug (the system of Hardy 32bit, 2.6.24-19-server) is copying the archives of sites and virtual machines. Further, in the office, I use Lucid (2.6.32-22-386, 32bit), but at home Fedora12 (32bit). But this is all subjective:
after update to 34 ivan1986@ivan1986:~/$ dd if=/dev/zero of=testfile.1gb bs=1M count=1000[] 1000+0 записей считано 1000+0 записей написано скопировано 1048576000 байт (1,0 GB), 36,1762 c, 29,0 MB/c ivan1986@ivan1986:~/$ dd if=/dev/zero of=testfile.1gb bs=1M count=1000[] 1000+0 записей считано 1000+0 записей написано скопировано 1048576000 байт (1,0 GB), 26,7475 c, 39,2 MB/c ivan1986@ivan1986:~/$ dd if=/dev/zero of=testfile.1gb bs=1M count=1000[] 1000+0 записей считано 1000+0 записей написано скопировано 1048576000 байт (1,0 GB), 32,8729 c, 31,9 MB/c 1 3 0 20940 19664 315188 0 0 128 7860 571 1108 6 10 3 81 2 2 0 15744 19668 320272 0 0 68 65332 893 1593 3 32 7 58 0 3 0 11932 19668 323260 0 0 96 49384 579 1142 3 11 0 85 0 3 0 17252 19704 318232 0 0 0 6832 516 1131 2 3 0 94 0 4 0 12732 19704 323204 0 0 128 6520 940 1145 4 22 4 69 2 4 0 11808 19980 323796 0 0 88 30492 1093 1393 7 20 2 70 0 4 0 32860 19980 302340 0 0 148 70892 1117 2026 2 10 4 84 0 4 0 11856 19980 323400 0 0 176 6652 553 1217 3 9 33 54 1 4 0 12340 19980 323156 0 0 12 12396 604 1269 2 8 4 85 0 4 0 12228 19980 323768 0 0 0 13520 816 1612 2 6 0 91 0 4 0 13136 19980 322244 0 0 0 21924 937 1504 7 8 0 85 0 3 0 11820 19980 324064 0 0 112 42740 857 1404 1 33 14 52 0 3 0 11896 19980 323468 0 0 48 9668 600 1161 4 6 1 88 0 4 0 12608 19980 322604 0 0 128 55032 746 1342 10 11 19 61 0 3 0 11328 19980 323508 0 0 76 27868 498 1087 4 3 6 86 0 4 0 11952 20020 322996 0 0 36 1196 502 1268 5 3 0 92 0 4 0 11952 20020 323512 0 0 0 4036 540 1064 3 8 0 89 0 4 0 11868 20304 323064 0 0 112 64560 893 1190 5 28 3 64 0 5 0 21888 20304 312760 0 0 336 35284 639 1520 4 15 0 82 0 5 0 21764 20304 313068 0 0 0 20936 572 1490 6 3 0 90 0 4 0 11844 20316 323896 0 0 248 364 610 1165 5 12 0 83 1 3 0 12336 20360 323368 0 0 0 31160 1113 1188 3 18 0 78 max 30% cpu in htop interface NO freeze, music play normal, and other work fine
Simplest way to reproduce this bug on most hardware is: 1. create a cryptsetup partition (on LVM or without LVM, both variants are ok). Preferably, all partitions used in test case are must be encrypted; 2. install VirtualBox and try to create preallocated hard disk image, size must be 4GB or more. That's it! If you try to use another applications at same time, you will see 5-10 sec freezes. I've reproduced bug on many hardware configurations with 2.6.34 and older kernels, such as: C2Q Q9650 / 8GB RAM / Seagate HDD / x86_64 i7 920 / 6GB / WD HDD / x86_64 C2D U7600 / 2GB / Samsung SSD / i686 C2D T7200 / 3GB / Seagate HDD / i686 So it's not hardware problems - hardwares age from 4 to 1 years and results are same. Also, on *BSD and Win there are no problems with that hardware.
I'm wondering. Isn't bad reponsiveness equals starvation of processes in the cpu schedueler? In that case it would be better to meassure the amount of cpu- cycles it is possible to burn during pekmop1024's procedure. I have tried to just dd a 8 Gb file, and it gives me stalls in the gui, but it is because of stat64-calls in the application. Under normal circumstances the file that is stat'ed is cached. But during high thoughput the cache is filled up with other data. So the stat64-call have to read from the disk which will the compete my dd. Running glxgears alongside the dd show a constant fps during to whole dd. I have followed this thread a long time and I do not remember anyone mensioning that a single high thoughput application renders the cache useless to other applications. Is it possible to avoid filling the cahce with data that is written ?
I'm wondering. Isn't bad reponsiveness equals starvation of processes in the cpu schedueler? In that case it would be better to meassure the amount of cpu- cycles it is possible to burn during pekmop1024's procedure. I have tried to just dd a 8 Gb file, and it gives me stalls in the gui, but it is because of stat64-calls in the application. Under normal circumstances the file that is stat'ed is cached. But during high thoughput the cache is filled up with other data. So the stat64-call have to read from the disk which will the compete my dd. Running glxgears alongside the dd show a constant fps during to whole dd. I have followed this thread a long time and I do not remember anyone mensioning that a single high thoughput application renders the cache useless to other applications. I'm guessing that a simple application that once per second reads the first byte from a memory mapped file will stall, even if it is only a single byte that needs to be cached. I'm sorry If my thoughts have been mensioned before in this thread :)
I've tested my assumption about the 1-byte mmap'ed file. It turned out that it is running fine during my dd. Probable 1 byte is not enough.
Still repeats itself - a compilation psi freezing interface
(In reply to comment #421) > I have suffered the high iowait problem for almost 4 years Then let's finally kill it! > I got information from this bugzilla report that kernel 2.6.32 has fixed > this high iowait problem, and I tested the snapshot kernel 2.6.32.2 of > zenwalk > on my notebook, and found the high iowait is gone > I found in kernel 2.6.32.8 the high iowait is back. How do I know > that? When I copy a 700MB avi file from my notebook disk to a 3.5" usb > mobile disk, I found the reading side disk LED start to falsh quickly and > immediately, but the writing side disk LED will keep still for a long > time(like > 25-30 seconds), and then start to flash slowly,and the course is abnormally > long and low responsive. > The kernel 2.6.32.2 is the only 2.6 kernel (since 2.6.18) on which I found > both of the reading and writing side disk LED will start to falsh > quickly and immediately.There must be somthing wrong with the write > cache behavior which will cause the high iowait, and it has been fixed in > 2.6.32.2 and brought back in 2.6.32.8. This is the complete git log 2.6.32.2..2.6.32.8: b0e4370 Linux 2.6.32.8 6117db7 NET: fix oops at bootime in sysctl code e4a6a35 powerpc: TIF_ABI_PENDING bit removal a420e9f ath9k: fix beacon slot/buffer leak 1c97637 ath9k: fix eeprom INI values override for 2GHz-only cards 2c7f87e pktcdvd: removing device does not remove its sysfs dir b31aa5c uartlite: fix crash when using as console e06fbe9 kernel/cred.c: use kmem_cache_free 35cfb03 starfire: clean up properly if firmware loading fails 906f68d mx3fb: some debug and initialisation fixes 682efb8 imxfb: correct location of callbacks in suspend and resume b260729 mac80211: fix NULL pointer dereference when ftrace is enabled 3a9353f mm: flush dcache before writing into page to avoid alias 78da404 be2net: Fix memset() arg ordering. e38d76e be2net: Bug fix to support newer generation of BE ASIC 43d7ff2 connector: Delete buggy notification code. f06f00e usb: r8a66597-hdc disable interrupts fix 0ae2b7d block: fix bugs in bio-integrity mempool usage 9648148 random: Remove unused inode variable 8857a1a random: drop weird m_time/a_time manipulation 94af44b Fix 'flush_old_exec()/setup_new_exec()' split cb723ba block: fix bio_add_page for non trivial merge_bvec_fn case e52299d mm: purge fragmented percpu vmap blocks 56d4b77 mm: percpu-vmap fix RCU list walking dce6a09 libata: retry link resume if necessary 42f7e23 oprofile/x86: fix crash when profiling more than 28 events 9c66557 oprofile/x86: add Xeon 7500 series support 4f7d666 KVM: allow userspace to adjust kvmclock offset a74e62c ax25: netrom: rose: Fix timer oopses 3125258 af_packet: Don't use skb after dev_queue_xmit() ecb7287 net: restore ip source validation 1681333 sky2: Fix oops in sky2_xmit_frame() after TX timeout 16b8efa tcp: update the netstamp_needed counter when cloning sockets 359e2f2 clocksource: fix compilation if no GENERIC_TIME 253f887 x86/amd-iommu: Fix possible integer overflow d1a3103 x86: Add quirk for Intel DG45FC board to avoid low memory corruption 8159070 x86: Add Dell OptiPlex 760 reboot quirk 00362b9 regulator: Specify REGULATOR_CHANGE_STATUS for WM835x LED constraints 6db6ace SECURITY: selinux, fix update_rlimit_cpu parameter 80569f6 firewire: core: add_descriptor size check 612e99b drm/i915: only enable hotplug for detected outputs 69bf9a6 iwlwifi: set default aggregation frame count limit to 31 3492bbb x86: Disable HPET MSI on ATI SB700/SB800 cf135e5 Input: winbond-cir - remove dmesg spam 5e806e1 x86: get rid of the insane TIF_ABI_PENDING bit c2e245d sparc: TIF_ABI_PENDING bit removal 336ca4c Split 'flush_old_exec' into two functions 944a638 FDPIC: Respect PT_GNU_STACK exec protection markings when creating NOMMU stack 0b3bf81 mm: fix migratetype bug which slowed swapping 629527c Fix failure exit in ipathfs 30d3844 fix affs parse_options() d842c31 Fix remount races with symlink handling in affs 36a0a4a fix leak in romfs_fill_super() 26d2257 fix oops in fs/9p late mount failure deb20f1 Fix failure exits in bfs_fill_super() 703c300 Fix a leak in affs_fill_super() 61d4374 drm/i915: Reload hangcheck timer too for Ironlake f0b4195 e1000/e1000e: don't use small hardware rx buffers b9ad9bb e1000e: enhance frame fragment detection dff2267 e1000: enhance frame fragment detection cfc7e54 UBI: fix volume creation input checking 3b4f785 ACPI: Advertise to BIOS in _OSC: _OST on _PPC changes 0d48a1a ACPI: fix OSC regression that caused aer and pciehp not to load 1a52add ACPI: Add platform-wide _OSC support. e62a96c ACPI: Add a generic API for _OSC -v2 1e88960 dasd: fix possible NULL pointer errors 083beff zcrypt: Do not remove coprocessor for error 8/72 63693ee libata: retry FS IOs even if it has failed with AC_ERR_INVALID 8c2cd3f x86: Remove "x86 CPU features in debugfs" (CONFIG_X86_CPU_DEBUG) b5b39c3 x86: Set hotpluggable nodes in nodes_possible_map 76e789c S390: fix single stepped svcs with TRACE_IRQFLAGS=y 16a2ae6 firewire: ohci: fix crashes with TSB43AB23 on 64bit systems d8e0902 drm/i915: Selectively enable self-reclaim 8268c0b mm: add new 'read_cache_page_gfp()' helper function b7a9d92 mptsas: Fix issue with chain pools allocation on katmai e15fca0 scsi_lib: Fix bug in completion of bidi commands b4bdd73 Linux 2.6.32.7 a8e96d6 x86, msr/cpuid: Pass the number of minors when unregistering MSR and CPUID drivers. 0a1c275 fnctl: f_modown should call write_lock_irqsave/restore 01e991b iwlwifi: Fix throughput stall issue in HT mode for 5000 d274df6 ACPI: enable C2 and Turbo-mode on Nehalem notebooks on A/C 59568be x86: Reenable TSC sync check at boot, even with NONSTOP_TSC 194223f IPoIB: Clear ipoib_neigh.dgid in ipoib_neigh_alloc() 454f8b1 KVM: only clear irq_source_id if irqchip is present eaccd49 KVM: fix lock imbalance in kvm_*_irq_source_id() 9801911 KVM: x86: Fix leak of free lapic date in kvm_arch_vcpu_init() 8e5c20d KVM: x86: Fix probable memory leak of vcpu->arch.mce_banks 0118bac KVM: x86: Fix host_mapping_level() 4938210 KVM: MMU: bail out pagewalk on kvm_read_guest error 59cf854 KVM: Fix race between APIC TMR and IRR f0d13b8 KVM: only allow one gsi per fd 70be4d7 KVM: S390: fix potential array overrun in intercept handling eb60025 cfg80211: fix channel setting for wext 304cd19 mac80211: check that ieee80211_set_power_mgmt only handles STA interfaces. 09e4d0f ASoC: fix a memory-leak in wm8903 2cdc2dc UBI: initialise update marker f6fbe0b UBI: fix memory leak in update path 4d845d6 hwmon: (fschmd) Fix a memleak on multiple opens of /dev/watchdog 00bd133 ALSA: hda - Fix HP T5735 automute a0dffef ipc ns: fix memory leak (idr) a5981df netiucv: displayed TX bytes value much too high 27aeefb cio: dont panic in non-fatal conditions f5b1bc5 cio: fix double free in case of probe failure da02974 V4L/DVB (13826): uvcvideo: Fix controls blacklisting 2928b68 md: fix small irregularity with start_ro module parameter 31cf6d8 ata_piix: fix MWDMA handling on PIIX3 3de08a12 ahci: disable SNotification capability for ich8 c817c19 iTCO_wdt: Add Intel Cougar Point and PCH DeviceIDs 42b4505 iTCO_wdt: add PCI ID for the Intel EP80579 (Tolapai) SoC 53691f2 iTCO_wdt.c - cleanup chipset documentation 4220098 ALSA: hda - Add missing Line-Out and PCM switches as slave 9049580 ALSA: hda - Fix quirk for Maxdata obook4-1 a2c5952 ALSA: hda - select IbexPeak handler for Calpella d160610 Input: i8042 - add Dritek quirk for Acer Aspire 5610. 461eb3f Input: i8042 - add Gigabyte M1022M to the noloop list f6278f1 Input: i8042 - remove identification strings from DMI tables 44d13be DMI: allow omitting ident strings in DMI tables 5172b4b PCI: AER: fix aer inject result in kernel oops bf9a88d qlge: Bonding fix for mode 6. 6b07617 qlge: Add handler for DCBX firmware event. 6055e7f qlge: Don't fail open when port is not initialized. 836750b qlge: Set PCIE max read request size. ffd1fab qlge: Remove explicit setting of PCI Dev CTL reg. 7c0798e fcoe: Fix getting san mac for VLAN interface 1ce0348 fcoe: Fix checking san mac address e166cb1 fcoe, libfc: fix an libfc issue with queue ramp down in libfc 2792e0ce libfc: remote port gets stuck in restart state without really restarting 407590a libfc: fix free of fc_rport_priv with timer pending a3d46ca libfc: fix memory corruption caused by double frees and bad error handling 4c40dbe libfc: Fix frags in frame exceeding SKB_MAX_FRAGS in fc_fcp_send_data 88cc93a fcoe: initialize return value in fcoe_destroy 7c8a0dc libfc: don't WARN_ON in lport_timeout for RESET state 83d236b libfc: lport: fix minor documentation errors 56320f6 libfc: Fix wrong scsi return status under FC_DATA_UNDRUN d5d72da fcoe: remove redundant checking of netdev->netdev_ops 34556a1 libfc: fix ddp in fc_fcp for 0 xid 1e418b2 libfc: fix typo in retry check on received PRLI 253f41b lpfc: fix hang on SGI ia64 platform 4b2bc96 scsi_transport_fc: remove invalid BUG_ON d502a76 scsi_dh: create sysfs file, dh_state for all SCSI disk devices e7c8167 scsi_devinfo: update Hitachi entries (v2) 001252f HID: fixup quirk for NCR devices 5e05787 NFS: Revert default r/wsize behavior 1d42a1b iscsi class: modify handling of replacement timeout 83886fa PCI: Always set prefetchable base/limit upper32 registers 5cf92e9 timers, init: Limit the number of per cpu calibration bootup messages 34911bf nfsd: Fix sort_pacl in fs/nfsd/nf4acl.c to actually sort groups a9238ce nohz: Prevent clocksource wrapping during idle db47a16 sched: Fix missing sched tunable recalculation on cpu add/remove 08b84be sched: Fix isolcpus boot option eb9dbd9 ALSA: ice1724 - Patch for suspend/resume for ESI Juli@ e96610c partitions: use sector size for EFI GPT 6f8de29 partitions: read whole sector with EFI GPT header 8f2fefc netfilter: xtables: fix conntrack match v1 ipt-save output 3cd4bea V4L/DVB (13680b): DocBook/media: create links for included sources 35f42c9 V4L/DVB (13680a): DocBook/media: copy images after building HTML 857ffb8 atl1e:disable NETIF_F_TSO6 for hardware limit f7b1714 atl1c:use common_task instead of reset_task and link_chg_task b68f619 iTCO_wdt: Add support for Intel Ibex Peak 96ef353 V4L/DVB (13168): Add support for Asus Europa Hybrid DVB-T card (SAA7134 SubVendor ID: 0x1043 Device ID: 0x4847) 8429570 USB: ftdi_sio: add USB device ID's for B&B Electronics line 5bcaffb USB: mos7840: add device IDs for B&B electronics devices 4d3c678 V4L/DVB (13569): smsusb: add autodetection support for five additional Hauppauge USB IDs ff23399 ALSA: hda - Add PCI IDs for Nvidia G2xx-series 4bc685e vfs: get_sb_single() - do not pass options twice 1b715f1 driver-core: fix devtmpfs crash on s390 da30443 Driver-Core: devtmpfs - set root directory mode to 0755 04daa51 Input: ALPS - add interleaved protocol support (Dell E6x00 series) 30dc12e davinci: dm646x: Add support for 3.x silicon revision c375e84 powerpc/fsl: Add PCI device ids for new QoirQ chips a98917c ar9170: Add support for D-Link DWA 160 A2 002464c mpt2sas: New device SAS2208 support is added 90ee3ca be2net: Add the new PCI IDs to PCI_DEVICE_TABLE. 879c8e8 be2net: Add support for next generation of BladeEngine device. c97c73d sfc: Fix DMA mapping cleanup in case of an error in TSO 9396c90 ACPI: don't cond_resched if irq is disabled ce946bc clockevents: Add missing include to pacify sparse 08b8ff4 clockevent: Don't remove broadcast device when cpu is dead f584d37 Linux 2.6.32.6 9607f06 perf: Honour event state for aux stream data b0a9392 perf events: Dont report side-band events on each cpu for per-task-per-cpu events 5a20267 perf timechart: Use tid not pid for COMM change f2fa92b vmalloc: remove BUG_ON due to racy counting of VM_LAZY_FREE 3d0cc9a USB: fix usbstorage for 2770:915d delivers no FAT 538a6fd x86/PCI/PAT: return EINVAL for pci mmap WC request for !pat_enabled e0f5cfa DM: Fix device mapper topology stacking fbe2992 block: bdev_stack_limits wrapper ed0cd89 drm/i915: try another possible DDC bus for the SDVO device with multiple outputs 4fb77a3 drm/i915: Read the response after issuing DDC bus switch command 8cef765 SCSI: enclosure: fix oops while iterating enclosure_status array 5f0ab2d ACPI: EC: Add wait for irq storm 1ff7b99 ACPI: EC: Accelerate query execution 111ab4b USB: add speed values for USB 3.0 and wireless controllers a2a5b33 USB: add missing delay during remote wakeup bfec5ce USB: EHCI & UHCI: fix race between root-hub suspend and port resume 07d577f USB: EHCI: fix handling of unusual interrupt intervals 186c74d USB: Don't use GFP_KERNEL while we cannot reset a storage device fa68188 USB: fix bitmask merge error 911b8be usb: serial: fix memory leak in generic driver 04f7ec7 serial: 8250_pnp: use wildcard for serial Wacom tablets 6fc7937 nozomi: quick fix for the close/close bug 8c53542 ecryptfs: initialize private persistent file before dereferencing pointer 3621216 ecryptfs: use after free 179b7e5 tty: fix race in tty_fasync b70922a Staging: hv: fix smp problems in the hyperv core code 50e4975 Staging: asus_oled: fix oops in 2.6.32.2 ccb90b8 V4L/DVB (13900): gspca - sunplus: Fix bridge exchanges. d547e91 x86, msr/cpuid: Register enough minors for the MSR and CPUID drivers a2febcd Linux 2.6.32.5 af55a3d vfs: Fix vmtruncate() regression 2693139 sched: Fix task priority bug fdc360e serial/8250_pnp: add a new Fujitsu Wacom Tablet PC device 2d22b38 i2c/pca: Don't use *_interruptible c1f77a7 i2c: Do not use device name after device_unregister 4bff5ff sparc64: Fix Niagara2 perf event handling. 9d6567c sparc64: Fix NMI programming when perf events are active. 896fb0d sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 9fc68ca asus-laptop: add Lenovo SL hotkey support 2196ca4 Input: pmouse - move Sentelic probe down the list 94249e6 megaraid_sas: remove sysfs poll_mode_io world writeable permissions 2db740c PCI/cardbus: Add a fixup hook and fix powerpc eecd8a9 HID: add device IDs for new model of Apple Wireless Keyboard 781d5c4 reiserfs: truncate blocks not used by a write 56a7f72 V4L/DVB (13868): gspca - sn9c20x: Fix test of unsigned. fe52cee ALSA: hda - Fix missing capture mixer for ALC861/660 codecs 34e7aa0 mfd: Correct WM835x ISINK ramp time defines 33faa3c mfd: WM835x GPIO direction register is not locked 7f08f93 x86: SGI UV: Fix mapping of MMIO registers 7f40c6b edac: i5000_edac critical fix panic out of bounds 25d5699 x86, apic: use physical mode for IBM summit platforms c91ab04 page allocator: update NR_FREE_PAGES only when necessary d4c893f futexes: Remove rw parameter from get_futex_key() 8410b13 x86, mce: Thermal monitoring depends on APIC being enabled 1bd24fd block: Fix incorrect reporting of partition alignment 8a9c3f5 drm/i915: remove loop in Ironlake interrupt handler 4334ab7 memcg: ensure list is empty at rmdir 70f800f revert "drivers/video/s3c-fb.c: fix clock setting for Samsung SoC Framebuffer" 800c028 inotify: only warn once for inotify problems cec3ad6 inotify: do not reuse watch descriptors 3df7673 Linux 2.6.32.4 5877960 agp/intel-agp: Clear entire GTT on startup 5deb72e ipv6: skb_dst() can be NULL in ipv6_hop_jumbo(). 54f1b39 module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y 9ef9a7c fix more leaks in audit_tree.c tag_chunk() dffaea5 fix braindamage in audit_tree.c untag_chunk() d3b1e3b mac80211: fix skb buffering issue (and fixes to that) 71c7707 kernel/sysctl.c: fix stable merge error in NOMMU mmap_min_addr 904e373 libertas: Remove carrier signaling from the scan code b9945e7 drm/i915: remove render reclock support 9b13cca mac80211: add missing sanity checks for action frames 0ea5505 iwl: off by one bug 724ad42 cfg80211: fix syntax error on user regulatory hints e6efac7 ath5k: Fix eeprom checksum check for custom sized eeproms fc95845 iwlwifi: fix iwl_queue_used bug when read_ptr == write_ptr a111c28 xen: fix hang on suspend. 38c4d8d quota: Fix dquot_transfer for filesystems different from ext4 a61dcb0 hwmon: (adt7462) Fix pin 28 monitoring 4052fbf hwmon: (coretemp) Fix TjMax for Atom N450/D410/D510 CPUs 545b020 netfilter: nf_ct_ftp: fix out of bounds read in update_nl_seq() 635b4f9 netfilter: ebtables: enforce CAP_NET_ADMIN 954c8ef ASoC: Fix WM8350 DSP mode B configuration cf99848 ALSA: atiixp: Specify codec for Foxconn RC4107MA-RS2 0385cc0 ALSA: ac97: Add Dell Dimension 2400 to Headphone/Line Jack Sense blacklist 5bb4e84 ALSA: hda - Fix ALC861-VD capture source mixer e0abcea mmc_block: fix queue cleanup 0c74f45 mmc_block: fix probe error cleanup bug 0798abf mmc_block: add dev_t initialization check 0696a3b kernel/signal.c: fix kernel information leak with print-fatal-signals=1 ecac13f dma-debug: allow DMA_BIDIRECTIONAL mappings to be synced with DMA_FROM_DEVICE and f21efc5 lib/rational.c needs module.h 21f7654 cgroups: fix 2.6.32 regression causing BUG_ON() in cgroup_diput() 6abb6ac drivers/cpuidle/governors/menu.c: fix undefined reference to `__udivdi3' fdc0895 rtc_cmos: convert shutdown to new pnp_driver->shutdown 0c51b5c drm/i915: fix unused var c7e8c26 drm/i915: Select the correct BPC for LVDS on Ironlake c04fd30 drm/i915: Make the BPC in FDI rx/transcoder be consistent with that in pipeconf on Ironlake cba0270 drm/i915: Enable/disable the dithering for LVDS based on VBT setting de04091 drm: remove address mask param for drm_pci_alloc() c693959 drm/i915: Permit pinning whilst the device is 'suspended' d241962 drm/i915: fix order of fence release wrt flushing d3e4d5f drm/i915: Update LVDS connector status when receiving ACPI LID event 8064af1 sunrpc: on successful gss error pipe write, don't return error 8ffe947 SUNRPC: Fix the return value in gss_import_sec_context() e64b13f SUNRPC: Fix up an error return value in gss_import_sec_context_kerberos() eb0b93d sunrpc: fix peername failed on closed listener 3aafc55 nfsd: make sure data is on disk before calling ->fsync b7e5f77 Revert "x86: Side-step lguest problem by only building cmpxchg8b_emu for pre-Pentium" 2448811 exofs: simple_write_end does not mark_inode_dirty 8dfabfc modules: Skip empty sections when exporting section notes efd38f4 ASoC: fix params_rate() macro use in several codecs e4dd8ca fasync: split 'fasync_helper()' into separate add/remove functions 1f51eb3 untangle the do_mremap() mess c3a8e0e Linux 2.6.32.3 84d330e generic_permission: MAY_OPEN is not write access 3815270 rt2x00: Disable powersaving for rt61pci and rt2800pci. 8ac9e80 ksm: fix mlockfreed to munlocked b2ea8cb vmscan: do not evict inactive pages when skipping an active list scan 370b758 lguest: fix bug in setting guest GDT entry 743c078 ext4: Update documentation to correct the inode_readahead_blks option name fc31022 sched: Sched_rt_periodic_timer vs cpu hotplug 9127720 amd64_edac: fix forcing module load/unload 1538323 amd64_edac: make driver loading more robust 44a529c amd64_edac: fix driver instance freeing 2d9e1f0 x86, msr: msrs_alloc/free for CONFIG_SMP=n eb21839 x86, msr: Add support for non-contiguous cpumasks 26eb2ac amd64_edac: unify MCGCTL ECC switching ebd2802 cpumask: use modern cpumask style in drivers/edac/amd64_edac.c a89a9e1 x86, msr: Unify rdmsr_on_cpus/wrmsr_on_cpus b2dbc46 ext4: fix sleep inside spinlock issue with quota and dealloc (#14739) dbe5cc0 ext4: Convert to generic reserved quota's space management. bbf2450 quota: decouple fs reserved space from quota reservation f07c88d Add unlocked version of inode_add_bytes() function 0aebc28 udf: Try harder when looking for VAT inode 3196f98 orinoco: fix GFP_KERNEL in orinoco_set_key with interrupts disabled fad0c31 xen: wait up to 5 minutes for device connetion 2cfea00 xen: improvement to wait_for_devices() af70ddf xen: fix is_disconnected_device/exists_disconnected_device 1dc51f1 S390: dasd: support DIAG access for read-only devices 4012cf6 drm: disable all the possible outputs/crtcs before entering KMS mode 08ff733 drm/radeon/kms: fix crtc vblank update for r600 a09adfe sched: Fix balance vs hotplug race fb70ac4 Keys: KEYCTL_SESSION_TO_PARENT needs TIF_NOTIFY_RESUME architecture support 7fcb558 b43: avoid PPC fault during resume a8e3ec9 hwmon: (sht15) Off-by-one error in array index + incorrect constants 048a424 netfilter: fix crashes in bridge netfilter caused by fragment jumps 89cf4f4 ipv6: reassembly: use seperate reassembly queues for conntrack and local delivery ee6bfc6 e100: Fix broken cbs accounting due to missing memset. ad46fed memcg: avoid oom-killing innocent task in case of use_hierarchy b52d855 x86/ptrace: make genregs[32]_get/set more robust 6e2aa7d V4L/DVB (13596): ov511.c typo: lock => unlock 4b6d263 kernel/sysctl.c: fix the incomplete part of sysctl_max_map_count-should-be-non-negative.patch 3ec268a 'sysctl_max_map_count' should be non-negative 0399123 NOMMU: Optimise away the {dac_,}mmap_min_addr tests 1cfe005 mac80211: fix race with suspend and dynamic_ps_disable_work 14b4d74 iwlwifi: fix 40MHz operation setting on cards that do not allow it c4ae8ae iwlwifi: fix more eeprom endian bugs df5d119 iwlwifi: fix EEPROM/OTP reading endian annotations and a bug 0c0cdaf iwl3945: fix panic in iwl3945 driver 66c9e44 iwl3945: disable power save 87d512c ath9k_hw: Fix AR_GPIO_INPUT_EN_VAL_BT_PRIORITY_BB and its shift value in 0x4054 a6d8cc6 ath9k_hw: Fix possible OOB array indexing in gen_timer_index[] on 64-bit 12ba709 ath9k: fix suspend by waking device prior to stop c965e1e ath9k: wake hardware during AMPDU TX actions 463a7f9 ath9k: fix missed error codes in the tx status check bef82b6 ath9k: Fix TX queue draining 0ebbdd7 ath9k: wake hardware for interface IBSS/AP/Mesh removal d5086b9 ath5k: fix SWI calibration interrupt storm 4777020 cfg80211: fix race between deauth and assoc response 9f7028e mac80211: Fix IBSS merge 0b41c5a mac80211: fix WMM AP settings application 330b937 mac80211: fix propagation of failed hardware reconfigurations 38cf2a0 iwmc3200wifi: fix array out-of-boundary access 08a9378 Libertas: fix buffer overflow in lbs_get_essid() 3b96f9a KVM: LAPIC: make sure IRR bitmap is scanned after vm load 3a9f992 KVM: MMU: remove prefault from invlpg handler 8b9f038 ioat2,3: put channel hardware in known state at init e05a6f0 ioat3: fix p-disabled q-continuation e93166f x86/amd-iommu: Fix initialization failure panic cd7bc18 cifs: NULL out tcon, pSesInfo, and srvTcp pointers when chasing DFS referrals 6cb5fcc dma-debug: Fix bug causing build warning 120dbaa dma-debug: Do not add notifier when dma debugging is disabled. c4ddbba dma: at_hdmac: correct incompatible type for argument 1 of 'spin_lock_bh' ed8f6eb md: Fix unfortunate interaction with evms acb8be4 x86: SGI UV: Fix writes to led registers on remote uv hubs 4ba51fe drivers/net/usb: Correct code taking the size of a pointer 526fed8 USB: fix bugs in usb_(de)authorize_device c6d7a67 USB: rename usb_configure_device f661c3f Bluetooth: Prevent ill-timed autosuspend in USB driver b71bfa6 USB: musb: gadget_ep0: avoid SetupEnd interrupt 3635acd USB: Fix a bug on appledisplay.c regarding signedness 5a82dd5 USB: option: support hi speed for modem Haier CE100 702a0a0 USB: emi62: fix crash when trying to load EMI 6|2 firmware 2d67231 drm/radeon: fix build on 64-bit with some compilers. 474ae5e ASoC: Do not write to invalid registers on the wm9712. d75621c powerpc: Handle VSX alignment faults correctly in little-endian mode 8aafd7d ACPI: Use the return result of ACPI lid notifier chain correctly 3872bf5 ACPI: EC: Fix MSI DMI detection 5ab8996 acerhdf: limit modalias matching to supported 296e9be ALSA: hda - Fix missing capsrc_nids for ALC88x aec8dc2 sound: sgio2audio/pdaudiocf/usb-audio: initialize PCM buffer e255d3c ASoC: wm8974: fix a wrong bit definition 1ee0552 pata_cmd64x: fix overclocking of UDMA0-2 modes f31733a pata_hpt3x2n: fix clock turnaround fa3f5a5 clockevents: Prevent clockevent_devices list corruption on cpu hotplug 8e04c81 sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE c9ac6a9 x86, cpuid: Add "volatile" to asm in native_cpuid() 14ae082 sched: Fix task_hot() test order fdf2675 SCSI: fc class: fix fc_transport_init error handling 1ab0714 SCSI: st: fix mdata->page_order handling 9f63d27 SCSI: qla2xxx: dpc thread can execute before scsi host has been added c1d17da SCSI: ipr: fix EEH recovery a1092bf Linux 2.6.32.2 The problem has to be somewhere in there. Frank, you're the only guy up to now bringing up hard evidence and two relatively close good/bad kernel versions. Would you be able to dig deeper on this? It's just ridiculous some IO can prevent a quadcore from skipless video playback (on 2.6.34-git12 that is).. because of btrfs i can't switch back to 2.6.32.2 - but maybe someone can figure out how to use the phoronix-test-suite's automagic to bisect this? And despite all noise: this bug really shouldn't be marked RESOLVED INSUFFICIENT_DATA ^^
(In reply to comment #448 and comment #431) Sorry, I really want to help, but I am not a kernel developer, hacking the kernel source is too difficult for me. Besides, the gas turbine historian is a live production system, it can not be used as a debug system. I will keep watching for the final resolve, for now, we will stick with 2.6.32.2.
wait wait what is this :O updating to yesterday's git kernel (from 2.6.34-git12) gave me a huge perceived speed boost? haven't specifically compared iowait times - but all processes seem to be using less cpu time? my BOINC likes that very much ;) a lot of concurrent IO here and the system, apart from minor application stalling (although 8GiB RAM and no swap), hasn't been this un-sluggish for a loooong time (2.6.18? ;) feels like someone finally released the breaks - hope you guys can confirm this!
wrt comment #450, 2.6.35-rc1 is out! I hope that has something for all of us sufferers. I will try it later today. Can other folks also try and report here?
(In reply to comment #450) Maybe this is related to the observations at phoronox's kernel tracker[1]. An in depth article was also posted[2]. 1: http://www.phoromatic.com/kernel-tracker.php?sys_1=yes&sys_3=yes&sys_4=yes&sub_type_System=yes&sub_type_Processor=yes&sub_type_Disk=yes&sub_type_Graphics=yes&sub_type_Memory=yes&sub_type_Network=yes&date_range=15®ression_threshold=0.15&only_show_regressions=yes&submit=Update+Results 2: http://www.phoronix.com/scan.php?page=article&item=linux_2635_fail&num=1 Note: Link 1 is valid for the next few days, thereafter you have to raise the displayed days to get the regression back into view
lol, this bug was marked "resolved". I wish. (Hi, everyone). I suspect we have about 25 different bugs here. Really the only way we'll make progress here is if people can come up with specific test cases which developers can run on their own machines, and reproduce the bug. So if any of you guys have time to try that and are successful then please attach that testcase here, or send it out via email to the relevant culprits. It's really that important. There's practically a 1:1 ratio between reproduction-test-cases and bugfixes.
Let me point out a potential pitfall: For a long while I thought my machine was suffering from this bug. However, the real reason for my high IO wait and extremely poor performance was this: http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives So everyone should rule out that one first... for me, a repartitioning of my drive helped a lot :).
Just want to report that I've had great success with the kernel 2.6.35-020635rc4-generic on ubuntu 32 bit. Apps can still grey out when allocating space for big files, but the interface is still responsive on other apps. I'll try it out on more setups and report back here if i notice it appearing on other places. Finally I can say that my linux machines are usable again. Cheers!
(In reply to comment #453) > lol, this bug was marked "resolved". I wish. > > (Hi, everyone). > > I suspect we have about 25 different bugs here. Really the only way we'll > make > progress here is if people can come up with specific test cases which > developers can run on their own machines, and reproduce the bug. > > So if any of you guys have time to try that and are successful then please > attach that testcase here, or send it out via email to the relevant culprits. > > It's really that important. There's practically a 1:1 ratio between > reproduction-test-cases and bugfixes. Hi ANdrew, Very simple testing procedure: Launch Firefox Run 'stress -d 1' Try open some websites Machine hangs Thanks
(In reply to comment #455) > Just want to report that I've had great success with the kernel > 2.6.35-020635rc4-generic on ubuntu 32 bit. Apps can still grey out when > allocating space for big files, but the interface is still responsive on > other > apps. I'll try it out on more setups and report back here if i notice it > appearing on other places. > > Finally I can say that my linux machines are usable again. Cheers! I will try that, but I have no issues in XP and my hard drive is at least 2 1/2 years old and this issue has been around for even longer than that. Doubt it's the reason for my issues. I have also tried playing around with other schedulers and disk mounting options. I have tried writeback and journal mode. Writeback provides very minimal improvement, not enough to make it worth my while to run always. Changing between ATA and AHCI mode makes no difference as well as changing the scheduler from cfg to anticipatory or deadline. I am testing this on a Dell Precision M6300 Laptop with SATA drive, but I have experienced this issue on all my various types of PC's since at least Ubuntu Gusty or Intrepid.
(In reply to comment #456) > > Very simple testing procedure: > > Launch Firefox > > Run 'stress -d 1' > From where does one obtain a copy of `stress'? Thanks.
I believe this is the website (according to gentoo portage). http://weather.ou.edu/~apw/projects/stress/ Benj
I've tried stress also. I have 2 Gb og memory and 1.5 Gb swap With swap activated stress -d 1 hangs my machine Same does stress -d while swapiness set to 0 Widh swap deactivated things runs pretty fine. Of couse apps utilizing syncronous disk-io fight stress for priority. There must be a reasonable explanation on why everything stops when swap is activated. Even a simple app like "dstat" stalls.
I can also confirm this. Disabling swap with swapoff -a solves the problem. I have 8gb of ram and 8gb of swap with a fake raid mirror. Before this I couldn't do backups without the whole system grinding to a halt. Right now I am doing a backup from the drives, watching a movie from the same drives and more. No more iowait times and programs freezing as they are starved from being able to access the drives.
Perhaps you could capture some vmstat 1 output from just before/when the stall occurs?
Created attachment 27230 [details] vmstat for my system running "stress -d 1" without hanging. My system just logged into KDE around 650 Mb of memory used by applications prior to starting "stress -d 1"
Created attachment 27231 [details] vmstat for my system running "stress -d 1". System hangs. My system just logged into KDE around 860 Mb of memory used by applications prior to starting "stress -d 1". Application utilizing extra memory is digikam and kontact - both sitting there doing nothing.
Created attachment 27232 [details] vmstat for my system (without swap) running "stress -d 1" without hanging. Same setup as stress_swap_hang.vmstat except that swap is turned off using "swapoff -a" in this run.
The strange thing about every high throughput io is that *every* byte of memory is used up intil a certain limit. That use of memory will even swap out stuff. Also looking at especially stress_noswap_nohang.vmstat the behavior mimics this. 1. Place data to be written into memory 2. Write some data to the disk 3. goto 1 if not all allowed memory is used. Interesting is that "stress -d 1" places data into memory a lot faster than a normal hard disk can handle. So the memory will be filled up eventually (the limit will be reached eventually). So for me I only have a hanging system when "stress -d 1" writes compete with "swap out" - which is actually caused by "stress -d 1" filling the memory. So the big question: Why do the kernel allow large data writes to fill up the memory and even swap out stuff just to get data to be written into memory?
(In reply to comment #466) > So the big question: Why do the kernel allow large data writes to fill up the > memory and even swap out stuff just to get data to be written into memory? A good question, but not the real source of this problem I guess. Judging by the previous posts and my own experience, this problems seems to occur with any concurrent I/O, possibly promoted by encryption. Provided that it is only one bug we are talking about.
I've notices that earlier in the long list of comments. But could it be that others confuse the real issue with swapout slowing things down during high disk write?
(In reply to comment #468) > I've notices that earlier in the long list of comments. But could it be that > others confuse the real issue with swapout slowing things down during high > disk > write? This squares somewhat with my own experience: 1. The file cache is *very* aggressive, even pushing out to swap stuff I think I might be using. 2. Large writes to swap trounce interactivity (and little gets scheduled). Small writes seem not to have an adverse effect. OK, I understand pushing out pages that haven't been used in a while in favour of more current caches; however, doing something that can result in 1.5 GiB going to page cache on a 2 GiB system (large copy, kernel compile) seem to provoke these large writes which make everything go slow.
(In reply to comment #469) > > 1. The file cache is *very* aggressive, even pushing out to swap stuff I > think > I might be using. > Now, I'm not a kernel hacker, but a programmer afterall, and to me it seems to be a an easier job to fix the aggressive file cache than to fix this "large I/O operations ......"-thing - which is not at all that concrete and varies over platforms, machine specs etc. Maybe fixing the aggressive file cache would fix a lot of peoples problems - I'm guessing that the file cache behaves 100% the same on all systems. Is that a correct assumption?
(In reply to comment #470) > (In reply to comment #469) > > > > 1. The file cache is *very* aggressive, even pushing out to swap stuff I > think > > I might be using. > > > > Now, I'm not a kernel hacker, but a programmer afterall, and to me it seems > to > be a an easier job to fix the aggressive file cache than to fix this "large > I/O > operations ......"-thing - which is not at all that concrete and varies over > platforms, machine specs etc. Isn't there already a knob for controlling the kernel's preference for swapping anonymous pages out to disk versus retaining cached/buffered block-device pages? /proc/sys/vm/swappiness — http://kerneltrap.org/node/3000 Our apps are appearing to hang because their GUI threads have stalled while waiting on pages (containing either executable code or auxiliary data like pixmaps) to come back into RAM from the disk. Reading those pages back in is taking forever because the disk queue is full of writes. The situation is worsened because reading the pages is not pipelined since the requests are being submitted from the page fault handler, so a program executing while huge disk activity is in progress will submit a request to load one page from disk and stall; then when that request is fulfilled, the program will execute a few hundred instructions more until its instruction pointer crosses into another page that isn't loaded from disk, whereupon the page fault handler will be invoked again, a new request will be submitted to the disk queue, and the application will hang again. Repeat ad infinitum. Meanwhile, while the program is stalled waiting for the page it needs to be loaded in from disk, all the rest of its pages are being evicted from RAM to make room for the huge disk buffers, thus perpetuating the problem. I would think the easiest and most reliable solution to this problem would be for the kernel to prefer fulfilling page-in requests ahead of dirtying blocks. If there are any requests to read pages in from disk to satisfy page faults, those requests should be fulfilled and a process's request to dirty a new page should be blocked. In other words, as dirty blocks are flushed to disk, thus freeing up RAM, the process performing the huge write shouldn't be allowed to dirty another block (thus consuming that freed RAM) if there are page-ins waiting to be fulfilled.
Created attachment 27243 [details] vmstat for my system running "stress -d 1" without hanging. My system just logged into KDE around 650 Mb of memory used by applications prior to starting "stress -d 1"
(In reply to comment #471) > > I would think the easiest and most reliable solution to this problem would be > for the kernel to prefer fulfilling page-in requests ahead of dirtying > blocks. > If there are any requests to read pages in from disk to satisfy page faults, > those requests should be fulfilled and a process's request to dirty a new > page > should be blocked. In other words, as dirty blocks are flushed to disk, thus > freeing up RAM, the process performing the huge write shouldn't be allowed to > dirty another block (thus consuming that freed RAM) if there are page-ins > waiting to be fulfilled. I agree with you on the preference-part. It will fix the race-like situation. But as I understand, it will not keep the file cache from swapping out a single page?
(In reply to comment #473) > I agree with you on the preference-part. It will fix the race-like situation. > But as I understand, it will not keep the file cache from swapping out a > single > page? Implementing my suggestion wouldn't prevent mmap'd pages from being evicted from RAM to make room for file cache. It would only mean (1) that the file cache wouldn't be allowed to consume pages that are needed to satisfy page faults, and (2) that requests to read pages in from disk (whether from swap (anonymous pages) or from mmap'd files such as executables) would be serviced ahead of any other reads or writes in the disk queue.
(In reply to comment #471) > Isn't there already a knob for controlling the kernel's preference for > swapping > anonymous pages out to disk versus retaining cached/buffered block-device > pages? > > /proc/sys/vm/swappiness — http://kerneltrap.org/node/3000 (For some reason playing with this doesn't seem to do anything, but perhaps that's another bug report.)
> I would think the easiest and most reliable solution to this problem would be > for the kernel to prefer fulfilling page-in requests ahead of dirtying > blocks. > If there are any requests to read pages in from disk to satisfy page faults, > those requests should be fulfilled and a process's request to dirty a new > page > should be blocked. In other words, as dirty blocks are flushed to disk, thus > freeing up RAM, the process performing the huge write shouldn't be allowed to > dirty another block (thus consuming that freed RAM) if there are page-ins > waiting to be fulfilled. Matt: Wouldn't setting dirty_bytes to low values make sure that the processes never dirty more than a fixed number of pages, and hence never get to consume more RAM until their existing dirty pages are flushed? Or may that's not how dirty_*bytes is designed to work. May be (I am guessing here) it just controls when the flush begins to happen for dirty pages, the application can still continue to dirty more pages. But if dirty_bytes controls when the process itself has to flush its dirty buffers, then it would be busy flushing and waiting on IO to complete and can't be dirtying more memory, right? So, it does look like setting dirty_bytes to a low value like 4096 will produce an extreme case where the process writes are almost completely sync and page cache is not pounded at all. Can someone try this extreme test? set dirty_bytes to 4096 and rerun your scenario. The sequential bandwidth seen by the disk stresser will go down the drain but your system should survive.
According to http://www.kernel.org/doc/Documentation/sysctl/vm.txt "Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any value lower than this limit will be ignored and the old configuration will be retained." Better make that 8192 Also you could try lowering /proc/sys/vm/dirty_ratio
(In reply to comment #477) > According to http://www.kernel.org/doc/Documentation/sysctl/vm.txt > > "Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any > value lower than this limit will be ignored and the old configuration will be > retained." > > Better make that 8192 > > Also you could try lowering /proc/sys/vm/dirty_ratio Setting dirty_bytes to 8192 solves the slowdown of me. Of cause it ends up with a troughput from "stress -d 1" which is considerably lower than when dirty_bytes was set to 0 (ie. <quote-from-doc> If dirty_bytes is written, dirty_ratio becomes a function of its value (dirty_bytes / the amount of dirtyable system memory). </quote-from-doc> Now, dirty_ratio is 60 by default, so 60% of my system memory can be used for dirty pages. On my system that is 1.2GB. So if I do not have 1.2GB free and I am doing some high troughput write to disk my system will hang. I think it is a bit overkill especially seen in the perspective that a standard harddisk can write no more that 100MB/sec. The kernel should be reosonable enough to behave and not just hog the majority of system memory during high throughput operations. Just think of system with 8GB of memory and the 6Gb is used by running application. Runnning "stress -d 1" on such a setup would kill it. The writing application would be allowed to use 60% of the 8GB for dirty pages. It seems massive, so please correctly me if I'm wrong since I have not done a test on such a system.
Søren: These parameters exist to tune the system behavior. There are other parameters which control the behavior of pdflush and FS journal threads but getting these all in harmony to make the system perform well in all scenarios is not an easy task. I think the hope is that pages will be reclaimed fast enough by pdflush if its parameters are tuned as well. But I agree that by default letting one process to dirty 60% of physical RAM before it blocks itself on IO flush, is a bad thing. Particularly, when filling RAM is many orders of magnitude faster than emptying it to disk. A couple of rogue user processes can bring the system down in a hurry. Linux needs to account for the disparity between RAM and disk, and how that disparity has increased many folds in recent times. 2GB system is considered minimum these days. Filling 60% of it will take few microseconds even on slowest of RAM, but emptying it to disk will take many seconds if not minutes on fastest drives.
Søren: These parameters exist to tune the system behavior. There are other parameters which control the behavior of pdflush and FS journal threads but getting these all in harmony to make the system perform well in all scenarios is not an easy task. I think the hope is that pages will be reclaimed fast enough by pdflush if its parameters are tuned as well. But I agree that by default letting one process to dirty 60% of physical RAM before it blocks itself on IO flush, is a bad thing. Particularly, when filling RAM is many orders of magnitude faster than emptying it to disk. A couple of rogue user processes can bring the system down in a hurry. Linux needs to account for the disparity between RAM and disk, and how that disparity has increased many folds in recent times. 2GB system is considered minimum these days. Filling 60% of it will take few microseconds even on slowest of RAM, but emptying it to disk will take many seconds if not minutes on normal drives.
Apologies for the double post. The first one timed out on me. While reposting, I realized fastest drives on market today (the SSDs) will likely be able to do stuff in seconds, so, I changed the word fastest to normal...:-)
devsk: Yeah, but shouldn't those knobs be to squeeze the most out of your system? The defaults should be set in a way that is not destructive. fx. swappiness = 0 - 10 or dirty_ratio = 10 or a combination of both or some other settings. People will experience trouble with the default settings anyway, so reports like "high troughput disk writes is slow" is certainly a lot better than "high troughput disk write locks my machine". What is the best fist steps to solving this: 1. Changing defaults on existing knobs? 2. Change the kernel code?
There are currently various patches dealing with various aspects of writeback. Some or all of these _may_ be ready for inclusion in 2.6.36
Nice .... where are those. If they apply to 2.6.35-something I will be happy to try them out.
Here are a couple of things being worked on. http://lwn.net/Articles/397003/ http://lwn.net/Articles/396512/ You'll need to dig around for the patches.
Wu Fengguang of Intel has started looking through this bug report. He has some patches that he'd like people to try. http://lkml.org/lkml/2010/8/1/40 http://lkml.org/lkml/2010/8/1/45
Created attachment 27313 [details] screenshot of extreme iowait at ridiculously low throughput
Created attachment 27314 [details] Wu Fengguang's anti-io-stall patch rebased for vanilla 2.6.35 @#486 The posted patches didn't apply to recent kernels, just rebased for latest kernel release and compiled.. Will restart machine now and party wildly if FINALLY this small change fixes this issue.
(In reply to comment #487) > Created an attachment (id=27313) [details] > screenshot of extreme iowait at ridiculously low throughput I have found that even if dstat should 0B throughput the disk have be very much active. So dstat seems to not measure the amount of bytes actually going to the disk.
2.6.35 + patch from #488 Mouse froze four times at 1 - 1.5 seconds, while dd wrote. When the sweep opens the file and swap grew from 0 to 1.3 GiB, mouse frozen. After opening the file Kopete loses connection to the Jabber account and KWin disables desktop effects.
Created attachment 27324 [details] test results
(In reply to comment #490) > 2.6.35 + patch from #488 > > Mouse froze four times at 1 - 1.5 seconds, while dd wrote. > > When the sweep opens the file and swap grew from 0 to 1.3 GiB, mouse frozen. > After opening the file Kopete loses connection to the Jabber account and KWin > disables desktop effects. Did you ensure to have a 50% usage before starting the test. Just to make sure to trigger pageout.
I'm just copying a few files from NFS folder to USB in my computer. I found that IO wait times are huge but Network is not in use. This is strange as the folder is a NFS one with GB ethernet attached. The problem is that the IOWait times are making my desktop unusable. Window manager takes a lot of time to move a window around, desktop does not responds well, mouse got hang sometimes... This is a mess. This is kernel. Linux azul1 2.6.35-10-generic #15-Ubuntu SMP Thu Jul 22 11:10:38 UTC 2010 x86_64 GNU/Linux Some maintainer of the kernel should order this bug. Separate in a few different bugs (because I'm sure that ther are more than one related to this) and try to resolve them. Divide and conquer! Thank you guys!
The patch from #488 does not solve the problem on my machine. My machine start to stall even if there is still 2GiB of 8GiB RAM free. The menu stalls, if the icons are not loaded and there is heavy io. It starts faster to stall while executing dd if=/dev/zero of=t1 bs=1M count=8K (throughput ~48,2MiB/s) instead of dd if=/dev/zero of=t1 bs=4K count=2M (throughput ~52,7MiB/s) The test data is written on the inner part of the disk, while the os is on the outer part. All partitions are ext4. High fragmentation caused by lvm snapshots, increases this problem.
Hi, I did some tests with the patch from #488. Test procedure: - filled up memory to 70/80% (4GB physical memory total) - executed "stress -d 1" - played changing windows, changing tabs in chromium, accessing menus, etc ----------------------------------------- 2.6.35 vanilla, 10GB swap partition on: Complete hang, no response at all from mouse or keyboard, had to reboot manually 2.6.35 vanilla, 10GB swap partition off: A few hiccups, but system was still usable, although slow. 2.6.35 + patch from #488, swap partition on: A few hiccups, but system was still usable, although slow. 2.6.35 + patch from #488, swap partition off: A few hiccups, but system was still usable, although slow. ----------------------------------------- So the patch from #488 seems to solve the problem for me. The hiccups and slowness can be attributed to my relatively slow magnetic disk and the fact that my partition is encrypted under LUKS. This is a very important bug for Linux in the desktop, I'm glad there is a patch out for it and I'll continue to use the patch for my kernels, but it should definitely be fixed in mainline!
Hi all, has anyone seen this article? http://www.phoronix.com/scan.php?page=news_item&px=ODQ3Mw Are they talking about the same patches? Sounds like the same issue.
I tried the patch from #488 on 2.6.35. When running dd if=/dev/zero of=/tmp/test bs=1M count=1M the system was almost flawless, windows switched quickly, opened programs reacted instantly. It might be that I'm mistaken, but I'm under the impression that my programs takes more time to launch. I wonder if anyone else have that.
*** Bug 15463 has been marked as a duplicate of this bug. ***
#496: yes the patch mentioned on phoronix IS the one from #488, and as reported by several it seems to improve IO latency (at the cost of throughput?) but falls short of completely preventing stalls. Strange thing for me is, the problems seemingly increase with uptime... besides i noticed some rogue flash-btrfs-1 threads causing 1MiB/s avg disk writing (uptime > 2 days, even after bringing down services causing heavy IO).. posted a blktrace of that to the linux-btrfs ml but no answer yet ^^ Wow this one's a tricky one. One thing i noticed a few kernel revisions back that might be relevant: there were a lot of processes in IOWAIT state (result of compiling packages, BOINC, munin-graph, ntop... and then some) and i wanted to priorize a single process so i issued a ionice -p xxx -c1 -n0 (realtime: prio 0). What i expected was that that process would instantly get its IO through and pick up work - alas it took SEVERAL MINUTES before it did. That really wtfed me.. Is this broken by design? Shouldn't iorenicing take effect immediately?
#496 doesn't solve the problem IMHO. Tested on Ubuntu Karmic (10.04) with vanilla 2.6.35. A simple 'dd if=/dev/zero of=/some/file bs=1M' caused 100% load (dual-head Core2 Duo E8500) and a high latency on even ^C'ing dd process itself. Need more info? Ask please.
I tried the patch rebased for 2.6.35 https://bugzilla.kernel.org/attachment.cgi?id=27314 It is problably ok, byt my first test is to fill my memory with all apps I can find and then run "stress -d 1". And as expected it started paging stuff out. You other guys must have the exact same problem, at least you Pedro. To me the responsiveness drop because of paging out.
echo 10 > /proc/sys/vm/vfs_cache_pressure echo 4096 > /sys/block/sda/queue/nr_requests echo 4096 > /sys/block/sda/queue/read_ahead_kb echo 100 > /proc/sys/vm/swappiness echo 0 > /proc/sys/vm/dirty_ratio echo 0 > /proc/sys/vm/dirty_background_ratio this solution work for me. or use "sync" fs-mount option.
(In reply to comment #502) > echo 10 > /proc/sys/vm/vfs_cache_pressure > echo 4096 > /sys/block/sda/queue/nr_requests > echo 4096 > /sys/block/sda/queue/read_ahead_kb > echo 100 > /proc/sys/vm/swappiness > echo 0 > /proc/sys/vm/dirty_ratio > echo 0 > /proc/sys/vm/dirty_background_ratio > > this solution work for me. > or use "sync" fs-mount option. Yeah, but testing a kernel patch with those testtings is not good for seing its effects.
(In reply to comment #501) > I tried the patch rebased for 2.6.35 > https://bugzilla.kernel.org/attachment.cgi?id=27314 > > It is problably ok, byt my first test is to fill my memory with all apps I > can > find and then run "stress -d 1". And as expected it started paging stuff out. > You other guys must have the exact same problem, at least you Pedro. To me > the > responsiveness drop because of paging out. Hi Soren, as said in my comment, I do have the responsiveness drop, but I don't think that is a bug. If you are swapping to a slow disk, that is kind of expected. However, what is not expected is a complete loss of responsiveness, with the UI hanging if even for a few seconds. I find that the mentioned patch improves a lot this situation vs the vanilla kernel. Of course, the best option yet is to disable swap, but for me 4GB of ram is not enough...
I have too reactivity problem on linux when I do large file copy. Other OS is very responsive when do multiple file copy but not linux. Windows have all IO async (no sync possible, read in the Qt doc), why not have same option in linux kernel?
After testing the patches intensively, I have to say that although they do improve the situation, they do it only slightly. I guess the best solution is still disabling swap. Also, what's the idea of having a swappiness tunable if it doesn't work? I can set it to 0, and even though I have only 70% of physical memory in use the system starts swapping to disk.
(In reply to comment #506) > After testing the patches intensively, I have to say that although they do > improve the situation, they do it only slightly. I guess the best solution is > still disabling swap. > It does help initially but not always. Under memory crunch, I found my laptop completely unresponsive even though swap was off (RAM is 3GiB) > Also, what's the idea of having a swappiness tunable if it doesn't work? I > can > set it to 0, and even though I have only 70% of physical memory in use the > system starts swapping to disk. That's weird. On my box, it does work the way it is designed. I have overall concluded that the default value of 60 is correct. If there is a buggy application, that should be fixed. I wouldn't be interested in OOMs on my box.
Memory count actually drops when the system becomes unresponsive during copying of a large file, if a bunch of small files was copied immediately before.
I've added some information on the Ubuntu bug page, but will add it here for completeness sake: 1) I'm seeing this problem extremely frequently due to an unrelated bug that makes X leak memory. 2) On a machine with 4GB memory and no swap, the disk starts thrashing like crazy when 60-70% of the memory is used. It's so bad that I can't even log in on a console as getty times out before I get a chance to enter the password. 3) If swap is enabled on the same machine, it will start swapping out. Doing a "swapoff -a" will force the swap in as planned, but it happens with approximately 500KB/s.
I have compiled the new 2.6.36 kernel today, I found this bug is REALLY fixed on my notebook! Copy a 700MB movie to USB disk became very smooth and quick, GUIs are very responsive, much better than 2.6.35.4(the last kernel of Zenwalk). Just like some one said, the angels are singing again! Congratulations! Great work! Long live Linux!
I'm not seeing this issue on 2.6.36 amd64 4Gb RAM 3Gb swap swapiness 20 Running 'stress -d 1' and browsing websites for 15 minutes with no issues
2.6.36-zen0-00214-g665fe96 Still has about 1 second page faults when "stress -d 1" or "pv /dev/zero > qqq". Swap is off. This: > echo 10 > /proc/sys/vm/vfs_cache_pressure > echo 4096 > /sys/block/sda/queue/nr_requests > echo 4096 > /sys/block/sda/queue/read_ahead_kb > echo 100 > /proc/sys/vm/swappiness > echo 0 > /proc/sys/vm/dirty_ratio > echo 0 > /proc/sys/vm/dirty_background_ratio does not help. Uniprocessor system, i386. 1.5G of RAM. 1G of it was in use by applications when testing.
Just wanted to add my two cents, since I'm experiencing this problem for a very long time now on various machines. But I just adopted myself by doing nothing on the OS when I have large file copies. But somehow I stumbelled upon a solution for this, maybe. I had this problems, the one that you are talking about in this bug and some others after I started using MD-raid. First I thought it was something with the IO-scheduler. Tried all schedulers there are, No-op, CFQ, Deadline, Anticipatory... Some helped a little bit some didn't. Then I thought it was something with the FS, tried ext2, ext3, XFS and now ext4. The same problem prevailed. When I started copying large files I had OS "hickups". Everything that had to do some disk work stopped. Music, and OpenGL where still functioning normal, only the responsivness of the system was gone for 1 or 2 secs. No browsing, no changing terminal windows. Then I thought that it had something to do with SWAP, too. A few days ago I got meself a new machine, i7/950, 2 x SATA3 WD HD, 12GB of ram, and I installed myself a new OS, pure64 bit kernel 3.6.36. The thing I had to do was to copy my old data to the new disks, and reuse the old disks. Now the way I did it is very important. I took a 1 TB WD HD Sata3, made some partitions (6 to be exact) and compiled a new OS. Then I copied the old data from the old raid. The old raid was 4 partitions on each disk with MD RAID 1 on two part. each. While I copied the data I had this hickups also, with the new system. I had this idea, since now it is possible to make partitioned raid with MD, and you can take whole disks for an array, to make a RAID 10 out of this four disks, 2 new ones and 2 old ones. So it was like "mdadm --create /dev/md0 ... --raid-devices=4 /dev/sda /dev/sdb..." Worked like a charm. Then I partitioned the array "fdisk /dev/md0". No problem there. Then I copied the old stuff from the single hard, with 6 part, to the new array. Now here is the interesting bit. No hickups !!!. Throughput was around 120MB/s and the OS was working smoothly as a Babies bump. And it was the same OS, no changes at all regarding kernel compile, or something else. Reading throughput was 270MB/s (dd-test). But since rootfs won't work on a partitioned MD array (some kernel racing problem, but that's another story) I had to change my setup on the new HDs. So again I created 4 normal partitions on each disk, one from all HD's for bootfs RAID 1, another 4 for swap, another 4 for the rootfs also RAID1 and the last four ones for RAID10 which I partitioned into two seperate partitions (srv and home). And the hickups came back. So this isn't hardware related. Because this problem I can reproduce on many Hardware. A list will follow. It's not with file system or such because I used them all. It's not SWAP, because on this new machine it didn't start to swap while I was copying. But this problem always comes up when I make more partitions (normal ones) for md-raid. The list of Hardware: Quad-Core 6600, I think it was ICH7 chipset, 8GB Ram, 2 x WD10EARS I think the kernel was 2.6.20 something, 32bit system, LinuxFromScratch 6.1 or 2. Can't remember. The system worked for three yrs to now. The partition of the disks was sda1,sda2,sda3,sda4,sdb1,sdb2,sdb3,sdb4 The raid arrays were md0 -> (sda1,sdb1); ... ; md4 -> (sda4,sdb4) md0 -> /boot md1 -> swap md2 -> / md3 -> /srv Fujtisu Siemens RX100S6 x 2 1x XEON 3220 (Quad), 4GB memory, and I can't remember the chipset. 1x XEON E3110 (Dual), 4GB Ram, still can't remember the chipset. kernel 2.6.32.10 pure 64bit system, LFS 6.5 And now: i7 950, 12GB Ram, ICH 10 chipset, 2 x WD10EARS, 2 x WD1002FAEX (+1 temporary) kernel 2.6.36, pure 64bit, LFS6.7 The setup that worked sda1,sda2,sda3,sda4,sda5,sda6; sdb,sdc,sdd,sde md0 -> (sdb,sdc,sdd,sde) RAID10 md0p1 -> boot (tried it but grub couldn't do it) md0p2 -> Swap (no problem there) md0p3 -> / (tried it after a workaround for grub to boot from RAID10, but the kernel didn't want to play along) md0p4 -> extended part. md0p5 -> /home (no problem there) md0p6 -> /srv (no problem there) sda1 -> /boot sda2 -> swap sda3 -> / sda5 -> /home sda6 -> /srv Unfortunatly I had to dump this setup because of a Race condition where the kernel can't put partitioned md together before the rootfs boot process starts. :-( Now the setup that doesn't work (the one with the hickups) sda1,sda2,sda3,sda4,sdb1,sdb2,sdb3,sdb4,sdc1,sdc2,sdc3,sdc4,sdd1,sdd2,sdd3,sdd4 md0 -> (sda1,sdb1,sdc1,sdd1) RAID 1 -> /boot swap -> (sda2,sdb2,sdc2,sdd2), didn't know what else to do with the free space md1 (which somehow changed to md126 automagically after the third boot) -> (sda3,sdb3,sdc3,sdd3) RAID 1 -> / md2 (which somehow changed to md127 automagically after the third boot) -> (sda4,sdb4,sdc4,sdd4) RAID 10 -> md2p1 (changed to md127p1) -> /home md2p2 (changed to md127p2) -> /srv and the temporary disk which used the sda segment until I copied everything to the new setup. Just to mention that throughput is still ok, around 80MB/s write. Didn't try read yet. Except for those hickups. So, what else do you need from me so that we can kill this pesting bug?? I can do everything that is not going to kill my system, cause I'm using it for everyday work. Everything else, torture tests, and so on after working hours is ok. Oh, and yes tried the /sys/block/sdX/device/queue_depth thingie, worked for 5 mins and then it was back to hickuping. dd is around 120MB/s...
To tackle this bug, there needs to be deep digging by the people who have these bugs, or good debug data has to be generated. And good info has to be given on the system. Because there can be serveral bugs out there with the same symptoms as this one. To solve this bug, the best you could do individual bug reports with complete information. If you cannot give complete information, don't post that report, because then you are sure it cannot be solved. The more relevant info we get, the easier it becomes to detect the problems. First install the newest kernel. Because that has the newest code and it will reduce the change that you'll run into an old and fixed bug. On time of writing it's: 2.6.36. Then test again, if it still happens, file a bug report. First give correct system information: Kernel: uname -a and cat /proc/version Architecture: also from uname -a Distro: name and version (could be handy for distro specific patches) CPU info: cat /proc/cpuinfo | grep -e '\(model name\|bogomips\|MHz\|flags\)' Mem info: cat /proc/meminfo | grep MemTotal IO scheduler used: cat /sys/block/sdX/queue/scheduler harddisk configuration: has raid, type of disks, speed of disk, partitions used and filessystems used harddisk speed by hdparm: hdparm -tT --direct /dev/sdX hdparm -tT /dev/sdX give dumps of the following commands: lshw dmesg lsmod cat /proc/swaps cat /proc/meminfo cat /proc/cmdline cat /proc/config.gz | gunzip - and give dumps of the following files: for every disk: /sys/block/<disk>/queue/* /sys/block/<disk>/queue/iosched/* /proc/sys/vm/* This is for information, so the developers can detect what configuration the system has. And if there are known configurations or drivers which are bad and maybe giving the same symptoms, they will be noticed earlier. If you want to use a script for that to help you collect the information, you can use the script located at: http://github.com/meghuizen/systeminfo which will build a tar.bz2 for you you can give as attachment, so you'll have complete information. After that learn a bit on the I/O scheduler. To make it easier for yourself to debug and understand the situation: - http://www.linuxjournal.com/article/6931 (info on I/O schedulers) - http://www.devshed.com/c/a/BrainDump/Linux-IO-Schedulers/ - http://kerneltrap.org/node/7637 - kernel-source/Documentation/block/iosched-description.txt (see: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=Documentation/block;hb=HEAD) - http://www.westnet.com/~gsmith/content/linux-pdflush.htm - http://www.docunext.com/blog/2009/10/debugging-and-reducing-io-wait.html There are some tools which are very handy to use. The Linux Perf tool, is for example very handy to debug slowness and latencies and stuff in your system. For some documentation on perf see: - https://perf.wiki.kernel.org/index.php/Main_Page - http://anton.ozlabs.org/blog/2010/01/10/using-perf-the-linux-performance-analysis-tool-on-ubuntu-karmic/ - http://blog.fenrus.org/?p=5 perf --help gives you also a lot of information. And other profiling tools: - http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/basic_profiling.txt;hb=HEAD So to debug these options, perf output is rather handy. So if there are slowdowns happening again, try to get at the same time get some perf record dumps and maybe as well as perf timechart dumps, so the developers can analyze that as well. for example perf top gives you what's currently happening in the kernel. perf bench can help you benchmark your system, so you could test changes with patches and kernel versions and tuning parameters.
Tried with recent official master: 18cb657ca1bafe635f368346a1676fb04c512edf http://vi-server.org/vi/12309_report/linux-2.6.36-09212-g18cb657_i686-sysinfo.tar.bz2 While running "pv /dev/zero > qqq" (http://vi-server.org/vi/12309_report/fill.txt), after about 2 GB I get pagefaults: http://vi-server.org/vi/12309_report/pagefault.txt http://vi-server.org/vi/12309_report/pagefault2.txt If I try deadline or noop scheduler, I still get pagefaults, but after about 5 GB of copied data (and probably not that often) In case of cfq the speed is jumping between 10 MB/s to 200 MB/s. In case of deadline or noop it is more stable, around 40 MB/s Trying > echo 10 > /proc/sys/vm/vfs_cache_pressure > echo 4096 > /sys/block/sda/queue/nr_requests > echo 4096 > /sys/block/sda/queue/read_ahead_kb > echo 100 > /proc/sys/vm/swappiness > echo 0 > /proc/sys/vm/dirty_ratio > echo 0 > /proc/sys/vm/dirty_background_ratio on this kernel leads to low filling speed (lover than 10 MB/s, measured with pv) Also after applying that settings applications (starting with gpg2) begin to hang in uninterruptible sleep with these settings. I cannot stop filling (probably it hangs too). P.S. Using this kernel I also cannot start X server. If somebody want, I can try other settings, other kernel revisions, patches, other config.
Checked a bit more with CONFIG_HZ_100 and CONFIG_PREEMPT_NONE: the same. Filling rate with vm.dirty_ratio=0 is 1 MB/s (with periodic stalls of everything). If I set vm.dirty_ratio to 1, it raises to 40 MB/s (stable). Long page faults when loading programs are present as well. Was testing with only 200 MB (of 1.5G) of memory filled.
While it feels like a general improvement with 2.6.36 (no audio stutter with swap, and building a kernel no longer drags the system down (and fills up cache) like it did with 2.6.35), I still see cursor jerkiness when I first log in and start loading Firefox, Evolution and Pidgin (all at the same time).
I've come to face this problem when using the new cgroup-sheduler patch. PC: Samsung NC10 netbook, kernel 2.6.36 vanilla, Zenwalk-snapshot. WHen trying to upgrade some packets in X session and browsing the Net at the same time, the latency increases badly, but not constantly, just in hitches. If i stop serfing the Net and return to my packager - the system works further, otherwise it may hang so that i have to reboot with a sysrq-key. If i turn off the cgroup scheduler in /sys - everything works fine. The kernel is compiled with full preemption, 1000 hz timer.
Trying 162253844be6caa9ad8bd84562cb3271690ceca9 from zenstable/io-less-dirty-throttling-2.6.37 - the same. Page faults of random processes (including Xorg) jump over 1 second while "pv /dev/zero > qqq". The speed measurements by "pv" are fluctuating (from 64 kb/s to 120 MB/s; avg 40 MB/s) just like in usual 2.6.35-zen2
Reply-To: Ritesh.Sarraf@netapp.com I'm currently Out Of Office. I'll be responding to emails, but except some delay in replies. For any urgent issues, please contact my manager, Kugesh Veeraraghavan Kugesh.Veeraraghavan@netapp.com
I have a reproducible test sequence for a 12309. It's easy: Take a _SCRATCHED_ DVD. Put it into the drive and copy all files on it to a HDD. The bug comes early :) The system freezes COMPLETELY at the time the drive read a scratched sectors. Distro: Arch Linux linuxhost 2.6.36-ARCH #1 SMP PREEMPT Fri Dec 10 20:01:53 UTC 2010 i686 AMD Athlon(TM) XP AuthenticAMD GNU/Linux Drive (dmesg |grep TSS) Feb 14 20:11:45 linuxhost kernel: scsi 2:0:0:0: CD-ROM TSSTcorp CDDVDW SH-S203B SB00 PQ: 0 ANSI: 5 Feb 10 12:05:36 linuxhost kernel: ata1.00: ATAPI: TSSTcorp CDDVDW SH-S203B, SB00, max UDMA/100 SATA-Controller (on the PCI-bus, drive connected to it): 00:0a.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE RAID Controller (rev 50)
(In reply to comment #521) > I have a reproducible test sequence for a 12309. It's easy: > > Take a _SCRATCHED_ DVD. Put it into the drive and copy all files on it to a > HDD. The bug comes early :) > > The system freezes COMPLETELY at the time the drive read a scratched sectors. I suspect this has more to do with the IDE bus than with the interaction between the kernel's block layer and the VM. Try this: dd if=/dev/dvd of=/dev/null bs=2048 I bet you get the same freezes when it reaches the scratches.
I checked the same DVD with another DVD-Drive (The Drive is on the IDE-bus, and not on the SATA-bus). All was OK. Any freezes at all. Any ideas? Is this another bug?
>Try this: >dd if=/dev/dvd of=/dev/null bs=2048 >I bet you get the same freezes when it reaches the scratches. You're right. But this is still the 12309 bug, isn't it?
(In reply to comment #524) > But this is still the 12309 bug, isn't it? No. However, this bug report has turned into a dumping ground for anyone experiencing any lagginess, regardless of cause. The actual bug here is related to the kernel preferring to evict memory-mapped executable pages when a process dirties blocks faster than they can be flushed to disk. The apparent hangs in responsiveness are due to threads (particularly GUI threads) triggering page faults and being unable to make progress until their code is re-fetched from disk. The fix should be to block the writing process from dirtying any more blocks well before the kernel starts evicting mapped executable pages from memory, but so far no one has been able to make it work correctly in all cases (afaik).
Alos, I better make a new bugreport for my bug?
Trying kernel from writeback/dirty-throttling-v6 Nothing seems to be changed, as usual. Still lengthy "Page Faults" (and others) for firefox-bin while "pv /dev/zero > qqq". Should provide more info about dirty-throttling-v6 (how to collect it)?
> The actual bug here is > related to the kernel preferring to evict memory-mapped executable pages when > a > process dirties blocks faster than they can be flushed to disk. Okay. Let it be so. However, the subject line for this bug is > Large I/O operations result in poor interactive performance and high iowait > times and that's what I'm experiencing now, rsync'ing a 100 GB worth of data with almost everything being there on the receiving side (thus making the receiving rsync read files heavily for the checksums). And I am dead sure this has nothing to do with the virtual memory as the swap is completely off (I would probably need to compile a different kernel with no support for swapping to reconfirm). iowait rises to 90%, LA shows disturbingly large numbers of up to 20, and unrelated processes like Xorg freeze, taking around 15 seconds to redraw the screen or move the mouse cursor or whatever. What I thought this bug was about is that while one process does overwhelmingly large volumes of I/O, it should by no means impact other, unrelated processes which might not even use the disc subsystem, or not use the same disc. At least this is what Mac OS X does: for example, Transmission preallocates space for 40 GB worth of torrent data, naturally freezing in the process and ceasing to respond to any events, but then again, I can minimise its window, type code in Eclipse or anything — barely noticing the disc thrashing. I think I'm reiterating this example for the upmteenth time here, sorry if that's the case. If I'm wrong and the bug #12309 was reduced to its VM part, I just request which one is about the above problem — high iowait affecting unrelated processes, with no swapping involved. Is that #13347? I cannot follow it because the submitter uses a dialect of English I'm not quite capable to parse. If there's no specific bug, I'll take the time to report it, because it bugs me a great deal, however I'm afraid I'll have to repeat most of the tests already conducted here. Please don't take it as if I'm trying to offend anyone, because I'm not. I just want to know where does go the specific symptom as described above. Thank you all for every effort to have it resolved.
@Yaroslav: Your misconception is that having swap disabled means that memory pages are never backed by disk blocks. That is simply not true. All it means is that *anonymous* pages cannot be backed by disk. All Linux kernels launch processes from disk (via execve(2)) by memory-mapping the executable image on disk and then jumping to the entry point address in the mapped image. Since the entry point address is in a non-resident page, the CPU's attempt to fetch an instruction from it triggers a page fault, which the kernel then handles by loading the needed page (and usually several more) from disk. When physical memory becomes scarce, the kernel has several tricks it may employ to attempt to free up memory. One of the first of these tricks is dropping cached blocks from the block layer and cached directory entries from the file system layer, which means that those blocks and dentries will have to be fetched from disk the next time they are accessed. One of the last tricks the kernel has is the OOM killer, which selects the "most offending" process and KILLs it in order to reclaim the memory it was using. Somewhere in between those two tricks, the kernel has another trick it attempts for freeing up physical memory. It can force memory pages out to disk. If the system has swap enabled, the kernel may force anonymous pages (e.g., process heaps and stacks) out to disk. In all cases, however, the kernel may also choose to force memory-mapped pages out to disk. If those memory-mapped pages are read-only (such as is the case with executable images), then "forcing them out to disk" really just means dropping them from physical memory, since they can always be fetched back in later. So, what does this mean in the context of this bug? The process that's hitting the disk a lot (usually it's dirtying blocks, but maybe it's possible that this happens even if it's just reading blocks) causes RAM to fill up with disk blocks. The kernel starts attempting its tricks to free up physical memory. One of those tricks is dropping memory-mapped pages from RAM, since they can always be fetched back into RAM from disk later. Then you the user switch applications or click on a button in the GUI or try to log into an SSH session, and what happens? Page fault! The code for repainting the X11 window or handling the button click or spawning a login session is not resident in memory because it was forced out by the kernel. That code now must be refetched from disk to satisfy the page fault, but uh oh, the disk is VERY busy and has very long queue depths, so it will be a while before the needed pages can be fetched. And at the same time as those pages are being fetched, the kernel is evicting other memory-mapped pages from RAM, so the responsiveness problem is just going to persist until the pressure on RAM subsides. Ideally, the kernel should not allow so many blocks to be dirtied that it has to resort to dropping memory-mapped pages from RAM. The dirty_ratio knob is supposed to control how much of RAM a process is allowed to fill with dirty blocks before it's forced to write them to disk itself (synchronously), but that does not appear to be working properly.
Incidentally, one reason this bug seems to manifest a lot more on 64-bit systems than on 32-bit systems is that 64-bit systems use Position-Independent Code (PIC) in their shared libraries universally, whereas 32-bit systems usually don't. Not using PIC means that 32-bit systems usually have to perform relocations throughout their shared libraries upon memory-mapping them, and those relocations cause private (anonymous) copies of those pages to be created, and those anonymous pages cannot be forced out to disk on systems without swap, so accessing those pages can never cause page faults. On 64-bit systems, PIC virtually eliminates the need to perform relocations in shared libraries, meaning most mappings of shared-library code are directly backed by the images on the disk and thus *may* be forced out of RAM and *may* cause page faults. In principle, using PIC (on 64-bit systems, which have new addressing modes to make it efficient) is a good idea because it means only one copy of a library needs to be in RAM, regardless of how many processes map it, rather than one relocated, private copy for each process, but because of this bug, *not* making private copies of the library code is what's killing us, as the only copy we have in memory is evictable. Please note, I am not arguing that the kernel should be making private copies of all executable pages; that would be the wrong solution. A better solution would be to prevent processes from dirtying so much RAM that the kernel has to start evicting pages that were memory-mapped by execve or dlopen (but not by plain old mmap!).
Thanks for prompt reply and the patience to explain these things, but then there's one more misconception on my side in a desperate need for debunking. And it's about the I/O queues. This misconception starts from a suggestion that not all data are equal. For example, non-resident executable pages are tier-0. I/O buffers for application usage like those for read(), write() and friends are tier-1. If there are no priorities on the queue, we cannot tell the origins of I/O requests apart and thus get what we have: swapping a process in has to wait until the queue is emptied by a disk-hungry application beast which just happened to fill it up. If we prioritize the queue and find a way to tell swap-in reads from application reads (say), on the other hand, it might improve interactive responsiveness. And the expense of having a tiered queue might be mitigated by employing it only on the media which has at least one mmapp'ed process. I say "it might improve things" because the solution is so obvious, in fact, that I have little doubt it has been thoroughfully thought through and ultimately rejected. And I have no doubt that every folk who gets a single line of code accepted and committed into mainline is smarter than me in this respect[1] so this must have popped up a while ago. [1] I'm no kernel hacker at all, just your average applications developer.
@Yaroslav: I agree. I've had the same thoughts regarding priority in the I/O queues. The biggest problem with this approach is that much of the queues actually sit inside the hardware nowadays. SCSI TCQ (tagged command queuing) and SATA NCQ (native command queuing) have exacerbated this. The Linux kernel can't do anything to prioritize queues inside the hardware, but it can limit how much of the hardware queue it will use, thus effectively keeping the queue in software only. Some proposed workarounds to this bug 12309 involve reducing the depth of the hardware queue that Linux is allowed to use, and that does seem to improve the worst case, although it severely degrades the common case. Another workaround might be to prevent the kernel from evicting executable memory-mapped pages in the first place. This would be only a partial solution, though, as applications often memory map resources that are not executable (for example, fonts, pixmaps, databases), so their responsiveness could hang on page faults for those resources just as readily as on page faults for code.
You are right about the workaround, but having a queue prioritised would be of help when, despite all workarounds, pages were actually evicted. I actually imagine it as a 4-tier queue: tier 0 for realtime processes, 1 for swap-ins we are talking about now, 2 for every other virtual memory operations, and 3 for everything else (or count 2 and 3 as everything else, maybe). My question then will be as follows: yes, we cannot control the commands queueing once they enter the hardware. But if we happen to know the hardware command queue size (which we do) and if we are able to tell how full it currently is (which I'm not quite sure about but I think it can be figured out), we could split it so that every tier is permitted to fill no more than some percentage of the hardware queue. It would of course hit average case performance, but still guarantee some bandwidth for higher tier I/O which is a good thing IMHO. Sorry for bugging and probably ignorance, but I really want this nailed.
To everyone interested in this bug: An easy and reliable way to demonstrate the issues surrounding this bug (on a system without anonymous swap) is to mount a tmpfs that is sized as large as your physical RAM. Then start writing to it (slowly!!!). The kernel will be unable to flush those blocks to disk, as they are not backed by disk. As you continue writing to the tmpfs, the kernel will gradually evict everything else in your block cache and file system cache. At some point, the kernel will have run out of caches to evict and will start evicting memory-mapped pages. You'll know this has happened when the system responsiveness comes to a crawl and your disk starts thrashing. Yes, your disk will thrash, even though you're only writing to a tmpfs. The thrashing is due to all the page-ins of executable pages that are being accessed as various processes on your system struggle to keep executing their background threads and event processing loops. If your writer process continues writing to the tmpfs, your system will become completely unusable. If you're lucky, eventually the kernel's OOM killer will be invoked. The OOM killer probably won't choose your tmpfs writer as its victim, though, so you'll have only a short time to kill the writer yourself before your system grinds to a halt again. If you do manage to get it killed, you can simply unmount the tmpfs, and everything will return to normal in short order. You will notice a bit of lag the first time you switch back to other applications that were running, as they will trigger page faults to get their code loaded back into RAM, but once that's done, everything will be as usual.
It would have made sense if only starting new processes was slow. Copying large volumes of data slows down even mouse cursor, where Xorg HID driver already sits in memory. If what you've described affects driver already in memory, entire architecture has to be abandoned. So to say, definition of the problem, not an excuse.
Hm, I then have another wild suggestion. It is in fact a very rare event that a process needs to hang in memory but wake up once in a blue moon, so that it can be harmlessly paged out and not bring the system to a halt. From my desktop experience I can only remember LibreOffice sitting on my long-running machine and be actually used once in two weeks or so. If the problem is really so grave that an often-running process (like Xorg!) is selected by the kernel to be paged out, why not work this around by disabling evicting processes' pages altogether? I think it must be somewhat easier than designing an over-engineered strategy for choosing what pages to throw away, test it over a couple years, find bugs in the very design, throw it away, design another one and so on. I would love to see a flag which I could set per control group. If the flag is set, pages owned by processes in that cgroup are never swapped out. Combined with pessimistic overcommit policy, it could help at least a bit. Or at least worth a try.
(In reply to comment #535) > It would have made sense if only starting new processes was slow. Copying > large > volumes of data slows down even mouse cursor, where Xorg HID driver already > sits in memory. If what you've described affects driver already in memory, > entire architecture has to be abandoned. So to say, definition of the > problem, > not an excuse. If you're seeing the mouse cursor lag/skip while copying large volumes of data, an alternative explanation could be that you're using PIO mode for your data transfers rather than DMA. However, as you identify, it's possible that the X.org driver that handles the mouse input is indeed being paged out, and that would result in mouse interrupts triggering page faults, and the mouse cursor would not update on screen until the code for doing so had been paged back in. To say the entire architecture must be abandoned is too extreme. Memory-mapping executable images is a very efficient mechanism that ordinarily works beautifully. This bug is creating pathological conditions that should never occur. (In reply to comment #536) > If the problem is really so grave that an often-running process (like Xorg!) > is > selected by the kernel to be paged out, why not work this around by disabling > evicting processes' pages altogether? You can't do that. Consider a process that maps a 1 TB file into memory and then starts randomly reading from it, thus causing more and more of the file to be loaded from disk into physical memory. You *must* allow pages to be evicted, or you will run out of RAM. Don't try to solve a problem that doesn't exist. The actual problem here is that the block layer is using too much RAM for dirty (or possibly even clean) blocks. To demonstrate to yourself that this is so, you may try another of the proposed workarounds, which is to mount your file system in "sync" mode, which causes all file writes to be performed synchronously rather than being buffered and written back later. Under that constraint, you will never run into this bug, because the block layer is never allowed to use so much RAM that the kernel starts paging out "hot" memory-mapped pages. (By "hot," I mean pages that are regularly being accessed, such that you would notice if they had to be paged back in from disk.)
Okay, sync might work, but it also would make filesystems slow as hell and contribute to media wear from another side. If what you say is the case, and I have no reason for disbelief, then there must be a way to limit the number of dirty blocks (and total blocks) which may exist before buffers are flushed. E. g., there's X seconds of commit interval or Y dirty blocks, whichever comes first, and a max Z buffered blocks in total per device or per system. This would be 'almost sync', I think, and it would solve one more problem with USB flash media. The problem is that too big write buffers tend to be flushed at a sub-optimal speed, thus increasing the total time needed to copy and sync the data. Again, this does not occur neither with Windows nor with OS X. And they don't mount 'sync'; they buffer writes (which is a good thing with any device with expensive and wearsome writes), it's just that their buffers are considerably smaller in size than those of Linux. I'd be happy to know that a solution of limiting buffer sizes exists, this at least would enable us to fine-tune the system so that in 90% of use cases the problem wouldn't appear, and that it would appear only in the cases where it's tough anyway.
@Yaroslav: There is already a knob for tuning the maximum amount of RAM that may be used for holding dirty blocks. From Documentation/sysctl/vm.txt: > dirty_ratio > > Contains, as a percentage of total system memory, the number of pages at > which > a process which is generating disk writes will itself start writing out dirty > data. The intent is as you describe: asynchronous writing until dirty_ratio is reached, and then synchronous writing only. "dirty_ratio" is 10% by default. You can test if it's working by starting a large write to disk (`dd if=/dev/zero of=/bigfile bs=1M`) and monitoring the "Dirty" counter in /proc/meminfo (`watch grep Dirty /proc/meminfo`). For what it's worth, it does work for me (and I haven't seen this bug manifest on my system in quite a while). I'm running Linux 2.6.36-gentoo-r5. I can still get the unresponsiveness and disk thrashing to happen using the tmpfs test case I described in comment #534, but that's not a failing of the kernel; that's a failing of the user (filling a tmpfs too much).
(In reply to comment #537) > an alternative explanation could be that you're using PIO mode for your data > transfers rather than DMA. However, as you identify, it's possible that the Excuse me, I am using you said? That would be like, specifically configuring the kernel to use PIO? Why would anyone do that? [ 1.101092] ata2.00: ATA-7: WDC WD3200KS-00PFB0, 21.00M21, max UDMA/133 [ 1.101205] ata2.00: 625142448 sectors, multi 0: LBA48 NCQ (depth 1), AA [ 1.102146] ata2.00: configured for UDMA/133 [ 2.191312] ata13.00: ATA-7: ST3160215A, 3.AAD, max UDMA/100 [ 2.191343] ata13.00: 312581808 sectors, multi 16: LBA48 [ 2.266143] ata13.00: configured for UDMA/100
(In reply to comment #540) > That would be like, specifically configuring > the kernel to use PIO? Why would anyone do that? The kernel can fall back to PIO mode if DMA mode is encountering problems (which can happen with faulty hardware). It happens with CD/DVD drives more often than with hard drives. The next time you encounter system sluggishness and the mouse cursor starts skipping, see if you can get a readout of /proc/meminfo (while the sluggishness is happening). If your "MemFree" is very low *and* your "Cached" or "Dirty" is very high, then you might be suffering from this bug.
dirty_ratio is not really a good measure of when to start flushing to disk. On a 24GB system, even 1% may be large for your disks to handle. Its better to configure dirty_bytes and dirty_background_bytes. dirty_bytes applies to the process which is doing the IO and dirty_background_bytes applies to kernel flush threads. When these thresholds are hit, if sum total of IO happening in the system is at a rate higher than your disks can take, you will start seeing very initial symptoms of this bug. The overall flow has been described well by Matt. I think this is precisely what's happening. One way to avoid the issue would be set the dirty_bytes and dirty_background_bytes in such a way that their sum total is within reasonable ratio of your disk's sequential bandwidth. When a Linux system is in steady state with a reasonable uptime, it will likely use all RAM for read side caches. It will free up those on demand when it comes under memory pressure (which may be created by large IO). By keeping the (dirty_bytes + dirty_background_bytes) a multiple of your disk's raw speed, you can put a bound on the overall latency of the system. For example, I don't let dirty to go beyond 200MB on my laptop. It makes all my sequential operations bound by the sequential speed of the disk but lets the small random IO to be buffered (so, its better than "sync" mode of the FS in that sense).
And can we find a solution that would apply in the case where the system is running out of free RAM and starts swapping out everything? I often experienced total unresponsiveness of both X and the consoles when a program tries to use more RAM than is available, and I wasn't even even able to kill the process manually (forced reboot). Maybe that should be considered as a pathological case requiring just the OOM killer to be more aggressive - I don't know.
(In reply to comment #543) > And can we find a solution that would apply in the case where the system is > running out of free RAM and starts swapping out everything? I often > experienced > total unresponsiveness of both X and the consoles when a program tries to use > more RAM than is available, and I wasn't even even able to kill the process > manually (forced reboot). Maybe that should be considered as a pathological > case requiring just the OOM killer to be more aggressive - I don't know. If you have the Magic SysRq key enabled in your kernel, you could do AltGr+SysRq+F to invoke the OOM killer manually. I do agree in principle, though, that the offending process should be denied the allocation of any additional memory before any frequently used memory-mapped pages start getting evicted from RAM. One possible solution might be to set a threshold for the minimum number of memory-mapped pages that the kernel must allow to remain in RAM. As an example, setting such a knob to 100000 would mean that the kernel would not evict any memory-mapped pages if fewer than 100000 memory-mapped pages were resident in RAM. Assuming that the kernel uses a least-recently-used eviction policy, this would prevent the debilitating thrashing scenario that occurs when essentially all memory-mapped pages have been and continue to be evicted.
(In reply to comment #544) > Assuming that the kernel uses a least-recently-used eviction > policy, this would prevent the debilitating thrashing scenario that occurs > when > essentially all memory-mapped pages have been and continue to be evicted. Given the fact that Xorg all too often falls victim to that, and it is active most of the time, I cannot help but assume something is wrong with the kernel's definition of "least recently used." By the way, setting vm.overcommit_memory to 2 and overcommit_ratio to 80 seems to at least somewhat reduce the problem; the same rsync command which has triggered this bug (or similar bug if you prefer) now behaves a lot better, letting me type these words.
I find that amount of slowness strongly depends on the writing driver. Today I had to evacuate Win7 machine onto Fedora14 and copying from NTFS to EXT3 was painful. Now I am returning the files back onto NTFS and there is no slowdown at all. Dig in the ext3 filesystem, it should be in the writing code.
This seems to be a hardware related issue, at least in some cases. Can the other people experiencing it confirm whether they have a WD Greed hard disk? Google search for "wd15eads firmware" reveals quite a few people having similar problems. I have one of these hard disks and I was using it on a fanless VIA Samuel 2 (pre-686) CPU and I was seeing the high IOWait problem and associated poor performance. When I put the same hard disk in a dual AMD opteron it had the same problem. Then I did a full backup and restore on a different hard disk. It is the same debian system on the same VIA cpu but now the high IOWait times are gone and the performance is adequate for the CPU. I should point out that the kernel should not suffer poor overall performance during disk I/O even on flakey hardware, especially with swap disabled. The offending hard disk is now blanked. I can run a few tests with it if somebody is interested.
Blaming hardware is the lamest practice in IT world and it surely earns those who practice that great deal of disrespect.
(In reply to comment #548) > Blaming hardware is the lamest practice in IT world and it surely earns those > who practice that great deal of disrespect. Vesselin Kostadinov doesn't blame hardware, he says this bug (or one of bugs discussed here) is hardware-dependent. I can confirm this too: initially I used Barracuda 7200.10 320GB ST3320620AS, then I tried to replace it with Seagate Barracuda LP 2TB without success (nothing changed), then I replaced it with Samsung HD103UJ 1TB and this helps a lot - bug is still noticeable, but very very rarely and have much less impact on overall system performance. You can find more details about this in my comments on bug 13347.
Regarding the WEADS disks from WD. It has something to do with disk geometry. We had some problems with them as well. We have some 30 pieces of them. But actually it's not a problem it's more a RTFM thingie. I think that there's something on the WD site, not sure. To partition this disk under linux / Windows XP (Win 7 is automagically doing it) you have to use fdisk -H 224 -S 56 /dev/sd... You can read my comment at https://bugzilla.kernel.org/show_bug.cgi?id=12309#c513 Two of the disks are green WD-s partitioned with the fdisk method. Until then I had also problems with speed where the HD-s only had a throughput of 2-5 MB/s. After the fdisk I had a throughput of up to 100 MB/s. But again the problem with this bug is not throughput it's if you start a big file copy or like dd if=/dev/zero of=test.img bs=1M count=5000 your desktop comes almost to a halt. But after some time I think that this even isn't a bug it's more a new kernel queing methodology. After entering this: vm.swappiness=1 vm.dirty_background_ratio=1 vm.dirty_ratio=1 into sysctl.conf I almost don't have this problem anymore. I read a lot about this problem and as far as I can understand the new way the kernel is working is that it, depending on the above configuration, put's something first into RAM and then writes it to disk (very simplified). So if you have a lot of Ram (in my case 12GB) and the above configuration is per default 40% then the kernel is putting almost 5GB as cache into RAM. And then writes it to disk, and yes I have a very fast RAID system but even with 400MB/s I have to wait 10 secs, and more, in which he has to write it to disk. I forgot with which kernel version this started but I know that I checked it and that my problems with responsivness started after changing to this new kernel (methodology). So you can say that this is not a bug but merely a kernel configuration matter. Because with this new metodolgy a default configuration of vm doesn't work for all, especially with those with a lot of RAM. And yes I would like that the old methodology would be integrated again into the new kernels but until then I'll try to circumvent this problem with understanding and configuring the kernel. The above sysctl configuration is working for me with the setup that I have in my comment #513 in this bug. There are slight hickups but nothing as severe as earlier when I couldn't do anything until the file writing finished.
Sorry for interrupting your research with my naive question, but does this bug have clear steps to reproduce it? The initial comment says 'starting a new shell takes minutes' after the system is left with dd running for significant time. But for me shells/browsers etc take just maybe 1 or 2 seconds longer to start after I have 'stress -d 1' or 'dd if=/dev/zero of=bigfile bs=1M' running for ~10 minutes (bigfile is 30Gb after my tests, dirty blocks quickly reach ~670M (3.67G RAM total) and stay there. The small file test that I accidentally ran with TWO simultaneous bigfile dd processes in the background finished in 0.073s (or is this bad?): $ dd if=/dev/zero of=/tmp/bigfile bs=1M count=30000 conv=fdatasync & sleep 30 ; time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync [2] 27953 1+0 records in 1+0 records out 4096 bytes (4.1 kB) copied, 0.0718053 s, 57.0 kB/s real 0m0.073s user 0m0.001s sys 0m0.001s dd: writing `/tmp/bigfile': No space left on device dd: writing `/var/tmp/bigfile': No space left on device 22891+0 records in 22890+0 records out 24002064384 bytes (24 GB) copied, 1211.53 s, 19.8 MB/s 21957+0 records in 21956+0 records out 23022534656 bytes (23 GB) copied, 1189.07 s, 19.4 MB/s [1]- Exit 1 dd if=/dev/zero of=/var/tmp/bigfile bs=1M count=100000 conv=fdatasync [2]+ Exit 1 dd if=/dev/zero of=/tmp/bigfile bs=1M count=30000 conv=fdatasync I'm noticing loss of interactivity when my RAM gets filled up and swap grows >500M, but this bug is not about such case is it? Could it be my HW on latest stable vanilla 2.6.38.2 amd64 (swappiness 20, the rest being defaults)? Or could I have just configured my kernel in some genius way? [ 2.051391] ata1.00: ATA-8: HITACHI HTS545025B9A300, PB2ZC61H, max UDMA/100 [ 2.054162] ata1.00: 488397168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA [ 2.065605] ata1.00: configured for UDMA/100 [ 2.087958] scsi 0:0:0:0: Direct-Access ATA HITACHI HTS54502 PB2Z PQ: 0 ANSI: 5 $ sudo hdparm -i /dev/sda /dev/sda: Model=HITACHI HTS545025B9A300, FwRev=PB2ZC61H, SerialNo=100408PBNXXXXXXXXXX Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=DualPortCache, BuffSize=7208kB, MaxMultSect=16, MultSect=off CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=488397168 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled Drive conforms to: unknown: ATA/ATAPI-2,3,4,5,6,7 PS: I'm on ext3
controller in the previous comment was *-storage description: SATA controller product: Ibex Peak 6 port SATA AHCI Controller vendor: Intel Corporation physical id: 1f.2 bus info: pci@0000:00:1f.2 logical name: scsi0 version: 06 width: 32 bits clock: 66MHz capabilities: storage msi pm ahci_1.0 bus_master cap_list emulated configuration: driver=ahci latency=0 resources: irq:41 ioport:1860(size=8) ioport:1814(size=4) ioport:1818(size=8) ioport:1810(size=4) ioport:1840(size=32) memory:f2727000-f27277ff(In reply to comment #551)
OK, the fun continues. Installed the offending hard disk in another system, booted Fedora 14 live and the drive worked OK: [root@localhost ~]# dd if=/dev/zero of=/dev/sd_ bs=1M count=4000 conv=fdatasync 4000+0 records in 4000+0 records out 4194304000 bytes (4.2 GB) copied, 50.0265 s, 83.8 MB/s (Replaced /dev/sda with /dev/sd_ in case someone decides to copy/paste the command). Then I booted Knoppix 5.1.1 (from 2007) and saw the fault. CPU usage was 49.7%wa (dual cpu) and had to interrupt dd because it was taking way too long. Then I tried again with a smaller file: root@Knoppix:~# uname -a Linux Knoppix 2.6.19 #7 SMP PREEMPT Sun Dec 17 22:01:07 CET 2006 i686 GNU/Linux root@Knoppix:~# dd if=/dev/zero of=/dev/sd_ bs=1M count=40 conv=fdatasync 40+0 records in 40+0 records out 41943040 bytes (42 MB) copied, 20.8245 seconds, 2.0 MB/s Then I booted Fedora again and saw the fault again: [root@localhost ~]# uname -a Linux localhost.localdomain 2.6.35.6-45.fc14.i686 #1 SMP Mon Oct 18 23:56:17 UTC 2010 i686 i686 i386 GNU/Linux [root@localhost ~]# dd if=/dev/zero of=/dev/sd_ bs=1M count=40 conv=fdatasync 40+0 records in 40+0 records out 41943040 bytes (42 MB) copied, 20.3055 s, 2.1 MB/s @ #548 From Zenith88: Ignoring the possibility of a hardware fault when the evidence points that way surely brings those who practice that great deal of fruitless debugging and frustration. @ #550 From D.M. I don't think it is the "partition starts at the wrong sector" issue. In the dd commands listed above I was writing to the drive as a whole, without messing with partitions at all. For the sake of it I decided to create a new partition and see what will happen: [root@localhost ~]# fdisk -H 224 -S 56 /dev/sd_ Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel with disk identifier 0x9b81ad16. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4, default 1): 1 First sector (2048-2930275054, default 2048): Using default value 2048 Last sector, +sectors or +size{K,M,G} (2048-2930275054, default 2930275054): +10G Command (m for help): p Disk /dev/sda: 1500.3 GB, 1500300828160 bytes 224 heads, 56 sectors/track, 233599 cylinders, total 2930275055 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x9b81ad16 Device Boot Start End Blocks Id System /dev/sda1 2048 20973567 10485760 83 Linux Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. [root@localhost ~]# mkfs.ext2 -q /dev/sda_ [root@localhost ~]# mount /dev/sda1 /mnt [root@localhost ~]# dd if=/dev/zero of=/mnt/bigfile bs=1M count=100 conv=fdatasync 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 77.3839 s, 1.4 MB/s I guess the performance drop can be attributed to the filesystem overhead. The issue you describe with writing a large bunch of dirty pages is a real one but is different to the high iowait times. I have seen high iowait times when the only active application I had was rtorrent running in seeding mode - so no disk writes but lots of disk reads from all over the place, with total system memory less than the size of the torrent. Basically when the performance of the drive drops from 80 MB/s to 2 MB/s the only thing the kernel does is waiting for I/O operations to complete. I am not sure if there is a solution for this problem at all. The disk is still available so I can run more tests if anyone is interested.
You can continue debating hard drives, or look into comparison of ntfs vs ext3 code on the cue from post #546 which is a reproducible test case. Your call.
Oleg: Unfortunatly no clear steps. If you read my comment #513 you'll see that I didn't have any troubles with whole disk software raid10. After that I thought that it was something file-system related but I tested ext2-ext4 and xfs and this is also answering your question Zenith. Same thing, no matter what. And regarding the Hardware it may be that this particular HD is broken and in the case of Kostadinov I even think it is a broken hardware problem because on one system (Fedora) it worked and then after using Knoppix and getting back to Fedora it didn't. I'm just mentioning which troubles we had with the green WDs, and not only under linux, until I read about this fdisk thing. Now I have two of them and they didn't give me any troubles when I had them in the whole disk RAID10 or when I had an older kernel, or now with the new kernel setting. But to get backto the substance again. Yes, if you go into dd-ing multiple RAM onto HD the system is coming to a halt. With the old kernels it was "I'm doing dd and the system automagically knows that firefox or mail or whatever is of more priority to me than dd, so he slows down dd a bit so firefox could get some time reading from the HD. Or maybe the queing was more fairly so all processes got some time raping the HD, I don't know, I'm not a kernel developer. I'm just a user and as a user I'm mentioning the diffs between the old and the new kernels." With the new kernel it's not it's he who's writing has all the power over the HD. But again that is more a perception than a fact. The difference I have to earlier, before I configured vm, is that wa was up to 98 and now it's up to max 45-50.
You can deny reality however much you see fit, it won't change the fact that writing onto ext3 partition causes freeze, while writing to ntfs does not on the same system. And this is not a VM but physical machine. Denial of reality and passing the blame is what's causing this project to sit on its hands for 3 years.
There are probably different bugs at stake here, and investigating one doesn't mean denying the other. Please be more respectful of people that try to improve our understanding of the problem instead of ranting. Just a guess: ntfs-3g driver is using FUSE, while ext3 driver is in kernelspace. *Maybe* this can explain the difference (ntfs-3g isn't considered as in-kernel as regards I/O scheduling).
I'm sorry if I offended you in any way. Again I'm not in denial, and I'm not blaming anyone I'm merely pointing out that it's not only a ext3 problem because I had the same problem on xfs, and that, as you pointed it out, the kernels from 3 years ago didn't have this kind of problem. And with vm I didn't mean Virtual Machine but virtual memory, because I was referring to the sysctl.conf (i.e. ... vm.dirty_background_ratio = 1 vm.dirty_background_bytes = 0 vm.dirty_ratio = 1 vm.dirty_bytes = 0 vm.dirty_writeback_centisecs = 500 vm.dirty_expire_centisecs = 3000 ... and so on.) Again, I'm sorry if I have offended you in any way.
Another attempt to narrow down the use case for the issue. You are not going to get anywhere if you continue reporting issues against all of the different breeds of Linuxes. You never know how Fedora or Knoppix patched the kernel, and you should report issues with their kernels to them instead of posting your observations here. As I see it the only way to trace down the issue is to use the same version of VANILLA kernel (preferably the latest) with different build and runtime configs. I personally have ext3 compiled in the kernel - could it be the reason why I can't reproduce the issue? Zenith88: would it take you a lot of effort to produce the latest not patched vanilla kernel with ext3 compiled into it and to see if it makes things any better for you?
It seems to be fixed in 3.2.
> It seems to be fixed in 3.2. Somewhere in parallel universe I think. Nothing changed for me on > Intel Corporation 5 Series/3400 Series Chipset SMBus Controller
(In reply to comment #561) > Nothing changed for me on > > > Intel Corporation 5 Series/3400 Series Chipset SMBus Controller Nor here on my ICH8-based notebook, with 2GiB RAM. If anything, 3.2 seems worse than 3.1 when it comes to the ability of one process to binge out on dirtying pages, and then bring the rest of the system down to a snail's pace. One consistent example case is unpacking to the local SATA drive an ISO image (using Nautilus, for example) stored on another drive. Compute-heavy processes with little disc access suffer (and even those without any I/O do --- CPU usage shoots right down). Another one is a kernel build. The file cache goes bananas, and even with no other desktop applications loaded, everything gets paged out and it takes around a minute (in the worst case) for the unlock screen prompt to appear.
(In reply to comment #561) > > It seems to be fixed in 3.2. > Somewhere in parallel universe I think. There are multiple issues that can lead to a behaviour like the one that is discussed in this bug. A few patches that went into 3.2 make some situation better. But some problems were still known back then; see http://lwn.net/Articles/467328/ Fixes for those went into 3.3-rc1. Quoting from this weeks LWN.net kernel page (I'm quite sure Jonathan won't mind): """ There have been some significant changes made to the memory compaction code to avoid the lengthy stalls experienced by some users when writing data to slow devices (USB keys, for example). This problem was described in this article (http://lwn.net/Articles/467328/), but the solution has evolved considerably. By making a number of changes to how compaction works, the memory management hackers (and Mel Gorman in particular) were able to avoid disabling synchronous compaction, which had the unfortunate effect of reducing huge page usage. See this commit ( http://git.kernel.org/linus/a77ebd333cd810d7b680d544be88c875131c2bd3 ) for a lot of information on how this problem was addressed. """ IOW: Best to test 3.3-rc and report bugs if there are still issues. While at it (and with a view from someone that is not very active in this bug tracker): I'd say opening a new bug and mentioning it here in this report might be the best way forward for any remaining issues, as the long history might be misleading/confusing when it comes to solving today's bugs. Just my 2 cent.
The problem is really fixed in 3.3rc4. I installed two guest systems on a first generation ssd. The ssd was only for the virtualisation guest. My system is on a >40000 IOPS ssd. The first installation was done with kernel 3.2.6, in which the long stalls up to 10 seconds reappeared. Even as bad as in kernel 2.6.2[4-9]. The second installation was done with kernel 3.3rc4. I could even work in an other running virtualisation guest. It's really great. Thanks to all people involved in solving this bug.
Could someone else prove it?
(In reply to comment #564) > Thanks to all people involved in solving this bug. Does anyone have a link to a discussion list post or a technical article detailing the theory behind the solution to this bug? Since this "bug" encompasses so many scenarios, I have doubts about whether all of them have indeed been resolved. I'm glad one person's problem went away, but until a kernel hacker can stand up and explain exactly what was wrong and how they fixed it, I'm going to assume there are still lurking problems in Linux's I/O subsystem. One problem we've seen and discussed in this thread is that large numbers of dirty blocks waiting to be flushed to disk can cause eviction of "hot" pages of code that are needed by interactive user processes, thus bringing the system to a state of thrashing in which processes continually trigger page faults because their actively executing code keeps being forced out of RAM by the large buffered write to disk. Even if this problem has been solved (presumably by fixing a bug in the code that is supposed to force a process to flush its own dirty pages to disk once dirty_ratio has been reached), there would still be the problem of the kernel's evicting hot pages from RAM so aggressively in low-memory conditions that interactivity of the system is compromised to the point where it's impossible for the user to resolve the memory shortage. It's pretty easy to reproduce the thrashing scenario: just mount a tmpfs whose max size is close to the amount of physical memory in the system and start writing data to it. Eventually you may find that you are no longer able to do anything, even to give input focus to your terminal emulator so you can interrupt your writing process (or in some setups, even to move your mouse cursor on the screen), because your entire desktop environment and even the X server have been evicted from RAM and are continually paging back in from disk (and being immediately evicted again), hindering your ability to do anything. I've encountered this scenario while compiling Chromium in a tmpfs. I'd expect the OOM killer to activate, but instead I find that all of my running applications are responding at a snail's pace because they have to keep paging in bits of their program code from disk. I should mention that I run without swap. I would think one way to solve the thrashing problem would be to introduce a kernel knob that would set how much time must elapse between a page being fetched from disk into RAM due to a page fault and that page becoming eligible for eviction from RAM. If set to, say, 30 seconds, then the user's interactive processes could retain a usable degree of interactivity, even under extremely low memory conditions. This would, of course, mean that the OOM killer would activate sooner than it does now, since pages that the kernel would presently choose to evict in order to free up RAM would be ineligible under this new time limit. Setting the knob to zero would yield the behavior we have now, in which the kernel is free to evict all unlocked pages. I'll reiterate once more, as a refresher, that this was formerly not such a problem on 32-bit x86 systems because most library code there contained relocations that would cause the pages containing the code for libraries to differ from disk, so they could not be evicted (assuming no swap). Now that we use position-independent code on x86_64, most executable pages in RAM are identical to the copies on disk, so they are eligible for evicting since the kernel can just page them back in from disk when they're needed. That convenience turns on us when we find that pages that are needed very frequently (like pages that handle moving the mouse cursor or blinking a cursor) are being evicted aggressively.
Maybe this: http://lwn.net/Articles/467328/
(In reply to comment #567) > Maybe this: > http://lwn.net/Articles/467328/ Interesting. Thanks for the link. However, this article doesn't explain why we see thrashing and extremely degraded interactivity on systems that don't have HugeTLB support enabled in the kernel (such as mine). This reinforces the point that there are many scenarios that exhibit poor interactive responsiveness under heavy disk writing load. Regarding this debate about the transparent huge pages, I have to wonder why the kernel would bother trying to create a huge page in a location where there are dirty pages waiting to be written to disk. Shouldn't it just choose some other area in RAM that doesn't intersect any dirty buffers? This isn't really the place for a discussion of page compaction, though, so I'll discourage anyone from responding to my idle musing here.
Large IOW on writing/reading to/from any Hard drive disk still occurs. Wasting huge amounts of ticks on any disk IO while _waiting_ is nonsense.
I Confirm this. System to become unresponsible when begin swap memory to disk.
Good day, anybody! I found фт optimal options against this and like this bug Anybody please try a following options. I found that my kernels 2.6* and 3.2.* and 3.3.* versions of my server has periodical freezings 4-15 secs. I found that this occur in writeback time (flushing to disk) in time when 30sec expires for expired dirty pages occur. I tried many variants of dirty_* options and found optimal these: I can suggest two veriants Here 1st and 2nd variants The second variant commented Only uncommment second lines and nothing ####################################### # every 3 sec look up for dirty status # It for smooting writebacking, may be 100 will be better echo 300 > /proc/sys/vm/dirty_writeback_centisecs # Only 100Kb data of dirty pages and writeback... # It very important option :) echo 102400 > /proc/sys/vm/dirty_background_bytes # second variant - uncomment it - but you will have frozens but rarely # echo 225280000 > /proc/sys/vm/dirty_background_bytes # my a frozens happen at time of expiring of dirty pages (default 30 sec) # i increased it (it doesn't mean for 1st variant - it will never happen) echo 864000 > /proc/sys/vm/dirty_expire_centisecs # I increased limit for non background writebacking (it never happens i think) echo 10 > /proc/sys/vm/dirty_ratio ####################################### I like 1st variant - my system now works smooth I found that freezings occurs when dirty pages are written to disk. You can see it by this: watch -n1 grep -A 1 dirty /proc/vmstat New kernel features from 3.2.* version (writeback throttling) will not help to me. Now i tested kernel 3.3.2-6 of FC16 and it have a troubles too. But these settings work for me! I don't have any time for detailed description But if you will test it and it will help i will ready to discus for it Sorry for my English :) Bye! Perlover
After upgrade from 3.0.x to 3.2.0, this bug are completely eats my brain :( Have tried solution from #571 -- now hangs not whole system, but just some applications (browser, terminal, ooffice etc). Disk is SSD: Read : 1145044992 bytes (1,1 GB), 1,56616 s, 731 MB/s Write: 1145044992 bytes (1,1 GB), 14,30301 s, 80 MB/s RAM: MemTotal: 3969340 kB MemFree: 112720 kB Buffers: 721196 kB Cached: 1246456 kB SwapCached: 656 kB Active: 918656 kB Inactive: 1666252 kB Active(anon): 507868 kB Inactive(anon): 158192 kB Active(file): 410788 kB Inactive(file): 1508060 kB SwapTotal: 6290428 kB SwapFree: 6288604 kB But it still freezez sometimes, on simple actions like just Alt+Tab to other app, and that app hangs for 3-6 seconds.
The in-kernel process scheduler is generally crap. Ok, make that majorly crap. Move away from it. Use BFS (search Con Kolivas) if you want sanity. Someone recently posted a simple test case where heavy kernel space starves the user space processes to death. The person switched to BFS and all his troubles went away. Nobody replied to him on the list. I don't think even Ingo knows what's wrong with CFS. So, don't have your hopes of ever seeing this fixed. Here is the user space starvation thread I am talking about: https://lkml.org/lkml/2012/6/7/448
Wow!! That's a pretty bold statement to make. Given that the code is all open, why don't you instrument the kernel and pin point where exactly the crap is. Most you guys who suffer the stall problem, you would want to give Daniel Poelzleithner's ulatency a try.
@Ritesh: you are assuming I am capable of debugging kernel. None of the users who have reported on this thread are. The only person capable of debugging this issue is Ingo. How many comments have you seen from him? Go ahead and count them! I will tell you the answer: Zilch! Process scheduling in stock Linux kernel is a REAL problem. Nobody wants to debug it, that is a different story. That does not mean the problem goes away. After seeing that thread I linked above, I am convinced it is some manifestation of CFS issues at fault here.
Anton can you file yours as a separate bug - it's clear the main problem has been fixed and the scenario you described seems different.
> it's clear the main problem has been fixed Alan: Can u describe how it is clear to you when the general public keeps suffering and reporting the issue Or worse, just gives up? what is the code change that "fixed" the issue? Just because someone mentioned BFS in a message somewhere and someone is pointing out a potential problem with the in-kernel scheduler, doesn't give u the right to close this bug randomly. That's arrogant behavior and does a disservice to all the reporters here.
Installed BFQ + BFS patched 3.4 kernel ( http://pf.natalenko.name/ ) -- there no hangs for now.
Hey Anton! Big Alan says this problem does not exist. How dare you claim otherwise...:) I am just kidding....I am moving to BFS myself. So...
No I said that Anton's case appears to be different and asked him to open a new bug for it, given the other cases seem fixed. If BFS fixes your case that's also interesting and wants putting in the bug too.
Comment #571 at least indicates that the problems remain, and are unrelated to the CPU scheduler. So describing all this as "fixed" seems a tad optimistic. That being said, this bugzilla report clearly isn't getting the job done. I suggest that people who are still seeing writeback-related problems should report them via email. Suitable recipients are linux-mm@kvack.org linux-kernel@vger.kernel.org Wu Fengguang <wfg@linux.intel.com> Andrew Morton <akpm@linux-foundation.org> And please, the thing to spend time on is to work out how to enable kernel developers to locally reproduce the problem. If we can do this, we'll fix it.
571 indicates someone has a possible problem of the same type. It's separate from all the other debug - hence I asked for it to filed as a new bug, otherwise nothing useful is going to occur. (eg I can get 3 second freezes on alt-tab out of gnome 3 but it doesn't appear to be anything to do with the kernel) Alan
Andrew: if it is not CPU scheduler, then how come JUST replacing the CPU scheduler fixes the issue? This does not make basic CS101 sense!
Hmm... Are you sure, that was replaced JUST cpu scheduler? In my case i have replaced both -- cpu and disk schedulers, to BFS and BFQ. Jun 10 12:05:54 nuuzerpogodible kernel: [ 1.611737] io scheduler bfq registered (default) Jun 10 12:05:54 nuuzerpogodible kernel: [ 1.826589] BFS CPU scheduler v0.422 by Con Kolivas.
I like stability and typically use a very minimalistic approach. I only changed just the CPU scheduler. And I haven't noticed any hangs or stuck mouse so far. May be you can change one variable at a time as well and tell us which one (or both) helped. I will update back if I have any new findings. For now, I am happy that I can use my system without getting annoyed with it.
devsk: it makes basis systems 101 sense however. All the bits interact. The fact replacing just the CPU scheduler change makes a difference is valuable info though.
I propose an interesting experiment. 1. Install Opera from this location: http://snapshot.opera.com/unix/rc4_12.00-1456/ 2. Switch on hardware acceleration opera:config#UserPrefs|EnableHardwareAcceleration set to 1 3. Open the test http://ie.microsoft.com/testdrive/Performance/LoveIsInTheAir/ or http://ie.microsoft.com/testdrive/Performance/ParticleAcceleration/ Try switching between tty, also use your GUI. I consider that no program should not affect the responsiveness of the system as a whole, is not it?
I my case(In reply to comment #587) > I propose an interesting experiment. > > 1. Install Opera from this location: > http://snapshot.opera.com/unix/rc4_12.00-1456/ > 2. Switch on hardware acceleration > opera:config#UserPrefs|EnableHardwareAcceleration > set to 1 > 3. Open the test > http://ie.microsoft.com/testdrive/Performance/LoveIsInTheAir/ > or http://ie.microsoft.com/testdrive/Performance/ParticleAcceleration/ > > Try switching between tty, also use your GUI. > > I consider that no program should not affect the responsiveness of the system > as a whole, is not it? In my case is very similar. I play movie with vlc on one screen, have some Konsole open with transparent, Kwin with desktop effect. Always when the graphic card is at 100% (very often with low end gc like me), all the system have general slow down (same on tty too).
(In reply to comment #578) > Installed BFQ + BFS patched 3.4 kernel ( http://pf.natalenko.name/ ) -- there > no hangs for now. BFS + BFQ really helps… but only until you run a couple of VMware virtual machines. :( With BFS and BFQ it result in incredible freezes, both in host OS and guest OSes, especially when some OSes does intensive I/O like installing updates etc. Without BFS and BFQ freezes still happens, but they much less noticeable! Probably only one of BFS and BFQ is responsible for such bad behavior, but I havn't tested them separately.
Continuing of post #571 Sorry, my English is not good as i want :) Now i have Fedora Core with 3.3.2-6.fc16.x86_64 kernel. My server has 48Gb memory and hardware RAID1 array. Now i use my server with settings (good settings for me): echo 1000 > /proc/sys/vm/dirty_writeback_centisecs echo 20 > /proc/sys/vm/dirty_background_ratio echo 9000000 > /proc/sys/vm/dirty_expire_centisecs echo 30 > /proc/sys/vm/dirty_ratio Before these settings as i wrote in #571 post i had regulary freezings up to 10-20 seconds every 2-5 minutes. I found that reason of this is writeback phase of dirty pages. During writeback phase (we can see it by "watch -n1 grep -A 1 dirty /proc/vmstat" command as nr_writeback value - written to disk dirty pages now). For example writeback phase can be started by 'sync' command or when will be expired dirty pages in memory (common settings - 30 seconds). If in next time of writeback we have many dirty pages (even 2000-3000 amount) my server has been frozen by this stage. Now i have a above settings and one day i do 'sync' from crontab (when load is minimum). During this phase my server increase load average from 1-2 up to 80-90 and this doing ~ 1-2 minutes. My system is frozen during 1-2 minutes! In other time ( 24 hours * 60 minutes - 3 minutes ) i have now load average 1-2, no freezings I/O. Before these settings i had load average 8-9. I know that if power of server will be turned off i will have oldest data in disk (up to 24 hours oldest) I think that system stops I/O for as long as all dirty pages marked as written to disk to be written to disk. I think normal system should not block all I/O and should split write process of dirty pages to times. And i noticed that i don't have this problem with my second server where same OS, same kernel version and same RAM volume. There is software RAID1 (/dev/md*). During writeback process this server works smoothly. I think there software raid has an other buffer mechanism of writting to disk. So may be somebody from you will test these problems with software raid? And i think this article will be useful and related with this: http://lwn.net/Articles/405076/ https://lwn.net/Articles/456904/ But as i understood this feature partly realized in kernel 3.3 but i didn't get a better things with new kernel. As i understood this is developing now. Sorry for my English Bye! :)
And now (may be 1-2 years) i don't see high volumes of iowait as in top of this topic. But problem with freezing during of large I/O operations remains. So may be iowait problem doesn't exist already but blocking any i/o to be during high-volume writings.
About i/o schedulers. I a lot of read, that devices which have NCQ support not needing in schedulers, is it? $ Dmesg | grep NCQ [2.145261] ata1.00: 175836528 sectors, multi 16: LBA48 NCQ (depth 31/32), AA [3.109745] ata5.00: 2930277168 sectors, multi 16: LBA48 NCQ (depth 31/32) Seems all my devices support NCQ. I manually set noop sheduler, and system was apparently much responsible. I hope this is not a placebo. If it true, so why not in the kernel will automatically switch off scheduler for devices which have NCQ support?
(In reply to comment #592) > [...] devices which have NCQ support not needing in schedulers [...] > > Seems all my devices support NCQ. I manually set noop sheduler, and system > was > apparently much responsible. [...] While it is true that the hard drive will reorder I/O requests within its native command queue to optimize armature movements, the on-device queue is really very shallow (only 32 requests maximum on your hardware). By circumventing the kernel's I/O scheduler (by selecting "noop"), you are losing the benefits of merging adjacent I/O requests and of distributing I/O throughput fairly across multiple processes.
In think in my case we have a system hang for two reasons. I don `t know that there does Opera, but it looks like that video card output is also limited, and when any application tries to send too much data to transfer GUI starts feeling less refreshed from this hangs. Despite the fact that htop does not show any CPU utilization or waiting for i/o. The second case is more traditional, it is to hang when accessing memory (memory of 2GB) htop shows us that even 2GB swap allocated. And then the freezing occurs because of the waiting of the hard disk. That's OK, but bad that affects all applications, even those to whom the available memory would be enough. The worst thing is that there is affects as a whole system responsiveness and GUI. I want to help fix these problems, write what I can do for this.
And I think why the noop scheduler can be better ... There is a stupid idea, but what if the queue scheduler gets to swap? This is theoretically possible? If so then it is understandable why noop is better.
I wonder, if someone had tried oprofile while forcing matchine to fall into #12309? It may be stalling somewhere waiting for locks or hardware action. Alas, I myself have no hardware to reproduce #12309 on at hand.
Created attachment 78231 [details] htop screenshot
Please look at my htop screenshot https://bugzilla.kernel.org/attachment.cgi?id=78231 I just copy file from HDD to HDD. It's normal to high IO wait's for CPU? I think the bug is not fixed. What other information to provide? $ uname -a Linux u3s3 3.5.2-1.fc17.i686.PAE #1 SMP Wed Aug 15 16:30:14 UTC 2012 i686 i686 i386 GNU/Linux
Looks fairly normal to me - I'd expect a lot of waiting for I/O during a big copy because rotating disks are incredibly slow relative to processor performance. The CPU is also generally having to work harder on a 32bit machine with > 1GB of RAM doing MMU management due to the lack of address space. The scheduler btw is kernel side so doesn't get paged/swapped out.
Please correct me if I'm wrong, but I do believe that "I/O Wait" time is the amount of time that processes are blocked on disk I/O operations. What I don't understand is why I/O Wait appears to consume CPU time. Is the kernel spinning in a busy wait loop while an I/O operation is pending on a disk? If so, why? The kernel should be allowing some other task to use the CPU during the I/O wait.
I/O wait isn't consuming CPU time but the process of reading/writing disks does consume CPU time because the process is doing work in the kernel managing the I/O and the things that go with it.
(In reply to comment #601) > I/O wait isn't consuming CPU time but the process of reading/writing disks > does > consume CPU time because the process is doing work in the kernel managing the > I/O and the things that go with it. Alan, if I understand you correctly why kernel don't switch to another process until current process waiting I/O? For example why GUI (means GNOME Shell) brakes while another application do swap or much writes to disk?
Alan, isn't what you just described named PIO? Isn't DMA the solution that resolved high CPU load on storage IO? Isn't high CPU load on VM IO (IOWAIT) very similar to PIO storage operation mode? Just to remember my already asked question: is polling technique suitable for VM IO as it was some years ago for NET IO? > I/O wait isn't consuming CPU time but the process of reading/writing disks > does > consume CPU time because the process is doing work in the kernel managing the > I/O and the things that go with it. >
The data transfers are done by DMA where possible, but you still have to do all the housekeeping, controller management, I/O queue handling and the like. On a 32bit box there can also be a lot of memory management work involved. Old (pre AHCI) controllers need PIO for some parts of a transfer. That is a hardware limit. And the kernel does switch to other processes and back and forth between them when one is waiting for I/O. The gnome shell is a very large program so on any system without vast quantities of memory the shell tends to be waiting for stuff to come from disk when there is any memory pressure. Last time I looked the compositor was single threaded with all of that so Gnome 3 stalled horribly under paging. That I'm afraid is mostly a problem in Gnome 3. Rotating disks are in relative terms very very slow. They've not materially improved in the past ten years yet memory sizes have grown vastly, processor speeds have grown likewise. They are also very bad at trying to do two things at once so writing a large file to disk tends to really slow down reading.
(In reply to comment #604) > And the kernel does switch to other processes and back and forth between them > when one is waiting for I/O. The gnome shell is a very large program so on > any > system without vast quantities of memory the shell tends to be waiting for > stuff to come from disk when there is any memory pressure. Last time I looked > the compositor was single threaded with all of that so Gnome 3 stalled > horribly > under paging. That I'm afraid is mostly a problem in Gnome 3. Ok, why also mouse movement is choppy? and why switching to a virtual terminal are slow? How I can ensure that the locks occurs not in kernel? And how find where occurs locks? I am really want to help find and fix them. Apologies for the many questions.
because the gnome compositor is going to end up stalling waiting to get data back. Ditto switching to/from X will be pulling in lots off disk if your machine has been paging stuff out. To actually get detailed data you need to start profiling the system and generating detailed information to analyse - thats way beyond a bugzilla discussion (but the linux-mm list might be a starting point if you want to get involved in understanding what is a very complicated area - because so many things interact). Ultimately though I suspect that unless someone does something drastic about its memory footprint the "fix" is not to run huge bloated inefficient desktops on a box with 1GB of RAM.
Alan: What would be your example of a huge bloated inefficient desktop? I guess KDE/GNOME. And the efficient one might be icewm/fvwm etc. Not common unfortunately. The I/O wait problem is still valid. It is just that you need different patterns to hit it. A lot has improved with the latest writeback work but still, when hit, this is a terrible problem. If you want to reproduce it, take your laptop/desktop, with 4 GiB Mem and the regular SATA disk. Pump (buffered) I/O with dd into it. Write zeroes with block size of 1 MiB. Since it is buffered, you'll start good until you consume all your 4 GiB memory. After that is when you will start seeing the problem. At that moment (i.e. after you have consumed all of your RAM), every write() will contend for page availability. And given that you also have a slow rotating disk (you can also include remote storage - both block and files), try to execute a task following the I/O. A simple sync command is good to start with. CPU goes blocked until the pages are scanned for best fits and are buffers synced. You can run dstat and observe the CPU wait time out there. (In our tests) Linux is good at pumping I/O. This doesn't always fit in the regular OS model where the user could also be doing other random stuff while I/O is in progress. They expect the machine to be responsive. MS Windows, while not the best, is still better than Linux desktop in this use case. Over the years, my workaround have been to have only 1 process doing I/O. Never let 2 or more processes do I/O at the same time. Like don't do 2 cp. Don't do 2 copy operations in your gui file browser. If you follow this policy, you have a higher chance avoiding this ugly bug.
Ritesh: if you have some test cases then discuss them on the linux-mm list.
(In reply to comment #608) > Ritesh: if you have some test cases then discuss them on the linux-mm list. Alan, I see in the prev comments you have the same explanation done in the right technical terms. :-) I just would add 1 more comment. All these symptoms were tested and seen also on my lab machine, which is: > 2 core CPU > 8 GiB RAM (We have tested also with 48 GiB RAM) > All tests were done with SAN Array (over sw iSCSI). The slow rotating media can be mapped with the slow network in this case. The stalls were visible on these machines also after you do buffered I/O consuming up all of the system RAM. I had then spent some time tweaking values in /proc/sys/vm but hadn't seen great improvements. Will surely put in my results on -mm in the next run I do on it (could be in some weeks). Thank you.
Is it possible, that one process can consume all (dirty?) pages and stalls other processes, even if these are running from or accessing other discs. My system is on two ssds. One for the system and one for the data. I can stall the whole system, while running a vm on a third slow external usb2.0 disc.
Thomas - the kernel tries very hard to avoid that sort of thing happening and to throttle a process generating too much I/O. Older kernels were certainly very bad at that and an rsync to a USB disk was horrible. It ought to be much better with the most recent kernels although still not great.
> Over the years, my workaround have been to have only 1 process doing I/O. > Never > let 2 or more processes do I/O at the same time. Like don't do 2 cp. Don't do > 2 > copy operations in your gui file browser. I've heard once a while ago that Linux is a multitasking OS, so I figure they lied to me?
By moving to BFS, it has been proven (empirically) that IT IS a CPU scheduler issue and not a slow-rotational media problem. Kernel can do other stuff when the rotational media is not giving it what it wants. And don't let buffers and caches fill so much (again a scheduling issue) that even the kernel does not have free pages to run its own components from. All that kernel is doing is spinning finding free pages all the time (kswapd hogging CPU searching through millions of pages on modern systems). Why does it not evict caches by default sooner is not clear? You need to set a bunch of proc parameters for it to start doing that. And it still eventually keels over. There was a bug reported by someone (and I linked it above) where just pumping network traffic through Linux kernel brought it to it