Bug 12309

Summary:	Large I/O operations result in poor interactive performance and high iowait times
Product:	IO/Storage	Reporter:	Ben Gamari (bgamari)
Component:	Block Layer	Assignee:	Jens Axboe (axboe)
Status:	CLOSED CODE_FIX
Severity:	high	CC:	17x2at4gt1, 3draven, 91kk91, aagaande, akatopaz, akpm, alan, alevkovich, alex, alexa.gerancho, alexwrinner, alexxy, alpha_one_x86, amaury.deganseman, andre, andrew, anelkansirim01, aoitechsolutionss, arthur.titeica, aydmain7, benjfitz, bes1002t, blueskyluv11, bob+kernel, Booksacord, bpenglase, brad, braintotty21, brian, brice+lklm, bugzilla, caroljames972022, castro8583bennett, ccloughwilliams, chip, chrschmitt, cigapeseo2, cmertes, coornail, csredrat, daniel, datacompboy, davidguptil05, db.pub.mail, dblogsng, demarco, desmondstanley2020, dev.rindeal+kernel.org, dik_again, dixlor, dominik.stadler, dopemp311, dragondx, dushistov, egorov_egor, emms007, erbrochendes, eugene.seppel, flakrat, flynn1, frankrq2009, gaguilar, gatekeeper.mail, gilboad, gnomeuser, gooberslot, gusnan, hacktivist, hassium, henkjans_bagger, horst-bugme-osdl, howaboutsynergy, humufr, iamyaleboi, iivanich, info.healthscholar, irv, issablogger8, ivan1986, jackbilla01, jackhicks121, james09cameroon, jaroslaw.fedewicz, Jason, john.m.lang, john_re, kebjoern, kehindequyum17, kernel, kernelbugs, konoha02, kuraga333, l.wandrebeck, leho, lenharo, linux-kernel-bugs, loki, luciamandela55, Lukasz.Kurylo, lure, marcus.husar, martin, mathieu.desnoyers, michiel, mihel, mikhail.v.gavrilov, mjyun75, mo6eeeb, mozilla_bugs, mpagano, mpartap, mricon, mvolaski, mybigspam, nalimilan, nels.nielson, nobled, nucrap1, odiditech, omarandemad, opeyemi041, ozan, p.kernel, pereezdprofiss, peterhoeg, pmiscml, preining, psypher246, pvz, rifter0x0000, robbie.bowman, rsarraf, ruslan, samuel-kbugs, seanj, sgh, sharkman.ru, simon+kernelbugzilla, smithpaulding, snigurmd, sprutos, srivastavpratik93, steriosprosiniklis, stuffcorpse, tchiwam, theparanoidone, thomas.pi, toby, todorovic.s, uzytkownik2, vaiski, vesok, vi0oss, vitaly.v.ch, voidpointertonull+kernelorgbugzilla, vshader, wade, wolfram, wusmnjg, yanp.bugz, ylalym, zenith22.22.22, zvova7890
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	3.3	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	test case with processes and pipes test case with threads and mutexes All testresult on Core2 T7700 @ 2.40GHz / 4GB RAM Bisect results sysbench results Bisect results 32v64test 32v64testCleanNewLines.txt Top output while running test RFC patch to put a maximum to the number of cached bio merge done in a row Port Attachment #19859 to Linus's master Test job description for fio fio test results of kernel 2.6.15 - 2.6.24 fio results for kernel 2.6.28 ext3 and ext4 comparison with patched and unpatched kernel Latencytop results latencytop captures with clocksource jiffies and hpet Test patch for async page promotion latencytop captures with clocksource hpet and patched kernel Test patch for async page promotion v2 Test patch for async page promotion v2 2.6.25.20 fio test with NOHZ disabled 2.6.25.20 fio test with NOHZ enabled latencytop captures with clocksource hpet with nohz and no high resolution timer latencytop captures + fio results amd64 With this .config I don't have latency bug. Graph of I/O waits on CPU Core 0 mm fix page writeback accounting to fix oom condition under heavy I/O Screenshot of current status of the bug while letting a program hang the system info request by Michiel in comment 191 IWait problem 91,4% 2.6.28.7 Initial effort to build an automatic test suite for this bug Initial effort to build an automatic test suite for this bug V2 Results in ODF for spreadsheet Results in ODF for spreadsheet vmstat with high # of uninterruptible processes fsync tester kernel 17 - 30 Automatic test suite for this bug V3 test case: Takes the time of mouse click events Complete test log Test patch against heavy io bug Patch to revert second commit, updated to apply against 2.6.30rc8 Backport of the reverted CFQ commit The corrected patch from #360 post (for 2.6.29 and may be more kernels) test results test results 2.6.30: cfq, deadline test result 2.6.30 without ACHI Drain async IO on the hw side test result 2.6.30 with patch from #397 Simple sleeper test case perf chart high io latency vmstat for my system running "stress -d 1" without hanging. vmstat for my system running "stress -d 1". System hangs. vmstat for my system (without swap) running "stress -d 1" without hanging. vmstat for my system running "stress -d 1" without hanging. screenshot of extreme iowait at ridiculously low throughput Wu Fengguang's anti-io-stall patch rebased for vanilla 2.6.35 test results htop screenshot $ cat /boot/config-`uname -r` all required files in one archive Per deice dirty ration configuration support attachment-6179-0.html attachment-22369-0.html Hot legit Hot legit Hot legit Hot legit South Africa Sky Legit Sky Legit impactlifetech 9jaextra attachment-13101-0.html

Description Ben Gamari 2008-12-27 06:56:11 UTC

This is an attempt at bringing sanity to bug #7372. Please only comment here is you are experiencing high I/O wait times and interactvity on reasonable workloads.

Latest working kernel version: 2.6.18?

Problem Description:
I/O operations on large files tend to produce extremely high iowait times and poor system I/O performance (degraded interactivity). This behavior can be seen to varying degrees in tasks such as,
 - Backing up /home (40GB with numerous large files) with diffbackup to external USB hard drive
 - Moving messages between large maildirs
 - updatedb
 - Upgrading large numbers of packages with rpm

Steps to reproduce:
The best synthetic reproduction case I have found is,
$ dd if=/dev/zero of=/tmp/test bs=1M count=1M
During this copy, IO wait times are very high (70-80%) with extremely degraded interactivity although throughput averages about 29MB/s (about the disk's capacity I think). Even starting a new shell takes minutes, especially after letting the machine copy for a while without being actively used. Could this mean it's a caching issue?

Comment 1 Ben Gamari 2008-12-27 06:57:52 UTC

For the record, this is even reproducible with Linus's master.

Comment 2 Ozan Caglayan 2009-01-12 13:21:17 UTC

I'm also having this problem.

Latest working kernel version: 2.6.18.8 with config:
http://svn.pardus.org.tr/pardus/2007/kernel/kernel/files/pardus-kernel-config.patch

Currently working on 2.6.25.20 with config:
http://svn.pardus.org.tr/pardus/2008/kernel/kernel/files/pardus-kernel-config.patch

Tested also with 2.6.28 and felt no significant performance improvement.

--

During heavy disk IO's like running 'svn up' hogs the system avoiding the start a new shell, browse on the internet, do some text editing using vim, etc.

For example, after being able to open a text buffer with vim, 4-5 seconds delays happens between consecutive search attempts.

Comment 3 Thomas Pilarski 2009-01-14 11:43:11 UTC

Hello Ben,

I don't known where to post it exactly. Why Linux Memory Management? Or why -mm and not mainstream? Can you do it for me please?

I have added a second test case, which using threads with pthread_mutex and pthread_cond instead of processes with pipes for communicating, to ensure it is a cpu scheduler issue.

I have repeated the tests with some vanilla kernels again, as there is a remark in the bug report for tainted or distro kernels. As I got a segmentation fault with the 2.6.28 kernel, I added the result of the Ubuntu 9.04 kernel (see attachment). The results are not comparable to the results posted before, as I have changed the time handling (doubles instead of int32_t as some echo messages takes more than one second).
The first three results are 2*100, 2*50 and 2*20 processes exchanging 100k, 200k and 1M messages over a pipe. The last three results are 2*100, 2*50, and 2*20 threads exchanging 100k, 200k and 1M messages with pthread_mutex and pthread_cond. I have added a 10 second pause at the beginning of every thread/process to assure the 2*100 processes or threads are all created and start to exchange the messages nearby at the same time. This was not the case at the old test-case with 2*100 processes, as the first thread was already destroyed before the last was created.

With the second test-case with threads, I got the problems (threads:2*100/msg:1M) immediately with the kernel 2.6.22.19. There kernel 2.6.20.21 was fine with both test-cases.

The meaning of the results:
- min message time
- average message time (80% of the messages)
- message time at median
- maximal message time
- test duration

Here the result.
Linux balrog704 2.6.20.21 #1 SMP Wed Jan 14 10:11:34 CET 2009 x86_64 GNU/Linux
min:0.000ms|avg:0.241-0.249ms|mid:0.244ms|max:18.367ms|duration:25.304s
min:0.002ms|avg:0.088-0.094ms|mid:0.093ms|max:17.845ms|duration:19.694s
min:0.002ms|avg:0.030-0.038ms|mid:0.038ms|max:564.062ms|duration:38.370s
min:0.002ms|avg:0.004-0.007ms|mid:0.004ms|max:1212.746ms|duration:33.137s
min:0.002ms|avg:0.004-0.005ms|mid:0.004ms|max:1092.045ms|duration:31.686s
min:0.002ms|avg:0.004-0.007ms|mid:0.004ms|max:4532.159ms|duration:59.773s

Linux balrog704 2.6.22.19 #1 SMP Wed Jan 14 10:16:43 CET 2009 x86_64 GNU/Linux
min:0.003ms|avg:0.394-0.413ms|mid:0.403ms|max:19.673ms|duration:42.422s
min:0.003ms|avg:0.083-0.188ms|mid:0.182ms|max:13.405ms|duration:37.038s
min:0.003ms|avg:0.056-0.075ms|mid:0.070ms|max:656.112ms|duration:72.943s
min:0.003ms|avg:0.005-0.010ms|mid:0.007ms|max:1756.113ms|duration:49.163s
min:0.003ms|avg:0.005-0.010ms|mid:0.007ms|max:11560.976ms|duration:52.836s
min:0.003ms|avg:0.008-0.010ms|mid:0.010ms|max:5316.424ms|duration:111.323s

Linux balrog704 2.6.24.7 #1 SMP Wed Jan 14 10:21:04 CET 2009 x86_64 GNU/Linux
min:0.003ms|avg:0.223-0.450ms|mid:0.428ms|max:8.494ms|duration:46.123s
min:0.003ms|avg:0.140-0.209ms|mid:0.200ms|max:12.514ms|duration:39.100s
min:0.003ms|avg:0.068-0.084ms|mid:0.076ms|max:38.778ms|duration:78.157s
min:0.003ms|avg:0.454-0.784ms|mid:0.625ms|max:11.063ms|duration:65.619s
min:0.004ms|avg:0.244-0.399ms|mid:0.319ms|max:21.018ms|duration:64.741s
min:0.003ms|avg:0.061-0.138ms|mid:0.111ms|max:23.861ms|duration:126.309s

Comment 4 Thomas Pilarski 2009-01-14 11:44:07 UTC

Created attachment 19795 [details]
test case with processes and pipes

Comment 5 Thomas Pilarski 2009-01-14 11:45:02 UTC

Created attachment 19796 [details]
test case with threads and mutexes

Comment 6 Thomas Pilarski 2009-01-14 11:49:17 UTC

Created attachment 19797 [details]
All testresult on Core2 T7700 @ 2.40GHz / 4GB RAM

Comment 7 Thomas Pilarski 2009-01-15 03:10:24 UTC

I guess the high I/O wait time and the poor responsiveness are the same problem, caused by the cpu scheduler, as I can produce the same symptoms without disc I/O. 
Since 2.6.26/27 everyone should be affected by this issue. 

What I did not understand is:
Why takes the test with threads and mutexes twice as long as the test with processes and pipes, but stresses the system much more? The mouses freezes nearby immediately, while the test with processes and pipes allows to move the windows.

Comment 8 Laurent Wandrebeck 2009-01-15 04:10:34 UTC

I've met the high I/O wait problem with 3ware cards on Centos 5.x.
This is related to pci_try_set_mwi. More information here:
https://bugzilla.redhat.com/show_bug.cgi?id=444759
Now Thomas seems to have found another source for the problem. Maybe mwi is adding on top of that (not every controller driver sets MWI - BIOS is supposed to do so, but I've met a couple of boards that do not).
HTH.

Comment 9 Rick Richardson 2009-01-15 10:51:32 UTC

If I run "google desktop indexer", then I get the long waits.  E.G. vim goes away for up to 5-30 seconds, repeatably!

So, I don't run "google desktop indexer".  No problem since 12/15/08!

Comment 10 Nicolas 2009-01-15 11:32:00 UTC

You can also add the task:

- copy a file from a compactflash card through usb adaptor or pcmcia card. The
computer is not usable until the copy of the file (3 to 5 megas) is finish. It
doesn't matter if it copy the whole card or only a file. It seems to be similar
to the description of the bug here.

Comment 11 Mike Bleiweiss 2009-01-15 11:36:34 UTC

I have found that this may be an issue with the Complete Fair Queuing I/O scheduler that was introduced as default in 2.6.18 (when most started observing this performance issue).  Reverting back to the old AS scheduler for me seems to have resolved the problem.

To use the AS scheduler and test for yourself, just specify "elevator=as" as a boot option.

Comment 12 Brice Figureau 2009-01-15 11:50:31 UTC

(In reply to comment #2)
> I'm also having this problem.
> 
> Latest working kernel version: 2.6.18.8 with config:
>
> http://svn.pardus.org.tr/pardus/2007/kernel/kernel/files/pardus-kernel-config.patch
> 
> Currently working on 2.6.25.20 with config:
>
> http://svn.pardus.org.tr/pardus/2008/kernel/kernel/files/pardus-kernel-config.patch
> 
> Tested also with 2.6.28 and felt no significant performance improvement.
> 
> --
> 
> During heavy disk IO's like running 'svn up' hogs the system avoiding the
> start
> a new shell, browse on the internet, do some text editing using vim, etc.
> 
> For example, after being able to open a text buffer with vim, 4-5 seconds
> delays happens between consecutive search attempts.

You seem to be able to reproduce the bug easily, and have found a non affected kernel version.
Can you git bisect between those kernels to at least isolate the culprit commit?

Comment 13 Brice Figureau 2009-01-15 11:53:39 UTC

(In reply to comment #3)
> 
> With the second test-case with threads, I got the problems
> (threads:2*100/msg:1M) immediately with the kernel 2.6.22.19. There kernel
> 2.6.20.21 was fine with both test-cases.

I'm not sure that's the same issue I had when I posted but 7372, but since you seem to be a programmer you should git bisect between those kernels to isolate the culprit commit.

Comment 14 Per von Zweigbergk 2009-01-15 12:18:56 UTC

I'm not sure if this is related or not, but I'm getting similar behaviour on my own system, but *only* when copying files *from* my USB memory stick (a 4 GB Corsair Flash Voyager) *to* the internal SSD on my Asus Eee PC 900 running Ubuntu 8.10 with a custom build of Linux 2.6.27 (probably slightly patched) provided by array.org.

I.e. reading a file from the USB stick to /dev/null, no slowdown.
Writing /dev/zero to USB stick, no slowdown.
Reading a file from the internal SSD to /dev/null, no slowdown.
Writing /dev/zero to internal SSD, no slowdown.
Copying a file from internal SSD to USB stick, no slowdown.
Copying a file from USB stick to internal SSD, I get massive slowdowns on interactive performance. Launching a terminal, which usually takes a few seconds, suddenly takes the better part of a minute.

Linux used is 2.6.27-8-eeepc on i686 SMP, as prebuilt by http://www.array.org/ubuntu/

The filesystem on the internal SSD is ext3, running on LVM, running on LUKS (encrypted filesystem). As set up by the Ubuntu 8.10 installer. Swap is also on the same encrypted LVM.

The filesystem on the USB stick is vfat. Nothing fancy at all.

I should also add that the read performance of my USB stick is faster (about 25 MB/s) than the write performance on the built-in SSD (about 10 MB/s).

If you feel that it is useful, I can provide dumps of lspci/lsusb/lsmod or any other information. As for the exact build options and patches, that should be determinable by checking the web site specified above.

Hope more data makes it possible to determine a pattern to this bug.

Comment 15 Nicolas 2009-01-15 12:22:36 UTC

I tried the solution of Mike the comment http://bugzilla.kernel.org/show_bug.cgi?id=12309#c11 and indeed that solved my issue. So he seens that he is right at least for my problem.

Comment 16 Per von Zweigbergk 2009-01-15 12:31:24 UTC

I tried elevator=as on my system, and it did not change the behaviour. Copying files from external USB to internal encrypted SSD still totally smashes interactive performance. So this issue might be unrelated.

Comment 17 Mike Bleiweiss 2009-01-15 13:03:05 UTC

(In reply to comment #16)
> I tried elevator=as on my system, and it did not change the behaviour.
> Copying
> files from external USB to internal encrypted SSD still totally smashes
> interactive performance. So this issue might be unrelated.
> 

This may be an unrelated issue having to do with USB I/O - since USB seems to be more CPU intensive anyway.

When I experienced this bug (prior to switching from CFQ), it would happen whenever I copied a large file on ATA or SCSI devices and I noticed extremely high I/O wait times - with very low CPU usage.  Not only during copying - but during any disk-intensive operation.  Everything on my affected machines would come to a grinding halt until the operation was complete.  Using AS for me so far has seemed to resolve the issue - as my machines are now responsive as they should be during heavy disk I/O.

Comment 18 Brandon Penglase 2009-01-15 13:05:04 UTC

I have had a very similar problem to this. I still have it often, but not as
much from when I changed from EXT3 to ReiserFS. For the Scheduler, I've been
using BFQ or V(R) thats included in the Zen Patchset. I have tried the stock
kernel, and same problem exists, however I can't remember which scheduler I
used at that point, I believe Deadline. 
Most of the IOWait I get comes when either I'm copying files to the local
drives, or using multiple VM's (generally Windows as thats what is needed for
work). I'm willing to try about anything to get this fixed. It's a little
better since I switched FS's on my VM Drive, but still isn't totally fixed.

Comment 19 Matt Whitlock 2009-01-15 16:31:13 UTC

(In reply to comment #11)
> I have found that this may be an issue with the Complete Fair Queuing I/O
> scheduler that was introduced as default in 2.6.18 (when most started
> observing
> this performance issue).  Reverting back to the old AS scheduler for me seems
> to have resolved the problem.
> 
> To use the AS scheduler and test for yourself, just specify "elevator=as" as
> a
> boot option.
> 

Fwiw, I've never used the CFQ scheduler.  I'm on the deadline scheduler with my 3ware 9560SE and still see this problem crop up from time to time, usually when doing a file copy large enough to fill the page cache.

Comment 20 Ben Gamari 2009-01-15 16:50:11 UTC

I too have found that the choice of I/O scheduler makes little difference. Using AS generally yields no noticable improvement.

Comment 21 devsk 2009-01-15 17:25:53 UTC

> 
> Fwiw, I've never used the CFQ scheduler.  I'm on the deadline scheduler with
> my
> 3ware 9560SE and still see this problem crop up from time to time, usually
> when
> doing a file copy large enough to fill the page cache.
> 

Another deadliner here. And the same thing. There are two clear cut triggers for me:

1. The test case thomas posted.
2. large copies which fill up page cache.

I think its a process scheduling bug because page cache fill up might be triggering the pdflush processes (which are btw, normal priority. why?) into hyper drive and causing all other processes to wait. We do see various processes going into 'D' state and pdflush at the top of the cpu usage list, when the symptoms occur.

If CFQ is used, and process priority determines IO priority, aren't pdflush processes going to compete with processes doing their own IO when dirty_ratio is reached and the process has priority equal or better than 0 (-1 and higher)? That may explain some of the stories with CFQ here.

Comment 22 Andi Kleen 2009-01-15 17:35:06 UTC

Re: blaming the scheduler in 2.6.26

The problem was observed a long time before that. There might be additional
scheduler problems (this bug in general suffers from the "lots of different problems" disease), but that is unlikely to be the old well known disk starvation with different devices issue.

Re comment #9 vim stalls while disk is pounded:

You're running ext3 or reiser right? That's a known problem in that vim
regularly does fsync on its auto safe file and that causes a synchronous
JBD transaction and since all transactions are strictly ordered if there
are enough of them in front and the disk is busy it takes quite a long time.

At least on the higher level that is supposed to be mostly solved by ext4
or by XFS.

Of course it's another problem that the disk schedulers allow that long starvation in the first time.

Comment 23 theparanoidone 2009-01-15 19:19:36 UTC

Hi Thomas~

Can you elaborate on your test?  

You wrote:
"The first three results are 2*100, 2*50 and 2*20 processes exchanging 100k,
200k and 1M messages over a pipe. The last three results are 2*100, 2*50, and
2*20 threads exchanging 100k, 200k and 1M messages with pthread_mutex and
pthread_cond."

So, I'm guessing you want the test to be run like this:
./processtest 200 100000
./processtest 100 200000
./processtest 40 1000000
./threadtest 200 100000
./threadtest 100 200000
./threadtest 40 1000000

Is that correct?  Just want to be sure i'm running the same tests (Also, the code limits number of processes to max 100... so I just edited this allowing the max limit to be 200) 

Here's our results:

2.6.15.7-ubuntu1-custom-1000HZ_CLK #1 SMP Thu Jan 15 19:06:30 PST 2009 x86_64 GNU/Linux (ubuntu 6.06.2 server LTS with clk_hz set to 1000HZ)
min:0.004ms|avg:0.004-0.271ms|mid:0.005ms|max:42.049ms|duration:34.029s
min:0.004ms|avg:0.004-0.138ms|mid:0.035ms|max:884.865ms|duration:33.105s
min:0.004ms|avg:0.004-0.042ms|mid:0.004ms|max:2319.621ms|duration:62.438s
min:0.005ms|avg:0.010-0.026ms|mid:0.012ms|max:1407.923ms|duration:92.132s
min:0.005ms|avg:0.011-0.029ms|mid:0.013ms|max:1539.929ms|duration:97.034s
min:0.005ms|avg:0.010-0.031ms|mid:0.013ms|max:18669.095ms|duration:176.555s


2.6.24-23-server #1 SMP Thu Nov 27 18:45:02 UTC 2008 x86_64 GNU/Linux (default ubuntu 64 8.04 server LTS at default 100HZ clock)
min:0.004ms|avg:0.034-0.357ms|mid:0.324ms|max:39.789ms|duration:43.390s
min:0.004ms|avg:0.006-0.149ms|mid:0.131ms|max:79.430ms|duration:39.288s
min:0.004ms|avg:0.046-0.057ms|mid:0.052ms|max:52.427ms|duration:64.481s
min:0.005ms|avg:0.006-0.650ms|mid:0.330ms|max:22.120ms|duration:60.142s
min:0.005ms|avg:0.053-0.309ms|mid:0.276ms|max:21.560ms|duration:62.353s
min:0.004ms|avg:0.033-0.123ms|mid:0.112ms|max:22.007ms|duration:131.029s

Linux la 2.6.24.6-custom #1 SMP Thu Jan 15 23:34:10 UTC 2009 x86_64 GNU/Linux (ubuntu 8.04 server LTS with clk_hz custom set to 1000HZ)
min:0.004ms|avg:0.054-0.364ms|mid:0.332ms|max:24.524ms|duration:42.522s
min:0.004ms|avg:0.125-0.156ms|mid:0.144ms|max:13.171ms|duration:33.573s
min:0.004ms|avg:0.046-0.058ms|mid:0.052ms|max:13.005ms|duration:64.388s
min:0.005ms|avg:0.006-0.594ms|mid:0.302ms|max:13.481ms|duration:61.105s
min:0.005ms|avg:0.109-0.336ms|mid:0.307ms|max:13.345ms|duration:65.000s
min:0.002ms|avg:0.070-0.130ms|mid:0.120ms|max:13.137ms|duration:133.786s



Side notes... we have been experiencing problems with MySQL specifically with sync-binlog=1 and log-bin on and performing high volume of concurrent transactions.  Although we run raid-1 with battery cache on... our throughput is horrible.  For some reason, we have found that by increasing the CONFIG_HZ=1000 from 100 in the kernel, we get much higher throughput.  Otherwise our benchmarks just sit around and have trouble context switching.

#CONFIG_HZ_100=y 
#CONFIG_HZ=100
#change to:
CONFIG_HZ_1000=y 
CONFIG_HZ=1000

I do not know if the problems we are experiencing with the clock are related to this bug listed here.  However, I did want to submit our feed back showing the difference in kernels where our bottleneck runs better.

We use sysbench for our test (with vmstat -S M 3, iostat -dx 3, and mpstat 3 to monitor.. all part of sysstat suite).

FYI, Here are our sysbench commands (be sure to change your mysql username and password and create the database sbtest):

You can get sysbench here: http://sysbench.sourceforge.net/
Compile it like: 
./configure --with-mysql --with-mysql-include=/usr/share/include --with-mysql-lib=/usr/share/lib
make
make install

Prepare it:
./sysbench --num-threads=50 --test=oltp --oltp-test-mode=complex --oltp-table-size=100000 --oltp-distinct-ranges=0 --oltp-order-ranges=0 --oltp-sum-ranges=0 --oltp-simple-ranges=0 --oltp-point-selects=0 --oltp-range-size=0 --mysql-table-engine=innodb --mysql-host=127.0.0.1 --mysql-user=ROOT --mysql-password=PASSWORD prepare

Run it:
./sysbench --num-threads=50 --test=oltp --oltp-test-mode=complex --oltp-table-size=100000 --oltp-distinct-ranges=0 --oltp-order-ranges=0 --oltp-sum-ranges=0 --oltp-simple-ranges=0 --oltp-point-selects=0 --oltp-range-size=0 --mysql-table-engine=innodb --mysql-host=127.0.0.1 --mysql-user=ROOT --mysql-password=PASSWORD run


The important line of output is read/write requests per second, and total time.


===
2.6.15.7-ubuntu1-custom-1000HZ_CLK #1 SMP Thu Jan 15 19:06:30 PST 2009 x86_64 GNU/Linux (ubuntu 6.06.2 server LTS with clk_hz custom set to 1000)

read/write requests:                 50000  (2394.13 per sec.)
total time:                          20.8844s

vmstat -S M 3
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0      0   9043    142    559    0    0     1 30341 5020 25659  6 15 78  1
iostat -dx 3
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda          0.00 4320.74  0.00 4836.12    0.00 73254.85     0.00 36627.42    15.15     4.93    1.02   0.16  77.02
===


===
2.6.24-23-server #1 SMP Thu Nov 27 18:45:02 UTC 2008 x86_64 GNU/Linux (default ubuntu 64 8.04 server LTS at default 100HZ clock)

read/write requests:                 50000  (434.33 per sec.)
total time:                          115.1207s

vmstat -S M 3
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0   1506    109    100    0    0   155  5011  531 4532  5  3 91  1
iostat -dx 3
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00   951.67   30.67  551.00   274.67 12021.33    21.14     1.18    2.03   1.60  93.00
===


===
Linux la 2.6.24.6-custom #1 SMP Thu Jan 15 23:34:10 UTC 2009 x86_64 GNU/Linux (ubuntu 8.04 server LTS with clk_hz custom set to 1000)

read/write requests:                 50003  (2680.47 per sec.)
total time:                          18.6546s

vmstat -S M 3
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0   1710     46     73    0    0  1296 27104 3474 31095  5  3 82  9
iostat -dx 3
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00  2432.33  159.33 2576.00  1632.00 40066.67    15.24     1.95    0.71   0.35  94.47
===

Note: Our servers are 2xIntel xeon 5110 dual core 1.6GHz with 15k SAS Raid-1 2+GB Ram


Not sure if this feedback is helping or not;  my hope is that it is relevant to what you are trying to fix.  My personal opinion is that the kernel should scale a little more uniform than  434.33 per sec  versus  2680.47 per sec... seems to be a large difference.

Even though 100Hz clock setting is recommended for servers... it's seems this could actually not be ideal for anyone running a MySQL server that needs safe transaction support via sync-binlog=1  (at least... that's what we are finding for high insert/update load).

Perhaps you can look at sysbench as there are a number of test for threads, fileio, and etc to determine if this can expose the kernel issues in a different way?

Any feedback about an ideal kernel and kernel config for servers is much appreciated as these are no doubt difficult to debug.

Comment 24 Per von Zweigbergk 2009-01-16 01:15:06 UTC

Did some more testing. My father has an Eee PC 900 exactly the same as mine also running Ubuntu 8.10 with the same kernel as mentioned before. Only difference that I can think of - he doesn't use LUKS and LVM like me, he instead has his / directly on /dev/sdb1 (internal SSD).

I also, in addition to trying to launch a Terminal via Gnome (as I did previously) I tried the vim "stuttering" test by creating a file, saving it, and holding down a key to see when it stutters. 

The results of these tests:

- On both my own (encypted) and the other (unencrypted) computer, vim occasionally freezes for a few seconds while I cp a file from USB memory to internal SSD.

- On my computer (encrypted) lauching a gnome-terminal takes much longer while copying a file from SSD than on the other computer. While there is a noticable slowdown on the unencrypted machine, on the encrypted machine sometimes the gnome-terminal won't even launch until *after* the copy is complete.

In conclusion - the effect exists on both machines, but the encryption of the SSD very significantly increases the problem. While some slowdown due to encryption should be expected, it should not make the machine almost completely unusable while copying a file from a USB stick to the internal SSD.

Comment 25 Lari Temmes 2009-01-16 01:24:03 UTC

Different scheduler (#11) doesn't seem to do much. I did some quick and dirty testing with my laptop :

Linux lupaus 2.6.28-customlupaus #4 SMP PREEMPT Thu Dec 25 15:05:35 EET 2008 x86_64 GNU/Linux
Vanilla 2.6.28 kernel, config from Ubuntu 8.10, with some modifications to suit my laptop

with io scheduler cfq
./threadtest 100 200000
min:0.004ms|avg:0.007-0.008ms|mid:0.008ms|max:894.480ms|duration:187.588s

with elevator=as (eg. io scheduler anticipatory)
./threadtest 100 200000
min:0.004ms|avg:0.007-0.008ms|mid:0.008ms|max:884.016ms|duration:188.248s

---

with io scheduler cfq
./proctest 50 100000
min:0.005ms|avg:0.005-0.006ms|mid:0.006ms|max:460.631ms|duration:35.773s

with elevator=as (eg. io scheduler anticipatory)
./proctest 50 100000
min:0.005ms|avg:0.006-0.006ms|mid:0.006ms|max:479.695ms|duration:36.645s

Comment 26 Per von Zweigbergk 2009-01-16 01:31:39 UTC

One more observation from another experiment I did:

I have swap on the same encrypted LVM as my root partition. Disabling swap makes the terminal launch much faster while copying -- still slower than when not copying files, but within a few seconds of clicking instead of within minutes. 

However! Now, instead individual running processes (like Firefox and vim) hang much more agressively and frequently during copying. I'm not sure what to make of this, but I hope somebody who actually knows something about the Linux kernel will find this useful. :-)

Comment 27 Milan Bouchet-Valat 2009-01-16 02:22:14 UTC

I'm not sure any developer will be able to pinpoint the problem in all this mess! ;-) There are likely several bugs here.

For a start, I think it could be nice to separate people whose problem is fixed by elevator=as. And then separate people using encrypted disks. And then problems occurring only with USB disks. Please open new reports. What do developers think?

Comment 28 Thomas Pilarski 2009-01-16 02:26:17 UTC

Created attachment 19828 [details]
Bisect results

I have done the bisect and isolated patch. In the attachment you can find the bisec result. I have done the sysbench test too. 

Tests:
 100 Process / 1k messages


Linux balrog704 2.6.20 #13 SMP Fri Jan 16 10:13:21 CET 2009 x86_64 GNU/Linux
min:0.003ms|avg:0.243-0.253ms|mid:0.246ms|max:29.503ms|duration:25.080s
min:0.002ms|avg:0.022-0.038ms|mid:0.037ms|max:756.082ms|duration:37.894s
min:0.002ms|avg:0.004-0.007ms|mid:0.004ms|max:929.790ms|duration:34.608s

Linux balrog704 2.6.20bad #14 SMP Fri Jan 16 10:52:17 CET 2009 x86_64 GNU/Linux
min:0.003ms|avg:0.411-0.434ms|mid:0.424ms|max:18.328ms|duration:43.549s
min:0.003ms|avg:0.063-0.075ms|mid:0.071ms|max:404.088ms|duration:72.860s
min:0.003ms|avg:0.005-0.010ms|mid:0.009ms|max:712.033ms|duration:51.654s

Comment 29 Thomas Pilarski 2009-01-16 02:32:03 UTC

Created attachment 19829 [details]
sysbench results

As I am using Firefox3 with the bad kernel, my post was submitted by accident. With the good kernel there are (nearby) no problems with firefox3 any more.

The tests were where run with the following parameters
- 2*100 processes / 100k messages 
- 2*20 processes / 1M messages 
- 2*200 threads / 100k messages

Comment 30 Thomas Pilarski 2009-01-16 02:33:08 UTC

Created attachment 19830 [details]
Bisect results

wrong file

Comment 31 Andi Kleen 2009-01-16 02:59:53 UTC

Re #26

There's some performance problem in general with encrypted swap. I've seen that too. But it's probably a different issue than the primary one which should be discussed here.

Comment 32 Thomas Pilarski 2009-01-16 03:34:55 UTC

> Is that correct?  Just want to be sure i'm running the same tests (Also, the
> code limits number of processes to max 100... so I just edited this allowing
> the max limit to be 200) 
I have used 100/50/20 as one echo process uses 2 threads or processes. But it is not important, as these test should only compare different kernel versions on the same computer.

Comment 33 Brandon Penglase 2009-01-16 06:10:51 UTC

(In reply to comment #18)
> I have had a very similar problem to this. I still have it often, but not as
> much from when I changed from EXT3 to ReiserFS. For the Scheduler, I've been
> using BFQ or V(R) thats included in the Zen Patchset. I have tried the stock
> kernel, and same problem exists, however I can't remember which scheduler I
> used at that point, I believe Deadline. 
> Most of the IOWait I get comes when either I'm copying files to the local
> drives, or using multiple VM's (generally Windows as thats what is needed for
> work). I'm willing to try about anything to get this fixed. It's a little
> better since I switched FS's on my VM Drive, but still isn't totally fixed. 
> 

I did try the AS Scheduler, as that was the only thing I changed in my kernel, and it didn't change anything interactively, still get a high IO Wait.

The other thing I noticed, at least when in AS, I start using Swap, it's not a lot (within about 2 minutes I was using 10MB), but it was still climbing. 

One other thing, I'm wondering if this is 64bit related. All of my personal boxes are 64bit, and it seems of ones posted here, along with other threads I've read (over on Gentoo forums) that it seems this hits the 64bit users more then the 32bit users. Any truth to this, or am I trying to relate things that aren't related?

My work box (most heavily used): 
Linux PC010233L 2.6.28-zen1-2 #2 SMP PREEMPT Thu Jan 15 16:06:37 EST 2009 x86_64 Intel(R) Core(TM)2 Duo CPU E8200 @ 2.66GHz GenuineIntel GNU/Linux

Comment 34 Ben Gamari 2009-01-16 06:56:18 UTC

(In reply to comment #30)
> Created an attachment (id=19830) [details]
> Bisect results
> 
If that bisection is to be believed, the assertion that the issue is caused by a scheduling issue seems quite plausible. 

(In reply to comment #33)
> One other thing, I'm wondering if this is 64bit related. All of my personal
> boxes are 64bit, and it seems of ones posted here, along with other threads
> I've read (over on Gentoo forums) that it seems this hits the 64bit users
> more
> then the 32bit users. Any truth to this, or am I trying to relate things that
> aren't related?
> 
There is evidence that x86-64 is a factor here.

Comment 35 Ben Gamari 2009-01-16 07:09:45 UTC

It does strike me as quite odd how large of a factor the size of the transfer seems to be. When I first start evolution (I have very large folders), the system will exhibit poor interactivity for upwards of 5 to 10 minutes. However, when transferring lots of small files (i.e. module_install'ing), the kernel behaves fine. (although modpost also seems to produce poor interactivity)

I think it might help if we had a kernel developer here to list the kernel block/memory manager/scheduler statistics that might indicate where this I/O wait time is going. If sufficient statistics don't exist, it might be worthwhile to instrument the kernel specifically for this bug. It does seem clear that the bug I intended this ticket to describe is invariant on I/O scheduler, so that's one factor that needn't be accounted for.

Comment 36 Lari Temmes 2009-01-16 08:15:05 UTC

I just recompiled my kernel without any SMP support and tested again. My laptop went from usable to totally unusable. Network traffic stops and it's even hard to type anything when process/thread test is running. I have only single CPU on my laptop. I also tried to change scheduler with this setup and that didn't make any difference.

Good luck :)

Comment 37 Adriaan van Kessel 2009-01-16 08:46:48 UTC

Could this be a jiffies wraparound bug ?

I've seen different formulas for doing interval arithmetic,
and (not) handling wraparound.

For instance, in as_antic_expired()
::
long delta_jif;

        delta_jif = jiffies - ad->antic_start;
        if (unlikely(delta_jif < 0))
                delta_jif = -delta_jif;
::
, which seems incorrect to me. (it could alter the preditive powers
of the scheduler in mysterious ways ;-)
(A different calculation is performed at other places.)
Jiffies wrap around depending on the HZ value (but still, intervals above INT_MAX should be relatively rare), and the jiffies start value
will cause the first wrap @ 5 min after booting, so that would show.

My 2 cents,
AvK

Comment 38 Mike Frysinger 2009-01-16 10:51:31 UTC

Adriaan: drivers shouldnt be manually doing comparison on jiffies values.  there are helps in linux/jiffies.h for doing the comparison (time_before() / time_after()) and those should handle wrap arounds.  if you do see a driver that is doing the wrong thing, i'd open another bug specifically about that (or post a patch yourself :D).

Comment 39 Thomas Pilarski 2009-01-16 12:00:14 UTC

With the following code I got negative time differences about -127ms. The tv_sec values where equal and the second tv_usec was smaller than the first. I cannot say which kernel it was, as I am no more able to reproduce it. Some days before it occurs on nearby every test. As this behaviour is connected with TSC synchronisation patch, I have posted it here. I will try to figure out the kernel version.

> gettimeofday(&tv_s, &tz);
> write(a2b[1], &c, 1);
> read(b2a[0], &c, 1);
> gettimeofday(&tv_e, &tz);
> timersub(&tv_e, &tv_s, &tv_r);

Comment 40 Thomas Pilarski 2009-01-16 12:58:48 UTC

I get the negative time difference on 2.6.17.14 kernel.org, 2.6.18.8 kernel.org and 2.6.18-92.el5 CentOS.

My system is unusable with these three kernels, when I use the ide_generic. Disc throughput ~3MB/s I/O wait time at 100%.

No problems in ahci and libata with 2.6.18-92.el5. 

I was not able to provoke a negative time difference with kernels 2.6.20, 2.6.21, 2.6.24, 2.6.27 and 2.6.8.

Comment 41 theparanoidone 2009-01-16 13:44:53 UTC

Created attachment 19839 [details]
32v64test

32 Bit Test vs 64-Bit

This test is slightly apples and oranges... however, because someone inquired if this was a 32bit or a 64bit problem I ran these tests.

I'm inclined to think it applies to both 32bit and 64bit for 2 reasons
-The 32 bit test didn't perform that great
-The git bisect comment states "the biggest change is the removal of the 'fix up TSCs' code on x86_64 and i386"

Comment 42 theparanoidone 2009-01-16 14:02:37 UTC

Created attachment 19840 [details]
32v64testCleanNewLines.txt

formatting fix

Comment 43 Thomas Pilarski 2009-01-16 14:08:30 UTC

Please ignore my comments #39 and #40, as this are other problems.

Comment 44 cyrusm 2009-01-16 15:53:11 UTC

Are you guys aware of the Latencytop utility? http://www.latencytop.org/
You have to add CONFIG_LATENCYTOP=y to your config.

Then run your tests which break down the system with Latencytop running. It might give additional information.

Comment 45 Mathieu Desnoyers 2009-01-16 18:26:23 UTC

I've reproduced this problem with LTTng (http://ltt.polymtl.ca). It looks like the block layer is backmerging the large "dd if=/dev/zero ...." requests at a rate which leaves the request on the top of the request queue.

I've started a more thorough discussion on lkml here :

http://lkml.org/lkml/2009/1/16/487

Comment 46 Mike Bleiweiss 2009-01-16 18:57:01 UTC

re: the 32bit vs 64bit idea - I've experienced this issue on both 32 and 64 bit platforms, however - all of the platforms were on x64-capable CPUs (not sure if that would matter).

Comment 47 xyke 2009-01-16 19:43:52 UTC

I hit this bug on Ubuntu 8.10 (updated to 2.6.27-9-generic) running Vmware Workstation 6.5.126130 with Ubuntu 8.04.1 LTS as a guest. It was esp pronounced when resuming a suspended VM.

I tried the different elevator io schedulers. Nothing helped. 

Independent of VMWare, if I ran bonnie in one shell and launched firefox the whole system behaved in a very chunky manner.

Renicing pdflush -10 had some great improvement on basic responsiveness. The weird part was after re-recreating a new VM and not seeing the iowait problems I then tried resuming a VM with VMware at the same time I was compressing a tar file with pbzip2 (parallel bzip). All 4 cores were pegged and my load average was normal, system responsiveness was good. As **soon** as I tried resuming the VM with VMWare workstation, the cpu load dropped to 1-5% across all cpus. iowait times shot way up. I have now killed Vmware and iowait times have dropped but my maximum read speed is hovers around 1MB/s (as measured with iostat). This is another symptom of the iowait problem.

 iostat -c -d -m -x sda 1

rMB/s is usually never over 2MB/s

Comment 48 Thomas Kallenberg 2009-01-17 02:09:16 UTC

(In reply to comment #46)
> re: the 32bit vs 64bit idea - I've experienced this issue on both 32 and 64
> bit
> platforms, however - all of the platforms were on x64-capable CPUs (not sure
> if
> that would matter).
> 

Using an IBM X40 with an old Pentium M (32bit) and Thomas.pi's testcases made my machine totally unusable. So I don't think this has anything to do with x64-capable CPUs.

Comment 49 Adriaan van Kessel 2009-01-17 05:35:46 UTC

(In reply to comment #38)
> Adriaan: drivers shouldnt be manually doing comparison on jiffies values. 
> there are helps in linux/jiffies.h for doing the comparison (time_before() /
> time_after()) and those should handle wrap arounds.  if you do see a driver
> that is doing the wrong thing, i'd open another bug specifically about that
> (or
> post a patch yourself :D).

Well, it was not in one of the driver's code but in block/as-iosched.c:as_fifo_expired()

The observed behavior indicates that something is wrong with the shceduling of
disk I/O, and that most time is spent by all theads competing for one or more (spin-)locks; you might call it a convoy or a thundering hurd syndrome.
But it might be unrelated.
AvK

Comment 50 Gonzalo Aguilar 2009-01-17 07:45:40 UTC

Hi all, 

More tests

	Linux ws-esp16 2.6.27-11-generic #1 SMP Thu Jan 8 08:38:33 UTC 2009 i686 GNU/Linux
	$ ./processtest 100 200000
	min:0.006ms|avg:0.278-0.520ms|mid:0.475ms|max:141.058ms|duration:107.646s
	$ ./threadtest 100 200000
	min:0.006ms|avg:0.690-0.768ms|mid:0.715ms|max:235.106ms|duration:159.355s

But if this is a IO problem why monitors does not show a big IO Wait Percentage. It shows a high system usage percentage.
So I suppose that not IO problem seems to be related to process handling inside kernel. May it be related to the preemption model?

I did some additional test:
 
   1.-Change clock timing -> (no improvement)
   2.-Change preemption model (tested all of them) -> (no improvement)
   3.-Change IO scheduler  -> (no improvement)

Is there any way to profile the kernel to see what function gets more attention?

Hope you find somethig...

I attach a screenshot also...

Comment 51 Gonzalo Aguilar 2009-01-17 07:46:56 UTC

Created attachment 19858 [details]
Top output while running test

Comment 52 Mathieu Desnoyers 2009-01-17 08:33:05 UTC

Created attachment 19859 [details]
RFC patch to put a maximum to the number of cached bio merge done in a row

Can you try this patch, which applies to 2.6.28, to see if it helps ? I have not been able to reproduce the problem with the patch applied.

Comment 53 Gonzalo Aguilar 2009-01-17 09:19:49 UTC

Hi Mathieu, 

I tried this patch against 2.6.27 because it patched right. But the results are not good. It took even more time to complete the test.

Can anyone confirm this?

Comment 54 Mathieu Desnoyers 2009-01-17 09:33:09 UTC

This patch will probably diminish the overall throughput, because it is making sure that we do not merge more than 128 requests together. I am more interested in the I/O _latency_ (delay) you get when you run the system under a heavy I/O load.

Mathieu

Comment 55 Ben Gamari 2009-01-17 11:41:19 UTC

Created attachment 19866 [details]
Port Attachment #19859 [details] to Linus's master

(In reply to comment #53)
> Hi Mathieu, 
> 
> I tried this patch against 2.6.27 because it patched right. But the results
> are
> not good. It took even more time to complete the test.
> 
> Can anyone confirm this?
> 
I can. Unfortunately, not only did the patch fail to reduce latency, but also reduces throughput. Even opening the file selection dialog to attach this patch took over 30 seconds while building a kernel.

Comment 56 Ben Gamari 2009-01-17 11:49:25 UTC

Also, a patch set providing an ftrace interface to blktrace was recently submitted to the LKML (http://marc.info/?t=123212992300002&r=1&w=2). This could come in handy in further debugging.

Comment 57 henkjans_bagger 2009-01-18 01:41:28 UTC

Just a comment that might have gone unnoticed, but to me appears relevant as this bug again appears to become a collection of multiple issues again as happened with #7372 making that the kernel-devs started to ignore it. 

The bisect done by thomas.pi points yields a first bad commit dating from february 2007, while these symptoms first surfaced in 2.6.18, which dates from end 2006. 

Bug #7372 basically is from before this first bad commit; the bisect I did in that bug for example pointed towards a problem with NCQ with the CFQ scheduler from November 2006 that clearly was only present for 64bit. See http://bugzilla.kernel.org/show_bug.cgi?id=7372#c112 as a reminder for this proof. I'm not sure that issue got resolved in the end.....no clear pointers on what I could do to help further.

Seeing reports in this bug reporting improvements when switching IO-scheduler and reports on differences between 32/64 bit makes me think those might be more related to that commit. Bottomline is to be sceptical with reports on whether or not a patch helps fully as to me it still appears to be multiple issues that have very similar but difficult to reliably trigger symptoms. 

However the test-case of Thomas does bring my system to its knees as well, so definitely a good way to tackle at least part of the problem. But I don't think it is the only problem.

Comment 58 Thomas Pilarski 2009-01-18 08:41:00 UTC

No the patch does not fix the problem, but I think is now better than before.

I think, that it is a cpu scheduler problem. As one process with many threads and thread switching can nearby stop the execution of other processes. This problem exists in every kernel version, even 2.6.15. You can test it by executing the thread based with 2*100 threads. 
My system starts to become unusable with the kernel 2.6.27 (Fedora 10) when executing the thread based test with 2*40-50 threads. I don't know how many interrupt occurs, while coping some data, but perhaps it is the commonness between copying files and the thread based test.

The provided bisect, points to a cpu scheduler performance regression, which make the problem more noticeable. The biggest cpu scheduler performance regression was in 2.6.24 - 2.6.27. There was another cpu scheduler performance regression between 2.6.22 and 2.6.24.

Comment 59 Lari Temmes 2009-01-18 09:29:42 UTC

(In reply to comment #57)

> Seeing reports in this bug reporting improvements when switching IO-scheduler
> and reports on differences between 32/64 bit makes me think those might be
> more
> related to that commit. 

Nobody confirms that changing io-scheduler or 32<->64bit improves system much?

People are also testing different things, some test disk i/o and others are running process/thread tests. It's very confusing and someone should run couple of identical tests (including disk i/o AND process/thread test) with different kernel options. On my setup, just disabing or enabling SMP support made HUGE difference. 

I'm happy to do testing, but only if someone really needs information i can provide.

Again, my worthless 5 cents.. :)

Comment 60 Mathieu Desnoyers 2009-01-18 14:05:46 UTC

I just created a fio job file which acts like a "ls" executed while doing a large dd. It looks like the anticipatory I/O scheduler was causing those delays for me.

The results for the ls-like jobs are interesting :

I/O scheduler        runt-min (msec)   runt-max (msec)
noop                       41             10563
anticipatory               63              8185
deadline                   52             33387
cfq                        43              1420

Is it me or all I/O schedulers except cfq generate unexpectedly high latency ?

Details here (including fio job file) :

http://lkml.org/lkml/2009/1/18/198

Mathieu

Comment 61 devsk 2009-01-19 15:02:42 UTC

Actually, in this bug as well as in the other (7372), there is no clear direction. None of the kernel devs have taken a leadership role and directed the reporters in a direction where we can start to get a handle on things. What we see here is a lot of speculation on the part of the users and hence enormity of variety of things being tried. Its like everybody shooting in the dark.

Unless someone in the kernel team takes ownership of this bug, sorts out quarters from the pennies and directs users with a clear set of instructions to get well-defined data, I don't see this bug going anywhere.

The question is who has the know-how and willingness to do that? We see process as well as io scheduler being involved, we see vm having effect, we see some libata effects. With so many components in the line of fire and kernel being as vast as it is, I don't see above (one savior coming along and putting 2 & 2 together) happening.

IOW, take a beer and head away from the computer and into the sun....;-)

Comment 62 Ben Gamari 2009-01-19 16:41:22 UTC

Created attachment 19894 [details]
Test job description for fio

Attaching the test case written by Mathieu Desnoyers and included in his earlier email

Comment 63 Brandon Penglase 2009-01-20 06:39:31 UTC

(In reply to comment #33)
> 
> I did try the AS Scheduler, as that was the only thing I changed in my
> kernel,
> and it didn't change anything interactively, still get a high IO Wait.
> 
> The other thing I noticed, at least when in AS, I start using Swap, it's not
> a
> lot (within about 2 minutes I was using 10MB), but it was still climbing. 
> 
> My work box (most heavily used): 
> Linux PC010233L 2.6.28-zen1-2 #2 SMP PREEMPT Thu Jan 15 16:06:37 EST 2009
> x86_64 Intel(R) Core(TM)2 Duo CPU E8200 @ 2.66GHz GenuineIntel GNU/Linux
> 

Ok, I tried playing a little bit more, and switching to the DeadLine scheduler really helped things. I have topped around 73% IOWait, but it never bogged the whole box down. I still need to do definitive testing (via tests already in the bug report), but this seems to have helped.  Not sure which problem this relates to in this bug though; I'm guessing the scheduler one.

Comment 64 Thomas Pilarski 2009-01-20 08:50:44 UTC

Created attachment 19906 [details]
fio test results of kernel 2.6.15 - 2.6.24

I have executed the test case of Mathieu Desnoyers on some different kernel version. I took the bad and good kernels from my bisection. The results do not confirm my theory. If someone can identificate a problem in it, I can make some more tests.

The only regression I can seen is the regression with the noop scheduler. It is the average of the average latencies.  

./test.2.6.15-53-amd64-genericresult.noop        700,62ms
./test.2.6.20-17-genericresult.noop             3520,24ms
./test.2.6.20result.noop                        3005,24ms
./test.2.6.20badresult.noop                     3698,64ms
./test.2.6.22.19result.noop                     1393,67ms
./test.2.6.24.7result.noop                       589,66ms

I will check, if the 2.6.24.7 kernel test build has a improved desktop responsiveness.

Comment 65 Thomas Pilarski 2009-01-20 12:42:37 UTC

There is no performance improve in 2.6.24.7. The list below shows the average times of the 41 small jobs with the cfq scheduler. I have the best desktop responsiveness on 2.6.20. Gimp start on heavy I/O in 10 seconds instead of 30 seconds. The freezes of the applications exists on 2.6.20, but they are much shorter, mostly under one second, while in kernels >= 2.6.22 there are freezes till one minute. 

	                     min    maxa     avg   stdev
2.6.20-17-generic	     9.9  126.00   49.97   59.89
2.6.20	                    8.66  115.05   39.68   50.41
2.6.22.19	           10.34  195.29   66.88   96.07
2.6.24.7	            9.93  185.02   64.38   89.95

The high I/O wait is at 75% on the start and climbs to 99-100% after ~5 seconds.  

I have noticed, that the freezes occurs in all applications more often, when firefox is running. Currently I create a ram disk on startup, extract the .mozilla folder to it and save it again on shutdown. I makes my system more responsive, especially firefox3.

Comment 66 Thomas Pilarski 2009-01-20 14:37:26 UTC

Created attachment 19912 [details]
fio results for kernel 2.6.28

And finally the results for 2.6.28. I have removed all tracing stuff, I could find, but the system is still dull under heavy io.

                             min     max     avg   stdev
2.6.28  noop               97,61 1799,06  654,84  861,90
2.6.28  cfq                 9,32  169,32   55,59   79,50

Comment 67 Thomas Pilarski 2009-01-21 04:38:53 UTC

Created attachment 19920 [details]
ext3 and ext4 comparison with patched and unpatched kernel

Here some more results. I could gain or loose some latency by different kernel settings. In 2.6.20 I could reproduceable loose 10ms, which makes a decrease of 25% of average latency. But it makes no difference in the desktop responsiveness.

I have tested the 2.6.28 kernel as patched ( http://bugzilla.kernel.org/attachment.cgi?id=19866 ) and unpatched kernel with ext3 and ext4 with exactly the same kernel settings. My test system is installed on a ext3 partition, the tests are executed on a extra ext3 or ext4 (on the slower one) partition on the same hard drive. The write performance on ext4 is now at 45MB/s instead of 35MB/s (ext3). 

The destop responsiveness on the ext4 test with the patched version decreases extremely. While copying a 10gig file from ext4 to ext4, there is nerby no problems with the unpatched kernel. While using the patched kernel, the system becomes unuseable. With ext3 there is a little responsiveness improve with the patched kernel. But it can be coincidence, as I have no exact test for desktop responsiveness.

But while copying the 10gig file on ext4 and compiling the kernel, my system becomes unusable with the unpatched kernel too. There are freezes for >20 seconds, while access the menu in applications the first time. 
You can easly simulate this behaviour by executing the following compression for every core.

bzip2 -9 -c /dev/urandom >/dev/null &


And the average latencies of the last four tests.

                             min    maxa     avg   stdev
2.6.28   unpatched   ext3  11.24  181.20   62.35   86.15
2.6.28   patched     ext3  10.82  175.93   62.18   83.89 
2.6.28   unpatched   ext4   6.90  396.17  132.52  213.18 
2.6.28   patched     ext4   6.85 2078.93  707.26 1006.74

Comment 68 Jens Axboe 2009-01-21 04:42:07 UTC

Forget the back merge patch.

Have you tried running latencytop to spot big sleep offenders?

Comment 69 Thomas Pilarski 2009-01-21 05:50:17 UTC

Created attachment 19924 [details]
Latencytop results

(In reply to comment #68)
> Have you tried running latencytop to spot big sleep offenders?
 
I am not sure, what I shall look at. You can find in the file latencytop-ext4-2*bzip2.txt the most results.

Comment 70 Jens Axboe 2009-01-21 06:14:10 UTC

Most of them look as expected, up to about 1 second latency for a single IO under load. latencytop-ext4-2*bzip2.txt looks pretty bad, though. It has a 10 second wait on a single lock_page(), that's pretty slow.

Again, this whole thread confuses me. The IO latencies from the fio jobs posted look OK, in the sense that they haven't regressed and that you can't expect zero latency when you are fully loading a disk with writes. So while we could do better there, it's not a catastrophy.

The bisect you originally did pointed to something interesting, I think. If we have clock problems, the CPU scheduler could easily delay a single process for large amounts of time if other processes are repeatedly ready to run.

Comment 71 Andi Kleen 2009-01-21 06:29:23 UTC

The scheduler has normal special code to handle bad (like going backwards) clocks. Of course it has its limit, but it should handle the typical cases.
Of course it could confuse other subsystems. For testing you could force
another clock like clock=pmtmr or clock=hpet (if you have HPET)

Comment 72 Jens Axboe 2009-01-21 06:34:58 UTC

It may be something as simple as a wrapped variable. IIRC, someone recently found something like that in the scheduler, though I can't find the posting just now. It was in kernel/sched_fair.c:update_curr() I think.

Comment 73 Thomas Pilarski 2009-01-21 08:44:42 UTC

My default clock source is hpet. It is faster, but I have long freezes. With acpi_pm the system is dull, but the freezes were allways below 5 seconds. 

Test: copy 10gig file and execute "bzip2 -9 -c /dev/urandom >/dev/null" twice on core2duo.

hpet
1299.7 / 1651.3 / 39790.7 / 4580.1 / 943.9 / 2069.3 / 145.7 / 1739.2 / 691.4 / 2060.2 / 172.3 / 492.4 / 2286.4 / 3064.9 / 696.9 / 716.9 / 14096.2 / 3131.2 / 1640.2 / 
min:145.7 ms|max:39790.7 ms|avg:4277.31

acpi_pm
1969 / 1276.8 / 658.8 / 16303.8 / 1604.3 / 3885.8 / 823.6 / 3659.1 / 2719.6 / 2064.2 / 672.9 / 1327.9 / 1783.9 / 604.3 / 1289 / 9535.1 / 1271.5 / 280.9 / 2621.8 / 759.1 / 
min:280.9 ms|max:16303.8 ms|avg:2755.57

Comment 74 Ben Gamari 2009-01-21 08:51:22 UTC

I'm not sure what my default clock source is (where does one look to determine this?), however I just booted with clock=hpet and things don't seem to be particularly better (50% IO wait time while evolution is starting, a process which takes over 5 minutes; this is with Jens' patch). These numbers are common with Jens' patch (which is a bit of an improvement, without the patch evolution pegs IO wait times at 70%+ and is very sluggish even after starting).

Comment 75 Ben Gamari 2009-01-21 08:59:18 UTC

I just tried clock=acpi_pm and evolution startup performance seems no better. Tonight I'm going to try some quantitative benchmarks on these configurations so that legitimate comparisons can be made.

One thing that I have neglected to mention is that Jens' patch does seem to help overall system interactivity---an application with a high IO load doesn't degrade the latency of the entire system nearly as much---although I have no numbers to support this claim.

Comment 76 Thomas Pilarski 2009-01-21 09:34:16 UTC

On my computer on 2.6.20 kernel jiffies was the default scheduler. Since 2.6.22 hpet is. On my old notebook it is now acpi_pm. I don't known what it was before. With jiffies under 2.6.28, my system seems much better, although there are still some short freezes. It does not solve the problem, but makes it much better. Please try clocksource=jiffies .

You can check yout current clocksource with.
cat /sys/devices/system/clocksource/clocksource0/* 

jiffies
645 / 598.3 / 462.5 / 1496.2 / 213.2 / 1353.1 / 6470.6 / 337.6 / 3406.9 / 2057.5 / 155.3 / 309 / 2332 / 463.1 / 1804.4 / 3258.6 / 261.7 / 8124.3 / 2373.2 / 2471.1 
min:116.1 ms|max:8124.3 ms|avg:1843.32

The long values are freezes of firefox.
hpet     39790.7 
acpi_pm  16303.8
jiffies   8124.3

Comment 77 devsk 2009-01-21 10:05:52 UTC

(In reply to comment #76)
> The long values are freezes of firefox.

Do you mean startup time? or you click on a tab and it takes that long for it to switch?

Comment 78 Ben Gamari 2009-01-21 10:27:48 UTC

Using the jiffies clocksource on linus's master causes the machine to wedge up on attempting the start Xorg. I'll have to look into it later.

Comment 79 Thomas Pilarski 2009-01-21 10:34:39 UTC

(In reply to comment #77)
> Do you mean startup time? or you click on a tab and it takes that long for it
> to switch?

It the longest time for switching or opening tabs during heavy io, and 2*bzip2 urandom.

Comment 80 devsk 2009-01-21 10:46:45 UTC

> (WW) intel(0): No outputs definitely connected, trying again...
> (WW) intel(0): Unable to find initial modes
> (EE) intel(0): No valid modes.

no Xorg coming up with jiffies clocksource. takes the console with it. I have darkness on the screen...:) I can ssh into it, though.

some weird interaction between i915 and clocksource there.

Comment 81 devsk 2009-01-21 10:49:12 UTC

echo hpet > current_clocksource

and things are back to normal.

Comment 82 Thomas Pilarski 2009-01-21 11:25:44 UTC

(In reply to comment #81)
> echo hpet > current_clocksource
> 
> and things are back to normal.

I got a crash, while tried to set jiffies clocksource while linux was running.

There is now a improvement in the process and thread test with clock source jiffies. Here the result. The performance is nearby as in 2.6.20. 

Linux bugs-laptop 2.6.28t61p4 #5 SMP Wed Jan 21 14:30:24 CET 2009 x86_64 GNU/Linux
min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:945.000ms|duration:24.354s
min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:466.000ms|duration:24.206s
min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:220.000ms|duration:47.452s
min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:870.000ms|duration:34.105s
min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:479.000ms|duration:36.610s
min:0.000ms|avg:0.000-0.000ms|mid:0.000ms|max:212.000ms|duration:77.449s

Comment 83 devsk 2009-01-21 14:00:53 UTC

I booted up with clocksource=jiffies and lost Xorg and console. So, it wasn't set while running.

Comment 84 Thomas Pilarski 2009-01-21 14:08:52 UTC

(In reply to comment #83)
> I booted up with clocksource=jiffies and lost Xorg and console. So, it wasn't
> set while running.

Try to blacklist the thermal and the processor kernel module.

Comment 85 Søren Holm 2009-01-21 14:47:23 UTC

Hi

I have currently the following running.

2 x "bzip2 -9 -c /dev/urandom >/dev/null" since I have 2 cores
and one "dd if=/dev/zero of=test.10g bs=1M count=10000"

And only small lockups happenend during that time, which was about  9 minuttes
Bu small locoups I mean a couple of seconds.

After the dd-command had finished the lockups where still occuring but they 
where generally much shorter.

For me it is definetly a fix.

Comment 86 Søren Holm 2009-01-21 15:05:15 UTC

Seems like is more complex. Only doing the dd-command halts my system in the same ways as earlier described in this bug. ~100% iowait etc. Adding a single bzip-command results in an iowait of around 40% and improved desktop reponse, and finally adding the second bzip-command results in  5% iowait and even better  desktop response.

Comment 87 devsk 2009-01-21 15:20:10 UTC

(In reply to comment #84)
> (In reply to comment #83)
> > I booted up with clocksource=jiffies and lost Xorg and console. So, it
> wasn't
> > set while running.
> 
> Try to blacklist the thermal and the processor kernel module.
> 

Wouldn't that throw everything cpufreq into a tizzy? Its a laptop, so losing cpufreq and other potential ACPI functions is a big loss. Let me know if I am wrong about this.

Comment 88 devsk 2009-01-21 18:18:29 UTC

blacklisting processor and thermal didn't work either. I give up on jiffies...:-)

Comment 89 Ben Gamari 2009-01-21 20:02:08 UTC

Well, looks like there's a good reason why machines hang with clock=jiffies. http://lkml.org/lkml/2009/1/21/402

Any ideas why those users whose machines didn't crash saw improvement? Does this suggest a scheduler issue?

Comment 90 devsk 2009-01-21 21:13:56 UTC

> Well, looks like there's a good reason why machines hang with clock=jiffies.
> http://lkml.org/lkml/2009/1/21/402
> 

This means I need to recompile kernel without high resolution timer and then pass clocksource=jiffies?

Do we have an explanation for why the freezing period reduced to half with acpi_pm and to a quarter with jiffies for Thomas? I would have thought faster timers will result in better behavior and it was a step in the future direction. But we seem to be going backwards.

Comment 91 Ben Gamari 2009-01-21 21:36:46 UTC

(In reply to comment #90)
> This means I need to recompile kernel without high resolution timer and then
> pass clocksource=jiffies?
No, it shouldn't be possible to run the kernel using jiffies as a clocksource. The system's time source needs to have a sufficiently high resolution. Using a low resolution time source (like jiffies) can cause the kernel to hang.

> 
> Do we have an explanation for why the freezing period reduced to half with
> acpi_pm and to a quarter with jiffies for Thomas? I would have thought faster
> timers will result in better behavior and it was a step in the future
> direction. But we seem to be going backwards.
It's far more complicated than that. If we have a timer wrapping around, it is entirely possible that a slower clock source would give you expected behavior whereas a higher resolution time source would fail. It completely depends upon the source of the freezes.

Jens, what do you think in light of this growing body of evidence pointing towards timer issues?

Comment 92 Ben Gamari 2009-01-22 05:52:01 UTC

(In reply to comment #91)
Hmm, I think I was a little tired last night. To clarify, I guess you probably could recompile without CONFIG_HIGH_RES_TIMERS, however I'm not sure you'd want to. If I'm not mistaken, the no-tick kernel option is dependent on high-res timers, so you'd have to give that up.

Also, correction:
s/towards timer issues/towards timer-triggered-issues/

Comment 93 devsk 2009-01-22 14:19:34 UTC

Has anyone run latency top yet?

Comment 94 Thomas Pilarski 2009-01-22 14:58:45 UTC

These are the total values of latency top.

http://bugzilla.kernel.org/show_bug.cgi?id=12309#c73
http://bugzilla.kernel.org/show_bug.cgi?id=12309#c76

Currently my system crashes, while I am executing the copy and 2*bzip operation with jiffies. I will make some new measures, as soon my test system runs.

Comment 95 Thomas Pilarski 2009-01-23 07:56:41 UTC

Created attachment 19954 [details]
latencytop captures with clocksource jiffies and hpet

I was not able to execute the 2*bzip2 with jiffies any more. The system freezes for ever, while copying a file and ziping urandom. It happens in runlevel 1, 3 and 5, during cpu intensive tasks. 

I have made an test with less cpu consumption. The test uses a script to have the same execution with different clocksources. It's copying a file and extract kernel source, build kernel and finally delete the kernel path. Concurrent the script started gimp, oowriter, firefox, htop, opens some web pages and a document. 

Here the "Total:" time from the captures. 

jiffies
min:0.1 ms|max:5442.1 ms|avg:213.2

hpet
min:0.0 ms|max:14777.7 ms|avg:403.71

The full capture without the escape sequences are added in the attachment. The escapes sequences are not correctly removed, but it's enough to see the necessary. I can provide the captures with the escape sequences too, if someone wants.

Comment 96 Thomas Pilarski 2009-01-23 08:20:52 UTC

Filtering 10% of the upper and lower times out, results in an average latency time of 1737.18ms for jiffies and 3164.72ms for hpet.

Comment 97 Jens Axboe 2009-01-23 10:09:56 UTC

Basically all of them show waiting for an async page write to finish, and that can take quite a bit of time with heavy writing going on. First thing next week I'll try and provide a 'this async write now went sync' helper for the io scheduler, so that they can make sure it gets expedited as soon as the sync io is. This should drastically reduce latencies for this situation.

I'll probably be less than straight forward, but a test patch should be quite doable.

Comment 98 Thomas Pilarski 2009-01-23 11:09:41 UTC

That sounds good. 

I have to correct the last values, as I was using the filter for capture logs with escape sequences. 

jiffies:
min:0.1 ms|max:5442.1 ms|avg:834.12|avg80:2248.28

hpet:
min:0 ms|max:14777.7 ms|avg:1474.09|avg80:3638.15

Why are there such a big difference in the average latency with jiffies and hpet? The total latency of 80% of the recording is 2,2s with jiffies and 3,6s with hpet.

Comment 99 Jens Axboe 2009-01-26 03:09:00 UTC

Created attachment 19996 [details]
Test patch for async page promotion

First attempt at doing sync promotion of async page waiting. It actually booted, however I haven't done any sort of testing with it yet.

Note that this will only work with CFQ currently.

Comment 100 Thomas Pilarski 2009-01-26 05:03:32 UTC

Created attachment 19997 [details]
latencytop captures with clocksource hpet and patched kernel

Same test, with patched kernel and hpet as clocksource.

hpet:
min:10.1 ms|max:11733 ms|avg:3096.22|avg80:4082.79

Comment 101 Jens Axboe 2009-01-26 05:16:12 UTC

One observation is that ext4 seems quite latency prone in waiting for write access to the journal. IIRC, that matches earlier results where ext3 was much quicker in that area. No idea what causes this, as I'm not familiar with the ext4 internals.

Another observation is that I neglected to include the buffer waiting in the async promotion, it only worked for page locking. I'll add an updated patch below after this posting.

And finally, lots of time is spent waiting for a new write request in the block layer. So you are maxing all 128 requests out in this test case. You can try and increase that to 512 for testing purposes, you can do that ala:

# echo 512 > /sys/block/sda/queue/nr_requests

That will get your async wait numbers down, but it may not reduce your latencies. Fact is that 128 writes is already a lot, and with more requests in the queue, you will have higher completion times for each individual request.

Comment 102 Jens Axboe 2009-01-26 05:16:46 UTC

Created attachment 19998 [details]
Test patch for async page promotion v2

Comment 103 Mathieu Desnoyers 2009-01-26 09:28:15 UTC

Attachement 

http://bugzilla.kernel.org/attachment.cgi?id=19998&action=view

Causes the following OOPS as soon as stress-testing starts. Is it possible that bdi->unplug_io_data can be NULL in blk_backing_dev_wop ? Should we simply discard those ?

[  138.345195] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  138.346301] IP: [<ffffffff803f997d>] elv_wait_on_page+0xd/0x20
[  138.346301] PGD 434c05067 PUD 434c06067 PMD 0
[  138.346301] Oops: 0000 [#1] PREEMPT SMP
[  138.346301] LTT NESTING LEVEL : 0
[  138.346301] last sysfs file: /sys/block/md1/md/raid_disks
[  138.346301] Dumping ftrace buffer:
[  138.346301]    (ftrace buffer empty)
[  138.346301] CPU 3
[  138.346301] Modules linked in: e1000e loop ltt_tracer ltt_trace_control ltt_type_serializer ltte
[  138.346301] Pid: 1272, comm: kjournald Not tainted 2.6.28.1 #69
[  138.346301] RIP: 0010:[<ffffffff803f997d>]  [<ffffffff803f997d>] elv_wait_on_page+0xd/0x20
[  138.346301] RSP: 0018:ffff88043cc19cd0  EFLAGS: 00010286
[  138.346301] RAX: 0000000000000000 RBX: ffff88043f460938 RCX: 0000000000000000
[  138.346301] RDX: ffff880438490000 RSI: ffffe200193f0bc0 RDI: ffff88043e580a40
[  138.346301] RBP: ffff88043cc19cd0 R08: ffff88043d09de78 R09: 0000000000000001
[  138.346301] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88043cc19d50
[  138.346301] R13: ffff88043cc19d60 R14: 0000000000000002 R15: ffff8800280590c8
[  138.346301] FS:  0000000000000000(0000) GS:ffff88043f804d00(0000) knlGS:0000000000000000
[  138.346301] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  138.346301] CR2: 0000000000000000 CR3: 0000000434817000 CR4: 00000000000006e0
[  138.346301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  138.346301] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  138.346301] Process kjournald (pid: 1272, threadinfo ffff88043cc18000, task ffff88043d09d8c0)
[  138.346301] Stack:
[  138.346301]  ffff88043cc19ce0 ffffffff803fd2a2 ffff88043cc19d00 ffffffff802f6762
[  138.346301]  ffff88043cc19d60 0000000000000000 ffff88043cc19d40 ffffffff8067ace2
[  138.346301]  ffffffff802f6710 ffff880438490000 0000000000000002 0000000000000002
[  138.346301] Call Trace:
[  138.346301]  [<ffffffff803fd2a2>] blk_backing_dev_wop+0x12/0x20
[  138.346301]  [<ffffffff802f6762>] sync_buffer+0x52/0x80
[  138.346301]  [<ffffffff8067ace2>] __wait_on_bit+0x62/0x90
[  138.346301]  [<ffffffff802f6710>] ? sync_buffer+0x0/0x80
[  138.346301]  [<ffffffff802f6710>] ? sync_buffer+0x0/0x80
[  138.346301]  [<ffffffff8067ad89>] out_of_line_wait_on_bit+0x79/0x90
[  138.346301]  [<ffffffff802566f0>] ? wake_bit_function+0x0/0x50
[  138.346301]  [<ffffffff802f6649>] __wait_on_buffer+0xf9/0x130
[  138.346301]  [<ffffffff8036c0c5>] journal_commit_transaction+0x7d5/0x1540
[  138.346301]  [<ffffffff80265991>] ? trace_hardirqs_on_caller+0x1b1/0x210
[  138.346301]  [<ffffffff8067d457>] ? _spin_unlock_irqrestore+0x47/0x80
[  138.346301]  [<ffffffff80249cef>] ? try_to_del_timer_sync+0x5f/0x70
[  138.346301]  [<ffffffff803708c8>] kjournald+0xe8/0x250
[  138.346301]  [<ffffffff802566b0>] ? autoremove_wake_function+0x0/0x40
[  138.346301]  [<ffffffff803707e0>] ? kjournald+0x0/0x250
[  138.346301]  [<ffffffff802561de>] kthread+0x4e/0x90
[  138.346301]  [<ffffffff80256190>] ? kthread+0x0/0x90
[  138.346301]  [<ffffffff8020d8d9>] child_rip+0xa/0x11
[  138.346301]  [<ffffffff8020cd58>] ? restore_args+0x0/0x30
[  138.346301]  [<ffffffff80256190>] ? kthread+0x0/0x90
[  138.346301]  [<ffffffff8020d8cf>] ? child_rip+0x0/0x11

Comment 104 Jens Axboe 2009-01-26 09:35:47 UTC

Yes that's expected, I didn't fixup the non-request_fn based drivers. It's trickier to do for dm/md, since you need to know where that page went. Or you can just cycle all the bottom backing_dev_info's like it's done for unplug. I'll be back at the machine in an hour or two, I'll update the patch for dm/md.

Comment 105 Jens Axboe 2009-01-26 11:42:21 UTC

Created attachment 20001 [details]
Test patch for async page promotion v2

Adds support for raid0/1/10/5 and should not oops on dm (just not work as intended, it'll do nothing).

There's still the debug printk in there that notifies you of when something has happened, ala:

$ dmesg | tail
cfq: moving e4a348d4 to dispatch
cfq: moving e49dede4 to dispatch
cfq: moving f687d8d4 to dispatch

Comment 106 Jens Axboe 2009-01-26 11:57:36 UTC

Another question - are people using CONFIG_NO_HZ or not?

Comment 107 devsk 2009-01-26 12:09:40 UTC

(In reply to comment #106)
> Another question - are people using CONFIG_NO_HZ or not?

Yes, I am.

Comment 108 Ben Gamari 2009-01-26 12:53:45 UTC

(In reply to comment #106)
> Another question - are people using CONFIG_NO_HZ or not?
> 
As am I

Comment 109 Thomas Pilarski 2009-01-26 13:57:12 UTC

(In reply to comment #106)
Me currently too.

Comment 110 Jens Axboe 2009-01-27 00:16:26 UTC

So my next question would be if disabling that option makes any difference?

Comment 111 Toby McLaughlin 2009-01-27 15:29:02 UTC

We are not using CONFIG_NO_HZ and get high latency (subjective) while running:

dd if=/dev/zero of=file bs=1M count=2048

Additionally, all 8 core cores go to at least 50% iowait, several peg at ~=95%.

We see similar results with:

2.6.18, 2.6.24, deadline, cfq.

Comment 112 Ozan Caglayan 2009-01-28 00:35:09 UTC

Created attachment 20024 [details]
2.6.25.20 fio test with NOHZ disabled

Comment 113 Ozan Caglayan 2009-01-28 00:36:06 UTC

Created attachment 20025 [details]
2.6.25.20 fio test with NOHZ enabled

Comment 114 Ozan Caglayan 2009-01-28 00:37:39 UTC

What is the preferred way of testing different kernels against this bug?

I've done the fio test of Mathieu but I'm not sure if it gives detailed clue about the problem. I've attached the results.

Comment 115 Thomas Pilarski 2009-01-28 00:46:31 UTC

Created attachment 20026 [details]
latencytop captures with clocksource hpet with nohz and no high resolution timer

hpet - no hz - no high resolution timer  
min:0 ms|max:10888.7 ms|avg:1311.17
hpet - no hz
min:2 ms|max:16980.9 ms|avg:1513.26

Same settings as in 
http://bugzilla.kernel.org/attachment.cgi?id=19954&action=view
hpet 
min:0 ms|max:14777.7 ms|avg:1474.09

jiffies
min:0.1 ms|max:5442.1 ms|avg:834.12

Comment 116 Thomas Pilarski 2009-01-28 01:32:16 UTC

Created attachment 20027 [details]
latencytop captures + fio results amd64

I have run the fio job on a different machine on two different discs. While running the fio job, I have captured the latency with latencytop. Each test was executed twice. Once with 2*bzip-urandom and the other without cpu consumption.
You can find the test results for every io scheduler in the archive.

100MB/s disk + 2bzip (2009-01-27.0847-2.6.28.2-acpi_pm)
100MB/s disk (2009-01-27.0908-2.6.28.2-acpi_pm)
40MB/s disk + 2bzip (2009-01-27.0934-2.6.28.2-acpi_pm)
40MB/s disk (2009-01-27.1029-2.6.28.2-acpi_pm)

Total latency - cfq
min: 133.3 ms | max 18555.8 ms | avg 5978.08 
min:  25.5 ms | max  5057.2 ms | avg 1660.21
min: 369.0 ms | max 11872.0 ms | avg 3764.57
min: 557.0 ms | max 12215.6 ms | avg 3002.81

fio results - cfq
mint  25msec | maxt 1669msec
mint  23msec | maxt 1596msec
mint  77msec | maxt 2370msec
mint 106msec | maxt  738msec

Comment 117 Ozan Caglayan 2009-01-28 02:05:21 UTC

// Adding myself to CC

Comment 118 Thomas Pilarski 2009-01-28 02:57:42 UTC

(In reply to comment #101)
> One observation is that ext4 seems quite latency prone in waiting for write
> access to the journal. IIRC, that matches earlier results where ext3 was much
> quicker in that area. No idea what causes this, as I'm not familiar with the
> ext4 internals.

It is possible, that the reduced latency on ext4 is a result of the increased write speed, which is nearby doubled. You can see in the result posted before (comment #116), a reduction on ext3 partitions with different hard drives.

Comment 119 Daniel Rowe 2009-01-28 03:42:25 UTC

I have really noticed this lately.

I replace a old server running and older kernel. The replacement hardware was by orders of magnitude more powerful. The I/O system in the old machine was a 4 disk hardware RAID 5 on 64 bit PCI with the very first SATA 10,000RPM WD Raptors (WD740-00FLA). The new machine has and 8 disk hardware RAID 5 using the new 300gig 10,000rpm Velociraptor SATA drives on PCI-Express. The old machine had a Pentium 4 HT CPU. The new machine has a 4 core Core 2 CPU. All high end gear.

The new machine does get far better disk through put, however on the workloads the latencies seem far higher, the interactvity of the machine is poor and all CPU core show high I/O waits.

This machine serves a application that run from Samba shares to 15 or so Windows workstations. This involved lots of file activity on large flat file database files. Some of the files are up to 4GB in size.

The old server was very busy however not a huge amounts of I/O wait was seen. On the new server using a 2.6.18 kernel on an enterprise distro the I/O waits are heaps higher. Especially noticed at backup times. Users of the system have noticed the extra latencies when the system is busy and at these time the I/O waits are high.

The server feels slower than the old machine and this should not be so.

Just thought I would let you know this info as it seems a hard to quantify this to real world.

Comment 120 simon+kernelbugzilla 2009-01-28 04:41:17 UTC

Just wanted to add a couple of links to places where some additional real world experience is related, for whatever they might be worth.

http://forums.storagereview.net/index.php?s=121e3f0d26cbd551c84271019f82f6d3&showtopic=25923&st=0

http://community.novacaster.com/showarticle.pl?id=7395&n=8001

Comment 121 Thomas Pilarski 2009-01-30 02:23:52 UTC

(In reply to comment #105)

I have tried the patch with 2.6.28.2 and 2.6.29-rc3 and always get a crash, when io start. Sometimes even after the X-server has started.

kernel 2.6.29-rc3 at 

cfq_remove_request 0xe3/0x251

0xffffffff811ca8fc is in cfq_remove_request (block/cfq-iosched.c:650).
645	{
646		struct cfq_queue *cfqq = RQ_CFQQ(rq);
647		struct cfq_data *cfqd = cfqq->cfqd;
648		const int sync = rq_is_sync(rq);
649	
650		BUG_ON(!cfqq->queued[sync]);
651		cfqq->queued[sync]--;
652	
653		elv_rb_del(&cfqq->sort_list, rq);
654	

kernel 2.6.28.2
elv_rb_del+0x21/0x4b

394     }
395     EXPORT_SYMBOL(elv_rb_add);
396     
397     void elv_rb_del(struct rb_root *root, struct request *rq)
398     {
399             BUG_ON(RB_EMPTY_NODE(&rq->rb_node));
400             rb_erase(&rq->rb_node, root);
401             RB_CLEAR_NODE(&rq->rb_node);
402     }
403     EXPORT_SYMBOL(elv_rb_del);

Comment 122 Michiel Eghuizen 2009-01-30 06:53:32 UTC

Could this be the same bug as: http://lkml.org/lkml/2008/6/15/163 ?

Because on the same system on which I have the same sympthones on what
his bug describes, also the following happens:
http://beheer.eduwijs.nl/kernellog-brikama.log

I need to say that changing the IO scheduler from CFQ to AS seems to
help a bit. It will not solve the problem, but the system will be much
more responsive.

System information:
IO Scheduler: AS (default is CFQ, using elevator=as)
Timer: hpet
CONFIG_NO_HZ=y
Kernel: Linux brikama 2.6.27-9-generic #1 SMP x86_64 GNU/Linux
Distro: Ubuntu 8.10 Intrepid amd64
CPU: Intel(R) Core(TM)2 CPU         E8400  @ 3.00GHz (2 cores)
Memory: 4GB
Using LVM: yes
Using LVM encryption: no
LVM version:
        LVM version:     2.02.39 (2008-06-27)
        Library version: 1.02.27 (2008-06-25)
        Driver version:  4.14.0
Using DM: yes


HDD:

/dev/sda:

 Model=WDC WD5000AACS-00G8B1                   , FwRev=05.04C05,
SerialNo=     WD-WCAUF0869014
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50
 BuffType=unknown, BuffSize=16384kB, MaxMultSect=16, MultSect=?0?
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=976773168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1,2,3,4,5,6,7

Comment 123 Philippe Trottier 2009-02-01 23:29:09 UTC

Anyone here managed to reproduce this problem on an AMD platform ? Because I can't seem to be able to reproduce it. But both of 965GM and 945GM chipset motherboard have the problem with the T7600 and T9500 cpu. My old celeron has the same problem but it doesn't feel like freezing so much.

Comment 124 devsk 2009-02-01 23:37:45 UTC

(In reply to comment #123)
> Anyone here managed to reproduce this problem on an AMD platform ? Because I
> can't seem to be able to reproduce it. But both of 965GM and 945GM chipset
> motherboard have the problem with the T7600 and T9500 cpu. My old celeron has
> the same problem but it doesn't feel like freezing so much.
> 

AMD on nForce4 here running x86_64. Look over at gentoo forums, there is a long thread. And almost all of the people experiencing the problem there are on amd.

Comment 125 Thomas Pilarski 2009-02-02 00:08:56 UTC

The problem exists on an AMD platform too, but not as bad as on a Intel platform. By changing the clocksource to the acpi_pm, you can recuce the problem a bit on a Intel platform, but the system feels a little bit slower.

Using ext4 reduces the problem enormous. Even firefox is usable while eclipse is indexing the kernel build tree.The problem still exists on heavy io.

Comment 126 Jens Axboe 2009-02-02 03:25:37 UTC

Sounds like the infamous ext3 fsync() issue is also a factor. Can you try mounting ext3 with -o data=writeback and see if that makes ext3 behave better?

Comment 127 Harrison Metzger 2009-02-02 07:16:56 UTC

On my machine (nForce 5, AMD Phenom II 940) I also experience huge slowdowns when performing I/O. For example, using Ben's:
dd if=/dev/zero of=/tmp/test bs=1M count=1M
test, it takes me about 40 secs to spawn a shell (15 secs for konsole to open a new tab, and about 25 secs for the shell to actually spawn). This was conducted on my a HD with ext4. Turning off swap helps a lot with launching a shell.

On a more substantial note, I use Unison to sync files between various places and when it is running my system is hardly responsive. This happens to me on ext4, ext3, and ReiserFS.

Changing the schedule to noop, the dd-and-open-shell test is very responsive (with both swapon or off), but any substantial usage, such as using firefox is still slow, just as it is above with cfq.

If I can free up some space and one of my partitions, I'm going to install some distro pre 2.6.18 and "feel" what the performance is like.

Comment 128 Thomas Pilarski 2009-02-02 08:09:04 UTC

I think the appearance of this bug is conditioned on cpu speed and drive speed.

I have make some more tests. Currently I am using the following command.

for i in  1 2 ; do \
 dd if=/dev/zero of=test-$i bs=1M count=4K oflag=direct & echo test-$i; \
done

Once with oflag=direct and once without. 

With ext3 the problem occurs immediately in both cases. With ext4 the problem occurs without oflag=direct occurs immediately. With with oflag=direct I can use even firefox, but sometime the desktop in unusable.

In direct mode new application does not start and disk intensive operations take a long time, but I can move the windows and change the desktops without problems and io-wait at 60%. With dd in non direct mode, I can start new application (it takes still a lot of time), but everything is freezing from time to time and io-wait is immediately at 100%.

I have captures some statistic by adding a printk with the duration time in the function __make_request (blk-core.c). The time is taken directly before and after the spin_lock_irq(q->queue_lock); and finally before the unlock.

There is a dramatic difference between the request per seconds in direct and non-direct mode. 
W: wait time before entering lock state
D: duration time of the make_request
T: total time = W + D 

ext3 - direct
requests: 209.694080/s
total: W: 0.000645 / D: 0.014584 / T: 0.015229
W: avg: 0.000000307 / min: 0.000000000 / max: 0.000007606 
D: avg: 0.000006948 / min: 0.000000255 / max: 0.000085018
T: avg: 0.000007255 / min: 0.000000365 / max: 0.000085018
4294967296 Bytes (4,3 GB) kopiert, 203,66 s, 21,1 MB/s
4294967296 Bytes (4,3 GB) kopiert, 203,582 s, 21,1 MB/s

ext3
requests: 4662.272968/s
total: W: 0.013624 / D: 15.256149 / T: 15.269773 
W: avg: 0.000000291 / min: 0.000000000 / max: 0.000275893
D: avg: 0.000325819 / min: 0.000000000 / max: 1.092940760
T: avg: 0.000326110 / min: 0.000000000 / max: 1.092940920
4294967296 Bytes (4,3 GB) kopiert, 203,559 s, 21,1 MB/s
4294967296 Bytes (4,3 GB) kopiert, 214,995 s, 20,0 MB/s

ext4 - direct
requests: 114.510132/s
total: W: 0.000356 / D: 0.017658 / T: 0.018014 
W: avg: 0.000000311 / min: 0.000000110 / max: 0.000000630
D: avg: 0.000015408 / min: 0.000000220 / max: 0.000127249
T: avg: 0.000015719 / min: 0.000000330 / max: 0.000127689
4294967296 Bytes (4,3 GB) kopiert, 154,491 s, 27,8 MB/s
4294967296 Bytes (4,3 GB) kopiert, 157,853 s, 27,2 MB/s

ext4 
requests: 7009.744726/s
total: W: 0.018928 / D: 6.110891 / T: 6.129819
W: avg: 0.000000270 / min: 0.000000000 / max: 0.000032916
D: avg: 0.000087046 / min: 0.000000000 / max: 0.603327176
T: avg: 0.000087316 / min: 0.000000000 / max: 0.603327516
4294967296 Bytes (4,3 GB) kopiert, 146,303 s, 29,4 MB/s
4294967296 Bytes (4,3 GB) kopiert, 149,361 s, 28,8 MB/s

Comment 129 Thomas Pilarski 2009-02-02 09:53:36 UTC

And some test results with clocksource=jiffies instead of hpet (non-direct only), which runs much better on my machine. The total times are added in an interval of 10s. The 15s with ext3 above should come from the two cores.

ext3
total: W: 0.018617 / D: total: 3.714917 / T: total: 3.733534 
requests: 4050.191168/s
W: avg: 0.000000459 / min: 0.000000000 / max: 0.000048408
D: avg: 0.000091496 / min: 0.000000000 / max: 0.615268038
T: avg: 0.000091954 / min: 0.000000000 / max: 0.615268379
4294967296 Bytes (4,3 GB) kopiert, 213,215 s, 20,1 MB/s
4294967296 Bytes (4,3 GB) kopiert, 222,198 s, 19,3 MB/s

ext4
total: W: 0.026263 / D: 3.681891 / T: 3.708154
requests: 6006.413044/s
W: avg: 0.000000431 / min: 0.000000000 / max: 0.001003075
D: avg: 0.000060427 / min: 0.000000000 / max: 0.344179020
T: avg: 0.000060858 / min: 0.000000000 / max: 0.344179370
4294967296 Bytes (4,3 GB) kopiert, 147,343 s, 29,1 MB/s
4294967296 Bytes (4,3 GB) kopiert, 146,386 s, 29,3 MB/s

Comment 130 Jens Axboe 2009-02-02 11:06:45 UTC

Can you try with this simple patch applied?

diff --git a/block/blk.h b/block/blk.h
index 6e1ed40..a145c3a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -5,7 +5,7 @@
 #define BLK_BATCH_TIME	(HZ/50UL)
 
 /* Number of requests a "batching" process may submit */
-#define BLK_BATCH_REQ	32
+#define BLK_BATCH_REQ	1
 
 extern struct kmem_cache *blk_requestq_cachep;
 extern struct kobj_type blk_queue_ktype;

Comment 131 Thomas Pilarski 2009-02-02 11:58:58 UTC

I would say there are no changes. Perhaps a little bit worse. 
There are still freezes with non-direct write access, e.g. while painting circles in gimp.
No freezes with direct-io, but high lattency with concurrent disk access (as before).

ext3 - direct
requests: 205.795295/s
total: W: 0.000616 / D:: 0.011195 / T: 0.011811
W: avg: 0.000000299 / min: 0.000000000 / max: 0.000007085
D: avg: 0.000005434 / min: 0.000000000 / max: 0.000100447
T: avg: 0.000005733 / min: 0.000000000 / max: 0.000100958
4294967296 Bytes (4,3 GB) kopiert, 210,281 s, 20,4 MB/s
4294967296 Bytes (4,3 GB) kopiert, 210,525 s, 20,4 MB/s

ext3
requests: 4960.868922 
total: W: 0.032503 / D: 21.032077 / T: 21.064580
W: avg: 0.000000655 / min: 0.000000000 / max: 0.000069624
D: avg: 0.000423863 / min: 0.000000000 / max: 0.415194973
T: avg: 0.000424518 / min: 0.000000000 / max: 0.415195303

requests: 3588.105593/s
total: W: 0.014912 / D: 10.578434 / T: 10.593346 
W: avg: 0.000000415 / min: 0.000000000 / max: 0.000077581
D: avg: 0.000294754 / min: 0.000000000 / max: 0.447073476 
T: avg: 0.000295170 / min: 0.000000000 / max: 0.447073806

4294967296 Bytes (4,3 GB) kopiert, 218,708 s, 19,6 MB/s
4294967296 Bytes (4,3 GB) kopiert, 228,355 s, 18,8 MB/s

ext4 - direct
requests: 115.981745/s
total: W: 0.000344 / D: 0.016716 / T: 0.017061
W: avg: 0.000000297 / min: 0.000000110 / max: 0.000025846
D: avg: 0.000014398 / min: 0.000000650 / max: 0.000075554
T: avg: 0.000014695 / min: 0.000000990 / max: 0.000076195
4294967296 Bytes (4,3 GB) kopiert, 156,476 s, 27,4 MB/s
4294967296 Bytes (4,3 GB) kopiert, 157,78 s, 27,2 MB/s

ext4
requests: 7556.114616/s
total: W: 0.029942 / D: 9.424271 / T: 9.454213
W: avg: 0.000000396 / min: 0.000000000 / max: 0.000127857
D: avg: 0.000124722 / min: 0.000000000 / max: 0.046151790
T: avg: 0.000125119 / min: 0.000000000 / max: 0.046152130
4294967296 Bytes (4,3 GB) kopiert, 147,553 s, 29,1 MB/s
4294967296 Bytes (4,3 GB) kopiert, 151,226 s, 28,4 MB/s

Comment 132 Mathieu Desnoyers 2009-02-02 13:08:15 UTC

(In reply to comment #130)
> Can you try with this simple patch applied?
> 
> diff --git a/block/blk.h b/block/blk.h
> index 6e1ed40..a145c3a 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -5,7 +5,7 @@
>  #define BLK_BATCH_TIME (HZ/50UL)
> 
>  /* Number of requests a "batching" process may submit */
> -#define BLK_BATCH_REQ  32
> +#define BLK_BATCH_REQ  1
> 
>  extern struct kmem_cache *blk_requestq_cachep;
>  extern struct kobj_type blk_queue_ktype;
> 

Hi Jens,

I tried it on a 2.6.29-rc3 kernel. It made things worse for "default" config, but did help with config1.
(fio "ssh" test bench)
(config1 : quantum=1, slice_async_rq=1, queue_depth=1)

max runt 2.6.29-rc3 default no patch    14247msec
max runt 2.6.29-rc3 default patch       30833msec

max runt 2.6.29-rc3 config1 no patch     7574msec
max runt 2.6.29-rc3 config1 patch        6585msec

Note that the results seems to indicate that the larger run times occur near the "write" job. The listings below show the runtime of the jobs (1 large write and many 2M reads executed at regular interval for most of the load, and ending with more randomly delayed jobs) in the order they were run. Note that all the read jobs are started at a 4s interval, except the last 2 jobs which are started after 50s for the 1st one, and after another 10s for the last one.

Here is the listing of the 2.6.29-rc3 default no patch

  write: io=10240MiB, bw=56062KiB/s, iops=53, runt=191526msec
  read : io=2052KiB, bw=3411KiB/s, iops=141, runt=   616msec
  read : io=2084KiB, bw=409KiB/s, iops=16, runt=  5215msec
  read : io=2060KiB, bw=349KiB/s, iops=15, runt=  6031msec
  read : io=2060KiB, bw=445KiB/s, iops=17, runt=  4731msec
  read : io=2068KiB, bw=377KiB/s, iops=14, runt=  5606msec
  read : io=2084KiB, bw=558KiB/s, iops=23, runt=  3824msec
  read : io=2056KiB, bw=398KiB/s, iops=15, runt=  5279msec
  read : io=2048KiB, bw=328KiB/s, iops=13, runt=  6393msec
  read : io=2056KiB, bw=337KiB/s, iops=12, runt=  6236msec
  read : io=2072KiB, bw=596KiB/s, iops=23, runt=  3558msec
  read : io=2068KiB, bw=448KiB/s, iops=17, runt=  4723msec
  read : io=2052KiB, bw=342KiB/s, iops=14, runt=  6143msec
  read : io=2056KiB, bw=448KiB/s, iops=19, runt=  4695msec
  read : io=2060KiB, bw=362KiB/s, iops=14, runt=  5814msec
  read : io=2072KiB, bw=1202KiB/s, iops=44, runt=  1765msec
  read : io=2048KiB, bw=395KiB/s, iops=17, runt=  5308msec
  read : io=2056KiB, bw=434KiB/s, iops=17, runt=  4851msec
  read : io=2064KiB, bw=382KiB/s, iops=14, runt=  5521msec
  read : io=2072KiB, bw=412KiB/s, iops=16, runt=  5144msec
  read : io=2052KiB, bw=439KiB/s, iops=17, runt=  4784msec
  read : io=2076KiB, bw=408KiB/s, iops=15, runt=  5209msec
  read : io=2084KiB, bw=405KiB/s, iops=15, runt=  5263msec
  read : io=2052KiB, bw=379KiB/s, iops=14, runt=  5543msec
  read : io=2076KiB, bw=438KiB/s, iops=18, runt=  4852msec
  read : io=2052KiB, bw=1016KiB/s, iops=38, runt=  2068msec
  read : io=2056KiB, bw=227KiB/s, iops=9, runt=  9271msec
  read : io=2072KiB, bw=1256KiB/s, iops=48, runt=  1689msec
  read : io=2048KiB, bw=347KiB/s, iops=13, runt=  6036msec
  read : io=2068KiB, bw=594KiB/s, iops=24, runt=  3562msec
  read : io=2052KiB, bw=415KiB/s, iops=16, runt=  5057msec
  read : io=2052KiB, bw=326KiB/s, iops=14, runt=  6430msec
  read : io=2064KiB, bw=394KiB/s, iops=16, runt=  5362msec
  read : io=2068KiB, bw=280KiB/s, iops=12, runt=  7553msec
  read : io=2064KiB, bw=364KiB/s, iops=15, runt=  5806msec
  read : io=2052KiB, bw=1001KiB/s, iops=41, runt=  2098msec
  read : io=2084KiB, bw=490KiB/s, iops=18, runt=  4352msec
  read : io=2056KiB, bw=1197KiB/s, iops=51, runt=  1758msec
  read : io=2048KiB, bw=471KiB/s, iops=19, runt=  4444msec
  read : io=2052KiB, bw=158KiB/s, iops=6, runt= 13259msec
  read : io=2052KiB, bw=147KiB/s, iops=6, runt= 14247msec
  read : io=2060KiB, bw=3906KiB/s, iops=148, runt=   540msec


Here is the listing of the 2.6.29-rc3 default patch

  write: io=10240MiB, bw=54981KiB/s, iops=52, runt=195291msec
  read : io=2072KiB, bw=3843KiB/s, iops=159, runt=   552msec
  read : io=2080KiB, bw=4302KiB/s, iops=173, runt=   495msec
  read : io=2064KiB, bw=276KiB/s, iops=11, runt=  7642msec
  read : io=2056KiB, bw=462KiB/s, iops=18, runt=  4552msec
  read : io=2064KiB, bw=311KiB/s, iops=12, runt=  6790msec
  read : io=2076KiB, bw=832KiB/s, iops=34, runt=  2554msec
  read : io=2052KiB, bw=298KiB/s, iops=12, runt=  7038msec
  read : io=2048KiB, bw=493KiB/s, iops=20, runt=  4250msec
  read : io=2048KiB, bw=310KiB/s, iops=12, runt=  6746msec
  read : io=2060KiB, bw=595KiB/s, iops=24, runt=  3542msec
  read : io=2068KiB, bw=280KiB/s, iops=12, runt=  7542msec
  read : io=2056KiB, bw=506KiB/s, iops=20, runt=  4155msec
  read : io=2052KiB, bw=352KiB/s, iops=13, runt=  5953msec
  read : io=2068KiB, bw=1778KiB/s, iops=73, runt=  1191msec
  read : io=2080KiB, bw=239KiB/s, iops=9, runt=  8885msec
  read : io=2064KiB, bw=790KiB/s, iops=31, runt=  2675msec
  read : io=2048KiB, bw=235KiB/s, iops=9, runt=  8900msec
  read : io=2052KiB, bw=395KiB/s, iops=16, runt=  5312msec
  read : io=2048KiB, bw=490KiB/s, iops=20, runt=  4279msec
  read : io=2048KiB, bw=350KiB/s, iops=14, runt=  5991msec
  read : io=2060KiB, bw=289KiB/s, iops=13, runt=  7296msec
  read : io=2060KiB, bw=392KiB/s, iops=14, runt=  5368msec
  read : io=2048KiB, bw=323KiB/s, iops=13, runt=  6487msec
  read : io=2052KiB, bw=442KiB/s, iops=17, runt=  4753msec
  read : io=2056KiB, bw=382KiB/s, iops=15, runt=  5506msec
  read : io=2052KiB, bw=299KiB/s, iops=11, runt=  7005msec
  read : io=2052KiB, bw=372KiB/s, iops=15, runt=  5647msec
  read : io=2068KiB, bw=512KiB/s, iops=18, runt=  4136msec
  read : io=2056KiB, bw=326KiB/s, iops=13, runt=  6453msec
  read : io=2060KiB, bw=765KiB/s, iops=30, runt=  2756msec
  read : io=2052KiB, bw=392KiB/s, iops=15, runt=  5357msec
  read : io=2060KiB, bw=420KiB/s, iops=19, runt=  5013msec
  read : io=2052KiB, bw=307KiB/s, iops=12, runt=  6838msec
  read : io=2056KiB, bw=724KiB/s, iops=33, runt=  2905msec
  read : io=2052KiB, bw=407KiB/s, iops=16, runt=  5153msec
  read : io=2048KiB, bw=417KiB/s, iops=15, runt=  5021msec
  read : io=2048KiB, bw=345KiB/s, iops=15, runt=  6069msec
  read : io=2048KiB, bw=451KiB/s, iops=21, runt=  4643msec
  read : io=2048KiB, bw=68KiB/s, iops=2, runt= 30833msec
  read : io=2048KiB, bw=121KiB/s, iops=5, runt= 17290msec
  read : io=2052KiB, bw=3876KiB/s, iops=167, runt=   542msec


Here is the listing of the 2.6.29-rc3 config1 no patch

  write: io=10240MiB, bw=61068KiB/s, iops=58, runt=175827msec
  read : io=2048KiB, bw=4185KiB/s, iops=167, runt=   501msec
  read : io=2056KiB, bw=3814KiB/s, iops=161, runt=   552msec
  read : io=2056KiB, bw=448KiB/s, iops=17, runt=  4692msec
  read : io=2056KiB, bw=1070KiB/s, iops=42, runt=  1966msec
  read : io=2052KiB, bw=424KiB/s, iops=16, runt=  4946msec
  read : io=2076KiB, bw=512KiB/s, iops=19, runt=  4149msec
  read : io=2076KiB, bw=580KiB/s, iops=25, runt=  3664msec
  read : io=2052KiB, bw=470KiB/s, iops=18, runt=  4467msec
  read : io=2068KiB, bw=624KiB/s, iops=26, runt=  3390msec
  read : io=2060KiB, bw=929KiB/s, iops=39, runt=  2270msec
  read : io=2064KiB, bw=508KiB/s, iops=19, runt=  4160msec
  read : io=2076KiB, bw=659KiB/s, iops=26, runt=  3224msec
  read : io=2080KiB, bw=366KiB/s, iops=14, runt=  5819msec
  read : io=2064KiB, bw=1023KiB/s, iops=42, runt=  2066msec
  read : io=2060KiB, bw=322KiB/s, iops=13, runt=  6540msec
  read : io=2060KiB, bw=1383KiB/s, iops=52, runt=  1525msec
  read : io=2052KiB, bw=691KiB/s, iops=26, runt=  3039msec
  read : io=2064KiB, bw=444KiB/s, iops=20, runt=  4755msec
  read : io=2080KiB, bw=551KiB/s, iops=20, runt=  3860msec
  read : io=2084KiB, bw=743KiB/s, iops=29, runt=  2870msec
  read : io=2056KiB, bw=412KiB/s, iops=16, runt=  5106msec
  read : io=2056KiB, bw=406KiB/s, iops=15, runt=  5179msec
  read : io=2048KiB, bw=465KiB/s, iops=19, runt=  4507msec
  read : io=2060KiB, bw=446KiB/s, iops=15, runt=  4725msec
  read : io=2068KiB, bw=467KiB/s, iops=20, runt=  4528msec
  read : io=2052KiB, bw=461KiB/s, iops=18, runt=  4557msec
  read : io=2076KiB, bw=628KiB/s, iops=25, runt=  3385msec
  read : io=2052KiB, bw=518KiB/s, iops=23, runt=  4054msec
  read : io=2068KiB, bw=492KiB/s, iops=20, runt=  4296msec
  read : io=2048KiB, bw=543KiB/s, iops=21, runt=  3858msec
  read : io=2048KiB, bw=559KiB/s, iops=20, runt=  3750msec
  read : io=2064KiB, bw=646KiB/s, iops=26, runt=  3270msec
  read : io=2056KiB, bw=426KiB/s, iops=17, runt=  4938msec
  read : io=2052KiB, bw=741KiB/s, iops=29, runt=  2835msec
  read : io=2048KiB, bw=453KiB/s, iops=19, runt=  4621msec
  read : io=2072KiB, bw=579KiB/s, iops=24, runt=  3662msec
  read : io=2068KiB, bw=418KiB/s, iops=16, runt=  5066msec
  read : io=2056KiB, bw=2101KiB/s, iops=82, runt=  1002msec
  read : io=2072KiB, bw=280KiB/s, iops=11, runt=  7574msec
  read : io=2048KiB, bw=4877KiB/s, iops=190, runt=   430msec
  read : io=2076KiB, bw=4160KiB/s, iops=168, runt=   511msec

and, for comparison, here is the listing of the
2.6.29-rc3 config1 patch

  write: io=10240MiB, bw=59607KiB/s, iops=56, runt=180134msec
  read : io=2068KiB, bw=4152KiB/s, iops=162, runt=   510msec
  read : io=2060KiB, bw=4185KiB/s, iops=168, runt=   504msec
  read : io=2064KiB, bw=508KiB/s, iops=21, runt=  4157msec
  read : io=2060KiB, bw=476KiB/s, iops=19, runt=  4425msec
  read : io=2056KiB, bw=444KiB/s, iops=18, runt=  4738msec
  read : io=2084KiB, bw=525KiB/s, iops=21, runt=  4063msec
  read : io=2072KiB, bw=481KiB/s, iops=20, runt=  4406msec
  read : io=2084KiB, bw=565KiB/s, iops=22, runt=  3777msec
  read : io=2048KiB, bw=498KiB/s, iops=20, runt=  4209msec
  read : io=2068KiB, bw=544KiB/s, iops=21, runt=  3888msec
  read : io=2080KiB, bw=389KiB/s, iops=15, runt=  5462msec
  read : io=2068KiB, bw=1384KiB/s, iops=55, runt=  1529msec
  read : io=2072KiB, bw=444KiB/s, iops=18, runt=  4774msec
  read : io=2064KiB, bw=320KiB/s, iops=12, runt=  6585msec
  read : io=2060KiB, bw=630KiB/s, iops=28, runt=  3348msec
  read : io=2064KiB, bw=428KiB/s, iops=15, runt=  4931msec
  read : io=2052KiB, bw=422KiB/s, iops=15, runt=  4973msec
  read : io=2056KiB, bw=480KiB/s, iops=21, runt=  4385msec
  read : io=2060KiB, bw=1453KiB/s, iops=61, runt=  1451msec
  read : io=2076KiB, bw=426KiB/s, iops=16, runt=  4983msec
  read : io=2052KiB, bw=735KiB/s, iops=28, runt=  2855msec
  read : io=2060KiB, bw=427KiB/s, iops=16, runt=  4939msec
  read : io=2064KiB, bw=508KiB/s, iops=19, runt=  4158msec
  read : io=2064KiB, bw=511KiB/s, iops=21, runt=  4134msec
  read : io=2052KiB, bw=538KiB/s, iops=20, runt=  3900msec
  read : io=2048KiB, bw=454KiB/s, iops=18, runt=  4612msec
  read : io=2052KiB, bw=520KiB/s, iops=21, runt=  4034msec
  read : io=2064KiB, bw=505KiB/s, iops=19, runt=  4183msec
  read : io=2052KiB, bw=414KiB/s, iops=17, runt=  5074msec
  read : io=2068KiB, bw=520KiB/s, iops=19, runt=  4065msec
  read : io=2048KiB, bw=392KiB/s, iops=15, runt=  5349msec
  read : io=2064KiB, bw=671KiB/s, iops=27, runt=  3148msec
  read : io=2068KiB, bw=551KiB/s, iops=21, runt=  3843msec
  read : io=2056KiB, bw=665KiB/s, iops=28, runt=  3162msec
  read : io=2084KiB, bw=606KiB/s, iops=23, runt=  3518msec
  read : io=2056KiB, bw=346KiB/s, iops=14, runt=  6076msec
  read : io=2056KiB, bw=452KiB/s, iops=19, runt=  4656msec
  read : io=2076KiB, bw=495KiB/s, iops=20, runt=  4291msec
  read : io=2052KiB, bw=407KiB/s, iops=17, runt=  5152msec
  read : io=2068KiB, bw=2267KiB/s, iops=92, runt=   934msec
  read : io=2064KiB, bw=4080KiB/s, iops=144, runt=   518msec

I start to think that I should put more than a 4s delay between the jobs, since the duration of the reads is always around those 4s. Things become more interesting with the 50s delay probably because the read queue is empty.

Mathieu

Comment 133 Mathieu Desnoyers 2009-02-02 13:09:59 UTC

(edit)
Note that the results seems to indicate that the larger run times occur near
the "write" job *end*.

Comment 134 Aleksandar Petrinic 2009-02-04 01:25:05 UTC

Hi.

On my laptop(Core2Duo 1.6 ghz) I run my gentoo kernel 2.6.28-gentoo.
I didn't have any problems with latency.

If I run "dd if=/dev/zero of=file bs=1M count=2048" or "dd if=/dev/zero of=/tmp/test bs=1M count=1M" (I tried to run it as user and also as root), my system works well and I can start firefox, another shell,  open dolphin (i'm under kde4-svn) and everything is faster.

I have XFS filesystem on my home and reiserfs on root.

Since I configured my kernel manually, maybe it could be usefull for someone to have my .config so I'll post it.

Comment 135 Aleksandar Petrinic 2009-02-04 01:30:27 UTC

Created attachment 20105 [details]
With this .config I don't have latency bug.

My 2.6.28 .config , Everything is ok with this .config. I didn't have any slowdowns running "dd if=/dev/zero of=/tmp/test bs=1M count=1M" on my core2duo laptop(1.6 ghz).

Comment 136 Harrison Metzger 2009-02-04 14:40:42 UTC

After looking through Alexsandar's kernel I decided to try a new config. Changing my kernel from 250HZ and Voluntary Kernel Preemption to 1000HZ and Preemptible Kernel (Low-Latency Desktop), I can actually open tabs in firefox, new terminals, or SSH into my computer (from itself) without waiting 10-30 seconds. Perhaps there is no bug but this is just expected behavior.

I wonder if it was more of the clock change or the preemption change which made the difference, or both.

For those of you who have this problem what is your HZ and preemption model?

Comment 137 Thomas Pilarski 2009-02-05 02:35:28 UTC

Enabling the 1000Hz timer frequency and Low-Latency Desktop as preemption model does not solve the problem for me. 

The mouse still freezes, I cannot move windows or switch between desktops on heavy i/o. The time of these freezes is now reduced to less than 3s, the freezes interval is 2-10s and the desktop still unusable for me.

Comment 138 Aleksandar Petrinic 2009-02-05 03:15:38 UTC

Maybe it's not only the preemption and the frequency. I think one of these things could be:

General setup:
- Control Group support DISABLED
- Group CPU Scheduler DISABLED
- Enable full-sized data structures for core ENABLED
- Enable futex support ENABLED
- Use full shmem filesystem ENABLED
- Enable AIO support ENABLED
- SLAB Allocator: SLUB

Processor type and features (ENABLED):
- Tickless System (NO_HZ)
- High Resolution Timer Support
- HPET Timer Support
- Multi-core scheduler support
- Preemptible RCU
- 64 bit Memory and IO resources
- Add LRU list to track non-evictable pages

Good luck...

Comment 139 Gonzalo Aguilar 2009-02-05 03:31:58 UTC

I think it could be great if someone of the kernel can take a look on this.

Linux is starting to loss advantage in performance tests because this problem.

Is there any kernel developer who can address this issue?

Comment 140 Ben Gamari 2009-02-05 07:04:48 UTC

(In reply to comment #136)
> For those of you who have this problem what is your HZ and preemption model?
> 
I'm currently using Voluntary Preemption and HZ=1000. However, I think we're probably losing focus here. Just randomly changing configurations seems like grasping at straws to me. There are far too many potentially relevant configuration options to realistically test them all. If we are going to make progress, we are going to have to use more targeted investigation.

(In reply to comment #139) 
> Is there any kernel developer who can address this issue?
> 
Jens Axboe has sent us a few patches although he doesn't seem to have a lot of time to dedicate to the issue. Honestly, I think we might need to find a distribution with a block layer developer on payroll who could focus on this issue until it is solved. In my discussions on #fedora-kernel, it doesn't look like Redhat has such a person. I haven't received any responses one way or another on #ubuntu-kernel with respect to Canonical.

Does anyone know of a company who might have someone with the requisite skill set to debug this issue? Jens, do you think you'll be able to sustainably work on this bug? (Thanks for your work so far, by the way)

I think it would be amazing if we could give 2.6.29 proper I/O performance. I know it's getting late considering we're at -rc3, but this bug has been with us for far too long.

Comment 141 Ben Gamari 2009-02-05 11:09:19 UTC

Well, I'm fairly certain at least part of the issue is a scheduler bug. Just now I was make module_install'ing a few kernels and after some time found that specific processes had stopped responding. This pattern continued, with more and more processes blocking. Eventually the entire X session stopped responding. For a while I could maintain an SSH session and found that IO wait time was 40%, with the rest of the CPU time going idle. After some time, however, even the ssh session stopped responding. This is the third time I have seen behavior like this, with the previous instances involving copying 15GB of data between external hard drives.

Also, Jens, what do you think is the most useful benchmark we've seen here? Testers have used several benchmarks including dd, various fio jobs. Would it help if we standardized on a single benchmark?

Comment 142 Adriaan van Kessel 2009-02-05 13:48:49 UTC

The best illustration of this behavior seems #128 #129 #131.
IMHO this illustrates that most CPU is burned on a spinlock.
If the time spent inside the critical section also increases (which it does), there is IMHO a strong indication that there must be another (spin-) lock inside this code path.
Currently I'm looking into mm->filemap.c

My own testing consists of a toy search engine I am developing. It uses the maximum of mmap()ed files (32K or 64K). (the program maintains it's own LRU)
In the first stage of its's indexer, it just reads mmap()ed pages, maybe dirtying them. When it is done, it unmaps() them (causing the buffers to be written back to backing store).

The frozen-cursor and non-responsive system only occurs during the first phase.
During the writing phase, things are back to normal again.

IMHO, this could mean two things:
1) There is a funneling lock in the read() pathway
2) The mm runs into the mud

Comment 143 Jens Axboe 2009-02-06 01:18:30 UTC

Sorry, I wish I could spend more time on it. I'll be on vacation the next 9 days, so no response until the week beginning on Feb 16th. I'll try and set aside a few days to work on it then.

With complete freezing of the mouse, it does look like some sort of spinning issue. To that extent, the most valuable information would be profiling from those 5 seconds surrounding the freeze. Hard to do, but would be very valuable.

People seem to be certain that this is a block layer issue, I'm far from convinced that is the case.

Comment 144 Thomas Pilarski 2009-02-06 12:52:09 UTC

I have have limited the usage of generic_file_aio_write in filemap.c for every process. Once I have limited the throughput of every process. When the overall throughput was below disc capacity, there where no more freezes of the mouse. When the overall throughput was above disc capacity, the problem appears immediately.

When I have limited the usage to max 20% of interval time for every process, and suspended the thread when it needs more. The problems was present as before, as every 20st requests, __generic_file_aio_write_nolock needs more than 2s for finishing.

I tried the same for the cfq scheduler in cfq_choose_req and added penalties for processes with heavy io, but the pid is not correctly set for all cfq_queue and I got a kernel panic after a while. Before the the kernel panic there was no improvement.

Comment 145 Daniel Rowe 2009-02-06 20:17:52 UTC

Created attachment 20148 [details]
Graph of I/O waits on CPU Core 0

Running  dd if=/dev/zero of=/storage/hwraid0/test1 bs=1M count=1M

On my AMD Phenom 9950 Quad-Core Processor running a distro kernel (2.6.27.12-170.2.5.fc10.x86_64). This test was run against a XFS file system on a 8 disk PCI-Express hardware RAID card. I also get the same if I run against ext4 on the same hardware. I also get similar results with this machine on a single 10,000RPM drive connected to the motherboards SATA with ext4.

When this test was running the system was very unresponsive. In a different test run I launched evolution and it took around 60 seconds to load.

[root@bajor hwraid0]# dd if=/dev/zero of=/storage/hwraid0/test1 bs=1M count=1M
436560+0 records in
436560+0 records out
457766338560 bytes (458 GB) copied, 2535.92 s, 181 MB/s

Comment 146 Marcel Partap 2009-02-07 00:48:34 UTC

<stupidmetoopost />
Now at least i know what's going on.. it seems like its somehow coupled with mm because when this happens a) i can see invocations of the oom_killer in the logs after reboot and b) SYSRQ + sync & unmount action do not end the furious HDD LED flashing so i presume the kernel is misusing swapspace..
btw this is a very indeterminate and simply doing the same thing again will not reproduce the problem... so my vote is for uuhm race condition or spinlock recursion, too.

Comment 147 Yuriy Lalym 2009-02-07 14:40:16 UTC

P5K, CPU - Core 2 Duo E8400, connected to the motherboards (ICH9) SATA - ST31000340AS, openSUSE 11.1, kernel - 2.6.28.3

yura@suse:~> dd if=/dev/zero of=test1 bs=1M count=1M
^C
128443+0 records in
128443+0 records out
134682247168 байт (135 GB), 1872,43 c, 71,9 MB/c

vmstat 1 (fragment)
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----   
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa   
 1  7      0  46780      0 7750564    0    0  4880 12808  838 1491  2  3  0 95  
 0  9      0  45480      0 7751004    0    0  1876 36888 1268 2506  2  5  0 93  
 1  9      0  45420      0 7752056    0    0  7120 12296  705 1790  1  3  0 96  
 0  7      0  43924      0 7751636    0    0  1416 36888  979 2178  3  4  0 93  
 0  8      0  44148      0 7751480    0    0   900 28176  672 1444  2  3  0 95  
 0  2      0  54144      0 7753008    0    0  2468 24680  649 1191  2  3  0 95  
 0 11      0  46420      0 7757720    0    0  1508 72240  994 1696  2  7  2 88  
 4 10      0  43348      0 7749244    0    0  5212 51212 1247 2436  6  6  0 87  
 1 10      0  46256      0 7749752    0    0  1268 42504  799 1963  2  5  0 93  
 0  1      0  45468      0 7757836    0    0     0 81959 1126 2249  1  9  6 84  
 0  1      0  43880      0 7758912    0    0     0 71736  830 1818  1  8 31 60  
 0 10      0  43280      0 7756472    0    0     0 59473  998 1879  1  5  8 85  
 1  9      0  46832      0 7748176    0    0     0 81996 1114 2332  1  8  0 91  
 0 10      0  46652      0 7747356    0    0     0 79920  867 1748  1  8  0 91  
 0 10      0  45836      0 7747508    0    0     0 76848 1021 1947  1  8  0 91  
 0 10      0  46724      0 7751964    0    0     0 52272  821 1775  1  6  0 93  
 0 10      0  44388      0 7754660    0    0     0 77896 1054 2230  1  7  0 92  
 0  6      0  45672      0 7755792    0    0     0 71736 1343 2886  1  7  0 91  
 1  8      0  44624      0 7756444    0    0     0 77863  826 1736  0  7  0 92  
 1  6      0  43132      0 7757664    0    0     0 63560 1036 1911  1  7  0 91  
 0  3      0  43200      0 7757936    0    0     0 77896  721 1539  1  6  0 92  
 1 11      0  46716      0 7760684    0    0   428 63544 1538 2789 12  8  0 79  
 0 10      0  44808      0 7756940    0    0  6876 31248 1241 2857  4  4  0 91

The system dies. To call KDE main menu - it is the extremely inconvenient. About the rest - in general I am silent.

Comment 148 Mathieu Desnoyers 2009-02-07 15:19:24 UTC

Ah !!

I think that could be the problem. The dd test with a large file (20GB) on my machine with 16GB. Looking at top while it's done shows me that the available memory steadily shrinks, all being incrementally reserved for cache.

It actually shrinks down to 80kB. Starting from that point, I experience lags when I type "ls". So.. I think this could be the problem. Is there any reason why the memory used for cache is allowed to grow out of proportion like this ?

Mathieu

(In reply to comment #146)
> <stupidmetoopost />
> Now at least i know what's going on.. it seems like its somehow coupled with
> mm
> because when this happens a) i can see invocations of the oom_killer in the
> logs after reboot and b) SYSRQ + sync & unmount action do not end the furious
> HDD LED flashing so i presume the kernel is misusing swapspace..
> btw this is a very indeterminate and simply doing the same thing again will
> not
> reproduce the problem... so my vote is for uuhm race condition or spinlock
> recursion, too.
>

Comment 149 Søren Holm 2009-02-07 15:29:43 UTC

Well actually it is worse than that. If you have not tuned vm.swappiness to something much lower than the default of 60 (1 or something) the kernel will also start swapping out stuff to free memory. I don't know a way the limit the cachememory's size.

Comment 150 Mathieu Desnoyers 2009-02-07 16:06:36 UTC

There seems to be some information about how to tune this here. Trying out
parameter variations would be interesting :

http://www.westnet.com/~gsmith/content/linux-pdflush.htm

Mathieu

Comment 151 Søren Holm 2009-02-07 16:33:37 UTC

echo "1" > dirty_background_ratio
echo "1" > dirty_ratio
echo "3" > drop_caches

and vmstat says

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  2 355844 427256   3508  67544   10   21   315   180  459  781  5  3 80 12


then after doing a 10gig dd-operation vmstat says

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0 355872  24532   8656 457200   10   21   338   497  456  763  5  3 79 13

So if I read the numbers correct around 400 Mb of memory has now been used for caches. Hmm that doesn't match setting dirty_background_ratio and dirty_ratio to 1. Since I have 1G of memory only 1% (10 Mb) should be allowed to be dirty before forcing applications to wait. But this is apparently not the cause here.

Comment 152 Thomas Pilarski 2009-02-08 05:09:55 UTC

In __block_write_full_page (buffer.c) nearby all submits to the block device are caused by pdflush.
At the beginning there are submits of 300MB on VM with 384MB. After that the dd processes submits the data direct. As soon as there is available memory, it is filled and submitted immediately by pdflush. The 300MB are submitted at once or nearby at once.

On the VM there is the following scheme, caused by the double buffering (VM/Host). 
At   67.506825 300MB (pdflush) 
               100MB (dd processes) 
At   72.750497 300MB (pdflush) 
               100MB (dd processes)
At   74.215577 50MB  (pdflush) // Host cache filled
...

My guess is, that the dirty pages are not increased correctly by create_empty_buffers in __block_write_full_page. I currently don't known, how to check it, as I have just started to read and understand the kernel code.

Comment 153 Mathieu Desnoyers 2009-02-08 22:24:00 UTC

The following solution works for me. I use the cgroups to limit the amount of memory dd can use. That shows that there is a problem with the kernel otherwise allowing the page cache to take _all_ the available kernel memory.

mkdir -p /cgroups
mount -t cgroup none /cgroups -o memory
mkdir 0
echo $$ > /cgroups/0/tasks
echo 4M > /cgroups/0/memory.limit_in_bytes
dd if=/dev/zero of=/tmp/bigfile bs=1024k count=20480

The same works with the fio "ssh" test case when run under the cgroups limitations :

  write: io=10240MiB, bw=34349KiB/s, iops=32, runt=312595msec
  read : io=2068KiB, bw=404KiB/s, iops=16, runt=  5239msec
  read : io=2048KiB, bw=598KiB/s, iops=25, runt=  3505msec
  read : io=2056KiB, bw=283KiB/s, iops=12, runt=  7437msec
  read : io=2056KiB, bw=542KiB/s, iops=21, runt=  3879msec
  read : io=2060KiB, bw=388KiB/s, iops=16, runt=  5431msec
  read : io=2052KiB, bw=591KiB/s, iops=25, runt=  3554msec
  read : io=2076KiB, bw=375KiB/s, iops=15, runt=  5658msec
  read : io=2048KiB, bw=522KiB/s, iops=19, runt=  4011msec
  read : io=2080KiB, bw=468KiB/s, iops=19, runt=  4548msec
  read : io=2068KiB, bw=406KiB/s, iops=16, runt=  5206msec
  read : io=2080KiB, bw=412KiB/s, iops=17, runt=  5161msec
  read : io=2068KiB, bw=410KiB/s, iops=18, runt=  5159msec
  read : io=2064KiB, bw=320KiB/s, iops=13, runt=  6603msec
  read : io=2064KiB, bw=356KiB/s, iops=13, runt=  5924msec
  read : io=2052KiB, bw=565KiB/s, iops=22, runt=  3716msec
  read : io=2060KiB, bw=396KiB/s, iops=18, runt=  5321msec
  read : io=2048KiB, bw=507KiB/s, iops=19, runt=  4129msec
  read : io=2048KiB, bw=302KiB/s, iops=12, runt=  6924msec
  read : io=2060KiB, bw=497KiB/s, iops=20, runt=  4243msec
  read : io=2072KiB, bw=3138KiB/s, iops=130, runt=   676msec
  read : io=2048KiB, bw=3472KiB/s, iops=130, runt=   604msec
  read : io=2060KiB, bw=4080KiB/s, iops=172, runt=   517msec
  read : io=2052KiB, bw=4227KiB/s, iops=171, runt=   497msec
  read : io=2048KiB, bw=3744KiB/s, iops=166, runt=   560msec
  read : io=2076KiB, bw=4201KiB/s, iops=169, runt=   506msec
  read : io=2052KiB, bw=3531KiB/s, iops=159, runt=   595msec

See Documentation/cgroups/memory.txt for more details.

Mathieu

Comment 154 Marco Gatti 2009-02-09 08:50:15 UTC

How can we limit this with pre 2.6.29* kernels? I'm using 2.6.28.4 but there's no memory.limit_in_bytes and documentation doesn't help much about this...
Should we completely remove cgroups support from kernel until upgrading or waiting for a fix?

(In reply to comment #153)
[...]
> echo 4M > /cgroups/0/memory.limit_in_bytes
[...]

Comment 155 Mathieu Desnoyers 2009-02-09 09:39:09 UTC

Is CONFIG_CGROUPS (and sub-options) enabled in your 2.6.28.x kernel ?

I cannot guarantee that memory limits will be available, but I can see the CONFIG_CGROUPS option in my old 2.6.28.x .config.

Mathieu

Comment 156 Søren Holm 2009-02-09 11:55:19 UTC

Does not work for me. I succeed in limitting the memory-usage from going to infinity, but I still get 98% iowait and bad loss of responsiveness. I'm running 2.6.28.7

Comment 157 Søren Holm 2009-02-09 12:19:13 UTC

well it is a little bit more detailed. 4M limit ended up to kill my dd-operation. A limit of 16M is better for me and seems to be way better than the default without any limits.

Comment 158 Thomas Pilarski 2009-02-09 13:41:43 UTC

The CGROUPS are available in 2.6.28.3, but there is no the memory limit.

(In reply to comment #156)
Søren can you test it with clocksource=jiffies too? As I still think, that the reduces scheduler performance (#3) makes the problem worse. You can see the differences in comment #128 and #129 on my machine.

The number of dirty pages and writeback pages (/proc/meminfo) is always below 20% of memory on my systems, even under heavy io. But there is a lot of "traffic" caused by pdflush, when dirty pages count reaches the limit. All dirty pages are passed to the blk/elevator nearby at once. The time for sorting the rb-tree or perhaps looks takes more time for every request, as there are a lot of requests. 

On ext3 it takes up to 1 second and 0.3 in average for inserting for new a request. And there are up to 7000 request submitted on my notebook. ( see comment #128 and #129 ). I think this one reason for the high io.  
 
The problem for the high memory usage is caused by pdflush too, which is called by generic_perform_write (filemap.c) -> balance_dirty_pages_ratelimited. The clear_page_dirty_for_io is called directly before the page is submitted to the blk/elevator in write_cache_pages. As a result the page buffers are still in the elevator queue and the global_page_state(NR_FILE_DIRTY) has a too small value.

Comment 159 Søren Holm 2009-02-09 16:04:39 UTC

It does not matter if I use jiffies in these cases where memory is limited

memory.limit_in_bytes = 4M
Responsiveness : Very good
Disk speed     : 40% of disk capabillity
iowait         : Generally around 50%

Responsiveness : Good
Disk speed     : 50% of disk capabillity
iowait         : Generally around 50% but 

Interestingly I can't get the disk speed > 50% of the disk capabillity reported by hdparm, not event with oflag=direct

Eearlier I have reported that jiffies performed better, but that was without  memory-limitations.

Comment 160 Mathieu Desnoyers 2009-02-09 19:53:47 UTC

Created attachment 20172 [details]
mm fix page writeback accounting to fix oom condition under heavy I/O

Makes sure the page cache accounting behaves correctly with I/O elevator, thus fixing OOM condition.

Does not seem to fix the latency problem though. See changelog.

Comment 161 Gonzalo Aguilar 2009-02-10 00:57:37 UTC

Hi Søren,

It's possible that the memory limits does not help with the problem, as you say the Hd will go underspeed because lack of data (dute to the memory limits). So it will trigger the problem later or not trigger it at all.

But it's good to have a way to limit the problem anyway.

So I have a question. What is the right way to work? I mean under heavy loads of IO flow, what's the right way to work for a sane kernel?

I propose some cases:

A) We have 2 process one that makes high load IO operations (this time HD), one thats only do it occasionally.

1.- Process 1 (high IO) starts to do IO ops. So it will switch between blocked status by IO ops and active as it reads and sends data to controller.
2.- Process 2 tries to access disk, so it has to wait for a chance to read.

In this case IO wait of process 1 should be almost 0 so it only waits microsecs while last IO op finish. But process 2 should have high IO waits because Process 1 takes all IO bandwidth.

B) Same case but with a round robin style queue. CFQ?

IO wait should be nearly 0 for Process 2 as it gets chance to write to disk but Process 1 must wait each operation to finish...

What is the correct whay? Is there any other?

What is clear is that is not normal that a process blocks all the other processes because is waiting to write. Just in case that every process want to write the IO Wait should rise as all processes are waiting to get a chance. In this case... Should we only have IO Wait times? Is this our case?

Comment 162 Gonzalo Aguilar 2009-02-10 01:01:47 UTC

Created attachment 20176 [details]
Screenshot of current status of the bug while letting a program hang the system

Here you can see how IO Wait is  72.2% With Xorg going crazy on CPU usage and system showing that the rest of the system is completely unusable.

That was just because transmission was verifying my torrents. So again it's not acceptable that systems renders unusable because a background operation in place...

How can I help more?

Comment 163 Søren Holm 2009-02-10 12:13:25 UTC

Could it be possible to reused the concept from cpu-schedueling. Instead of talking about time-slices we could talk about IO-slices. The favor the processes which uses fewest IO-slices this will avoid an evil dd to starve other light reader/writers. I'm not kernel-skilled a t all so maybe this sound a lot like your RR-queue but just some thoughts.

Comment 164 Alex 2009-02-10 23:55:28 UTC

May be someone can explain me why the simple copying eats ~50% cpu? May be it is a part of this problem? The same copying in Windows eats 5-10% cpu. UDMA 100 is enabled by my pata. I have jfs partitions.

Comment 165 Thomas Pilarski 2009-02-11 07:06:52 UTC

With the last patch, the problem is permanent on my notebook on a ext4 and ext3 partition. The io wait time is at 100% with heavy io. Mouse clicks are not recognised very often, or the keyboard input is delayed for up to 10 seconds (all under xorg). 

I got a deadlock with the patch on kernel 2.6.28.2, but only once. The io wait time was at 100%, but there was no disc io any more. I could not start any programs or save some data, but I was able to use the running programs. I am not sure, if this is a problem caused by the patch or it is our problem. I got a complete freeze with clocksource=jiffies on a unpatched kernel with heavy io and heavy cpu usage too. 

I have checked some timings in the block and elevator functions
(__make_request, get_request, get_request_wait, blk_complete_request, cfq_service_tree_add and  cfq_add_rq_rb).
All the timings where below 5µs. At some points they are climbing to 80µs. But it looks good for me. In get_request_wait the writing dd processes are waiting up to one second for a new free request. It was only the dd processes or sometime the pdflush process. Should be OK.

Can prepare_to_wait_exclusive(&rl->wait[rw], &wait, TASK_UNINTERRUPTIBLE) in get_request_wait (blk-core.c) cause such a problem?

Comment 166 Søren Holm 2009-02-11 14:40:37 UTC

The patch from #160 to avoid the kernel from jsut taking all available memory almost works for me. Thanks  Mathieu. I don't get crazy swapout as I used to, but the cache still occupies 400 Megs of memory out of my 1G which is also wrong.

Comment 167 Søren Holm 2009-02-12 11:47:50 UTC

hmm .....  I assume that the cache is both read- and write-cache. In that case everything is allright. I can confirm the allmost 100% iowait

Comment 168 Thomas Pilarski 2009-02-12 15:53:08 UTC

I have limited the number of request of a process to 200 every second by adding a msleep_interruptible(5) just before spin_lock_irq(q->queue_lock) in __make_request, when there is a intensive usage by this process. The number of request are incremented in a ring buffer for four seconds and updated every 100ms. The throughput of the two dd processes is really bad at 3MB/s (as expected). Processes with a higher priority than 0, kjournald(2) or when (bio_data_dir(bio) == READ || bio_sync(bio)) is true, are passed without delay.
The wait time is at 100% of one core at the beginning and 100% off both cores after ~5-10s. Only the two dd processes and pdflush are delayed.
The problem is permanent. I cannot change the windows of two consoles or switching desktop. There are always long delays. It's exactly the freezing known from heavy io, with the difference of a moveable mouse cursor. I am not able using gedit to write a text, as every 5-15 seconds the keys are recognised with a long delay of at least 5 seconds. Even when the dd processes are killed and there is only a maximum write speed of 3MB/s (pdflush and perhaps kjournald) (0% io wait time) in the background. Gimp is starting in 10 seconds without preloading. The cache usage is at less than 20% of memory (~800MB). 

I am using the kernel 2.6.28.2 with the patch from Mathieu. Thanks a lot. I think It stops freezing the mouse cursor. And my delay in __make_request.
Removing the delay only, restores the state before.

I think it is the main problem, as I can simulate it! The high wait io are cause by the sleeping threads. In __make_request there are only 100-200 from 7000 request during heavy io, which are calling get_request_wait. And there are only 10 requests, which are entering the while loop in get_request_wait, realy waiting more then 20ms and up to 1 second on my machine (prepare_to_wait_exclusive(&rl->wait[rw], &wait, TASK_UNINTERRUPTIBLE);...; io_schedule(); in get_request_wait).

Comment 169 Thomas Pilarski 2009-02-12 23:57:27 UTC

I have just replaced prepare_to_wait_exclusive(&rl->wait[rw], &wait, TASK_UNINTERRUPTIBLE); and io_schedule(); in the function get_request_wait) agains msleep_interruptible(500). The thoughtput of the two dd processes is at 57MB/s (27/30). The desktop freezes up to 100 seconds.

Comment 170 Mathias Burén 2009-02-16 13:45:32 UTC

Is there any way I can help debugging this?

Comment 171 devsk 2009-02-18 02:14:51 UTC

(In reply to comment #138)
> Maybe it's not only the preemption and the frequency. I think one of these
> things could be:
> 
> General setup:
> - Control Group support DISABLED
> - Group CPU Scheduler DISABLED
> - Enable full-sized data structures for core ENABLED
> - Enable futex support ENABLED
> - Use full shmem filesystem ENABLED
> - Enable AIO support ENABLED
> - SLAB Allocator: SLUB
> 
> Processor type and features (ENABLED):
> - Tickless System (NO_HZ)
> - High Resolution Timer Support
> - HPET Timer Support
> - Multi-core scheduler support
> - Preemptible RCU
> - 64 bit Memory and IO resources
> - Add LRU list to track non-evictable pages
> 
> Good luck...
> 

Many of these seem to be 32-bit settings. The funny thing is that if I boot into x86 32-bit, I don't see any of the slow downs or they are so little that effectively I don't feel them. Its only x86-64 which freezes on me during IO.

Comment 172 Daniel Rowe 2009-02-18 03:23:37 UTC

Must admit all machines I have noticed this on are x86_64.

Comment 173 Michiel Eghuizen 2009-02-18 03:28:31 UTC

On the systems I have noticed it, are also x86_64.

Comment 174 Thomas Pilarski 2009-02-18 03:39:35 UTC

I have noticed this bug on a Pentium-M (32-Bit only) processor.

Comment 175 simon+kernelbugzilla 2009-02-18 04:25:53 UTC

I have seen this bug on an Opteron 250 system with a 32-bit OS (CentOS 4.4 thru CentOS 5) installed.

Comment 176 Gonzalo Aguilar 2009-02-18 06:24:03 UTC

Mine is

gad@ws-esp16:~$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz
stepping	: 10
cpu MHz		: 800.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm ida
bogomips	: 4388.98
clflush size	: 64
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz
stepping	: 10
cpu MHz		: 800.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm ida
bogomips	: 4389.07
clflush size	: 64
power management:

Comment 177 Søren Holm 2009-02-19 01:46:12 UTC

My cpu model is :AMD Turion(tm) 64 X2 Mobile Technology TL-50
The kernel is compiled for i686, and I see large slowdowns.

Comment 178 James Ettle 2009-02-19 03:24:41 UTC

I see this on my Intel T81OO notebook on both kernel-2.6.29-0.33.rc5.fc10.x86_64 and kernel-2.6.27.15-170.2.24.fc10.x86_64 (default Fedora config options). Just using the simple dd /dev/zero test can provoke it; the desktop feels less responsive. latencytop shows things like evolution waiting almost 10 seconds for an fsync to complete.

Hardware has an ICH8 chipset, DMA etc. seems configured properly.

vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Core(TM)2 Duo CPU     T8100  @ 2.10GHz
stepping	: 6
cpu MHz		: 800.000
cache size	: 3072 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1

Comment 179 Yuriy Lalym 2009-02-19 15:02:37 UTC

In certain cases (2.6.28.5) at patch usage "mm fix page writeback accounting to fix oom condition under heavy I/O" the output from under the control, increase iowait ~ 100 % and a complete stop of system is observed. Any data I can not put, as one button reset on a box works only. Probably there is a set of the influencing factors demanding more detailed check.

Comment 180 Mathieu Desnoyers 2009-02-19 15:43:31 UTC

(In reply to comment #179)
> In certain cases (2.6.28.5) at patch usage "mm fix page writeback accounting
> to
> fix oom condition under heavy I/O" the output from under the control,
> increase
> iowait ~ 100 % and a complete stop of system is observed. Any data I can not
> put, as one button reset on a box works only. Probably there is a set of the
> influencing factors demanding more detailed check.
> 

My patch "mm fix page writeback accounting to fix oom condition under heavy I/O" is probably no the right solution, but rather a step in the right direction. It poinpoints that the elevator fails to increment counters that are tested by the code which selects if the memory pressure from the dirty pages and writeback pages high enough to make the process fall into "sync write" mode.

Therefore, I think a cleaner solution to this particular problem could be to create a new page type counter (like dirty pages, write buffers, ..) to let the vm know how many pages are used by the elevator. The fs/buffer.c code should then check for this value too to see if the pressure on memory is high enough to make the process do a "sync write". However, this problem is harder than it appears, because the buffer.c code would probably put such process in sync write mode independently of the elevator, and I really wonder what the interaction of such solution with the CFQ would be. I am not sure the CFQ I/O scheduler would behave correctly in such situation, but Jens could tell better than I on the subject.

Hope this helps,

Mathieu

Comment 181 Thomas Pilarski 2009-02-20 08:29:26 UTC

(In reply to comment #179)
> Any data I can not
> put, as one button reset on a box works only. Probably there is a set of the
> influencing factors demanding more detailed check.

I have noticed this issue with a unpached kernel too. The "mm fix page writeback accounting to fix oom condition under heavy I/O" patch makes the problem reproduceable. Sometimes the io wait time is at 100%. Sometimes there is no io wait time. There is no problem with read access, but no write access is executed. I can reproduce the problem with xfs. With ext4 the problem does not appear very often on the patched and unpatched kernel.

Comment 182 Yuriy Lalym 2009-02-20 10:53:24 UTC

(In reply to comment #181)

Then I will bring specification, I use only xfs. Probably patch badly influences it, and probably well works with other file systems. I am sorry, it is simple to me there is nothing it to check up.

Comment 183 Trenton D. Adams 2009-02-24 14:23:23 UTC

I have consistently had this problem with any kernel I have tried, above 2.6.17, so I have stuck with that up until now.

There are some supposed resolutions to the problem at http://linux-ata.org/faq.html, but none of them work for me, and I don't have the mentioned BIOS setting in my BIOS.

lspci reports...
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GBM (ICH7-M) LPC Interface Bridge (rev 01)
00:1f.2 IDE interface: Intel Corporation 82801GBM/GHM (ICH7 Family) SATA IDE Controller (rev 01)

I do not appear to have the problem on my Macbook 2,1, although the disk performance is like 21M/s, which is lousy.  But, what I'm seeing on my one machine is 1M-3M/s.

I also tried passing "pci=routeirq" and "acpi=off" (grasping at straws), but that did not change anything.  I did however notice that my HD is /dev/sda in 2.6.17, and /dev/hda in 2.6.25 and 2.6.27.

On 2.6.17, dmesg tells me...
ata_piix 0000:00:1f.2: version 1.05
ata_piix 0000:00:1f.2: MAP [ P0 P2 IDE IDE ]
ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 17 (level, low) -> IRQ 18
ata: 0x170 IDE port busy
PCI: Setting latency timer of device 0000:00:1f.2 to 64
ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0xBFA0 irq 14
ata1: dev 0 cfg 49:2f00 82:346b 83:7d09 84:6123 85:3469 86:bc09 87:6123 88:207f
ata1: dev 0 ATA-8, max UDMA/133, 625142448 sectors: LBA48
ata1: dev 0 configured for UDMA/133
scsi2 : ata_piix
  Vendor: ATA       Model: ST9320421ASG      Rev: SD13
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
 sda: sda1 sda2 sda3
sd 2:0:0:0: Attached scsi disk sda
sd 2:0:0:0: Attached scsi generic sg0 type 0

But on 2.6.27, I get nothing of the sort.  Nothing to do with SATA or anything.  I did notice that with 2.6.27, libata was enabled, while with 2.6.17 it didn't appear to be an option even.  Ever since that libata, nothing seems to work, and my computer is relatively new.  I have a Dell D820 core 2 duo.

Comment 184 Søren Holm 2009-02-27 23:21:14 UTC

I noticed the same - would it be possible to revert the libata-integration ?

Comment 185 Andrew Morton 2009-02-27 23:39:01 UTC

Trenton, it's unclear to me what you're describing here.

> I have consistently had this problem 

which problem?

Anyway, it sounds like what you're reporting is a straightforward
regression in ATA throughput?

If so, please raise a separate, new bug report against SATA for that,
thanks.

Comment 186 Trenton D. Adams 2009-02-27 23:51:52 UTC

Oops, mid-air collision. I'll answer Andrew's question first.

I'm having two problems.
1. on my Dell D820 I see degraded throughput AND high io wait times as everyone else here has described
2. on my Macbook, I do not see degraded performance, but I see the extremely high io wait times.

Both of these systems have the IDENTICAL IDE chipsets. Read on with my original reply, before collision, for more information.

Quick question, is anyone else using the Intel 82801GBM/GHM IDE chipset, who has this problem as well???

I have a Dell D820 (64 bit) notebook, and a Macbook from late 2007 (the 64 bit ones). I noticed that they both have Intel 82801GBM/GHM IDE chipsets. They both exhibit the problem. If running Gentoo Linux 32 bit on the D820, and one of these bad kernels, my hard drive (which was renamed to hda), gets about 3M/sec, and the high wait times are also present.

With the Macbook, the high io wait times are there, but I get a good throughput, with Gentoo 32 bit. Not sure what the difference is between the D820 and the Macbook, seeing they have very similar hardware (almost identical). I suppose it is possible that Apple made the suggested change that the linux-ata guy suggested (for the bios).

This truly is debilitating. I have now tried two distributions with the latest 2.6.x kernels (Gentoo and OpenSUSE 11.1), and all of them exhibit these symptoms on my hardware. I am almost certain that if this does not get fixed, I will be unable to continue using Linux at work, unless I get a new computer (slim chance but possible). After all, eventually, Gentoo will move towards some new features that require a newer kernel, and I will be left in the dust. I will then be forced to run Linux in vmware under Windows. Please, someone save me from this awful DEATH. muhahahaha.

Comment 187 Trenton D. Adams 2009-02-27 23:56:12 UTC

(In reply to comment #172)
> Must admit all machines I have noticed this on are x86_64.
> 

I am seeing both x86_64 and i686 machines exhibit this.  Before my Dell D820 died on me, it was a duo core 32bit machine.  Then it got replace with a newer D820 which is a core 2 duo 64bit machine.  This issue happened on both of those.  And, as mentioned in my last comment, it also happens on my core 2 duo Macbook.

Comment 188 Ozan Caglayan 2009-02-28 05:52:09 UTC

I once had a similar *traumatic* throughput regression with an Intel processor + p4_clockmod. So the issues may completely have different causes.

Comment 189 Trenton D. Adams 2009-03-01 01:26:12 UTC

(In reply to comment #134)
> Hi.
> 
> On my laptop(Core2Duo 1.6 ghz) I run my gentoo kernel 2.6.28-gentoo.
> I didn't have any problems with latency.
> 
> If I run "dd if=/dev/zero of=file bs=1M count=2048" or "dd if=/dev/zero
> of=/tmp/test bs=1M count=1M" (I tried to run it as user and also as root), my
> system works well and I can start firefox, another shell,  open dolphin (i'm
> under kde4-svn) and everything is faster.
> 
> I have XFS filesystem on my home and reiserfs on root.
> 
> Since I configured my kernel manually, maybe it could be usefull for someone
> to
> have my .config so I'll post it.
> 


I have just unmasked, and tried 2.6.28 on Gentoo Linux as well, and the problem appears to be gone.  This is on my D820, which is the one with really bad throughput as well. As I am in the process of converting to  64bit on my D820, I am unable to try GUI stuff out.  But, before, during heavy load, I was unable to switch between terminals very well either.  Now, the system is EXTREMELY responsive, during these heavy load times, which is what I expect.  And, I'm getting 82M/sec once the caching limit has been reached, and 256M/sec with caching.  This is equivalent to what I was getting with 2.6.17.

Now, I don't know if the gentoo guys applied someone's patch from here, as comment #52 mentioned patching 2.6.28, but it's working for me now.  I'm VERY happy about that. :D  Based on his description, it very much sounds like the Gentoo guys must have applied the patch.  I was doing a while loop, with dd, increasing the amount of data by 1M at a time.  The first few, up to about 60M, were getting 256M/sec.  Then, I noticed in my other terminal, running vmstat, the iowait times got pinned to nearly 100%.  So, I'm thinking that all those dd's that got cached, were finally catching up to the NO LIMIT on cached items, and causing thrashing in the IO system.  That caused a COMPLETE freezeup of the while loop.  Also, during this time, my HD light was going crazy.  Then, when the io wait times dropped to 0 again (cached items flushed), the loop did a few more iterations (and my HD light was off), and it started all over again.  Then, again the loop froze, etc, etc, etc.

Comment 190 Trenton D. Adams 2009-03-01 01:31:47 UTC

Also, I feel kind of stupid because I should have reported this back in 2007 when I saw it.  But, I figured someone else would find it before too long, so I just hung back with my kernel version. SORRY!!! :(

I guess I shouldn't do that next time.  Especially considering it is way easier to find bugs when a new release just came out, and there is a new bug due to the changes in that release.

Comment 191 Michiel Eghuizen 2009-03-01 04:56:37 UTC

bugme-daemon@bugzilla.kernel.org schreef:
> http://bugzilla.kernel.org/show_bug.cgi?id=12309
>
>
>
>
>
> ------- Comment #189 from trent.bugzilla@trentonadams.ca  2009-03-01 01:26
> -------
> (In reply to comment #134)
>   
>> Hi.
>>
>> On my laptop(Core2Duo 1.6 ghz) I run my gentoo kernel 2.6.28-gentoo.
>> I didn't have any problems with latency.
>>
>> If I run "dd if=/dev/zero of=file bs=1M count=2048" or "dd if=/dev/zero
>> of=/tmp/test bs=1M count=1M" (I tried to run it as user and also as root),
>> my
>> system works well and I can start firefox, another shell,  open dolphin (i'm
>> under kde4-svn) and everything is faster.
>>
>> I have XFS filesystem on my home and reiserfs on root.
>>
>> Since I configured my kernel manually, maybe it could be usefull for someone
>> to
>> have my .config so I'll post it.
>>
>>     
>
>
> I have just unmasked, and tried 2.6.28 on Gentoo Linux as well, and the
> problem
> appears to be gone.  This is on my D820, which is the one with really bad
> throughput as well. As I am in the process of converting to  64bit on my
> D820,
> I am unable to try GUI stuff out.  But, before, during heavy load, I was
> unable
> to switch between terminals very well either.  Now, the system is EXTREMELY
> responsive, during these heavy load times, which is what I expect.  And, I'm
> getting 82M/sec once the caching limit has been reached, and 256M/sec with
> caching.  This is equivalent to what I was getting with 2.6.17.
>
> Now, I don't know if the gentoo guys applied someone's patch from here, as
> comment #52 mentioned patching 2.6.28, but it's working for me now.  I'm VERY
> happy about that. :D  Based on his description, it very much sounds like the
> Gentoo guys must have applied the patch.  I was doing a while loop, with dd,
> increasing the amount of data by 1M at a time.  The first few, up to about
> 60M,
> were getting 256M/sec.  Then, I noticed in my other terminal, running vmstat,
> the iowait times got pinned to nearly 100%.  So, I'm thinking that all those
> dd's that got cached, were finally catching up to the NO LIMIT on cached
> items,
> and causing thrashing in the IO system.  That caused a COMPLETE freezeup of
> the
> while loop.  Also, during this time, my HD light was going crazy.  Then, when
> the io wait times dropped to 0 again (cached items flushed), the loop did a
> few
> more iterations (and my HD light was off), and it started all over again. 
> Then, again the loop froze, etc, etc, etc.
>
>
>   
Ok, so if that version is working for you of Gentoo, can we compare that 
with the vanilla kernel?

Can you send us some system info to compare your kernel config with the 
vanilla one?

Can we have a tarball with the following structure? (to make it easy to 
diff over it)
--------------------------------------------------
systeminfo.txt
vanilla
    \- config (original config of the vanilla kernel, not yours)
    |- kernel-info.txt
    |- dmesg.txt
    |- lsmod-output.txt
    |- test-report.txt
gentoo-youredition
    \- config (the config file of your kernel version)
    |- dmesg.txt
    |- lsmod-output.txt
    |- test-report.txt
    |- gentoo.patch
--------------------------------------------------

If you have the time can you do the following on the system:
 - Get the source for that gentoo version you are using (shouldn't be to 
hard on Gentoo ;-) )
 - Get the source of the vanilla kernel with the same version/patch 
level as your gentoo kernel
 - Check to see if your current gentoo config is working on vanilla 
kernel and if that will result in a responding system
 - If that does not solve the bug on your system, create a patch file 
for the gentoo patches, so we can see exactly what gentoo has patched

If you try this and send us the information, we can use a tool like Meld 
(http://meld.sourceforge.net/) to compare the 2 kernel configurations 
with each other.

Can you put the following information in systeminfo.txt
    cat /proc/cpuinfo
    cat /proc/meminfo
    cat /proc/swaps

And for per kernel information:

In kernel-info.txt:
    cat /proc/version
    uname -a
    cat /proc/cmdline
    cat /sys/block/<disk>/queue/scheduler

Config is just the .config file
    You can get the info by the command zcat /proc/config.gz or via your 
/boot/config-<something> or via kernel source

In dmesg.txt your dmesg  output

In lsmod-output.txt your lsmod output.

In test-report the reporting of your tests on the kernel. And how they 
performend and what tests you did.

In gentoo.patch the patches Gentoo made on the vanilla kernel (using the 
diff command).

I hope we can find a piece of the cause with this information.

Greetings,

Michiel

Comment 192 Walter Prins 2009-03-01 17:22:06 UTC

(In reply to comment #16)
> I tried elevator=as on my system, and it did not change the behaviour.
> Copying
> files from external USB to internal encrypted SSD still totally smashes
> interactive performance. So this issue might be unrelated.
> 

Note, some SSD's have very poor random-write performance, this can cause stuttering and all sorts of side effects.  Anandtech investigated this issue when comparing/reviewing Intel's SSD's vs. parts from OCZ which uses a certain JMicron controller.  See here: http://www.anandtech.com/showdoc.aspx?i=3403&p=7
You should probably just read the entire review.

It is therefore possible that your issue has more to do with the behaviour of your SSD during writes than the kernel scheduler or anything else.

Comment 193 Trenton D. Adams 2009-03-01 18:40:42 UTC

Working on it now Michiel.  I'll try and get that info for 2.6.27, 2.6.28, and vanilla 2.6.28.

ttyl

Comment 194 Trenton D. Adams 2009-03-01 18:47:01 UTC

Hmmm, apparently I forgot to try vmstat.  The high io wait times are still there, but I haven't been noticing it.  I wonder what could have caused me to not notice it now.  The performance is way better, even with the high io wait though.  I'm not seeing 30 second delays on stuff.  Every now and then there's a second or two delay, perhaps five tops.  I'll get the info anyhow, and see what the differences are.  FYI: This is still on my D820.

Comment 195 Trenton D. Adams 2009-03-01 19:34:04 UTC

(In reply to comment #192)
> (In reply to comment #16)
> > I tried elevator=as on my system, and it did not change the behaviour.
> Copying
> > files from external USB to internal encrypted SSD still totally smashes
> > interactive performance. So this issue might be unrelated.
> > 
> 
> Note, some SSD's have very poor random-write performance, this can cause
> stuttering and all sorts of side effects.  Anandtech investigated this issue
> when comparing/reviewing Intel's SSD's vs. parts from OCZ which uses a
> certain
> JMicron controller.  See here:
> http://www.anandtech.com/showdoc.aspx?i=3403&p=7
> You should probably just read the entire review.
> 
> It is therefore possible that your issue has more to do with the behaviour of
> your SSD during writes than the kernel scheduler or anything else.
> 

Well, if that is true, it would have to be a combination of the kernel and my system.  Mainly because my system was SUPER fast before I tried upgrading my kernel past 2.6.17.  As for my Mac, I don't recall having performance issues while running Mac OS X.  Nothing like the article describes anyhow.

Comment 196 Thomas Pilarski 2009-03-02 00:01:39 UTC

(In reply to comment #189) - #195
> Well, if that is true, it would have to be a combination of the kernel and my
> system.  Mainly because my system was SUPER fast before I tried upgrading my
> kernel past 2.6.17.  As for my Mac, I don't recall having performance issues
> while running Mac OS X.  Nothing like the article describes anyhow.

There is another bug in 2.6.17/18-??, which gives a poor disc performance, while running the SATA controller on a ICH8M (or equal?) platform in compatibility mode, which gives a high i/o wait time too and lets this bug appear.

There are dependencies between cpu-power, disc throughput, task switching time (eg. clocksource) and this bug.

Has someone tried to identify the source of the problem, with the info provided in Comment #168 and Comment #169 ?

There is a comment in the code (blk-core.c @ ~1300)
	/*
	 * After dropping the lock and possibly sleeping here, our request
	 * may now be mergeable after it had proven unmergeable (above).
	 * We don't worry about that case for efficiency. It won't happen
	 * often, and the elevators are able to handle it.
	 */
But it happens up to 20 times every second during heavy io, causing high io wait times for the writing process (or pdflush) and makes the desktop responsiveness becomes poor. My proof is the real poor desktop responsiveness, when replacing prepare_to_wait_exclusive by msleep_interruptible (see  Comment  #169). I will be able to spend some more time on this bug in april.

Comment 197 Trenton D. Adams 2009-03-02 00:11:07 UTC

Created attachment 20405 [details]
info request by Michiel in comment 191

Here's the info you wanted Michiel.

Doing a diff on the config of the bad kernel and the new one reveals this interesting tidbit...

diff -u 2.6.27-gentoo-r8-kernel-config.txt 2.6.28-gentoo-r2-kernel-config.txt
-CONFIG_BLK_DEV_IDEDISK=y
-CONFIG_IDEDISK_MULTI_MODE=y
+CONFIG_IDE_GD=y
+CONFIG_IDE_GD_ATA=y

That must have been what switched me back to using sda.  Anyhow, that was obviously a separate issue.

So, my system performance, and io wait times are totally fine during normal system operation.  When I do REALLY heavy io, the wait times go up, but the responsiveness is still relatively good.  I can start kwrite in about 2-3 seconds.  It seems like it is fixed to me.  But, I'll still try that patched 2.6.28 and get back to you, to see if it is even better.

Perhaps Andrew Morton was right.  Maybe my issue was entirely to do with my SATA issues.

Comment 198 James Ettle 2009-03-02 02:05:29 UTC

(In reply to comment #196)
> There is another bug in 2.6.17/18-??, which gives a poor disc performance,
> while running the SATA controller on a ICH8M (or equal?) platform in
> compatibility mode, which gives a high i/o wait time too and lets this bug
> appear.
> 
> There are dependencies between cpu-power, disc throughput, task switching
> time
> (eg. clocksource) and this bug.

This is interesting, since my notebook has an ICH8M stuck in compatibility mode (no BIOS option). I'll see how it compares to my other notebook with an ATI-IXP chipset.

Comment 199 Heine Andersen 2009-03-03 06:34:33 UTC

Anyone seen this on a non-sata drive ?

If i do a "dd if=/dev/zero of=outfile bs=1M count=50000" on 2.6.28 the load raise to around 8, on 2.6.29-rc5 It never get past 4.

I'm testing on 64bit, ich9 + sata, btw. I tried to install centos 4.7, with kernel 2.6.9.+, and It's just as bad as 2.6.28.

Comment 200 Thomas Pilarski 2009-03-03 09:50:51 UTC

I have just tested the 2.6.29-rc6. The desktop responsiveness is increased enormous. Especially Firefox is now useable. The problem still exists for me, but it is now not as noticeable as before.

Comment 201 Walter Prins 2009-03-04 06:51:59 UTC

(In reply to comment #195)
> (In reply to comment #192)
> > It is therefore possible that your issue has more to do with the behaviour
> of
> > your SSD during writes than the kernel scheduler or anything else.
> > 
> 
> Well, if that is true, it would have to be a combination of the kernel and my
> system.  Mainly because my system was SUPER fast before I tried upgrading my
> kernel past 2.6.17.  As for my Mac, I don't recall having performance issues
> while running Mac OS X.  Nothing like the article describes anyhow.

OK well in that case I absolutely agree it's obviously a software only problem in your case and probably this scheduler kernel issue.  (I just wanted to point out for the record so everyone's aware, that there are some SSD hardware combinations that inherently have limitations that will may very well cause similar sluggishness regardless of the kernel/software itself.)  

As an aside, high IO wait percentages are after all as far as I understand it not in and of themselves problematic, since high IO wait only means that a process is waiting for IO.  This measure will therefore predictably be high when a process is doing heavy substantial IO with a comparatively slow device.  Normally however one would expect such IO to not generally negatively affect other processes/general system reponsiveness, *except* if the other processes are also somehow IO hungry in order to proceed and you have some sort of IO resource contention going on, or as appears in this thread, there's actually a scheduling problem which causes processes that are runnable to not receive the CPU when they should, thus resulting in perceived sluggishness.

Comment 202 Thomas Pilarski 2009-03-04 14:59:49 UTC

I must correct my last post (Comment #200). I was working with VMs the whole day and it is still awful as before. 

But there is a big improvement while using firefox.

Comment 203 Ben Gamari 2009-03-04 16:15:11 UTC

I would agree that -rc6 has for some reason greatly improved system responsiveness under I/O load but there are most certainly still great issues in the block I/O world.

Just now I once again managed to completely wedge up my machine by doing nothing more than copying a few gigabytes of files between drives. Furthermore, Firefox still freezes for several seconds when I first start typing in the location bar as it looks in its history database. Lastly, Evolution still takes several minutes to start and become usable while it's I/O rate is less than 1 MB/s. All in all, things are pretty unusable.

Jens, are you around? I've been asking various distributions and vendors whether they could spare some qualified man-hours to get this problem finally worked out but it seems like you're our best hope. I know you'll be getting at least one case of beer when this is fixed ;)

Comment 204 Trenton D. Adams 2009-03-06 02:08:24 UTC

Hi Guys,

My brother has apparently been having the same problem on his computer.  I hadn't realized it when I submitted my bug.  For him, he has an ICH8 family of chipsets.

The following works for him, and the problem goes away.
echo anticipatory > /sys/block/sda/queue/scheduler

Looks like this may be a tough one to nail down, because everyone's symptoms are slightly different.  I'm wondering if perhaps there are multiple issues going on here.

Comment 205 Trenton D. Adams 2009-03-06 02:11:07 UTC

Oh, crap, I forgot the details.  Before the details, I also wanted to say that I am going to get him to try changing the BIOS option mentioned on the libata page I gave earlier, to see what happens.

[03:05 root@zipper ~]# lspci
00:00.0 Host bridge: Intel Corporation 82P965/G965 Memory Controller Hub (rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82G965 Integrated Graphics Controller (rev 02)
00:03.0 Communication controller: Intel Corporation 82P965/G965 HECI Controller (rev 02)
00:19.0 Ethernet controller: Intel Corporation 82566DC Gigabit Network Connection (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Contoller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #5 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation 82801H (ICH8 Family) HD Audio Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 (rev 02)
00:1c.1 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 2 (rev 02)
00:1c.2 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 3 (rev 02)
00:1c.3 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 4 (rev 02)
00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev f2)
00:1f.0 ISA bridge: Intel Corporation 82801HB/HR (ICH8/R) LPC Interface Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801H (ICH8 Family) 4 port SATA IDE Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 02)
00:1f.5 IDE interface: Intel Corporation 82801H (ICH8 Family) 2 port SATA IDE Controller (rev 02)
02:00.0 IDE interface: Marvell Technology Group Ltd. 88SE6101 single-port PATA133 interface (rev b1)
06:00.0 RAID bus controller: Silicon Image, Inc. SiI 3112 [SATALink/SATARaid] Serial ATA Controller (rev 02)
06:01.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
06:03.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link)


[03:05 root@zipper ~]# uname -a
Linux zipper 2.6.18-53.el5xen #1 SMP Mon Nov 12 02:46:57 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

[03:09 root@zipper ~]# cat /etc/issue
CentOS release 5 (Final)
Kernel \r on an \m

Comment 206 Thomas Pilarski 2009-03-06 03:10:21 UTC

I have noticed, that while working with VMs my system starts swapping after a while. I tried the -rc7 with Mathieus patch (Comment #160) and my system seems to be useable. There is still the non fair io scheduling between processes, but it's another problem. I am using a kernel without "Group CPU Scheduler" and "Control Group Support" and writing this text in firefox at load avg 12.

To reach such high load avg, I have to run eight concurrent dd write operations. 

for i in  1 2 3 4 5 6 7 8; do \
  dd if=/dev/zero of=test-$i bs=1M count=4K oflag=direct & echo test-$i; \ 
done

Copying big files with nautilus makes my system from time to time unusable. With known symptoms such as "Unable to switch desktop" and "mouse freezes".

And finally, I have not seen the complete io freeze with -rc7 kernel on xfs, ext3 and ext4.

Comment 207 Khalid Rashid 2009-03-08 13:31:53 UTC

Trenton, I too set my kernel to anticipatory scheduler and for a while i thought all was well when I ran dd if=/dev/zero of=~/test bs=1M count=1500 in order to test. Then I realized that its not a reliable testing method since the *anticipatory* can anticipate the coming zeroes that will be written. I ran dd if=/dev/zero of=~/test bs=1M count=1500 simultaniously with the one writing from dev/zero, and realized that the part of the syndrome is fixed with AS, but the problem persists...

Comment 208 Khalid Rashid 2009-03-08 13:33:33 UTC

argh, forgot to give details too... 
running 2.6.28-8-generic kernel (64bit) in ubuntu jaunty. and i had this problem in 32 kernels before aswell.

khaal@Xeraphim:~$ sudo lspci
[sudo] password for khaal: 
00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2)
00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2)
00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2)
00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2)
00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2)
00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2)
00:02.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:04.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2)
00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a3)
00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a3)
00:0a.2 RAM memory: nVidia Corporation MCP51 Memory Controller 0 (rev a3)
00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev a1)
00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1)
00:0f.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev a1)
00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2)
00:10.1 Audio device: nVidia Corporation MCP51 High Definition Audio (rev a2)
00:14.0 Bridge: nVidia Corporation MCP51 Ethernet Controller (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
02:00.0 VGA compatible controller: nVidia Corporation G80 [GeForce 8800 GTS] (rev a2)
03:05.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 70)
03:06.0 Multimedia controller: Philips Semiconductors SAA7131/SAA7133/SAA7135 Video Broadcast Decoder (rev d1)
03:07.0 Multimedia audio controller: Creative Labs SB X-Fi
03:09.0 Ethernet controller: Atheros Communications Inc. AR5413 802.11abg NIC (rev 01)

Comment 209 Trenton D. Adams 2009-03-08 14:52:31 UTC

I had done some initial testing on my x86_64 box, of 2.6.17 vanilla (downloaded from kernel.org), and it seems to me that it has the problem too.  I don't understand why my problem started with 2.6.18 if the vanilla 2.6.17 has the problem.  Note that I tested the first 2.6.17, and the last version of 2.6.17. I'm thoroughly confused.  I think I'll switch to 2.6.17, and run that for awhile to see if there's better performance overall.  Perhaps loading it is not the best way to see if there's latency issues, as there will be some.

Then, if I do see some improvement, I'll increment to 2.6.18.  Hopefully, slowly but surely I can figure out which exact kernel as the problem, and then a kernel dev can fix it.  That's the plan anyhow. :P

Comment 210 Gonzalo Aguilar 2009-03-09 02:08:27 UTC

Hi I tried new  2.6.28.7 kernel. And things seem to go worse... Even btorrent checking downloaded files is able to lock the computer...

I will upload a new screenshot showing 91.4% of processor time waiting for HD to read data... This is a nonsense... I will try to do same check for evey new kernel that goes out to check for improvements.

Comment 211 Gonzalo Aguilar 2009-03-09 02:10:05 UTC

Created attachment 20464 [details]
IWait problem 91,4% 2.6.28.7

Comment 212 Khalid Rashid 2009-03-09 03:30:51 UTC

Wanted to add even more testing results from my side, tried the suggestions from this source: http://stackoverflow.com/questions/392198/how-to-make-linux-gui-usable-when-lots-of-disk-activity-is-happening
by changing some vm.dirty_ variables. No improvement could be seen neither changing to deadline scheduler didn't improve the situation. I also changed /sys/block/sda/queue/nr_requests to 64 with same unresponsiveness.

i'm still on the same kernel (2.6.28-8) and my fstab mounts the partition with relatime,noatime,nodiratime flags.

Comment 213 Michiel Eghuizen 2009-03-09 04:31:25 UTC

Am currently installing 2.6.29-rc7. Hoping that it will solve some issues on the bug.

Can changing SLAB allocator be an option to test for the problem? We can choose between SLAB/SLUB/SLOB. Maybe that can be helpfull.

Comment 214 Jens Axboe 2009-03-09 13:01:48 UTC

There's still some confusing comments on IO wait in here, lets clear that up at least. 91% io wait does not mean it's using 91% cpu power for doing the IO, it merely means that some process is BLOCKED waiting for IO 91% of the time. It has zero relevance on cpu cycles consumed. Same goes for the observed load. Having a load of 2.0 due to io wait times does not mean that you have a doubly loaded system. It just means that, on average, two processes are blocked waiting for IO. When you start a bittorrent client and it checks the file data, you would expect io wait to be nearly 100%. It does do some cpu processing, so that's why it's not completely at 100%.

So forget IO wait, it doesn't tell you ANYTHING about whether a system is supposed to be slow or not.

Comment 215 Jens Axboe 2009-03-09 13:08:22 UTC

And to make a more general comment... This bug is impossible to solve, since it (once again) has degraded into somewhere for everybody to tunnel everything that relates to a system feeling sluggish. There could be at least 10 separate issues described in here, or more. And while some of these are surely things we could do better, some are also certainly expected behaviour. We are at least touching several file systems, mm issues, and io scheduler issues. I'm quite sure that some of the mentioned behaviour is completely due to ext3 sucking at fsync.

I'd LOVE to be able to look into this, but honestly I have no idea where to start. What I would also love is for someone to post a test case that actually works. This includes observed behaviour and a description of what you would EXPECT to see happen. Then we/I should be able to at least judge whether there's something we can do about it. Expecting a fully fluid system while having 100 threads writing data to the device is not reasonable, for instance. But if it behaves significantly worse than previous kernels, then there's still something to look into.

Comment 216 Trenton D. Adams 2009-03-09 13:25:21 UTC

I totally agree with you Jens.  I have been having a hard time localizing the problem myself.  I went back to the 2.6.17 kernel, and it seems to be worse than my 2.6.28 kernel.  But keep in mind, I was running i686 when I originally discovered the problem, and now I'm doing x86_64.  I think the only way I will be able to localize the issue, is if I restore my system to i686 gentoo, and then trying 2.6.28, then I may start getting somewhere.

I also agree that it is nearly impossible to solve this one without some more concrete data.  I wish I had chosen a different time to upgrade to 64bit, because then I could be fiddling with this issue on my i686 still.

I'll post again if I find something more concrete.

Comment 217 Ben Gamari 2009-03-09 13:40:24 UTC

I will admit that many of my issues seem to be caused by fsync() (I'm on ext4). One of the largest issues I'm currently having is Liferea blocking in fsync() for several seconds every time a new item is selected. During this time kjournald2 is writing, although iotop only shows a total write rate of ~500kB/s. This seems extremely slow and far below the disk's (a 7200 RPM SATA drive) capacity. This low I/O rate is common for all sluggish I/O cases. Does this sound like expected behavior?  Perhaps my problems have been caused by just generally slow I/O?

Comment 218 Yuriy Lalym 2009-03-09 14:49:49 UTC

The last without a problem kernel was - 2.6.16 (acknowledgement to that is SLES 10 SP2 does not give high iowait on ASUS P5K). So let's look what super-mega-function has appeared in 2.6.17 and was absent in 2.6.16. This function cannot clearly belong to separately taken file system (all file systems are subject to an error). Changes in schedulers between 2.6.16-2.6.17 I has not found out. Introduction libata - a unique difference. Who gives high iowait - itself libata or the infrastructure of its embedding in a kernel practically has no value. Value has only one - the kernel is disabled. Also it is the sad fact.

Comment 219 Thomas Pilarski 2009-03-09 16:26:26 UTC

I do not mean the fsync problem, which is not a problem in the 29 kernel for me any more. I mean the sluggish behaviour of all gui application. Especially while working with vmware workstation. Suspend and resume time rises from less than two minutes to up to ten minutes. It started for me, when I upgraded from feisty (2.6.20) to gutsy (2.6.22) on a 32-bit Pentium-M.

There is a problem to locate this problem, as it does not appear all the time and there are a lot other problems and many solved problems, which make a comparison very problematic. And my assumption is, that I depend on the cpu, hard drive and user.

The best hint for me was the duration of the process test. I have not committed this test to adjust the kernel to this special test case, as I have seen at LKML. It should help to localize the problem. The results of this tests, seems to fit with the regression of the sluggish behaviour.

See
http://bugzilla.kernel.org/attachment.cgi?id=19797&action=view
CentOS 2.6.18-92.el5 - 29.995s - good
Feisty 2.6.20.21 - 25.304s - good
Gusty 2.6.22-16 - 40.405s - bad
Hardy 2.6.24-23 - 37.604s - bad
Intrepid 2.6.27-9 - 96.922s - unusable

I have seen with powertop, that the number of interrupt was doubled from 200 to 400 for keyboard input, when a high io was running in the background.

And I know there is nothing wrong with a high io wait time, but as soon as the io wait time reaches 100% the desktop becomes sluggish and unusable. You can try this on an installation on a slow disk and ext3, or even on an full encrypted disc. The slow SSD could be related with this bug, as there is a real poor write performance with linux on many SSDs. I have measured transfer rates up(down) to 2MB/s on no direct write (4KB cache splitting), while direct writing gets up to 90MB/s on my SSD. My system on my SSD is completely unusable.

I will execute some tests in a virtual machine, as it's seems to me, that an application running in the virtual machine is more affected by this sluggish behaviour than an application executed on the host. I will run exactly the same vm and test on different host kernels. But I am not able to send some more time before April. Perhaps someone else can starts earlier?

Comment 220 Trenton D. Adams 2009-03-10 00:47:04 UTC

Jens,

I'm trying to nail this down on my computer.  So, I'm creating a vm of my i686 gentoo system, to see if I can see the same results as I was before.

I used the following command, inside the vm, to extract my system tarball backup of my previous system.

ssh root@192.168.8.4 'gunzip -c /media/backup/system.tar.gz' | tar -xv --exclude './usr/portage/packages/*' --exclude './userportage/distfiles/*' --exclude './var/log/apache2/*' --exclude ./Bonnie.10218 >extract-list.txt

Now, on the host system (192.168.8.4) I am seeing the following...
trenta@tdamac ~/Desktop $ uptime
 01:39:37 up  1:21,  6 users,  load average: 20.49, 14.92, 9.35

Obviously I'm getting REALLY sick performance.  Normally something linear like a tar extraction does not produce these kinds of issues with performance.  Granted that the disk may have to move around a little, but is it that bad?.

Is there some sort of thing I can do, to analyze why this is happening?  e.g. something like strace, or something?  I ran strace -c on kwrite, during heavy load like this, and it claims that it finished everything in a tenth of a second, even though it took like 30.

So, is there a lower level mechanism I can use to get a fix on what is making processes wait?  For example, something that will tell me "kernel function X" is blocking?

Thanks.

Comment 221 Gonzalo Aguilar 2009-03-10 02:33:13 UTC

Hi Ben, 

Thank you for the clarification. I think I was really lost on this. I expected the process to wait while IO but then it's supposed that the rest of the system should take the rest of the processor power while it's not. The system seems to hang until IO stops. 

So I think best way to proceed is to start to discard problems.

I propose to start with:


    I will try to do CPU intensive with no IO task while other process will write a file with no CPU intensive to check if the first process take the same time to execute under high IO or not.
            Process 1: CPU / No IO
            Process 2: High UI / No CPU
    And measure times...

    Should this test trigger the problem? As no IO for process 1 it should finish almost in the same time than under no load at all. Right?

    Can we discard a ext3 related problem? Test case (Test writing files 1 thread, over ext3 and ext4, reiser, etc) and observe responsivness.

    Can we track if this is a fsync problem? How (commands, test case)?

    How can we test this without making filesystem take part on the tests?

    Can we show differences between kernel 2.6.16 and >=2.6.28? (I will do this today)
   
    How to measure responsiveness? Can we put a numeric value to this?

Thank you all.

Comment 222 Khalid Rashid 2009-03-10 07:28:35 UTC

Gonzalo, I think you're giving great question in order for us to establish the cause of the problem. Even though I can't anwer most of your questions (I'm no guru) I think we all should agree on a unified ways to test and measure the responsiveness. Regarding filesystems, I tried ReiserFS, ext3 and ext4 with two terminals running dd if=/dev/zero of=/test1 bs=1M count=1400 and dd if=/dev/urandom of=/tst2 bs=1M count=700 as a test, and they all gave the same sluggish feeling to the system.

Comment 223 Trenton D. Adams 2009-03-10 10:00:14 UTC

I agree that those tests let us know that there's a problem, because we see the sluggish behaviour.  However, if a kernel dev is not seeing the performance issues on their machines, it won't be very convincing for them.  If however, we provide some concrete tests, showing which kernels didn't have the problem, which did, and the test results, then they may be able to get somewhere.  That's why I'm hoping someone can chime in and tell us what sorts of tests would be useful, such as I suggested in comment #220.

Comment 224 Gonzalo Aguilar 2009-03-11 03:52:08 UTC

Ok. Here are my firsts tests with 2.6.28.7:

I used a modified version of the ThreadSchedulerTest.cpp that kills the initial timeout. And a dd to simulate high IO loads.

First hypothesis seesm to be broken. High IO loads does not seem affect processing much.

------------------------------------------------------------------
./kernel-test.sh
Using current dir to do IO tests
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 3362
min:0.008ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:19.791s
Break!
We have Burning CPU with 4855
min:0.006ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:18.754s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 6211
We have Burning CPU with 6212
min:0.007ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:20.265s
DD Finished
--- Finish ---
Kernel tested: 2.6.28.7-level2crm i686

-----------------------------------------------------------------------

Results says that it takes 2 segs more to complete (Is this relevant for a process that takes ~18-19s to complete).

A curious thing is that I observed no IO Wait was present while doing processing in test 2. Only system processor time.

This also seem to be strange as it should be 100% USER time. System time (correct me if I'm wrong) means that OS is taking lot of time doing scheduling of the threads...

Anyway, I will try to reproduce high iowait times before starting the CPU intensive program to see if we are right.

I will post the test suite in bash. Feel free to add more tests.

Comment 225 Gonzalo Aguilar 2009-03-11 03:55:11 UTC

Created attachment 20489 [details]
Initial effort to build an automatic test suite for this bug

Please feel free to add tests or correct what's wrong

Comment 226 Khalid Rashid 2009-03-11 04:56:36 UTC

Hello Gonzalo, I just ran your testsuit and here is the results:

---------------------------------
khaal@Xeraphim:~/Desktop/test-suite-bug-12309$ sh kernel-test.sh 
Using current dir to do IO tests
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 17986
min:0.006ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:21.873s
We have Burning CPU with 19909
min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:17.708s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 21084
We have Burning CPU with 21085
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 12.5488 s, 16.7 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 16.0014 s, 13.1 MB/s
DD Finished
Killing 21085 process
 --- Finish --- 
Kernel tested: 2.6.28-8-generic x86_64
khaal@Xeraphim:~/Desktop/test-suite-bug-12309$ 200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 18.6493 s, 11.2 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 18.9091 s, 11.1 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 20.0353 s, 10.5 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 20.1651 s, 10.4 MB/s
-------------------------------------

I'm not really familiar with what it saying, but it did affect the desktop responsiveness. I made a google spread sheet thats open for access to anyone in order to organise test results and see common traits among our systems: http://spreadsheets.google.com/ccc?key=p3aerC-xkjEqvo7BvMHaxXg - there is one thing missing and that is a place to upload the output of these test results, anyone who knows of a service that's like photobucket but for text/console output?

the document open to edit for everyone. Please choose a specific color for you so we keep the readability :-)

Comment 227 Gonzalo Aguilar 2009-03-11 07:22:16 UTC

Created attachment 20491 [details]
Initial effort to build an automatic test suite for this bug V2

This fixes the killing of the process (I hope)

Comment 228 Gonzalo Aguilar 2009-03-11 07:27:47 UTC

I will try to explain:

TEST 1: First test does two measures of a CPU intensive program:
 We have Burning CPU with 17986
 min:0.006ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:21.873s
 We have Burning CPU with 19909
 min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:17.708s

It takes between 17s - 22s to complete.

The lines like:
209715200 bytes (210 MB) copied, 18.6493 s, 11.2 MB/s


Tells you the throughput of your HD. This throughput is shared between 6 processes that are writing at the same time.


TEST 2. Then tries to do the same thing but with high IO. 

Unfortunately I killed the program before finish because High IO finished before than the CPU intensive program. so it seems it is affecting hard to you.

In my computer CPU program finished early.

Can you run it with the new version, please?

NOTE: It writes several 200MB files to your hard disk. Please remove them after tests... it will take 200X6=1200MB of your disk.

Comment 229 Gonzalo Aguilar 2009-03-11 07:32:45 UTC

For me throughput is horrible:
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 14987
min:0.005ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:21.527s
We have Burning CPU with 16371
min:0.005ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:21.833s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 17768
We have Burning CPU with 17769
min:0.007ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:22.777s
200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 64,2187 s, 3,3 MB/s
200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 75,1226 s, 2,8 MB/s
DD Finished
IO Finished before than processing
 --- Finish --- 
Kernel tested: 2.6.28.7-level2crm i686
gad@ws-esp16:~$ 200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 76,8811 s, 2,7 MB/s
200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 79,4772 s, 2,6 MB/s
200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 82,0248 s, 2,6 MB/s
200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 82,9147 s, 2,5 MB/s
---------------------------


I forgot to say ext3 filesystem here...


I will try with different kernels from now on.

Comment 230 James Ettle 2009-03-11 07:42:04 UTC

Results from my notebook:

[james@rhapsody tsb]$ ./kernel-test.sh 
Using current dir to do IO tests
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 3772
min:0.009ms|avg:0.013-0.013ms|mid:0.000ms|max:0.000ms|duration:37.528s
We have Burning CPU with 6762
min:0.011ms|avg:0.013-0.013ms|mid:0.000ms|max:0.000ms|duration:37.351s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 9489
We have Burning CPU with 9490
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 21.1718 s, 9.9 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 38.183 s, 5.5 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 41.1141 s, 5.1 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 45.3742 s, 4.6 MB/s
min:0.007ms|avg:0.012-0.013ms|mid:0.000ms|max:0.000ms|duration:38.801s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 49.0724 s, 4.3 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 50.0517 s, 4.2 MB/s
DD Finished
IO Finished before than processing
 --- Finish --- 
Kernel tested: 2.6.29-0.54.rc7.git3.fc10.x86_64 x86_64

Comment 231 Igor Lautar 2009-03-11 08:18:16 UTC

Output on kubuntu 8.10 running on EliteBook 8530w.

While running, it felt 'sluggish' but not by much. When copying/unziping big files, I can get 10+ seconds of firefox inactivity.

Using current dir to do IO tests                        
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 24021                          
min:0.004ms|avg:0.018-0.022ms|mid:0.000ms|max:0.000ms|duration:15.861s
We have Burning CPU with 25229                                        
min:0.004ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:15.678s
Second Test: Does the process queue get blocked because high IO?      
Starting                                                              
We have High IO PID 27067                                             
We have Burning CPU with 27068                                        
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 15.0066 s, 14.0 MB/s                 
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 19.0474 s, 11.0 MB/s                 
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 21.9454 s, 9.6 MB/s                  
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 22.6718 s, 9.3 MB/s                  
DD Finished                                                           
DD Finished                                                           
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 22.9066 s, 9.2 MB/s                  
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 23.667 s, 8.9 MB/s                   
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD FinishedUsing current dir to do IO tests                        
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 24021                          
min:0.004ms|avg:0.018-0.022ms|mid:0.000ms|max:0.000ms|duration:15.861s
We have Burning CPU with 25229                                        
min:0.004ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:15.678s
Second Test: Does the process queue get blocked because high IO?      
Starting                                                              
We have High IO PID 27067                                             
We have Burning CPU with 27068                                        
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 15.0066 s, 14.0 MB/s                 
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 19.0474 s, 11.0 MB/s                 
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 21.9454 s, 9.6 MB/s                  
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 22.6718 s, 9.3 MB/s                  
DD Finished                                                           
DD Finished                                                           
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 22.9066 s, 9.2 MB/s                  
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 23.667 s, 8.9 MB/s                   
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
min:0.004ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:17.371s
DD Finished
IO Finished before than processing
 --- Finish ---
Kernel tested: 2.6.27-13-generic x86_64
DD Finished
DD Finished
min:0.004ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:17.371s
DD Finished
IO Finished before than processing
 --- Finish ---
Kernel tested: 2.6.27-13-generic x86_64

Comment 232 Gonzalo Aguilar 2009-03-11 10:56:17 UTC

gad@ws-esp16:~$ ./kernel-test.sh /mnt/data/gad/
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 8103
min:0.006ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:21.766s
We have Burning CPU with 10098
min:0.007ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:21.275s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 12105
We have Burning CPU with 12106
min:0.007ms|avg:0.010-0.011ms|mid:0.000ms|max:0.000ms|duration:20.630s
200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 34,4896 s, 6,1 MB/s
200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 35,157 s, 6,0 MB/s
200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 37,4852 s, 5,6 MB/s
DD Finished
IO Finished before than processing
 --- Finish --- 
Kernel tested: 2.6.28-8-generic i686
gad@ws-esp16:~$ 200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 40,6583 s, 5,2 MB/s
200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 49,9392 s, 4,2 MB/s
200+0 registros de entrada
200+0 registros de salida
209715200 bytes (210 MB) copiados, 51,9306 s, 4,0 MB/s


-----


Filesystem ext4

Comment 233 Igor Lautar 2009-03-11 12:26:15 UTC

Seams last comment has double c/p, making it hard to read. Here goes another result (for some reason, I get a bunch of "DD Finished", I didn't want to cut as do not know if its relevant for test - probably not):


Using current dir to do IO tests                        
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 7139                           
min:0.005ms|avg:0.015-0.031ms|mid:0.000ms|max:0.000ms|duration:22.600s
We have Burning CPU with 8947                                         
min:0.004ms|avg:0.014-0.031ms|mid:0.000ms|max:0.000ms|duration:22.342s
Second Test: Does the process queue get blocked because high IO?      
Starting                                                              
We have High IO PID 10772                                             
We have Burning CPU with 10773                                        
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 14.7651 s, 14.2 MB/s                 
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 16.8547 s, 12.4 MB/s                 
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 18.5809 s, 11.3 MB/s                 
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
DD Finished                                                           
200+0 records in                                                      
200+0 records out                                                     
209715200 bytes (210 MB) copied, 19.6679 s, 10.7 MB/s                                                                                                                        
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
200+0 records in                                                                                                                                                             
200+0 records out                                                                                                                                                            
209715200 bytes (210 MB) copied, 20.7152 s, 10.1 MB/s                                                                                                                        
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
200+0 records in                                                                                                                                                             
200+0 records out                                                                                                                                                            
209715200 bytes (210 MB) copied, 22.0414 s, 9.5 MB/s                                                                                                                         
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished                                                                                                                                                                  
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
min:0.004ms|avg:0.018-0.033ms|mid:0.000ms|max:0.000ms|duration:24.033s
DD Finished
IO Finished before than processing
 --- Finish ---
Kernel tested: 2.6.27-13-generic x86_64

This is ext3.

Comment 234 Yuriy Lalym 2009-03-11 16:07:00 UTC

What for you brake disks a finger?

yura@suse:~/Desktop> sh kernel-test.sh
Using current dir to do IO tests
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 14170
min:0.003ms|avg:0.006-0.007ms|mid:0.000ms|max:0.000ms|duration:4.725s
We have Burning CPU with 14815
min:0.004ms|avg:0.006-0.007ms|mid:0.000ms|max:0.000ms|duration:4.752s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 15470
We have Burning CPU with 15471
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 2,45896 c, 85,3 MB/c
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 4,33352 c, 48,4 MB/c
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 4,51529 c, 46,4 MB/c
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 5,22602 c, 40,1 MB/c
DD Finished
DD Finished
DD Finished
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 5,97021 c, 35,1 MB/c
DD Finished
DD Finished
DD Finished
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 6,38097 c, 32,9 MB/c
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
min:0.003ms|avg:0.006-0.007ms|mid:0.000ms|max:0.000ms|duration:6.047s
DD Finished
IO Finished before than processing
 --- Finish ---
Kernel tested: 2.6.28.5-default x86_64

Comment 235 Mathias Burén 2009-03-11 16:25:29 UTC

$ ./kernel-test.sh 
Using current dir to do IO tests
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 4215
min:0.005ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:14.822s
We have Burning CPU with 5656
min:0.007ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:15.624s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 7403
We have Burning CPU with 7404
200+0 poster in
200+0 poster ut
209715200 byte (210 MB) kopierade, 12,7466 s, 16,5 MB/s
200+0 poster in
200+0 poster ut
209715200 byte (210 MB) kopierade, 15,3423 s, 13,7 MB/s
200+0 poster in
200+0 poster ut
209715200 byte (210 MB) kopierade, 17,363 s, 12,1 MB/s
200+0 poster in
200+0 poster ut
209715200 byte (210 MB) kopierade, 18,3437 s, 11,4 MB/s
200+0 poster in
200+0 poster ut
209715200 byte (210 MB) kopierade, 18,9163 s, 11,1 MB/s
200+0 poster in
200+0 poster ut
209715200 byte (210 MB) kopierade, 19,3732 s, 10,8 MB/s

min:0.005ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:18.564s

IO Finished before than processing
 --- Finish --- 
Kernel tested: 2.6.29-rc7-zen2-ARCH-20090309 x86_64

Comment 236 Thomas Pilarski 2009-03-12 02:41:31 UTC

I have recognized, that the cpu clock scaling responds sluggish during heavy io. From time to time it stays at lowest clock rate, although there was cpu intensive, but discontinuous, work in other processes. I had just a freeze for 20 seconds during such a state.

Comment 237 Thomas Pilarski 2009-03-12 02:43:58 UTC

I could move the mouse, but cursor did not change. All panel were working, but I could not move or switch windows.

Comment 238 Khalid Rashid 2009-03-12 03:08:31 UTC

Gonzalo, is it possible to include the motherboard chipset in the test? It would be interesting to see if everybody who's affected have the same or similiar chipsets... Here's another test result, with 2.5.29 RC7. Still affected by the bug, on ext4.


khaal@Xeraphim:~/Desktop/test-suite-bug-12309-v2$ sh kernel-test.sh 
Using current dir to do IO tests
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 9080
min:0.007ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:23.801s
We have Burning CPU with 14728
min:0.007ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:22.593s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 19811
We have Burning CPU with 19812
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 13.901 s, 15.1 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 15.2808 s, 13.7 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 15.4188 s, 13.6 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 16.1941 s, 13.0 MB/s
DD Finished
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 16.6363 s, 12.6 MB/s
DD Finished
DD Finished
DD Finished
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 17.1937 s, 12.2 MB/s
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
min:0.004ms|avg:0.008-0.009ms|mid:0.000ms|max:0.000ms|duration:18.957s
DD Finished
IO Finished before than processing
 --- Finish --- 
Kernel tested: 2.6.29-020629rc7-generic x86_64

Comment 239 Gonzalo Aguilar 2009-03-12 08:05:02 UTC

Created attachment 20503 [details]
Results in ODF for spreadsheet

This shows the information recovered by each of the tests performed.

Comment 240 Gonzalo Aguilar 2009-03-12 08:05:07 UTC

Created attachment 20504 [details]
Results in ODF for spreadsheet

This shows the information recovered by each of the tests performed.

Comment 241 Gonzalo Aguilar 2009-03-12 08:16:58 UTC

I uploaded a spreadsheet to show results...

For me a High IO is affecting to the scheduler or processor. Not really much for the tests but it may be important if long processing takes in place.

It's very significative that increment is always about 2 seconds for all included for the tests of Yuriy Lalym where normally should only take 4,7s and the processing time gets incremented in 1,3 secs. Why always around 2 secs?

Also we can see that ext4 does not really seem to be affected. Maybe because throughtput? It would be interesting to know the fs system tested by Khalid Rashid because it take less time to complete under high IO, like for me on ext4.

And what the DD Finished says is that the last IO transfer done finished before thee CPU intensive task. Maybe this also affected the result.

Ok. I will fix the format of the output of the testsuite program and include other tests. Also temp files will be deleted after tests.

What other tests should be included?

I will try to search for the fsync problem to include it in the tests.

Also will try to report motherboard chipset as requested...

Any ideas on what to test?

Comment 242 Gonzalo Aguilar 2009-03-12 08:36:18 UTC

I have one question for the kernel developers...

How many processor time is normal for a dd process using dma? 
 I have two hipothesis:
   1.- Kernel is taking to much time getting the process in and out even if it is blocked by IO.
   2.- Is there one lock that prevents the scheduler from running free...

How can I trackdown processor time of a program (say dd)?
   Want to see if times for each kind of process is normal. Current computers are fast and sometimes we do not realize that a process is taking to much time to complete.

Any good ways to profile the kernel looking at only one PID?
   I want to profile specific parts of the kernel. Any good doc?

Thank you all!

I forgot to say. For now don't use the testsuite anymore until new tests are here.

Comment 243 Yuriy Lalym 2009-03-12 11:23:05 UTC

Server on Xeon based, internal HDD SATA2 (no RAID), SLES 10 SP2

Using current dir to do IO tests
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 31607
min:0.004ms|avg:0.013-0.049ms|mid:0.000ms|max:0.000ms|duration:19.071s
We have Burning CPU with 7637
min:0.004ms|avg:0.015-0.057ms|mid:0.000ms|max:0.000ms|duration:21.218s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 15831
We have Burning CPU with 15832
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 1,0195 секунд, 206 MB/s
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 1,04578 секунд, 201 MB/s
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 1,26246 секунд, 166 MB/s
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 1,90053 секунд, 110 MB/s
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 2,19354 секунд, 95,6 MB/s
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 2,22529 секунд, 94,2 MB/s
min:0.003ms|avg:0.014-0.060ms|mid:0.000ms|max:0.000ms|duration:20.705s
IO Finished before than processing
 --- Finish ---
Kernel tested: 2.6.16.60-0.21-smp x86_64
         
Server on Xeon based, 3-Ware RAID-1 (2 pieces SAS), SLES 10 SP2

Using current dir to do IO tests
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 22420
min:0.004ms|avg:0.015-0.071ms|mid:0.000ms|max:0.000ms|duration:25.210s
We have Burning CPU with 28763
min:0.004ms|avg:0.018-0.083ms|mid:0.000ms|max:0.000ms|duration:33.232s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 1628
We have Burning CPU with 1629
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 0,335776 секунд, 625 MB/s
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 0,367063 секунд, 571 MB/s
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 0,363934 секунд, 576 MB/s
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 0,430686 секунд, 487 MB/s
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 0,520617 секунд, 403 MB/s
200+0 записей считано
200+0 записей написано
 скопировано 209715200 байт (210 MB), 0,531063 секунд, 395 MB/s
min:0.004ms|avg:0.014-0.065ms|mid:0.000ms|max:0.000ms|duration:22.025s
IO Finished before than processing
 --- Finish ---
Kernel tested: 2.6.16.60-0.21-smp x86_64

Comment 244 Brandon Penglase 2009-03-12 14:23:03 UTC

bpenglas@PC010233L ~/Desktop/bug $ ./kernel-test.sh 
Using current dir to do IO tests
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 10638
min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:14.790s
We have Burning CPU with 13523
min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:13.953s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 14793
We have Burning CPU with 14794
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 14.7986 s, 14.2 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 17.6264 s, 11.9 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 19.4253 s, 10.8 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 19.9593 s, 10.5 MB/s
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 21.898 s, 9.6 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 21.9509 s, 9.6 MB/s
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:14.694s
DD Finished
IO Finished before than processing
 --- Finish --- 
Kernel tested: 2.6.29-rc3-zen1-1-07438-g2953ca1 x86_64

Comment 245 Khalid Rashid 2009-03-12 16:20:57 UTC

Gonzalo, as I stated before I am on ext4 mounted with noatime and nodiratime flags. However, even if my throughput is fast according to the test, my performance takes a big hit during the tests still. I'm considering to reformat my partitions to ext3 so i can get an older kernel running and test how it fares. Also, it would be great to collect the results on one place, I've put one up at http://tinyurl.com/au4fda - feel free to rearrange it to fit your needs.

Well done with the testsuit, and good bughunting everyone :-)

Comment 246 Yuriy Lalym 2009-03-12 17:22:49 UTC

(In reply to comment #234)
(In reply to comment #243)

File system - xfs

Comment 247 Brandon Penglase 2009-03-12 19:20:44 UTC

(In reply to comment #244)


Forgot to mention, on this system, all filesystems are EXT3. That is also without my VMs running, and it's my work machine. I'll try to get with VMs running, and also my home box tomorrow(3/13/09).

Comment 248 Brandon Penglase 2009-03-16 08:14:02 UTC

My Work machine:

bpenglas@PC010233L ~/kernel $ ./kernel-test.sh 
Using current dir to do IO tests
First Test: How much gets to run the CPU intensive task?
We have Burning CPU with 16034
min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:19.169s
We have Burning CPU with 18771
min:0.005ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:17.182s
Second Test: Does the process queue get blocked because high IO?
Starting
We have High IO PID 21066
We have Burning CPU with 21067
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 21.8451 s, 9.6 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 21.7598 s, 9.6 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 21.9914 s, 9.5 MB/s
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 24.8323 s, 8.4 MB/s
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 24.9565 s, 8.4 MB/s
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
DD Finished
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 25.6149 s, 8.2 MB/s
DD Finished
DD Finished
DD Finished
min:0.004ms|avg:0.007-0.008ms|mid:0.000ms|max:0.000ms|duration:15.944s
DD Finished
IO Finished before than processing
 --- Finish --- 
Kernel tested: 2.6.29-rc3-zen1-1-07438-g2953ca1 x86_64


This is while FireFox is open, Audacious is playing music, and two VMWare Workstation VM's running (Windows Vista, and Windows XP).
All filesystems are EXT3, main system drive is a WD 80gig at 10kRPM, other drive is 250gig 7.2kRPM. All intel Chipset, with an Core2Dou E8200. It's a Dell GX755.

Comment 249 David Rees 2009-03-17 00:09:07 UTC

Simple test case:

dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync &
sleep 10
time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync

You'd expect the small file to be written fairly quickly - as in a couple seconds at most.  But on every system with a recent kernel I've tried this on, it takes 6-45 seconds.

Why the huge range?  I'm not sure, but available memory seems to have something to do with it.  The more memory in the machine, the larger the smallfile writes.

Comment 250 Matt Whitlock 2009-03-17 05:03:31 UTC

(In reply to comment #249)
> dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync &
> sleep 10
> time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync

real    0m1.808s
user    0m0.001s
sys     0m0.001s

I don't think this gets to the issue.

Comment 251 Igor Lautar 2009-03-17 05:11:45 UTC

Well, for me it does:

dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync &
sleep 10
time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync

1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 15.8284 s, 0.3 kB/s

real    0m16.024s
user    0m0.004s
sys     0m0.020s

Comment 252 André-Sebastian Liebe 2009-03-17 07:14:02 UTC

(In reply to comment #249)
> dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync &
> sleep 10
> time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync

2.6.28-gentoo-r2 (/tmp on reiser3.6, rootfs-drive):
>4096 bytes (4,1 kB) copied, 10.618 s, 0.4 kB/s
>real    0m10.620s
>user    0m0.000s
>sys     0m0.077s

2.6.28-gentoo-r2 (/tmp on ext4, other drive):
>4096 bytes (4,1 kB) copied, 5,34679 s, 0,8 kB/s
>real    0m5.349s
>user    0m0.000s
>sys     0m0.003s

2.6.27.19-3.2-default (opensuse 11.1) (/tmp on ext3, rootfs):
>4096 bytes (4,1 kB) copied, 60.5764 s, 0.1 kB/s
>real    1m2.827s
>user    0m0.004s
>sys     0m0.036s

Comment 253 Matt Whitlock 2009-03-17 09:30:04 UTC

(In reply to comment #250)
> real    0m1.808s
> user    0m0.001s
> sys     0m0.001s

My 1.808s was on 2.6.27-gentoo-r8 with XFS on a 3ware 8-drive SATA RAID.

Comment 254 Vaino Venermo 2009-03-17 10:09:23 UTC

2.6.28.7 w/reiserFS

4096 bytes (4.1 kB) copied, 6.96955 s, 0.6 kB/s

real	0m6.972s
user	0m0.001s
sys	0m0.026s

Comment 255 David Rees 2009-03-17 10:29:18 UTC

André did you mean to take ownership of this bug away from Jens?

It looks like the test case I posted up earlier seems to be very effective at demonstrating at least one of the issues that is affecting people in this thread (namely people using ext3 or reiserfs).

It appears that xfs and ext4 are better at avoiding these huge latencies - I'm also assuming that the IO scheduler interacts with these filesystems differently.

Matt - I don't think this test case works for you as much because you have such a fast disk array.  I imagine that you can write 10GB pretty quickly with an 8-drive array.  Try increasing the 10GB to 100GB and increasing the sleep to 20-30 seconds so that you get more data waiting to be flushed to disk.

Comment 256 Khalid Rashid 2009-03-17 10:44:53 UTC

David, I want to stress that while my earlier test results looked good on my ext4 filesystem, I was still affected by the slow performance. I think we need a (diffrent?) way to measure desktop responsiveness in order to get actual values from there too.

Comment 257 David Rees 2009-03-17 11:02:30 UTC

Where are your performance numbers from the test case, Khalid, and what is your hardware setup like?  André posted numbers in comment #252 on ext4 which are better than his ext3/resiserfs numbers, are still very poor, IMO.

It's fairly obvious that it is likely there are multiple bugs causing similar symptoms and all have been jumbled into this bug report.

Jens has asked for a simple test case illustrating at least one issue discussed in this thread.  I have presented one extremely simple test case which duplicates the problems I (and others) am seeing.  Feel free to create another.

Comment 258 André-Sebastian Liebe 2009-03-17 11:42:35 UTC

sorry, didn't mean to reassign the bug in the first place

Comment 259 David Rees 2009-03-17 11:59:55 UTC

I've been doing some testing - two tunables I've found (briefly mentioned earlier) that helps immensely is setting /proc/sys/vm/dirty_background_ratio to 1 and /proc/sys/vm/dirty_ratio to 2.

On some of my systems that I've run the test on it reduces latency down to a fraction of a second - on other systems it reduces it from 20+ seconds to less than 10.

Anyone else see similar behaviour with my simple test?

Comment 260 devsk 2009-03-17 12:16:07 UTC

(In reply to comment #259)
> I've been doing some testing - two tunables I've found (briefly mentioned
> earlier) that helps immensely is setting /proc/sys/vm/dirty_background_ratio
> to
> 1 and /proc/sys/vm/dirty_ratio to 2.
> 
> On some of my systems that I've run the test on it reduces latency down to a
> fraction of a second - on other systems it reduces it from 20+ seconds to
> less
> than 10.
> 
> Anyone else see similar behaviour with my simple test?
> 

This is right. Although it doesn't eliminate stutter (mouse freezing for 1-2 seconds) during heavy IO, it does make that stutter tolerable. Its basically converting your IO to almost sync inline instead of leaving the work for later for pdflush to pick up and choke the hell out of the IO subsystem. I have no idea why on larger memory configurations those default values are set so high as 40 and 20 (IIRC). I mean on a 4GB RAM system, we may not see any IO landing until expiry alarms fire in pdflush or 40% of 4GB=1.6G is ready to be written.

Comment 261 Brandon Penglase 2009-03-17 13:01:09 UTC

PC010233L vmware # dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync &
[1] 10528
PC010233L vmware # sleep 10
PC010233L vmware # time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.00333981 s, 1.2 MB/s

real	0m0.054s
user	0m0.000s
sys	0m0.000s
PC010233L vmware # 
PC010233L vmware # time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.604249 s, 6.8 kB/s

real	0m3.219s
user	0m0.000s
sys	0m0.000s


The second time I ran the second DD was about 2 minutes later. My / (or /tmp) 
is located on a WD 10K RPM SATA II Drive. 

And after fixing the dirty ratios....

PC010233L vmware # dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000 conv=fdatasync &
[1] 10548
PC010233L vmware # sleep 10
PC010233L vmware # time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 1.41179 s, 2.9 kB/s

real	0m2.044s
user	0m0.000s
sys	0m0.002s
PC010233L vmware # time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000649804 s, 6.3 MB/s

real	0m6.366s
user	0m0.000s
sys	0m0.002s
PC010233L vmware # 

Again, second one was about 2 minutes afterwards.

Comment 262 David Rees 2009-03-17 13:13:55 UTC

(In reply to comment #261)
Brandon, this test case doesn't seem to reproduce any significant latency issues for you.  I suspect that 10k RPM disk is able to write fast enough to keep a significant amount of data from being buffered in memory.  1.5 seconds isn't great, but all my systems are at least 5 times worse than that and often 10-40 times worse.

Do you notice a large latency hit on the system when the large write is running?

Why are you running that second small write afterwards?  Was the big write done at that point or not?  The latency of your small writes does seem to vary by quite a bit.

Comment 263 Matt Whitlock 2009-03-17 13:36:46 UTC

(In reply to comment #255)
> Matt - I don't think this test case works for you as much because you have
> such
> a fast disk array.  I imagine that you can write 10GB pretty quickly with an
> 8-drive array.  Try increasing the 10GB to 100GB and increasing the sleep to
> 20-30 seconds so that you get more data waiting to be flushed to disk.
> 

Setting dirty_background_ratio=1 and dirty_ratio=2 had a HUGE effect on my system.

$ dd if=/dev/zero of=/var/tmp/bigfile bs=1M count=100000 conv=fdatasync & sleep 30 ; time dd if=/dev/zero of=/var/tmp/smallfile bs=4k count=1 conv=fdatasync
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 6.96642 s, 0.6 kB/s

real    0m8.590s
user    0m0.000s
sys     0m0.004s

100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 1354.9 s, 77.4 MB/s

# echo 1 > dirty_background_ratio ; echo 2 > dirty_ratio

$ dd if=/dev/zero of=/var/tmp/bigfile bs=1M count=100000 conv=fdatasync & sleep 30 ; time dd if=/dev/zero of=/var/tmp/smallfile bs=4k count=1 conv=fdatasync
[1] 22718
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.72366 s, 5.7 kB/s

real    0m0.725s
user    0m0.000s
sys     0m0.001s

100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 359.02 s, 292 MB/s

Comment 264 Brandon Penglase 2009-03-17 18:52:47 UTC

(In reply to comment #262)
> (In reply to comment #261)
> Brandon, this test case doesn't seem to reproduce any significant latency
> issues for you.  I suspect that 10k RPM disk is able to write fast enough to
> keep a significant amount of data from being buffered in memory.  1.5 seconds
> isn't great, but all my systems are at least 5 times worse than that and
> often
> 10-40 times worse.
> 
> Do you notice a large latency hit on the system when the large write is
> running?
> 
> Why are you running that second small write afterwards?  Was the big write
> done
> at that point or not?  The latency of your small writes does seem to vary by
> quite a bit.
> 

The large write took a while to complete (about 10 minutes.. and only got to 5.3gig before I killed it), and yes, VERY degraded performance... took me a while to ssh in and kill it.. as local was almost unusable.

The first small write wasn't when the system started lagging out on me... it was when the ram usage was going up, and cpu usage was going up, so I decided to run it again at a later point just to see. 

I can try doing the writes on my 7.2K RPM disc tomorrow when I'm back at work. just need to point the output to a different partition.

Comment 265 Khalid Rashid 2009-03-18 02:58:33 UTC

David Rees, all my test results are presented here: https://spreadsheets.google.com/ccc?key=p3aerC-xkjEqvo7BvMHaxXg&hl=en  and my computer components can be seen here: http://h10025.www1.hp.com/ewfrf/wc/prodinfoCategory?lc=en&cc=se&dlc=sv&product=3387690&lang=sv&

I tried this also on a WD Raptor drive just to ensure that it was not fauly harddrives that was the case, and the symptoms were still present.

Comment 266 Brandon Penglase 2009-03-18 11:48:53 UTC

PC010233L ~ # dd if=/dev/zero of=/home/bigfile bs=1M count=10000 conv=fdatasync &
[1] 22333
PC010233L ~ # sleep 10
PC010233L ~ # time dd if=/dev/zero of=/home/smallfile bs=4k count=1 conv=fdatasync
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 6.27386 s, 0.7 kB/s

real	0m6.275s
user	0m0.000s
sys	0m0.000s
PC010233L ~ # time dd if=/dev/zero of=/home/smallfile bs=4k count=1 conv=fdatasync
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 2.4702 s, 1.7 kB/s

real	0m2.482s
user	0m0.000s
sys	0m0.000s


This was going to /home which is on a 250gig 7200k RPM SATA II drive. Also, even though the second one (ran about a minute or two later) completed quickly.. it was about another 10 secs till I got the prompt back.

Comment 267 Vaino Venermo 2009-03-19 07:31:09 UTC

(In reply to comment #249)

the same system than in #254 but I changed kernel to latest rc of .29

2.6.29-rc8 w/reiserFS

4096 bytes (4.1 kB) copied, 1.2374 s, 3.3 kB/s

real	0m2.843s
user	0m0.001s
sys	0m0.003s

Comment 268 Marcel Partap 2009-03-22 14:01:18 UTC

Just a quick note: i've been having considerable troubles with kernels since 2.6.17 aswell, yet recently run across this article http://kerneltrap.org/node/3000, citing: "Kernel maintainer Andrew Morton has said that he runs his desktop machines with a swappiness of 100"... that made me think if my swappiness of 1 might be not such a good idea. An example of misbehaviour which i was actually crediting to this bug can be seen here:
http://hfopi.org/files/temp/time-trouble.jpg (look at the three different clock times).. This problem was the result of physical memory running full, which was happening a lot (VLC mem leak..) - stalling the system sometimes for hours.
Well setting swappiness .. ah let me quote Andrew: "I'm gonna stick my fingers in my ears and sing 'la la la' until people tell me 'I set swappiness to zero and it didn't do what I wanted it to do'." .. well here i am. To all of you: setting swappiness too extremely low values is a bad idea and won't achieve what you expect it to. So that might actually be your problem if you have done so; echo 100 > /proc/sys/vm/swappiness and make the test.

Comment 269 Matt Whitlock 2009-03-22 14:07:56 UTC

(In reply to comment #268)

I see the problem, and I've never touched 'swappiness'.

$ cat /proc/sys/vm/swappiness
60

Actually, I have no swap at all.

# swapon -s
swapon: /proc/swaps: No such file or directory

Comment 270 Søren Holm 2009-03-22 14:29:08 UTC

Well Mathieu Desnoyers did a fix for the write-cache accounting which solves the kernel write-cache to eat up all available memory + swap. Witout the fix the slowness is solved by setting swappiness to 0 or disabling swap, The fix is afaik not in 2.6.29.

Comment 271 devsk 2009-03-22 14:53:01 UTC

So, we have one guy (#268) saying high swappiness will solve the problem and the other guy (#270) saying setting swappiness to 0 will solve the problem. I have a feeling neither is going to work, because I have run my system with both and this bug appears under high IO load in both cases. But I would like to see what others find.

Comment 272 Trenton D. Adams 2009-03-22 17:50:19 UTC

The swappiness setting is irrelevant to this bug, as it is a disk io problem no matter which way you look at it.  Yes, if you are swapping, this bug will cause the system to be even slower.

p.s.
I'm convinced that swap is evil.  I just disable my swap, and my system works much better, especially when I get a run-away memory hog process.

Comment 273 axel 2009-03-23 06:30:55 UTC

Hi:

A) if you have multipple harddrives
- they are not equally affected
- if you copy a file (e.g. 7 Gig) from drive A to drive B, a job running on drive C is not slowing down, accept, if perhas a swapfile is used.

A job, in my case, is a vmware virtual machine
I was spreading machines over different harddrives to reduce the trouble.


B) isn't this slowdown a planed action of the system:

About /proc/sys/vm/dirty_ratio
> Note that all processes are blocked for writes when this happens
(see below, original text)
This is what slows everything down.

IMHO, it should be:
If "dirty_ratio" is reached, slow down the job that is creating
so much "dirt" and leave the other ones alone.



cut out from http://www.westnet.com/~gsmith/content/linux-pdflush.htm

8< -------------------

Process page writes
There is another parameter involved though that can spill over into management of user processes:

/proc/sys/vm/dirty_ratio (default 40): Maximum percentage of total memory that can be filled with dirty pages before processes are forced to write dirty buffers themselves during their time slice instead of being allowed to do more writes.

Note that all processes are blocked for writes when this happens, not just the one that filled the write buffers. This can cause what is perceived as an unfair behavior where one "write-hog" process can block all I/O on the system. The classic way to trigger this behavior is to execute a script that does "dd if=/dev/zero of=hog" and watch what happens. See Kernel Korner: I/O Schedulers for examples showing this behavior.

8< -------------------

Reference:
http://www.westnet.com/~gsmith/content/linux-pdflush.htm

Does someone have an idea how to slow down the IO-heavy job (automatically) ?
If the throughput of dd, rsync or "whatever" is reduced, the moment
a triggervalue is reached, the problem would be only for dd, rsync, ...
and not for the rest of the system.

Comment 274 axel 2009-03-23 08:59:03 UTC

Hi again:

My test is to throttle the bandwith using "rsync --bwlimit=<throughput>"

I am testing using vmware on /images3.
Vmware runs fluent until I copy a lot (7Gig vmdk-file) to /images3, which
is a separat harddrive on which 5 vmware systems are having their .vmdk-files.
Copying this 7Gig file freezes the vmware systems for > 30 seconds.

And now with limited bandwith ...

all jobs run fine, no hangig or else:
rsync --bwlimit=10000 /images5/vmware/vlab03/STD_XP_Prof.vmdk /images3/test
rsync --bwlimit=20000 /images5/vmware/vlab03/STD_XP_Prof.vmdk /images3/test

some jobs start to become slow and hang:
rsync --bwlimit=30000 /images5/vmware/vlab03/STD_XP_Prof.vmdk /images3/test

a lot of jobs hangs and are very slow, some freeze:
rsync --bwlimit=40000 /images5/vmware/vlab03/STD_XP_Prof.vmdk /images3/test

This is my estimation:
rsync is creating more dirt than the "kernel" can get rid off and 
the system is put into this "processes are blocked for writes" (see previous posting) mode.

I hope that my input can help.

Comment 275 Trenton D. Adams 2009-03-23 23:07:09 UTC

Created attachment 20656 [details]
vmstat with high # of uninterruptible processes

I just had a hang for about 10-15 minutes.  My system started to freeze, so I immediately switched to a console, and ran "vmstat 1" (see attachment).  

I sat there and watched it, as I wanted to catch it immediately after it became usable again, so that I could check the load average.

uptime                                                                                                              
 23:38:18 up 6 days,  4:49,  8 users,  load average: 23.30, 26.12, 16.21

23, with a 5 minute load average of 26 OUCH.

I have no swap, and I think the problem happened when one of my processes did something to lock up the machine.  But, take note how many processes are blocked in UNINTERRUPTIBLE sleep at various times...

I think I also realized something very interesting about this bug.  It does not occur as readily when you have a fast disk.  As I had mentioned in previous comments, my macbook and my D820 have the same hardware.  Well I'm rarely experiencing this on my D820 now.  The only difference I can see, related to IO, is that the D820 just had a 320G 80M/s drive put into it.  My Macbook runs at approximately 20-25M/s.

Also, given that I am pretty sure that one of my processes hanged the machine, it seems (though I am not a kernel hacker) like this bug may be related to a wait on a mutex or semaphore in a location that it should not be, hence the high number of uninterruptible processes?  Could that be?

Comment 276 David Rees 2009-03-24 18:16:19 UTC

There has been more discussion on LKML related to this issue attached to the 2.6.29 kernel release thread.  I'll direct interested parties to this post from Ted Tso:

http://lkml.org/lkml/2009/3/24/227

Attached to that post is Ted's fsync latency measuring tool.  If people have a workload which generates high latency, this tool may be useful for measuring it and then posting that workload to Ted/LKML.

His testing tool doesn't do anything much different than my earlier dd test, except that he writes 1MB of data which may show higher latencies.

For those interested, I picked up a couple other workarounds for people this is affecting:

1. Mount ext3 in writeback instead of ordered.  This has the drawback of leaving your data a bit more vulnerable than default, but now data writes won't be forced to be completed in order with meta data.

2. Increase IO priority of kjournald:
for i in `pidof kjournald` ; do ionice -c1 -p $i ; done
One theory is that by default kjournald is fighting for IO priority with normal processes. By making the IO priority of kjournald higher, the "important" data (IE, data that is getting synced to disk) should get written out faster reducing user visible latency.  See this post/thread for more detail: http://lkml.org/lkml/2008/10/2/205

Comment 277 Milan Bouchet-Valat 2009-03-25 13:59:57 UTC

I've tested the second workaround posted by David above (high IO priority of kjournald), and it definitely improves things in my case. My test is very simple: doing normal upgrades under Ubuntu (esp. kernel packages) always make Firefox and even Evolution or the whole desktop freeze for several seconds, up to about 20 sec in some cases. With that workaround, the freezes don't last more than ~1 sec; the desktop experience is not really smooth, but I can work during upgrades.

So I guess we can track down at least a specific issue here, which may be the major one affecting desktop boxes, and which seems to have appeared (maybe in different ways) between 2.6.17 and 2.6.28. I'm using a fairly basic Toshiba Satellite laptop with 512 MB of RAM and a 4200 rd/min HD.

Can anybody confirm that too?

Comment 278 Gonzalo Aguilar 2009-03-25 14:37:30 UTC

Ok. I'm also testing the kjournald option to see if it improves. I will post after some testing...

I want to include the fsync tests you pointed out. I tested it and gave me:
fsync time: 0.0145
fsync time: 0.0205
fsync time: 0.0221
fsync time: 0.0195
fsync time: 0.0177
fsync time: 0.0702
fsync time: 0.0456
What's the correct way to do reliable tests? I will include it in the test suite.

Comment 279 Jonathan Bower 2009-03-25 16:22:06 UTC

The kjournald option makes my system much more responsive.

Comment 280 Trenton D. Adams 2009-03-26 03:31:57 UTC

Hi Guys,

After reading those LKML messages from Theoodre, regarding his sync patches, it gave me an idea.  Why not just mount my filesystem with "sync" mount option.

I run the following command on one console...
dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000

And Theodore's fsync-test on another.  On the standard test, WITHOUT mounting with sync, I get these results out of Theodore's test...

fsync time: 1.5693
fsync time: 18.8047
fsync time: 21.2672
fsync time: 18.6747
fsync time: 2.3821
fsync time: 2.0494
fsync time: 2.8781
fsync time: 21.6300

Here's a "vmstat 1" snipette.  All the lines while the dd is running are roughly the same.
 2  9 380388  16716  33412 1409988    0    0     0 15340  806 1188  3  4  0 93 
 0  8 380388  15748  33428 1411080    0    0     0 16284 1165 2350  7  8  0 85 
 0  9 380388  16620  33432 1409752    0    0     0 18240  878 1108  5  3  0 92 
 1  8 380388  16776  33452 1410108    0    0     0 11888 1046 1140 10  8  0 82

When I do the following...
mount -o remount,rw,sync /dev/s/sys /

I get the following benches while running the same dd command...
fsync time: 0.0067
fsync time: 0.0369
fsync time: 0.0208
fsync time: 0.0099
fsync time: 0.1175
fsync time: 0.0337
fsync time: 0.0003
fsync time: 0.0219
fsync time: 0.0110
fsync time: 0.0142
fsync time: 0.0076
fsync time: 0.0146
fsync time: 0.0153
fsync time: 0.1104
fsync time: 0.0061
fsync time: 0.0003

With "vmstat 1" snippet of ...
 1  0 380624 1112236  93104 297252    0    0     0 13056  920 1167  5  3 49 43
 0  1 380624 1098212  93252 311044    0    0     0 15876  925 1165  5  4 52 38
 1  2 380624 1085796  93408 323296    0    0     0 13800  996 1239 10  4 47 38

Did something in the kernel change a couple years ago, in regard to syncing?

Comment 281 Trenton D. Adams 2009-03-26 03:47:39 UTC

Just an FYI, there was some mm/msync.c "fsync" related changes between 2.6.16.62 and 2.6.17 vanilla.  I didn't see the problem until after 2.6.17, but perhaps gentoo had patched the kernel heavily, I don't know.  I'll try and do some more diffs between the kernel versions around the time I started having the problem, in case it can help you guys figure it out.

Comment 282 Trenton D. Adams 2009-03-26 04:19:20 UTC

From first 2.6.17 release to first 2.6.18 release (haven't narrowed it down to exact versions), 3 PF_SYNCWRITE related lines have been removed from mm/msync.c.

And some PF_SYNCWRITE related stuff in block/cfq-iosched.c was added in 2.6.17 (diff between 2.6.16.62 and 2.6.17), and then removed in 2.6.18.

There's also fs/ sync related stuff between 2.6.16.62 and 2.6.17.

I hope I'm not spamming. :P

Comment 283 devsk 2009-03-26 04:51:23 UTC

(In reply to comment #280)
> Hi Guys,
> 
> After reading those LKML messages from Theoodre, regarding his sync patches,
> it
> gave me an idea.  Why not just mount my filesystem with "sync" mount option.

what are the disadvantages of sync mount option? reduced b/w? higher latency? data you posted does show any disadvantages or may be I don't know what to conclude from that data?

Comment 284 devsk 2009-03-26 04:51:59 UTC

(In reply to comment #280)
> Hi Guys,
> 
> After reading those LKML messages from Theoodre, regarding his sync patches,
> it
> gave me an idea.  Why not just mount my filesystem with "sync" mount option.

what are the disadvantages of sync mount option? reduced b/w? higher latency? data you posted doesn't show any disadvantages or may be I don't know what to conclude from that data?

Comment 285 Trenton D. Adams 2009-03-26 04:54:44 UTC

(In reply to comment #284)
> (In reply to comment #280)
> > Hi Guys,
> > 
> > After reading those LKML messages from Theoodre, regarding his sync
> patches, it
> > gave me an idea.  Why not just mount my filesystem with "sync" mount
> option.
> 
> what are the disadvantages of sync mount option? reduced b/w? higher latency?
> data you posted doesn't show any disadvantages or may be I don't know what to
> conclude from that data?

It appears that the overall transfer rate has decreased a tiny bit.  But, the
big advantage of not doing "sync" on mount, is that the system can queue the
writes.  So, for anything that fits into kernel queues, the writes appear way
faster to the user.  That's my understanding of the difference between sync and
not using sync.

Comment 286 Trenton D. Adams 2009-03-26 04:59:39 UTC

Oh, I should have given an example.  Normally, when doing a dd of say 10M, your write would be several hundred MEGABYTES per second, because it's writing to memory, not disk.  In my case, I only get disk speeds, even with 10M.  So yeah, the memory queueing is WAAAAY faster until you reach the limit.

One last thing, for the kernel devs, as this may be important...
The comment in 2.6.28's version of msync.c is as follows...

/*
 * MS_SYNC syncs the entire file - including mappings.
 *
 * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
 * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
 * Now it doesn't do anything, since dirty pages are properly tracked.
 *
 * The application may now run fsync() to
 * write out the dirty pages and wait on the writeout and check the result.
 * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
 * async writeout immediately.
 * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
 * applications.
 */

This is an interesting comment.  Mainly because there was some logic based on MS_SYNC, that was removed from msync.c, in 2.6.18 (as I mentioned at the TOP of comment #282).  That code would set the PF_SYNCWRITE flag.  The code exists in 2.6.17 but not 2.6.18.  I haven't checked if it was the 2.6.18 change that did it, or a previously 2.6.17.x change.

Is this a problem kernel devs???????

Comment 287 De Ganseman Amaury 2009-03-26 07:59:47 UTC

I have the same result here when mouting with the "sync" option.

I try also async and ionice -c1 'pidof kjournald' and doesn't seems to improve latency measured by fsync-tester.

(In reply to comment #280)
> Hi Guys,
> 
> After reading those LKML messages from Theoodre, regarding his sync patches,
> it
> gave me an idea.  Why not just mount my filesystem with "sync" mount option.
> 
> I run the following command on one console...
> dd if=/dev/zero of=/tmp/bigfile bs=1M count=10000
> 
> And Theodore's fsync-test on another.  On the standard test, WITHOUT mounting
> with sync, I get these results out of Theodore's test...
> 
> fsync time: 1.5693
> fsync time: 18.8047
> fsync time: 21.2672
> fsync time: 18.6747
> fsync time: 2.3821
> fsync time: 2.0494
> fsync time: 2.8781
> fsync time: 21.6300
> 
> Here's a "vmstat 1" snipette.  All the lines while the dd is running are
> roughly the same.
>  2  9 380388  16716  33412 1409988    0    0     0 15340  806 1188  3  4  0
>  93 
>  0  8 380388  15748  33428 1411080    0    0     0 16284 1165 2350  7  8  0
>  85 
>  0  9 380388  16620  33432 1409752    0    0     0 18240  878 1108  5  3  0
>  92 
>  1  8 380388  16776  33452 1410108    0    0     0 11888 1046 1140 10  8  0
>  82
> 
> When I do the following...
> mount -o remount,rw,sync /dev/s/sys /
> 
> I get the following benches while running the same dd command...
> fsync time: 0.0067
> fsync time: 0.0369
> fsync time: 0.0208
> fsync time: 0.0099
> fsync time: 0.1175
> fsync time: 0.0337
> fsync time: 0.0003
> fsync time: 0.0219
> fsync time: 0.0110
> fsync time: 0.0142
> fsync time: 0.0076
> fsync time: 0.0146
> fsync time: 0.0153
> fsync time: 0.1104
> fsync time: 0.0061
> fsync time: 0.0003
> 
> With "vmstat 1" snippet of ...
>  1  0 380624 1112236  93104 297252    0    0     0 13056  920 1167  5  3 49
>  43
>  0  1 380624 1098212  93252 311044    0    0     0 15876  925 1165  5  4 52
>  38
>  1  2 380624 1085796  93408 323296    0    0     0 13800  996 1239 10  4 47
>  38
> 
> Did something in the kernel change a couple years ago, in regard to syncing?

Comment 288 Adriaan van Kessel 2009-03-27 15:46:57 UTC

@ #286: no the msync.c :: IMHO MS_[A]SYNC is _not_ related.
With the introduction of the Unified (disk) Buffer Cache, msync(MS_ASYNC) became basically a no-op. Every process will see the same contents for a block, whether it uses read() or mmap() to access it. Other unices (without UBC) may behave differently. For MS_SYNC, the situation is more complicated. (IIUC: it is hard to wait for all pages to have been written if other processes may re-dirty them simultaneously)

This bug / issue is not about throughput, it is about latency and (lack of) responsiveness (of other, unrelated processes).

BTW, to me it seems there are actually two symptoms:
1) initially, the mouse cursor is stuck ("stuck/jerky mouse syndrome")
2) later on, the cursor gets quicker, but the actions (pop-ups, window focus, ...)
   are still slow.

(1) can be associated with CPU scheduling, unix-domain socket-I/O, maybe even pagefaulting of X's code segments.
(2) can be associated with CPU scheduling, pagefaulting of code, or memory shortage ( -->> pagefaulting + induced writing of dirty pages)

Comment 289 Yuriy Lalym 2009-03-27 21:25:24 UTC

File system - xfs (mounted with options by default)
dd if=/dev/zero of=test.img bs=572041216 count=1

Kernel 2.6.28.8

# time (cp test.img test1.img && sync)
real 0m7.372s
user 0m0.021s
sys 0m1.152s


Kernel 2.6.29

# time (cp test.img test2.img && sync)
real    0m13.704s
user    0m0.016s
sys     0m1.060s

Comment 290 Valentijn Sessink 2009-04-01 10:58:08 UTC

This bug is present at least from 2.6.15 and up, so it's older than the 2.6.18 (with question mark) reported in this bug.

@breezer:~$ { sleep 5; dd if=/dev/zero of=/tmp/bigfile bs=1M count=5000 conv=fdatasync ; } & /tmp/fsync-tester
[1] 4946
fsync time: 0.0188
fsync time: 0.0142
fsync time: 0.0142
fsync time: 0.0142
fsync time: 0.0143
fsync time: 9.2283
fsync time: 12.0892
fsync time: 11.9867
fsync time: 17.6123
fsync time: 13.5469

I've seen sync times up to 20 seconds.

This is Ubuntu 6.06LTS, 2.6.15-53-686 kernel. I am seeing this behaviour on various machines with different hardware. It is a real problem for NFS servers in combination with clients that run Firefox 3.

Comment 291 David Rees 2009-04-09 01:12:47 UTC

Anyone willing to do some before and after tests?  It looks like the huge filesystem thread has produced some results and latency during large writes should be much better now with 2.6.30-rc1 + Theodore Ts'o's ext3-latency-fixes.

http://lkml.org/lkml/2009/4/8/760

Comment 292 Matt Whitlock 2009-04-09 02:42:02 UTC

Look at the difference in disk throughput when running with dirty_background_ratio=0 and dirty_ratio=0:
http://img9.imageshack.us/img9/811/fsyncgraph00.png

versus with dirty_background_ratio=40 and dirty_ratio=80:
http://img154.imageshack.us/img154/9427/fsyncgraph4080.png

Both images are graphs of vmstat output during this command:
dd if=/dev/zero of=bigfile bs=1M count=20k conv=fdatasync

I collected this data in single-user mode so no other processes were touching the disk.

Do note the fairly steady throughput in the first case, in stark contrast with the huge burst at the beginning and end of, and slowness throughout, the second case.

In case anyone missed the point, it took 55 seconds to write 20 GB with dbr=0,dr=0 and 593 seconds to write 20 GB with dbr=40,dr=80.  For some reason, the page cache appears to be really gumming up the works.

Comment 293 Trenton D. Adams 2009-04-09 04:26:34 UTC

Hi David,

I would be willing to do some before/after testing.  But it may be a couple of days, at least, before I can.  When is 2.6.30 going to be released?

Also, I have a local Linus git tree.  How do I update it with the latest git, or do I have to re-clone the entire thing again?

Comment 294 Thomas Pilarski 2009-04-09 09:47:12 UTC

Before you make the comparison tests, you should ensure that you use the same journal mode with ext3. The default ext3 journal mode was changed to writeback as default mode.

Comment 295 Thomas Pilarski 2009-04-09 11:38:52 UTC

Is there a reliable testcase for the latency issue?

Comment 296 Pavel Vasilyev 2009-04-12 11:15:07 UTC

How about XFS, JFS, Reiserfs ???

Comment 297 jgardia 2009-04-12 20:12:18 UTC

I'm using XFS, and I have the same latency problems.
I've been checking this thread, and testing the proposed ideas, without success.
if you want me to test something, I'll happily do it.

Cheers,

Jose


(In reply to comment #296)
> How about XFS, JFS, Reiserfs ???

Comment 298 Daniel Rowe 2009-04-13 00:00:45 UTC

Same here using XFS on a multi disk (8) volume and seeing high IO waits.

Comment 299 David Rees 2009-04-14 05:59:19 UTC

For anyone who wants to test, here's what to do:

1. Document latencies with current setup which is performing poorly.
2. Document latencies with 2.6.30-rc1 (which should be much better for most people - make sure that if you are using ext3, that you mount your filesystem with the same journalling mode, as the default has changed)

To document latencies, start a large streaming write:

# dd if=/dev/zero of=/tmp/bigfile bs=1M count=5000

And run Ted Tso's latency testing tool in parallel (grab/compile it from here: http://lkml.org/lkml/2009/3/24/227)

If you still have questions, read the last 50 or so comments to this bug for more information.

Comment 300 Pavel Vasilyev 2009-04-14 08:13:33 UTC

(In reply to comment #299)
> For anyone who wants to test, here's what to do:

# uname -a
Linux amd64 2.6.29.1 #4 SMP PREEMPT Fri Apr 3 07:27:52 MSD 2009 x86_64 x86_64 x86_64 GNU/Linux

# cat /proc/meminfo | grep MemTotal
MemTotal:        4127376 kB

#cat /proc/cpuinfo | grep -i "Model name" | uniq

model name      : Dual Core AMD Opteron(tm) Processor 265

# cat /proc/mounts | grep ' / '

/dev/sda2 / xfs rw,noatime,nodiratime,relatime,noquota 0 0

# hdparm -i /dev/sda | grep Model

Model=WDC WD1500AHFD-00RAR5, FwRev=21.07QR5, SerialNo=WD-WMAP43732535

/* Western Digital Raptor */

# dd if=/dev/zero of=./bigfile bs=1M count=5000 && ./fsync-tester
5000+0 records in
5000+0 records out
 5242880000 bytes (5,2 GB) copied, 69,7789 s, 75,1 MB/s

fsync time: 0.0076
fsync time: 0.0091
fsync time: 0.0436
fsync time: 0.0359
fsync time: 0.0359
fsync time: 0.0359
fsync time: 0.0358
fsync time: 0.0359
fsync time: 0.0359
fsync time: 0.0359
fsync time: 0.0359
fsync time: 0.0358
fsync time: 0.0359
fsync time: 0.0358
fsync time: 0.0359
fsync time: 0.0359
fsync time: 0.0359
fsync time: 0.0359

^C

Comment 301 Matt Whitlock 2009-04-14 09:15:49 UTC

(In reply to comment #300)
> # dd if=/dev/zero of=./bigfile bs=1M count=5000 && ./fsync-tester

That's supposed to be a single ampersand, which causes the dd process to start in the background so the fsync-tester process can run simultaneously with it.

Comment 302 Pavel Vasilyev 2009-04-14 09:43:22 UTC

(In reply to comment #301)
> ...to start in the background ...

dd if=/dev/zero of=./bigfile bs=1M count=5000 & ./fsync-tester;
[1] 5298
fsync time: 0.0266
fsync time: 0.7677
fsync time: 0.6938
fsync time: 0.5879
fsync time: 1.1956
fsync time: 0.9582
fsync time: 0.9866
fsync time: 1.1833
fsync time: 0.6964
fsync time: 0.9986
fsync time: 0.9624
fsync time: 0.9093
fsync time: 0.9999
fsync time: 0.4423
fsync time: 0.8406
fsync time: 1.0880
fsync time: 0.1754
fsync time: 0.9039
fsync time: 0.8727
fsync time: 0.1261
fsync time: 0.2749
fsync time: 0.8547
fsync time: 0.5241
fsync time: 0.8164
fsync time: 0.4006
fsync time: 0.6532
fsync time: 0.8521
fsync time: 0.4151
fsync time: 0.3384
fsync time: 0.3326
fsync time: 0.4330
fsync time: 0.5800
fsync time: 0.8854
fsync time: 0.5953
fsync time: 0.3899
fsync time: 0.6722
fsync time: 0.1056
fsync time: 0.5554
^C

Comment 303 Thomas Pilarski 2009-04-14 12:19:59 UTC

Created attachment 20972 [details]
fsync tester kernel 17 - 30

I have tested the kernels 17, 18, 20, 28, 29, 29 (patched with http://bugzilla.kernel.org/attachment.cgi?id=20172) and 30 (f4efdd65b754ebbf41484d3a2255c59282720650), which should include the patches.

I got great results with the patched 29 kernel at the beginning and bad results, while executing the test again. This test case is not reliable, or my installation is changing parameters while switching the kernels.

I have executed the two commands concurrent (Comment #299). 
dd if=/dev/zero of=./bigfile bs=1M count=5000 & ./fsync-tester

Comment 304 Yuriy Lalym 2009-04-14 20:58:25 UTC

ASUS P5K
linux suse 2.6.29-53-default x86_64

# cat /proc/meminfo | grep MemTotal
MemTotal:        8196428 kB

# cat /proc/cpuinfo | grep -i "Model name" | uniq
model name      : Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz

# cat /proc/mounts | grep ' /home '
/dev/sda3 /home xfs rw,attr2,noquota 0 0

# hdparm -i /dev/sda
Model=ST31000340AS /* Seagate SATA2 */
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6


~> dd if=/dev/zero of=./bigfile bs=1M count=5000 & ./fsync-tester
[1] 5346
setting up random write file
5000+0 records in
5000+0 records out
5242880000 bytes (5.2 GB) copied, 90.9677 s, 57.6 MB/s
done setting up random write file
starting fsync run
starting random io!
fsync time: 1.0965s
fsync time: 0.4574s
fsync time: 0.7729s
fsync time: 0.3746s
fsync time: 0.5232s
fsync time: 0.1928s
fsync time: 0.9374s
fsync time: 0.6353s
fsync time: 0.3625s
fsync time: 0.4970s
fsync time: 0.3150s
run done 11 fsyncs total, killing random writer
[1]+  Done                    dd if=/dev/zero of=./bigfile bs=1M count=5000

~> vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1 13      0  38868    164 7778824    0    0    12 23407  959 1940  1  3  0 95
 1 13      0  47144    164 7770084    0    0     0 26260 1435 2732  2  3  0 95
 0 13      0  39740    164 7774280    0    0    60 30724 1534 2860  2  4  0 94
 0 13      0  41124    164 7776080    0    0     0 13888 1103 2038  2  3  0 95
 0 13      0  42460    164 7768056    0    0     0 52248 1320 2334  2  3  0 95
 1 13      0  40456    164 7776908    0    0     0  3028 1058 1934  2  3  0 95

At the moment of performance of the test operation with graphic interface KDE is impossible

Comment 305 Srdjan Todorovic 2009-04-15 19:49:27 UTC

Just tried dd if=/dev/zero of=bigfile bs=1M count=20k conv=fdatasync on 2.6.30-rc2 and top still shows iowait of 70% to 90%, on ext3 filesystem.

Motherboard: Gigabyte M57SLI-S4
Distro: Slamd64 12.2

$ cat /proc/meminfo | grep MemTotal
MemTotal:        3089672 kB

$ cat /proc/cpuinfo | grep -i "Model name" | uniq
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4400+

sda:

 Model=WDC WD5000AAKS-00TMA0, FwRev=12.01C01, SerialNo=WD-WCAPW4009869
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6

I believe the ext3 partition was mounted with data=writeback option, but can reboot and confirm if it is important enough.

Comment 306 David Rees 2009-04-15 20:07:32 UTC

(In reply to comment #304)
> ASUS P5K
> linux suse 2.6.29-53-default x86_64

You're running a kernel that is known to have high write latencies, and it doesn't appear that your fsync latency test is running in parallel with the dd.  With 8GB of RAM, you likely need to change your dd to write out at least 10GB of data instead of 5GB.


(In reply to comment #305)
> Just tried dd if=/dev/zero of=bigfile bs=1M count=20k conv=fdatasync on
> 2.6.30-rc2 and top still shows iowait of 70% to 90%, on ext3 filesystem.

Your system *should* show high iowait when you're stress testing it like that.  If it doesn't, you're not writing to disk as fast as it can handle it.

High iowait is normal and expected.  It is not an indication of a problem.

What is not expected is high latency during those stress tests.

Ideally you should see sync latencies of less than a second - if latencies get higher than that you are likely using ext3 data=ordered or a broken kernel.

2.6.30-rc2 was just released - that should be used for future tests.

Comment 307 Srdjan Todorovic 2009-04-15 20:45:34 UTC

2.6.30-rc2

fsync-tester shows mostly < 1 second, except a few times when it goes just above 1 sec.

fsync time: 0.1964
fsync time: 0.2317
fsync time: 0.2923
fsync time: 0.0565
fsync time: 1.1033
fsync time: 0.2297
fsync time: 0.0124

fsync time: 0.0848
fsync time: 0.1049
fsync time: 0.6525
fsync time: 11.1130       <--- not sure what that was
fsync time: 2.2619
fsync time: 0.3535
fsync time: 0.1543
fsync time: 0.2699


Unfortunately, the load average shoots up, peaking at about 8 before I run out of space on the disk. System responsiveness is also affected, but don't have a meaningful measurable quantity.

top - 21:41:06 up 16 min,  6 users,  load average: 7.23, 5.93, 3.98
top - 21:42:19 up 17 min,  7 users,  load average: 8.12, 6.53, 4.34

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  9      0  19428  10752 2681252    0    0   180 13957 1344  497  2  9 30 59
 1  8      0  20100  10780 2680416    0    0     0 47644 2883 1290  2 12  0 86
 0  9      0  18908  10816 2681888    0    0     0 22528 2819  858  2 11  0 88
 0 10      0  20116  10828 2680952    0    0     4 25080 2865  781  2  7  0 92
 0  9      0  18900  10844 2682280    0    0     4 32696 3496  835  0 11  0 90
 0  9      0  19040  10876 2681736    0    0     0 29936 3060 1064  1 10  0 89
 2  8      0  18880  10892 2680868    0    0     4 47736 2954  731  0  7  0 92
 0  9      0  18180  10920 2681448    0    0     0 44160 2723  971  0 13  0 87

/dev/sda4 /home ext3 rw,relatime,errors=continue,data=writeback 0 0

Comment 308 Gonzalo Aguilar 2009-04-16 08:19:16 UTC

Hi all!

I just ran the tests and obtained this:

######################################################
gad@ws-esp16:~$ ./kernel-test2.sh 
Using current dir to do IO tests
####################
## System info
System: 2.6.28-11-generic i686
Tag: 2.6.28-11-generic
Memory MemTotal:        2060636 kB
CPU Model: model name	: Intel(R) Core(TM)2 Duo CPU     T7500  @ 2.20GHz
Running in .
Mounts: 
---------------------
rootfs / rootfs rw 0 0
/dev/disk/by-uuid/ee364958-34b6-474e-8e54-9a9eaff56d12 / ext3 rw,relatime,errors=remount-ro,data=ordered 0 0
---------------------
Sda info: 
 Model=ST91608220AS                            , FwRev=3.ALE   , SerialNo=            5MA4TF4V
####################
First Test: FsyncProblem

Starting
./test-2.6.28-11-generic-1
We have High IO PID 8949 running
We have fsync-tester with 8950 running...
fsync time: 0.1504
fsync time: 0.5174
fsync time: 0.3664
fsync time: 0.1727
fsync time: 0.2163
fsync time: 0.3080
fsync time: 0.3914
fsync time: 0.1766
fsync time: 0.4800
fsync time: 0.2304
fsync time: 0.4018
fsync time: 0.1159
fsync time: 0.4537
fsync time: 0.1837
fsync time: 0.3032
fsync time: 0.5013
fsync time: 2.0128
fsync time: 0.9343
fsync time: 0.3027
fsync time: 1.2761
fsync time: 0.7145
fsync time: 0.4678
fsync time: 2.0326
fsync time: 0.2019
fsync time: 0.5484
fsync time: 0.3867
fsync time: 0.0912
fsync time: 0.2040
fsync time: 0.3893
fsync time: 0.2703
fsync time: 0.3794
fsync time: 0.5449
fsync time: 0.7379
fsync time: 0.5957
fsync time: 0.6034
fsync time: 0.7915
fsync time: 1.0564
fsync time: 0.5795
fsync time: 0.4501
fsync time: 2.2850
fsync time: 8.1411
fsync time: 1.4754
fsync time: 1.3487
fsync time: 0.9896
fsync time: 0.6221
fsync time: 1.1703
fsync time: 0.2775
fsync time: 0.1842
fsync time: 0.3994
fsync time: 0.5275
fsync time: 0.3382
fsync time: 0.3295
fsync time: 0.6451
fsync time: 0.6803
fsync time: 1.2621
fsync time: 1.3397
fsync time: 0.3250
fsync time: 0.3182
fsync time: 0.3491
fsync time: 0.2745
fsync time: 0.3489
fsync time: 0.5478
fsync time: 0.6009
fsync time: 0.4482
fsync time: 0.3772
fsync time: 0.1414
fsync time: 0.2948
fsync time: 0.2228
fsync time: 0.3758
fsync time: 0.3091
fsync time: 0.2624
fsync time: 0.3526
fsync time: 0.0771
fsync time: 0.2078
fsync time: 0.1613
fsync time: 0.2265
fsync time: 0.2759
fsync time: 0.3231
fsync time: 0.3532
fsync time: 0.1200
fsync time: 0.2788
fsync time: 0.4866
fsync time: 0.2710
fsync time: 0.4107
fsync time: 0.4903
fsync time: 0.5680
fsync time: 0.1199
fsync time: 0.3397
fsync time: 0.3929
fsync time: 0.3373
fsync time: 0.4407
fsync time: 0.2629
fsync time: 0.2998
fsync time: 0.2175
fsync time: 0.3119
fsync time: 0.0971
fsync time: 0.1899
fsync time: 0.4977
fsync time: 0.4127
fsync time: 0.2498
fsync time: 0.8439
fsync time: 0.1513
fsync time: 0.1109
fsync time: 0.2506
fsync time: 0.3414
fsync time: 0.1470
fsync time: 0.0558
./kernel-test2.sh: line 84:  8949 Terminado               dd if=/dev/zero of="$io_test_path/test-$info_tag-$i" bs=1M count=5000 oflag=direct
./kernel-test2.sh: line 86:  8950 Terminado               ./fsync-tester "$io_test_path/test-$info_tag-$i.fsynctest"
./test-2.6.28-11-generic-1 deleted!
./test-2.6.28-11-generic-1.fsynctest deleted!
 --- Finish --- 
Kernel tested: 2.6.28-11-generic i686

######################################################


I have to say that I killed the dd program manually because it took to much time. 

I don't know if it's an issue to get about 1,2MB/S IO as much... I suppose that even for a laptop this is not much normal.


Anyway there are results. 
I updated testsuite to include this new test.

It's called kernel-testsuite.tar.gz and it includes:
   kernel-test-fsync.sh
   fsync-tester

The package contains Theodore Ts'o sources but modified to use 1 parameter for the filename of the output.


I hope this helps.

Comment 309 Gonzalo Aguilar 2009-04-16 08:23:54 UTC

Created attachment 21007 [details]
Automatic test suite for this bug V3

This includes the fsync test

Comment 310 Yuriy Lalym 2009-04-17 20:32:12 UTC

(In reply to comment #306)
> You're running a kernel that is known to have high write latencies, and it
> doesn't appear that your fsync latency test is running in parallel with the
> dd.

????????????????????????????
Most likely about it is not known to anybody. The bug status 'NEW'.

>  With 8GB of RAM, you likely need to change your dd to write out at least
>  10GB
> of data instead of 5GB.

OK (add for (In reply to comment #304))

dd if=/dev/zero of=./bigfile bs=1M count=15000 & ./fsync-tester
...
fsync time: 2.3800                                                          
fsync time: 2.4295                                                          
fsync time: 2.4099                                                          
fsync time: 2.1599                                                          
fsync time: 2.0760                                                          
fsync time: 2.6152                                                          
fsync time: 2.1427                                                          
fsync time: 2.4893                                                          
fsync time: 2.3252                                                          
fsync time: 2.3208                                                          
fsync time: 2.4223                                                          
...
fsync time: 2.3710
fsync time: 1.3094
fsync time: 1.4473
fsync time: 2.7260
fsync time: 2.2739
fsync time: 2.2078
fsync time: 0.5446
15000+0 records in
15000+0 records out
fsync time: 1.5607
15728640000 bytes (16 GB) copied, 201,724 s, 78,0 MB/s

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  5      0 3930476   6108 3873852    0    0     0 74384  883 1632  1  4  0 94
 0  5      0 3864644   6108 3941216    0    0     0 64512  667 1088  1  5  0 93
 0  4      0 3788956   6108 4015088    0    0     0 73728  943 1738  2  5  0 93
 0  5      0 3735848   6108 4070376    0    0     0 53268  666 1181  1  5  0 94
 2  5      0 3671468   6108 4135384    0    0     0 65024  735 1277  1  4  0 94
 0  4      0 3590356   6108 4213988    0    0     0 77824  860 1590  2  5  0 93
 1  5      0 3524484   6108 4280384    0    0     0 64392  749 1495  1  4  0 94

Comment 311 Thomas Pilarski 2009-04-20 06:12:26 UTC

Created attachment 21054 [details]
test case: Takes the time of mouse click events

All my results shows a high probability of high latencies, when there is a high system time. Most posts where related on high latencies during high IO with SSH connection or with the X-Server. Both uses a network/socket connection. The bug may be in the network stack and not in the io scheduler or block layer.

Here my first test. 
The "Example Network Job" test (Flexible IO Tester) shows a regression since 2.6.22.
(see the last test on http://global.phoronix-test-suite.com/?k=profile&u=ebird-3722-22013-9288 )

And here the mouse click test. This test case shows exactly the same regression on all kernels and the same behavior I have recognized in a real environment.

It's !!not!! caused by the fsync bug. 

The test case is just clicking on a label and takes the time till the event arrives. It's using the platform's native input queue (see java.util.Robot). 

The test case is only a quick solution and has no error handling. It expects a factor as parameter. A high factor like 40.0 means a high sensitiveness and produces a high probability for high latencies, but increases the probability for a missing precondition (no high cpu usage and no high system time) on the current kernels. A value below 5.0 means a bad sensitiveness, which reduces the system time and reduces the probability of capture a high latency event. These values may differ on other machines, as it is not tested on other machines.  

For generating the high io, I have used the following commands, but it's enough to copy a big folder (> memory size) too.
# for i in 1 2 3 4 5 6; do dd if=/dev/zero of=t-$i bs=1M count=1K & done

The error occurs with the kernels 2.6.17, 2.6.18 and 2.6.20 only while the cache is filling up withing the first five seconds. 
 
kernel         no IO         high IO     
2.6.17     max 160ms     max    35ms (max 2.859s within the first 5 seconds)
2.6.18     max 152ms     max   101ms (max 2.430s within the first 5 seconds)
2.6.20     max 164ms     max   100ms (max 1.049s within the first 5 seconds)

2.6.27     max  46ms     max 6.988s  (during IO)
2.6.28     max  51ms     max 3.778s  (during IO)
2.6.29     max  99ms     max 3.632s  (during IO)
2.6.30-rc2 max  50ms     max 4.993s  (during IO)

Unable to run test on this kernel, because of missing preconditions.
2.6.22                     
2.6.30-rc2 (smp)         max 3.624s  (during IO)

An output like this or no cpu usage means missing preconditions for the test, reduce the factor.
> High total latency of last 19 events at 138.783s - total latency : 646ms

A factor below 5.0 means the test is not able to be run on this kernel.

P.S.
All tests where done on a kernel without SMP support to reduce multi core scheduler differences with a 250Hz timer and without cpu scaling.
On multi cores system you should busy n-1 cores with an job like this.
# bzip2 -c /dev/zero >/dev/null &

Comment 312 Thomas Pilarski 2009-04-20 06:14:10 UTC

Created attachment 21055 [details]
Complete test log

Comment 313 Trenton D. Adams 2009-04-22 08:11:33 UTC

Hi guys,

I have run my test script, which I ran with previous kernels.  There is a pretty big increase in performance. on 2.6.30-rc3.  The BIGGEST difference I noticed, about my test output was that vmstat reported large numbers (10) of "uninterruptible sleep" processes.  Now, it's down to about 1-4.

I saw some 9 and 10 second fsync latencies, but most were around 0.3 seconds, with some around 1-2 seconds.

However, I don't think the kernel is back to what it used to be yet.  I never used to have problems with ext3 fsync latencies at all.  It used to be that a simple file copy would not cause much latency issues on the responsiveness of my regular apps.  In fact, generally speaking, I never noticed any problems when copying huge files.  Now, when copying large files, I still get some choppiness, even with Ted's patches.

I'm wondering if the real problem lies in the block io layer, and not the file system layer?

Comment 314 Thomas Pilarski 2009-04-22 09:02:33 UTC

The reliability of the mouse click test case (Comment #311) can be improved by adding a random reading process.

# for i in 1 2 3 4 5 6; do dd if=/dev/zero of=t-$i bs=1M count=1K & done
# find / 2>%1 >/dev/null
# java MouseClickTester 40

I am able to catch latencies up to 12 seconds with the kernel 2.6.27 (no smp support). Is there a way to trace such an mouse click event in the kernel? It should be suspend/wait and resume.

Comment 315 Yuriy Lalym 2009-04-22 20:04:37 UTC

Kernel 2.6.30-rc2
Other info see comment #304

TEST 1
----------------------------------------------------------------------------
yura@suse:~> dd if=/dev/zero of=./bigfile bs=1M count=15000 & ./fsync-tester
[1] 4561
fsync time: 0.0401
fsync time: 2.4475
fsync time: 1.7808
fsync time: 1.1141
fsync time: 1.6912
fsync time: 1.0753
fsync time: 1.2931
fsync time: 0.3260
fsync time: 0.3653
fsync time: 0.5603
.....
fsync time: 1.3651
fsync time: 1.0479
fsync time: 1.0806
fsync time: 0.6021
fsync time: 0.4708
fsync time: 1.3952
fsync time: 0.6665
fsync time: 1.4431
fsync time: 1.0893
fsync time: 1.7844
fsync time: 0.6520
fsync time: 0.3665
fsync time: 0.8171
fsync time: 0.7537
fsync time: 1.2100
fsync time: 0.9319
fsync time: 1.1578
fsync time: 1.1377
fsync time: 1.4913
fsync time: 1.0317
fsync time: 0.5870
fsync time: 1.8464
fsync time: 1.4770
fsync time: 1.3934
fsync time: 1.3794
fsync time: 0.7868
15000+0 записей считано
15000+0 записей написано
 скопировано 15728640000 байт (16 GB), 172,839 c, 91,0 MB/c
^C

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  4      0 6189644    808 1572324    0    0     4 116748 1585 1548  2 26  6 67
 1  3      0 6098828    808 1663460    0    0     0 84472  973 1538  2  7  0 91
 0  4      0 6011692    808 1749652    0    0     0 88416  722 1248  2  6  0 92
 0  3      0 5915592    808 1844204    0    0     0 95232  996 1668  1  7  0 92
 1  4      0 5834692    808 1925564    0    0     0 77832  672  838  1  6  0 93
 0  4      0 5755452    808 2005900    0    0     0 79872  940 1472  1  5  0 93
 1  2      0 5664856    808 2096760    0    0     0 88744  746 1316  1  6  0 92
 0  4      0 5574556    808 2185520    0    0     0 86368  802 1286  1  6  0 93
 0  3      0 5492072    808 2268036    0    0     0 81408  785 1112  1  6  0 93
 0  4      0 5412744    808 2347624    0    0     0 78344  926 1400  1  5  0 93
 0  3      0 5333768    808 2428624    0    0     0 78848  659 1046  1  5 50 43
 0  4      0 5245744    808 2516336    0    0     0 86536  992 1526  1  6 50 42
 0  4      0 5153952    808 2605988    0    0     0 89088  947 4596  4  7 48 41
 0  3      0 5074720    808 2686532    0    0     0 78336  958 1768  1  6 49 43
 0  4      0 4974280    808 2787192    0    0     0 92198  706 1028  1  7 20 72
 0  3      0 4897224    808 2862716    0    0     0 80905 1046 1650  1  5 49 45
 0  4      0 4819832    808 2940944    0    0     0 77348 1193 2076  1  6  0 93
 1  2      0 4730172    808 3031732    0    0     0 82104  733 1020  1  6  1 91
 0  3      0 4648668    808 3112676    0    0     0 86864  994 1674  1  6 50 42
 1  3      0 4556864    808 3203828    0    0     0 87232  708 1136  2  6 49 43

TEST 2
----------------------------------------------------------------------------
yura@suse:~> dd if=/dev/zero of=./bigfile2 bs=1M count=15000
15000+0 записей считано
15000+0 записей написано
 скопировано 15728640000 байт (16 GB), 174,036 c, 90,4 MB/c

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  3      0  45296      0 7683084    0    0     0 79360 1196 2213  1  6 33 60
 0  3      0  46896      0 7682140    0    0     0 74752  792 1526  1  6 49 43
 0  3      0  48996      0 7681420    0    0     0 79360 1103 2084  1  6 50 42
 0  3      0  45216      0 7684640    0    0     0 84480  824 1494  2  7 49 42
 1  3      0  46028      0 7684060    0    0     0 78336 1081 1981  1  7 16 76
 1  2      0  46960      0 7684032    0    0     0 82536 1138 2168  1  8  0 91
 0  3      0  50072      0 7680432    0    0     0 79768  760 1473  1  6  0 92
 1  2      0  47396      0 7683396    0    0     0 86680  966 1670  1  6 19 74
 1  3      0  48876      0 7681688    0    0     0 75624  758 1304  2  6  0 92
 0  3      0  50652      0 7680384    0    0     0 83456  983 1656  1  7  8 83
 0  3      0  45072      0 7684236    0    0     0 90624 1151 2103  1  7 47 45
 0  3      0  45308      0 7683464    0    0     0 80896  817 1380  1  6 46 46
 1  3      0  45936      0 7684280    0    0     0 73216 1049 1807  2  6 46 46
 2  2      0  45284      0 7685624    0    0     0 81008  881 1397  1  7 47 45
 0  3      0  47208      0 7683352    0    0     0 84405 1055 1642  1  7 47 45
 0  4      0  48368      0 7682056    0    0     0 76299 1049 1721  1  5 48 45

TEST 3 (all parallel - one hdd)
----------------------------------------------------------------------------
yura@suse:~> dd if=/dev/zero of=./bigfile4 bs=1M count=15000 (terminal 1)
15000+0 записей считано
15000+0 записей написано
 скопировано 15728640000 байт (16 GB), 481,226 c, 32,7 MB/c

yura@suse:~> dd if=/dev/zero of=./bigfile5 bs=1M count=15000 (terminal 2)
15000+0 записей считано
15000+0 записей написано
 скопировано 15728640000 байт (16 GB), 485,821 c, 32,4 MB/c

And at the same time
KDE Menu -> Dolphin -> click -> WAIT 1s -> Dolphin is opened

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  5      0  44792      0 7682396    0    0   116 57368 1016 2083  3  6  0 91
 0  5      0  47988      0 7679356    0    0   140 45112  768 1116  2  4  0 94
 0  6      0  48080      0 7679688    0    0   744 57352  935 1410  1  5  0 94
 3  4      0  45584      0 7679752    0    0  1080 16396 1549 2173  2  3  0 95
 1  5      0  46408      0 7680768    0    0  4648    32 1364 1708  4  2  0 93
 0  6      0  46648      0 7680468    0    0  1080 32824 1052 1592  3  4  0 93
 0  6      0  44884      0 7681468    0    0     8 73453  852 1252  1  6  0 93
 0  5      0  48664      0 7676500    0    0    72 44126  825 1770  1  4  0 95
 0  6      0  44752      0 7678648    0    0   540 71215 1272 2865  2  6  0 92

Comment 316 Bob McElrath 2009-04-27 19:44:05 UTC

This absolutely cannot be an ext3 bug.  I'm using reiserfs for my root, and it happens here too.  System totally locks up with a swap storm when memory pressure starts forcing things into swap.  Firefox using > 2GB memory, and a wine memory bug which causes it to report ~4GB VIRT are what triggers it for me.  Killing either one fixes the storm.  (which is often not possible because keyboard/mouse are unresponsive)  Machine has 4GB RAM, 4GB swap.

It must be in the block layer, or elsewhere.

It also seems to happen with swap *off*.

Comment 317 Trenton D. Adams 2009-04-27 20:54:09 UTC

Bob, (In reply to comment #316)
> This absolutely cannot be an ext3 bug.  I'm using reiserfs for my root, and
> it
> happens here too.  System totally locks up with a swap storm when memory
> pressure starts forcing things into swap.  Firefox using > 2GB memory, and a
> wine memory bug which causes it to report ~4GB VIRT are what triggers it for
> me.  Killing either one fixes the storm.  (which is often not possible
> because
> keyboard/mouse are unresponsive)  Machine has 4GB RAM, 4GB swap.
> 
> It must be in the block layer, or elsewhere.
> 
> It also seems to happen with swap *off*.

Bob, what exact symptoms are you seeing?  There is another issue in the kernel, to which I have been unable to reproduce for the kernel devs.  I have seen it numerous times where the kernel has "futex" deadlocks.  It is potentially possible that yours could be related to that.

Because this performance problem, in this bug, does not cause a complete lockup.  It may seem that way for a bit, but if you leave the machine, it will eventually recover.  The futex one appears to be a complete deadlock, as it doesn't appear that it matters how long I leave it, it will never recover.

Comment 318 Matt Whitlock 2009-04-27 21:13:22 UTC

I recently experienced a new (for me) condition wherein this bug reared its ugly head, and it actually did not involve high disk throughput.  I was running mencoder, which was pegging three of my CPU cores and using a fair share of the fourth.  It was reading from a file on my RAID and writing to a file on a tmpfs, not particularly quickly on either end since it was doing a lot of number crunching in between.  The bug cropped up when I started an rsync at the same time, sending some files from my RAID to a remote system, again not particularly quickly (my upstream network bandwidth is only about 80 KB/s).  So I wasn't stressing the disk at all, yet my system came to a crawl.  I could literally watch windows repainting themselves on expose events.  Pressing Ctrl+Alt+Delete to bring up the KRunner process list took at least a minute, if not more.  My disks were churning an awful lot, which was odd given the quite low demands I should have been placing on them.  I thought maybe the input file to mencoder might have been heavily fragmented, but I ran xfs_fsr on it, and it said it only had 4 extents.  Something is seriously FUBAR here.

A possible theory: forcing the disks to seek back and forth to read from the two files "simultaneously" meant that the majority of the time was spent waiting for disk seeks.  If the kernel was holding a big lock while waiting for those seeks, it could have seriously degraded the performance of the rest of the system.

Comment 319 Bob McElrath 2009-04-27 23:47:31 UTC

The bug I'm seeing is extremely reproducible.  (I just wait for about a day with firefox running and lots of tabs open and it will happen)  As I mentioned it occurs when memory pressure starts forcing things into swap.  This is not a hard lockup and the system will eventually recover.  (Where "eventually" can be > 30 minutes)

updated and trackerd also cause my system to be unusable, as reported above.  I have disabled them as a consequence...

Given that I can trigger it, I can run jobs in the background that could log something useful...locks?  fsync?  What do you suggest?

(This system has a quad core intel and raid5 root as well -- don't know if that's related)

Comment 320 Trenton D. Adams 2009-04-28 00:00:48 UTC

Matt,(In reply to comment #318)
> I recently experienced a new (for me) condition wherein this bug reared its
> ugly head, and it actually did not involve high disk throughput.

Yes, that is one of the reasons that I believe there is more to it than just ext3 fsync improvements; it doesn't always take a lot to make it happen.

Matt, do these things happen on 2.6.30-rc3?  I've seen an almost disappearance of my issues with this release.  It's still not gone, which indicates to me that they just didn't hit the nail on the head.  But, it certainly is WAY better.

Comment 321 Matt Whitlock 2009-04-28 03:55:06 UTC

(In reply to comment #320)
> Matt, do these things happen on 2.6.30-rc3?

I'm not willing to run a pre-release kernel.  In fact, the kernel is the only package on my Gentoo system that I intentionally maintain at the "Gentoo stable" level, rather than at the leading edge.  This is mostly because I don't want to have to reboot every time a new patch set comes out.  Right now I'm running 2.6.28-gentoo-r5, which is based on 2.6.28.9.

If this bug is indeed improved upon in 2.6.30, then I look forward to the release of 2.6.31! :)

Comment 322 Khalid Rashid 2009-04-28 06:53:29 UTC

(In reply to comment #317)

> 
> Bob, what exact symptoms are you seeing?  There is another issue in the
> kernel,
> to which I have been unable to reproduce for the kernel devs.  I have seen it
> numerous times where the kernel has "futex" deadlocks.  It is potentially
> possible that yours could be related to that.

Trenton, could you please point me to the bug of this issue you are speaking of?

Comment 323 unggnu 2009-04-28 07:10:54 UTC

I am using Ubuntu 9.04 with 2.6.30-rc3 x86_64 Kernel and I can confirm the whole behavior.
The irony is that it feels like Windows 95 while a floppy was formated. You know, the whole pseudo multi tasking on top of Dos - everything was really choppy.
A easy testcase is to set up two luks encrypted partitions and copy from one to another. Even if no core is under heavy load everything is slow. The same happens with usb transfers too.
But as like Matt Whitlock pointed out it is not always a disk io problem. Even under higher cpu usage this could happen. If I encode a DVD with ogmrip/mencoder h264 and 16 threads (16 threads get the highest cpu usage from my quad core which is still under 80% per core) Gnome feels like a formatting Win 95.
The latest problem has become less severe with 2.6.30-rc3 but it is still noticeable slow which makes no sense since no core has 100% load.
To have an comparison how it could work. If I fire up Prime95 whith 100% load on every core in Windows Vista I can still play modern 3D games without lagging. Windows of course has also flaws with IO and so on but the cpu multi tasking works really great. Way to go imho.

Comment 324 Milan Bouchet-Valat 2009-04-28 07:50:06 UTC

FWIW, I've tried the test proposed by Thomas in comment 314:
# for i in 1 2 3 4 5 6; do dd if=/dev/zero of=t-$i bs=1M count=1K & done
# find / 2>%1 >/dev/null
(the Java part did not start for some reason)

I ended force-rebooting my laptop, since it was impossible to control *after a few seconds*. I could only switch to VT and back to X, but very slowly and I couldn't even type a character there or in X. I have 500MB of RAM with a swap of the same size, Pentium M 1500 MHz: not very high config, but that should be sufficient to work, isn't it? :-) This was with 2.6.28, I'll try with 2.6.30rc2.

Comment 325 rocko 2009-04-28 08:29:35 UTC

My system also locks up when it tries to access swap. This is on Ubuntu Jaunty with both the Ubuntu 2.6.28 kernel and Ubuntu's vanilla 2.6.30.rc3 kernel. This machine has 4GB of RAM and 4GB of swap and is running on a root ext4 partition. 

My test case is to run multiple VirtualBox VMs (eg Jaunty installations) with say 1.4GB of RAM assigned to each. When I run the third one, as soon as the kernel starts to hit swap, it thrashes the hard drive, X rapidly becomes unresponsive and I have to hard reset the the machine. I am able to move the mouse (slowly) but clicking on individual windows doesn't work and the keyboard doesn't respond. atop -d manages to update itself as far as about 300MB of swap use and then stops updating.

I've left it as long as 15 minutes to see if it will recover, but it doesn't.

Comment 326 unggnu 2009-04-28 11:15:16 UTC

(In reply to comment #325)
> My system also locks up when it tries to access swap. 
> My test case is to run multiple VirtualBox VMs (eg Jaunty installations) with
> say 1.4GB of RAM assigned to each. When I run the third one, as soon as the
> kernel starts to hit swap, it thrashes the hard drive, X rapidly becomes
> unresponsive and I have to hard reset the the machine.
There are definitive some huge issues with the Kernel but I think this is not one of them. If your applications try to use more ram than it is available and always trying to access/reserve this mem which is likely with Virtualbox every other OS wouldn't operate fine anymore. Of course it should be possible to switch to console and run some commands but this has nothing to do with this report I think.

Btw. I forgot to mention that I don't use a swap.

Comment 327 Thomas Pilarski 2009-04-28 14:03:13 UTC

(In reply to comment #324)
> I ended force-rebooting my laptop, since it was impossible to control *after
> a
> few seconds*.

It's a extreme test case, as there will be generated a very high load. You can try with only two concurrent write processes, as your machine is PATA, only 1,5GHz and with a single core. And start the java test case at the beginning, it was switched before (a long day).

# java MouseClickTester 40

# for i in 1 2; do dd if=/dev/zero of=t-$i bs=1M count=1K & done
# find / 2>%1 >/dev/null

Comment 328 Thomas Pilarski 2009-04-28 14:11:51 UTC

Little correction.

# java MouseClickTester 40

# for i in 1 2; do dd if=/dev/zero of=t-$i bs=1M count=1K & done
# find >/dev/null 2>&1

Comment 329 rocko 2009-04-28 16:00:11 UTC

(In reply to comment #326)
>There are definitive some huge issues with the Kernel but I think this is not
>one of them. If your applications try to use more ram than it is available and
>always trying to access/reserve this mem which is likely with Virtualbox every
>other OS wouldn't operate fine anymore. Of course it should be possible to
>switch to console and run some commands but this has nothing to do with this
>report I think.
>Btw. I forgot to mention that I don't use a swap.

@unggnu: this is not a kernel issue?!!! If multiple apps are trying to reserve more RAM than is available and thus causing continuous access to swap, the kernel should NOT become completely unresponsive and require a hard reset, risking data loss or in the case of a remote server that you can't hard reset, denial of service. Surely the memory management system should be able to recognise this condition and take appropriate action, eg freeze one or more processes with high RAM requirements.

At the VERY least it should allow an operator to kill off offending processes, but this is impossible because you can't even login via ssh or access a console. This is where the test case is relevant to this bug - if the system didn't become completely unresponsive, the operator could fix the problem without a hard reset.

Comment 330 David Rees 2009-04-28 17:19:41 UTC

IMO, this bug has long past the point where it is useful.

There are far too many people posting with different issues.

There is too much noise to filter through to find a single bug.

There aren't any interested kernel developers following the bug.

The bug needs to be closed and reopened with separate bugs for each issue.  Each issue should be reproducible with the latest 2.6.30-rc kernel with a simple test case.

Anything else will just result in another huge bug with 300+ comments and no kernel developer interest.

(In reply to comment #329)
> @unggnu: this is not a kernel issue?!!! If multiple apps are trying to
> reserve
> more RAM than is available and thus causing continuous access to swap

It is not a kernel issue.  It is a system configuration issue.  If you have a half dozen large memory processes trying to fight for more memory than is available in the system causing each of those processes to be continuously swapped in and out as they fight to run, you're going to get horrible performance.

You either need more memory, less swap so that the OOM killer can kill a process or need to avoid running so many large memory processes in parallel.

Comment 331 Ben Gamari 2009-04-29 02:39:58 UTC

(In reply to comment #330)
> IMO, this bug has long past the point where it is useful.
 
Even I (the reporter) have more or less stopped tracking this bug. I absolutely agree.

> There are far too many people posting with different issues.
> 
> There is too much noise to filter through to find a single bug.
> 
> There aren't any interested kernel developers following the bug.

I would definitely agree; the bug has long outlived its usefulness. Closing with INSUFFICIENT_DATA.

> The bug needs to be closed and reopened with separate bugs for each issue. 
> Each issue should be reproducible with the latest 2.6.30-rc kernel with a
> simple test case.

Absolutely, all of you who have commented on this bug thusfar should open new bugs. While I can't stop anyone from opening bug reports, it is likely that any report without a definite test case reproducing the issue will turn into yet another grab-bag like this one.

Comment 332 simon+kernelbugzilla 2009-04-29 08:18:58 UTC

Having tracked bugs 7372 and 12309 on the primary issue (performance hitting a brick wall with heavy IO) since October 2007, and now facing the prospect of needing to track yet another one, can I make a plea that whoever opens the new one(s) posts a reference to the new bug ID(s) in this thread?

Comment 333 Milan Bouchet-Valat 2009-04-29 08:44:08 UTC

Thomas: thanks for that update, and indeed the second and more reasonable testcase does not completely kill the system. I'm seeing a possibly interesting phenomenon: the testcase does not trigger any hang when run alone, but when Firefox is started, I can see swap usage rise, and then the mouse won't move for about a second from time to time.

So my guess is that when the system needs to swap, even for only a few MB, it's not able to do that smoothly for the user. Maybe there's a problem of scheduling when the kernel needs to choose to give priority to the swap or to the root partition. Or that's simply because writing to quite remote places on the disk leads to high latencies. Would that be worth a new bug? I think we're a few experiencing this problem here.


I generally agree that this bug is not leading anywhere, but ATST we don't even know how many different issues there are, so opening new reports is problematic too. Maybe we could concentrate on the few cases we're best able to describe precisely, and hope we all suffer from these...

Comment 334 Lukasz Kurylo 2009-04-29 16:57:30 UTC

I have found this article after I had another "freeze". Just before freeze free memory was running out, swap was barely used, buffers were few hundred kB, BUT CACHE was over 2,7GB out of total 3GB of memory. After about 20 minutes I managed to switch to VT1 and there was now about 500MB of free memory, less cache and increased swap usage. Last output of top showed kswapd process kicking in. 

Googling gave me this thread:  
http://lkml.indiana.edu/hypermail/linux/kernel/0311.3/0406.html

Comment 335 Thomas Pilarski 2009-04-29 20:10:45 UTC

Lets summarize the bugs.

- High cache usage during write process enforce swapping of processes
  Patch in comment #160 works, but is not included in the linux tree.

- Fsync Bug in Ext3
  (There is a test case and a activity)

- Too high prioritization of heavy writing processes
  (Copying a big file, can delay the start of a program till finishing the copy
   operation)

- Missing read and write based scheduler

And finally the annoying bugs
- Low gui responsiveness during heavy IO
  A reliable test case is still missing.
  - The test case in comment #311 shows high click latencies 2-12s
    during heavy io on non smp kernels 
    (on smp kernels too, but it's not easy to catch such an event)
  - I have a socket ping pong test (not submitted), which shops latencies of
    ~2s after the writing processes are finished

- Low gui responsiveness in virtual machines
  no test case 
  maybe the same bug as the "Low gui responsiveness during heavy IO" bug


The gui responsiveness are not deterministic, as there may be a day with nearby no latencies and a hour with continuous latencies up to 60 seconds

Comment 336 devsk 2009-04-29 20:24:54 UTC

Does anybody know why the caches are not dropped after I echo 3 to drop_caches? I would expect that number to come down to 0 ideally but still few megs may be practically. What I see is after some usage of the system, the caches keep increasing and never go down with drop_caches. The graph is ever increasing. Almost like a leak of caches. Has anybody debugged this aspect? I think this is one of the primary reasons for slow down because memory is locked in the caches and new memory requests are swapping the crap out of the system.

Comment 337 Lukasz Kurylo 2009-04-30 08:25:02 UTC

In case of GUI responsiveness iotop showed relatively high IO after the freeze on X process (read). Maybe X poor responsiveness is caused by waiting for IO as well.

Comment 338 Lukasz Kurylo 2009-04-30 10:27:36 UTC

Interesting thing is cache usage and inability to drop most of it. From my understanding memory cache can be dropped if it's not dirty (has been "wittenback" to disk) this brought me to this thread about lack of writeback:
http://marc.info/?l=linux-kernel&m=113919849421679&w=2 

On the other side /proc/meminfo shows only ~160kB of dirty memory. Cache shows 880868 KB. echo 3 > /proc/sys/drop_caches doesn't do anything. So why cache can't be freed? Is it possible to have cache leak?

Comment 339 Lukasz Kurylo 2009-04-30 12:19:49 UTC

Looks like drop_caches stopped working as expected somewhere around 2.6.18:
look at first comment:
http://jons-thoughts.blogspot.com/2007/09/tip-of-day-dropcaches.html

Comment 340 Matt Whitlock 2009-04-30 12:48:03 UTC

Be careful using drop_caches.  I actually managed to cause a kernel crash by using it in combination with a removable medium.  Think it was a double-free bug, but I don't remember for certain.

Comment 341 devsk 2009-04-30 16:17:23 UTC

It has been mentioned time and again that none of the kernerl devs have gotten a concise description of the problem and hence none of them seems to have any nswers. Well, does anybody know why my caches show 700MB in a 2GB machine and why can't I get rid of any of it? I don't think the question can get any more precise.This is the heart of the problem folks.

Comment 342 Ben Gamari 2009-04-30 17:51:38 UTC

(In reply to comment #341)
> It has been mentioned time and again that none of the kernerl devs have
> gotten
> a concise description of the problem and hence none of them seems to have any
> nswers. Well, does anybody know why my caches show 700MB in a 2GB machine and
> why can't I get rid of any of it? I don't think the question can get any more
> precise.This is the heart of the problem folks.

I don't understand why you'd assume that cache is a problem. The kernel uses available RAM as cache as it's the most productive use for it. To assume that this is buggy behavior is extremely misled logic.

Comment 343 devsk 2009-04-30 18:07:10 UTC

(In reply to comment #342)
> (In reply to comment #341)
> > It has been mentioned time and again that none of the kernerl devs have
> gotten
> > a concise description of the problem and hence none of them seems to have
> any
> > nswers. Well, does anybody know why my caches show 700MB in a 2GB machine
> and
> > why can't I get rid of any of it? I don't think the question can get any
> more
> > precise.This is the heart of the problem folks.
> 
> I don't understand why you'd assume that cache is a problem. The kernel uses
> available RAM as cache as it's the most productive use for it. To assume that
> this is buggy behavior is extremely misled logic.

What's buggy is that its not ready to relinquish it when asked to drop it or when needed. echo 3 to drop_caches should drop the damn thing. If I configure swappiness=1, cache should be dropped first and then swap disk should be used. I don't like it locking 700MB out of my 2GB RAM, then swapping heavily. If this behavior is by design, someone needs to change that design.

Comment 344 devsk 2009-04-30 18:08:46 UTC

(In reply to comment #342)
> (In reply to comment #341)
> > It has been mentioned time and again that none of the kernerl devs have
> gotten
> > a concise description of the problem and hence none of them seems to have
> any
> > nswers. Well, does anybody know why my caches show 700MB in a 2GB machine
> and
> > why can't I get rid of any of it? I don't think the question can get any
> more
> > precise.This is the heart of the problem folks.
> 
> I don't understand why you'd assume that cache is a problem. The kernel uses
> available RAM as cache as it's the most productive use for it. To assume that
> this is buggy behavior is extremely misled logic.

What's buggy is that its not ready to relinquish it when asked to drop it or when needed. echo 3 to drop_caches should drop the damn thing. If I configure swappiness=1, cache should be dropped first and then swap disk should be used. I don't like it locking 700MB out of my 2GB RAM, then swapping heavily. If this behavior is by design, someone needs to change that design.

Comment 345 Yuriy Lalym 2009-05-01 07:59:26 UTC

Kernel 2.6.30-rc3
If a task is executed using a processor on ~30 percents and work is simultaneously executed with the file system (cp, mv, rm) - a computer dies, at this juncture to start something is not real yet.

Comment 346 Yuriy Lalym 2009-05-01 19:46:45 UTC

The kernel 2.6.30-rc4 is not better.
This bug "Large I/O operations result in slow performance and high iowait times" has passed from the status NEW in not clear state but iowait both was high and remained. Stop these frauds. What data is still necessary?

Comment 347 devsk 2009-05-01 22:49:55 UTC

(In reply to comment #346)
> The kernel 2.6.30-rc4 is not better.
> This bug "Large I/O operations result in slow performance and high iowait
> times" has passed from the status NEW in not clear state but iowait both was
> high and remained. Stop these frauds. What data is still necessary?

No. Kernel folks will not "stop these frauds". Is your system using the latest DDR3 memory running at 2000Mhz? Is it using core i7 based processor, overclocked to 4.5Ghz? Does it have SSD drives with at least 150MB/s writes? Are you using ext4 yet? If all of these are true, and your system still hangs, only then kernel devs will "stop these frauds" and fix this bug. Until then, just use Vista (make sure to upgrade to SP1)...:D
...






...
in case you couldn't tell, I was just kidding with ya! Please file a separate bug report with specific details about what you are experiencing on your system.

Comment 348 Yuriy Lalym 2009-05-02 10:06:28 UTC

Terminal 1 (no other active task)
:~/x1> time cp -r qt-x11-opensource-src-4.5.1 qt-x11-opensource-src-4.5.1-1

real    5m51.075s
user    0m0.147s
sys     0m2.192s

302.6Mb / 351s = 0.9 Mb/s

Terminal 2
:~/x1> vmstat 1

<- only cp
0  0      0 4916172    808 2774512    0    0    24 16248  794 1469  2  3 95  0
2  0      0 4915228    808 2776340    0    0    24  2180  959 1385  2  1 97  0
1  0      0 4913492    808 2778140    0    0    24  3144  841 1251  1  1 97  0
0  0      0 4912500    808 2779104    0    0    24  2636  679  936  2  1 97  0
1  0      0 4910516    808 2781112    0    0    32  2804  862 1258  2  1 96  0
0  1      0 4908872    808 2781812    0    0    36 27160  749  913  5  2 91  2
<- enter to foolder in dolphin (100 files in this folder)
2  0      0 4907012    808 2783712    0    0    48  2615 1108 1563  3  2 82 12 
0  1      0 4906020    808 2784728    0    0    80  3248  890 1274  2  1 64 33 
0  1      0 4905648    808 2785164    0    0    56  2933  705  920  3  1 67 28
0  1      0 4904828    808 2786028    0    0    84  2600  884 1240  2  1 49 47
0  1      0 4903456    808 2787400    0    0    44  3148  723  873  3  1 62 35
0  1      0 4902084    808 2788572    0    0    64  2681 1177 2604  3  1 49 47
0  1      0 4901464    808 2789284    0    0    48  2328  952 1407  2  1 63 34
0  1      0 4900556    808 2790416    0    0    36  2624  951 2373  4  1 59 35
1  1      0 4898040    808 2792868    0    0    60  2672 1224 4018  8  3 46 43
0  1      0 4897032    808 2793868    0    0    80  2760  693 1004  2  1 49 47
0  1      0 4895552    808 2795304    0    0    28  2459 1029 1495  2  1 81 15
0  1      0 4894552    808 2796408    0    0    84  2744  877 1279  2  1 49 47
0  1      0 4892700    808 2798272    0    0    76  2204  773 1078  4  1 48 47

Comment 349 Ben Gamari 2009-05-02 22:17:41 UTC

(In reply to comment #348)

You need to open a new bug report with a thorough explanation of your test case, expected and observed results, and any pertinent data you may have collected. Leave a note here referencing your newly created bug but posting any data here is not going to help anyone. This bug is closed due to a lack of focus.

Comment 350 Lukasz Kurylo 2009-05-05 15:37:17 UTC

I need to get back to 2.6.17, can't work like that! I have 3GB RAM of which >2GB are used by cache that won't drop even if memory is runnig out.

Comment 351 Andrew Morton 2009-05-05 21:45:15 UTC

(In reply to comment #349)
> (In reply to comment #348)
> 
> You need to open a new bug report with a thorough explanation of your test
> case, expected and observed results, and any pertinent data you may have
> collected. Leave a note here referencing your newly created bug but posting
> any
> data here is not going to help anyone. This bug is closed due to a lack of
> focus.


yup.

Guys, problems like this aren't sovled very effectively via bugzilla.

Please prefer to report these issues via email to linux-kernel and
myself and any developers who you think might be relevant.  it's confusing,
and clarity is important.  Being able to provide a means by which others
can demonstrate the problem is a huge benefit.

Comment 352 Trenton D. Adams 2009-05-06 03:31:57 UTC

Why am I still being CC'd on this bug, even though I'm not on the CC list?

Comment 353 Matt Whitlock 2009-05-06 03:45:03 UTC

(In reply to comment #352)
> Why am I still being CC'd on this bug, even though I'm not on the CC list?

Maybe you're watching Jens Axboe (the assignee), Ben Gamari (the reporter), or another user who is still in the CC list.

Comment 354 Yuriy Lalym 2009-05-19 22:16:40 UTC

kernel 2.6.30-rc6
yura@suse:~> export LANG=en
yura@suse:~> dd if=/dev/zero of=test1 bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 129.928 s, 80.7 MB/s

yura@suse:~> vmstat 1                                                           
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----   
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa   
 3  8      0  44708      0 7628596    0    0   249 12283  362  821  4  3 66 28  
 0  8      0  49180      0 7627532    0    0     0 40517 1061 1581  7  4  0 89  
 0  7      0  47188      0 7627180    0    0     0 59694 1156 1505  5  6  0 89  
 1  6      0  46692      0 7628276    0    0     0 55553 1144 1476  6  5  0 90  
 0  8      0  46428      0 7628160    0    0    20 51573  900 1096  5  4  0 90  
 0  7      0  46568      0 7627860    0    0     0 64024 1127 1480  5  5  0 90  
 0  7      0  45796      0 7629100    0    0    12 44597  889  987  6  4  0 90  
 0  8      0  46904      0 7627808    0    0   332 40500 1100 1485  6  4  0 90  
 0  7      0  47772      0 7626884    0    0   168 45300 1158 1628  6  4  0 90
 0  8      0  47216      0 7624456    0    0    72 67116  958 1151  5  5  0 90
 0  7      0  47032      0 7626480    0    0   280 29244 1177 1667  5  4  0 91
 0  7      0  45936      0 7626640    0    0   248 58872  922 1060  6  5  0 89
 0  9      0  44988      0 7626640    0    0   216 62492  945 1359  2  6  0 92
 0  8      0  47548      0 7625932    0    0   152 47164  926 1425  1  4  0 95
 1  6      0  45276      0 7627256    0    0    36 54721  605 1089  2  4  0 94
 0  7      0  48208      0 7626388    0    0    44 43612  834 1198  1  4  0 95
 0  8      0  47096      0 7625644    0    0   132 53789  655 1156  1  4  0 94
 0  7      0  46344      0 7624828    0    0   468 50292  981 2089  2  4  0 94
 0  8      0  46576      0 7625416    0    0   116 44056 1155 2119  1  3  0 96
 0  8      0  47476      0 7624800    0    0   636 38936  734 1125  2  4  0 94
 0  8      0  47348      0 7626676    0    0    32 58410  885 1613  1  5  0 93
 1  6      0  48508      0 7626280    0    0     0 67256  623  969  1  4  0 94
 0  7      0  47984      0 7625328    0    0     0 64888  694 1335  2  6  0 92
 0  7      0  45800      0 7626692    0    0     0 62496 1002 1698  1  4  0 95
 0  7      0  48220      0 7625052    0    0     0 61952  614 1222  2  5  0 93
 0  7      0  48508      0 7623300    0    0     0 69632  890 1586  1  5  0 94

Comment 355 Yuriy Lalym 2009-05-19 22:37:53 UTC

#354 this bigfile
yura@suse:~> time cp bigfile bigfile.cp

real    5m52.457s
user    0m0.343s
sys     0m21.356s

calc speed =>  10485760000 / 352.457 = 29.75 Mb/s

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
 1  0      0  46688      0 7686820    0    0     0    12  564  862  1  0 98  0
 0  0      0  46688      0 7686820    0    0    20     0  387  730  2  0 96  1
 0  0      0  46688      0 7686840    0    0     0     0  559  879  1  0 98  0
 0  0      0  46688      0 7686840    0    0     0     0  598  937  1  1 97  0
 0  0      0  46688      0 7686840    0    0     0     0  315  517  2  1 98  0
 0  0      0  46704      0 7686840    0    0     0    16  600 1058  2  1 97  0
 0  0      0  46704      0 7686840    0    0     0     0  328  473  2  0 98  0
 0  0      0  46704      0 7686840    0    0     0     0  610 1122  2  0 98  0
 0  0      0  46704      0 7686840    0    0     0     0  582 1013  2  0 98  0
 0  0      0  46876      0 7686840    0    0     0     1  341  475  1  0 98  0
 0  0      0  46876      0 7686840    0    0     0     0  577  988  2  0 98  0
 0  0      0  46876      0 7686840    0    0     0     0  339  543  2  1 97  0
start cp
 3  0      0  46500      0 7686704    0    0 17500     0  857 2379  2  2 91  5
 3  0      0  43840      0 7689132    0    0 90624     0 2119 5710  4 11 61 24
 0  1      0  46532      0 7686180    0    0 83968     0 2008 5246  8 11 57 24
 1  1      0  43884      0 7689020    0    0 81024    46 2159 8097  6 10 59 25
 0  1      0  45020      0 7687772    0    0 81920     1 1759 3732  4 10 60 26
 0  1      0  44948      0 7687472    0    0 91264     0 2154 4449  4 10 60 25
 0  1      0  43924      0 7688888    0    0 88064     0 2040 4500  3 11 60 26
 0  1      0  46180      0 7686288    0    0 89984     0 1919 4107  3 11 63 22
 0  2      0  44692      0 7680932    0    0 86784 39184 2156 4820  4 12 47 38
 0  2      0  44568      0 7681376    0    0 64384 22436 1569 4127  3  7 35 54
 0  2      0  44092      0 7682832    0    0 35584 37396 1331 2886  3  5 37 55
 4  2      0  46920      0 7678572    0    0 42624 43336 1544 3311  3  6 28 63
 0  2      0  45724      0 7679280    0    0 49792 31240 1301 3076  2  6 27 64
 0  2      0  45328      0 7681288    0    0 41856 31648 1473 3322  3  5 27 65
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
 0  2      0  46328      0 7679272    0    0 52480 29232 1425 3345  3  8 33 56
 1  2      0  46276      0 7679748    0    0 48768 24844 1564 3539  3  6 32 60
 1  2      0  47196      0 7678088    0    0 63360 28688 1830 4781  3  9 14 74
 5  3      0  44052      0 7681744    0    0 58112 23612 1493 3905  3  8  5 83
 1  2      0  44988      0 7679956    0    0 18560 53021 1107 2129  2  4  0 94
 0  4      0  46872      0 7677272    0    0 55808 22541 1478 4117  3  7  1 89
 0  4      0  43824      0 7681360    0    0 52608 33800 1627 4628  3  7  0 89
 0  4      0  45536      0 7679600    0    0 41856 31720 1491 4026  3  6  2 89
 0  4      0  45688      0 7680620    0    0 39424 34324 1202 3190  3  6  6 85
 1  4      0  46052      0 7679708    0    0 48104 27148 1901 5505  3  7  2 88
 0  2      0  44120      0 7680964    0    0 53280  6660 1531 5967  3  8  6 83
 2  3      0  45692      0 7679440    0    0 55784     0 1837 6209  4  7  4 84
 0  5      0  44796      0 7678908    0    0 52452 10248 1953 6444  4  8  1 87
 0  2      0  46952      0 7676344    0    0 45264 24619 2085 8506  6  7  0 87
 0  6      0  44820      0 7677336    0    0 34588 41550 2438 5289  2  7  4 86
 0  2      0  47084      0 7675392    0    0 31016 34352 1203 3506  3  5 17 74
 2  3      0  46856      0 7674440    0    0 10252 67612  685 1250  2  3 11 84
 0  5      0  45072      0 7677236    0    0 48004 15368 1575 4859  2  7  5 85
 0  5      0  45504      0 7678588    0    0 29824 23240  952 2688  3  3  8 85
 0  5      0  45020      0 7678452    0    0 62080  8196 1865 6184  3  9  0 88
 0  3      0  44564      0 7679208    0    0  6272 61473  607  926  2  2 24 71
 1  3      0  46444      0 7676624    0    0 14216 64046 1059 2083  3  2 41 54
 0  2      0  44444      0 7680932    0    0 53636 16048 1448 5787  3  8 10 79
 0  2      0  45188      0 7680048    0    0 40320 34024 1439 3785  3  6  1 90
 1  3      0  46208      0 7679612    0    0 63872 10248 1628 4998  3  9  1 87
 3  4      0  45808      0 7680852    0    0 47360 27152 2030 5505  3  6  7 83
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
 0  5      0  44496      0 7681320    0    0 45828 29452 1683 4815  3  6  0 90
 0  5      0  46044      0 7679900    0    0 44160 28704 1273 3575  3  7  3 87
 0  5      0  44676      0 7679868    0    0 17280 62076  840 1855  3  3  1 93
 0  5      0  44716      0 7681108    0    0 53504 11268 1862 6838  3  6  0 91
 0  4      0  44504      0 7681556    0    0 42880 30905 1460 3885  3  7  2 88
 0  3      0  44364      0 7680420    0    0 19712 69876  904 1960  2  4  3 91
 0  4      0  48668      0 7678024    0    0 52096 21788 1904 5313  3  7  1 88
 0  4      0  47032      0 7677752    0    0 28032 54580 1066 2496  3  5 16 76
 0  3      0  46600      0 7678904    0    0 41344 33300 1426 4071  3  6 10 80
 0  4      0  45424      0 7679184    0    0 41856 40276 1240 3204  3  6  0 91
 0  4      0  45692      0 7680156    0    0 46080 18172 1479 4094  3  6  8 83
 0  4      0  45908      0 7679112    0    0 45824 40108 1660 4158  3  7  4 86
 0  4      0  44048      0 7681292    0    0 49408 32776 1345 3872  3  7  8 81
 0  3      0  46548      0 7678436    0    0 50816 22604 1609 4448  3  7 12 77
 0  2      0  46156      0 7678672    0    0 46464 35852 1350 3829  2  7 12 79
 0  3      0  45244      0 7678956    0    0 50304 26420 1664 4697  3  7  4 85
 0  3      0  48256      0 7675968    0    0 33280 31759 1325 3366  3  5 13 79
 0  4      0  44852      0 7679008    0    0 35328 53281 1168 3202  3  5 39 52
 0  4      0  46668      0 7677444    0    0 38784 24628 1390 3743  2  5 14 78
 0  5      0  45028      0 7680464    0    0 49920 20492 1373 4244  3  7  3 87
 0  5      0  45356      0 7681500    0    0 33408 35548 1369 3642  3  5  2 90
 0  5      0  46896      0 7679744    0    0 57216 23808 1526 4868  3  7  3 87
 1  1      0  45884      0 7679008    0    0 34432 50311 1427 3789  3  5  3 89
 2  3      0  44592      0 7679592    0    0 45952 40989 1378 4140  2  7  3 87
 1  2      0  44188      0 7680924    0    0 25856 48793 1023 2593  3  4  1 92
 0  5      0  45092      0 7680536    0    0 46336 26184 1533 4502  3  6  5 86
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
 1  5      0  44880      0 7681796    0    0 46208 23228 1560 4530  3  7  4 85
 0  5      0  46412      0 7679076    0    0 60800 23884 1623 5242  3  8  7 81
 0  5      0  46488      0 7679024    0    0 50048 24908 1637 4677  3  7  8 83
 0  2      0  44564      0 7681684    0    0 36224 39584 1140 3492  3  4 11 81
 0  5      0  45200      0 7681196    0    0 50304 22796 1757 5024  3  7  7 82
 0  3      0  46924      0 7672852    0    0 34560 43484 1175 3359  2  6  0 91
 0  6      0  45632      0 7675104    0    0 37504 36780 1346 3847  3  6  4 87
 0  6      0  45988      0 7673412    0    0 43776 42904 1434 4643  3  6  1 91
 0  5      0  48244      0 7669064    0    0 30848 44548 1094 3317  3  5  3 88
 0  5      0 286084      0 7439272    0    0 35456 33520 1403 3765  3  7  8 82
 2  4      0 181440      0 7543056    0    0 51968 27148 1491 4836  3  6  3 87
 0  4      0 119520      0 7601468    0    0 29184 43984  969 2825  3  4  4 89
 0  6      0  44456      0 7680104    0    0 43008 34584 1377 4120  3  5  2 89
 1  5      0  49396      0 7666304    0    0 29956 56836 1091 3636  2  5  0 92
 0  6      0  47008      0 7669068    0    0 37508 22336 1439 4405  3  6  0 91
 0  6      0  45312      0 7670448    0    0 20864 32748 1173 3639  2  4  0 94
 0  7      0  45792      0 7673916    0    0 29568 14856  996 2870  3  5  1 91
 0  6      0  44136      0 7679304    0    0 16128 63532 1052 2344  2  3  4 90
 0  3      0  44800      0 7678884    0    0 49664 15644 1407 5044  2  7  1 89
 0  7      0  45180      0 7679664    0    0 44416 28660 1534 4548  2  6  2 89
 0  7      0  43892      0 7678188    0    0 30080 35848 1365 4087  3  4  2 91
 0  7      0  45176      0 7672668    0    0 35456 39344 1145 3608  3  5  1 90
 0  7      0  46560      0 7671748    0    0 25472 36872 1239 3494  3  5  0 92
 0  4      0  47096      0 7675136    0    0 30464 24052 1044 3068  2  5  5 88
 0  5      0  43864      0 7683468    0    0 34944 31988 1274 3345  3  5 10 82
 0  3      0  44452      0 7676068    0    0 32640 43300 1758 5028  2  5  1 93
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
 1  5      0  46088      0 7679004    0    0 26368 34104  951 2415  3  4 10 83
 1  7      0  44644      0 7680964    0    0 39296 40568 1457 4161  3  5  4 87
 0  3      0  46904      0 7677248    0    0 17280 52767  847 2121  2  4 26 68
 0  6      0  45004      0 7679484    0    0 33408 36716 1135 3297  3  5  8 83
 0  3      0  45324      0 7680364    0    0 32128 41892 1873 4455  2  5  7 85
 0  3      0  45672      0 7679632    0    0 38144 32788 1146 2985  2  5  4 88
 0  4      0  44792      0 7681208    0    0 31488 34320 1255 2825  3  4  6 87
 0  5      0  44856      0 7678824    0    0 40960 31488 1259 3753  3  6  1 89
 5  4      0  46340      0 7678096    0    0 53632 25620 1703 5203  3  7  4 86
 0  3      0  46204      0 7678884    0    0 34816 37500 1395 3642  2  5  4 88
 0  5      0  47052      0 7671596    0    0 51840 40476 1460 4763  3  6  3 87
 0  3      0  44276      0 7676316    0    0 27520 30743 1245 3234  3  4  2 90
 0  5      0  46692      0 7678184    0    0 22784 48161  909 2197  2  5  3 90
 0  5      0  44396      0 7679168    0    0 46336 30268 1546 4511  3  6  5 85
 0  5      0  44352      0 7680436    0    0 43264 26744 1543 4340  2  6  2 90
 0  5      0  45496      0 7672684    0    0 44672 37352 1397 4869  3  7  0 90
 0  5      0  49372      0 7662316    0    0 33792 49596 1403 3972  3  6  1 90
 0  4      0  45644      0 7677232    0    0 24960 22280  926 2576  2  4  3 90
 0  6      0  44964      0 7680548    0    0 53120 33176 1705 5679  3  7  1 88
 0  7      0  45844      0 7671820    0    0 22456 37240 1288 4498  3  5  0 92
 0  8      0  45220      0 7667760    0    0 27404 24584  979 2885  2  5  0 93
 0  7      0  45812      0 7666412    0    0 21308 24720 1304 4636  2  3  5 89
 0  6      0  44268      0 7664988    0    0 29152 36896 1250 4517  3  5  0 92
 1  7      0  48448      0 7667896    0    0 27376 21508 1298 4418  3  5  0 92
 0  3      0  52464      0 7672108    0    0 27776 36073 1239 4920  3  5  0 92
 0  4      0  46272      0 7677508    0    0 43008 40522 1351 4645  2  6  0 91
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  7      0  45520      0 7672060    0    0 39296 36952 1499 4941  3  6  0 90
 2  6      0  44008      0 7667764    0    0 41088 39800 1286 4794  3  6  0 91
 0  5      0  45480      0 7665824    0    0 32512 39428 1362 4147  3  5  1 90
 5  7      0  45260      0 7674556    0    0 41728 18204 1464 5160  2  6  1 90
 0  6      0  44860      0 7674124    0    0 31360 43012 1069 3452  3  5  4 88
 1  5      0  44912      0 7672628    0    0 31744 43300 1365 4663  3  5  0 92
 0  4      0  50228      0 7674084    0    0 22656 51238  922 2896  2  5  3 90
 1  6      0  44768      0 7680248    0    0 51840 24756 1661 6082  3  7  5 84
 0  6      0  47312      0 7678352    0    0 45952 25656 1554 5644  3  6  3 87
 0  8      0  44808      0 7675632    0    0 39552 47932 1247 4241  3  6  1 90
 0  8      0  45832      0 7664640    0    0 33024 47104 1447 4698  2  5  0 93
 0  9      0  46444      0 7664368    0    0 40192 41608 1299 4904  3  6  0 91
 0  9      0  45212      0 7673000    0    0 39296 16156 1461 5472  3  5  0 92
 5  6      0  44980      0 7673068    0    0 25216 51256 1245 4057  2  5  2 90
 0  6      0  44220      0 7681560    0    0 34816 27188 1088 3746  2  6  6 85
 0  9      0  44824      0 7680132    0    0 47616 19960 1891 6827  3  6  9 81
 0  2      0  45324      0 7678940    0    0  3200 81459  615  855  2  2 42 54
 0  6      0  40888      0 7683144    0    0 24080 43565 1102 3157  3  4 28 65
 0  9      0  46280      0 7678620    0    0 51460  7176 1738 6587  3  6  3 88
 1  1      0  44716      0 7681788    0    0 31360 39488 1113 3891  3  5 14 78
 0  9      0  45096      0 7680228    0    0 52864 21536 1640 5803  3  7  6 84
 0  9      0  45744      0 7680164    0    0 44032 31268 1332 4769  4  6  2 88
 0 10      0  45880      0 7669856    0    0 39336 49300 1433 4746  1  5  0 94
^C

Comment 356 Yuriy Lalym 2009-05-19 22:44:20 UTC

Bug 12309 - Large I/O operations result in slow performance and high iowait times

Where the low iowait
Where the small I/O operations result
Where Status: 	RESOLVED INSUFFICIENT_DATA

Comment 357 Yuriy Lalym 2009-05-19 23:03:26 UTC

http://bugzilla.kernel.org/show_bug.cgi?id=13347

Comment 358 rafal 2009-05-20 08:59:12 UTC

There is ongoing discussion about similar issue:
http://lkml.org/lkml/2009/5/15/320
and
http://lkml.org/lkml/2009/5/16/23

Comment 359 Perlover 2009-05-24 07:49:11 UTC

Confirm bug

My OS is Fedora Core release 6
Kernel: 2.6.22.14-72.fc6
2 CPUs: Intel® Xeon® CPU 5130 @ 2.00GHz
HDDs: SAS 3.0 Gb/s, FUJITSU
RAID: Adaptec 4800SAS
RAID10

How to test:
# dd if=/dev/zero of=testfile.1gb bs=1M count=1000

In other terminal during a copying you should run:
# vmstat 1

I see for example:
r b swpd free buff cache si so bi bo in cs us sy id wa st
14 8 460 120716 280236 1509844 0 0 9 14 0 0 9 3 66 22
0
0 13 468 121936 279216 1550936 0 0 1368 47776 1927 4153 24 8 8 60
0
0 15 468 121516 280200 1551200 0 0 1408 3744 1726 2846 1 2 3 94
0
0 8 468 129804 280520 1545940 0 0 1612 4280 1854 4060 3 2 1 95
0
0 6 468 131388 281868 1546628 0 0 2140 3620 2020 4650 12 3 13 71
0
0 17 468 114220 282792 1571864 0 0 1208 3212 1647 2715 4 3 6 87
0
1 12 468 115356 283164 1570704 0 0 1420 18964 1718 2397 2 2 2 94
0
0 9 468 114320 283628 1570868 0 0 768 1204 1753 2831 3 1 0 96

iowait -> 80-90% during 'dd'
All other CPU's task work very very slow ...

AND (!!!), the output of 'dd' is:
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 112.086 seconds, 9.4 MB/s
                                                   ^^^^^^^^^

During some years i see a following behaviour: if server often uses a harddisk (for testing: 'dd' examples here) then iowait is stability 50-90% and many tasks are frozen during some seconds (10-20 and may be more at me). It's easy for testing through 'dd'. I cannot resolve this trouble by ionice for example - iowait is high even if i do a some i/o tasks ionice -c3 or ionice -c2 -n7 for example! So each server under kernel 2.6.18 and more (i read many topics) has this bug. A people in forums write that the kernel of 2.6.30-rc2 has bug too and that FreeBSD work quickly (mouse moving, video showing and some other CPU's tasks) during 'dd' testing unlike Linux ...

I don't know what arguments do you want for finding this bug! This bug to be since 2007 year ...

Please help! Here examples of my loaded server in some times (not DD - there only typical Mysql database & Mysql tasks & apache tasks):

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
-
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
13 14    120  68460 574784 1286748    0    0    13     1    0    0  9  3 66 22
0
 1 11    120  74564 576080 1286976    0    0  1560     0 1632 3641 34 10  0 57
0
 0 12    120  69988 577572 1287352    0    0  1904     0 1969 3696  5  2  0 93
0
 0 11    120  66916 578984 1287860    0    0  1900     0 1809 3615  6  2  0 92
0
 0 11    120  64960 580424 1288028    0    0  1668     0 1642 2188  1  1  0 97
0
 0 11    120  72764 576508 1286788    0    0  1668     0 1681 2198  3  2  0 96
0
 1 11    120  71424 577940 1287300    0    0  1604   332 1575 2152  2  1  0 97
0
 3 11    120  58852 579528 1289100    0    0  2000     0 1984 3286 44  7  0 49
0
 1 11    120  75104 581012 1287472    0    0  1608     0 2119 2839 39  7  0 55
0
 0 13    120  72160 582572 1287672    0    0  1908   120 1645 2366  7  1  0 92
0


[root@63 logs]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
-
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 5  9    120  95540 570248 1276840    0    0    13     1    0    0  9  3 66 22
0
 1  7    120  93996 571428 1277440    0    0  1772 33712 2024 4341 28  4 11 57
0
 0  7    120  97980 572528 1277884    0    0  1444   300 1568 2339 13  1 17 70
0
 0  7    120  99900 573532 1278468    0    0  1504     0 1513 2364  4  2  3 90
0
 1  5    120  98656 574484 1278540    0    0  1052   400 1629 1924  2  1  0 97
0
 1  3    120  97924 574932 1278916    0    0   480 21108 2276 1987 11  2 47 40
0
 1  4    120  87280 575264 1279040    0    0   432  3676 2456 2654 23  2 40 35
0
 1  5    120  95856 575668 1279140    0    0   780  4128 2249 3097 26  2 25 47
0

Here you can see stability high 'wa' field. When my tasks frozen during 10-20 seconds i see there 80-90% 'wa'.

Please can catch this bug!

Thanks

Comment 360 Thomas Pilarski 2009-06-05 21:55:05 UTC

Created attachment 21774 [details]
Test patch against heavy io bug

I have made an bisection and got these two patches. Reverting these patches improves the desktop responsiveness on my notebook enormous. I have tested it on a 2.6.28 non smp kernel (my heavy io testing installation) during four concurrent read and write operations, while working with two VMs. It's only a Core2 @2.4GHz system. I can even start new application during heavy io.

I have added the patch, which I have applied to my test installation. Use it with care, as I am not a kernel developer and does not know the dependencies in the cfq scheduler. 

I have reverted theses two patches:

07db59bd6b0f279c31044cba6787344f63be87ea is first bad commit
commit 07db59bd6b0f279c31044cba6787344f63be87ea
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date:   Fri Apr 27 09:10:47 2007 -0700

    Change default dirty-writeback limits

    Do this really early in the 2.6.22-rc series, so that we'll get
    feedback.  And don't change by half measures.  Just cut the default
    dirty limit to a quarter of what it was, and see if anybody even
    notices.

    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

:040000 040000 b63eb9faf5b9a42a1cdad901a5f18d6cceb7fdf6 2b8b4117ca34077cb0b817c77595aa6c9e34253a M      mm

a993800655ee516b6f6a6fc4c2ee13fedfb0590b is first bad commit
commit a993800655ee516b6f6a6fc4c2ee13fedfb0590b
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Fri Apr 20 08:55:52 2007 +0200

    cfq-iosched: fix sequential write regression
    
    We have a 10-15% performance regression for sequential writes on TCQ/NCQ
    enabled drives in 2.6.21-rcX after the CFQ update went in.  It has been
    reported by Valerie Clement <valerie.clement@bull.net> and the Intel
    testing folks.  The regression is because of CFQ's now more aggressive
    queue control, limiting the depth available to the device.
    
    This patches fixes that regression by allowing a greater depth when only
    one queue is busy.  It has been tested to not impact sync-vs-async
    workloads too much - we still do a lot better than 2.6.20.
    
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

:040000 040000 07c48a6930ce62d36540b6650e3ea0563bd7ec59 95fc11105fe3339c90c4e7bebb66a820f7084601 M      block


Here the fsync result on my machine:

**************************************************************************
Without patch
Linux balrog 2.6.28 #2 Mon Mar 23 11:19:13 CET 2009 x86_64 GNU/Linux

fsync time: 7.8282
fsync time: 17.3598
fsync time: 24.0352
fsync time: 19.7307
fsync time: 21.9559
fsync time: 21.0571
5000+0 Datensätze ein
5000+0 Datensätze aus
5242880000 Bytes (5,2 GB) kopiert, 129,286 s, 40,6 MB/s
fsync time: 21.8491
fsync time: 0.0430
fsync time: 0.0448
fsync time: 0.0451
fsync time: 0.0451
fsync time: 0.0451
fsync time: 0.0452



**************************************************************************
With patch
Linux balrog 2.6.28 #5 Fri Jun 5 22:23:54 CEST 2009 x86_64 GNU/Linux

fsync time: 2.8409
fsync time: 2.3345
fsync time: 2.8423
fsync time: 0.0851
fsync time: 1.2497
fsync time: 0.9981
fsync time: 0.9494
fsync time: 2.7094
fsync time: 2.9753
fsync time: 2.8886
fsync time: 2.9894
fsync time: 1.2673
fsync time: 2.6728
fsync time: 1.3408
5000+0 Datensätze ein
5000+0 Datensätze aus
5242880000 Bytes (5,2 GB) kopiert, 117,388 s, 44,7 MB/s
fsync time: 85.1461
fsync time: 23.5310
fsync time: 0.0317
fsync time: 0.0337
fsync time: 0.0338
fsync time: 0.0338

Comment 361 Milan Bouchet-Valat 2009-06-08 17:36:26 UTC

Fantastic! Have you bisected the whole kernel tree between 2.17 and 2.20? Really great I've found those patches.

The first one doesn't seem to be very important to me, and in 2.6.30 some of its changes have been reverted. But the second one changes dramatically my system's responsiveness. I'm now running it reverted, and there's no possible comparison with the old behavior: now my pointer no longer freezes when performing updates, and almost everything is smooth!

For those that would like to try the patch in 2.6.30, I've updated it as I could, and I'm attaching it. It's quite dirty and I was doubtful it would work, but it looks like that's enough.

Would a kernel dev look at the patches Thomas identified and tell us what he thinks?

Comment 362 Milan Bouchet-Valat 2009-06-08 17:38:22 UTC

Created attachment 21816 [details]
Patch to revert second commit, updated to apply against 2.6.30rc8

Comment 363 Ben Gamari 2009-06-08 21:11:09 UTC

(In reply to comment #360)
Thank you very much for you work. I can't imagine how long that bisection must have taken and it is very exciting to have finally found a potential culprit. It would be best for everyone if you opened a new bug report with this information. Developers would be far more likely to look at it if we had a clean slate on which to start.

Comment 364 James Ettle 2009-06-08 21:39:57 UTC

Are there patches for 2.6.29 available that I can test?

Comment 365 Bob McElrath 2009-06-08 22:04:11 UTC

Isn't the second patch just adjusting things which can be adjusted in proc?

echo 10 > /proc/sys/vm/dirty_background_ratio
echo 40 > /proc/sys/vm/dirty_ratio

Someone want to do some tests after adjusting those two?

Comment 366 Jens Axboe 2009-06-09 06:18:16 UTC

Created attachment 21822 [details]
Backport of the reverted CFQ commit

This is a proper backport of the commit that was indentified by Thomas to be the problematic one.

Thomas, can you please verify that this makes 2.6.30-rc8 behave better? And if it does, it would be interesting to narrow it down to one single change. The first always makes sure that we drain the queue before servicing a queue that has idling enabled, and the second is just a tweak for idle/async immediate expiration. I think the first one is likely the interesting bit, but it would be good to have confirmation on that.

And Thomas, thanks for all your work on this!

Comment 367 Matt Whitlock 2009-06-09 06:21:58 UTC

(In reply to comment #365)
> Isn't the second patch just adjusting things which can be adjusted in proc?
> 
> echo 10 > /proc/sys/vm/dirty_background_ratio
> echo 40 > /proc/sys/vm/dirty_ratio
> 
> Someone want to do some tests after adjusting those two?

We already determined months ago that tuning those knobs way down was a way to minimize the problem. (See comment #263 and comment #292 for test results.)  It's not a solution, though; it just skirts around the real issue.

Comment 368 Thomas Pilarski 2009-06-09 13:26:04 UTC

(In reply to comment #366)
> I think the first one is likely the interesting bit, but it would
> be good to have confirmation on that.

Yes it is the first one. I could only execute my long lasting test, which shows only a bad kernel and does not confirm a good kernel one, but I have executed it a long time and there weren't any long times of the lame encoding. 

It took 40s without any i/o on all kernels and 48-55s with the following lines during heavy i/o. 

+	if (cfqd->rq_in_driver && cfq_cfqq_idle_window(cfqq))
+		return 0;

It took 55-80s without any patch or with the second patch during heavy i/o.

This may be related too. While enabling the second core. The lame encoding process was shifted between the cores without the first patch and it takes up to 130s seconds. I could see it, as the maximum clocks frequency was switched between the cores.

Comment 369 Jens Axboe 2009-06-09 13:49:24 UTC

This question has probably been answered before, but this bug is huge so I'll just ask again... Thomas, what kind of drive are you using? Does it have NCQ enabled? If so, does disabling NCQ make any difference?

You can disable NCQ on sda by doing:

# echo 1 > /sys/block/sda/device/queue_depth

(or use sdX for others, naturally).

Comment 370 Thomas Pilarski 2009-06-09 15:28:20 UTC

The last tests, I have done on a sata drives with queue depth 31. By reducing the queue depth the overall throughput of the two/four concurrent copy operations is nearby halved with and without patch. I have tried to run some tests, but got some really strange results. I will try it again on my test installation at home.

Comment 371 Yuriy Lalym 2009-06-09 20:19:42 UTC

(In reply to comment #366)

cd /usr/src/linux-2.6.30-rc8+
suse:/usr/src/linux-2.6.30-rc8+ # patch -p1 < cfq.dif (#360)
patching file block/cfq-iosched.c
Hunk #1 FAILED at 1073.
Hunk #2 FAILED at 1119.
Hunk #3 FAILED at 1129.
3 out of 3 hunks FAILED -- saving rejects to file block/cfq-iosched.c.rej
patching file mm/page-writeback.c
Reversed (or previously applied) patch detected!  Assume -R? [n] y
Hunk #1 succeeded at 66 with fuzz 1.
Hunk #2 FAILED at 77.
1 out of 2 hunks FAILED -- saving rejects to file mm/page-writeback.c.rej

suse:/usr/src/linux-2.6.30-rc8+ # patch -p1 < cfq.dif (#360 + #366)
patching file block/cfq-iosched.c
Hunk #3 FAILED at 1119.
Hunk #4 FAILED at 1129.
2 out of 4 hunks FAILED -- saving rejects to file block/cfq-iosched.c.rej
patching file mm/page-writeback.c
Reversed (or previously applied) patch detected!  Assume -R? [n]
Apply anyway? [n] y
Hunk #1 FAILED at 66.
Hunk #2 FAILED at 77.
2 out of 2 hunks FAILED -- saving rejects to file mm/page-writeback.c.rej

Comment 372 Thomas Pilarski 2009-06-09 20:47:22 UTC

(In reply to comment #371)
You should only try the patch in comment #366

Comment 373 Yuriy Lalym 2009-06-09 22:59:53 UTC

Ok, 2.6.30-rc8 + patch in comment #366, xfs

dd if=/dev/zero of=./bigfile bs=1M count=15000 & ./fsync-tester
fsync time: 1.7085
fsync time: 1.6639
fsync time: 0.4616
fsync time: 1.3800
fsync time: 1.3603
fsync time: 1.5529
fsync time: 1.8435
fsync time: 0.2561
fsync time: 0.9318
fsync time: 0.1965
fsync time: 1.2233
fsync time: 1.3920
fsync time: 0.4677
fsync time: 0.4560
fsync time: 1.8206
fsync time: 1.8135
fsync time: 1.8342
fsync time: 0.8565
fsync time: 0.9477
fsync time: 2.8569
fsync time: 0.4323
15000+0 записей считано
15000+0 записей написано
 скопировано 15728640000 байт (16 GB), 181,923 c, 86,5 MB/c
fsync time: 1.3716
fsync time: 0.0168
fsync time: 1.5381
fsync time: 1.5649
fsync time: 0.0349
fsync time: 0.0636
fsync time: 0.0657
fsync time: 0.3337
fsync time: 0.0393

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  2      0 4230432    808 3417716    0    0     0 87568 1102 1850  1  7 13 79
 0  4      0 4149632    808 3499392    0    0     0 83960  722 1037  1  5 36 57
 0  4      0 4069892    808 3578140    0    0     0 76840  701 1178  1  5  0 93
 1  3      0 3988784    808 3659444    0    0     0 78848  727 1151  1  5 14 79
 0  4      0 3889380    808 3757188    0    0     0 97310  804 1200  2  6 33 59
 0  3      0 3807540    808 3838720    0    0     0 79888  614 1010  2  5 19 74
 0  4      0 3729056    808 3918092    0    0     0 76866  840 1367  0  5 29 65
0  3      0 3002860    808 4645932    0    0     0 90672  597  817  2  6  0 93
 0  4      0 2921840    808 4728132    0    0     0 80416  865 1377  1  6  0 93
 0  3      0 2841564    808 4810132    0    0     0 80384  627  933  1  5  0 93
 1  4      0 2743820    808 4906136    0    0     0 94216  892 1398  1  7  0 92
 0  3      0 2666100    808 4984280    0    0     0 77824  770 1217  1  5  0 93
 1  2      0 2590248    808 5063188    0    0     0 82496  795 1283  2  6  0 92

In the moment of copying of /usr/src/linux-2.6.30-rc8 -> /usr/src/linux-2.6.30-rc8+ (in Konsole, without the use of dolphin, ... and other GUI)
to cause
:~> kdesu /usr/bin/kwrite
it is impossible, after completion of copying - it is impossible, it is needed only to overload an user or computer

Speed of copying of /usr/src/linux-2.6.30-rc8 -> /usr/src/linux-2.6.30-rc8+ as was near to the zero remained so.

time cp -r /usr/src/linux-2.6.30-rc8 /usr/src/linux-2.6.30-rc8+
real    6m14.566s
user    0m0.158s
sys     0m2.838s

Comment 374 Yuriy Lalym 2009-06-09 23:06:42 UTC

Brought in misinformation. Sometimes after completion of copying of kdesu /usr/bin/kwrite executed successfully, but in the moment of copying never.

Comment 375 Thomas Pilarski 2009-06-10 11:25:40 UTC

(In reply to comment #369)
> This question has probably been answered before, but this bug is huge so I'll
> just ask again... Thomas, what kind of drive are you using? Does it have NCQ
> enabled? If so, does disabling NCQ make any difference?

This bug is really annoying. I was not able to reproduce the mouse freezes any more, with and without patch and with and without NCQ. I will try later again.

Is there a possibility to simulate a disc in ram with a parametrized speed and latency?

Comment 376 Perlover 2009-06-11 06:25:31 UTC

Created attachment 21849 [details]
The corrected patch from #360 post (for 2.6.29 and may be more kernels)

I tried to patch from post #360 to kernel 2.6.29 and found some rejects
I made rejects by hands and put here normal variant
I saw the patch from #366, but i think there are not same correctoins as #360
So i would like to suggest to test this patch (only cfq-iosched.c file)

Comment 377 Perlover 2009-06-11 06:35:19 UTC

And i saw the patch from post #366, i didn't understand why the author tell that this is "proper backport". There no code with 'prev_cfqq' variable. I think that patch from #366 may be not valid patch.

Please try this patch. This patch for 2.6.29 and more kernels as i think
I didn't test it because i don't have a test machine for experiments. I have only Linux server under a heavy load...

Comment 378 Jens Axboe 2009-06-11 10:47:33 UTC

It IS the proper backport. I'm the author and maintainer of CFQ, I should know... I would generally advise against using patches from people who don't know what they are doing, especially for data integrity important code like the IO scheduler. There could be data loss from bad patches.

The reason the 2.6.30 and 2.6.29 patches are different is that the CFQ request dispatch mechanism is different in 2.6.30. As such there's no prev_cfqq to take into account, since we never dispatch from more than one cfqq in one round. You would need to take the prev_cfqq out of local function scope for it to have any meaning.

So, not to be rude, but the last thing this bug needs are more cooks or chefs asking people to test things. It's a huge mess already. For now the focus is making Thomas happy, since he's spent much time on this and has a reproducible (sort of) way of testing it. Once that is done, we can proceed to any other potential issues. Any comments not related to that exact issue will be ignored.

Comment 379 Anton Revunov 2009-06-11 13:01:12 UTC

Created attachment 21852 [details]
test results

Two hard drive SAMSUNG HD753LJ + NCQ + mdadm raid1 + ext3 + 2GB RAM + Core2Duo E6750 2.66 @ 3.44 GHz

Comment 380 Anton Revunov 2009-06-11 13:07:30 UTC

Comment on attachment 21852 [details]
test results

>==============================2.6.30==============================
>ff@home-desktop:~$ dd if=/dev/zero of=./bigfile bs=1M count=15000 &
>./fsync-tester                                                                 
>[1] 6958                                                                       
>fsync time: 0.1025                                                             
>fsync time: 0.8720                                                             
>fsync time: 5.5800                                                             
>fsync time: 5.6179                                                             
>fsync time: 3.7413                                                             
>fsync time: 4.2393                                                             
>fsync time: 5.2596                                                             
>fsync time: 0.0985                                                             
>fsync time: 1.7070
>fsync time: 4.1414
>fsync time: 0.1577
>fsync time: 4.8191
>fsync time: 0.6993
>fsync time: 3.6732
>fsync time: 3.6963
>fsync time: 4.7696
>fsync time: 6.0947
>fsync time: 3.4383
>fsync time: 0.7583
>fsync time: 4.0760
>fsync time: 4.1786
>fsync time: 3.9886
>fsync time: 0.3802
>fsync time: 3.4182
>fsync time: 1.1262
>fsync time: 2.8425
>fsync time: 3.9217
>fsync time: 1.4758
>fsync time: 3.7798
>fsync time: 3.9234
>fsync time: 0.3557
>fsync time: 4.1882
>fsync time: 4.4526
>15000+0 records in
>15000+0 records out
>15728640000 bytes (16 GB) copied, 231.473 s, 68.0 MB/s
>fsync time: 2.1747
>fsync time: 0.0820
>fsync time: 0.0774
>fsync time: 0.0299
>fsync time: 0.0268
>fsync time: 0.0282
>fsync time: 0.0277
>fsync time: 0.0270
>^C
>[1]+  Done                    dd if=/dev/zero of=./bigfile bs=1M count=15000
>
>ff@home-desktop:~$ vmstat 1
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> 1  0 214308 1592368   6344  66260    1    5   253   653  611  592 15  5 73  7
> 0  0 214308 1592400   6344  66264    0    0     0     0  309  525  7  3 90  0
> 2  0 214308 1592448   6344  66264    0    0     0     0  365  686  5  3 91  0
> 0  0 214308 1592400   6344  66264    0    0     0     0  291  543  5  3 92  0
> 0  2 214308 1126216   6756 464876    0    0    24 398976  980 1265  7 36 37
> 20
> 0  4 214308 1107468   6780 489032    0    0     0 20524  671  551  9  5 35 51 
> 0  6 214308 1118544   6780 489032    0    0     0     4  658  575  7  3 32 58 
> 0  5 214308 1129752   6784 489032    0    0     0     4  646  578  6  5 36 53 
> 0  4 214308 1142036   6784 489032    0    0     0     8  656  576  6  4 36 54 
> 2  3 214308 1151708   6784 489032    0    0     0     0  590  501  8  3 16 72 
> 0  1 214308 1156616   6792 491124    0    0     0  1572  587  485  7  3 29 60 
> 0  2 214308 704504   7188 876836    0    0     0 392152  885  716  8 38 21 32 
> 0  4 214308 637132   7252 942604    0    0     0 65728  666  494  7 10  0 83  
> 0  4 214308 561368   7324 1016556    0    0     0 73984  686  499  7 12  0 81 
> 0  4 214308 490020   7392 1086476    0    0     0 69920  693  537  7 10  0 83 
> 0  4 214308 418224   7460 1156364    0    0     0 69888  686  490  9  9  0 82 
> 0  3 214308 398752   7496 1177372    0    0     4 22316  781  500  6  7  7 80 
> 0  4 214308 406700   7496 1177404    0    0    28     0  532  510  8  3 10 79 
> 0  5 214308 416212   7516 1177528    0    0   160     8  645  550  6  4 14 76 
> 0  4 214308 427788   7524 1177648    0    0   108    12  620  526  8  4  1 87 
> 0  3 214308 437528   7536 1177688    0    0    56     0  674  651  6  5  0 89 
> 1  2 214308 302288   7688 1321540    0    0     8    16  540  533  9 12 20 58 
> 1  3 214308  15268   7536 1548008    0    0    56 391944  878  707  7 28  0
> 65
> 0  5 214308  15152   7372 1548204    0    0    96 69896  699  574  8 11  0 81 
> 1  7 214308  15232   7420 1549092    0    0   220 45220  661  616  6  9  0 85 
> 7  6 214308  14752   7544 1548616    0    0   640 82216  755  747  6 13  0 81 
> 0  5 214308  15084   7620 1548232    0    0    24 65720  709  674  8 11  0 81 
> 0  4 214308  17436   7636 1549072    0    0   284  8224  889  572  6  6  0 88 
> 0  4 214308  17776   7660 1550176    0    0   100  1496  637  545  6  5  0 89 
> 0  4 214308  27604   7684 1550280    0    0   132     0  606  513  7  5  0 88 
> 0  4 214308  35192   7712 1550412    0    0   156     0  588  572  8  3  0 90 
> 0  5 214308  44060   7744 1550820    0    0   480     0  681  683  6  6  0 88 
> 0  3 214308  55500   7800 1551100    0    0   296    12  750  746  5  6 31 58 
> 2  1 214308  13580   8028 1601036    0    0  1356   120  696  881  8  9 26 56 
> 0  3 214308  15220   8316 1547984    0    0     0 381300  944 1397  7 39 18
> 37
> 0  5 214308  13560   8384 1550264    0    0     0 69944  769  536  7 10  0 82 
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa  
> 2  5 214308  14424   8452 1549564    0    0     0 69888  739  508  6  9  0 85 
> 1  5 214308  13952   8524 1549784    0    0     0 74016  719  527  7 12  0 81 
> 0  3 214308  14468   8564 1549144    0    0     0 69920  781  576  6 12  1 81 
> 0  3 214308  20844   8564 1549148    0    0     0     0  580  477  7  3 22 67 
> 0  5 214308  28084   8568 1549144    0    0     0     8  659  498  7  3 36 54 
> 0  3 214308  37388   8576 1549148    0    0     0  1764  605  555  8  5 12 76 
> 0  3 214308  45184   8576 1549148    0    0     0     0  525  470  6  5  0 90 
> 0  1 214308  58560   8576 1549148    0    0     0    12  811  623  8  5  5 81 
> 0  3 214308  14952   8812 1592052    0    0     0 57644  637  804  6 25 37 32 
> 0  2 214308  14012   8952 1549216    0    0     0 347944 1108  767  8 19 14
> 59
> 0  2 214308  15548   8964 1549476    0    0     0  5320  421  472  7  3  2 88 
> 0  4 214308  28940   8964 1549476    0    0     0    12  694  466  7  5 11 77 
> 0  4 214308  43720   8964 1549476    0    0     0     0  669  455  6  3 34 57 
> 0  4 214308  56184   8964 1549476    0    0     0     0  714  437  6  4 36 54 
> 0  4 214308  13704   8752 1594000    0    0     8 32772  889  834  4 28 30 38 
> 0  3 214308  14292   8760 1592532    0    0     0 72228  720 1005  8  6 36 51 
> 0  2 214308  15452   8680 1548660    0    0     0 301880  688  756  6 18 18
> 59
> 0  2 214308  24520   8680 1548660    0    0     0     0  618  443  7  4 25 64 
> 0  3 214308  39648   8680 1548660    0    0     0     4  683  482  5  5 21 68 
> 1  2 214308  52696   8684 1548660    0    0     0    12  583  602  6  5 35 53 
> 3  0 214308  13620   8344 1599184    0    0    40     4  817  598  6 14  4 76 
> 1  1 214308  14660   7476 1565564    0    0   112 276832  772  970  3 27 14
> 56
> 0  4 214308  14024   7544 1552748    0    0     4 145788  625  542  0  9  0
> 91
> 2  2 214308  13996   7596 1556652    0    0     4 28800  523  529  1  5  3 90 
> 0  5 214308  14692   7692 1552168    0    0     4 115004  702  578  0 12  2
> 86
> 0  5 214308  17748   7744 1551596    0    0     8 41132  619  522  2  5  0 93 
> 0  4 214308  15604   7808 1560576    0    0   204     0  631  537  0  6  0 94 
> 0  3 214308  26272   7808 1560576    0    0     0     0  539  486  2  1  0 97 
> 0  3 214308  38512   7816 1560576    0    0     0  1440  555  503  0  0  0
> 100
> 1  1 214308  50424   7816 1560576    0    0     0     0  465  497  2  0 27 71 
> 3  2 214308  14948   8096 1595136    0    0    12 81980  816  866  0 25 42 32 
> 0  3 214308  14936   8188 1550608    0    0     4 347056  643  579  2 15 15
> 69
> 1  3 214308  13788   8256 1552700    0    0     4 64168  599  476  0  7  0 93 
> 0  3 214308  14436   8292 1551948    0    0     0 73920  657  567  2 10 11 77 
> 1  2 214308  14452   8380 1551596    0    0     4 80780  795  529  0 10  0 90 
> 0  3 214308  15284   8400 1552484    0    0     4 16384  390  461  2  2 36 60 
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa  
> 0  2 214308  27336   8408 1552484    0    0     0  1688  631  507  0  1 10 88 
> 0  2 214308  39496   8408 1552484    0    0     0     0  500  442  2  1 29 68 
> 1  1 214308  53336   8408 1552484    0    0     0     0  616  503  0  1 44 55 
> 1  2 214308  14904   8456 1602564    0    0   148     4  515  588  1 10 42 46 
> 0  6 214308  14088   6968 1595120    0    0  1280 123568  793 1043  1 26  1
> 73
> 0  4 214308  15032   5900 1554300    0    0  1108 312700  695 1060  0 13  0
> 87
> 0  6 214308  19824   5904 1556772    0    0   192     0  465  459  0  1  0 99 
> 0  4 214308  14432   5640 1554152    0    0   680 125936  698  583  0 14  0
> 85
> 1  4 214308  14268   4524 1556644    0    0     4 79152  627  536  0  9  0 91 
> 0  4 214308  15472   4328 1557116    0    0    44 67104  653  512  2  9  0 89 
> 0  3 214308  14192   4356 1557264    0    0     4 34360  869  524  0  4  0 96 
> 0  3 214308  23236   4356 1557264    0    0     0     0  521  509  2  1 37 60 
> 0  4 214308  34784   4364 1557264    0    0     4    12  536  499  0  1 32 68 
> 0  4 214308  47144   4364 1557336    0    0    72     0  420  426  0  1 28 71 
> 0  1 214308  62740   4368 1557336    0    0     0    16  677  727  3  1 30 68 
> 0  1 214308  14492   4608 1604772    0    0    72 37004  608  779  8 21 41 30 
> 0  3 214308  17452   4768 1600460    0    0     4 91240  801  740  6 20  7 67 
> 0  2 214308  13752   4848 1557448    0    0     4 352920 1058  907  8 15  9
> 69
> 0  3 214308  20620   4848 1557444    0    0     0     0  618  478  6  4 30 60 
> 0  3 214308  18152   4972 1556620    0    0     4 117476  743  575  2 14  2
> 82
> 0  4 214308  13708   5060 1557112    0    0     0 100632  848  556  1 11 19
> 69
> 3  4 214308  17728   5132 1556508    0    0   592 32928  650  580  1  4  0 95 
> 0  4 214308  17744   5148 1559500    0    0   324  1388  626  543  0  2 19 79 
> 0  4 214308  27648   5160 1559788    0    0   236     0  498  504  0  1  2 97 
> 0  3 214308  41392   5160 1559920    0    0   140     0  566  592  2  1  8 89 
> 0  1 214308  55464   5160 1559920    0    0     0     0  613  659  2  1 43 53 
> 2  3 214308  13524   5464 1566192    0    0     8 286840  771  911  3 35 35
> 27
> 1  2 214308  14304   5536 1556740    0    0     4 138424  727  671  7 12 10
> 71
> 0  3 214308  15288   5604 1555804    0    0     4 78144  740  555  7 13 18 63 
> 0  3 214308  14120   5680 1554992    0    0     0 84032  622  538  2 10 19 69 
> 0  4 214308  15312   5720 1556812    0    0     4 49508  590  564  0  6 12 81 
> 0  3 214308  18076   5764 1555236    0    0     0 41132  730  548  2  6 21 71 
> 1  3 214308  29532   5768 1555236    0    0     0  1720  501  498  0  1 16 83 
> 1  4 214308  41840   5772 1555256    0    0    20     0  612  566  6  4 24 66 
> 0  1 214308  56008   5772 1555332    0    0    80    12  747  670  6  3  3 88 
> 3  0 214308  13156   5952 1607880    0    0     4   172  609  625  8 16 35 41 
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa  
> 1  4 214308  15264   6168 1553756    0    0     8 409604  877  936  5 30 29
> 36
> 0  3 214308  15940   6232 1554652    0    0     4 64624  604  571  2  9 13 75 
> 1  2 214308  14836   6316 1555000    0    0     0 93280  682  535  0 13  0 87 
> 2  3 214308  14840   6388 1554324    0    0     4 74644  618  515  1  8  3 87 
> 0  2 214308  15460   6452 1555048    0    0     4 57504  736  537  0  6 15 79 
> 0  3 214308  19076   6460 1556076    0    0     0  1632  672  487  7  4 39 50 
> 0  4 214308  30076   6460 1556076    0    0     0     4  630  492  5  5  0 90 
> 0  3 214308  42704   6464 1556076    0    0     0    12  569  475  3  2 19 76 
> 0  2 214308  51996   6464 1556076    0    0     0     0  520  551  0  0  4 96 
> 2  1 214308  13928   6584 1606556    0    0     4    16  532  625  1 10 38 50 
> 0  2 214308  14848   6812 1596100    0    0     8 129312  765  897  0 22 22
> 56
> 0  4 214308  14632   6880 1553992    0    0     0 308888  701  662  7 13  0
> 80
> 0  4 214308  15096   6976 1553728    0    0     4 94592  716  592  5 14  0 81 
> 0  4 214308  14700   7048 1554080    0    0     0 74048  700  570  7 11  0 82 
> 0  4 214308  13780   7116 1555188    0    0     4 71092  685  538  7 10  0 83 
> 0  3 214308  14964   7152 1554204    0    0     0 36944  918  502  7  7  0 86 
> 0  4 214308  19908   7156 1554292    0    0    88  1532  565  514  6  4  0 90 
> 0  4 214308  28984   7160 1554336    0    0    44    12  552  487  8  3  0 89 
> 0  5 214308  37492   7160 1554452    0    0   116    36  610  539  6  5 21 68 
> 0  2 214308  52188   7164 1554452    0    0     0    12  731  722  6  5  9 79 
> 1  1 214308  14644   7304 1605148    0    0     4     8  509  594  6 13 37 44 
> 0  1 214308  17204   7340 1597036    0    0     8 81928  678  707  7 15 36 42 
> 0  2 214308  15036   7276 1553748    0    0     8 357892  807  848  7 23 37
> 33
> 0  4 214308  14800   7344 1554312    0    0     4 65824  713  585  8 11  9 72 
> 1  3 214308  14808   7420 1554168    0    0     0 86340  739  575  7 13  0 80 
> 0  4 214308  14004   7492 1554560    0    0     4 74016  639  531  2  8  0 89 
> 0  3 214308  19856   7504 1554272    0    0     0 12320  596  506  0  2 29 68 
> 0  3 214308  23068   7508 1554272    0    0     0  1476  668  489  1  1 14 83 
> 0  3 214308  35408   7508 1554272    0    0     0     0  529  474  0  1  0 99 
> 0  3 214308  46364   7512 1554272    0    0     0    12  521  457  1  1 12 86 
> 0  2 214308  62656   7512 1554272    0    0     0     4  492  609  0  0 14 86 
> 0  2 214308  15412   7680 1594804    0    0    32 94884  735  897  2 32 25 41 
> 1  4 214308  14692   7744 1575932    0    0     0 194608  731 1150  0 14 34
> 52
> 0  3 214308  17144   7780 1554232    0    0     4 180920  572  517  2  7 22
> 70
> 1  2 214308  14296   7864 1553780    0    0     0 86432  587  514  0 10 46 44 
> 0  3 214308  13692   7944 1554200    0    0     4 82304  620  578  2  8  9 82 
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa  
> 0  4 214308  14068   8020 1553412    0    0     0 69900  746  633  0  8 22 71 
> 0  4 214308  19652   8036 1553056    0    0     4  8224  472  540  2  3  0 96 
> 0  3 214308  25872   8048 1554388    0    0     0  1456  731  634  0  1  0 98 
> 0  3 214308  34260   8048 1554388    0    0     0     0  431  509  1  1 13 84 
> 1  3 214308  46200   8048 1554388    0    0     0     0  525  539  0  0 23 76 
> 0  1 214308  58964   8048 1554388    0    0     0     0  605  600  2  1 39 58 
> 1  1 214308  14124   8136 1556832    0    0     8 333512  796  925  0 35 36
> 28
> 0  3 214308  14004   8072 1552952    0    0     4 99628  638  512  1 10  0 88 
> 0  3 214308  14708   8020 1552224    0    0     0 74048  648  501  0 10 12 78 
> 0  3 214308  14104   8100 1553228    0    0     4 75852  661  524  2 11  0 87 
> 0  3 214308  15048   8040 1552704    0    0     0 65772  775  572  0  8  0 92 
> 0  5 214308  21864   8044 1552888    0    0     4    36  441  477  2  0  0 97 
> 0  4 214308  33624   8052 1552884    0    0     0  1700  541  524  0  0  0
> 100
> 0  3 214308  45344   8052 1552892    0    0     0     8  533  574  2  1 33 64 
> 0  2 214308  57328   8052 1552892    0    0     0     0  537  561  0  1 33 67 
> 2  1 214308  14412   8188 1600840    0    0     4 28688  583  749  2 18 41 39 
> 0  2 214308  20020   8212 1595688    0    0     4 56132  754  587  0  6 21 73 
> 1  1 214308  15360   8220 1552048    0    0     4 369572  841  905  2 23  6
> 69
> 0  3 214308  21504   8224 1552504    0    0     0  5344  423  421  0  2 28 70 
> 0  3 214308  32504   8224 1552504    0    0     0     0  547  606  1  1 10 88 
> 0  3 214308  45044   8224 1552504    0    0     0     0  513  427  0  1  0 99 
> 1  2 214308  57088   8224 1552504    0    0     0     0  609  652  2  0  4 94 
> 2  0 214308  16176   8216 1602536    0    0     8    16  651  602  0 16 28 55 
> 0  2 214308  15432   8324 1594212    0    0     4 117376  746  789  2 16  7
> 76
> 0  3 214308  15028   8424 1550876    0    0     0 318200  632  621  0 18  1
> 81
> 0  3 214308  14672   8472 1553268    0    0     4 75828  630  571  2  9 17 71 
> 0  3 214308  15204   8516 1553032    0    0     4 57568  622  475  0 10 18 73 
> 1  3 214308  14388   8580 1553448    0    0     0 86368  674  567  2 12 26 60 
> 2  2 214308  20152   8580 1553432    0    0     0     0  521  413  0  0 35 65 
> 0  4 214308  21876   8588 1553432    0    0     0  1508  569  578  2  2 21 75 
> 0  2 214308  36576   8592 1553428    0    0     0     4  550  416  0  1 35 64 
> 0  1 214308  49388   8592 1553432    0    0     0     8  565  553  1  1 39 58 
> 0  1 214308  21752   8652 1596876    0    0     4    36  387  460  1  5 51 44 
> 0  2 214308  15584   8900 1593092    0    0     8 118212  838 1253  2 24 24
> 50
> 0  1 214308  15640   9016 1599208    0    0     4  4540  610  557  1 11 30 59 
> 0  3 214308  13976   9080 1551688    0    0     0 392328  813  684  2 16 16
> 67
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa  
> 0  3 214308  15492   9120 1551696    0    0     4 42364  511  481  3  6  7 84 
> 0  3 214308  13956   9204 1552296    0    0     4 82272  686  633  8 12  0 80 
> 0  2 214308  16768   9236 1551476    0    0     0 53472  652  483  7  9  5 79 
> 0  3 214308  27796   9236 1551476    0    0     0     0  673  583  7  4 29 60 
> 0  3 214308  37796   9240 1551476    0    0     0    12  693  458  5  5 36 53 
> 0  5 214308  44304   9252 1551476    0    0     8  1448  589  612  8  4 36 52 
> 0  1 214308  59348   9256 1551472    0    0     0     4  731  614  6  4  3 87 
> 0  3 214308  14708   9500 1595132    0    0    12 57356  696  947  8 26 23 43 
> 1  1 214308  13588   9620 1554652    0    0     4 335112  829  756  9 20 12
> 60
> 0  3 214308  14964   9660 1550636    0    0     0 69192  705  554  1  6 10 82 
> 2  2 214308  13688   9724 1554112    0    0     4 49324  571  571  0  6  7 86 
> 0  2 214308  17120   9788 1549904    0    0     0 68184  615  659  2  8 16 73 
> 0  3 214308  28960   9788 1549904    0    0     0     4  530  409  0  1 33 66 
> 1  1 214308  40744   9792 1549904    0    0     0    12  540  498  1  2 25 72 
> 0  2 214308  47964   9792 1549904    0    0     0     0  481  410  0  0 21 79 
> 1  0 214308  15176   9920 1598424    0    0     4     4  690  677  2 11 44 43 
> 0  2 214308  14808  10060 1593288    0    0     8 122856  793 1034  0 21 30
> 49
> 0  5 214308  14700  10136 1549124    0    0    28 314392  636  812  2 17 22
> 60
> 2  2 214308  14576  10192 1551960    0    0   168 70868  654  548  0  9  7 83 
> 1  5 214308  15064  10228 1550548    0    0   320 77584  605  582  1 11  5 82 
> 1  4 214308  13652  10300 1551720    0    0   144 69852  654  557  0  9  8 83 
> 1  3 214308  14824  10316 1551632    0    0   184 45248  606  587  2  6 17 75 
> 0  3 214308  17100  10332 1552116    0    0   492    12  734  511  0  1 10 89 
> 0  4 214308  24752  10336 1552140    0    0    24  1432  462  503  1  1  4 93 
> 0  5 214308  36596  10344 1552336    0    0   288    12  535  508  0  0  0
> 100
> 0  6 214308  48460  10400 1554488    0    0  2116     8  686 1024  2  1  7 89 
> 1  2 214308  13784  10568 1602292    0    0   528    20  535  581  0 12 34 53 
> 0  3 214308  15168  10736 1549184    0    0   336 359376 1141 1015  2 26  1
> 72
> 0  3 214308  15156  10760 1549976    0    0    72 21776  496  442  0  3  1 95 
> 0  3 214308  29756  10768 1550304    0    0   268     0  653  533  2  1  6 92 
> 0  3 214308  38128  10776 1550776    0    0   492     0  536  448  0  1  4 95 
> 0  3 214308  49364  10780 1551048    0    0   276     0  549  532  2  1  0 96 
> 1  5 214304  13884  10992 1593420   32    0   892 20484  726  760  0 19  9 71 
> 0  4 214304  14648  11116 1547548    0    0   852 324424  804 1278  2 19  4
> 75
> 0  6 214304  14964  11184 1549504    0    0  1848 35824  638  796  0  6  0 94 
> 0  6 214304  28112  11188 1549480    0    0     4    12  620  571  2  1  0 97
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> 1  7 214304  14604  11296 1549212    0    0    48 134928  640  696  0 13  0
> 86
> 0  7 214304  15052  11372 1547720    0    0   168 95424  822 1137  3 17  0 80
> 0  4 214304  20360  11392 1547520    0    0     0 20552  709  512  0  3  7 91
> 0  4 214304  25844  11396 1548544    0    0    16  1416  438  555  2  1  0 97
> 0  4 214304  38652  11396 1548560    0    0     0     0  571  407  0  2  0 98
> 0  3 214304  52360  11396 1548560    0    0     0     0  601  721  2  1  0 97
> 0  3 214304  50348  11424 1565520    0    0     4    16  481  397  0  2  2 97
> 1  3 214304  14952  11768 1596452    0    0   252 52444  661  952  2 25 19 54
> 0  5 214304  15320  11856 1548480    0    0     4 364900  760  666  0 16  4
> 79
> 0  5 214304  14896  11940 1547704    0    0     0 90496  709  633  2 11  0 87
> 0  5 214304  14400  11936 1547988    0    0   188 79464  652  544  0 10  0 90
> 0  7 214304  13556  11972 1549776    0    0     8 57844  668  553  2  8  0 90
> 0  4 214304  16036  11996 1550756    0    0     4 38756  755  518  0  6  0 94
> 0  4 214304  28108  11996 1550780    0    0    24     0  485  505  2  1 16 82
> 0  4 214304  40644  11996 1550780    0    0     0     0  498  427  1  1 12 86
> 0  3 214304  53108  11996 1550780    0    0     0     0  634  689  2  2 32 65
> 0  3 214304  14060  12008 1599916    0    0    52    20  525  490  0  8 27 66
> 1  2 214304  14952  11652 1554332    0    0   144 337780  805 1056  2 32  1
> 66
> 1  4 214304  14136  11672 1548960    0    0     0 121500  693  537  0 11  0
> 89
> 1  4 214304  20988  11700 1549328    0    0   888 12288  557  525  2  2 11 85
> 0  3 214304  35248  11700 1549436    0    0   120     0  659  461  6  4 28 62
> 0  3 214304  48952  11700 1549436    0    0     0     0  586  508  8  3  2 86
> 0  4 214304  49148  11748 1552932    0    0  3468  1368  888  542  3  2  1 93
> 0  5 214304  52508  11756 1552948    0    0    80    12  422  243  2  0  0 98
> 0  5 214304  60292  11784 1553904   32    0  1064     8  492  258  0  0  0
> 100
> 1  2 214304  61600  12056 1556072    0    0  2476  1092  916 2595  9  5 42 44
> 2  0 214304  60032  12072 1557844    0    0  1644    16  528 1295  7  6 66 22
> 0  0 214304  59660  12080 1558152    0    0   308  1052  448  937  7  2 86  4
> 3  0 214304  59660  12088 1558152    0    0     0  1052  297  627  1  0 98  1
> 0  0 214304  59668  12096 1558152    0    0     0  1052  390 1060  2  1 96  1
> 1  0 214304  59660  12104 1558152    0    0     0  1052  434  858  6  2 92  0
> 0  0 214304  59536  12112 1558152    0    0     0  1052  449  897  6  3 90  1
> 2  0 214304  60048  12120 1558620    0    0   468  1052  390  809  3  1 93  3
> 0  0 214304  57320  12136 1561792    0    0  3188     0  395  785  3  1 88  8
>^C
>
>==============================2.6.30 + patch from
>#366==============================
>ff@home-desktop:~$ dd if=/dev/zero of=./bigfile bs=1M count=15000 &
>./fsync-tester                                                                 
>[1] 5148                                                                       
>fsync time: 0.1111                                                             
>fsync time: 4.3442                                                             
>fsync time: 3.9939                                                             
>fsync time: 3.7558                                                             
>fsync time: 5.8475                                                             
>fsync time: 1.3059                                                             
>fsync time: 3.0354                                                             
>fsync time: 4.5832
>fsync time: 4.3041
>fsync time: 0.2866
>fsync time: 0.7935
>fsync time: 3.2131
>fsync time: 1.5684
>fsync time: 2.0876
>fsync time: 0.9385
>fsync time: 4.3251
>fsync time: 4.1135
>fsync time: 0.7379
>fsync time: 4.9408
>fsync time: 1.1250
>fsync time: 4.2838
>fsync time: 1.1455
>fsync time: 4.7464
>fsync time: 2.8139
>fsync time: 3.5942
>fsync time: 0.9125
>fsync time: 3.4242
>fsync time: 0.1742
>fsync time: 4.8445
>fsync time: 4.0925
>fsync time: 0.8951
>fsync time: 4.1239
>fsync time: 0.0716
>fsync time: 4.5728
>fsync time: 0.3215
>fsync time: 4.6018
>fsync time: 3.5965
>15000+0 records in
>15000+0 records out
>15728640000 bytes (16 GB) copied, 234.895 s, 67.0 MB/s
>fsync time: 0.0515
>fsync time: 0.0345
>fsync time: 0.0722
>^C
>[1]+  Done                    dd if=/dev/zero of=./bigfile bs=1M count=15000
>
>ff@home-desktop:~$ vmstat 1
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> 0  0      0 1235868  42744 329892    0    0  1120    36  267  866  7  4 68 22
> 0  0      0 1235860  42744 329892    0    0     0     0  193  463  0  1 99  0
> 0  0      0 1235860  42744 329892    0    0     0    64  279  633  1  0 99  0
> 1  0      0 1045624  42956 514680    0    0     0  1252  443  792  0 13 83  3
> 0  3      0 821604  43124 686576    0    0     0 355816  810  622  0 19 20 61
> 0  3      0 754580  43188 751656    0    0     0 65088  609  538  1  8 18 74 
> 1  2      0 680256  43260 823852    0    0     0 67648  610  486  1  8  8 83 
> 0  4      0 608228  43328 893888    0    0     0 74496  567  357  0  8  0 92 
> 0  4      0 559444  43380 941264    0    0     0 48372  613  490  1  5  0 94 
> 0  4      0 565140  43384 941268    0    0     0    12  435  464  0  0  0 100
> 0  5      0 578448  43384 941272    0    0     0     4  574  603  1  2  0 97 
> 0  5      0 589836  43384 941272    0    0     0     0  507  469  0  1  0 99 
> 0  1      0 603220  43388 941272    0    0     0    52  495  648  0  0  0 99 
> 0  3      0 236396  43696 1253428    0    0     0 313596  912  502  2 26 26
> 45
> 0  4      0 242476  43700 1253432    0    0     0  1380  497  487  2  0  0 98 
> 0  4      0 254124  43704 1253432    0    0     0    12  472  428  0  1  0
> 100
> 0  6      0 263184  43704 1253432    0    0     0     8  478  507  1  0  0
> 100
> 0  3      0 276748  43704 1253432    0    0     0     4  576  550  0  1  0 99 
> 1  3      0  14600  42256 1501064    0    0     0 96932  816  697  6 23 10 61 
> 0  4      0  20824  37932 1474172    0    0     0 213872  710  496  7  8 12
> 72
> 0  4      0  32524  37932 1474172    0    0     0     0  540  504  2  1  0 97 
> 0  4      0  44192  37936 1474172    0    0     0    12  519  408  0  1  0
> 100
> 0  2      0  57712  37940 1474168    0    0     0    12  608  531  1  0 29 70 
> 0  3      0  14572  19936 1491896    0    0     4 286460  714 1011  2 29 25
> 43
> 0  3      0  25932  19940 1491820    0    0     0   328  640  515  8  2 26 64 
> 0  3      0  38744  19940 1491820    0    0     0     0  609  428  6  4 37 54 
> 1  1      0  15000  20040 1526688    0    0     0     4  700  629  6 13 35 45 
> 2  2      0  13512  20240 1493504    0    0     4 301892  782  750  6 24  4
> 67
> 1  3      0  14852  20304 1491700    0    0     0 52476  721  567  6 11  0 83 
> 1  2      0  18284  20316 1492080    0    0     0 28780  566  451  5  3  5 87 
> 2  2      0  27224  20316 1492080    0    0     0     0  496  495  0  1 46 52 
> 1  2      0  35648  20316 1492080    0    0     0     0  677  381  0  1 25 74 
> 0  1      0  42856  20328 1493104    0    0     8  1456  654  503  6  3 31 59 
> 1  1      0  15036  20432 1536900    0    0     0    16  443  503  6 11 37 46 
> 0  3      0  15760  20644 1490228    0    0     8 311972  751  904  8 23  7
> 63
> 0  3      0  30516  20648 1490232    0    0     0    12  501  426  0  1 34 65 
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa  
> 1  3      0  42548  20648 1490232    0    0     0     0  632  528  6  4 23 68 
> 1  1      0  50416  20664 1494928    0    0     8     4  604  478  7  3  2 88 
> 0  3      0  14100  20860 1490812    0    0    24 325696  928  950  7 35 35
> 22
> 0  2      0  15028  19232 1491500    0    0    48 53484  716  533  7  8 22 62 
> 0  4      0  24768  19232 1491512    0    0     0     4  545  482  2  1 28 69 
> 0  2      0  35968  19236 1491512    0    0     0    12  607  423  3  2 16 79 
> 0  2      0  46564  19236 1491512    0    0     0     0  492  483  1  1 31 67 
> 0  1      0  51728  19240 1491508    0    0     0  1468  666  410  1  1 31 68 
> 0  1      0  61724  19240 1491512    0    0     0     0  533  478  5  4 38 53 
> 0  2      0  15500  17552 1490608    0    0    16 362168  805 1000  1 36 34
> 30
> 0  4      0  14232  17588 1492032    0    0     0 78084  692  585  1  9  6 85 
> 0  3      0  14324  17496 1491412    0    0     0 69932  694  499  6  9  0 84 
> 0  3      0  14712  16420 1492288    0    0     0 85520  715  568  2 11 19 68 
> 0  3      0  15200  16464 1491824    0    0     0 74048  615  468  0 11 16 74 
> 0  3      0  21124  16472 1492776    0    0     0  8224  606  517  1  2 33 64 
> 0  3      0  30116  16476 1492772    0    0     0    12  476  425  0  1 38 61 
> 0  5      0  39592  16480 1492776    0    0     0  1760  468  525  0  1 49 49 
> 0  3      0  55292  16480 1492776    0    0     0     0  588  552  1  1 29 69 
> 1  1      0  14584  16508 1543964    0    0     0    16  558  553  7  9 29 56 
> 0  3      0  14960   9296 1542656    0    0    16 121004  802  950  6 27 20
> 48
> 0  3      0  15168   9224 1499636    0    0     4 347852  832  699  1 17  4
> 77
> 0  3      0  13620   9180 1500896    0    0     0 69932  609  485  0  6  0 94 
> 0  4      0  13716   8364 1501620    0    0     4 66352  612  544  1  9  0 90 
> 1  4      0  14900   6416 1502408    0    0     4 89640  627  492  0 10  0 90 
> 1  3      0  17748   6452 1502668    0    0   104 29000  656  531  1  4 23 72 
> 0  4      0  15436   6480 1506132    0    0   972  1556  641  527  0  1 24 75 
> 0  4      0  25444   6488 1506384    0    0   236     0  464  505  0  1  6 94 
> 0  5      0  34200   6504 1507236    0    0   836     0  491  557  0  0 16 83 
> 0  2      0  46676   6508 1507436    0    0   232    12  649  608  7  4 16 72 
> 0  1      0  57780   6508 1507436    0    0     0     0  558  434  5  4 25 66 
> 0  4      0  14308   6440 1501448    0    0     8 346148  956 1041  6 34 33
> 27
> 0  3      0  14916   6444 1501928    0    0    36 13740  434  463  7  4  0 89 
> 0  3      0  24976   6468 1502376    0    0   496     0  581  503  2  1 29 68 
> 0  3      0  39420   6468 1502400    0    0     0     8  519  413  0  1 27 72 
> 0  4      0  49148   6492 1502900    0    0   604     0  530  543  1  0  0 98 
> 0  3      0  14804   5096 1547228    0    0  2128     8  689  686  1 10  2 86 
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa  
> 1  1      0  14040   5168 1553184    0    0    16 54552  848  908  7 16 40 37 
> 0  2      0  14596   5292 1544912    0    0     4 119588  757  883  6 17 12
> 66
> 0  2      0  14616   4196 1504332    0    0     0 318128  911  798  5 17 21
> 56
> 0  5      0  14252   4248 1504768    0    0   124 41100  573  479  0  6 17 77 
> 0  5      0  21548   4248 1504760    0    0     0     0  356  454  0  0  0
> 100
> 0  5      0  30616   4256 1508596    0    0     4     8  492  408  0  1  0 99 
> 1  4      0  13568   4480 1518228    0    0    64 128736  740  773  1 20  4
> 75
> 0  2      0  19012   4504 1504444    0    0    16 108876  500  458  0  4 11
> 85
> 0  2      0  29868   4504 1504444    0    0     0     0  629  680  1  1 30 68 
> 0  3      0  43568   4508 1504444    0    0     0  1468  506  482  0  0 24 76 
> 0  2      0  56920   4508 1504444    0    0     0     0  567  667  1  1 13 85 
> 0  2      0  41276   4544 1530452    0    0     4    20  451  476  0  2 42 56 
> 0  3      0  13836   4888 1546144    0    0   116 114292  802 1039  2 31 35
> 32
> 0  2      0  23516   4896 1525112    0    0    44 154896  622  499  0  6 16
> 77
> 0  3      0  14904   4984 1546848    0    0     4  8196  680  774  0  9  8 82 
> 0  4      0  14516   5040 1503056    0    0   132 342932  655  613  0 18  0
> 82
> 0  3      0  13804   5116 1503620    0    0    76 73128  610  557  1  8  8 83 
> 0  3      0  16136   5136 1504412    0    0     4 53484  605  481  0  7  0 93 
> 0  3      0  27920   5136 1504404    0    0     0     4  505  470  1  1  5 93 
> 0  2      0  40740   5140 1504408    0    0     0    12  710  456  0  0 36 63 
> 1  2      0  44156   5144 1504408    0    0     0  1400  508  473  1  1 34 64 
> 0  1      0  57520   5144 1504408    0    0     0     0  560  404  0  0 34 66 
> 0  1      0  14276   5268 1555548    0    0     4    28  473  597  1  8 50 40 
> 0  2      0  14720   5464 1544052    0    0     8 145540  769  793  0 25 41
> 34
> 0  4      0  13724   5552 1502248    0    0   208 318396  714  964  1 13 21
> 64
> 1  6      0  15652   5596 1502480    0    0    92 45248  478  454  0  6  0 94 
> 1  8      0  13848   5700 1503536    0    0  3588 61408  784 1299  1  9  0 90 
> 0  6      0  15004   5776 1502076    0    0  3012 57504  748 1322  0  8  0 92 
> 0  6      0  13600   5840 1503888    0    0   380 78176  653  666  1  9  0 90 
> 0  5      0  16248   5860 1502668    0    0   180 20556  821  557  0  3  0 97 
> 0  7      0  19132   5868 1502708    0    0    56  1416  432  523  1  0  0 99 
> 0  5      0  35600   5892 1504112    0    0  1360     4  511  614  0  1  0 99 
> 1  5      0  45772   5892 1504196    0    0   100     8  512  547  1  1  0 98 
> 0  3      0  55852   5900 1504464    0    0   316     0  562  604  0  1  0 99 
> 1  2      0  14092   6252 1551468    0    0  3968 49244  783 1201  2 20  0 78 
> 0  3      0  14336   6412 1505532    0    0   104 369176  788  713  0 22 34
> 45
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa  
> 1  5      0  14668   6476 1505676    0    0   132 57660  550  511  1  6  7 85 
> 0  4      0  17624   6544 1508624    0    0    16 45260  588  483  0  6  0 93 
> 0  4      0  13984   6668 1504972    0    0    64 115264  700  588  0 11  0
> 89
> 0  4      0  15700   6696 1505884    0    0   396 12868  477  508  0  2  0 98 
> 0  3      0  28964   6700 1506988    0    0     0  1672  565  565  1  1  0 98 
> 0  3      0  41988   6700 1506988    0    0     0     0  513  413  0  1  0 99 
> 0  2      0  54796   6704 1506984    0    0     0    12  525  536  1  1  0 98 
> 0  1      0  65068   6704 1506988    0    0     0     4  417  386  0  0 37 63 
> 1  1      0  14544   7056 1547220    0    0     8 114364  785 1032  1 31 41
> 27
> 1  3      0  14984   7112 1546224    0    0     4 62020  633  990  0  6 10 84 
> 0  4      0  15400   7180 1503764    0    0     4 322084  662  695  1 14  0
> 85
> 0  4      0  14676   7252 1505040    0    0     0 74048  542  511  0  8  0 92 
> 0  4      0  14160   7328 1504748    0    0     4 74124  624  585  1  9  0 90 
> 1  6      0  14044   7388 1505560    0    0     0 74020  619  511  0  8  0 92 
> 1  3      0  14800   7452 1503940    0    0     4 58168  896  593  1  6  4 88 
> 2  2      0  15276   7456 1503888    0    0     0  1464  386  432  0  1  0 98 
> 1  3      0  25124   7456 1503892    0    0     0     0  547  505  1  0  0 99 
> 0  3      0  38420   7456 1503892    0    0     0     0  490  410  0  0  0
> 100
> 0  3      0  51580   7456 1503892    0    0     0     0  533  498  1  0  0 98 
> 0  1      0  65620   7456 1503892    0    0     0    12  482  564  0  0 36 64 
> 0  2      0  14968   7680 1545992    0    0    52 103228  791 1004  1 32 37
> 29
> 0  3      0  13916   7708 1547864    0    0     0 36232  565  931  0  2  8 89 
> 3  3      0  14048   7776 1504432    0    0     4 349496  640  762  1 16  9
> 74
> 0  4      0  14252   7860 1503760    0    0     4 78124  625  530  0  8 12 80 
> 0  4      0  13900   7784 1504388    0    0   436 47364  554  581  1  6  1 92 
> 0  4      0  15136   7740 1503292    0    0   196 87056  634  543  0 10  0 90 
> 1  3      0  14756   7776 1503712    0    0     0 53468  626  516  2  6  0 92 
> 0  3      0  20884   7780 1503724    0    0     0    12  726  446  0  1  0 99 
> 0  5      0  24668   7788 1504748    0    0    52  1460  459  520  1  2  4 93 
> 0  3      0  37628   7792 1504920    0    0   124     4  554  491  0  2  0 98 
> 0  2      0  51356   7792 1504924    0    0     0     8  520  583  1  1 40 58 
> 0  2      0  15188   7868 1556060    0    0     4    16  566  514  0  6 32 62 
> 0  3      0  14740   7764 1547640    0    0     8 119488  762  941  1 23 47
> 28
> 0  1      0  19600   7776 1547712    0    0     4 27084  511  427  0  1 52 46 
> 0  3      0  13664   7708 1505740    0    0     4 371988  717  828  2 23 18
> 58
> 0  3      0  13956   7656 1505784    0    0     4 61676  622  502  0  9 30 61 
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa  
> 0  3      0  13952   7588 1506016    0    0     0 78208  600  589  1 10 26 63 
> 0  3      0  15800   7628 1505272    0    0     4 51392  544  461  0  8 27 65 
> 0  3      0  18264   7672 1504532    0    0     0 63748  745  536  4  7 10 80 
> 1  3      0  19980   7684 1505532    0    0     0  1504  558  441  0  0 40 60 
> 0  3      0  27952   7684 1505536    0    0     0     0  482  500  2  0 49 49 
> 0  3      0  43392   7688 1505532    0    0     0     8  556  486  1  2 28 68 
> 0  3      0  52820   7688 1505604    0    0    68     8  517  603  6  4  8 82 
> 0  1      0  12496   7836 1558900    0    0     4    24  632  719  5 13 22 59 
> 1  2      0  15284   7840 1547452    0    0     8 104228  767  898  6 18 20
> 56
> 1  2      0  21324   7900 1514460    0    0     0 249132  691  548  7 12 18
> 63
> 0  5      0  30912   7900 1514464    0    0     0     4  583  505  8  4 33 55 
> 1  3      0  46064   7904 1515080    0    0     0     4  681  588  6  3 37 54 
> 1  3      0  18740   8060 1552056    0    0     8   876  659  746  6 17  6 71 
> 0  3      0  14936   7920 1503820    0    0     0 356160  956  624  5 21 22
> 52
> 0  2      0  25384   7936 1504544    0    0     4  1356  583  540  7  4 13 76 
> 1  3      0  38272   7940 1504544    0    0     0    12  604  464  7  4 32 57 
> 0  3      0  44132   7940 1504544    0    0     0     0  540  551  8  4 35 53 
> 2  1      0  15056   8028 1547112    0    0     0    12  739  691  6 11 31 51 
> 0  4      0  14924   8208 1538096    0    0     8 120368  687  900  8 24 16
> 52
> 0  3      0  16108   8228 1545492    0    0     4 13968  710  766  7  9  8 76 
> 1  3      0  14888   8248 1504728    0    0     0 338572  708  670  7 16 25
> 52
> 1  3      0  15092   8288 1504772    0    0     4 41132  619  469  6  9  0 84 
> 0  4      0  13720   8372 1506108    0    0     0 98724  718  577  7 13 27 53 
> 0  2      0  15300   8412 1505056    0    0     4 32908  630  470  6  7 14 73 
> 0  2      0  23212   8412 1505064    0    0     0     0  554  526  7  3 36 54 
> 0  2      0  29264   8416 1505064    0    0     0  1364  801  430  7  4 39 50 
> 0  2      0  40780   8416 1505064    0    0     0     0  521  478  3  2 18 77 
> 2  1      0  53716   8420 1505064    0    0     0    12  437  405  0  0 51 49 
> 2  0      0  13800   8500 1557188    0    0     0    36  456  623  2  8 49 40 
> 1  2      0  15452   8712 1545408    0    0     8 125840  930 1130  7 26 37
> 30
> 1  1      0  14220   8788 1503544    0    0     4 339620  775  788  7 19 25
> 49
> 0  3      0  15200   8752 1502960    0    0     4 45216  621  477  6 10  1 83 
> 0  3      0  15364   8816 1502940    0    0     0 90496  728  600  8 12  0 80 
> 0  3      0  17116   8868 1505288    0    0     4 32908  641  454  5 10  0 85 
> 0  4      0  18532   8836 1502924    0    0     0 90564  734  577  7 12  0 80 
> 0  2      0  24432   8844 1502932    0    0     0  1440  724  466  6  4 21 69 
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----  
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa  
> 1  2      0  32152   8844 1502932    0    0     0     0  484  497  7  3 36 54 
> 1  1      0  45520   8844 1502932    0    0     0     0  594  401  7  2 37 54 
> 0  1      0  58376   8844 1502932    0    0     0     0  669  498  6  6 27 61 
> 1  1      0  14164   8908 1556064    0    0     4    16  531  552  7 14 37 41 
> 0  2      0  15200   8868 1502456    0    0    12 394016  824 1047  2 26 14
> 57
> 0  3      0  17988   8920 1505192    0    0     4 32948  556  477  2  6  0 92 
> 1  2      0  15244   9008 1502740    0    0     0 109784  739  597  7 14  0
> 78
> 1  2      0  14120   9064 1504488    0    0     4 69952  660  487  7 11  0 82 
> 0  3      0  15544   9136 1502676    0    0     0 78112  795  584  9 10  0 81 
> 0  2      0  14772   9164 1503996    0    0     4 20524  769  449  6  6  4 84 
> 0  2      0  23116   9168 1505024    0    0     0  1456  472  500  8  3 21 67 
> 0  3      0  35232   9168 1505024    0    0     0     4  598  413  6  5 11 78 
> 0  2      0  47668   9172 1505024    0    0     0    12  506  558  2  0 14 83 
> 0  2      0  56336   9180 1505016    0    0   104     0  548  426  0  1 37 62 
> 1  1      0  14308   9104 1550864    0    0     4 45552  585  752  4 22 42 32 
> 0  4      0  14908   9000 1546296    0    0     8 81468  787  747  6 15  8 70 
> 2  1      0  13672   8984 1503684    0    0     4 356028  843  854  8 19 16
> 57
> 1  3      0  15088   9044 1506536    0    0     4 45248  707  506  7 10 15 68 
> 0  4      0  15088   8956 1502944    0    0     0 98656  687  581  4 11  0 85 
> 0  4      0  13996   9016 1503984    0    0     4 78144  631  477  0 10  0 89 
> 1  4      0  17544   9072 1502820    0    0     4 49852  599  569  1  5  0 93 
> 0  3      0  17292   9052 1505400    0    0     0  1468  730  476  0  3  1 97 
> 0  4      0  30052   9052 1505404    0    0     0    36  566  506  6  4 30 60 
> 0  3      0  43416   9056 1505404    0    0     0    12  585  449  4  2 12 83 
> 0  2      0  53728   9060 1505408    0    0    60     0  534  604  1  2  6 91 
> 2  1      0  13792   9204 1556156    0    0     4    16  556  618  3 12 47 39 
> 1  0      0  15192   9128 1547256    0    0     4 119120  774  796  6 18 38
> 38
> 1  2      0  14468   9124 1503200    0    0     4 355892  930  949  5 21 30
> 44
> 0  4      0  16568   9168 1506296    0    0     4 24644  602  532  8  8  7 77 
> 0  3      0  14576   9176 1503172    0    0     0 119060  728  587  6 16 18
> 61
> 1  2      0  14352   9120 1508740    0    0     4 20576  612  558  8  7 19 65
> 0  3      0  15532   9136 1501896    0    0     0 128572  813  532  7 15  0
> 78
> 0  2      0  15140   9152 1504820    0    0     4  1448  768  547  7  5  5 83
> 1  3      0  25064   9156 1504856    0    0     0    12  586  437  6  4  7 82
> 1  5      0  38540   9156 1504856    0    0     0     4  642  611  9  3 26 62
> 0  3      0  52500   9156 1504856    0    0     0     0  624  481  5  5 37 53
>procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
> 2  1      0  14500   9232 1555248    0    0     0    16  682  704  7 10 28 55
> 1  2      0  14644   9376 1546224    0    0     8 111736  724  810  6 26 40
> 28
> 0  2      0  13508   9308 1503808    0    0     4 341320  782  948  8 18  8
> 67
> 1  2      0  13724   9328 1504316    0    0     0 61708  662  620  7 10 15 67
> 1  3      0  14292   9376 1503728    0    0     4 61764  667  673  7 10  0 83
> 0  4      0  16960   9312 1504224    0    0     4 61700  661  587  7 12  0 81
> 0  3      0  14488   9400 1502776    0    0     0 98792  723  605  7 10  0 83
> 0  3      0  14544   9408 1502508    0    0     0 26160  794  573  6  7 34 53
> 0  3      0  26252   9412 1502512    0    0     0  1420  650  546  7  4  2 87
> 0  3      0  39100   9416 1502512    0    0     0    12  592  432  6  4  0 89
> 0  4      0  49712   9416 1502512    0    0     0     4  499  493  2  2  8 88
> 0  1      0  63372   9420 1502512    0    0     0    12  582  593  0  0 14 85
> 2  0      0  13900   9588 1555884    0    0  1632    56  531  762  2 14 27 56
> 1  2      0  14012   9768 1505492    0    0     8 391688  790  846  0 26 23
> 52
> 0  3      0  14904   9812 1502068    0    0     0 98880  695  574  1 11 21 66
> 0  3      0  13512   9812 1503280    0    0     4 74060  642  496  0 10 29 61
> 0  4      0  14024   9844 1502960    0    0     0 73940  658  598  1 10 32 57
> 0  3      0  18676   9872 1501804    0    0     4 35736  591  437  0  5 15 80
> 0  4      0  27660   9884 1501824    0    0     8    12  541  284  2  1  2 95
> 0  3      0  39176   9888 1501912    0    0    92  1720  448  274  0  1 10 89
> 0  3      0  51596   9888 1501912    0    0     0     0  545  593  1  1 36 61
> 0  1      0  66220   9892 1501912    0    0     0    16  570  555  0  0 48 51
> 0  0      0  15144   9940 1556124    0    0  2240   176  506  753  4 10 37 49
> 0  0      0  15192   9948 1556176    0    0     0  1052  283  483  0  0 98  1
> 0  0      0  15192   9956 1556176    0    0     0  1052  316  769  2  0 97  2
> 0  1      0  14348   9988 1556772    0    0  2996  1052  408  718  0  0 77 22
> 0  0      0  14252  10012 1558320    0    0  1588     0  404  803  2  6 78 13
> 0  0      0  14680  10036 1557968    0    0  1840    28  400 1197  0  3 88  8
> 0  0      0  14740  10080 1557812    0    0  2168     0  406  992  2  2 85 12
>^C

Comment 381 Perlover 2009-06-11 14:01:36 UTC

I read russian forums about this problem
I should go now and i cannot write to more info right now
But there somebodies tried to checnge scheduler from cfg to other and iowait bug stayed. If you think that bug in scheduler may be to try to change scheduler through /proc ?

Comment 382 topaz 2009-06-12 16:45:54 UTC

I'm annoyed by the same bug (I suspect). And I'm able to reproduce it with both anticipatory and cfq schedulers. Therefore, is this but to be link with cfq ?

I'm running my kernel with: elevator=as

Comment 383 rocko 2009-06-13 00:53:12 UTC

@Jens Axboe: I tried your patch in comment 366 on the 2.6.30 kernel, and it did improve responsiveness in my initial testing. I used to have the problem that the kernel became highly unresponsive on large file copies to the same partition or as soon as it tried to use swap (in 2.6.30-rc3 and earlier), but the unpatched 2.6.30 performs quite reasonably and the patch improved responsiveness further (my unscientific test results are that moving the mouse resulted in much less 'stuttering' after the patch - note that with earlier kernels the mouse would just freeze).

I did though just find a problem where an overnight memory leak caused X to become so unresponsive it couldn't even draw the screen background until I killed the culprit (firefox). This might be unrelated to the patch, ie a problem with swap management, but it does show that the kernel can still become bogged down under high disk I/O.

Comment 384 Perlover 2009-06-17 04:32:54 UTC

Did anybody here resolve this bug ?
I see a workaround only as an installing FreeBSD instead a Linux kernel version >= 2.6.18

Comment 385 Perlover 2009-06-18 16:58:15 UTC

I think that i coped with a this bug !
I made some options of kernel and my server works stability and there are no frozen timeouts with high iowait already 10-12 hours!

Detailed info:
My kernel now is 2.6.22.14-72.fc6
Fedora Core 6

This the suggestion is not bug resolving (i think there is bug in kernel and it stays) but this is a workaround. I have read many topics and forums and stopped at these commands:

# echo 50 > /proc/sys/vm/vfs_cache_pressure
# echo deadline > /sys/block/DEVICE/queue/scheduler
#  # echo 1 > /sys/block/DEVICE/device/queue_depth
# echo 1024 > /sys/block/DEVICE/queue/nr_requests

The DEVICE is 'hda' or 'sda' for some HDDs. I didn't test queue_depth because for my HDDs (SAS SCSI + RAID10) this file is readonly (no there NCQ supporting as i think). But may be this command will help to you. I don't know.

I suggest anybody who have a frozen timeouts with high iowait to try this turning

I am very glad! Please to try this workaround. I didn't test 'dd' command but my heavy a HDD working has been freezing the server. Now i don't see this.

Comment 386 Jens Axboe 2009-06-19 08:28:41 UTC

Can you try the three settings separately, to see which one makes the large difference?

Comment 387 Perlover 2009-06-19 09:36:01 UTC

I will try but this is my work server under heavy load. I am afraid now there to touch something already :-/
But near time i will try to define the main option of this turning. Already passed > 24 hours and i don't have a troubles there with freezes. I cannot believe ...

Comment 388 Perlover 2009-06-19 09:41:54 UTC

Here test for same server as in my post here # 359
Same server but after turning port # 385

# dd if=/dev/zero of=testfile.1gb bs=1M count=1000

And during 'dd' i do vnstat 1:

 0  2    116 103632 507240 2016112    0    0  1324    16 1024  963  1  1 50 48
0
 1  2    116 101512 507484 2015736    0    0  1436     0 1314 1253 21  5 25 48
0
 0  2    116 103632 507240 2016112    0    0  1324    16 1024  963  1  1 50 48
0
 0  7    116  25208 496944 2105464    0    0     4 26272 2892  239  0  4 23 73
0
 0  9    116  21636 496972 2109568    0    0    32 21904 2150  339  0  2  8 90
0
 0 10    116  39888 481904 2105552    0    0     4 23544 1964  368  0  4  1 96
0
 0  9    116  49036 472984 2105016    0    0     8 18252 1730  728  0  3  0 97
0
 0  7    116  61700 459736 2105412    0    0    16 74176 2167  317  0  5 13 82
0
 0  7    116  71416 450576 2104272    0    0    24  8680 1322  237  0  4 16 80
0
 1  5    116  82772 439000 2106280    0    0    24 58616 1457 3332  0  7  5 88
0
 1  5    116  97224 424752 2105804    0    0    20 60164  848  286  0  6 24 70
0
 0  7    116 110700 409384 2107036    0    0    56 105584  884  397  0  9 15 76
 0
 2  5    116 116444 392304 2118776    0    0   288 95624 1096  424  1 11 10 78

As you can see there no stability iowait 90-99%, only sometimes ...

Comment 389 Perlover 2009-06-19 09:57:12 UTC

Here some tests:

I do to defaults settings before tuning:
# echo 100 > /proc/sys/vm/vfs_cache_pressure
#
# echo cfq > /sys/block/sda/queue/scheduler
#
# echo 128 > /sys/block/sda/queue/nr_requests

# dd if=/dev/zero of=testfile.1gb bs=1M count=1000
^C
116+0 records in
116+0 records out
121634816 bytes (122 MB) copied, 20.5609 seconds, 5.9 MB/s
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^ (!!!)

During a riunning of 'dd' i do vmstat 1:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
-
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0 10    116 760168 503488 1329836    0    0     4     4    0    0  9  3 65 23
0
 0 11    116 756132 502488 1330536    0    0  1332  5648 1744 4909  5  2  2 91
0
 0 12    116 760208 503128 1330856    0    0  1136  4388 1875 3053  4  2  0 94
0
 0 11    116 759832 502668 1331608    0    0  1004  7488 2379 4032  1  2  0 97
0
 0 12    116 758740 503288 1331832    0    0  1280  3252 1818 2402  1  1  0 98
0
 0 10    116 733976 502936 1356780    0    0  1232  4476 1753 4143  1  3  0 96
0
 1  8    116 733596 502368 1357324    0    0   804  5792 1831 2980 20  2  0 79
0
 1  7    116 738388 502920 1357788    0    0   928  6652 1875 2349 17  2  4 77
0


**************************

Now i after this to do:

# echo 50 > /proc/sys/vm/vfs_cache_pressure
#
# echo deadline > /sys/block/sda/queue/scheduler
#
# echo 1024 > /sys/block/sda/queue/nr_requests


# dd if=/dev/zero of=testfile.1gb bs=1M count=1000

^C
638+0 records in
638+0 records out
668991488 bytes (669 MB) copied, 10.463 seconds, 63.9 MB/s
                                     ^^^^^^^^^^^^^^^^ (!!! :-))) )

During 'dd' i do in other terminal:
# vmstat 1

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
-
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  7    116 718764 502884 1371484    0    0     4     4    0    0  9  3 65 23
0
 0  9    116 687208 502924 1405624    0    0     8 26664 2708  746  6  4  3 87
0
 1  8    116 668924 502976 1422116    0    0    16 21404 2246 8462  1  4  9 87
0
 0  8    116 654804 501632 1434492    0    0    24 30804 2072 9249 10  4  0 86
0
 0 10    116 613152 501692 1475220    0    0    20 42880 2021 4408 15  5  7 73
0
 2 10    116 559860 499464 1524600    0    0    32 58504 2108 10612  5  6 15 74
 0
 0 11    116 510132 499528 1578340    0    0    36 59400  984 1748 17  5  2 77
0
 0 10    116 399420 499672 1689316    0    0   108 111332  910  957  4 11  2 84
 0
 1  7    116 331556 499756 1750580    0    0   104 62268 1501 5255 11  6 10 74
0


*********************

and noticing:

I have other servers, there other hardware. I cannot repeat this iowait problem with and without this turning (there Fedora release 7 (Moonshine), kernel 2.6.23.17-88.fc7). Now i think that this trouble is not for all HDDs. May be this trouble is hardware dependent.

I am researching now what option will help to resolve iowait problem

Comment 390 Perlover 2009-06-19 10:05:50 UTC

I determined main option:

Only this option helped to me:

# echo deadline > /sys/block/sda/queue/scheduler

I don't understand why. I have read many russian topics that a changing of scheduler doesn't help ... I don't think that only a changing of scheduler will help to me. But i only have changed the scheduler from cfq to deadline and 'dd' test now this:

# dd if=/dev/zero of=testfile.1gb bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 13.7121 seconds, 76.5 MB/s

iowait sometime was only 80-90%.

Here my current setting:

# cat /proc/sys/vm/vfs_cache_pressure
100
# cat /sys/block/sda/queue/scheduler
noop anticipatory [deadline] cfq
# cat /sys/block/sda/queue/nr_requests
128

Now i will keep these setting and will watch there are a freezes or not.

Comment 391 Perlover 2009-06-19 11:30:47 UTC

I made a some experiments
And i think that i found main reason of high iowait with cfq scheduler.

I made some tests:

I changed cfg <--> scheduler into my two servers with same hardware & OS (FC6, kernel 2.6.22.14-72.fc6). There same CPUs, motherboard, SAS & RAID controllers & HDDs. But i saw only in one server high iowait & cfq scheduler during 'dd' command.

I think that main reason is A LOT AMOUNT OF USED INODES OF PARTIOTION into HDD.

For example:

The 'OK' server where i counld not reproduce bug:
# df -i

/dev/sda1             524288    8543  515745    2% /
tmpfs                 219756       1  219755    1% /dev/shm
/dev/sda6             787200   34068  753132    5% /usr
/dev/sda5             787200   25582  761618    4% /usr/local
/dev/sda7             524288    1993  522295    1% /var
/dev/sda8            30900224 1719787 29180437    6% /wwws
/dev/sda3            1048576   49655  998921    5% /wwws/accel-proxy

I wrote the test testfile.1gb file to /wwws partiotion . There no highest iowait with deadline & cfq schedulers.

The second server, 'BAD' server has a same hardware & soft but there df -i:

Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1             524288    7444  516844    2% /
tmpfs                 219756       1  219755    1% /dev/shm
/dev/sda7             787200   35307  751893    5% /usr
/dev/sda6             787200   27520  759680    4% /usr/local
/dev/sda8             524288    2334  521954    1% /var
/dev/sda3            30900224 5332794 25567430   18% /wwws
/dev/sda5             524288    4128  520160    1% /wwws/accel-proxy

I did 'dd' tests to /wwws/ partition too (i get used to write there big files) ... There if i use cfq scheduler and (important) have some worked processes (apaches, mysql - not idle server) that during 'dd' command i have highest iowait (90-99%) and very low speed of writing (9-10 Mb/sec). If i change there to deadline scheduler and write to /wwws/ partition too i have 60-80 Mb/sec speed and not high iowait. But if i wrote testfile.1gb to other partiotion (for example to /var) i have not iowait even with cfq scheduler. Thus cfq scheeduler + a lot used inodes is bad as i think. The deadline scehduler + a lot used inodes is not bad.

So i think that high amount of used inodes in partiotion and cfq scheduler together have some wrong something.

May be if i could have a much used inodes into my other servers (FC7 where i could not reproduce iowait problem) i could reproduce this high iowait bug too.

Please to try make a many many small files in some partiotion (5-6 millions for example) and to test 'dd' & cfq scheduler.

Comment 392 Jens Axboe 2009-06-19 18:39:22 UTC

It would be ideal, if you could try 2.6.30 on the problematic server. I realize that this may not be easy, however there's not much I can do about a problem on an ancient kernel.

If you do try 2.6.30 and it also has the same problem, then I want you to capture some blktrace data of both deadline and cfq. Basically, right after you start the dd test, in another terminal do:

# cd /dev/shm; blktrace /dev/sda

and ctrl-c that blktrace after ~5 seconds or so. Then stop the dd as well. Save the blktrace files on the harddrive.

Now switch do deadline and repeat the exact same thing. Then tar up the two sets of files and attach them to this bug report.

Comment 393 Perlover 2009-06-19 19:13:50 UTC

Jens Axboe, i happy to help but i cannot to try 2.6.30 :(

I never install kernels and i am afraid that something maybe to be not right after installing kernel and i will not be able to access to server. This server under heavy load and is located in other continent. I cannot risk, sorry ;-(

May be will anybody try to make many small files in HDD (many inodes - ~ 5-6 millions for example) and will try to compare cfq & deadline schedulers ?

Comment 394 Anton Revunov 2009-06-20 07:00:11 UTC

Created attachment 22019 [details]
test results 2.6.30: cfq, deadline

It gives no improvement in responsiveness at all variants. Maybe quite a bit.

Comment 395 Carlos 2009-06-27 16:31:08 UTC

hi, i am using the 2.6.30 kernel with the patch from #366.
Before using the patch i got really trouble when downloading large files with torrent at high speed ( over 5MB/sec). 
Now it just works great. Thanks for this patch.

Comment 396 Anton Revunov 2009-07-01 16:14:37 UTC

Created attachment 22167 [details]
test result 2.6.30 without ACHI

I turned off the ACHI in BIOS on the laptop. System has become much more responsive. Now possible to run new applications, while the dd is running.

Comment 397 Jens Axboe 2009-07-02 20:22:09 UTC

Created attachment 22180 [details]
Drain async IO on the hw side

This patch makes sure that async IO has completed drained from the device queue before starting sync IO. Hopefully that should make things as good as disabling NCQ, and it should even improve the situation without NCQ.

I'd like for people to test this patch and see if it makes a difference. It's against 2.6.31-rc (ish), but I _think_ it will apply against 2.6.30 as well. If not, holler, and I'll do a backport too.

Comment 398 Anton Revunov 2009-07-03 05:14:34 UTC

Created attachment 22184 [details]
test result 2.6.30 with patch from #397

(2.6.30 + NCQ + patch from #397) == (2.6.30 + NCQ). New applications start very slow.

Comment 399 Yuriy Lalym 2009-07-04 13:42:47 UTC

Notebooks toshiba earlier came across to me, and I saw as on them then worked Linux. If ACPI to switch off — it was possible to listen to music (but, and it is clear, not to see how many remains fuel in batteries) and if ACPI to switch on — any sound was not, whether but it was visible that with a battery began and to understand have stolen this battery while we enjoyed music. It is possible and to accuse of it the scheduler (in our case cfq) and to search why it cannot plan simultaneously two processes — to play music and to check a battery state.
For us it has turned out that all schedulers have broken (because I have tried them all — and all non-working). The theory of probability does not deny possibility of such event. But, that, for one person all schedulers broke, and all worked for another are already influence supernatural forces. Struggle against them is useless.
So why for one all works, and for another the system hardly creeps. In what a difference.  Only in computers (or is more exact in their complete set).

I can be mistaken, but can then someone will tell why on one iron all simply flies, and on other hardly creeps (without looking at that on the second both the processor faster and disks faster, and bus faster).

Comment 400 kebjoern 2009-07-17 20:59:58 UTC

Had big troubles on a ASUS PN5e Motherboard and a WD 320 G. Compiled a 2.6.31-rc3 with your patch and it works great. Thank you very much! I'd like to backport it to 2.6.29 to try it together with the realtime patch. Is there a chance to get it working?

Comment 401 Anton Revunov 2009-07-18 04:12:58 UTC

2.6.31-rc3-git3 + NCQ + patch from #397: new applications start very
slow.
Without NCQ new applications start quickly.

Comment 402 Yuriy Lalym 2009-07-19 15:42:13 UTC

There is one more interesting question.
KSYSGUARD shows "Used Memory" = 0.66Gb.
> top
top - 21:43:57 up 7:00, 3 users, load average: 0.74, 0.39, 0.29
Tasks: 149 total, 3 running, 146 sleeping, 0 stopped, 0 zombie
Cpu (s): 2.8%us, 1.3%sy, 0.0%ni, 93.1%id, 2.5%wa, 0.2%hi, 0.2%si, 0.0%st
Mem: 8035628k total, 7998716k used, 36912k free, 0k buffers
Swap: 2104472k total, 6564k used, 2097908k free, 7402836k cached

When value Mem:used aspires to value Mem:total - the graphic interface works much more slowly (and without any disk operations).

It only at me is present such problem?

Comment 403 Benj FitzPatrick 2009-08-18 20:04:46 UTC

I applied the patch in 397 to a vanilla 2.6.30.4 and the difference was dramatic (with the patch is _much_ better, ie the complete freezing for 15+ seconds when running multiple IO intensive jobs are gone).  I'll work on getting some hard numbers (with iobench, etc) to see if they agree.

Comment 404 devsk 2009-08-18 20:24:59 UTC

(In reply to comment #397)
> Created an attachment (id=22180) [details]
> Drain async IO on the hw side
> 
> This patch makes sure that async IO has completed drained from the device
> queue
> before starting sync IO. Hopefully that should make things as good as
> disabling
> NCQ, and it should even improve the situation without NCQ.
> 
> I'd like for people to test this patch and see if it makes a difference. It's
> against 2.6.31-rc (ish), but I _think_ it will apply against 2.6.30 as well.
> If
> not, holler, and I'll do a backport too.

Is this in the vanilla 2.6.31-rc5 already?

Comment 405 Jens Axboe 2009-08-18 21:16:47 UTC

No, the patch is queued up for 2.6.32 since it was a rather risky change for 2.6.31. But I'm glad it makes a difference, that means that the starvation experienced is largely on the device side. By draining the queue, we prevent that from happening (or, at least we lessen the effect dramatically).

Comment 406 Yuriy Lalym 2009-08-25 20:09:23 UTC

2.6.31-rc7 + patch in 397 - There are no improvements

Comment 407 James Ettle 2009-08-30 12:00:12 UTC

No improvements seen here with 2.6.30.5 and the patch, either. Pretty much *any* write to swap causes major latency (disruption to audio, graphics etc.).

Comment 408 Thomas Pilarski 2009-09-15 19:47:23 UTC

There is an improvement in desktop responsiveness with kernel 2.6.31 and the as scheduler compared to the cfq scheduler. It does not solve the problem, but it makes it more sufferable. I am using a full encrypted lvm drive with ext3 partitions, mounted with noatime and data=ordered.

Comment 409 rocko 2009-10-26 23:35:13 UTC

I've observed something that might be relevant to this bug (using the 2.6.31.5 kernel): when I do large I/O operations from one external device (say /dev/sdb) to another slow USB flash key (say /dev/sdc), I can hear my *internal* hard drive (/dev/sda) thrashing away constantly even though its light indicates that no read/write activity is going on. During this time anything that requires access to /dev/sda is slowed right down and hence running new programs slows down disk access.

When I start copying, eg using nautilus, there is usually a 400 MB buffering delay before writing starts to the USB drive (ie before its light starts flashing). During this time, there is NO /dev/sda thrashing. /dev/sda starts thrashing starts as soon as the USB key light starts flashing.

So there appears to be a bug that makes /dev/sda constantly seek during the /dev/sdc USB write operation, and this is affecting system responsiveness.

Comment 410 Jens Axboe 2009-10-27 08:14:15 UTC

Please try 2.6.32-rc5. Make sure you are using CFQ as your io scheduler.

Comment 411 rocko 2009-10-28 02:29:18 UTC

I opened http://bugzilla.kernel.org/show_bug.cgi?id=14491 to track this bug separately - I've put comments in there about 2.6.32-rc5, which I don't think exhibits the problem.

Comment 412 Thomas Pilarski 2009-11-01 21:28:24 UTC

Created attachment 23618 [details]
Simple sleeper test case

As this bug occurs more permanent while working in an virtual machine or while using java and I still think, that's this is a process scheduler bug (or something related). Here another test case, which shows the suspected behaviour. As there are many system calls while using a virtual machine, I have tries to find an equal test. The test case just sleeps for 1µs and measures the time difference of the usleep operation. I am using such many of the usleep operations, as the problems does not occur deterministic and I tried to catch as many as possibly occurrences. 

I have run this test case on three machines. The first one was a Core2 Duo with a first generation SDD (OCZ Core Series) with a poor write performance and on a Ubuntu kernel 2.6.31-14-generic. The partitions are block aligned. I have run this test, while my wife was using firefox. Every time, she was submitting something and firefox is using sqlite for writing the history, there was a high latency for the sleep test.

Timediff    7629094:   16.80ms Total:  61.12ms
Timediff    7629100:   18.82ms Total:  93.68ms
Timediff    7629101:   19.96ms Total: 113.54ms
Timediff    7629102:   19.98ms Total: 133.43ms
Timediff    7629103:   19.97ms Total: 153.31ms
Timediff    7629104:   20.00ms Total: 173.24ms
Timediff    7629105:   19.96ms Total: 193.09ms
Timediff    7629106:   20.02ms Total: 213.02ms
Timediff    7629107:   19.94ms Total: 232.86ms
Timediff    7636162:   16.40ms Total:  34.44ms
Timediff    7636164:   19.90ms Total:  64.00ms

While the duration of 100 usleep should be somewhere between 10ms and 20ms, 10 usleep(1) takes more than 200ms. This behaviour is reproducible. 

On my machine, a Core2Duo, a normal 2.5" hard drive with a vanilla kernel 2.6.31.5, there is an equal behaviour. While making a backup from one hard drive to another, the latency jumps to >30ms for one usleep(1) nearby every second. But there are some latencies grater than >150ms for one usleep(1).

Timediff   11054523:   38.23ms Total:  53.19ms
Timediff   11212737:   21.64ms Total:  31.46ms
Timediff   11213557:   35.59ms Total:  44.62ms
Timediff   11213939:   59.88ms Total:  65.76ms
Timediff   11264190:   40.83ms Total:  49.72ms
Timediff   11264709:   53.77ms Total:  63.09ms
Timediff   11265629:  145.74ms Total: 155.96ms
Timediff   11327458:   16.94ms Total:  25.23ms
Timediff   11376430:   18.91ms Total:  27.67ms
Timediff   11408941:   17.67ms Total:  26.36ms
Timediff   11424964:   19.26ms Total:  28.01ms
Timediff   11509722:   19.84ms Total:  28.30ms
Timediff   11627259:   27.01ms Total:  34.51ms
Timediff   11645718:   18.26ms Total:  29.80ms

On my server Athlon X2 on a full encrypted RAID-5 with lvm on a 2.6.18-128.2.1.el5.028stab064.7 (CentOS with OpenVZ) kernel, the behaviour was even worse. While copying a 4GB iso, there are latencies of one usleep(1) up to one second. 

Timediff      40397:   24.16ms Total: 122.93ms
Timediff      40417:  859.04ms Total: 981.78ms
Total        40417:  981.78ms
Timediff      45928:   22.62ms Total: 220.16ms
Timediff      50471:   25.02ms Total: 135.80ms
Timediff      51085:   19.23ms Total: 163.03ms
Timediff      51097:  205.12ms Total: 360.66ms
Timediff      51160:   47.47ms Total: 422.81ms
Total        51160:  422.81ms
Timediff      51662:   21.93ms Total: 279.08ms
Timediff      51663:   40.87ms Total: 318.58ms
Total        52068:  401.49ms
Timediff      54540:   16.69ms Total: 150.93ms
Timediff      63056:   78.07ms Total: 203.86ms
Timediff      65673:   16.43ms Total: 228.44ms
Timediff      65675:   24.04ms Total: 265.11ms

On all three machines, there were small latencies without any fsync or copy operation. On the machines with the Core2Duo and kernel 2.6.31 the latencies are below 0.2ms and 0.1ms, even while watching a movie or using 100% of the cpu. On the Athlon X2 and the kernel 2.6.18, the latencies are always below 1ms. 

A 200ms latency while moving the mouse is noticeable. A delay of 1 second, while moving the mouse, should be the freezes, which many of us notice during copy operations.

Why is the kernel delaying resume of the usleep(1) operation up to one second during a copy operation? Please have a look on this behaviour.

Comment 413 Mariusz Pluciński 2009-11-09 20:40:08 UTC

I also had problem with system latency with high I/O usage. After applying patch from #397 to kernel 2.6.31.5, the problem became really smaller. Before patching, machine were sometimes freezing for more than 5 minutes. Now, maximum latency delay is less than half-second.

Comment 414 Zenith88 2009-11-18 19:51:59 UTC

I have the same issue on a machine with i845e chipset, P4-1.5 Northwood, 2GB DDR RAM, GF6800 video and Audigy2 sound card. My main HDD is 160GB IDE Seagate.

When there is disk activity the system becomes virtually unusable.

For example, when I am burning a DVD on the drive attached to SII 3512 SATA controller, the CPU load goes from 40% at 7-8x to 98% at 16x.

Downloading Fedora12 ISO last night at 500 kb/s kept system busy at 90%!

If I start kernel compile, CPU load is stable 100%, which is Okay, but switching tabs in Firefox takes 10 seconds and starting any application like JUK, Dolphin, Konsole takes up to 1 minute.

Running Fedora11 with 2.6.30.9.96 FC11 i686 PAE kernel.

The system has become a bit more responsive (by about 10-20%) since I noticed p4-clockmod was being loaded and shut it down.

Comment 415 Yuriy Lalym 2009-12-08 21:27:43 UTC

There are not enthusiastic comments after an output 2.6.32. I understand so - "And cartful and now there"

Comment 416 Pawel 2010-02-28 19:51:37 UTC

Created attachment 25281 [details]
perf chart high io latency

I am using 2.6.33 kernel and this problem is still present. When I copy big file (few GB) system becomes unresponsive. I ran perf chart and generated svg image. You can notice Plasma-desktop (part of KDE) is blocked for long time by IO. I copied file from the ntfs partition, but it also happens when I am copying big files over my Linux partition or from hard drive to pendrive.

Comment 417 Christian Mertes 2010-02-28 22:02:08 UTC

(In reply to comment #416)
> I am using 2.6.33 kernel and this problem is still present.

Yep, this definitely earns the Most Embarrassing Linux Bug Award 2009 and is a Nominee for Most Annoying Linux Bug 2009 although the ATI binary driver wins in this category. Call me unfair for allowing binary blobs.

Comment 418 Ben Gamari 2010-02-28 23:17:28 UTC

I will agree that something still isn't right with the VM. In my uninformed opinion, it does seem to be far too eager to swap out executable page in favor of streaming pages. Unfortunately, it seems that very few people know the VM well enough to fix it.

Comment 419 Thomas Pilarski 2010-03-16 23:21:54 UTC

I am currently using the linux kernel 2.6.33 and the the desktop responsiveness is awful on my machine compared to the 2.6.32.x kernel. It's even worse than I have even seen it before. The load avg is rising to >7 very quickly, while writing many small file to the filesystem. I can make some tests with my configuration, but a kernel developer should tell me which tests.

Comment 420 Andrew Morton 2010-03-16 23:56:58 UTC

(In reply to comment #419)
> I am currently using the linux kernel 2.6.33 and the the desktop
> responsiveness
> is awful on my machine compared to the 2.6.32.x kernel. It's even worse than
> I
> have even seen it before. The load avg is rising to >7 very quickly, while
> writing many small file to the filesystem. I can make some tests with my
> configuration, but a kernel developer should tell me which tests.

This isn't really the best place to bring this up.  Please send a full description to linux-kernel@vger.kernel.org.  cc myself, Ingo Molnar <mingo@elte.hu>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Jens Axboe <jens.axboe@oracle.com>.  In that email, please identify what the system is doing at the time.  Is it disk-related?  CPU scheduler related?  etc.

Thanks.

Comment 421 Frank Ren 2010-03-31 23:10:07 UTC

Gentlemen,
	I have suffered the high iowait problem for almost 4 years, and I have been watching the bug report(Bug 12309) on bugzilla.kernel.org for 1 year,
and yesterday I finally managed to get out of this trouble by switching from CentOS 5.4(with kernel 2.6.18) to zenwalk 6.2(with a snapshot kernel 2.6.32.2). 
	The computer is used to collect signal data from 4 gas turbines in a power plant. The project started from 2004,and we used mandrake 9 and zenwalk, both are 2.4.x kernel,and there was no high iowait problems. Since 2006 we switched to fedore 6(kernel 2.6.18) and then centos 5, and the iowait began to make trouble, the system's response of mouse and keyboard became very slow, new applications needed a long time to be launched. During these years, I always thought the main reason of this was because the computer's hardware was not good enough. But early this month, the plant has upgraded the computer to a new Lenovo server with two Xeon E5504 CPUs(total 8 cores), and 4GB memory,but the iowait is still very very high, the following is the output of "top" command on that machine:

Tasks: 215 total,   1 running, 213 sleeping,   0 stopped,   1 zombie
Cpu0  :  1.0%us,  0.3%sy,  0.0%ni, 65.9%id, 32.8%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  1.0%us,  3.6%sy,  0.0%ni, 45.0%id, 50.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  1.0%us,  4.0%sy,  0.0%ni, 94.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  :  1.3%us,  3.3%sy,  0.0%ni, 56.3%id, 38.3%wa,  0.0%hi,  0.7%si,  0.0%st
Cpu4  :  1.3%us,  6.7%sy,  0.0%ni,  0.0%id, 89.7%wa,  0.7%hi,  1.7%si,  0.0%st
Cpu5  :  0.3%us,  3.3%sy,  0.0%ni, 91.7%id,  0.0%wa,  0.7%hi,  4.0%si,  0.0%st
Cpu6  : 10.3%us, 30.2%sy,  0.0%ni, 50.2%id,  2.3%wa,  1.0%hi,  6.0%si,  0.0%st
Cpu7  :  1.3%us,  8.6%sy,  0.0%ni, 83.1%id,  4.0%wa,  1.0%hi,  2.0%si,  0.0%st
Mem:   4078540k total,  3872720k used,   205820k free,   182344k buffers
Swap:  4192956k total,        0k used,  4192956k free,  2815596k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                          
 3841 markv     15   0 72172  12m 8380 S 42.2  0.3   1984:24 lvinf                            
 8573 markv     15   0 60232  12m 8876 S 11.6  0.3   0:17.22 mark                             
 4067 markv     15   0 19056 3224 2336 S 10.6  0.1 759:52.00 dms                              
 3548 mysql     21   0  656m 617m 9292 S  9.0 15.5 764:42.05 mysqld                           
27042 markv     15   0 69404  12m 8756 S  4.3  0.3 290:36.14 walin                            
 3810 root      15   0 39772  15m 8224 S  1.3  0.4   3:59.76 Xorg                             
    1 root      15   0  2068  620  532 S  0.0  0.0   0:01.19 init                             
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.04 migration/0                      
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0                      
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0                       
    5 root      RT  -5     0    0    0 S  0.0  0.0   0:00.02 migration/1                      
    6 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1                      
    7 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/1                       
    8 root      RT  -5     0    0    0 S  0.0  0.0   0:00.01 migration/2                      
    9 root      34  19     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/2                      
   10 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/2                       
   11 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 migration/3                      
   12 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/3                      
   13 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/3                       
   14 root      RT  -5     0    0    0 S  0.0  0.0   0:00.09 migration/4                      
   15 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/4                      
   16 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/4                       
   17 root      RT  -5     0    0    0 S  0.0  0.0   0:00.03 migration/5                      
   18 root      36  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/5                      
[markv@markgt ~]$ 

	All our application's job is to insert 16 records per second(every record is a fixed 12 bytes in 3 fields) into a mysql database, the storage is a LVM consisted of two 750GB seagate SATA 7200RPM disks. I am sure this high iowait was not caused by other things like network cards or video card, because I have experimented to comment out only the mysql inserting lines from our source code, and the system iowait would drop to 0, GUI would become very responsive.
    It also has nothing to do with the io scheduler,because I had tested deadline and noop on CentOS 5.4 and iowait could not be reduced. I also tried to enlarge the /sys/block/sda/queue/nr_requests, and it does not work.      
	I got information from this bugzilla report that kernel 2.6.32 has fixed this high iowait problem, and I tested the snapshot kernel 2.6.32.2 of zenwalk on my notebook, and found the high iowait is gone, so yesterday I installed the zenwalk 6.2 with the 2.6.32.2 kernel on that server, although the kernel only detected/used one Xeon CPU and 2GB memory, the iowait is very low and the whole system became very fast, only several seconds iowait would reach to 30%-40%, and then dropped back to 0 very soon. 
By the way, the io scheduler is cfq. The following is "top" output of it:

Tasks: 157 total,   2 running, 155 sleeping,   0 stopped,   0 zombie
Cpu0  : 12.3%us,  7.8%sy,  0.0%ni, 77.3%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
Cpu1  : 11.3%us,  8.4%sy,  0.0%ni, 76.1%id,  0.0%wa,  0.0%hi,  4.2%si,  0.0%st
Cpu2  :  5.2%us,  7.2%sy,  0.0%ni, 84.0%id,  0.0%wa,  0.3%hi,  3.3%si,  0.0%st
Cpu3  :  8.1%us,  7.7%sy,  0.0%ni, 81.3%id,  0.0%wa,  0.0%hi,  2.9%si,  0.0%st
Mem:   2272368k total,  1153508k used,  1118860k free,    79384k buffers
Swap:  4192956k total,        0k used,  4192956k free,   797568k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                        
3085 mysql     40   0  453m  68m 4864 S   35  3.1  12:25.12 mysqld                         
3203 markv     40   0 77852  17m  11m S   24  0.8   7:53.75 mark                           
2684 root      40   0 16440 2896 2144 S    9  0.1  11:03.99 dms                            
3879 markv     40   0 42256  12m 9336 S    4  0.6   1:43.26 walin                          
1520 root      40   0  4156 1232  972 S    0  0.1   0:00.06 ntpd                           
3235 root      40   0 64164  29m   9m S    0  1.3   0:50.78 X                              
3885 markv     40   0  2452 1180  892 R    0  0.1   0:02.08 top                            
   1 root      40   0   804  332  292 S    0  0.0   0:00.90 init                           
   2 root      40   0     0    0    0 S    0  0.0   0:00.00 kthreadd                       
   3 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0                    
   4 root      20   0     0    0    0 S    0  0.0   0:00.05 ksoftirqd/0                    
        We didn't change the storage, it is still that two seagate disks LVM with 4 years turbine data in them. 
	I found in kernel 2.6.32.8 the high iowait is back. How do I know
that? When I copy a 700MB avi file from my notebook disk to a 3.5" usb
mobile disk, I found the reading side disk LED start to falsh quickly and immediately, but the writing side disk LED will keep still for a long time(like 25-30 seconds), and then start to flash slowly,and the course is abnormally long and low responsive. The kernel 2.6.32.2 is the only 2.6 kernel(since 2.6.18) on which I found both of the reading and writing side disk LED will start to falsh quickly and immediately.There must be somthing wrong with the write cache behavior which will cause the high iowait, and it has been fixed in 2.6.32.2 and brought back in 2.6.32.8.
	So by copying a big file to an usb disk and watching the disk LED, it can be used as a method for the kernel developers to reproduce and observe this bug. I hope this may be helpful.
        I noticed Mr. Morton said this is not the best place to discuss, but the linux-kernel@vger.kernel.org rejected two of my emails from two different email account. So I mailed and cc-ed the email, and also post it here, to make more people can share my experience.
	
Best regards,
Frank Ren

Comment 422 Frank Ren 2010-03-31 23:17:15 UTC

Gentlemen,
	I have suffered the high iowait problem for almost 4 years, and I have been watching the bug report(Bug 12309) on bugzilla.kernel.org for 1 year,
and yesterday I finally managed to get out of this trouble by switching from CentOS 5.4(with kernel 2.6.18) to zenwalk 6.2(with a snapshot kernel 2.6.32.2). 
	The computer is used to collect signal data from 4 gas turbines in a power plant. The project started from 2004,and we used mandrake 9 and zenwalk, both are 2.4.x kernel,and there was no high iowait problems. Since 2006 we switched to fedore 6(kernel 2.6.18) and then centos 5, and the iowait began to make trouble, the system's response of mouse and keyboard became very slow, new applications needed a long time to be launched. During these years, I always thought the main reason of this was because the computer's hardware was not good enough. But early this month, the plant has upgraded the computer to a new Lenovo server with two Xeon E5504 CPUs(total 8 cores), and 4GB memory,but the iowait is still very very high, the following is the output of "top" command on that machine:

Tasks: 215 total,   1 running, 213 sleeping,   0 stopped,   1 zombie
Cpu0  :  1.0%us,  0.3%sy,  0.0%ni, 65.9%id, 32.8%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  1.0%us,  3.6%sy,  0.0%ni, 45.0%id, 50.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  1.0%us,  4.0%sy,  0.0%ni, 94.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  :  1.3%us,  3.3%sy,  0.0%ni, 56.3%id, 38.3%wa,  0.0%hi,  0.7%si,  0.0%st
Cpu4  :  1.3%us,  6.7%sy,  0.0%ni,  0.0%id, 89.7%wa,  0.7%hi,  1.7%si,  0.0%st
Cpu5  :  0.3%us,  3.3%sy,  0.0%ni, 91.7%id,  0.0%wa,  0.7%hi,  4.0%si,  0.0%st
Cpu6  : 10.3%us, 30.2%sy,  0.0%ni, 50.2%id,  2.3%wa,  1.0%hi,  6.0%si,  0.0%st
Cpu7  :  1.3%us,  8.6%sy,  0.0%ni, 83.1%id,  4.0%wa,  1.0%hi,  2.0%si,  0.0%st
Mem:   4078540k total,  3872720k used,   205820k free,   182344k buffers
Swap:  4192956k total,        0k used,  4192956k free,  2815596k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                          
 3841 markv     15   0 72172  12m 8380 S 42.2  0.3   1984:24 lvinf                            
 8573 markv     15   0 60232  12m 8876 S 11.6  0.3   0:17.22 mark                             
 4067 markv     15   0 19056 3224 2336 S 10.6  0.1 759:52.00 dms                              
 3548 mysql     21   0  656m 617m 9292 S  9.0 15.5 764:42.05 mysqld                           
27042 markv     15   0 69404  12m 8756 S  4.3  0.3 290:36.14 walin                            
 3810 root      15   0 39772  15m 8224 S  1.3  0.4   3:59.76 Xorg                             
    1 root      15   0  2068  620  532 S  0.0  0.0   0:01.19 init                             
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.04 migration/0                      
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0                      
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0                       
    5 root      RT  -5     0    0    0 S  0.0  0.0   0:00.02 migration/1                      
    6 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1                      
    7 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/1                       
    8 root      RT  -5     0    0    0 S  0.0  0.0   0:00.01 migration/2                      
    9 root      34  19     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/2                      
   10 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/2                       
   11 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 migration/3                      
   12 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/3                      
   13 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/3                       
   14 root      RT  -5     0    0    0 S  0.0  0.0   0:00.09 migration/4                      
   15 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/4                      
   16 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/4                       
   17 root      RT  -5     0    0    0 S  0.0  0.0   0:00.03 migration/5                      
   18 root      36  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/5                      
[markv@markgt ~]$ 

	All our application's job is to insert 16 records per second(every record is a fixed 12 bytes in 3 fields) into a mysql database, the storage is a LVM consisted of two 750GB seagate SATA 7200RPM disks. I am sure this high iowait was not caused by other things like network cards or video card, because I have experimented to comment out only the mysql inserting lines from our source code, and the system iowait would drop to 0, GUI would become very responsive.
    It also has nothing to do with the io scheduler,because I had tested deadline and noop on CentOS 5.4 and iowait could not be reduced. I also tried to enlarge the /sys/block/sda/queue/nr_requests, and it does not work.      
	I got information from this bugzilla report that kernel 2.6.32 has fixed this high iowait problem, and I tested the snapshot kernel 2.6.32.2 of zenwalk on my notebook, and found the high iowait is gone, so yesterday I installed the zenwalk 6.2 with the 2.6.32.2 kernel on that server, although the kernel only detected/used one Xeon CPU and 2GB memory, the iowait is very low and the whole system became very fast, only several seconds iowait would reach to 30%-40%, and then dropped back to 0 very soon. 
By the way, the io scheduler is cfq. The following is "top" output of it:

Tasks: 157 total,   2 running, 155 sleeping,   0 stopped,   0 zombie
Cpu0  : 12.3%us,  7.8%sy,  0.0%ni, 77.3%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
Cpu1  : 11.3%us,  8.4%sy,  0.0%ni, 76.1%id,  0.0%wa,  0.0%hi,  4.2%si,  0.0%st
Cpu2  :  5.2%us,  7.2%sy,  0.0%ni, 84.0%id,  0.0%wa,  0.3%hi,  3.3%si,  0.0%st
Cpu3  :  8.1%us,  7.7%sy,  0.0%ni, 81.3%id,  0.0%wa,  0.0%hi,  2.9%si,  0.0%st
Mem:   2272368k total,  1153508k used,  1118860k free,    79384k buffers
Swap:  4192956k total,        0k used,  4192956k free,   797568k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                        
3085 mysql     40   0  453m  68m 4864 S   35  3.1  12:25.12 mysqld                         
3203 markv     40   0 77852  17m  11m S   24  0.8   7:53.75 mark                           
2684 root      40   0 16440 2896 2144 S    9  0.1  11:03.99 dms                            
3879 markv     40   0 42256  12m 9336 S    4  0.6   1:43.26 walin                          
1520 root      40   0  4156 1232  972 S    0  0.1   0:00.06 ntpd                           
3235 root      40   0 64164  29m   9m S    0  1.3   0:50.78 X                              
3885 markv     40   0  2452 1180  892 R    0  0.1   0:02.08 top                            
   1 root      40   0   804  332  292 S    0  0.0   0:00.90 init                           
   2 root      40   0     0    0    0 S    0  0.0   0:00.00 kthreadd                       
   3 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0                    
   4 root      20   0     0    0    0 S    0  0.0   0:00.05 ksoftirqd/0                    
        We didn't change the storage, it is still that two seagate disks LVM with 4 years turbine data in them. 
	I found in kernel 2.6.32.8 the high iowait is back. How do I know
that? When I copy a 700MB avi file from my notebook disk to a 3.5" usb
mobile disk, I found the reading side disk LED start to falsh quickly and immediately, but the writing side disk LED will keep still for a long time(like 25-30 seconds), and then start to flash slowly,and the course is abnormally long and low responsive. The kernel 2.6.32.2 is the only 2.6 kernel(since 2.6.18) on which I found both of the reading and writing side disk LED will start to falsh quickly and immediately.There must be somthing wrong with the write cache behavior which will cause the high iowait, and it has been fixed in 2.6.32.2 and brought back in 2.6.32.8.
	So by copying a big file to an usb disk and watching the disk LED, it can be used as a method for the kernel developers to reproduce and observe this bug. I hope this may be helpful.
        I noticed Mr. Morton said this is not the best place to discuss, but the linux-kernel@vger.kernel.org rejected two of my emails from two different email account. So I mailed and cc-ed the email, and also post it here, to make more people can share my experience.
	
Best regards,
Frank Ren

Comment 423 Ilya 2010-05-18 12:54:27 UTC

Im using mandriva 2010 with kernel 2.6.33-rc5

The freeze ist huge. System becomes unusable at every small disk activity (for example sudo urpmi blackbox).

with kernels 2.6.31 2.6.32 is the problem too. Other kernels was not tested.

Please, reopen the bug. It is a huge problem for many people.

Comment 424 Christian Mertes 2010-05-18 13:10:56 UTC

It *is* a huge problem indeed. I kinda got used to it but it feels like in the 80s. I still have a Windows in a 10GB corner of my HDD which I use very rarely but every time I do it feels like a miracle to see what these modern computers are able to do when they don't run a f*cked up kernel :/

Comment 425 Khalid Rashid 2010-05-18 19:18:34 UTC

One angle to tackle this would be those who don't suffer from this bug, what kind of kernel (and with what parameters) and hardware they're running. Since this seem to affect a wide range of people and setups, could be intresting... but also a huge undertake

Comment 426 Ilya 2010-05-18 19:29:02 UTC

Probably is this bug triggered by GCC compiler? Have anyone tryed to compile a 2.6.30-2.6.33 with a earlyer gcc version?

Comment 427 topaz 2010-05-18 19:32:36 UTC

Could be interesting, but I've read some comment whom writers had tried to isolate two consecutive kernel versions surrounding the bug.

At last, it might be quicker, but quite boring for the operator, to try a laggy scenario with many different kernel versions catching the bug by dichotomia ?

We might also distribute the effort between ourselves. I propose something like this: everybody uses the same kernel version which exhibit the bug, and try the same laggy scenario amongst a set of kernel version. Let's play with 4 version each, to cover the 2.6.x revisions, should not take so long. Who is volonteer ?

Have a nice day.
Topaz.

Comment 428 Khalid Rashid 2010-05-18 19:47:45 UTC

Topaz, you'll have to explain the meaning of catching the bug by dichotomia. I wonder if running with a "barebone" kernel can trigger this bug?

Comment 429 topaz 2010-05-18 19:58:13 UTC

I'm currently running on Ubuntu Lucid, and I've noticed the bug since the Jaunty release (Intel x86 centrino platform, with a core 2 duo and on two different machines, both contaminated).
When I first had some poor performance problems, I've tried to compile the vanilla kernel by myself and it resulted in a failure, the vanilla kernel 2.6.30 was also affected by this bug.
My plans are to establish a laggy scenario, and to compile all version of 2.6.x kernel, and to test them all against my laggy scenario. Should not take that long, but the more the merrier :)

Comment 430 devsk 2010-05-18 20:07:45 UTC

One clear angle that has not been investigated by kernel developers is that this issue is highlighted by 64-bit code. I don't see this lag and high IO wait in 32-bit kernel. I have a laptop with 2GB RAM and I got so sick of the lag that I have gone back to 32-bit kernel and userspace.

And the speed difference is amazing to say the least. No more stuck mouse and no more waiting to see that konsole window pop up. Everything is much faster. Feels like a new laptop. And this is a 2Ghz core2duo based T61, not a slow hardware by any means!

And I get extra 300-400MB of RAM back (YES! that's what you are reading!) by just switching to 32-bit system. 64-bit C++ apps like firefox and KDE eat almost twice the RAM. Firefox is running at 250MB with the same number of tabs and windows as was the 64-bit system, where it was consuming (RSS) about 450MB. Go figure!

I am running a Virtualbox copy of XP on the laptop and I still don't see swap kick in. With 64-bit, running firefox and XP in VB at the same time would lead to heavy swapping and things would be crawling!

So much for advancement to 64-bit! I have been running 64-bit systems for 4 years now and switching to 32-bit feels like I was living under a rock!

I know all this sounds backwards. But give it a try.

Comment 431 Milan Bouchet-Valat 2010-05-18 20:50:00 UTC

Frank Ren: If you are sure the bug doesn't happen with 2.6.32.2, but with all other releases you could test, then you should try to find what has changed in it. Were you always running vanilla upstream kernels? Or always kernels from your distribution? Built on the same machine with the same compiler? If so, then have a look at the changelog from 2.6.32.2 to 2.6.32.8, looking for the culprit. I'd suggest you try 2.6.32.3 and check if the bug is there; and if not, increase the minor version until you get it: that will make the changelog really small. Then, send a mail to LKML with your findings.

You seem to be the reporter with the most precise informations out there, you may catch something interesting!


devsk: Beware not to be misled by the swapping behavior of your system. If you're often completely filling your RAM when on 64bits, then swapping may hurt responsiveness badly. When moving to 32bits, if you gain 300MB, you may not suffer from this because there's free RAM, but that's not really linked with a 64bit-only bug.

Comment 432 Christian Mertes 2010-05-18 21:32:43 UTC

(In reply to comment #430)
> One clear angle that has not been investigated by kernel developers is that
> this issue is highlighted by 64-bit code. I don't see this lag and high IO
> wait
> in 32-bit kernel.

I do. I share your opinion on RAM use though but it surely doesn't belong here. The bug itself is definitely not restricted to 64-bit systems.

Comment 433 Perlover 2010-05-18 23:00:04 UTC

Anybody, please read this comment:
https://bugzilla.kernel.org/show_bug.cgi?id=13347#c59
I think there is the worthwhile suggestion.

Comment 434 Laurent Wandrebeck 2010-05-19 06:41:37 UTC

I'm really unsure CFQ is the (only ?) culprit.
I've met the same behaviour using deadline and a 3ware 9650, and the fix was a completely other thing (pci_set_mwi).
See https://bugzilla.redhat.com/show_bug.cgi?id=444759 for more details.

Comment 435 Khalid Rashid 2010-05-19 06:47:44 UTC

I'm gonna chip in my experiences:
I've had this bug with both 32 and 64 bits of the kernel.
setting different schedulers didnt make a difference.
I've tried different versions of the kernel with no luck (though i haven't tried specifically 2.6.32-2).

Comment 436 Ilya 2010-05-19 08:06:39 UTC

I have a 32bit system. The bug is almost there.

Comment 437 Thomas Pilarski 2010-05-19 08:20:49 UTC

This bug depends on cpu, memory and first of all disc and filesystem, lvm and encryption. It's a mix of transactions/s and throughput. If both are in a system dependent range, the problem starts.
There is no throughput/transaction statistic for processes in the scheduler to disadvantage processes, which are causing a high load. A process can gain all available dirty pages and block the other processes.

Comment 438 Ivan Borzenkov 2010-05-19 08:47:43 UTC

I update from 2.6.32 to 2.6.34 and bug fixed on two computers

on vmstat 
wa take full free time, but interface not freeze

may give all need info and try build any version from git for test

Comment 439 Ruslan Pisarev 2010-05-19 12:35:01 UTC

2  topaz (#429)
Ready to join. It would be nice to determine the methods of testing: be advised that some of the methods. At the kernel of the current Lucid 2.6.32 bug reproduced.

Comment 440 Thomas Pilarski 2010-05-19 12:40:58 UTC

That's the problem. There is no reliable method for testing.

Comment 441 Ruslan Pisarev 2010-05-19 12:53:41 UTC

I know about using dd. In addition I can move really big files (files in 4-7Gb). My database on the server really tiny and there I can easily reproduce the bug (the system of Hardy 32bit, 2.6.24-19-server) is copying the archives of sites and virtual machines.

Further, in the office, I use Lucid (2.6.32-22-386, 32bit), but at home Fedora12 (32bit).

But this is all subjective:

Comment 442 Ivan Borzenkov 2010-05-19 13:48:49 UTC

after update to 34

ivan1986@ivan1986:~/$ dd if=/dev/zero of=testfile.1gb bs=1M count=1000[] 
1000+0 записей считано
1000+0 записей написано
 скопировано 1048576000 байт (1,0 GB), 36,1762 c, 29,0 MB/c
ivan1986@ivan1986:~/$ dd if=/dev/zero of=testfile.1gb bs=1M count=1000[] 
1000+0 записей считано
1000+0 записей написано
 скопировано 1048576000 байт (1,0 GB), 26,7475 c, 39,2 MB/c
ivan1986@ivan1986:~/$ dd if=/dev/zero of=testfile.1gb bs=1M count=1000[] 
1000+0 записей считано
1000+0 записей написано
 скопировано 1048576000 байт (1,0 GB), 32,8729 c, 31,9 MB/c


 1  3      0  20940  19664 315188    0    0   128  7860  571 1108  6 10  3 81
 2  2      0  15744  19668 320272    0    0    68 65332  893 1593  3 32  7 58
 0  3      0  11932  19668 323260    0    0    96 49384  579 1142  3 11  0 85
 0  3      0  17252  19704 318232    0    0     0  6832  516 1131  2  3  0 94
 0  4      0  12732  19704 323204    0    0   128  6520  940 1145  4 22  4 69
 2  4      0  11808  19980 323796    0    0    88 30492 1093 1393  7 20  2 70
 0  4      0  32860  19980 302340    0    0   148 70892 1117 2026  2 10  4 84
 0  4      0  11856  19980 323400    0    0   176  6652  553 1217  3  9 33 54
 1  4      0  12340  19980 323156    0    0    12 12396  604 1269  2  8  4 85
 0  4      0  12228  19980 323768    0    0     0 13520  816 1612  2  6  0 91
 0  4      0  13136  19980 322244    0    0     0 21924  937 1504  7  8  0 85
 0  3      0  11820  19980 324064    0    0   112 42740  857 1404  1 33 14 52
 0  3      0  11896  19980 323468    0    0    48  9668  600 1161  4  6  1 88
 0  4      0  12608  19980 322604    0    0   128 55032  746 1342 10 11 19 61
 0  3      0  11328  19980 323508    0    0    76 27868  498 1087  4  3  6 86
 0  4      0  11952  20020 322996    0    0    36  1196  502 1268  5  3  0 92
 0  4      0  11952  20020 323512    0    0     0  4036  540 1064  3  8  0 89
 0  4      0  11868  20304 323064    0    0   112 64560  893 1190  5 28  3 64
 0  5      0  21888  20304 312760    0    0   336 35284  639 1520  4 15  0 82
 0  5      0  21764  20304 313068    0    0     0 20936  572 1490  6  3  0 90
 0  4      0  11844  20316 323896    0    0   248   364  610 1165  5 12  0 83
 1  3      0  12336  20360 323368    0    0     0 31160 1113 1188  3 18  0 78

max 30% cpu in htop

interface NO freeze, music play normal, and other work fine

Comment 443 pekmop1024 2010-05-19 19:03:48 UTC

Simplest way to reproduce this bug on most hardware is:

1. create a cryptsetup partition (on LVM or without LVM, both variants are ok). Preferably, all partitions used in test case are must be encrypted;
2. install VirtualBox and try to create preallocated hard disk image, size must be 4GB or more.

That's it! If you try to use another applications at same time, you will see 5-10 sec freezes.

I've reproduced bug on many hardware configurations with 2.6.34 and older kernels, such as:
C2Q Q9650 / 8GB RAM / Seagate HDD / x86_64
i7 920 / 6GB / WD HDD / x86_64
C2D U7600 / 2GB / Samsung SSD / i686
C2D T7200 / 3GB / Seagate HDD / i686

So it's not hardware problems - hardwares age from 4 to 1 years and results are same.
Also, on *BSD and Win there are no problems with that hardware.

Comment 444 Søren Holm 2010-05-19 23:03:37 UTC

I'm wondering. Isn't bad reponsiveness equals starvation of processes in the 
cpu schedueler? In that case it would be better to meassure the amount of cpu-
cycles it is possible to burn during pekmop1024's procedure.

I have tried to just dd a 8 Gb file, and it gives me stalls in the gui, but it 
is because of stat64-calls in the application. Under normal circumstances the 
file that is stat'ed is cached. But during high thoughput the cache is filled up 
with other data. So the stat64-call have to read from the disk which will the 
compete my dd. Running glxgears alongside the dd show a constant fps during to 
whole dd.

I have followed this thread a long time and I do not remember anyone 
mensioning that a single high thoughput application renders the cache useless 
to other applications.

Is it possible to avoid filling the cahce with data that is written ?

Comment 445 Søren Holm 2010-05-19 23:07:37 UTC

I'm wondering. Isn't bad reponsiveness equals starvation of processes in the 
cpu schedueler? In that case it would be better to meassure the amount of cpu-
cycles it is possible to burn during pekmop1024's procedure.

I have tried to just dd a 8 Gb file, and it gives me stalls in the gui, but it 
is because of stat64-calls in the application. Under normal circumstances the 
file that is stat'ed is cached. But during high thoughput the cache is filled up 
with other data. So the stat64-call have to read from the disk which will the 
compete my dd. Running glxgears alongside the dd show a constant fps during to 
whole dd.

I have followed this thread a long time and I do not remember anyone 
mensioning that a single high thoughput application renders the cache useless 
to other applications. I'm guessing that a simple application that once per 
second reads the first byte from a memory mapped file will stall, even if it is 
only a single byte that needs to be cached.

I'm sorry If my thoughts have been mensioned before in this thread :)

Comment 446 Søren Holm 2010-05-20 19:06:23 UTC

I've tested my assumption about the 1-byte mmap'ed file. It turned out that it is running fine during my dd. Probable 1 byte is not enough.

Comment 447 Ivan Borzenkov 2010-05-22 16:41:10 UTC

Still repeats itself - a compilation psi freezing interface

Comment 448 Marcel Partap 2010-05-29 19:03:55 UTC

(In reply to comment #421)
> I have suffered the high iowait problem for almost 4 years
Then let's finally kill it!

> I got information from this bugzilla report that kernel 2.6.32 has fixed
> this high iowait problem, and I tested the snapshot kernel 2.6.32.2 of
> zenwalk
> on my notebook, and found the high iowait is gone

> I found in kernel 2.6.32.8 the high iowait is back. How do I know
> that? When I copy a 700MB avi file from my notebook disk to a 3.5" usb
> mobile disk, I found the reading side disk LED start to falsh quickly and
> immediately, but the writing side disk LED will keep still for a long
> time(like
> 25-30 seconds), and then start to flash slowly,and the course is abnormally
> long and low responsive.

> The kernel 2.6.32.2 is the only 2.6 kernel (since 2.6.18) on which I found
> both of the reading and writing side disk LED will start to falsh
> quickly and immediately.There must be somthing wrong with the write
> cache behavior which will cause the high iowait, and it has been fixed in
> 2.6.32.2 and brought back in 2.6.32.8.

This is the complete git log 2.6.32.2..2.6.32.8:
b0e4370 Linux 2.6.32.8
6117db7 NET: fix oops at bootime in sysctl code
e4a6a35 powerpc: TIF_ABI_PENDING bit removal
a420e9f ath9k: fix beacon slot/buffer leak
1c97637 ath9k: fix eeprom INI values override for 2GHz-only cards
2c7f87e pktcdvd: removing device does not remove its sysfs dir
b31aa5c uartlite: fix crash when using as console
e06fbe9 kernel/cred.c: use kmem_cache_free
35cfb03 starfire: clean up properly if firmware loading fails
906f68d mx3fb: some debug and initialisation fixes
682efb8 imxfb: correct location of callbacks in suspend and resume
b260729 mac80211: fix NULL pointer dereference when ftrace is enabled
3a9353f mm: flush dcache before writing into page to avoid alias
78da404 be2net: Fix memset() arg ordering.
e38d76e be2net: Bug fix to support newer generation of BE ASIC
43d7ff2 connector: Delete buggy notification code.
f06f00e usb: r8a66597-hdc disable interrupts fix
0ae2b7d block: fix bugs in bio-integrity mempool usage
9648148 random: Remove unused inode variable
8857a1a random: drop weird m_time/a_time manipulation
94af44b Fix 'flush_old_exec()/setup_new_exec()' split
cb723ba block: fix bio_add_page for non trivial merge_bvec_fn case
e52299d mm: purge fragmented percpu vmap blocks
56d4b77 mm: percpu-vmap fix RCU list walking
dce6a09 libata: retry link resume if necessary
42f7e23 oprofile/x86: fix crash when profiling more than 28 events
9c66557 oprofile/x86: add Xeon 7500 series support
4f7d666 KVM: allow userspace to adjust kvmclock offset
a74e62c ax25: netrom: rose: Fix timer oopses
3125258 af_packet: Don't use skb after dev_queue_xmit()
ecb7287 net: restore ip source validation
1681333 sky2: Fix oops in sky2_xmit_frame() after TX timeout
16b8efa tcp: update the netstamp_needed counter when cloning sockets
359e2f2 clocksource: fix compilation if no GENERIC_TIME
253f887 x86/amd-iommu: Fix possible integer overflow
d1a3103 x86: Add quirk for Intel DG45FC board to avoid low memory corruption
8159070 x86: Add Dell OptiPlex 760 reboot quirk
00362b9 regulator: Specify REGULATOR_CHANGE_STATUS for WM835x LED constraints
6db6ace SECURITY: selinux, fix update_rlimit_cpu parameter
80569f6 firewire: core: add_descriptor size check
612e99b drm/i915: only enable hotplug for detected outputs
69bf9a6 iwlwifi: set default aggregation frame count limit to 31
3492bbb x86: Disable HPET MSI on ATI SB700/SB800
cf135e5 Input: winbond-cir - remove dmesg spam
5e806e1 x86: get rid of the insane TIF_ABI_PENDING bit
c2e245d sparc: TIF_ABI_PENDING bit removal
336ca4c Split 'flush_old_exec' into two functions
944a638 FDPIC: Respect PT_GNU_STACK exec protection markings when creating NOMMU stack
0b3bf81 mm: fix migratetype bug which slowed swapping
629527c Fix failure exit in ipathfs
30d3844 fix affs parse_options()
d842c31 Fix remount races with symlink handling in affs
36a0a4a fix leak in romfs_fill_super()
26d2257 fix oops in fs/9p late mount failure
deb20f1 Fix failure exits in bfs_fill_super()
703c300 Fix a leak in affs_fill_super()
61d4374 drm/i915: Reload hangcheck timer too for Ironlake
f0b4195 e1000/e1000e: don't use small hardware rx buffers
b9ad9bb e1000e: enhance frame fragment detection
dff2267 e1000: enhance frame fragment detection
cfc7e54 UBI: fix volume creation input checking
3b4f785 ACPI: Advertise to BIOS in _OSC: _OST on _PPC changes
0d48a1a ACPI: fix OSC regression that caused aer and pciehp not to load
1a52add ACPI: Add platform-wide _OSC support.
e62a96c ACPI: Add a generic API for _OSC -v2
1e88960 dasd: fix possible NULL pointer errors
083beff zcrypt: Do not remove coprocessor for error 8/72
63693ee libata: retry FS IOs even if it has failed with AC_ERR_INVALID
8c2cd3f x86: Remove "x86 CPU features in debugfs" (CONFIG_X86_CPU_DEBUG)
b5b39c3 x86: Set hotpluggable nodes in nodes_possible_map
76e789c S390: fix single stepped svcs with TRACE_IRQFLAGS=y
16a2ae6 firewire: ohci: fix crashes with TSB43AB23 on 64bit systems
d8e0902 drm/i915: Selectively enable self-reclaim
8268c0b mm: add new 'read_cache_page_gfp()' helper function
b7a9d92 mptsas: Fix issue with chain pools allocation on katmai
e15fca0 scsi_lib: Fix bug in completion of bidi commands
b4bdd73 Linux 2.6.32.7
a8e96d6 x86, msr/cpuid: Pass the number of minors when unregistering MSR and CPUID drivers.
0a1c275 fnctl: f_modown should call write_lock_irqsave/restore
01e991b iwlwifi: Fix throughput stall issue in HT mode for 5000
d274df6 ACPI: enable C2 and Turbo-mode on Nehalem notebooks on A/C
59568be x86: Reenable TSC sync check at boot, even with NONSTOP_TSC
194223f IPoIB: Clear ipoib_neigh.dgid in ipoib_neigh_alloc()
454f8b1 KVM: only clear irq_source_id if irqchip is present
eaccd49 KVM: fix lock imbalance in kvm_*_irq_source_id()
9801911 KVM: x86: Fix leak of free lapic date in kvm_arch_vcpu_init()
8e5c20d KVM: x86: Fix probable memory leak of vcpu->arch.mce_banks
0118bac KVM: x86: Fix host_mapping_level()
4938210 KVM: MMU: bail out pagewalk on kvm_read_guest error
59cf854 KVM: Fix race between APIC TMR and IRR
f0d13b8 KVM: only allow one gsi per fd
70be4d7 KVM: S390: fix potential array overrun in intercept handling
eb60025 cfg80211: fix channel setting for wext
304cd19 mac80211: check that ieee80211_set_power_mgmt only handles STA interfaces.
09e4d0f ASoC: fix a memory-leak in wm8903
2cdc2dc UBI: initialise update marker
f6fbe0b UBI: fix memory leak in update path
4d845d6 hwmon: (fschmd) Fix a memleak on multiple opens of /dev/watchdog
00bd133 ALSA: hda - Fix HP T5735 automute
a0dffef ipc ns: fix memory leak (idr)
a5981df netiucv: displayed TX bytes value much too high
27aeefb cio: dont panic in non-fatal conditions
f5b1bc5 cio: fix double free in case of probe failure
da02974 V4L/DVB (13826): uvcvideo: Fix controls blacklisting
2928b68 md: fix small irregularity with start_ro module parameter
31cf6d8 ata_piix: fix MWDMA handling on PIIX3
3de08a12 ahci: disable SNotification capability for ich8
c817c19 iTCO_wdt: Add Intel Cougar Point and PCH DeviceIDs
42b4505 iTCO_wdt: add PCI ID for the Intel EP80579 (Tolapai) SoC
53691f2 iTCO_wdt.c - cleanup chipset documentation
4220098 ALSA: hda - Add missing Line-Out and PCM switches as slave
9049580 ALSA: hda - Fix quirk for Maxdata obook4-1
a2c5952 ALSA: hda - select IbexPeak handler for Calpella
d160610 Input: i8042 - add Dritek quirk for Acer Aspire 5610.
461eb3f Input: i8042 - add Gigabyte M1022M to the noloop list
f6278f1 Input: i8042 - remove identification strings from DMI tables
44d13be DMI: allow omitting ident strings in DMI tables
5172b4b PCI: AER: fix aer inject result in kernel oops
bf9a88d qlge: Bonding fix for mode 6.
6b07617 qlge: Add handler for DCBX firmware event.
6055e7f qlge: Don't fail open when port is not initialized.
836750b qlge: Set PCIE max read request size.
ffd1fab qlge: Remove explicit setting of PCI Dev CTL reg.
7c0798e fcoe: Fix getting san mac for VLAN interface
1ce0348 fcoe: Fix checking san mac address
e166cb1 fcoe, libfc: fix an libfc issue with queue ramp down in libfc
2792e0ce libfc: remote port gets stuck in restart state without really restarting
407590a libfc: fix free of fc_rport_priv with timer pending
a3d46ca libfc: fix memory corruption caused by double frees and bad error handling
4c40dbe libfc: Fix frags in frame exceeding SKB_MAX_FRAGS in fc_fcp_send_data
88cc93a fcoe: initialize return value in fcoe_destroy
7c8a0dc libfc: don't WARN_ON in lport_timeout for RESET state
83d236b libfc: lport: fix minor documentation errors
56320f6 libfc: Fix wrong scsi return status under FC_DATA_UNDRUN
d5d72da fcoe: remove redundant checking of netdev->netdev_ops
34556a1 libfc: fix ddp in fc_fcp for 0 xid
1e418b2 libfc: fix typo in retry check on received PRLI
253f41b lpfc: fix hang on SGI ia64 platform
4b2bc96 scsi_transport_fc: remove invalid BUG_ON
d502a76 scsi_dh: create sysfs file, dh_state for all SCSI disk devices
e7c8167 scsi_devinfo: update Hitachi entries (v2)
001252f HID: fixup quirk for NCR devices
5e05787 NFS: Revert default r/wsize behavior
1d42a1b iscsi class: modify handling of replacement timeout
83886fa PCI: Always set prefetchable base/limit upper32 registers
5cf92e9 timers, init: Limit the number of per cpu calibration bootup messages
34911bf nfsd: Fix sort_pacl in fs/nfsd/nf4acl.c to actually sort groups
a9238ce nohz: Prevent clocksource wrapping during idle
db47a16 sched: Fix missing sched tunable recalculation on cpu add/remove
08b84be sched: Fix isolcpus boot option
eb9dbd9 ALSA: ice1724 - Patch for suspend/resume for ESI Juli@
e96610c partitions: use sector size for EFI GPT
6f8de29 partitions: read whole sector with EFI GPT header
8f2fefc netfilter: xtables: fix conntrack match v1 ipt-save output
3cd4bea V4L/DVB (13680b): DocBook/media: create links for included sources
35f42c9 V4L/DVB (13680a): DocBook/media: copy images after building HTML
857ffb8 atl1e:disable NETIF_F_TSO6 for hardware limit
f7b1714 atl1c:use common_task instead of reset_task and link_chg_task
b68f619 iTCO_wdt: Add support for Intel Ibex Peak
96ef353 V4L/DVB (13168): Add support for Asus Europa Hybrid DVB-T card (SAA7134 SubVendor ID: 0x1043 Device ID: 0x4847)
8429570 USB: ftdi_sio: add USB device ID's for B&B Electronics line
5bcaffb USB: mos7840: add device IDs for B&B electronics devices
4d3c678 V4L/DVB (13569): smsusb: add autodetection support for five additional Hauppauge USB IDs
ff23399 ALSA: hda - Add PCI IDs for Nvidia G2xx-series
4bc685e vfs: get_sb_single() - do not pass options twice
1b715f1 driver-core: fix devtmpfs crash on s390
da30443 Driver-Core: devtmpfs - set root directory mode to 0755
04daa51 Input: ALPS - add interleaved protocol support (Dell E6x00 series)
30dc12e davinci: dm646x: Add support for 3.x silicon revision
c375e84 powerpc/fsl: Add PCI device ids for new QoirQ chips
a98917c ar9170: Add support for D-Link DWA 160 A2
002464c mpt2sas: New device SAS2208 support is added
90ee3ca be2net: Add the new PCI IDs to PCI_DEVICE_TABLE.
879c8e8 be2net: Add support for next generation of BladeEngine device.
c97c73d sfc: Fix DMA mapping cleanup in case of an error in TSO
9396c90 ACPI: don't cond_resched if irq is disabled
ce946bc clockevents: Add missing include to pacify sparse
08b8ff4 clockevent: Don't remove broadcast device when cpu is dead
f584d37 Linux 2.6.32.6
9607f06 perf: Honour event state for aux stream data
b0a9392 perf events: Dont report side-band events on each cpu for per-task-per-cpu events
5a20267 perf timechart: Use tid not pid for COMM change
f2fa92b vmalloc: remove BUG_ON due to racy counting of VM_LAZY_FREE
3d0cc9a USB: fix usbstorage for 2770:915d delivers no FAT
538a6fd x86/PCI/PAT: return EINVAL for pci mmap WC request for !pat_enabled
e0f5cfa DM: Fix device mapper topology stacking
fbe2992 block: bdev_stack_limits wrapper
ed0cd89 drm/i915: try another possible DDC bus for the SDVO device with multiple outputs
4fb77a3 drm/i915: Read the response after issuing DDC bus switch command
8cef765 SCSI: enclosure: fix oops while iterating enclosure_status array
5f0ab2d ACPI: EC: Add wait for irq storm
1ff7b99 ACPI: EC: Accelerate query execution
111ab4b USB: add speed values for USB 3.0 and wireless controllers
a2a5b33 USB: add missing delay during remote wakeup
bfec5ce USB: EHCI & UHCI: fix race between root-hub suspend and port resume
07d577f USB: EHCI: fix handling of unusual interrupt intervals
186c74d USB: Don't use GFP_KERNEL while we cannot reset a storage device
fa68188 USB: fix bitmask merge error
911b8be usb: serial: fix memory leak in generic driver
04f7ec7 serial: 8250_pnp: use wildcard for serial Wacom tablets
6fc7937 nozomi: quick fix for the close/close bug
8c53542 ecryptfs: initialize private persistent file before dereferencing pointer
3621216 ecryptfs: use after free
179b7e5 tty: fix race in tty_fasync
b70922a Staging: hv: fix smp problems in the hyperv core code
50e4975 Staging: asus_oled: fix oops in 2.6.32.2
ccb90b8 V4L/DVB (13900): gspca - sunplus: Fix bridge exchanges.
d547e91 x86, msr/cpuid: Register enough minors for the MSR and CPUID drivers
a2febcd Linux 2.6.32.5
af55a3d vfs: Fix vmtruncate() regression
2693139 sched: Fix task priority bug
fdc360e serial/8250_pnp: add a new Fujitsu Wacom Tablet PC device
2d22b38 i2c/pca: Don't use *_interruptible
c1f77a7 i2c: Do not use device name after device_unregister
4bff5ff sparc64: Fix Niagara2 perf event handling.
9d6567c sparc64: Fix NMI programming when perf events are active.
896fb0d sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
9fc68ca asus-laptop: add Lenovo SL hotkey support
2196ca4 Input: pmouse - move Sentelic probe down the list
94249e6 megaraid_sas: remove sysfs poll_mode_io world writeable permissions
2db740c PCI/cardbus: Add a fixup hook and fix powerpc
eecd8a9 HID: add device IDs for new model of Apple Wireless Keyboard
781d5c4 reiserfs: truncate blocks not used by a write
56a7f72 V4L/DVB (13868): gspca - sn9c20x: Fix test of unsigned.
fe52cee ALSA: hda - Fix missing capture mixer for ALC861/660 codecs
34e7aa0 mfd: Correct WM835x ISINK ramp time defines
33faa3c mfd: WM835x GPIO direction register is not locked
7f08f93 x86: SGI UV: Fix mapping of MMIO registers
7f40c6b edac: i5000_edac critical fix panic out of bounds
25d5699 x86, apic: use physical mode for IBM summit platforms
c91ab04 page allocator: update NR_FREE_PAGES only when necessary
d4c893f futexes: Remove rw parameter from get_futex_key()
8410b13 x86, mce: Thermal monitoring depends on APIC being enabled
1bd24fd block: Fix incorrect reporting of partition alignment
8a9c3f5 drm/i915: remove loop in Ironlake interrupt handler
4334ab7 memcg: ensure list is empty at rmdir
70f800f revert "drivers/video/s3c-fb.c: fix clock setting for Samsung SoC Framebuffer"
800c028 inotify: only warn once for inotify problems
cec3ad6 inotify: do not reuse watch descriptors
3df7673 Linux 2.6.32.4
5877960 agp/intel-agp: Clear entire GTT on startup
5deb72e ipv6: skb_dst() can be NULL in ipv6_hop_jumbo().
54f1b39 module: handle ppc64 relocating kcrctabs when CONFIG_RELOCATABLE=y
9ef9a7c fix more leaks in audit_tree.c tag_chunk()
dffaea5 fix braindamage in audit_tree.c untag_chunk()
d3b1e3b mac80211: fix skb buffering issue (and fixes to that)
71c7707 kernel/sysctl.c: fix stable merge error in NOMMU mmap_min_addr
904e373 libertas: Remove carrier signaling from the scan code
b9945e7 drm/i915: remove render reclock support
9b13cca mac80211: add missing sanity checks for action frames
0ea5505 iwl: off by one bug
724ad42 cfg80211: fix syntax error on user regulatory hints
e6efac7 ath5k: Fix eeprom checksum check for custom sized eeproms
fc95845 iwlwifi: fix iwl_queue_used bug when read_ptr == write_ptr
a111c28 xen: fix hang on suspend.
38c4d8d quota: Fix dquot_transfer for filesystems different from ext4
a61dcb0 hwmon: (adt7462) Fix pin 28 monitoring
4052fbf hwmon: (coretemp) Fix TjMax for Atom N450/D410/D510 CPUs
545b020 netfilter: nf_ct_ftp: fix out of bounds read in update_nl_seq()
635b4f9 netfilter: ebtables: enforce CAP_NET_ADMIN
954c8ef ASoC: Fix WM8350 DSP mode B configuration
cf99848 ALSA: atiixp: Specify codec for Foxconn RC4107MA-RS2
0385cc0 ALSA: ac97: Add Dell Dimension 2400 to Headphone/Line Jack Sense blacklist
5bb4e84 ALSA: hda - Fix ALC861-VD capture source mixer
e0abcea mmc_block: fix queue cleanup
0c74f45 mmc_block: fix probe error cleanup bug
0798abf mmc_block: add dev_t initialization check
0696a3b kernel/signal.c: fix kernel information leak with print-fatal-signals=1
ecac13f dma-debug: allow DMA_BIDIRECTIONAL mappings to be synced with DMA_FROM_DEVICE and
f21efc5 lib/rational.c needs module.h
21f7654 cgroups: fix 2.6.32 regression causing BUG_ON() in cgroup_diput()
6abb6ac drivers/cpuidle/governors/menu.c: fix undefined reference to `__udivdi3'
fdc0895 rtc_cmos: convert shutdown to new pnp_driver->shutdown
0c51b5c drm/i915: fix unused var
c7e8c26 drm/i915: Select the correct BPC for LVDS on Ironlake
c04fd30 drm/i915: Make the BPC in FDI rx/transcoder be consistent with that in pipeconf on Ironlake
cba0270 drm/i915: Enable/disable the dithering for LVDS based on VBT setting
de04091 drm: remove address mask param for drm_pci_alloc()
c693959 drm/i915: Permit pinning whilst the device is 'suspended'
d241962 drm/i915: fix order of fence release wrt flushing
d3e4d5f drm/i915: Update LVDS connector status when receiving ACPI LID event
8064af1 sunrpc: on successful gss error pipe write, don't return error
8ffe947 SUNRPC: Fix the return value in gss_import_sec_context()
e64b13f SUNRPC: Fix up an error return value in gss_import_sec_context_kerberos()
eb0b93d sunrpc: fix peername failed on closed listener
3aafc55 nfsd: make sure data is on disk before calling ->fsync
b7e5f77 Revert "x86: Side-step lguest problem by only building cmpxchg8b_emu for pre-Pentium"
2448811 exofs: simple_write_end does not mark_inode_dirty
8dfabfc modules: Skip empty sections when exporting section notes
efd38f4 ASoC: fix params_rate() macro use in several codecs
e4dd8ca fasync: split 'fasync_helper()' into separate add/remove functions
1f51eb3 untangle the do_mremap() mess
c3a8e0e Linux 2.6.32.3
84d330e generic_permission: MAY_OPEN is not write access
3815270 rt2x00: Disable powersaving for rt61pci and rt2800pci.
8ac9e80 ksm: fix mlockfreed to munlocked
b2ea8cb vmscan: do not evict inactive pages when skipping an active list scan
370b758 lguest: fix bug in setting guest GDT entry
743c078 ext4: Update documentation to correct the inode_readahead_blks option name
fc31022 sched: Sched_rt_periodic_timer vs cpu hotplug
9127720 amd64_edac: fix forcing module load/unload
1538323 amd64_edac: make driver loading more robust
44a529c amd64_edac: fix driver instance freeing
2d9e1f0 x86, msr: msrs_alloc/free for CONFIG_SMP=n
eb21839 x86, msr: Add support for non-contiguous cpumasks
26eb2ac amd64_edac: unify MCGCTL ECC switching
ebd2802 cpumask: use modern cpumask style in drivers/edac/amd64_edac.c
a89a9e1 x86, msr: Unify rdmsr_on_cpus/wrmsr_on_cpus
b2dbc46 ext4: fix sleep inside spinlock issue with quota and dealloc (#14739)
dbe5cc0 ext4: Convert to generic reserved quota's space management.
bbf2450 quota: decouple fs reserved space from quota reservation
f07c88d Add unlocked version of inode_add_bytes() function
0aebc28 udf: Try harder when looking for VAT inode
3196f98 orinoco: fix GFP_KERNEL in orinoco_set_key with interrupts disabled
fad0c31 xen: wait up to 5 minutes for device connetion
2cfea00 xen: improvement to wait_for_devices()
af70ddf xen: fix is_disconnected_device/exists_disconnected_device
1dc51f1 S390: dasd: support DIAG access for read-only devices
4012cf6 drm: disable all the possible outputs/crtcs before entering KMS mode
08ff733 drm/radeon/kms: fix crtc vblank update for r600
a09adfe sched: Fix balance vs hotplug race
fb70ac4 Keys: KEYCTL_SESSION_TO_PARENT needs TIF_NOTIFY_RESUME architecture support
7fcb558 b43: avoid PPC fault during resume
a8e3ec9 hwmon: (sht15) Off-by-one error in array index + incorrect constants
048a424 netfilter: fix crashes in bridge netfilter caused by fragment jumps
89cf4f4 ipv6: reassembly: use seperate reassembly queues for conntrack and local delivery
ee6bfc6 e100: Fix broken cbs accounting due to missing memset.
ad46fed memcg: avoid oom-killing innocent task in case of use_hierarchy
b52d855 x86/ptrace: make genregs[32]_get/set more robust
6e2aa7d V4L/DVB (13596): ov511.c typo: lock => unlock
4b6d263 kernel/sysctl.c: fix the incomplete part of sysctl_max_map_count-should-be-non-negative.patch
3ec268a 'sysctl_max_map_count' should be non-negative
0399123 NOMMU: Optimise away the {dac_,}mmap_min_addr tests
1cfe005 mac80211: fix race with suspend and dynamic_ps_disable_work
14b4d74 iwlwifi: fix 40MHz operation setting on cards that do not allow it
c4ae8ae iwlwifi: fix more eeprom endian bugs
df5d119 iwlwifi: fix EEPROM/OTP reading endian annotations and a bug
0c0cdaf iwl3945: fix panic in iwl3945 driver
66c9e44 iwl3945: disable power save
87d512c ath9k_hw: Fix AR_GPIO_INPUT_EN_VAL_BT_PRIORITY_BB and its shift value in 0x4054
a6d8cc6 ath9k_hw: Fix possible OOB array indexing in gen_timer_index[] on 64-bit
12ba709 ath9k: fix suspend by waking device prior to stop
c965e1e ath9k: wake hardware during AMPDU TX actions
463a7f9 ath9k: fix missed error codes in the tx status check
bef82b6 ath9k: Fix TX queue draining
0ebbdd7 ath9k: wake hardware for interface IBSS/AP/Mesh removal
d5086b9 ath5k: fix SWI calibration interrupt storm
4777020 cfg80211: fix race between deauth and assoc response
9f7028e mac80211: Fix IBSS merge
0b41c5a mac80211: fix WMM AP settings application
330b937 mac80211: fix propagation of failed hardware reconfigurations
38cf2a0 iwmc3200wifi: fix array out-of-boundary access
08a9378 Libertas: fix buffer overflow in lbs_get_essid()
3b96f9a KVM: LAPIC: make sure IRR bitmap is scanned after vm load
3a9f992 KVM: MMU: remove prefault from invlpg handler
8b9f038 ioat2,3: put channel hardware in known state at init
e05a6f0 ioat3: fix p-disabled q-continuation
e93166f x86/amd-iommu: Fix initialization failure panic
cd7bc18 cifs: NULL out tcon, pSesInfo, and srvTcp pointers when chasing DFS referrals
6cb5fcc dma-debug: Fix bug causing build warning
120dbaa dma-debug: Do not add notifier when dma debugging is disabled.
c4ddbba dma: at_hdmac: correct incompatible type for argument 1 of 'spin_lock_bh'
ed8f6eb md: Fix unfortunate interaction with evms
acb8be4 x86: SGI UV: Fix writes to led registers on remote uv hubs
4ba51fe drivers/net/usb: Correct code taking the size of a pointer
526fed8 USB: fix bugs in usb_(de)authorize_device
c6d7a67 USB: rename usb_configure_device
f661c3f Bluetooth: Prevent ill-timed autosuspend in USB driver
b71bfa6 USB: musb: gadget_ep0: avoid SetupEnd interrupt
3635acd USB: Fix a bug on appledisplay.c regarding signedness
5a82dd5 USB: option: support hi speed for modem Haier CE100
702a0a0 USB: emi62: fix crash when trying to load EMI 6|2 firmware
2d67231 drm/radeon: fix build on 64-bit with some compilers.
474ae5e ASoC: Do not write to invalid registers on the wm9712.
d75621c powerpc: Handle VSX alignment faults correctly in little-endian mode
8aafd7d ACPI: Use the return result of ACPI lid notifier chain correctly
3872bf5 ACPI: EC: Fix MSI DMI detection
5ab8996 acerhdf: limit modalias matching to supported
296e9be ALSA: hda - Fix missing capsrc_nids for ALC88x
aec8dc2 sound: sgio2audio/pdaudiocf/usb-audio: initialize PCM buffer
e255d3c ASoC: wm8974: fix a wrong bit definition
1ee0552 pata_cmd64x: fix overclocking of UDMA0-2 modes
f31733a pata_hpt3x2n: fix clock turnaround
fa3f5a5 clockevents: Prevent clockevent_devices list corruption on cpu hotplug
8e04c81 sched: Select_task_rq_fair() must honour SD_LOAD_BALANCE
c9ac6a9 x86, cpuid: Add "volatile" to asm in native_cpuid()
14ae082 sched: Fix task_hot() test order
fdf2675 SCSI: fc class: fix fc_transport_init error handling
1ab0714 SCSI: st: fix mdata->page_order handling
9f63d27 SCSI: qla2xxx: dpc thread can execute before scsi host has been added
c1d17da SCSI: ipr: fix EEH recovery
a1092bf Linux 2.6.32.2

The problem has to be somewhere in there. Frank, you're the only guy up to now bringing up hard evidence and two relatively close good/bad kernel versions. Would you be able to dig deeper on this? It's just ridiculous some IO can prevent a quadcore from skipless video playback (on 2.6.34-git12 that is).. because of btrfs i can't switch back to 2.6.32.2 - but maybe someone can figure out how to use the phoronix-test-suite's automagic to bisect this?

And despite all noise: this bug really shouldn't be marked RESOLVED INSUFFICIENT_DATA ^^

Comment 449 Frank Ren 2010-05-30 06:08:44 UTC

(In reply to comment #448 and comment #431)
Sorry, I really want to help, but I am not a kernel developer, hacking the kernel source is too difficult for me. Besides, the gas turbine historian is a live production system, it can not be used as a debug system. I will keep watching for the final resolve, for now, we will stick with 2.6.32.2.

Comment 450 Marcel Partap 2010-05-30 15:35:21 UTC

wait wait what is this :O
updating to yesterday's git kernel (from 2.6.34-git12) gave me a huge perceived speed boost? haven't specifically compared iowait times - but all processes seem to be using less cpu time? my BOINC likes that very much ;)
a lot of concurrent IO here and the system, apart from minor application stalling (although 8GiB RAM and no swap), hasn't been this un-sluggish for a loooong time (2.6.18? ;)
feels like someone finally released the breaks - hope you guys can confirm this!

Comment 451 devsk 2010-05-30 22:37:58 UTC

wrt comment #450, 2.6.35-rc1 is out! I hope that has something for all of us sufferers. I will try it later today. Can other folks also try and report here?

Comment 452 André-Sebastian Liebe 2010-05-31 07:03:29 UTC

(In reply to comment #450)
Maybe this is related to the observations at phoronox's kernel tracker[1]. An in depth article was also posted[2].

1: http://www.phoromatic.com/kernel-tracker.php?sys_1=yes&sys_3=yes&sys_4=yes&sub_type_System=yes&sub_type_Processor=yes&sub_type_Disk=yes&sub_type_Graphics=yes&sub_type_Memory=yes&sub_type_Network=yes&date_range=15&regression_threshold=0.15&only_show_regressions=yes&submit=Update+Results
2: http://www.phoronix.com/scan.php?page=article&item=linux_2635_fail&num=1

Note: Link 1 is valid for the next few days, thereafter you have to raise the displayed days to get the regression back into view

Comment 453 Andrew Morton 2010-06-04 00:20:16 UTC

lol, this bug was marked "resolved".  I wish.

(Hi, everyone).

I suspect we have about 25 different bugs here.  Really the only way we'll make progress here is if people can come up with specific test cases which developers can run on their own machines, and reproduce the bug.

So if any of you guys have time to try that and are successful then please attach that testcase here, or send it out via email to the relevant culprits.

It's really that important.  There's practically a 1:1 ratio between reproduction-test-cases and bugfixes.

Comment 454 desasterman 2010-06-24 21:06:56 UTC

Let me point out a potential pitfall: For a long while I thought my machine was suffering from this bug. However, the real reason for my high IO wait and extremely poor performance was this:

http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives

So everyone should rule out that one first... for me, a repartitioning of my drive helped a lot :).

Comment 455 Khalid Rashid 2010-07-12 15:25:44 UTC

Just want to report that I've had great success with the kernel 2.6.35-020635rc4-generic on ubuntu 32 bit. Apps can still grey out when allocating space for big files, but the interface is still responsive on other apps. I'll try it out on more setups and report back here if i notice it appearing on other places.

Finally I can say that my linux machines are usable again. Cheers!

Comment 456 psypher246@gmail.com 2010-07-22 15:40:38 UTC

(In reply to comment #453)
> lol, this bug was marked "resolved".  I wish.
> 
> (Hi, everyone).
> 
> I suspect we have about 25 different bugs here.  Really the only way we'll
> make
> progress here is if people can come up with specific test cases which
> developers can run on their own machines, and reproduce the bug.
> 
> So if any of you guys have time to try that and are successful then please
> attach that testcase here, or send it out via email to the relevant culprits.
> 
> It's really that important.  There's practically a 1:1 ratio between
> reproduction-test-cases and bugfixes.

Hi ANdrew,

Very simple testing procedure:

Launch Firefox

Run  'stress -d 1'

Try open some websites

Machine hangs

Thanks

Comment 457 psypher246@gmail.com 2010-07-22 15:56:27 UTC

(In reply to comment #455)
> Just want to report that I've had great success with the kernel
> 2.6.35-020635rc4-generic on ubuntu 32 bit. Apps can still grey out when
> allocating space for big files, but the interface is still responsive on
> other
> apps. I'll try it out on more setups and report back here if i notice it
> appearing on other places.
> 
> Finally I can say that my linux machines are usable again. Cheers!

I will try that, but I have no issues in XP and my hard drive is at least 2 1/2 years old and this issue has been around for even longer than that.

Doubt it's the reason for my issues. 

I have also tried playing around with other schedulers and disk mounting options. I have tried writeback and journal mode. Writeback provides very minimal improvement, not enough to make it worth my while to run always. Changing between ATA and AHCI mode makes no difference as well as changing the scheduler from cfg to anticipatory or deadline.

I am testing this on a Dell Precision M6300 Laptop with SATA drive, but I have experienced this issue on all my various types of PC's since at least Ubuntu Gusty or Intrepid.

Comment 458 Andrew Morton 2010-07-22 19:49:40 UTC

(In reply to comment #456)
> 
> Very simple testing procedure:
> 
> Launch Firefox
> 
> Run  'stress -d 1'
> 

From where does one obtain a copy of `stress'?

Thanks.

Comment 459 Benj FitzPatrick 2010-07-22 20:00:56 UTC

I believe this is the website (according to gentoo portage).
http://weather.ou.edu/~apw/projects/stress/
Benj

Comment 460 Søren Holm 2010-07-22 20:33:00 UTC

I've tried stress also.
I have 2 Gb og memory and 1.5 Gb swap

With swap activated stress -d 1 hangs my machine

Same does stress -d while swapiness set to 0

Widh swap deactivated things runs pretty fine. Of couse apps utilizing syncronous disk-io fight stress for priority.

There must be a reasonable explanation on why everything stops when swap is activated. Even a simple app like "dstat" stalls.

Comment 461 Nels Nielson 2010-07-23 16:23:06 UTC

I can also confirm this. Disabling swap with swapoff -a solves the problem.
I have 8gb of ram and 8gb of swap with a fake raid mirror.

Before this I couldn't do backups without the whole system grinding to a halt. Right now I am doing a backup from the drives, watching a movie from the same drives and more. No more iowait times and programs freezing as they are starved from being able to access the drives.

Comment 462 Andrew Clayton 2010-07-23 18:27:58 UTC

Perhaps you could capture some vmstat 1 output from just before/when the stall occurs?

Comment 463 Søren Holm 2010-07-23 21:54:29 UTC

Created attachment 27230 [details]
vmstat for my system running "stress -d 1" without hanging.

My system just logged into KDE around 650 Mb of memory used by applications
prior to starting "stress -d 1"

Comment 464 Søren Holm 2010-07-23 21:58:01 UTC

Created attachment 27231 [details]
vmstat for my system running "stress -d 1". System hangs.

My system just logged into KDE around 860 Mb of memory used by applications
prior to starting "stress -d 1". Application utilizing extra memory is
digikam and kontact - both sitting there doing nothing.

Comment 465 Søren Holm 2010-07-23 22:00:51 UTC

Created attachment 27232 [details]
vmstat for my system (without swap) running "stress -d 1" without hanging.

Same setup as stress_swap_hang.vmstat except that swap is turned off using
"swapoff -a" in this run.

Comment 466 Søren Holm 2010-07-23 22:21:14 UTC

The strange thing about every high throughput io is that *every* byte of  memory is used up intil a certain limit. That use of memory will even swap out stuff.

Also looking at especially stress_noswap_nohang.vmstat the behavior mimics this.

1. Place data to be written into memory
2. Write some data to the disk
3. goto 1 if not all allowed memory is used.

Interesting is that "stress -d 1" places data into memory a lot faster than a normal hard disk can handle. So the memory will be filled up eventually (the limit will be reached eventually).

So for me I only have a hanging system when "stress -d 1" writes compete with "swap out" - which is actually caused by "stress -d 1" filling the memory.

So the big question: Why do the kernel allow large data writes to fill up the memory and even swap out stuff just to get data to be written into memory?

Comment 467 Christian Mertes 2010-07-24 07:33:04 UTC

(In reply to comment #466)
> So the big question: Why do the kernel allow large data writes to fill up the
> memory and even swap out stuff just to get data to be written into memory?

A good question, but not the real source of this problem I guess. Judging by the previous posts and my own experience, this problems seems to occur with any concurrent I/O, possibly promoted by encryption. Provided that it is only one bug we are talking about.

Comment 468 Søren Holm 2010-07-24 08:31:52 UTC

I've notices that earlier in the long list of comments. But could it be that others confuse the real issue with swapout slowing things down during high disk write?

Comment 469 James Ettle 2010-07-24 08:45:53 UTC

(In reply to comment #468)
> I've notices that earlier in the long list of comments. But could it be that
> others confuse the real issue with swapout slowing things down during high
> disk
> write?

This squares somewhat with my own experience:

1. The file cache is *very* aggressive, even pushing out to swap stuff I think I might be using.

2. Large writes to swap trounce interactivity (and little gets scheduled).

Small writes seem not to have an adverse effect. OK, I understand pushing out pages that haven't been used in a while in favour of more current caches; however, doing something that can result in 1.5 GiB going to page cache on a 2 GiB system (large copy, kernel compile) seem to provoke these large writes which make everything go slow.

Comment 470 Søren Holm 2010-07-24 21:32:01 UTC

(In reply to comment #469)
> 
> 1. The file cache is *very* aggressive, even pushing out to swap stuff I
> think
> I might be using.
> 

Now, I'm not a kernel hacker, but a programmer afterall, and to me it seems to be a an easier job to fix the aggressive file cache than to fix this "large I/O operations ......"-thing - which is not at all that concrete and varies over platforms, machine specs etc.

Maybe fixing the aggressive file cache would fix a lot of peoples problems - I'm guessing that the file cache behaves 100% the same on all systems. Is that a correct assumption?

Comment 471 Matt Whitlock 2010-07-24 22:14:21 UTC

(In reply to comment #470)
> (In reply to comment #469)
> > 
> > 1. The file cache is *very* aggressive, even pushing out to swap stuff I
> think
> > I might be using.
> > 
> 
> Now, I'm not a kernel hacker, but a programmer afterall, and to me it seems
> to
> be a an easier job to fix the aggressive file cache than to fix this "large
> I/O
> operations ......"-thing - which is not at all that concrete and varies over
> platforms, machine specs etc.

Isn't there already a knob for controlling the kernel's preference for swapping anonymous pages out to disk versus retaining cached/buffered block-device pages?

/proc/sys/vm/swappiness — http://kerneltrap.org/node/3000

Our apps are appearing to hang because their GUI threads have stalled while waiting on pages (containing either executable code or auxiliary data like pixmaps) to come back into RAM from the disk.  Reading those pages back in is taking forever because the disk queue is full of writes.  The situation is worsened because reading the pages is not pipelined since the requests are being submitted from the page fault handler, so a program executing while huge disk activity is in progress will submit a request to load one page from disk and stall; then when that request is fulfilled, the program will execute a few hundred instructions more until its instruction pointer crosses into another page that isn't loaded from disk, whereupon the page fault handler will be invoked again, a new request will be submitted to the disk queue, and the application will hang again.  Repeat ad infinitum.  Meanwhile, while the program is stalled waiting for the page it needs to be loaded in from disk, all the rest of its pages are being evicted from RAM to make room for the huge disk buffers, thus perpetuating the problem.

I would think the easiest and most reliable solution to this problem would be for the kernel to prefer fulfilling page-in requests ahead of dirtying blocks.  If there are any requests to read pages in from disk to satisfy page faults, those requests should be fulfilled and a process's request to dirty a new page should be blocked.  In other words, as dirty blocks are flushed to disk, thus freeing up RAM, the process performing the huge write shouldn't be allowed to dirty another block (thus consuming that freed RAM) if there are page-ins waiting to be fulfilled.

Comment 472 Søren Holm 2010-07-24 22:27:57 UTC

Created attachment 27243 [details]
vmstat for my system running "stress -d 1" without hanging.

My system just logged into KDE around 650 Mb of memory used by applications
prior to starting "stress -d 1"

Comment 473 Søren Holm 2010-07-24 22:51:31 UTC

(In reply to comment #471)
> 
> I would think the easiest and most reliable solution to this problem would be
> for the kernel to prefer fulfilling page-in requests ahead of dirtying
> blocks. 
> If there are any requests to read pages in from disk to satisfy page faults,
> those requests should be fulfilled and a process's request to dirty a new
> page
> should be blocked.  In other words, as dirty blocks are flushed to disk, thus
> freeing up RAM, the process performing the huge write shouldn't be allowed to
> dirty another block (thus consuming that freed RAM) if there are page-ins
> waiting to be fulfilled.

I agree with you on the preference-part. It will fix the race-like situation. But as I understand, it will not keep the file cache from swapping out a single page?

Comment 474 Matt Whitlock 2010-07-25 00:08:44 UTC

(In reply to comment #473)
> I agree with you on the preference-part. It will fix the race-like situation.
> But as I understand, it will not keep the file cache from swapping out a
> single
> page?

Implementing my suggestion wouldn't prevent mmap'd pages from being evicted from RAM to make room for file cache.  It would only mean (1) that the file cache wouldn't be allowed to consume pages that are needed to satisfy page faults, and (2) that requests to read pages in from disk (whether from swap (anonymous pages) or from mmap'd files such as executables) would be serviced ahead of any other reads or writes in the disk queue.

Comment 475 James Ettle 2010-07-25 05:40:07 UTC

(In reply to comment #471)

> Isn't there already a knob for controlling the kernel's preference for
> swapping
> anonymous pages out to disk versus retaining cached/buffered block-device
> pages?
> 
> /proc/sys/vm/swappiness — http://kerneltrap.org/node/3000

(For some reason playing with this doesn't seem to do anything, but perhaps that's another bug report.)

Comment 476 devsk 2010-07-25 06:43:00 UTC

> I would think the easiest and most reliable solution to this problem would be
> for the kernel to prefer fulfilling page-in requests ahead of dirtying
> blocks. 
> If there are any requests to read pages in from disk to satisfy page faults,
> those requests should be fulfilled and a process's request to dirty a new
> page
> should be blocked.  In other words, as dirty blocks are flushed to disk, thus
> freeing up RAM, the process performing the huge write shouldn't be allowed to
> dirty another block (thus consuming that freed RAM) if there are page-ins
> waiting to be fulfilled.

Matt: Wouldn't setting dirty_bytes to low values make sure that the processes never dirty more than a fixed number of pages, and hence never get to consume more RAM until their existing dirty pages are flushed? Or may that's not how dirty_*bytes is designed to work. May be (I am guessing here) it just controls when the flush begins to happen for dirty pages, the application can still continue to dirty more pages. But if dirty_bytes controls when the process itself has to flush its dirty buffers, then it would be busy flushing and waiting on IO to complete and can't be dirtying more memory, right? So, it does look like setting dirty_bytes to a low value like 4096 will produce an extreme case where the process writes are almost completely sync and page cache is not pounded at all.

Can someone try this extreme test? set dirty_bytes to 4096 and rerun your scenario. The sequential bandwidth seen by the disk stresser will go down the drain but your system should survive.

Comment 477 Andrew Clayton 2010-07-25 10:20:37 UTC

According to http://www.kernel.org/doc/Documentation/sysctl/vm.txt

"Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
value lower than this limit will be ignored and the old configuration will be
retained."

Better make that 8192

Also you could try lowering /proc/sys/vm/dirty_ratio

Comment 478 Søren Holm 2010-07-25 19:57:05 UTC

(In reply to comment #477)
> According to http://www.kernel.org/doc/Documentation/sysctl/vm.txt
> 
> "Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
> value lower than this limit will be ignored and the old configuration will be
> retained."
> 
> Better make that 8192
> 
> Also you could try lowering /proc/sys/vm/dirty_ratio

Setting dirty_bytes to 8192 solves the slowdown of me. Of cause it ends up with a troughput from "stress -d 1" which is considerably lower than when dirty_bytes was set to 0 (ie.

<quote-from-doc>
If dirty_bytes is written, dirty_ratio becomes a function of its value
(dirty_bytes / the amount of dirtyable system memory).
</quote-from-doc>

Now, dirty_ratio is 60 by default, so 60% of my system memory can be used for dirty pages. On my system that is 1.2GB. So if I do not have 1.2GB free and I am doing some high troughput write to disk my system will hang. I think it is a bit overkill especially seen in the perspective that a standard harddisk can write no more that 100MB/sec.

The kernel should be reosonable enough to behave and not just hog the majority of system memory during high throughput operations. Just think of system with 8GB of memory and the 6Gb is used by running application. Runnning "stress -d 1" on such a setup would kill it. The writing application would be allowed to use 60% of the 8GB for dirty pages. It seems massive, so please correctly me if I'm wrong since I have not done a test on such a system.

Comment 479 devsk 2010-07-25 20:34:54 UTC

Søren: These parameters exist to tune the system behavior. There are other parameters which control the behavior of pdflush and FS journal threads but getting these all in harmony to make the system perform well in all scenarios is not an easy task. I think the hope is that pages will be reclaimed fast enough by pdflush if its parameters are tuned as well.

But I agree that by default letting one process to dirty 60% of physical RAM before it blocks itself on IO flush, is a bad thing. Particularly, when filling RAM is many orders of magnitude faster than emptying it to disk. A couple of rogue user processes can bring the system down in a hurry.

Linux needs to account for the disparity between RAM and disk, and how that disparity has increased many folds in recent times. 2GB system is considered minimum these days. Filling 60% of it will take few microseconds even on slowest of RAM, but emptying it to disk will take many seconds if not minutes on fastest drives.

Comment 480 devsk 2010-07-25 20:37:24 UTC

Søren: These parameters exist to tune the system behavior. There are other parameters which control the behavior of pdflush and FS journal threads but getting these all in harmony to make the system perform well in all scenarios is not an easy task. I think the hope is that pages will be reclaimed fast enough by pdflush if its parameters are tuned as well.

But I agree that by default letting one process to dirty 60% of physical RAM before it blocks itself on IO flush, is a bad thing. Particularly, when filling RAM is many orders of magnitude faster than emptying it to disk. A couple of rogue user processes can bring the system down in a hurry.

Linux needs to account for the disparity between RAM and disk, and how that disparity has increased many folds in recent times. 2GB system is considered minimum these days. Filling 60% of it will take few microseconds even on slowest of RAM, but emptying it to disk will take many seconds if not minutes on normal drives.

Comment 481 devsk 2010-07-25 20:48:12 UTC

Apologies for the double post. The first one timed out on me. While reposting, I realized fastest drives on market today (the SSDs) will likely be able to do stuff in seconds, so, I changed the word fastest to normal...:-)

Comment 482 Søren Holm 2010-07-25 21:22:16 UTC

devsk: Yeah, but shouldn't those knobs be to squeeze the most out of your system? The defaults should be set in a way that is not destructive.

fx. 

swappiness = 0 - 10
 or
dirty_ratio = 10 

or a combination of both or some other settings.

People will experience trouble with the default settings anyway, so reports like "high troughput disk writes is slow" is certainly a lot better than "high troughput disk write locks my machine".

What is the best fist steps to solving this:
1. Changing defaults on existing knobs?
2. Change the kernel code?

Comment 483 Andrew Clayton 2010-07-25 21:36:12 UTC

There are currently various patches dealing with various aspects of writeback. Some or all of these _may_ be ready for inclusion in 2.6.36

Comment 484 Søren Holm 2010-07-27 08:09:51 UTC

Nice  .... where are those. If they apply to 2.6.35-something I will be happy to try them out.

Comment 485 Andrew Clayton 2010-07-27 09:52:15 UTC

Here are a couple of things being worked on.

http://lwn.net/Articles/397003/
http://lwn.net/Articles/396512/

You'll need to dig around for the patches.

Comment 486 Andrew Clayton 2010-08-01 23:24:32 UTC

Wu Fengguang of Intel has started looking through this bug report. He has some patches that he'd like people to try.

http://lkml.org/lkml/2010/8/1/40
http://lkml.org/lkml/2010/8/1/45

Comment 487 Marcel Partap 2010-08-02 11:54:20 UTC

Created attachment 27313 [details]
screenshot of extreme iowait at ridiculously low throughput

Comment 488 Marcel Partap 2010-08-02 12:01:54 UTC

Created attachment 27314 [details]
Wu Fengguang's anti-io-stall patch rebased for vanilla 2.6.35

@#486
The posted patches didn't apply to recent kernels, just rebased for latest kernel release and compiled.. Will restart machine now and party wildly if FINALLY this small change fixes this issue.

Comment 489 Søren Holm 2010-08-02 13:34:22 UTC

(In reply to comment #487)
> Created an attachment (id=27313) [details]
> screenshot of extreme iowait at ridiculously low throughput

I have found that even if dstat should 0B throughput the disk have be very much active. So dstat seems to not measure the amount of bytes actually going to the disk.

Comment 490 Anton Revunov 2010-08-02 17:22:18 UTC

2.6.35 + patch from #488

Mouse froze four times at 1 - 1.5 seconds, while dd wrote.

When the sweep opens the file and swap grew from 0 to 1.3 GiB, mouse frozen. After opening the file Kopete loses connection to the Jabber account and KWin disables desktop effects.

Comment 491 Anton Revunov 2010-08-02 17:29:32 UTC

Created attachment 27324 [details]
test results

Comment 492 Søren Holm 2010-08-02 20:06:08 UTC

(In reply to comment #490)
> 2.6.35 + patch from #488
> 
> Mouse froze four times at 1 - 1.5 seconds, while dd wrote.
> 
> When the sweep opens the file and swap grew from 0 to 1.3 GiB, mouse frozen.
> After opening the file Kopete loses connection to the Jabber account and KWin
> disables desktop effects.

Did you ensure to have a 50% usage before starting the test. Just to make sure to trigger pageout.

Comment 493 Gonzalo Aguilar 2010-08-03 08:39:20 UTC

I'm just copying a few files from NFS folder to USB in my computer. 

I found that IO wait times are huge but Network is not in use. This is strange as the folder is a NFS one with GB ethernet attached.

The problem is that the IOWait times are making my desktop unusable. Window manager takes a lot of time to move a window around, desktop does not responds well, mouse got hang sometimes... This is a mess. 

This is kernel.

Linux azul1 2.6.35-10-generic #15-Ubuntu SMP Thu Jul 22 11:10:38 UTC 2010 x86_64 GNU/Linux


Some maintainer of the kernel should order this bug. Separate in a few different bugs (because I'm sure that ther are more than one related to this) and try to resolve them. Divide and conquer!

Thank you guys!

Comment 494 Thomas Pilarski 2010-08-05 02:07:53 UTC

The patch from #488 does not solve the problem on my machine. My machine start to stall even if there is still 2GiB of 8GiB RAM free. The menu stalls, if the icons are not loaded and there is heavy io.

It starts faster to stall while executing 
dd if=/dev/zero of=t1 bs=1M count=8K (throughput ~48,2MiB/s)
instead of 
dd if=/dev/zero of=t1 bs=4K count=2M (throughput ~52,7MiB/s)

The test data is written on the inner part of the disk, while the os is on the outer part. All partitions are ext4.

High fragmentation caused by lvm snapshots, increases this problem.

Comment 495 Pedro Ribeiro 2010-08-05 04:07:30 UTC

Hi, 

I did some tests with the patch from #488.

Test procedure:
- filled up memory to 70/80% (4GB physical memory total)
- executed "stress -d 1"
- played changing windows, changing tabs in chromium, accessing menus, etc

-----------------------------------------
2.6.35 vanilla, 10GB swap partition on:
Complete hang, no response at all from mouse or keyboard, had to reboot manually

2.6.35 vanilla, 10GB swap partition off:
A few hiccups, but system was still usable, although slow.

2.6.35 + patch from #488, swap partition on:
A few hiccups, but system was still usable, although slow.

2.6.35 + patch from #488, swap partition off:
A few hiccups, but system was still usable, although slow.
-----------------------------------------

So the patch from #488 seems to solve the problem for me. The hiccups and slowness can be attributed to my relatively slow magnetic disk and the fact that my partition is encrypted under LUKS. 

This is a very important bug for Linux in the desktop, I'm glad there is a patch out for it and I'll continue to use the patch for my kernels, but it should definitely be fixed in mainline!

Comment 496 psypher246@gmail.com 2010-08-05 13:24:27 UTC

Hi all, has anyone seen this article?

http://www.phoronix.com/scan.php?page=news_item&px=ODQ3Mw

Are they talking about the same patches? Sounds like the same issue.

Comment 497 Kornel Lugosi 2010-08-06 06:40:51 UTC

I tried the patch from #488 on 2.6.35.
When running dd if=/dev/zero of=/tmp/test bs=1M count=1M the system was almost flawless, windows switched quickly, opened programs reacted instantly.

It might be that I'm mistaken, but I'm under the impression that my programs takes more time to launch. I wonder if anyone else have that.

Comment 498 uzytkownik2@gmail.com 2010-08-06 08:28:30 UTC

*** Bug 15463 has been marked as a duplicate of this bug. ***

Comment 499 Marcel Partap 2010-08-06 11:02:53 UTC

#496:
yes the patch mentioned on phoronix IS the one from #488, and as reported by several it seems to improve IO latency (at the cost of throughput?) but falls short of completely preventing stalls. Strange thing for me is, the problems seemingly increase with uptime... besides i noticed some rogue flash-btrfs-1 threads causing 1MiB/s avg disk writing (uptime > 2 days, even after bringing down services causing heavy IO).. posted a blktrace of that to the linux-btrfs ml but no answer yet ^^

Wow this one's a tricky one.
One thing i noticed a few kernel revisions back that might be relevant: there were a lot of processes in IOWAIT state (result of compiling packages, BOINC, munin-graph, ntop... and then some) and i wanted to priorize a single process so i issued a ionice -p xxx -c1 -n0 (realtime: prio 0). What i expected was that that process would instantly get its IO through and pick up work - alas it took SEVERAL MINUTES before it did. That really wtfed me.. Is this broken by design? Shouldn't iorenicing take effect immediately?

Comment 500 Andrey Semyonov 2010-08-06 11:44:51 UTC

#496 doesn't solve the problem IMHO.

Tested on Ubuntu Karmic (10.04) with vanilla 2.6.35.

A simple 'dd if=/dev/zero of=/some/file bs=1M' caused 100% load (dual-head Core2 Duo E8500) and a high latency on even ^C'ing dd process itself. Need more info? Ask please.

Comment 501 Søren Holm 2010-08-06 23:01:04 UTC

I tried the patch rebased for 2.6.35
https://bugzilla.kernel.org/attachment.cgi?id=27314

It is problably ok, byt my first test is to fill my memory with all apps I can find and then run "stress -d 1". And as expected it started paging stuff out. You other guys must have the exact same problem, at least you Pedro. To me the responsiveness drop because of paging out.

Comment 502 abstract 2010-08-07 04:13:43 UTC

echo 10 > /proc/sys/vm/vfs_cache_pressure 
echo 4096 > /sys/block/sda/queue/nr_requests 
echo 4096 > /sys/block/sda/queue/read_ahead_kb 
echo 100 > /proc/sys/vm/swappiness 
echo 0 > /proc/sys/vm/dirty_ratio 
echo 0 > /proc/sys/vm/dirty_background_ratio

this solution work for me.
or use "sync" fs-mount option.

Comment 503 Søren Holm 2010-08-07 11:53:15 UTC

(In reply to comment #502)
> echo 10 > /proc/sys/vm/vfs_cache_pressure 
> echo 4096 > /sys/block/sda/queue/nr_requests 
> echo 4096 > /sys/block/sda/queue/read_ahead_kb 
> echo 100 > /proc/sys/vm/swappiness 
> echo 0 > /proc/sys/vm/dirty_ratio 
> echo 0 > /proc/sys/vm/dirty_background_ratio
> 
> this solution work for me.
> or use "sync" fs-mount option.

Yeah, but testing a kernel patch with those testtings is not good for seing its effects.

Comment 504 Pedro Ribeiro 2010-08-07 17:41:05 UTC

(In reply to comment #501)
> I tried the patch rebased for 2.6.35
> https://bugzilla.kernel.org/attachment.cgi?id=27314
> 
> It is problably ok, byt my first test is to fill my memory with all apps I
> can
> find and then run "stress -d 1". And as expected it started paging stuff out.
> You other guys must have the exact same problem, at least you Pedro. To me
> the
> responsiveness drop because of paging out.

Hi Soren,

as said in my comment, I do have the responsiveness drop, but I don't think that is a bug. If you are swapping to a slow disk, that is kind of expected. However, what is not expected is a complete loss of responsiveness, with the UI hanging if even for a few seconds.

I find that the mentioned patch improves a lot this situation vs the vanilla kernel. Of course, the best option yet is to disable swap, but for me 4GB of ram is not enough...

Comment 505 alpha_one_x86 2010-08-08 08:35:37 UTC

I have too reactivity problem on linux when I do large file copy.
Other OS is very responsive when do multiple file copy but not linux.
Windows have all IO async (no sync possible, read in the Qt doc), why not have same option in linux kernel?

Comment 506 Pedro Ribeiro 2010-08-16 02:22:28 UTC

After testing the patches intensively, I have to say that although they do improve the situation, they do it only slightly. I guess the best solution is still disabling swap.

Also, what's the idea of having a swappiness tunable if it doesn't work? I can set it to 0, and even though I have only 70% of physical memory in use the system starts swapping to disk.

Comment 507 Ritesh Raj Sarraf 2010-08-16 08:47:25 UTC

(In reply to comment #506)
> After testing the patches intensively, I have to say that although they do
> improve the situation, they do it only slightly. I guess the best solution is
> still disabling swap.
>

It does help initially but not always. Under memory crunch, I found my laptop completely unresponsive even though swap was off (RAM is 3GiB)
 
> Also, what's the idea of having a swappiness tunable if it doesn't work? I
> can
> set it to 0, and even though I have only 70% of physical memory in use the
> system starts swapping to disk.

That's weird. On my box, it does work the way it is designed. I have overall concluded that the default value of 60 is correct. If there is a buggy application, that should be fixed. I wouldn't be interested in OOMs on my box.

Comment 508 Zenith88 2010-08-25 15:14:37 UTC

Memory count actually drops when the system becomes unresponsive during copying of a large file, if a bunch of small files was copied immediately before.

Comment 509 Peter Hoeg 2010-10-21 15:54:00 UTC

I've added some information on the Ubuntu bug page, but will add it here for completeness sake:

1) I'm seeing this problem extremely frequently due to an unrelated bug that makes X leak memory.

2) On a machine with 4GB memory and no swap, the disk starts thrashing like crazy when 60-70% of the memory is used. It's so bad that I can't even log in on a console as getty times out before I get a chance to enter the password.

3) If swap is enabled on the same machine, it will start swapping out. Doing a "swapoff -a" will force the swap in as planned, but it happens with approximately 500KB/s.

Comment 510 Frank Ren 2010-10-24 09:09:12 UTC

I have compiled the new 2.6.36 kernel today, I found this bug is REALLY fixed on my notebook! Copy a 700MB movie to USB disk became very smooth and quick, GUIs are very responsive, much better than 2.6.35.4(the last kernel of Zenwalk). Just like some one said, the angels are singing again! Congratulations! Great work! Long live Linux!

Comment 511 Oleg Mikheev 2010-10-26 19:07:15 UTC

I'm not seeing this issue on 2.6.36 amd64 4Gb RAM 3Gb swap swapiness 20
Running 'stress -d 1' and browsing websites for 15 minutes with no issues

Comment 512 _Vi 2010-10-27 23:37:26 UTC

2.6.36-zen0-00214-g665fe96

Still has about 1 second page faults when "stress -d 1" or "pv /dev/zero > qqq".

Swap is off.

This:

> echo 10 > /proc/sys/vm/vfs_cache_pressure 
> echo 4096 > /sys/block/sda/queue/nr_requests 
> echo 4096 > /sys/block/sda/queue/read_ahead_kb 
> echo 100 > /proc/sys/vm/swappiness 
> echo 0 > /proc/sys/vm/dirty_ratio 
> echo 0 > /proc/sys/vm/dirty_background_ratio

does not help.

Uniprocessor system, i386. 1.5G of RAM. 1G of it was in use by applications when testing.

Comment 513 D.M. 2010-10-28 20:05:56 UTC

Just wanted to add my two cents, since I'm experiencing this problem for a very long time now on various machines. But I just adopted myself by doing nothing on the OS when I have large file copies. But somehow I stumbelled upon a solution for this, maybe. I had this problems, the one that you are talking about in this bug and some others after I started using MD-raid. First I thought it was something with the IO-scheduler. Tried all schedulers there are, No-op, CFQ, Deadline, Anticipatory... Some helped a little bit some didn't. Then I thought it was something with the FS, tried ext2, ext3, XFS and now ext4. The same problem prevailed. When I started copying large files I had OS "hickups". Everything that had to do some disk work stopped. Music, and OpenGL where still functioning normal, only the responsivness of the system was gone for 1 or 2 secs. No browsing, no changing terminal windows. Then I thought that it had something to do with SWAP, too.

A few days ago I got meself a new machine, i7/950, 2 x SATA3 WD HD, 12GB of ram, and I installed myself a new OS, pure64 bit kernel 3.6.36. The thing I had to do was to copy my old data to the new disks, and reuse the old disks. Now the way I did it is very important. I took a 1 TB WD HD Sata3, made some partitions (6 to be exact) and compiled a new OS. Then I copied the old data from the old raid. The old raid was 4 partitions on each disk with MD RAID 1 on two part. each. While I copied the data I had this hickups also, with the new system.

I had this idea, since now it is possible to make partitioned raid with MD, and you can take whole disks for an array, to make a RAID 10 out of this four disks, 2 new ones and 2 old ones. So it was like "mdadm --create /dev/md0 ... --raid-devices=4 /dev/sda /dev/sdb..."
Worked like a charm. Then I partitioned the array "fdisk /dev/md0". No problem there. Then I copied the old stuff from the single hard, with 6 part, to the new array. Now here is the interesting bit. No hickups !!!. Throughput was around 120MB/s and the OS was working smoothly as a Babies bump. And it was the same OS, no changes at all regarding kernel compile, or something else. Reading throughput was 270MB/s (dd-test). But since rootfs won't work on a partitioned MD array (some kernel racing problem, but that's another story) I had to change my setup on the new HDs. So again I created 4 normal partitions on each disk, one from all HD's for bootfs RAID 1, another 4 for swap, another 4 for the rootfs also RAID1 and the last four ones for RAID10 which I partitioned into two seperate partitions (srv and home). And the hickups came back. So this isn't hardware related. Because this problem I can reproduce on many Hardware. A list will follow. It's not with file system or such because I used them all. It's not SWAP, because on this new machine it didn't start to swap while I was copying. But this problem always comes up when I make more partitions (normal ones) for md-raid.

The list of Hardware:

Quad-Core 6600, I think it was ICH7 chipset, 8GB Ram, 2 x WD10EARS I think the kernel was 2.6.20 something, 32bit system, LinuxFromScratch 6.1 or 2. Can't
remember. The system worked for three yrs to now.

The partition of the disks was sda1,sda2,sda3,sda4,sdb1,sdb2,sdb3,sdb4
The raid arrays were md0 -> (sda1,sdb1); ... ; md4 -> (sda4,sdb4)
md0 -> /boot
md1 -> swap
md2 -> /
md3 -> /srv

Fujtisu Siemens RX100S6 x 2
1x XEON 3220 (Quad), 4GB memory, and I can't remember the chipset.
1x XEON E3110 (Dual), 4GB Ram, still can't remember the chipset.
kernel 2.6.32.10 pure 64bit system, LFS 6.5

And now:
i7 950, 12GB Ram, ICH 10 chipset, 2 x WD10EARS, 2 x WD1002FAEX (+1 temporary)
kernel 2.6.36, pure 64bit, LFS6.7
The setup that worked

sda1,sda2,sda3,sda4,sda5,sda6; sdb,sdc,sdd,sde
md0 -> (sdb,sdc,sdd,sde) RAID10
md0p1 -> boot (tried it but grub couldn't do it)
md0p2 -> Swap (no problem there)
md0p3 -> / (tried it after a workaround for grub to boot from RAID10, but the kernel didn't want to play along)
md0p4 -> extended part.
md0p5 -> /home (no problem there)
md0p6 -> /srv (no problem there)

sda1 -> /boot
sda2 -> swap
sda3 -> /
sda5 -> /home
sda6 -> /srv

Unfortunatly I had to dump this setup because of a Race condition where the kernel can't put partitioned md together before the rootfs boot process starts. :-(

Now the setup that doesn't work (the one with the hickups)

sda1,sda2,sda3,sda4,sdb1,sdb2,sdb3,sdb4,sdc1,sdc2,sdc3,sdc4,sdd1,sdd2,sdd3,sdd4
md0 -> (sda1,sdb1,sdc1,sdd1) RAID 1 -> /boot
swap -> (sda2,sdb2,sdc2,sdd2), didn't know what else to do with the free space
md1 (which somehow changed to md126 automagically after the third boot) -> (sda3,sdb3,sdc3,sdd3) RAID 1 -> /
md2 (which somehow changed to md127 automagically after the third boot) -> (sda4,sdb4,sdc4,sdd4) RAID 10 ->
md2p1 (changed to md127p1) -> /home
md2p2 (changed to md127p2) -> /srv

and the temporary disk which used the sda segment until I copied everything to the new setup.

Just to mention that throughput is still ok, around 80MB/s write. Didn't try read yet. Except for those hickups.

So, what else do you need from me so that we can kill this pesting bug??
I can do everything that is not going to kill my system, cause I'm using it
for everyday work. Everything else, torture tests, and so on after working hours is ok. Oh, and yes tried the
/sys/block/sdX/device/queue_depth thingie, worked for 5 mins and then it was back to hickuping.
dd is around 120MB/s...

Comment 514 Michiel Eghuizen 2010-10-29 15:05:16 UTC

To tackle this bug, there needs to be deep digging by the people who have these bugs, or good debug data has to be generated. And good info has to be given on the system.

Because there can be serveral bugs out there with the same symptoms as this one. To solve this bug, the best you could do individual bug reports with complete information. If you cannot give complete information, don't post that report, because then you are sure it cannot be solved. The more relevant info we get, the easier it becomes to detect the problems.

First install the newest kernel. Because that has the newest code and it will reduce the change that you'll run into an old and fixed bug. On time of writing it's: 2.6.36. Then test again, if it still happens, file a bug report.

First give correct system information:
Kernel: uname -a and cat /proc/version
Architecture: also from uname -a
Distro: name and version (could be handy for distro specific patches)
CPU info: cat /proc/cpuinfo | grep -e '$model name\|bogomips\|MHz\|flags$'
Mem info: cat /proc/meminfo | grep MemTotal
IO scheduler used: cat /sys/block/sdX/queue/scheduler

harddisk configuration: has raid, type of disks, speed of disk, partitions used and filessystems used

harddisk speed by hdparm:
hdparm -tT --direct /dev/sdX
hdparm -tT /dev/sdX

give dumps of the following commands:
lshw
dmesg
lsmod
cat /proc/swaps
cat /proc/meminfo
cat /proc/cmdline
cat /proc/config.gz | gunzip -

and give dumps of the following files:
for every disk:
/sys/block/<disk>/queue/*
/sys/block/<disk>/queue/iosched/*
/proc/sys/vm/*

This is for information, so the developers can detect what configuration the system has. And if there are known configurations or drivers which are bad and maybe giving the same symptoms, they will be noticed earlier.

If you want to use a script for that to help you collect the information, you can use the script located at: http://github.com/meghuizen/systeminfo which will build a tar.bz2 for you you can give as attachment, so you'll have complete information.

After that learn a bit on the I/O scheduler. To make it easier for yourself to debug and understand the situation:
- http://www.linuxjournal.com/article/6931 (info on I/O schedulers)
- http://www.devshed.com/c/a/BrainDump/Linux-IO-Schedulers/
- http://kerneltrap.org/node/7637
- kernel-source/Documentation/block/iosched-description.txt (see: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=tree;f=Documentation/block;hb=HEAD)
- http://www.westnet.com/~gsmith/content/linux-pdflush.htm
- http://www.docunext.com/blog/2009/10/debugging-and-reducing-io-wait.html

There are some tools which are very handy to use. The Linux Perf tool, is for example very handy to debug slowness and latencies and stuff in your system.

For some documentation on perf see:
- https://perf.wiki.kernel.org/index.php/Main_Page
- http://anton.ozlabs.org/blog/2010/01/10/using-perf-the-linux-performance-analysis-tool-on-ubuntu-karmic/
- http://blog.fenrus.org/?p=5

perf --help gives you also a lot of information.

And other profiling tools:
- http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/basic_profiling.txt;hb=HEAD

So to debug these options, perf output is rather handy. So if there are slowdowns happening again, try to get at the same time get some perf record dumps and maybe as well as perf timechart dumps, so the developers can analyze that as well.

for example perf top gives you what's currently happening in the kernel.

perf bench can help you benchmark your system, so you could test changes with patches and kernel versions and tuning parameters.

Comment 515 _Vi 2010-10-29 21:00:15 UTC

Tried with recent official master: 18cb657ca1bafe635f368346a1676fb04c512edf

http://vi-server.org/vi/12309_report/linux-2.6.36-09212-g18cb657_i686-sysinfo.tar.bz2

While running "pv /dev/zero > qqq" (http://vi-server.org/vi/12309_report/fill.txt), after about 2 GB I get pagefaults: http://vi-server.org/vi/12309_report/pagefault.txt http://vi-server.org/vi/12309_report/pagefault2.txt

If I try deadline or noop scheduler, I still get pagefaults, but after about 5 GB of copied data (and probably not that often)

In case of cfq the speed is jumping between 10 MB/s to 200 MB/s.

In case of deadline or noop it is more stable, around 40 MB/s

Trying
> echo 10 > /proc/sys/vm/vfs_cache_pressure 
> echo 4096 > /sys/block/sda/queue/nr_requests 
> echo 4096 > /sys/block/sda/queue/read_ahead_kb 
> echo 100 > /proc/sys/vm/swappiness 
> echo 0 > /proc/sys/vm/dirty_ratio 
> echo 0 > /proc/sys/vm/dirty_background_ratio
on this kernel leads to low filling speed (lover than 10 MB/s, measured with pv)
Also after applying that settings applications (starting with gpg2) begin to hang in uninterruptible sleep with these settings. I cannot stop filling (probably it hangs too).

P.S. Using this kernel I also cannot start X server.


If somebody want, I can try other settings, other kernel revisions, patches, other config.

Comment 516 _Vi 2010-10-30 00:23:19 UTC

Checked a bit more with CONFIG_HZ_100 and CONFIG_PREEMPT_NONE: the same.

Filling rate with vm.dirty_ratio=0 is 1 MB/s (with periodic stalls of everything).

If I set vm.dirty_ratio to 1, it raises to 40 MB/s (stable). Long page faults when loading programs are present as well.

Was testing with only 200 MB (of 1.5G) of memory filled.

Comment 517 James Ettle 2010-11-22 23:03:57 UTC

While it feels like a general improvement with 2.6.36 (no audio stutter with swap, and building a kernel no longer drags the system down (and fills up cache) like it did with 2.6.35), I still see cursor jerkiness when I first log in and start loading Firefox, Evolution and Pidgin (all at the same time).

Comment 518 Vladimir 2010-12-12 12:09:01 UTC

I've come to face this problem when using the new cgroup-sheduler patch.
PC: Samsung NC10 netbook, kernel 2.6.36 vanilla, Zenwalk-snapshot.
WHen trying to upgrade some packets in X session and browsing the Net at the same time, the latency increases badly, but not constantly, just in hitches. If i stop serfing the Net and return to my packager - the system works further, otherwise it may hang so that i have to reboot with a sysrq-key.
If i turn off the cgroup scheduler in /sys - everything works fine.
The kernel is compiled with full preemption, 1000 hz timer.

Comment 519 _Vi 2011-01-04 23:23:26 UTC

Trying 162253844be6caa9ad8bd84562cb3271690ceca9 from zenstable/io-less-dirty-throttling-2.6.37 - the same. 

Page faults of random processes (including Xorg) jump over 1 second while "pv /dev/zero > qqq".

The speed measurements by "pv" are fluctuating (from 64 kb/s to 120 MB/s; avg 40 MB/s) just like in usual 2.6.35-zen2

Comment 520 Anonymous Emailer 2011-01-05 00:23:30 UTC

Reply-To: Ritesh.Sarraf@netapp.com

I'm currently Out Of Office. I'll be responding to emails, but except some delay in replies.

For any urgent issues, please contact my manager, Kugesh Veeraraghavan Kugesh.Veeraraghavan@netapp.com

Comment 521 Ilya 2011-03-01 23:53:23 UTC

I have a reproducible test sequence for a 12309. It's easy:

Take a _SCRATCHED_ DVD. Put it into the drive and copy all files on it to a HDD. The bug comes early :)

The system freezes COMPLETELY at the time the drive read a scratched sectors.

Distro: Arch

Linux linuxhost 2.6.36-ARCH #1 SMP PREEMPT Fri Dec 10 20:01:53 UTC 2010 i686 AMD Athlon(TM) XP AuthenticAMD GNU/Linux

Drive (dmesg |grep TSS)

Feb 14 20:11:45 linuxhost kernel: scsi 2:0:0:0: CD-ROM TSSTcorp CDDVDW SH-S203B SB00 PQ: 0 ANSI: 5
Feb 10 12:05:36 linuxhost kernel: ata1.00: ATAPI: TSSTcorp CDDVDW SH-S203B, SB00, max UDMA/100

SATA-Controller (on the PCI-bus, drive connected to it):

00:0a.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE RAID Controller (rev 50)

Comment 522 Matt Whitlock 2011-03-02 00:00:06 UTC

(In reply to comment #521)
> I have a reproducible test sequence for a 12309. It's easy:
> 
> Take a _SCRATCHED_ DVD. Put it into the drive and copy all files on it to a
> HDD. The bug comes early :)
> 
> The system freezes COMPLETELY at the time the drive read a scratched sectors.

I suspect this has more to do with the IDE bus than with the interaction between the kernel's block layer and the VM.

Try this:
dd if=/dev/dvd of=/dev/null bs=2048

I bet you get the same freezes when it reaches the scratches.

Comment 523 Ilya 2011-03-02 00:10:55 UTC

I checked the same DVD with another DVD-Drive (The Drive is on the IDE-bus, and not on the SATA-bus). All was OK. Any freezes at all. Any ideas? Is this another bug?

Comment 524 Ilya 2011-03-02 01:02:01 UTC

>Try this:
>dd if=/dev/dvd of=/dev/null bs=2048

>I bet you get the same freezes when it reaches the scratches.

You're right.

But this is still the 12309 bug, isn't it?

Comment 525 Matt Whitlock 2011-03-02 01:16:29 UTC

(In reply to comment #524)
> But this is still the 12309 bug, isn't it?

No.

However, this bug report has turned into a dumping ground for anyone experiencing any lagginess, regardless of cause.  The actual bug here is related to the kernel preferring to evict memory-mapped executable pages when a process dirties blocks faster than they can be flushed to disk.  The apparent hangs in responsiveness are due to threads (particularly GUI threads) triggering page faults and being unable to make progress until their code is re-fetched from disk.  The fix should be to block the writing process from dirtying any more blocks well before the kernel starts evicting mapped executable pages from memory, but so far no one has been able to make it work correctly in all cases (afaik).

Comment 526 Ilya 2011-03-02 01:26:35 UTC

Alos, I better make a new bugreport for my bug?

Comment 527 _Vi 2011-03-10 16:47:50 UTC

Trying kernel from writeback/dirty-throttling-v6

Nothing seems to be changed, as usual. Still lengthy "Page Faults" (and others) for firefox-bin while "pv /dev/zero > qqq".

Should provide more info about dirty-throttling-v6 (how to collect it)?

Comment 528 Yaroslav Fedevych 2011-03-12 21:45:23 UTC

> The actual bug here is
> related to the kernel preferring to evict memory-mapped executable pages when
> a
> process dirties blocks faster than they can be flushed to disk.

Okay.

Let it be so. However, the subject line for this bug is

> Large I/O operations result in poor interactive performance and high iowait
> times

and that's what I'm experiencing now, rsync'ing a 100 GB worth of data with almost everything being there on the receiving side (thus making the receiving rsync read files heavily for the checksums). And I am dead sure this has nothing to do with the virtual memory as the swap is completely off (I would probably need to compile a different kernel with no support for swapping to reconfirm). iowait rises to 90%, LA shows disturbingly large numbers of up to 20, and unrelated processes like Xorg freeze, taking around 15 seconds to redraw the screen or move the mouse cursor or whatever.

What I thought this bug was about is that while one process does overwhelmingly large volumes of I/O, it should by no means impact other, unrelated processes which might not even use the disc subsystem, or not use the same disc. At least this is what Mac OS X does: for example, Transmission preallocates space for  40 GB worth of torrent data, naturally freezing in the process and ceasing to respond to any events, but  then again, I can minimise its window, type code in Eclipse or anything — barely noticing the disc  thrashing. I think I'm reiterating this example for the upmteenth time here, sorry if that's the case.

If I'm wrong and the bug #12309 was reduced to its VM part, I just request which one is about the above problem — high iowait affecting unrelated processes, with no swapping involved. Is that #13347? I cannot follow it because the submitter uses a dialect of English I'm not quite capable to parse. If there's no specific bug, I'll take the time to report it, because it bugs me a great deal, however I'm afraid I'll have to repeat most of the tests already conducted here.

Please don't take it as if I'm trying to offend anyone, because I'm not. I just want to know where does go the specific symptom as described above.

Thank you all for every effort to have it resolved.

Comment 529 Matt Whitlock 2011-03-12 22:17:45 UTC

@Yaroslav: Your misconception is that having swap disabled means that memory pages are never backed by disk blocks. That is simply not true. All it means is that *anonymous* pages cannot be backed by disk.

All Linux kernels launch processes from disk (via execve(2)) by memory-mapping the executable image on disk and then jumping to the entry point address in the mapped image. Since the entry point address is in a non-resident page, the CPU's attempt to fetch an instruction from it triggers a page fault, which the kernel then handles by loading the needed page (and usually several more) from disk.

When physical memory becomes scarce, the kernel has several tricks it may employ to attempt to free up memory. One of the first of these tricks is dropping cached blocks from the block layer and cached directory entries from the file system layer, which means that those blocks and dentries will have to be fetched from disk the next time they are accessed. One of the last tricks the kernel has is the OOM killer, which selects the "most offending" process and KILLs it in order to reclaim the memory it was using.

Somewhere in between those two tricks, the kernel has another trick it attempts for freeing up physical memory. It can force memory pages out to disk. If the system has swap enabled, the kernel may force anonymous pages (e.g., process heaps and stacks) out to disk. In all cases, however, the kernel may also choose to force memory-mapped pages out to disk. If those memory-mapped pages are read-only (such as is the case with executable images), then "forcing them out to disk" really just means dropping them from physical memory, since they can always be fetched back in later.

So, what does this mean in the context of this bug? The process that's hitting the disk a lot (usually it's dirtying blocks, but maybe it's possible that this happens even if it's just reading blocks) causes RAM to fill up with disk blocks. The kernel starts attempting its tricks to free up physical memory. One of those tricks is dropping memory-mapped pages from RAM, since they can always be fetched back into RAM from disk later. Then you the user switch applications or click on a button in the GUI or try to log into an SSH session, and what happens? Page fault! The code for repainting the X11 window or handling the button click or spawning a login session is not resident in memory because it was forced out by the kernel. That code now must be refetched from disk to satisfy the page fault, but uh oh, the disk is VERY busy and has very long queue depths, so it will be a while before the needed pages can be fetched. And at the same time as those pages are being fetched, the kernel is evicting other memory-mapped pages from RAM, so the responsiveness problem is just going to persist until the pressure on RAM subsides.

Ideally, the kernel should not allow so many blocks to be dirtied that it has to resort to dropping memory-mapped pages from RAM. The dirty_ratio knob is supposed to control how much of RAM a process is allowed to fill with dirty blocks before it's forced to write them to disk itself (synchronously), but that does not appear to be working properly.

Comment 530 Matt Whitlock 2011-03-12 22:38:39 UTC

Incidentally, one reason this bug seems to manifest a lot more on 64-bit systems than on 32-bit systems is that 64-bit systems use Position-Independent Code (PIC) in their shared libraries universally, whereas 32-bit systems usually don't.  Not using PIC means that 32-bit systems usually have to perform relocations throughout their shared libraries upon memory-mapping them, and those relocations cause private (anonymous) copies of those pages to be created, and those anonymous pages cannot be forced out to disk on systems without swap, so accessing those pages can never cause page faults.  On 64-bit systems, PIC virtually eliminates the need to perform relocations in shared libraries, meaning most mappings of shared-library code are directly backed by the images on the disk and thus *may* be forced out of RAM and *may* cause page faults.  In principle, using PIC (on 64-bit systems, which have new addressing modes to make it efficient) is a good idea because it means only one copy of a library needs to be in RAM, regardless of how many processes map it, rather than one relocated, private copy for each process, but because of this bug, *not* making private copies of the library code is what's killing us, as the only copy we have in memory is evictable.  Please note, I am not arguing that the kernel should be making private copies of all executable pages; that would be the wrong solution.  A better solution would be to prevent processes from dirtying so much RAM that the kernel has to start evicting pages that were memory-mapped by execve or dlopen (but not by plain old mmap!).

Comment 531 Yaroslav Fedevych 2011-03-12 22:46:15 UTC

Thanks for prompt reply and the patience to explain these things,

but then there's one more misconception on my side in a desperate need for debunking. And it's about the I/O queues.

This misconception starts from a suggestion that not all data are equal. For example, non-resident executable pages are tier-0. I/O buffers for application usage like those for read(), write() and friends are tier-1. If there are no priorities on the queue, we cannot tell the origins of I/O requests apart and thus get what we have: swapping a process in has to wait until the queue is emptied by a disk-hungry application beast which just happened to fill it up.

If we prioritize the queue and find a way to tell swap-in reads from application reads (say), on the other hand, it might improve interactive responsiveness. And the expense of having a tiered queue might be mitigated by employing it only on the media which has at least one mmapp'ed process. I say "it might improve things" because the solution is so obvious, in fact, that I have little doubt it has been thoroughfully thought through and ultimately rejected.

And I have no doubt that every folk who gets a single line of code accepted and committed into mainline is smarter than me in this respect[1] so this must have popped up a while ago.

[1] I'm no kernel hacker at all, just your average applications developer.

Comment 532 Matt Whitlock 2011-03-12 23:05:35 UTC

@Yaroslav: I agree.  I've had the same thoughts regarding priority in the I/O queues.  The biggest problem with this approach is that much of the queues actually sit inside the hardware nowadays.  SCSI TCQ (tagged command queuing) and SATA NCQ (native command queuing) have exacerbated this.  The Linux kernel can't do anything to prioritize queues inside the hardware, but it can limit how much of the hardware queue it will use, thus effectively keeping the queue in software only.  Some proposed workarounds to this bug 12309 involve reducing the depth of the hardware queue that Linux is allowed to use, and that does seem to improve the worst case, although it severely degrades the common case.

Another workaround might be to prevent the kernel from evicting executable memory-mapped pages in the first place.  This would be only a partial solution, though, as applications often memory map resources that are not executable (for example, fonts, pixmaps, databases), so their responsiveness could hang on page faults for those resources just as readily as on page faults for code.

Comment 533 Yaroslav Fedevych 2011-03-12 23:29:21 UTC

You are right about the workaround, but having a queue prioritised would be of help when, despite all workarounds, pages were actually evicted.

I actually imagine it as a 4-tier queue: tier 0 for realtime processes, 1 for swap-ins we are talking about now, 2 for every other virtual memory operations, and 3 for everything else (or count 2 and 3 as everything else, maybe).

My question then will be as follows:

yes, we cannot control the commands queueing once they enter the hardware. But if we happen to know the hardware command queue size (which we do) and if we are able to tell how full it currently is (which I'm not quite sure about but I think it can be figured out), we could split it so that every tier is permitted to fill no more than some percentage of the hardware queue. It would of course hit average case performance, but still guarantee some bandwidth for higher tier I/O which is a good thing IMHO.

Sorry for bugging and probably ignorance, but I really want this nailed.

Comment 534 Matt Whitlock 2011-03-12 23:48:53 UTC

To everyone interested in this bug:

An easy and reliable way to demonstrate the issues surrounding this bug (on a system without anonymous swap) is to mount a tmpfs that is sized as large as your physical RAM. Then start writing to it (slowly!!!). The kernel will be unable to flush those blocks to disk, as they are not backed by disk. As you continue writing to the tmpfs, the kernel will gradually evict everything else in your block cache and file system cache.

At some point, the kernel will have run out of caches to evict and will start evicting memory-mapped pages. You'll know this has happened when the system responsiveness comes to a crawl and your disk starts thrashing. Yes, your disk will thrash, even though you're only writing to a tmpfs. The thrashing is due to all the page-ins of executable pages that are being accessed as various processes on your system struggle to keep executing their background threads and event processing loops.

If your writer process continues writing to the tmpfs, your system will become completely unusable. If you're lucky, eventually the kernel's OOM killer will be invoked. The OOM killer probably won't choose your tmpfs writer as its victim, though, so you'll have only a short time to kill the writer yourself before your system grinds to a halt again. If you do manage to get it killed, you can simply unmount the tmpfs, and everything will return to normal in short order. You will notice a bit of lag the first time you switch back to other applications that were running, as they will trigger page faults to get their code loaded back into RAM, but once that's done, everything will be as usual.

Comment 535 Zenith88 2011-03-13 18:10:19 UTC

It would have made sense if only starting new processes was slow. Copying large volumes of data slows down even mouse cursor, where Xorg HID driver already sits in memory. If what you've described affects driver already in memory, entire architecture has to be abandoned. So to say, definition of the problem, not an excuse.

Comment 536 Yaroslav Fedevych 2011-03-13 21:42:16 UTC

Hm, I then have another wild suggestion.

It is in fact a very rare event that a process needs to hang in memory but wake up once in a blue moon, so that it can be harmlessly paged out and not bring the system to a halt. From my desktop experience I can only remember LibreOffice sitting on my long-running machine and be actually used once in two weeks or so.

If the problem is really so grave that an often-running process (like Xorg!) is selected by the kernel to be paged out, why not work this around by disabling  evicting processes' pages altogether? I think it must be somewhat easier than designing an over-engineered strategy for choosing what pages to throw away, test it over a couple years, find bugs in the very design, throw it away, design another one and so on.

I would love to see a flag which I could set per control group. If the flag is set, pages owned by processes in that cgroup are never swapped out. Combined with pessimistic overcommit policy, it could help at least a bit.

Or at least worth a try.

Comment 537 Matt Whitlock 2011-03-14 00:18:53 UTC

(In reply to comment #535)
> It would have made sense if only starting new processes was slow. Copying
> large
> volumes of data slows down even mouse cursor, where Xorg HID driver already
> sits in memory. If what you've described affects driver already in memory,
> entire architecture has to be abandoned. So to say, definition of the
> problem,
> not an excuse.

If you're seeing the mouse cursor lag/skip while copying large volumes of data, an alternative explanation could be that you're using PIO mode for your data transfers rather than DMA.  However, as you identify, it's possible that the X.org driver that handles the mouse input is indeed being paged out, and that would result in mouse interrupts triggering page faults, and the mouse cursor would not update on screen until the code for doing so had been paged back in.

To say the entire architecture must be abandoned is too extreme.  Memory-mapping executable images is a very efficient mechanism that ordinarily works beautifully.  This bug is creating pathological conditions that should never occur.

(In reply to comment #536)
> If the problem is really so grave that an often-running process (like Xorg!)
> is
> selected by the kernel to be paged out, why not work this around by disabling 
> evicting processes' pages altogether?

You can't do that.  Consider a process that maps a 1 TB file into memory and then starts randomly reading from it, thus causing more and more of the file to be loaded from disk into physical memory.  You *must* allow pages to be evicted, or you will run out of RAM.

Don't try to solve a problem that doesn't exist.  The actual problem here is that the block layer is using too much RAM for dirty (or possibly even clean) blocks.  To demonstrate to yourself that this is so, you may try another of the proposed workarounds, which is to mount your file system in "sync" mode, which causes all file writes to be performed synchronously rather than being buffered and written back later.  Under that constraint, you will never run into this bug, because the block layer is never allowed to use so much RAM that the kernel starts paging out "hot" memory-mapped pages.  (By "hot," I mean pages that are regularly being accessed, such that you would notice if they had to be paged back in from disk.)

Comment 538 Yaroslav Fedevych 2011-03-14 01:08:49 UTC

Okay, sync might work, but it also would make filesystems slow as hell and contribute to media wear from another side. If what you say is the case, and I have no reason for disbelief, then there must be a way to limit the number of dirty blocks (and total blocks) which may exist before buffers are flushed. E. g., there's X seconds of commit interval or Y dirty blocks, whichever comes first, and a max Z buffered blocks in total per device or per system. This would be 'almost sync', I think, and it would solve one more problem with USB flash media.

The problem is that too big write buffers tend to be flushed at a sub-optimal speed, thus increasing the total time needed to copy and sync the data. Again, this does not occur neither with Windows nor with OS X. And they don't mount 'sync'; they buffer writes (which is a good thing with any device with expensive and wearsome writes), it's just that their buffers are considerably smaller in size than those of Linux.

I'd be happy to know that a solution of limiting buffer sizes exists, this at least would enable us to fine-tune the system so that in 90% of use cases the problem wouldn't appear, and that it would appear only in the cases where it's tough anyway.

Comment 539 Matt Whitlock 2011-03-14 01:26:56 UTC

@Yaroslav:  There is already a knob for tuning the maximum amount of RAM that may be used for holding dirty blocks.

From Documentation/sysctl/vm.txt:
> dirty_ratio
> 
> Contains, as a percentage of total system memory, the number of pages at
> which
> a process which is generating disk writes will itself start writing out dirty
> data.

The intent is as you describe: asynchronous writing until dirty_ratio is reached, and then synchronous writing only.  "dirty_ratio" is 10% by default.  You can test if it's working by starting a large write to disk (`dd if=/dev/zero of=/bigfile bs=1M`) and monitoring the "Dirty" counter in /proc/meminfo (`watch grep Dirty /proc/meminfo`).

For what it's worth, it does work for me (and I haven't seen this bug manifest on my system in quite a while).  I'm running Linux 2.6.36-gentoo-r5.  I can still get the unresponsiveness and disk thrashing to happen using the tmpfs test case I described in comment #534, but that's not a failing of the kernel; that's a failing of the user (filling a tmpfs too much).

Comment 540 Zenith88 2011-03-14 01:29:15 UTC

(In reply to comment #537)
> an alternative explanation could be that you're using PIO mode for your data
> transfers rather than DMA.  However, as you identify, it's possible that the

Excuse me, I am using you said? That would be like, specifically configuring the kernel to use PIO? Why would anyone do that?

[    1.101092] ata2.00: ATA-7: WDC WD3200KS-00PFB0, 21.00M21, max UDMA/133
[    1.101205] ata2.00: 625142448 sectors, multi 0: LBA48 NCQ (depth 1), AA
[    1.102146] ata2.00: configured for UDMA/133

[    2.191312] ata13.00: ATA-7: ST3160215A, 3.AAD, max UDMA/100
[    2.191343] ata13.00: 312581808 sectors, multi 16: LBA48 
[    2.266143] ata13.00: configured for UDMA/100

Comment 541 Matt Whitlock 2011-03-14 01:46:00 UTC

(In reply to comment #540)
> That would be like, specifically configuring
> the kernel to use PIO? Why would anyone do that?

The kernel can fall back to PIO mode if DMA mode is encountering problems (which can happen with faulty hardware).  It happens with CD/DVD drives more often than with hard drives.

The next time you encounter system sluggishness and the mouse cursor starts skipping, see if you can get a readout of /proc/meminfo (while the sluggishness is happening).  If your "MemFree" is very low *and* your "Cached" or "Dirty" is very high, then you might be suffering from this bug.

Comment 542 devsk 2011-03-14 02:22:36 UTC

dirty_ratio is not really a good measure of when to start flushing to disk. On a 24GB system, even 1% may be large for your disks to handle. Its better to configure dirty_bytes and dirty_background_bytes. dirty_bytes applies to the process which is doing the IO and dirty_background_bytes applies to kernel flush threads. When these thresholds are hit, if sum total of IO happening in the system is at a rate higher than your disks can take, you will start seeing very initial symptoms of this bug. The overall flow has been described well by Matt. I think this is precisely what's happening.

One way to avoid the issue would be set the dirty_bytes and dirty_background_bytes in such a way that their sum total is within reasonable ratio of your disk's sequential bandwidth. When a Linux system is in steady state with a reasonable uptime, it will likely use all RAM for read side caches. It will free up those on demand when it comes under memory pressure (which may be created by large IO). By keeping the (dirty_bytes + dirty_background_bytes) a multiple of your disk's raw speed, you can put a bound on the overall latency of the system. For example, I don't let dirty to go beyond 200MB on my laptop. It makes all my sequential operations bound by the sequential speed of the disk but lets the small random IO to be buffered (so, its better than "sync" mode of the FS in that sense).

Comment 543 Milan Bouchet-Valat 2011-03-14 09:24:59 UTC

And can we find a solution that would apply in the case where the system is running out of free RAM and starts swapping out everything? I often experienced total unresponsiveness of both X and the consoles when a program tries to use more RAM than is available, and I wasn't even even able to kill the process manually (forced reboot). Maybe that should be considered as a pathological case requiring just the OOM killer to be more aggressive - I don't know.

Comment 544 Matt Whitlock 2011-03-14 09:37:53 UTC

(In reply to comment #543)
> And can we find a solution that would apply in the case where the system is
> running out of free RAM and starts swapping out everything? I often
> experienced
> total unresponsiveness of both X and the consoles when a program tries to use
> more RAM than is available, and I wasn't even even able to kill the process
> manually (forced reboot). Maybe that should be considered as a pathological
> case requiring just the OOM killer to be more aggressive - I don't know.

If you have the Magic SysRq key enabled in your kernel, you could do AltGr+SysRq+F to invoke the OOM killer manually.

I do agree in principle, though, that the offending process should be denied the allocation of any additional memory before any frequently used memory-mapped pages start getting evicted from RAM.

One possible solution might be to set a threshold for the minimum number of memory-mapped pages that the kernel must allow to remain in RAM.  As an example, setting such a knob to 100000 would mean that the kernel would not evict any memory-mapped pages if fewer than 100000 memory-mapped pages were resident in RAM.  Assuming that the kernel uses a least-recently-used eviction policy, this would prevent the debilitating thrashing scenario that occurs when essentially all memory-mapped pages have been and continue to be evicted.

Comment 545 Yaroslav Fedevych 2011-03-14 10:34:12 UTC

(In reply to comment #544)
> Assuming that the kernel uses a least-recently-used eviction
> policy, this would prevent the debilitating thrashing scenario that occurs
> when
> essentially all memory-mapped pages have been and continue to be evicted.

Given the fact that Xorg all too often falls victim to that, and it is active most of the time, I cannot help but assume something is wrong with the kernel's definition of "least recently used."

By the way, setting vm.overcommit_memory to 2 and overcommit_ratio to 80 seems to at least somewhat reduce the problem; the same rsync command which has triggered this bug (or similar bug if you prefer) now behaves a lot better, letting me type these words.

Comment 546 Zenith88 2011-03-24 20:43:45 UTC

I find that amount of slowness strongly depends on the writing driver.
Today I had to evacuate Win7 machine onto Fedora14 and copying from NTFS to EXT3 was painful. Now I am returning the files back onto NTFS and there is no slowdown at all. Dig in the ext3 filesystem, it should be in the writing code.

Comment 547 Vesselin Kostadinov 2011-04-02 11:48:22 UTC

This seems to be a hardware related issue, at least in some cases.
Can the other people experiencing it confirm whether they have a WD Greed hard disk?
Google search for "wd15eads firmware" reveals quite a few people having similar problems.
I have one of these hard disks and I was using it on a fanless VIA Samuel 2 (pre-686) CPU and I was seeing the high IOWait problem and associated poor performance. When I put the same hard disk in a dual AMD opteron it had the same problem.
Then I did a full backup and restore on a different hard disk. It is the same debian system on the same VIA cpu but now the high IOWait times are gone and the performance is adequate for the CPU.
I should point out that the kernel should not suffer poor overall performance during disk I/O even on flakey hardware, especially with swap disabled.
The offending hard disk is now blanked. I can run a few tests with it if somebody is interested.

Comment 548 Zenith88 2011-04-02 14:42:32 UTC

Blaming hardware is the lamest practice in IT world and it surely earns those who practice that great deal of disrespect.

Comment 549 Alex Efros 2011-04-02 15:37:32 UTC

(In reply to comment #548)
> Blaming hardware is the lamest practice in IT world and it surely earns those
> who practice that great deal of disrespect.

Vesselin Kostadinov doesn't blame hardware, he says this bug (or one of bugs discussed here) is hardware-dependent. I can confirm this too: initially I used 
Barracuda 7200.10 320GB ST3320620AS, then I tried to replace it with Seagate Barracuda LP 2TB without success (nothing changed), then I replaced it with Samsung HD103UJ 1TB and this helps a lot - bug is still noticeable, but very very rarely and have much less impact on overall system performance. You can find more details about this in my comments on bug 13347.

Comment 550 D.M. 2011-04-02 17:31:27 UTC

Regarding the WEADS disks from WD. It has something to do with disk geometry. We had some problems with them as well. We have some 30 pieces of them. But actually it's not a problem it's more a RTFM thingie. I think that there's something on the WD site, not sure. To partition this disk under linux / Windows XP (Win 7 is automagically doing it) you have to use fdisk -H 224 -S 56 /dev/sd...
You can read my comment at https://bugzilla.kernel.org/show_bug.cgi?id=12309#c513
Two of the disks are green WD-s partitioned with the fdisk method. Until then I had also problems with speed where the HD-s only had a throughput of 2-5 MB/s. After the fdisk I had a throughput of up to 100 MB/s. But again the problem with this bug is not throughput it's if you start a big file copy or like dd if=/dev/zero of=test.img bs=1M count=5000 your desktop comes almost to a halt. But after some time I think that this even isn't a bug it's more a new kernel queing methodology. After entering this:

vm.swappiness=1
vm.dirty_background_ratio=1
vm.dirty_ratio=1

into sysctl.conf I almost don't have this problem anymore. I read a lot about this problem and as far as I can understand the new way the kernel is working is that it, depending on the above configuration, put's something first into RAM and then writes it to disk (very simplified). So if you have a lot of Ram (in my case 12GB) and the above configuration is per default 40% then the kernel is putting almost 5GB as cache into RAM. And then writes it to disk, and yes I have a very fast RAID system but even with 400MB/s I have to wait 10 secs, and more, in which he has to write it to disk. I forgot with which kernel version this started but I know that I checked it and that my problems with responsivness started after changing to this new kernel (methodology). So you can say that this is not a bug but merely a kernel configuration matter. Because with this new metodolgy a default configuration of vm doesn't work for all, especially with those with a lot of RAM.
And yes I would like that the old methodology would be integrated again into the new kernels but until then I'll try to circumvent this problem with understanding and configuring the kernel. The above sysctl configuration is working for me with the setup that I have in my comment #513 in this bug. There are slight hickups but nothing as severe as earlier when I couldn't do anything until the file writing finished.

Comment 551 Oleg Mikheev 2011-04-03 10:09:18 UTC

Sorry for interrupting your research with my naive question, but does this bug have clear steps to reproduce it?

The initial comment says 'starting a new shell takes minutes' after the system is left with dd running for significant time.

But for me shells/browsers etc take just maybe 1 or 2 seconds longer to start after I have 'stress -d 1' or 'dd if=/dev/zero of=bigfile bs=1M' running for ~10 minutes (bigfile is 30Gb after my tests, dirty blocks quickly reach ~670M (3.67G RAM total) and stay there.

The small file test that I accidentally ran with TWO simultaneous bigfile dd processes in the background finished in 0.073s (or is this bad?):

$ dd if=/dev/zero of=/tmp/bigfile bs=1M count=30000 conv=fdatasync & sleep 30 ; time dd if=/dev/zero of=/tmp/smallfile bs=4k count=1 conv=fdatasync
[2] 27953
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.0718053 s, 57.0 kB/s

real	0m0.073s
user	0m0.001s
sys	0m0.001s

dd: writing `/tmp/bigfile': No space left on device
dd: writing `/var/tmp/bigfile': No space left on device
22891+0 records in
22890+0 records out
24002064384 bytes (24 GB) copied, 1211.53 s, 19.8 MB/s
21957+0 records in
21956+0 records out
23022534656 bytes (23 GB) copied, 1189.07 s, 19.4 MB/s

[1]-  Exit 1                  dd if=/dev/zero of=/var/tmp/bigfile bs=1M count=100000 conv=fdatasync
[2]+  Exit 1                  dd if=/dev/zero of=/tmp/bigfile bs=1M count=30000 conv=fdatasync

I'm noticing loss of interactivity when my RAM gets filled up and swap grows >500M, but this bug is not about such case is it?

Could it be my HW on latest stable vanilla 2.6.38.2 amd64 (swappiness 20, the rest being defaults)? Or could I have just configured my kernel in some genius way?

[    2.051391] ata1.00: ATA-8: HITACHI HTS545025B9A300, PB2ZC61H, max UDMA/100
[    2.054162] ata1.00: 488397168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[    2.065605] ata1.00: configured for UDMA/100
[    2.087958] scsi 0:0:0:0: Direct-Access     ATA      HITACHI HTS54502 PB2Z PQ: 0 ANSI: 5

$ sudo hdparm -i /dev/sda

/dev/sda:

 Model=HITACHI HTS545025B9A300, FwRev=PB2ZC61H, SerialNo=100408PBNXXXXXXXXXX
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=DualPortCache, BuffSize=7208kB, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=488397168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 
 AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
 Drive conforms to: unknown:  ATA/ATAPI-2,3,4,5,6,7

PS: I'm on ext3

Comment 552 Oleg Mikheev 2011-04-03 10:17:26 UTC

controller in the previous comment was
        *-storage
             description: SATA controller
             product: Ibex Peak 6 port SATA AHCI Controller
             vendor: Intel Corporation
             physical id: 1f.2
             bus info: pci@0000:00:1f.2
             logical name: scsi0
             version: 06
             width: 32 bits
             clock: 66MHz
             capabilities: storage msi pm ahci_1.0 bus_master cap_list emulated
             configuration: driver=ahci latency=0
             resources: irq:41 ioport:1860(size=8) ioport:1814(size=4) ioport:1818(size=8) ioport:1810(size=4) ioport:1840(size=32) memory:f2727000-f27277ff(In reply to comment #551)

Comment 553 Vesselin Kostadinov 2011-04-04 02:38:28 UTC

OK, the fun continues.

Installed the offending hard disk in another system, booted Fedora 14 live and the drive worked OK:
[root@localhost ~]# dd if=/dev/zero of=/dev/sd_ bs=1M count=4000 conv=fdatasync 
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 50.0265 s, 83.8 MB/s

(Replaced /dev/sda with /dev/sd_ in case someone decides to copy/paste the command).

Then I booted Knoppix 5.1.1 (from 2007) and saw the fault. CPU usage was 49.7%wa (dual cpu) and had to interrupt dd because it was taking way too long. Then I tried again with a smaller file:

root@Knoppix:~# uname -a
Linux Knoppix 2.6.19 #7 SMP PREEMPT Sun Dec 17 22:01:07 CET 2006 i686 GNU/Linux
root@Knoppix:~# dd if=/dev/zero of=/dev/sd_ bs=1M count=40 conv=fdatasync 
40+0 records in
40+0 records out
41943040 bytes (42 MB) copied, 20.8245 seconds, 2.0 MB/s

Then I booted Fedora again and saw the fault again:
[root@localhost ~]# uname -a
Linux localhost.localdomain 2.6.35.6-45.fc14.i686 #1 SMP Mon Oct 18 23:56:17 UTC 2010 i686 i686 i386 GNU/Linux
[root@localhost ~]# dd if=/dev/zero of=/dev/sd_ bs=1M count=40 conv=fdatasync
40+0 records in
40+0 records out
41943040 bytes (42 MB) copied, 20.3055 s, 2.1 MB/s

@ #548 From Zenith88:
Ignoring the possibility of a hardware fault when the evidence points that way surely brings those who practice that great deal of fruitless debugging and frustration.

@ #550 From D.M. 
I don't think it is the "partition starts at the wrong sector" issue. In the dd commands listed above I was writing to the drive as a whole, without messing with partitions at all.
For the sake of it I decided to create a new partition and see what will happen:
[root@localhost ~]# fdisk -H 224 -S 56 /dev/sd_
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x9b81ad16.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Command action
   e   extended
   p   primary partition (1-4)
p
Partition number (1-4, default 1): 1
First sector (2048-2930275054, default 2048): 
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-2930275054, default 2930275054): +10G

Command (m for help): p

Disk /dev/sda: 1500.3 GB, 1500300828160 bytes
224 heads, 56 sectors/track, 233599 cylinders, total 2930275055 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x9b81ad16

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048    20973567    10485760   83  Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
[root@localhost ~]# mkfs.ext2 -q /dev/sda_
[root@localhost ~]# mount /dev/sda1 /mnt
[root@localhost ~]# dd if=/dev/zero of=/mnt/bigfile  bs=1M count=100 conv=fdatasync
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 77.3839 s, 1.4 MB/s

I guess the performance drop can be attributed to the filesystem overhead.

The issue you describe with writing a large bunch of dirty pages is a real one but is different to the high iowait times. 

I have seen high iowait times when the only active application I had was rtorrent running in seeding mode - so no disk writes but lots of disk reads from all over the place, with total system memory less than the size of the torrent.

Basically when the performance of the drive drops from 80 MB/s to 2 MB/s the only thing the kernel does is waiting for I/O operations to complete. I am not sure if there is a solution for this problem at all.

The disk is still available so I can run more tests if anyone is interested.

Comment 554 Zenith88 2011-04-04 14:03:56 UTC

You can continue debating hard drives, or look into comparison of ntfs vs ext3 code on the cue from post #546 which is a reproducible test case. Your call.

Comment 555 D.M. 2011-04-04 15:08:21 UTC

Oleg: Unfortunatly no clear steps. If you read my comment #513 you'll see that I didn't have any troubles with whole disk software raid10. After that I thought that it was something file-system related but I tested ext2-ext4 and xfs and this is also answering your question Zenith. Same thing, no matter what.

And regarding the Hardware it may be that this particular HD is broken and in the case of Kostadinov I even think it is a broken hardware problem because on one system (Fedora) it worked and then after using Knoppix and getting back to Fedora it didn't. I'm just mentioning which troubles we had with the green WDs, and not only under linux, until I read about this fdisk thing. Now I have two of them and they didn't give me any troubles when I had them in the whole disk RAID10 or when I had an older kernel, or now with the new kernel setting.

But to get backto the substance again. Yes, if you go into dd-ing multiple RAM onto HD the system is coming to a halt. With the old kernels it was "I'm doing dd and the system automagically knows that firefox or mail or whatever is of more priority to me than dd, so he slows down dd a bit so firefox could get some time reading from the HD. Or maybe the queing was more fairly so all processes got some time raping the HD, I don't know, I'm not a kernel developer. I'm just a user and as a user I'm mentioning the diffs between the old and the new kernels." With the new kernel it's not it's he who's writing has all the power over the HD. But again that is more a perception than a fact.

The difference I have to earlier, before I configured vm, is that wa was up to 98 and now it's up to max 45-50.

Comment 556 Zenith88 2011-04-04 16:51:59 UTC

You can deny reality however much you see fit, it won't change the fact that writing onto ext3 partition causes freeze, while writing to ntfs does not on the same system. And this is not a VM but physical machine. Denial of reality and passing the blame is what's causing this project to sit on its hands for 3 years.

Comment 557 Milan Bouchet-Valat 2011-04-04 16:57:58 UTC

There are probably different bugs at stake here, and investigating one doesn't mean denying the other. Please be more respectful of people that try to improve our understanding of the problem instead of ranting.

Just a guess: ntfs-3g driver is using FUSE, while ext3 driver is in kernelspace. *Maybe* this can explain the difference (ntfs-3g isn't considered as in-kernel as regards I/O scheduling).

Comment 558 D.M. 2011-04-04 17:34:53 UTC

I'm sorry if I offended you in any way. Again I'm not in denial, and I'm not blaming anyone I'm merely pointing out that it's not only a ext3 problem because I had the same problem on xfs, and that, as you pointed it out, the kernels from 3 years ago didn't have this kind of problem. And with vm I didn't mean Virtual Machine but virtual memory, because I was referring to the sysctl.conf (i.e. 
...
vm.dirty_background_ratio = 1
vm.dirty_background_bytes = 0
vm.dirty_ratio = 1
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
...
and so on.)

Again, I'm sorry if I have offended you in any way.

Comment 559 Oleg Mikheev 2011-04-04 19:27:57 UTC

Another attempt to narrow down the use case for the issue.
You are not going to get anywhere if you continue reporting issues against all of the different breeds of Linuxes. You never know how Fedora or Knoppix patched the kernel, and you should report issues with their kernels to them instead of posting your observations here.

As I see it the only way to trace down the issue is to use the same version of VANILLA kernel (preferably the latest) with different build and runtime configs.
I personally have ext3 compiled in the kernel - could it be the reason why I can't reproduce the issue?

Zenith88: would it take you a lot of effort to produce the latest not patched vanilla kernel with ext3 compiled into it and to see if it makes things any better for you?

Comment 560 Oleksandr Natalenko 2012-01-07 13:49:12 UTC

It seems to be fixed in 3.2.

Comment 561 wolfram 2012-01-20 16:00:00 UTC

> It seems to be fixed in 3.2.

Somewhere in parallel universe I think.

Nothing changed for me on 

> Intel Corporation 5 Series/3400 Series Chipset SMBus Controller

Comment 562 James Ettle 2012-01-20 16:11:22 UTC

(In reply to comment #561)
> Nothing changed for me on 
> 
> > Intel Corporation 5 Series/3400 Series Chipset SMBus Controller

Nor here on my ICH8-based notebook, with 2GiB RAM. If anything, 3.2 seems worse than 3.1 when it comes to the ability of one process to binge out on dirtying pages, and then bring the rest of the system down to a snail's pace.

One consistent example case is unpacking to the local SATA drive an ISO image (using Nautilus, for example) stored on another drive. Compute-heavy processes with little disc access suffer (and even those without any I/O do --- CPU usage shoots right down).

Another one is a kernel build. The file cache goes bananas, and even with no other desktop applications loaded, everything gets paged out and it takes around a minute (in the worst case) for the unlock screen prompt to appear.

Comment 563 Thorsten Leemhuis 2012-01-21 20:08:24 UTC

(In reply to comment #561)
> > It seems to be fixed in 3.2.
> Somewhere in parallel universe I think.

There are multiple issues that can lead to a behaviour like the one that is discussed in this bug. 

A few patches that went into 3.2 make some situation better. But some problems were still known back then; see http://lwn.net/Articles/467328/

Fixes for those went into 3.3-rc1. Quoting from this weeks LWN.net kernel page (I'm quite sure Jonathan won't mind):

"""
There have been some significant changes made to the memory compaction code to avoid the lengthy stalls experienced by some users when writing data to slow devices (USB keys, for example). This problem was described in this article (http://lwn.net/Articles/467328/), but the solution has evolved considerably. By making a number of changes to how compaction works, the memory management hackers (and Mel Gorman in particular) were able to avoid disabling synchronous compaction, which had the unfortunate effect of reducing huge page usage. See this commit (
http://git.kernel.org/linus/a77ebd333cd810d7b680d544be88c875131c2bd3 ) for a lot of information on how this problem was addressed. 
"""

IOW: Best to test 3.3-rc and report bugs if there are still issues.

While at it (and with a view from someone that is not very active in this bug tracker): I'd say opening a new bug and mentioning it here in this report might be the best way forward for any remaining issues, as the long history might be misleading/confusing when it comes to solving today's bugs. Just my 2 cent.

Comment 564 Thomas Pilarski 2012-02-21 21:34:48 UTC

The problem is really fixed in 3.3rc4. I installed two guest systems on a first generation ssd. The ssd was only for the virtualisation guest. My system is on a >40000 IOPS ssd. 
The first installation was done with kernel 3.2.6, in which the long stalls up to 10 seconds reappeared. Even as bad as in kernel 2.6.2[4-9].
The second installation was done with kernel 3.3rc4. I could even work in an other running virtualisation guest. It's really great. Thanks to all people involved in solving this bug.

Comment 565 Oleksandr Natalenko 2012-02-21 21:39:09 UTC

Could someone else prove it?

Comment 566 Matt Whitlock 2012-02-22 08:09:58 UTC

(In reply to comment #564)
> Thanks to all people involved in solving this bug.

Does anyone have a link to a discussion list post or a technical article detailing the theory behind the solution to this bug? Since this "bug" encompasses so many scenarios, I have doubts about whether all of them have indeed been resolved. I'm glad one person's problem went away, but until a kernel hacker can stand up and explain exactly what was wrong and how they fixed it, I'm going to assume there are still lurking problems in Linux's I/O subsystem.

One problem we've seen and discussed in this thread is that large numbers of dirty blocks waiting to be flushed to disk can cause eviction of "hot" pages of code that are needed by interactive user processes, thus bringing the system to a state of thrashing in which processes continually trigger page faults because their actively executing code keeps being forced out of RAM by the large buffered write to disk. Even if this problem has been solved (presumably by fixing a bug in the code that is supposed to force a process to flush its own dirty pages to disk once dirty_ratio has been reached), there would still be the problem of the kernel's evicting hot pages from RAM so aggressively in low-memory conditions that interactivity of the system is compromised to the point where it's impossible for the user to resolve the memory shortage.

It's pretty easy to reproduce the thrashing scenario: just mount a tmpfs whose max size is close to the amount of physical memory in the system and start writing data to it. Eventually you may find that you are no longer able to do anything, even to give input focus to your terminal emulator so you can interrupt your writing process (or in some setups, even to move your mouse cursor on the screen), because your entire desktop environment and even the X server have been evicted from RAM and are continually paging back in from disk (and being immediately evicted again), hindering your ability to do anything. I've encountered this scenario while compiling Chromium in a tmpfs. I'd expect the OOM killer to activate, but instead I find that all of my running applications are responding at a snail's pace because they have to keep paging in bits of their program code from disk. I should mention that I run without swap.

I would think one way to solve the thrashing problem would be to introduce a kernel knob that would set how much time must elapse between a page being fetched from disk into RAM due to a page fault and that page becoming eligible for eviction from RAM. If set to, say, 30 seconds, then the user's interactive processes could retain a usable degree of interactivity, even under extremely low memory conditions. This would, of course, mean that the OOM killer would activate sooner than it does now, since pages that the kernel would presently choose to evict in order to free up RAM would be ineligible under this new time limit. Setting the knob to zero would yield the behavior we have now, in which the kernel is free to evict all unlocked pages.

I'll reiterate once more, as a refresher, that this was formerly not such a problem on 32-bit x86 systems because most library code there contained relocations that would cause the pages containing the code for libraries to differ from disk, so they could not be evicted (assuming no swap). Now that we use position-independent code on x86_64, most executable pages in RAM are identical to the copies on disk, so they are eligible for evicting since the kernel can just page them back in from disk when they're needed. That convenience turns on us when we find that pages that are needed very frequently (like pages that handle moving the mouse cursor or blinking a cursor) are being evicted aggressively.

Comment 567 Milan Bouchet-Valat 2012-02-22 09:31:36 UTC

Maybe this:
http://lwn.net/Articles/467328/

Comment 568 Matt Whitlock 2012-02-22 13:00:45 UTC

(In reply to comment #567)
> Maybe this:
> http://lwn.net/Articles/467328/

Interesting. Thanks for the link. However, this article doesn't explain why we see thrashing and extremely degraded interactivity on systems that don't have HugeTLB support enabled in the kernel (such as mine). This reinforces the point that there are many scenarios that exhibit poor interactive responsiveness under heavy disk writing load.

Regarding this debate about the transparent huge pages, I have to wonder why the kernel would bother trying to create a huge page in a location where there are dirty pages waiting to be written to disk. Shouldn't it just choose some other area in RAM that doesn't intersect any dirty buffers? This isn't really the place for a discussion of page compaction, though, so I'll discourage anyone from responding to my idle musing here.

Comment 569 Andrey Semyonov 2012-03-28 07:49:42 UTC

Large IOW on writing/reading to/from any Hard drive disk still occurs. Wasting huge amounts of ticks on any disk IO while _waiting_ is nonsense.

Comment 570 Mikhail 2012-04-24 06:55:21 UTC

I Confirm this. System to become unresponsible when begin swap memory to disk.

Comment 571 Perlover 2012-04-26 13:11:22 UTC

Good day, anybody!

I found фт optimal options against this and like this bug
Anybody please try a following options.

I found that my kernels 2.6* and 3.2.* and 3.3.* versions of my server has periodical freezings 4-15 secs. I found that this occur in writeback time (flushing to disk) in time when 30sec expires for expired dirty pages occur. I tried many variants of dirty_* options and found optimal these:

I can suggest two veriants
Here 1st and 2nd variants
The second variant commented
Only uncommment second lines and nothing

#######################################
# every 3 sec look up for dirty status
# It for smooting writebacking, may be 100 will be better
echo 300    > /proc/sys/vm/dirty_writeback_centisecs

# Only 100Kb data of dirty pages and writeback...
# It very important option :)
echo 102400 > /proc/sys/vm/dirty_background_bytes

# second variant - uncomment it - but you will have frozens but rarely 
# echo 225280000 > /proc/sys/vm/dirty_background_bytes

# my a frozens happen at time of expiring of dirty pages (default 30 sec)
# i increased it (it doesn't mean for 1st variant - it will never happen)
echo 864000 > /proc/sys/vm/dirty_expire_centisecs

# I increased limit for non background writebacking (it never happens i think)
echo 10     > /proc/sys/vm/dirty_ratio

#######################################

I like 1st variant - my system now works smooth 
I found that freezings occurs when dirty pages are written to disk.
You can see it by this:

watch -n1 grep -A 1 dirty /proc/vmstat

New kernel features from 3.2.* version (writeback throttling) will not help to me. Now i tested kernel 3.3.2-6 of FC16 and it have a troubles too. But these settings work for me!

I don't have any time for detailed description
But if you will test it and it will help i will ready to discus for it
Sorry for my English :)

Bye!
Perlover

Comment 572 Anton Fedorov 2012-06-09 09:38:52 UTC

After upgrade from 3.0.x to 3.2.0, this bug are completely eats my brain :(
Have tried solution from #571 -- now hangs not whole system, but just some applications (browser, terminal, ooffice etc).
Disk is SSD:
  Read : 1145044992 bytes (1,1 GB), 1,56616 s, 731 MB/s
  Write: 1145044992 bytes (1,1 GB), 14,30301 s, 80 MB/s
RAM:
  MemTotal:        3969340 kB
  MemFree:          112720 kB
  Buffers:          721196 kB
  Cached:          1246456 kB
  SwapCached:          656 kB
  Active:           918656 kB
  Inactive:        1666252 kB
  Active(anon):     507868 kB
  Inactive(anon):   158192 kB
  Active(file):     410788 kB
  Inactive(file):  1508060 kB
  SwapTotal:       6290428 kB
  SwapFree:        6288604 kB

But it still freezez sometimes, on simple actions like just Alt+Tab to other app, and that app hangs for 3-6 seconds.

Comment 573 devsk 2012-06-09 15:20:58 UTC

The in-kernel process scheduler is generally crap. Ok, make that majorly crap. Move away from it. Use BFS (search Con Kolivas) if you want sanity. Someone recently posted a simple test case where heavy kernel space starves the user space processes to death. The person switched to BFS and all his troubles went away. Nobody replied to him on the list. I don't think even Ingo knows what's wrong with CFS. So, don't have your hopes of ever seeing this fixed.

Here is the user space starvation thread I am talking about:

https://lkml.org/lkml/2012/6/7/448

Comment 574 Ritesh Raj Sarraf 2012-06-09 18:41:56 UTC

Wow!! That's a pretty bold statement to make. Given that the code is all open, why don't you instrument the kernel and pin point where exactly the crap is.

Most you guys who suffer the stall problem, you would want to give Daniel Poelzleithner's ulatency a try.

Comment 575 devsk 2012-06-09 21:27:10 UTC

@Ritesh: you are assuming I am capable of debugging kernel. None of the users who have reported on this thread are. The only person capable of debugging this issue is Ingo. How many comments have you seen from him? Go ahead and count them! I will tell you the answer: Zilch!

Process scheduling in stock Linux kernel is a REAL problem. Nobody wants to debug it, that is a different story. That does not mean the problem goes away. After seeing that thread I linked above, I am convinced it is some manifestation of CFS issues at fault here.

Comment 576 Alan 2012-06-09 23:03:07 UTC

Anton can you file yours as a separate bug - it's clear the main problem has been fixed and the scenario you described seems different.

Comment 577 devsk 2012-06-09 23:50:25 UTC

> it's clear the main problem has been fixed

Alan: Can u describe how it is clear to you when the general public keeps suffering and reporting the issue Or worse, just gives up?

what is the code change that "fixed" the issue? Just because someone mentioned BFS in a message somewhere and someone is pointing out a potential problem with the in-kernel scheduler, doesn't give u the right to close this bug randomly. That's arrogant behavior and does a disservice to all the reporters here.

Comment 578 Anton Fedorov 2012-06-10 06:26:07 UTC

Installed BFQ + BFS patched 3.4 kernel ( http://pf.natalenko.name/ ) -- there no hangs for now.

Comment 579 devsk 2012-06-10 16:03:04 UTC

Hey Anton! Big Alan says this problem does not exist. How dare you claim otherwise...:)

I am just kidding....I am moving to BFS myself. So...

Comment 580 Alan 2012-06-10 18:04:21 UTC

No I said that Anton's case appears to be different and asked him to open a new bug for it, given the other cases seem fixed. If BFS fixes your case that's also interesting and wants putting in the bug too.

Comment 581 Andrew Morton 2012-06-11 22:39:11 UTC

Comment #571 at least indicates that the problems remain, and are unrelated to the CPU scheduler.  So describing all this as "fixed" seems a tad optimistic.

That being said, this bugzilla report clearly isn't getting the job done.  I suggest that people who are still seeing writeback-related problems should report them via email.  Suitable recipients are

linux-mm@kvack.org
linux-kernel@vger.kernel.org
Wu Fengguang <wfg@linux.intel.com>
Andrew Morton <akpm@linux-foundation.org>

And please, the thing to spend time on is to work out how to enable kernel developers to locally reproduce the problem.  If we can do this, we'll fix it.

Comment 582 Alan 2012-06-11 22:42:48 UTC

571 indicates someone has a possible problem of the same type. It's separate from all the other debug - hence I asked for it to filed as a new bug, otherwise nothing useful is going to occur.

(eg I can get 3 second freezes on alt-tab out of gnome 3 but it doesn't appear to be anything to do with the kernel)



Alan

Comment 583 devsk 2012-06-12 02:39:14 UTC

Andrew: if it is not CPU scheduler, then how come JUST replacing the CPU scheduler fixes the issue? This does not make basic CS101 sense!

Comment 584 Anton Fedorov 2012-06-12 02:46:49 UTC

Hmm... Are you sure, that was replaced JUST cpu scheduler? In my case i have replaced both -- cpu and disk schedulers, to BFS and BFQ.
Jun 10 12:05:54 nuuzerpogodible kernel: [    1.611737] io scheduler bfq registered (default)
Jun 10 12:05:54 nuuzerpogodible kernel: [    1.826589] BFS CPU scheduler v0.422 by Con Kolivas.

Comment 585 devsk 2012-06-12 05:12:09 UTC

I like stability and typically use a very minimalistic approach. I only changed just the CPU scheduler. And I haven't noticed any hangs or stuck mouse so far.

May be you can change one variable at a time as well and tell us which one (or both) helped.

I will update back if I have any new findings. For now, I am happy that I can use my system without getting annoyed with it.

Comment 586 Alan 2012-06-12 09:31:01 UTC

devsk: it makes basis systems 101 sense however. All the bits interact.

The fact replacing just the CPU scheduler change makes a difference is valuable info though.

Comment 587 Mikhail 2012-06-13 03:02:52 UTC

I propose an interesting experiment.

1. Install Opera from this location: http://snapshot.opera.com/unix/rc4_12.00-1456/
2. Switch on hardware acceleration opera:config#UserPrefs|EnableHardwareAcceleration
set to 1
3. Open the test http://ie.microsoft.com/testdrive/Performance/LoveIsInTheAir/ or http://ie.microsoft.com/testdrive/Performance/ParticleAcceleration/

Try switching between tty, also use your GUI.

I consider that no program should not affect the responsiveness of the system as a whole, is not it?

Comment 588 alpha_one_x86 2012-06-13 08:18:07 UTC

I my case(In reply to comment #587)
> I propose an interesting experiment.
> 
> 1. Install Opera from this location:
> http://snapshot.opera.com/unix/rc4_12.00-1456/
> 2. Switch on hardware acceleration
> opera:config#UserPrefs|EnableHardwareAcceleration
> set to 1
> 3. Open the test
> http://ie.microsoft.com/testdrive/Performance/LoveIsInTheAir/
> or http://ie.microsoft.com/testdrive/Performance/ParticleAcceleration/
> 
> Try switching between tty, also use your GUI.
> 
> I consider that no program should not affect the responsiveness of the system
> as a whole, is not it?

In my case is very similar. I play movie with vlc on one screen, have some Konsole open with transparent, Kwin with desktop effect. Always when the graphic card is at 100% (very often with low end gc like me), all the system have general slow down (same on tty too).

Comment 589 Alex Efros 2012-06-13 16:06:36 UTC

(In reply to comment #578)
> Installed BFQ + BFS patched 3.4 kernel ( http://pf.natalenko.name/ ) -- there
> no hangs for now.

BFS + BFQ really helps… but only until you run a couple of VMware virtual machines. :(
With BFS and BFQ it result in incredible freezes, both in host OS and guest OSes, especially when some OSes does intensive I/O like installing updates etc.
Without BFS and BFQ freezes still happens, but they much less noticeable!
Probably only one of BFS and BFQ is responsible for such bad behavior, but I havn't tested them separately.

Comment 590 Perlover 2012-06-13 16:47:21 UTC

Continuing of post #571

Sorry, my English is not good as i want :)

Now i have Fedora Core with 3.3.2-6.fc16.x86_64 kernel. My server has 48Gb memory and hardware RAID1 array.

Now i use my server with settings (good settings for me):

echo 1000 > /proc/sys/vm/dirty_writeback_centisecs
echo 20 > /proc/sys/vm/dirty_background_ratio
echo 9000000 > /proc/sys/vm/dirty_expire_centisecs
echo 30 > /proc/sys/vm/dirty_ratio

Before these settings as i wrote in #571 post i had regulary freezings up to 10-20 seconds every 2-5 minutes. I found that reason of this is writeback phase of dirty pages. During writeback phase (we can see it by "watch -n1 grep -A 1 dirty /proc/vmstat" command as nr_writeback value - written to disk dirty pages now). For example writeback phase can be started by 'sync' command or when will be expired dirty pages in memory (common settings - 30 seconds). If in next time of writeback we have many dirty pages (even 2000-3000 amount) my server has been frozen by this stage.

Now i have a above settings and one day i do 'sync' from crontab (when load is minimum). During this phase my server increase load average from 1-2 up to 80-90 and this doing ~ 1-2 minutes. My system is frozen during 1-2 minutes! In other time ( 24 hours * 60 minutes - 3 minutes ) i have now load average 1-2, no freezings I/O. Before these settings i had load average 8-9. I know that if power of server will be turned off i will have oldest data in disk (up to 24 hours oldest)

I think that system stops I/O for as long as all dirty pages marked as written to disk to be written to disk. I think normal system should not block all I/O and should split write process of dirty pages to times.

And i noticed that i don't have this problem with my second server where same OS, same kernel version and same RAM volume. There is software RAID1 (/dev/md*). During writeback process this server works smoothly. I think there software raid has an other buffer mechanism of writting to disk. So may be somebody from you will test these problems with software raid?

And i think this article will be useful and related with this:

http://lwn.net/Articles/405076/
https://lwn.net/Articles/456904/

But as i understood this feature partly realized in kernel 3.3 but i didn't get a better things with new kernel. As i understood this is developing now.

Sorry for my English

Bye! :)

Comment 591 Perlover 2012-06-13 16:53:27 UTC

And now (may be 1-2 years) i don't see high volumes of iowait as in top of this topic. But problem with freezing during of large I/O operations remains. So may be iowait problem doesn't exist already but blocking any i/o to be during high-volume writings.

Comment 592 Mikhail 2012-06-14 20:28:48 UTC

About i/o schedulers. I a lot of read, that devices which have NCQ support not needing in schedulers, is it?
$ Dmesg | grep NCQ
[2.145261] ata1.00: 175836528 sectors, multi 16: LBA48 NCQ (depth 31/32), AA 
[3.109745] ata5.00: 2930277168 sectors, multi 16: LBA48 NCQ (depth 31/32)

Seems all my devices support NCQ. I manually set noop sheduler, and system was apparently much responsible. I hope this is not a placebo. If it true, so why not in the kernel will automatically switch off scheduler for devices which have NCQ support?

Comment 593 Matt Whitlock 2012-06-14 22:30:33 UTC

(In reply to comment #592)
> [...] devices which have NCQ support not needing in schedulers [...]
> 
> Seems all my devices support NCQ. I manually set noop sheduler, and system
> was
> apparently much responsible. [...]

While it is true that the hard drive will reorder I/O requests within its native command queue to optimize armature movements, the on-device queue is really very shallow (only 32 requests maximum on your hardware). By circumventing the kernel's I/O scheduler (by selecting "noop"), you are losing the benefits of merging adjacent I/O requests and of distributing I/O throughput fairly across multiple processes.

Comment 594 Mikhail 2012-06-15 02:16:38 UTC

In think in my case we have a system hang for two reasons. I don `t know that there does Opera, but it looks like that video card output is also limited, and when any application tries to send too much data to transfer GUI starts feeling less refreshed from this hangs. Despite the fact that htop does not show any CPU utilization or waiting for i/o. The second case is more traditional, it is to hang when accessing memory (memory of 2GB) htop shows us that even 2GB swap allocated. And then the freezing occurs because of the waiting of the hard disk. That's OK, but bad that affects all applications, even those to whom the available memory would be enough. The worst thing is that there is affects as a whole system responsiveness and GUI. I want to help fix these problems, write what I can do for this.

Comment 595 Mikhail 2012-06-15 02:30:42 UTC

And I think why the noop scheduler can be better ... There is a stupid idea, but what if the queue scheduler gets to swap? This is theoretically possible? If so then it is understandable why noop is better.

Comment 596 Alexey Asemov 2012-06-25 14:11:50 UTC

I wonder, if someone had tried oprofile while forcing matchine to fall into #12309? It may be stalling somewhere waiting for locks or hardware action.

Alas, I myself have no hardware to reproduce #12309 on at hand.

Comment 597 Mikhail 2012-08-22 21:36:27 UTC

Created attachment 78231 [details]
htop screenshot

Comment 598 Mikhail 2012-08-22 21:40:27 UTC

Please look at my htop screenshot https://bugzilla.kernel.org/attachment.cgi?id=78231

I just copy file from HDD to HDD. It's normal to high IO wait's for CPU?

I think the bug is not fixed. What other information to provide?

$ uname -a
Linux u3s3 3.5.2-1.fc17.i686.PAE #1 SMP Wed Aug 15 16:30:14 UTC 2012 i686 i686 i386 GNU/Linux

Comment 599 Alan 2012-08-22 21:46:20 UTC

Looks fairly normal to me - I'd expect a lot of waiting for I/O during a big copy because rotating disks are incredibly slow relative to processor performance. The CPU is also generally having to work harder on a 32bit machine with > 1GB of RAM doing MMU management due to the lack of address space.

The scheduler btw is kernel side so doesn't get paged/swapped out.

Comment 600 Matt Whitlock 2012-08-22 22:06:17 UTC

Please correct me if I'm wrong, but I do believe that "I/O Wait" time is the amount of time that processes are blocked on disk I/O operations. What I don't understand is why I/O Wait appears to consume CPU time. Is the kernel spinning in a busy wait loop while an I/O operation is pending on a disk? If so, why? The kernel should be allowing some other task to use the CPU during the I/O wait.

Comment 601 Alan 2012-08-22 22:53:07 UTC

I/O wait isn't consuming CPU time but the process of reading/writing disks does consume CPU time because the process is doing work in the kernel managing the I/O and the things that go with it.

Comment 602 Mikhail 2012-08-23 02:58:07 UTC

(In reply to comment #601)
> I/O wait isn't consuming CPU time but the process of reading/writing disks
> does
> consume CPU time because the process is doing work in the kernel managing the
> I/O and the things that go with it.

Alan, if I understand you correctly why kernel don't switch to another process until current process waiting I/O? 

For example why GUI (means GNOME Shell) brakes while another application do swap or much writes to disk?

Comment 603 Andrey Semyonov 2012-08-23 05:27:52 UTC

Alan, isn't what you just described named PIO? Isn't DMA the solution
that resolved high CPU load on storage IO? Isn't high CPU load on VM
IO (IOWAIT) very similar to PIO storage operation mode? Just to
remember my already asked question: is polling technique suitable for
VM IO as it was some years ago for NET IO?

> I/O wait isn't consuming CPU time but the process of reading/writing disks
> does
> consume CPU time because the process is doing work in the kernel managing the
> I/O and the things that go with it.
>

Comment 604 Alan 2012-08-23 10:56:57 UTC

The data transfers are done by DMA where possible, but you still have to do all the housekeeping, controller management, I/O queue handling and the like. On a 32bit box there can also be a lot of memory management work involved.

Old (pre AHCI)  controllers need PIO for some parts of a transfer. That is a hardware limit.


And the kernel does switch to other processes and back and forth between them when one is waiting for I/O. The gnome shell is a very large program so on any system without vast quantities of memory the shell tends to be waiting for stuff to come from disk when there is any memory pressure. Last time I looked the compositor was single threaded with all of that so Gnome 3 stalled horribly under paging. That I'm afraid is mostly a problem in Gnome 3.

Rotating disks are in relative terms very very slow. They've not materially improved in the past ten years yet memory sizes have grown vastly, processor speeds have grown likewise. They are also very bad at trying to do two things at once so writing a large file to disk tends to really slow down reading.

Comment 605 Mikhail 2012-08-23 11:18:16 UTC

(In reply to comment #604)
> And the kernel does switch to other processes and back and forth between them
> when one is waiting for I/O. The gnome shell is a very large program so on
> any
> system without vast quantities of memory the shell tends to be waiting for
> stuff to come from disk when there is any memory pressure. Last time I looked
> the compositor was single threaded with all of that so Gnome 3 stalled
> horribly
> under paging. That I'm afraid is mostly a problem in Gnome 3.

Ok, why also mouse movement is choppy? and why switching to a virtual terminal are slow?

How I can ensure that the locks occurs not in kernel? And how find where occurs locks? I am really want to help find and fix them. Apologies for the many questions.

Comment 606 Alan 2012-08-23 12:54:25 UTC

because the gnome compositor is going to end up stalling waiting to get data back. Ditto switching to/from X will be pulling in lots off disk if your machine has been paging stuff out. To actually get detailed data you need to start profiling the system and generating detailed information to analyse - thats way beyond a bugzilla discussion (but the linux-mm list might be a starting point if you want to get involved in understanding what is a very complicated area - because so many things interact).

Ultimately though I suspect that unless someone does something drastic about its memory footprint the "fix" is not to run huge bloated inefficient desktops on a box with 1GB of RAM.

Comment 607 Ritesh Raj Sarraf 2012-08-23 13:13:21 UTC

Alan: What would be your example of a huge bloated inefficient desktop? I guess KDE/GNOME. And the efficient one might be icewm/fvwm etc. Not common unfortunately.

The I/O wait problem is still valid. It is just that you need different patterns to hit it. A lot has improved with the latest writeback work but still, when hit, this is a terrible problem.

If you want to reproduce it, take your laptop/desktop, with 4 GiB Mem and the regular SATA disk. Pump (buffered) I/O with dd into it. Write zeroes with block size of 1 MiB. Since it is buffered, you'll start good until you consume all your 4 GiB memory. After that is when you will start seeing the problem.

At that moment (i.e. after you have consumed all of your RAM), every write() will contend for page availability. And given that you also have a slow rotating disk (you can also include remote storage - both block and files), try to execute a task following the I/O. A simple sync command is good to start with. CPU goes blocked until the pages are scanned for best fits and are buffers synced. You can run dstat and observe the CPU wait time out there.

(In our tests) Linux is good at pumping I/O. This doesn't always fit in the regular OS model where the user could also be doing other random stuff while I/O is in progress. They expect the machine to be responsive.

MS Windows, while not the best, is still better than Linux desktop in this use case.

Over the years, my workaround have been to have only 1 process doing I/O. Never let 2 or more processes do I/O at the same time. Like don't do 2 cp. Don't do 2 copy operations in your gui file browser. If you follow this policy, you have a higher chance avoiding this ugly bug.

Comment 608 Alan 2012-08-23 13:20:41 UTC

Ritesh: if you have some test cases then discuss them on the linux-mm list.

Comment 609 Ritesh Raj Sarraf 2012-08-23 13:24:47 UTC

(In reply to comment #608)
> Ritesh: if you have some test cases then discuss them on the linux-mm list.

Alan, I see in the prev comments you have the same explanation done in the right technical terms. :-)

I just would add 1 more comment. All these symptoms were tested and seen also on my lab machine, which is:

> 2 core CPU
> 8 GiB RAM (We have tested also with 48 GiB RAM)
> All tests were done with SAN Array (over sw iSCSI).

The slow rotating media can be mapped with the slow network in this case. The stalls were visible on these machines also after you do buffered I/O consuming up all of the system RAM.

I had then spent some time tweaking values in /proc/sys/vm but hadn't seen great improvements.


Will surely put in my results on -mm in the next run I do on it (could be in some weeks). Thank you.

Comment 610 Thomas Pilarski 2012-08-23 13:37:11 UTC

Is it possible, that one process can consume all (dirty?) pages and stalls other processes, even if these are running from or accessing other discs. 

My system is on two ssds. One for the system and one for the data. I can stall the whole system, while running a vm on a third slow external usb2.0 disc.

Comment 611 Alan 2012-08-23 13:41:04 UTC

Thomas - the kernel tries very hard to avoid that sort of thing happening and to throttle a process generating too much I/O. Older kernels were certainly very bad at that and an rsync to a USB disk was horrible. It ought to be much better with the most recent kernels although still not great.

Comment 612 Yaroslav Fedevych 2012-08-23 14:36:12 UTC

> Over the years, my workaround have been to have only 1 process doing I/O.
> Never
> let 2 or more processes do I/O at the same time. Like don't do 2 cp. Don't do
> 2
> copy operations in your gui file browser.

I've heard once a while ago that Linux is a multitasking OS, so I figure they lied to me?

Comment 613 devsk 2012-08-23 17:06:22 UTC

By moving to BFS, it has been proven (empirically) that IT IS a CPU scheduler issue and not a slow-rotational media problem. Kernel can do other stuff when the rotational media is not giving it what it wants. And don't let buffers and caches fill so much (again a scheduling issue) that even the kernel does not have free pages to run its own components from. All that kernel is doing is spinning finding free pages all the time (kswapd hogging CPU searching through millions of pages on modern systems). Why does it not evict caches by default sooner is not clear? You need to set a bunch of proc parameters for it to start doing that. And it still eventually keels over.

There was a bug reported by someone (and I linked it above) where just pumping network traffic through Linux kernel brought it to its knees leading to cluster reboot. The kernel space (SIRQs) hogged so much CPU during the network traffic processing that user space never got any chance to run. The person moved to BFS and he could run network traffic as fast as he could without bringing anything to its knees. If this is not CPU scheduler issue, then I don't what is!

Comment 614 Matt Whitlock 2012-08-23 17:17:54 UTC

(In reply to comment #613)
> If this is not CPU scheduler issue, then I don't what is!

This has nothing to do with scheduling CPU time and everything to do with managing virtual memory. That the kswapd process is consuming all CPU time does not indicate that the CPU scheduler is not giving time to user processes but rather that all user processes are blocked in page faults. User processes become unresponsive during heavy I/O because all their code gets evicted from RAM, and the page faults that load their code back in from disk have to compete with all the other disk I/O.

The question is why the kernel is evicting memory-mapped pages (especially *executable* pages) rather than blocking calls to write() until more RAM becomes available.

Comment 615 devsk 2012-08-23 17:38:35 UTC

Then, how do you explain the above behaviour? Moving to BFS solves this problem. And solves the other problem where during heavy network traffic, the kernel space does not give any chance to user space leading to user space starvation.

Comment 616 Matt Whitlock 2012-08-23 18:13:05 UTC

(In reply to comment #615)
> Moving to BFS solves this problem.

Moving to BFS solves *a* problem *you're* having. According to comment #589, the VM thrashing problem still occurs when using BFS.

Perhaps BFS schedules a process that is encountering a serial string of page faults more favorably than the CFS, but that doesn't solve the underlying problem of executable pages being evicted from RAM to make excessive space available for caching large writes.

Comment 617 devsk 2012-08-23 18:41:10 UTC

> According to comment #589,
> the VM thrashing problem still occurs when using BFS.

And the person in comment #589 piled on BFQ into the equation to muddy the waters. Throwing in a new IO scheduler and then having IO problems, well yeah, that's not intuitively obvious.

So, someone please still prove that BFS alone hasn't fixed this issue.

Comment 618 Leho Kraav 2012-11-06 11:21:44 UTC

(In reply to comment #568)
> (In reply to comment #567)
> > Maybe this:
> > http://lwn.net/Articles/467328/
> 

Whatever this patch has done or not done, I just had my 3.4.11-pf laptop (CFQ, BFQ) load climb from regular 0,6 to 30+ when I did:

$ ls -l usb-_USB_FLASH_DRIVE_079605074ECA-0\:0.img 
-rw-r--r-- 1 leho leho 2004877312  6. nov   12:41 usb-_USB_FLASH_DRIVE_079605074ECA-0:0.img

$ ddrescue usb-_USB_FLASH_DRIVE_079605074ECA-0\:0.img /dev/sdb --force
GNU ddrescue 1.16
Press Ctrl-C to interrupt
rescued:     2004 MB,  errsize:       0 B,  current rate:    2490 kB/s
   ipos:     2004 MB,   errors:       0,    average rate:    3671 kB/s
   opos:     2004 MB,     time since last successful read:       0 s
Finished

Because of the massive stalling that occured, average write rate ended up at 3.5 MB/sec instead of regular 20+MB/s.

Is this bug still alive, or related or does anyone here know what to look for? I'd really like to maintain responsiveness when working with USB drives.

Comment 619 wolfram 2012-11-21 06:19:46 UTC

Same problem here. Kernel 3.7rc6. SSD. 4GB RAM+2GB swap. When system tries to use swap it became irresponsible (even mouse cursor doesn't moving smoothly). 

00:00.0 Host bridge: Intel Corporation 3rd Gen Core processor DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
00:04.0 Signal processing controller: Intel Corporation Device 0153 (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.1 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 2 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation HM76 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)
00:1f.6 Signal processing controller: Intel Corporation 7 Series/C210 Series Chipset Family Thermal Management Controller (rev 04)
02:00.0 Network controller: Intel Corporation Centrino Advanced-N 6235 (rev 24)

Comment 620 wolfram 2012-11-21 06:20:50 UTC

Additionally, I'm using deadline IO scheduler. So, maybe reopen?..

Comment 621 Sergei 2013-03-04 07:00:43 UTC

Progress has been made toward the goal of eliminating the timer tick while running in user space. The patches merged for 3.9 fix up the CPU time accounting code, printk() subsystem, and irq_work code to function without timer interrupts; further work can be expected in future development cycles.

A relatively simple scheduler patch fixes the "bouncing cow problem," wherein, on a system with more processors than running processes, those processes can wander across the processors, yielding poor cache behavior. For a "worst-case" tbench benchmark run, the result is a 15x improvement in performance.

The format of tracing events has been changed to remove some unused padding. This change created problems when it was first attempted in 2011, but it seems that the relevant user-space programs have since been fixed (by moving them to the libtraceevent library). It is worth trying again; smaller events require less bandwidth as they are communicated to user space. Anybody who observes any remaining problems would do well to report them during the 3.9 development cycle.

https://lwn.net/Articles/539179/

Comment 622 vova7890 2013-04-05 06:59:29 UTC

On 3.9-rc5 have a pretty good result with copying from usb-flash to hdd. High speed(30+ mb/s) and iowait 2-20%, wery well. hdd -> flash have 50-60% iowait, but no have any performance problems. Copying speed 12-16mb/s(that is maximum for my usb-flash).

dd if=/dev/zero of=~/ololo bs=1M count=1024 provides high iowait and have some performance problems(in 3d apps have small freezes, fps ceases unstable, and DE slowing down; ie, tasks are starving?).

One a very important problem - swap. Even with a small content of swap have a problems with smoothly work. When kernel starting an a very high usage of swap freezes are delayed on minutes, all of have - freezes. It feels like mm-manager work with swap in blocking mode %) Multitasking is locking when page merging ram <-> swap? Have any ideas why that happens?

Oy, forgot, my sata controller is MCP67(most buggy chip?).

Comment 623 Anonymous Emailer 2013-04-05 07:00:01 UTC

Reply-To: Ritesh.Sarraf@netapp.com

I'm currently Out Of Office. I'll be responding to emails, but expect some delay in replies. For any urgent issues, please contact my manager, Kugesh Veeraraghavan Kugesh.Veeraraghavan@netapp.com

Comment 624 Mikhail 2013-08-04 05:41:11 UTC

All who believe that this problem has been fixed, please open this link in Google Chrome: http://ec2-54-229-117-209.eu-west-1.compute.amazonaws.com/party.html

Comment 625 Anton Fedorov 2013-08-04 06:39:34 UTC

Mikhail, that doesn't related to this one bug, there no large IO in that page, only canvas playing, that eatout RAM:
  function partyHard( drunkenness ) {

    var mapCanvas = [];
    var mapCanvasCtx = [];
    for (var i = 0; i < drunkenness * 1200; i++) {
    mapCanvas[i] = document.createElement('canvas');
        mapCanvas[i].width = 2500;
        mapCanvas[i].height = 2500;
        mapCanvasCtx[i] = mapCanvas[i].getContext('2d');
        mapCanvasCtx[i].fillStyle = 'rgb(0, 0, 0)';
        mapCanvasCtx[i].fillRect( 0, 0, 1700, 1700 );
    }
    console.log(window);
  }

Comment 626 Mikhail 2013-08-04 07:07:47 UTC

In this example, the large IO will be the result of the swap file. Try to increase the size of the swap to 64Gb and repeat the experiment. On my system with 16Gb of RAM with no swap system is no freezes. If you increase the size of the swap to 64Gb 100% then the system dies. :(

Comment 627 vova7890 2013-08-04 20:24:52 UTC

Swap in linux is something fantastik. Fills like schedule is locked, when ram-page is writing in swap. We expected for lags in program, but lags is global! That`s awesome! :)

Comment 628 3draven 2013-10-21 01:01:10 UTC

In my i7+8Gb RAM+sata 750Gb Hdd. If hdd swap working -> system freez and lags -> mouse!!! lags!

kernel 3.10 (and many other versions)

for fix it i use zram swap+hdd swap. Lags are reduced, but did not pass.

Comment 629 Leho Kraav 2013-10-21 07:41:49 UTC

(In reply to 3draven from comment #628)
> In my i7+8Gb RAM+sata 750Gb Hdd. If hdd swap working -> system freez and
> lags -> mouse!!! lags!
> 
> kernel 3.10 (and many other versions)
> 
> for fix it i use zram swap+hdd swap. Lags are reduced, but did not pass.

Yep, experiencing the same, currently on 3.10.15. Getting memory usage to swapping on Linux is craziness for the user. Means 8G of RAM is minimum for any above-average workload.

Comment 630 Mikhail 2013-10-25 17:25:55 UTC

Nobody cares problems with swap I/O :(

Comment 631 D.M. 2014-01-31 16:32:25 UTC

Heya,

after some years I have resolved MY problem with the respopnsivness of the computer in regards to HIGH I/O. For all others I can only say, try it. If it works for you be happy, if not...

For starters I wouldn't call this a bug. It's a DEFECT. Because if I have 20 servers with different linux flavours and distributions, many of them were compiled from scratch, and if I have 200 ubuntu desktops that behave all the same if I use the command dd if=/dev/zero of=test.img bs=1M count=xxx (above 1GB file size), by same meaning this command grinding the system to a halt, and then keeping this problem around for so many years and so many kernels, for me it is a DEFECT.

For the past week I've been trying the BFQ patch for kernel 3.9 on several machines. On one machine I have been heavy testing. I have this machine for some years now, a CORE i7 with 12 GB and 6 HDs in RAID 10. On this I also had the problem, and it was somewhat better with the BFS patch but it was still happening.
With the BFQ patch it's working perfectly. At one moment I had two dd's (dd if=/dev/zero of=test.img bs=1m count=100k, creating a VirtualBox vdi of 60GB, openining 10 ods documents, watching youtube, watching a hd movie in vlc and some other stuff and the desktop / system was as responsive as if nothing was using it. Just like I remember linux being some years ago. And now I have a sustained throughput of 470MB/s HD without my computer going to /dev/null.

So BFQ solved this problem for me. Maybe it's not stable yet, but for me it's more stable than using CFS !!!

Just my two cents. And this bug is closed for me, but only NOW !!!
For all others out there I wish you luck

Comment 632 Leho Kraav 2014-01-31 17:01:53 UTC

Just general FYI, BFQ just freshly did a new release where they claim another batch of significant improvements for whatever they're doing.

Comment 633 devsk 2014-01-31 17:20:51 UTC

> So BFQ solved this problem for me. Maybe it's not stable yet,
> but for me it's more stable than using CFS !!!

BFQ and CFS are not congruent. May be you meant BFS?

Comment 634 D.M. 2014-02-01 11:58:37 UTC

(In reply to devsk from comment #633)
> > So BFQ solved this problem for me. Maybe it's not stable yet,
> > but for me it's more stable than using CFS !!!
> 
> BFQ and CFS are not congruent. May be you meant BFS?

Sorry, my mistake. Meant CFQ. But on the other hand BFS, too. BFS did give me some improvements, I could listen to music while I created a big file, but that was all. So CFQ without BFS was a no-go, CFQ with BFS helped a little, but BFQ alone solved my problems which I had for the past 4-5 years, in which I had to bend and improvise to create a VDI of 60 GB and hoping that my computer stays alive until it finishes the job, and mind you a computer that has resources in abundance. :)

Comment 636 Vitaly 2014-02-05 11:44:27 UTC

I have successfully reproduce this bug on my HP Z200 under ubuntu 12.04 LTS. After some investigation I found out than main reason of this bug is very ugly bottleneck in block device layer so cores of my Z200 spend almost all time in spinning on spinlock while we have disabled IRQ on ALL cores.

Comment 637 Horst Schirmeier 2014-11-06 15:23:46 UTC

I'm still seeing this.

Setup: Debian 7 Wheezy, amd64 backports kernel (3.11-0.bpo.2-amd64), ~45MB/s write of a low number of large files by rsync (fed through a GBit ethernet link) on an ext3 FS (rw,noatime,data=ordered) in a LVM2 partition on a hardware RAID5.

Observation: The machine (32-core Xeon E5-4650, 192 GB RAM), primarily servicing multiple interactive users via SSH, x2go and SunRay sessions, gets completely unusable during and quite some time after the rsync transfer.  TCP connections to SunRay clients time out, IRC connections are dropped, even simple tools like "htop" don't do anything but clear the screen after being started.  "iotop" shows a [jbd2/dm-1-8] process on top, reportedly doing "99.99%" I/O (but not reading or writing a single byte, maybe because it's a kernel thread?).

Once I switch from the default CFQ I/O scheduler to "deadline" (echo deadline > /sys/block/sdb/queue/scheduler), the symptoms disappear completely.

Comment 638 Yan Pas 2015-06-05 18:47:21 UTC

Still face this bug. Kernel 3.16

Is it possible to preserve 5% of IO for user/othe processes needs? Any fast download or copying eats 99.99 of IO and system is hard to use.

Comment 639 Artem S. Tashkinov 2016-02-06 01:40:07 UTC

I'm curious: this bug was ostensibly fixed years ago however I dare everyone, who owns an Android smartphone, run a simple test. Invoke any terminal emulator and execute this command:

$ cat < /dev/zero > /sdcard/EMPTY

What's terribly unpleasant is that _all_ CPU cores become busy (more than 75% load), and the CPU jumps into the highest performance state, i.e. frequency, i.e. power consumption. Obviously this is wrong, bad and shouldn't happen. This test is kinda artificial as no Android app can create such a high IO load, but then there are multiple phones out there with either 5GHz MIMO 802.11n or 802.11ac chips which allow up to 80MB/sec throughput which can easily saturate most if not all internal MMC cards and have the same effect as the above command.

Perhaps vanilla kernel bugzilla is not a place to discuss bugs in Android, but latest Android releases usually feature kernels 3.10.x and 4.1.x without that many patches, so this bug is still there. Both these kernels are currently maintained and supported. Android by default never uses SWAP (one of the reasons for this bug).

Go figure.

P.S. Sample apps from Google Play:

* CPU Stats by takke
* Terminal Emulator by Jack Palevich

Comment 640 Eugene Seppel 2016-05-14 18:53:08 UTC

I've just experienced this issue with 3.19.0-32-generic on Ubuntu.
My KTorrent downloaded files to NTFS filesystem on SATA3 drive (fuse, download speed was about 100Mbit/s), simultaneously I copied files from that filesystem to USB3.0 flash drive with NTFS filesystem. That resulted poor interactive performance, mouse and windows lags. The workaround was to suspend torrent downloa until files copied.

Hardware: One AMD FX(tm)-8320 Eight-Core Processor, 8 GB RAM.

Comment 641 gooberslot 2016-11-17 08:49:25 UTC

This bug is definitely not fixed. A simple cp from one drive to another makes a huge impact on my desktop. Trying to do an rsync is even worse. It seems to mainly be a problem with large files. My system is old (Athlon II 250) but even an old P3 running Win98 doesn't lag this bad from just copying files.

Comment 642 bes1002t 2017-01-12 08:41:01 UTC

I'm trying to copy 50gb from one tower to another via USB 3.0 and it is really no fun. If I would copy all files at once the speed is decreasing constantly. After 30 minutes it copies with 1.0MB/s. If I copy a bunch of directories it is a littlebit better but also decreases in speed. For 2GB my Linux system needs more than an hour. This bug is definitely not fixed. On Windows this USB Stick is working without that speed loss.

OS: Fedora 24 
Kernel: 4.8.15

Comment 643 bes1002t 2017-01-12 10:19:02 UTC

I've noticed that this happens not everytime when I use exact the same USB stick. For my 2GB files (Eclipse with workspace and a project) I needed one hour to copy. It startet with 60MB/s and decreased to 500KB/s. Now I copy 16gb (android studio and some other projects) and it only needs about 15minutes. The copy speed startet with 70MB/s and at the end it was 22MB/s fast. So it also decreases but not as fast as in my 2GB copy process.

Comment 644 Mikhail 2017-01-12 10:29:37 UTC

It seems Kernel developers not look this topic here, much better to write to the mailing list.

Comment 645 Oleksandr Natalenko 2017-01-12 11:33:11 UTC

Does Jens' buffered writeback throttling patchset solve your issue?

Comment 646 Artem S. Tashkinov 2017-01-12 11:54:12 UTC

(In reply to bes1002t from comment #642)
> I'm trying to copy 50gb from one tower to another via USB 3.0 and it is
> really no fun. If I would copy all files at once the speed is decreasing
> constantly. After 30 minutes it copies with 1.0MB/s. If I copy a bunch of
> directories it is a littlebit better but also decreases in speed. For 2GB my
> Linux system needs more than an hour. This bug is definitely not fixed. On
> Windows this USB Stick is working without that speed loss.
> 
> OS: Fedora 24 
> Kernel: 4.8.15

This bug report has nothing to do with the speed of copying data to USB flash drive. It's about substantially degraded interactivity which manifests in slowness and it's hard to believe you can perceive it via an SSH session.

I'm inclined to believe your bug is related to other subsystems like USB.

> It seems Kernel developers not look this topic here, much better to write to
> the mailing list.

Kernel bugzilla has always been neglected. Thousands of bug reports which have zero comments from prospective developers. LKML is a hit and miss too. Your developer skipped your e-mail because he/she was busy? Bad luck.

Comment 647 Marcel Partap 2017-01-12 12:06:28 UTC

@bes1002t: I think throughput is a different issue than this, although it might well be related.

But most important would be for someone to create a I/O concurrency / latency benchmark. Maybe the Phoronix Test Suite is an adequate tool for that? It can also be used for automatic bisecting..

I clearly remember pre-2.6.18 times where I had a much inferior machine and while gent0o's emerge was compiling stuff in the background with multiple threads, I could browse the web switch between programs and play a HD stream without any hickup or stalling.

Comment 648 Thomas Pilarski 2017-01-16 09:30:07 UTC

@bes1002t: Copying to a USB device always starts with the speed of the harddrive as all is cached till the write cache is full and ends with the speed of the usb drive. The write process has to wait till all data is written. 

@Artem S. Tashkinov: The stall problems on a ssh session exists or existed. I have migrated an old server with CentOS 6 and copied some vm images. The ssh responsiveness was very bad. I had to wait for up to 20 seconds for tab auto to complete. 

I many cases it was a swap problem, as the buffers are full and the caches need a long time to be written to a slow usb device. The server starts to swap process data. It's only a very small amount of data. I could increase the overall desktop performance with an RAM upgrade.

Comment 649 ValdikSS 2017-02-25 17:16:57 UTC

Try Kernel 4.10.

>Improved writeback management
>
>Since the dawn of time, the way Linux synchronizes to disk the data written to
>memory by processes (aka. background writeback) has sucked. When Linux writes
>all that data in the background, it should have little impact on foreground
>activity. That's the definition of background activity...But for a long as it
>can be remembered, heavy buffered writers have not behaved like that. For
>instance, if you do something like $ dd if=/dev/zero of=foo bs=1M count=10k,
>or try to copy files to USB storage, and then try and start a browser or any
>other large app, it basically won't start before the buffered writeback is
>done, and your desktop, or command shell, feels unreponsive. These problems
>happen because heavy writes -the kind of write activity caused by the
>background writeback- fill up the block layer, and other IO requests have to
>wait a lot to be attended (for more details, see the LWN article).
>
>This release adds a mechanism that throttles back buffered writeback, which
>makes more difficult for heavy writers to monopolize the IO requests queue,
>and thus provides a smoother experience in Linux desktops and shells than what
>people was used to. The algorithm for when to throttle can monitor the
>latencies of requests, and shrinks or grows the request queue depth
>accordingly, which means that it's auto-tunable, and generally, a user would
>not have to touch the settings. This feature needs to be enabled explicitly in
>the configuration (and, as it should be expected, there can be regressions)

Comment 650 Marius Tolzmann 2017-02-25 17:26:48 UTC

Hi..

Thank you for your email. 

I am sorry, but this email will soon be disabled..

Please send everything work related to helpdesk@molgen.mpg.de

Please send private mails to marius@mariux.de

bye m.

Comment 651 Mikhail 2017-03-23 18:14:31 UTC

> Try Kernel 4.10.
It not helps in my work load :(
still freezing mouse pointer and keyboard input

Comment 652 ValdikSS 2017-03-23 18:17:12 UTC

Make sure your kernel has that option enabled.

>This feature needs to be enabled explicitly in
>the configuration (and, as it should be expected, there can be regressions)

Comment 653 Mikhail 2017-03-23 18:39:57 UTC

I read this https://kernelnewbies.org/Linux_4.10 and this https://kernelnewbies.org/Linux_4.10 articles, but I not seen name of this option.

Comment 654 Mikhail 2017-03-23 18:41:00 UTC

Created attachment 255491 [details]
$ cat /boot/config-`uname -r`

Comment 655 Artem S. Tashkinov 2017-03-23 18:51:27 UTC

(In reply to Mikhail from comment #651)

First, I'd recommend trying to disable SWAP completely - it might help:

$ sudo swapoff -a

If you compile your own kernel or your distro hasn't enabled them for you, here's the list of the options you need to enable:

BLK_WBT, enable support for block device writeback throttling
BLK_WBT_MQ, multiqueue writeback throttling
BLK_WBT_SQ, single queue writeback throttling

They are all under "Enable the block layer".

If disabling swap and enabling these options have no effect, please ***create a new bug report*** and provide the following information:

CPU
Motherboard and BIOS version
RAM type and volume
Storage and its type
Kernel version and its .config

And also the complete output of these utilities:

dmesg
lspci -vvv
lshw
free
vmstat (when the bug is exposed)

cat /proc/interrupts
cat /proc/iomem
cat /proc/meminfo
cat /proc/mttr

Comment 656 ValdikSS 2017-03-23 18:55:36 UTC

>CONFIG_BLK_WBT=y
># CONFIG_BLK_WBT_SQ is not set
>CONFIG_BLK_WBT_MQ=y

So writeback throttling is enabled only for multi queue devices in your case. I suppose you need to use blk-mq for your sd* devices to activate writeback throttling (scsi_mod.use_blk_mq=1 boot flag) or to recompile kernel with CONFIG_BLK_WBT_SQ enabled.

Comment 657 Mikhail 2017-03-24 15:38:17 UTC

Created attachment 255501 [details]
all required files in one archive

Comment 658 Mikhail 2017-03-27 21:07:37 UTC

After setting boot flag "scsi_mod.use_blk_mq=1", the freezes became much shorter. I'm not sure now that they are at the kernel level. More look like that window manager (GNOME mutter) is written in such a way that freezes mouse while loading list of applications. To finally defeat freezes, seems need to make the window manager not paged into the swap file.


I'm also catch vmstat output when freeze occurred:
# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  6 15947052 205136 112592 4087608   32   41    93   119    7   23 43 19 37  1  0

Comment 659 Artem S. Tashkinov 2017-03-28 04:16:33 UTC

Twice I asked you you to try disabling SWAP altogether and you still haven't.

I'm unsubscribing from this bug report.

Comment 660 vova7890 2018-03-03 20:08:45 UTC

Created attachment 274511 [details]
Per deice dirty ration configuration support

Per device dirty bytes configuration

Comment 661 vova7890 2018-03-03 20:13:08 UTC

Per device dirty bytes configuration. Patch is not ideal, i'm make it for smoothly flash drive wriring by passing smaller value of dirty byte per removeable device. 

>> Path
# ls /sys/block/sdc/bdi/
dirty_background_bytes  dirty_background_ratio  dirty_bytes  dirty_ratio  max_ratio  min_pages_to_flush  min_ratio  power  read_ahead_kb  stable_pages_required  subsystem  uevent

>> udev Rule for removeables device
# cat /etc/udev/rules.d/90-dirty-flash.rules
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{removable}=="1", ATTR{bdi/dirty_bytes}="4194304"

Comment 663 protivakid 2019-04-23 15:21:23 UTC

Was this bug actually fixed? The status shows CLOSED CODE_FIX with a last modified date of Dec 5 2018. I don't see any updates as to what was corrected, and what version the fix will be put into?

Comment 664 Andrey Semyonov 2019-04-23 16:37:13 UTC

Created attachment 282477 [details]
attachment-6179-0.html

This was never fixed and since bug state cheating with no commit info ever
provided even if asked directly, will never be fixed. Nobody just cares and
I guess nobody even figured out who broke the kernel by which changeset and
when. Just buy another couple of Xeons for your zupa-dupa web-serfing
desktop and pray it's enough for loads of waits when you format your
diskette. Another approach is to buy enough ram to hold whole your block
devices set there so write-outs are quick enough and you won't see
microsecond lags. This is complete workaround list they provided since the
bug opened.

вт, 23 апр. 2019 г., 18:21 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=12309
>
> protivakid (chriswy27@aol.com) changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |chriswy27@aol.com
>
> --- Comment #663 from protivakid (chriswy27@aol.com) ---
> Was this bug actually fixed? The status shows CLOSED CODE_FIX with a last
> modified date of Dec 5 2018. I don't see any updates as to what was
> corrected,
> and what version the fix will be put into?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Comment 665 _Vi 2019-04-23 18:29:50 UTC

As far as I understand, this is kind of meta-bug: there are multiple causes and multiple fixes.

"I do bulk IO and it gets slow" sounds rather general, and problem that can resurface anytime due to some new underlying issue. So the problem cannot be really "closed for good" no matter how much technical progress is made.

For me 12309 basically stopped happening unless I deliberately tune "/proc/sys/vm/dirty_*" values to non-typical ranges and forgot to revert them back. I see system controllably slowing down processes doing bulk IO so the system in general stays reasonable. This behaviour is one of outcomes of this bug.

I don't expect meaningful technical discussion to be happen in this thread. It should just serve as a hub for linking to specific new issues.

Comment 666 Alex Efros 2019-04-23 18:56:11 UTC

Sure it's a meta bug, but for me 12309 is still actual, and I don't use any tuning for I/O subsystem at all.

Not as bad as years ago when it happens for the first time, but I still have to throttle rtorrent to download at 2.5MB/sec maximum instead of usual 10MB/s if I like to view films in mplayer at same time without jitter/freeze/lag. And that's on powerful and modern enough system with kernel 4.19.27, CPU i7-2600K @ 4.5GHz, RAM 24GB, and HDD 3TB Western Digital Caviar Green WD30EZRX-00D. This is annoying, and I remember time before 12309 when rtorrent without any throttling won't make mplayer to freeze on less powerful hardware.

Comment 667 Andrey Semyonov 2019-04-23 22:12:46 UTC

Created attachment 282483 [details]
attachment-22369-0.html

Well, I've tried to report a new bug to investigate my own "my CPU does
nothing because waiting is too hard for it". Of no interest of any kernel
dev. So, just as Linus once said "f**k you Nvidia", the very same goes back
to linux itself. Pity some devs think that make their software linux-bound
(via udev only binding or alsa only sound out) is a good idea (gnome and
even parts of KDE). They forgot 15 years ago they picketed Adobe for having
flash for win only. Now one has to use 12309-bound crap for not having a
way to run his software on another platform.

вт, 23 апр. 2019 г., 21:29 <bugzilla-daemon@bugzilla.kernel.org>:

> https://bugzilla.kernel.org/show_bug.cgi?id=12309
>
> --- Comment #665 from _Vi (vi0oss@gmail.com) ---
> As far as I understand, this is kind of meta-bug: there are multiple
> causes and
> multiple fixes.
>
> "I do bulk IO and it gets slow" sounds rather general, and problem that can
> resurface anytime due to some new underlying issue. So the problem cannot
> be
> really "closed for good" no matter how much technical progress is made.
>
> For me 12309 basically stopped happening unless I deliberately tune
> "/proc/sys/vm/dirty_*" values to non-typical ranges and forgot to revert
> them
> back. I see system controllably slowing down processes doing bulk IO so the
> system in general stays reasonable. This behaviour is one of outcomes of
> this
> bug.
>
> I don't expect meaningful technical discussion to be happen in this
> thread. It
> should just serve as a hub for linking to specific new issues.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Comment 668 Marcel Partap 2019-04-28 20:37:32 UTC

Can 'someone' please open a bounty on creation of a VM test case, f.e. with `vagrant` or `phoronix test suite`?
Basically, a way to reproduce and quantify the perceived/actual performance difference between
> Linux 2.6.17 Released 17 June, 2006
and
> Linux 5.0 Released Sun, 3 Mar 2019 
…

(In reply to Alex Efros from comment #666)
> Not as bad as years ago […]
> And that's on powerful and modern enough system with
> kernel 4.19.27, CPU i7-2600K @ 4.5GHz, RAM 24GB, and HDD 3TB […]
> This is annoying, and I remember time before
> 12309 when rtorrent without any throttling won't make mplayer to freeze on
> less powerful hardware.

Oh yeah, this... i can clearly remember back then when on a then mid-range machine with a lot of compiling (gentoo => 100% cpu 🤣) and filesystem work, VLC used to play an HD video stream even under heavy load without any hiccups and micro-stuttering.. It was impressive at the time.. and then.. it broke 🤨

Comment 671 Vitaly 2019-06-24 17:03:07 UTC

According to my attempts to fix this bug, I totally disagree with you.

This bug is caused by pure design of current block dev layer. Methods which are good to develop code is absolutely improper for developing ideas. It's probably the key problem of the Linux Comunity. Currently, there is merged WA for block devices with a good queue such as Samsung Pro NVMe.

WBR,

Vitaly


(In reply to _Vi from comment #665)
> As far as I understand, this is kind of meta-bug: there are multiple causes
> and multiple fixes.
> 
> "I do bulk IO and it gets slow" sounds rather general, and problem that can
> resurface anytime due to some new underlying issue. So the problem cannot be
> really "closed for good" no matter how much technical progress is made.
> 
> For me 12309 basically stopped happening unless I deliberately tune
> "/proc/sys/vm/dirty_*" values to non-typical ranges and forgot to revert
> them back. I see system controllably slowing down processes doing bulk IO so
> the system in general stays reasonable. This behaviour is one of outcomes of
> this bug.
> 
> I don't expect meaningful technical discussion to be happen in this thread.
> It should just serve as a hub for linking to specific new issues.

Comment 672 Srdjan Todorovic 2019-07-07 17:25:10 UTC

Had this again 20 minutes ago.
Was dopying 8.7GiB of data from one directory to another directory on the same filesystem (ext4 (rw,relatime,data=ordered)) on the same disk (Western Digital  WDC WD30EZRX-00D8PB0 spinning metal disk).

The KDE UI became unresponsive (Everything other than /home and user data in on a SSD), could not launch any new applications. Opening a new tab on Firefox to go to Youtube didnt load the page, and kept saying waiting for youtube.com in the status bar (network gets halted?).

dmesg shows these, are they important?

[25013.905943] INFO: task DOMCacheThread:17496 blocked for more than 120 seconds.
[25013.905945]       Tainted: P           OE    4.15.0-54-generic #58-Ubuntu
[25013.905947] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[25013.905949] DOMCacheThread  D    0 17496   2243 0x00000000
[25013.905951] Call Trace:
[25013.905954]  __schedule+0x291/0x8a0
[25013.905957]  schedule+0x2c/0x80
[25013.905959]  jbd2_log_wait_commit+0xb0/0x120
[25013.905962]  ? wait_woken+0x80/0x80
[25013.905965]  __jbd2_journal_force_commit+0x61/0xb0
[25013.905967]  jbd2_journal_force_commit+0x21/0x30
[25013.905970]  ext4_force_commit+0x29/0x2d
[25013.905972]  ext4_sync_file+0x14a/0x3b0
[25013.905975]  vfs_fsync_range+0x51/0xb0
[25013.905977]  do_fsync+0x3d/0x70
[25013.905980]  SyS_fsync+0x10/0x20
[25013.905982]  do_syscall_64+0x73/0x130
[25013.905985]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[25013.905987] RIP: 0033:0x7fc9cb839b07
[25013.905988] RSP: 002b:00007fc9a7aeb200 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[25013.905990] RAX: ffffffffffffffda RBX: 00000000000000a0 RCX: 00007fc9cb839b07
[25013.905992] RDX: 0000000000000000 RSI: 00007fc9a7aeaff0 RDI: 00000000000000a0
[25013.905993] RBP: 0000000000000000 R08: 0000000000000000 R09: 72732f656d6f682f
[25013.905994] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000001f6
[25013.905995] R13: 00007fc97fc5d038 R14: 00007fc9a7aeb340 R15: 00007fc987523380

Comment 673 alpha_one_x86 2019-07-07 17:43:27 UTC

KDE have problem too, same copy via CLI or via Ultracopier (GUI) have no problem.
I note too KDE have UI more slow, plasma doing CPU usage in case I use the HDD...

Comment 674 GYt2bW 2019-07-08 14:15:30 UTC

What's the value of `vm.dirty_writeback_centisecs` ?, ie.
$ sysctl vm.dirty_writeback_centisecs

try setting it to 0 to disable it, ie.
`$ sudo sysctl -w vm.dirty_writeback_centisecs=0`

I found that this helps my network transfer not stall/stop at all(for a few seconds when that is =1000 for example) while some kinda of non-async `sync`(command)-like flushing is going on periodically while transferring GiB of data files from sftp to SSD!(via Midnight Commander, on a link limited to 10MiB per second)

vm.dirty_writeback_centisecs is how often the pdflush/flush/kdmflush processes wake up and check to see if work needs to be done.

Coupled with the above I've been using another value:
`vm.dirty_expire_centisecs=1000`
for both cases (when stall and not stall), so this one remained fixed to =1000.

vm.dirty_expire_centisecs is how long something can be in cache before it needs to be written. In this case it's 1 seconds. When the pdflush/flush/kdmflush processes kick in they will check to see how old a dirty page is, and if it's older than this value it'll be written asynchronously to disk. Since holding a dirty page in memory is unsafe this is also a safeguard against data loss.

Well, with the above, at least I'm not experiencing network stalls when copying GiB of data via Midnight Commander's sftp to my SSD until some kernel-caused sync-ing is completed in the background.

I don't know if this will work for others, but if curious about any of my other (sysctl)settings, they should be available for perusing [here](https://github.com/howaboutsynergy/q1q/tree/0a2cd4ba658067140d3f0ae89a0897af54da52a4/OSes/archlinux/etc/sysctl.d)

Comment 675 GYt2bW 2019-07-08 15:12:08 UTC

correction:

> In this case it's 1 seconds.

*In this case it's 10 seconds.

Also, heads up:
I found that 'tlp' in `/etc/default/tlp`, on ArchLinux, will overwrite the values set in `/etc/sysctl.d/*.conf` files if these are set to non `0`, ie.
MAX_LOST_WORK_SECS_ON_AC=10
MAX_LOST_WORK_SECS_ON_BAT=10
will set:
vm.dirty_expire_centisecs=1000
vm.dirty_writeback_centisecs=1000

regardless of what values you set them in `/etc/sysctl.d/*.conf` files.

/etc/default/tlp is owned by tlp 1.2.2-1

Not setting those (eg. commenting them out) will have tlp set the to its default of 15 sec (aka =1500). So the workaround is to set them to =0 which makes tlp not set them at all, thus the values from `/etc/sysctl.d/*.conf` files is allowed to remain as set.

Comment 723 Konstantin Ryabitsev 2020-03-04 01:33:32 UTC

I'm making this bug private to prevent more spam from being added to it.

Comment 724 Carol Jams 2022-10-01 15:56:32 UTC

I am also facing the same issue. https://www.allhdd.com/seagate-st4000nm0025-hard-disk-drive/

Comment 726 r.piedfer 2023-11-12 02:18:21 UTC

Created attachment 305400 [details]
attachment-13101-0.html

Bonjour,

Je suis actuellement absent.
J'aurai d'ici là un accès très limité à mes emails.
Je reviendrai vers vous dès que possible à mon retour.
Pour toute urgence, vous pouvez contacter Vincent Ophele / v.ophele@asmodee.com

Cordialement,

----------

Hello,

I am OOO with no access to my emails.
I will get back to you as quickly as possible when I return.
In case of emergency, please contact Vincent Ophele / v.ophele@asmodee.com

Best regards,



Ce courriel provient de la société Financière Amuse BidCo. Le contenu de ce courriel et les pièces jointes le cas échéant sont confidentiels pour le destinataire. Ils ne peuvent être ni divulgués, ni utilisés, ni copiés de quelque manière que ce soit par une personne autre que le destinataire prévu. Si ce courriel vous a été adressé par erreur, merci d'en informer son auteur par téléphone et par courriel en intégrant le message original dans votre réponse, puis supprimez-le. Pour plus d'informations sur la manière dont nous traitons les données personnelles, veuillez consulter notre Politique de protection des données personnelles <https://cdn.svc.asmodee.net/corporate/uploads/Templates/AH_Politique_de_protection_des_donnees_personnelles_du_groupe_FR.pdf> . Veuillez noter que Financière Amuse BidCo n'assume aucune responsabilité vis-à-vis des virus et qu'il vous incombe d'analyser ou de consulter ce courriel et ses pièces jointes le cas échéant. Financière Amuse BidCo est une société par actions simplifiée à associé unique (RCS Versailles 815 143 904). Son siège social est situé 18 rue Jacqueline Auriol - Quartier Villaroy - 78280 Guyancourt. Pour plus d'informations, veuillez consulter le site https://corporate.asmodee.com/.

This email is from Financière Amuse BidCo. The content of this email and any attachments are confidential to the intended recipient. They may not be disclosed to or used by or copied in any way by anyone other than the intended recipient. If this email is received in error, please inform the author by phone and email, including the original message in your reply, and then delete it. For more information on how we process personal data, please see our Privacy policy <https://cdn.svc.asmodee.net/corporate/uploads/Templates/AH_Asmodee_Global_Privacy_Policy_EN.pdf> . Please note that Financière Amuse BidCo does not accept any responsibility for viruses and it is your responsibility to scan or otherwise check this email and any attachments. Financière Amuse BidCo is a French « Société par actions simplifiée à associé unique » (Commerce and Companies Register of Versailles 815 143 904) Its registered office is at 18 rue Jacqueline Auriol - Quartier Villaroy - 78280 Guyancourt. For further information, please refer to https://corporate.asmodee.com/.

Comment 727 Nelson G 2023-11-12 02:19:13 UTC

I am facing this issue with both debian and archlinux.  xfs and ext4
https://forums.debian.net/viewtopic.php?p=778803