Most recent kernel where this bug did not occur: don't know, tested only on 2.6.18 Distribution: Debian Etch Hardware Environment: Dell PowerEdge 2850, x86_64, 2x Xeon 3.2GHz, RAM 4GB, Megaraid, 2 hardware RAID1 volumes, ext3 fs on top of lvm2. Software Environment: mysql 5 is running on the server. Problem Description: During our nightly backup, we are copying (with cp) some big (a few GB) mysql files from on volume to the other. During this time the system is almost unusable: * mysql doesn't serve new request (ie it can't use the disk) * ssh login is almost unpossible (ie very long) * top -d 0.5 almost freeze sometimes This is completely reproducible with any io scheduler (cfq or as are better but it is marginal). Steps to reproduce: 1) have a mysql instance that consumes a large part of the RAM (ie 3.5GB over 4GB of physical RAM) 2) issue a cp for some big files 3) watch the server running only the copy, other process seems to be blocked waiting for the disk to be ready. for instance a simple "ls" finally run about 3 to 4 seconds later after pressing return. More information: swappiness is at 20, kernel compiled without preemption, but with SMP and Hyperthreading enabled. More information as attachement. The server is a live mysql server, I can't reboot it whenever I want, so it will be a little bit complex to diagnose.
Created attachment 9256 [details] lsmod output
Created attachment 9257 [details] kernel .config
Created attachment 9258 [details] cat /proc/meminfo
Created attachment 9259 [details] cat /proc/interrupts
Created attachment 9260 [details] vmstat -n1 output
Created attachment 9261 [details] iostat -x sda sdb output
Created attachment 9262 [details] gzipped top -H -d 0.5 -b output for a few seconds
I had the same thing happen to me ... turned out DMA was off on the drive. Check with hdparm?
Hum, those are 10k RPM SCSI drives behind a brand new PCI megaraid card, so hdparm is of no help, and I suppose DMA is activated. Throughput of the storage subsystem by itself is very good, it is just that the server becomes unusable when two process competes for it.
Important Side note: I ran the almost same test on a less powerful server running 2.6.17.13, and it worked really fine (I was able to use the server for something else beside the running copy). So I guess the issue is with 2.6.18.
Was the "less powerful server" also using the megaraid driver with hardware RAID and LVM? We could be seeing an interaction between the IO scheduler and the device driver layer here. Also, it'd be interesting to see if echo 1 > /proc/sys/vm/dirty_background_ratio echo 2 > /proc/sys/vm/dirty_ratio improves things.
A simpler recipe that is trivially repeatable for me is: mount -o loop $iso /tmp/iso sudo /tmp/cold # fairly typical drop_caches / BLKFLSBUF script cp -a /tmp/iso /tmp/foo-slow & sleep 1 ; time ls /usr with parallel cp: real: 2.210, 2.020, 2.350 : avg. 2.193 without: real: 0.317, 0.377, 0.379 : avg. 0.358 ie. 6x degradation in I/O - of course, the user-visible symptoms in terms of desktop responsiveness are of long hangs when any I/O is attempted. The hardware is a fairly standard T60p - SMP kernel, SATA disk, and $iso is just a CD ISO of your choice. HTH.
Andrew, The "less powerful" server has the same megaraid card, but there is only one RAID1 volume (also ext3 on top of LVM2), only 2GB of RAM and is running also as a mail server (along with mysql 5). The only real differences are: 1) it runs 2.6.17.13 2) mysql does not use almost all the physical RAM The two "echos" didn't change anything to the issue. My next test will be to run 2.6.18 on the "less powerful" server just to see if it exhibits the same issue. My next next test will be to run 2.6.17.x on the server running 2.6.18 (this is trickier as the server is live, I need to prepare a maintenance window). I did, like Michael, time ls /usr, just to see. After more than 10s without the ls running, I stopped the cp. ls ran and printed: real 0m11.319s user 0m0.000s sys 0m0.000s As you can see, it never ran! This all makes me thing that it is a process scheduler issue and not an io scheduler (which makes sense since I tried almost all the io scheduler and the problem is the same). Moreover, I'm heading for a business trip for the next 5 days, so I'll perform all those tests next week. Many thanks for your help,
Created attachment 9334 [details] dmesg output Someone instructed me to do: echo 1 > /proc/sys/kernel/sysrq # before echo t > /proc/sysrq-trigger # (during stall) the trace may be useless, my stall is rather shorter than Brice's - and I believe I did 1 trace before starting the I/O.
Hi, I did the following test: I took my extra 'less powerfull' server, upgraded it to 2.6.18, and repeated the cp test while using mysql. I couldn't observe the same issue as observed on the other server, so I certainly missed another important factor. This certainly is a combination of memory usage (the mysql instance is using 3.5GB or physical RAM along as about 5GB VSS out of 4GB total physical RAM), and i/o load (mysql load is low but constant). There is almost no shared and buffers used (since mysql uses O_DIRECT for file access). Could that be that cp adds more buffers (and things like that) and since there is not enough physical memory there is a kind of struggle between them? Next test will be to run 2.6.17.x on my mysql instance and try to reproduce the issue. Thanks for your help, Brice
Hi, A quick followup to my issue. I performed our monthly maintenance window, so I could restart the server. I switched from 2.6.18 to 2.6.19. But alas, it still exhibits the same issue as before, although it feels a little bit better. I tried several experiments like running the cp command with nice -n19, but it didn't really help. I'm puzzled. Another experiment, was running 'dd' while running mysql: $ dd if=/data/bigfile of=/root/bigfile iflag=direct oflag=direct it doesn't exhibit the problem, running the same dd without direct, exhibited this usual problem. I will add the top -H -i output to the bug report taken while running the dd (without direct), and it shows that all the mysqld threads seems stuck D state during the whole time dd run (at least at the sampling points (every 500ms)). I also added a new vmstat output if that matters. So it seems that a long thread reading (dd) seems to be able to block many really short writing threads (mysql). I also tried to ionice -c3 or -c2 -n7 the dd process, and things feel better, but it is not that best, and as usual incurs long pauses for mysql. Any ideas?
Created attachment 9754 [details] top -H -i -d 0.5 output under 2.6.19 This file contains the top output which shows that the dd process seems to be 'favored' by the system.
Created attachment 9755 [details] vmstat -n 1 output under 2.6.19 vmstat output taken while running dd and mysql under 2.6.19
Created attachment 9761 [details] first sysrq-t task list output (gzipped) I followed Micheal Meeks suggestion of using sysrq-t while mysql was frozen. Michael was suggesting the mysql process could suffer from a lock contention. I took too successive task trace, each of them while cp was running and mysql suffered a freeze. I attached the result to this bug report. If someone more knowledgeable than me could parse it, that would be great. Otherwise, I analyzed mysql read/write behaviour in my case, and it is mostly (ie over 90%) short writes to log files, because the whole database fits in memory. So reads are performed from the mysql internal buffer space. It occured to me that the kernel has not PREEMPT enabled, maybe that's the cause of those long freeze?
Created attachment 9762 [details] second task list (sysrq-t) gzipped Second gzipped trace, taken a few seconds after the first, but still while mysql was frozen. I hope that will help understand this specific issue.
Created attachment 9766 [details] make balance_dirty_pages use nonblocking writeback You may find that this improves things. It won't fix it, but it might make it better.
Created attachment 9828 [details] Oprofile reports - running only mysqld Andrew, I didn't had the opportunity to test the patch you sent earlier, I'll do as soon as I have a maintenance window on the server. But I did had time to run a two oprofile session, in case it would help someone to understand the issue. Each report contains several different opreport, the first one is a general one, then only the kernel, then finally kernel with call graph. The first oprofile (oprofile_normal.txt) report was taken with only mysqld running on the server (ie normal load). It shows that most of the time is spent in mysqld (as it should) and that the clear winner in the kernel is: sync_buffer. The second oprofile report (oprofile_stressed.txt) was taken while running the huge cp session along with the regular mysql load. What's interesting is that most of the time seems spent running cp, and from the kernel point of view, beside sync_buffer as before is in set_personnality. I'll let the knowledgeable people interested in this report comment it... I hope it'll help.
Created attachment 9829 [details] Oprofile report stressed version -running mysqld + cp This one is the report of profiling while running the cp command.
A quick followup, I reverted the kernel on this machine to 2.6.17.13 (which I never tested in production, and because of the ext3 corruption detected during holidays), and everything is running smoothly during a huge file copy unlike with 2.6.18 or 2.6.19. This narrows down the change to the zillions of commit between 2.6.17 and 2.6.18 (sorry). This is too bad that I can't git bisect because I'm not free to reboot the server whenever I want. While I'm running 2.6.17 if someone needs me to run some tests, please say so, I'll do it happily. Thanks,
It seems that the above bugreport is related to problems I experience with bad system performance upon copying/moving/unt(r)aring files, similar to problems reported in the gentoo-forum: http://forums.gentoo.org/viewtopic-t-462052-highlight-.html http://forums.gentoo.org/viewtopic-t-482731.html My system is a simple AMD64, with 1 SATA drive, 1GB memory, so nothing special (running a kernel-2.6.18-gentoo) Upon copying a big file performing any operation that required disk I/O becomes impossible until the copying process is finished (throughput of the disk with hdparm is correct) as also shown by looking at the output of dstat during calculating a hash of an iso-file. ----------------------------------------------------------------------------- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq|_read _writ|_recv _send|__in_ _out_|_int_ _csw_ 32 7 48 12 0 0| 69M 0 | 0 0 | 0 0 |1246 897 26 6 46 19 0 1| 60M 0 | 0 0 | 0 0 |1132 933 ----------------------------------------------------------------------------- Setting swappiness to 20 has no effectm nor does setting /proc/sys/vm/ dirty_background_ratio differently. I've also tried the patch given in this topic to see if it would resolve the problem. Unfortunately it does not solve the problem here. I'm happy to test other suggestions...
Created attachment 10296 [details] Output of dstat while calculating a checksum As a follow-up on my previous post I now attach the output of dstat while calculating the checksum of an ISO-DVD with DVD. Simultaneous I try to open other applications such as a simple terminal window. If you look in the line with the process to be started you see that there are a large number of blocked processes. Something that does not happen when I try to start an application without applying the heavy load. That does make sense as I try to open applications during the calculation of the checksum, but they don't appear. However there is one thing that appears to be important for this bug. You see that roughly in the middle there is a line where 1290M is read by the harddisk, which is clearly wrong as it should be ~60MB/s. That is exactly the point where I see that the whole system briefly stops in this case very noticable. Also the output from dstat is halted. The large numbers are an artefact because it simply adds all reading that occured over a much longer time (20 seconds in fact). Also visible for the network and system colums. Divide those numbers by 20 and they correspond to the rest. As the IO clearly keeps running I think this provides clear proof that there is a problem with the process scheduling codepath, not the IO scheduling. The only strange thing is see is that exactly at that point 6 new processes are created, but that could be an artefact as well. I hope this provides some insight into the problem. As said, I'm happy to test things.
There are too many people with x86_64 suffering from this issue. Can we please have someone look at the data provided here and make some progress on this issue?
The threads are blocked in io_schedule. Most likely a driver or block layer problem. I had a similar problem some time ago which is also still unsolved, but the best theory for that was starvation in dirty accounting.
Limit the queuing depth on the SCSI device, it's likely starvation there. Ala: echo 4 > /sys/block/sda/device/queue_depth echo 4 > /sys/block/sdb/device/queue_depth
Reply-To: akpm@linux-foundation.org On Tue, 6 Feb 2007 03:46:22 -0800 bugme-daemon@bugzilla.kernel.org wrote: > I had a similar problem some time ago which is also still unsolved, > but the best theory for that was starvation in dirty accounting. yes, I expect that's the case. Could the reporters please always capture the sysrq-T output during the stalls? I expect we're stuck in balance_dirty_pages(). People are slowly making changes in this area, but this one is tough and really the guy who should fix it is me and I just don't seem to have five minutes to think about the same thing lately :(
the current value of /sys/block/sd[a-d]/device/queue_depth is 1 on my system. Did you mean to increase it to 4? Moreover, I can reproduce short freezes (of few seconds length) doing just a copy of large file over cifs/smbfs, and the file involved is on IDE (/dev/hda5) disk. In another experiment I did, I copied a large file (2gig on /dev/hda) over cifs and md5sum calculation on a file (3.5gig on /dev/sdd) at the same time in k3b. During this time the system freezes for every 2-3 seconds and thruput of both the disks falls to 2-5Mbytes/sec compared to normal 50-60Mbytes/sec and md5sum calculation just crawls and mouse freezes. You can see the dstat output (sorry I didn't save it, and formatting in bugzilla will be destroyed) at http://forums.gentoo.org/viewtopic-p-3888210.html#3888210 These are separate disks on separate channels, one IDE and one sata. Their thruput should be pretty much to the max, even for concurrent access. The thing is that the freezes and thruput reduction doesn't happen if network is taken out of the picture.
I've tried the suggestions of Jens to see if setting queuedepth to 4 has an effect. It seems to have very little to no effect. While calculating that checksum I still have to wait 40s before a simple terminal opens. More clear proof that it doesn't help is dstat again. I can still see stalls happening. In this case the system stalls for roughly 8 seconds while IO continues as you can see below. This is with the qeuedepth set to 4 (normally it is set to 31) ---------------------------------------- 32 4 49 14 0 0| 66M 0 | 28 5 43 23 0 1| 483M 8192B| 31 6 47 17 0 1| 60M 0 | ---------------------------------------- @anonymous emailer...comment #30 In my case it is very difficult to trigger something recognizable like the stall above. Most freezes on my system are short....less than a second. Not even noticable when looking at the throughput from the HD. Unfortunately I don't completely understand which info you need and at what exact time. Do you have a clear pointer to a document that shows me what to provide and how to do this. I do have the CONFIG_MAGIC_SYSRQ set to y in my kernel.
Created attachment 10321 [details] Output of SysRQ during stall Ignore my previous question...stupid me, Brice already posted how to do it. So the attached file shows the output obtained with Sysrq-T. I set it up by "echo 1 > .." , and already typed " echo t > ..." so that I only needed to hit enter. Then I repeatedly calculated the checksum from an ISO with K3B until I saw a clear stall. At that point I hit the enter in the console to trigger the dump. At that specific point the system stalled for ~10s judging from dstat. I hope that the dump indeed correspond to that point.
Reply-To: akpm@linux-foundation.org On Tue, 6 Feb 2007 13:16:58 -0800 bugme-daemon@bugzilla.kernel.org wrote: > Created an attachment (id=10321) hm, that's unexpected: everyone is sitting waiting for a read to complete - nobody is stuck in balance_dirty_pages(). This points more at an elevator problem, or device driver, or perhaps contention for the IDE bus itself. Other people in thsi thread _have_ been seeing stalls due to balance_dirty_pages() throttling, so I suspect we're looking at two different and unrelated problems here. So you were reading an ISO from a CDROM drive, yes? Are all those stalled processes also trying to read from IDE? Perhaps on the second channel of the same controller?
No, its not from a CD. It is an iso-image from a DVD that is located on my harddisk. I calculate the MD5 checksum within K3B, which you have to do to burn it to DVD. I took that as a testcase as it gives the clearest pattern I could find. I have only a single SATA-harddrive in my system, so all other IO-operations should access the same HD. Note however that I try to open a terminal while this , which I do on a routine basis. For that reason that one should come from memory and not from the HD. However even that takes +40 seconds to happen.
Reply-To: akpm@linux-foundation.org OK. This actually could be a CPU scheduler problem - we've had problem in the past where specific operations like this cause the mostly-busy task to trick the CPU scheduler into giving it a too-high priority. Please try nicing k3b down, see if that makes the problem go away. Maybe nice -20 k3b ...
Sorry for spamming this evening... I've now tried with an older kernel as requested by the starter of the bug- report. I compiled the oldest 2.6.17 kernel in the Gentoo tree , which was the vanilla 2.6.17.6 version. With this older kernel there is no problem to open other applications while the checksum is being calculated. Also the system behaves decent under large copy operations such as "cp *.iso /dev/null". Which under the 2.6.18 kernel also leads to huge slowdowns.
> Please try nicing k3b down, see if that makes the problem go away. Maybe nice -20 k3b Are you sure you didn't mean "nice -n 20 k3b" (i thought negative values increased the priority). If I run k3b with "nice -20 k3b" the behavior is pretty much the same. Very long waiting times before an application appears and long stalls visible in dstat. If I run k3b with "nice -n 20 k3b" the behavior appears a bit smoother most of the time. But also here long waiting times happen. In both commands I tried I see the long stalls of ~10 seconds appear in the output of dstat. So overall I would say it has little effect. It might be relevant: in both cases I see K3B always on top of the list in the output of top with priority 34 or 35 and Nice equal to 19.
Ingo asks is Hankjans can try the -rt kernel "I've got function tracing there so if he can reproduce it there i can tell him how to get a proper trace and fix it in no time." That's http://people.redhat.com/~mingo/realtime-preempt/
henkjans_bagger@hotmail.com @ 2007-02-06 14:45: >With this older kernel there is no problem to open other applications while the >checksum is being calculated. Also the system behaves decent under large copy >operations such as "cp *.iso /dev/null". Which under the 2.6.18 kernel also >leads to huge slowdowns. This match what I was observing on my server. Can you try to git-bisect between 2.6.17 and 2.6.18 as I suggested in a private mail ? Assuming we suffer from the same issue, it will isolate the patch that introduced the problem and will allow kernel developper to spot the issue. I can't do it myself because I can't freely reboot the server (ie I have strict maintenance schedule every month for these systems, thus I estimated to find the culprit in not more than 6 months :-P ). Also, for the kernel developpers reading this bug-report would it help some sysreq-t (or other debug material) from a non affected kernel (2.6.17) ? Thanks, Brice
> Ingo asks is Hankjans can try the -rt kernel ... 1) I don't know if it is good or bad news. Although it seems more difficult to trigger the stall I can reproduce the problem with the RT-kernel. I've taken the Vanilla 2.6.18 kernel and patched it with patch-2.6.18-rt3. Also in this case I'm able to get a clear 10s stall in the output from dstat. I was not sure which kernel to take, but as I have always been having the problem with the 2.6.18 kernel I thought moving to 2.6.20 would add an additional unknown. So the question: what should I do to provide a proper trace? 2) I've also been trying to narrow down what the last kernel was that did not show the problem. It seems 2.6.17.14 does not show the problem of the long stalls, while the first kernel from the 2.6.18 branch has this problem.
I've now finally completed running a git-bisect. It took a long time as seeing the stall occur was not always easy. The output: 6edad161cd4dfe1df772e7a74ab63cab53b5e8c1 is first bad commit A quick search with google yields the info given below. Seeing that indeed the commit is related to libata/Sata I'm pretty sure that I indeed did not make an error during the git-bisect. --- commit 6edad161cd4dfe1df772e7a74ab63cab53b5e8c1 tree 389d6daa728b2ba1bd8c2180cab705706449f62a parent 236ee8c33277ab48671995f26dc68a4639936418 parent 0dd4b21f517e138ea113db255645fbae1bf5eef3 author Linus Torvalds <torvalds@g5.osdl.org> Fri, 23 Jun 2006 15:58:44 -0700 committer Linus Torvalds <torvalds@g5.osdl.org> Fri, 23 Jun 2006 15:58:44 -0700 Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/ git/jgarzik/libata-dev * 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/ libata-dev: (258 commits) [libata] conversion to new debug scheme, part 1 of $N [PATCH] libata: Add ata_scsi_dev_disabled [libata] Add host lock to struct ata_port [PATCH] libata: implement per-dev EH action mask eh_info->dev_action[] [PATCH] libata-dev: move the CDB-intr DMA blacklisting [PATCH] ahci: disable NCQ support on vt8251 [libata] ahci: add JMicron PCI IDs [libata] sata_nv: add PCI IDs [libata] ahci: Add NVIDIA PCI IDs. [PATCH] libata: convert several bmdma-style controllers to new EH, take #3 [PATCH] sata_via: convert to new EH, take #3 [libata] sata_nv: s/spin_lock_irqsave/spin_lock/ in irq handler [PATCH] sata_nv: add hotplug support [PATCH] sata_nv: convert to new EH [PATCH] sata_nv: better irq handlers [PATCH] sata_nv: simplify constants [PATCH] sata_nv: kill struct nv_host_desc and nv_host [PATCH] sata_nv: kill not-working hotplug code [libata] Update docs to reflect current driver API [PATCH] libata: add host_set->next for legacy two host_sets case, take #3 ... ---
>From henkjans_bagger@hotmail.com 2007-02-07 16:05 >I've now finally completed running a git-bisect. It took a long time as seeing >the stall occur was not always easy. > >The output: 6edad161cd4dfe1df772e7a74ab63cab53b5e8c1 is first bad commit > >A quick search with google yields the info given below. Seeing that indeed the >commit is related to libata/Sata I'm pretty sure that I indeed did not make an >error during the git-bisect. Thanks for taking time to bisect the problem. Unfortunately at first glance, that's not good news for me: my servers are pure SCSI based (no sata)... That means to me either: 1) your problem is different than mine as was suggested earlier or 2) the aforementionned patch changed a subtle thing in the SCSI layer (libata is implemented behind). The only thing it seems to touch in the upper SCSI layer is the implementation of a new SCSI error handling. Maybe this change "broke" something regarding the SCSI low-level driver I'm using (megaraid). Regarding the -rt kernel (I'm no expert, I just read the rt wiki), you have to make sure Latency Tracing is activated in the kernel configuration (under Kernel Hacking I guess), and then you can read the offending stack traces from /proc/latency_trace. Brice
@Brice: I'm afraid somebody has to tell me in a bit more detail what info is needed from the rt-kernel. I've compiled that kernel with the given option. However what I see happening is that the content of /proc/latency_trace is changing all the time (latencies >3ms), so I have no clue what I should exactly look for or attach to this bugreport. Another disadvantage is that my system completely deadlocks after some time with the RT-kernel (only option is reboot via the powerbutton). Something I've never seen happening with any other kernel on this machine.
It has now been almost a week ago since I performed a bisect to locate which commit introduced the problem in my specific case. See the post in comment #c42. Is any more information needed from me to be able to solve these seconds long stalls during any IO-intensive process? It would be really nice to be able to move to a newer kernel than 2.6.17. ps) As said: I can reproduce the problem with the RT-kernel as well, but I need to know what information I need to provide.
I don't have the faintest idea how a sata change could cause this behaviour, sorry. Perhaps you could have another go with git-bisect and see if you can identify the offending change amongst those 20?
> Perhaps you could have another go with git-bisect and see if > you can identify the offending change amongst those 20? 1) Unfortunately I fail to see how I should do that. I did the bisect between 2.6.17 and 2.6.18. The given first bad commit was the last one in the tree. After that I could go no further. 2) The output with the RT-kernel would not provide the needed input as asked in comment #39? As said I could reproduce it there as well.
Reply-To: akpm@linux-foundation.org > On Tue, 13 Feb 2007 12:59:31 -0800 bugme-daemon@bugzilla.kernel.org wrote: > 1) Unfortunately I fail to see how I should do that. I did the bisect between > 2.6.17 and 2.6.18. The given first bad commit was the last one in the tree. > After that I could go no further. Yeah, I know. git-bisect is a bit sucky. I expect that you can bisect it further by doing `git-bisect good' and `git-bisect bad' against the first and last patches in that series-of-20.
Henkjans, your kernel might just be stuck in EH thus hanging all IO accesses. Does the kernel complain about anything (dmesg) after such stuttering? Also, please report the result of 'lspci -nnvvv' and 'dmesg' right after boot and 'dmesg' right after the problem has occurred. Thanks.
Created attachment 10409 [details] Output of lspci -nnvvv as requested
Created attachment 10410 [details] output of dmesg after boot (=identical after stutter) Although the stutter is clearly observable in the output of dstat the kernel is not complaining at all. The output of dmesg after the stutter is identical to the output of dmesg immediately after boot.
>> Perhaps you could have another go with git-bisect and see if >> you can identify the offending change amongst those 20? > >1) Unfortunately I fail to see how I should do that. I did the bisect between >2.6.17 and 2.6.18. The given first bad commit was the last one in the tree. >After that I could go no further. I think you'd have to fecth the libata git tree: git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git Then: git-bisect reset git-bisect start git-bisect bad 0dd4b21f517e138ea113db255645fbae1bf5eef3 git-bisect good v2.6.17 (I think that the last known bad commit is 0dd4b21..., but I'm not really fluent with git...). >2) The output with the RT-kernel would not provide the needed input as asked in >comment #39? As said I could reproduce it there as well. Make sure you have the different debug RT options set in "Kernel Hacking" in kernel config (see: http://lwn.net/Articles/146861/). Then you can read the highest latency from /proc/latency_trace and observe dmesg as there should be something there. You can reset it (ie just before your test to make sure you read something usefull) with: echo 0 > /proc/sys/kernel/preempt_max_latency Regarding your crashes with -rt, I unfortunately can't comment. Maybe you should ask on lklm or a more appropriate list. I'll perform a test soon (ie when I'll be able to cg-fetch a kernel) myself on my db server by eliminating the patch you bisected from a 2.6.18 kernel to see if it solved my problem or not. This will allow us to verify we're suffering the same exact issue or not. Maybe it is also time to bring the discussion to lkml to ask if other people suffer from this issue ?
I'm not really convinced that it is usefull to run another bisect. (It would take me are very long evening as it is not easy to determine good or bad) If I look at the patches that showed up in my original run in comment #42 I can exclude 50% as those are drivers that I don't use. Basically my system uses ahci and libata. If my problem is the same as what Bruce was experiencing the file ahci.c can be excluded. If I then look at the patches that have been introduced for files with libata or scsi in their name for the commit given in #42 it seems to me that the changes to the files involved are closely related: ps) I looked at the patches at http://www.kernel.org/git/?p=linux/kernel/git/ torvalds/linux-2.6.git;a=commitdiff;h=6edad161cd4dfe1df772e7a74ab63cab53b5e8c1
Created attachment 10441 [details] trace-it output Attached you will find trace-it output of 2.6.19.1-rt15 during a particularly busy period with an emerge --sync / updatedb /matlab in the background / and a movie playing. There was some skipping of the audio. But the skipping and unresponsiveness was not as bad as it sometimes gets after a few days of uptime. But maybe you can see from this the culprit function?? Btw where is a good tutorial on how to read/interpret the output of trace-it? Thanks
Pages are kept in memory for re-use, which is fast and fine except for: 1) data without re-use value or even single use 2) applications _do not_ advise the kernel how to cache pages related to there self generated i/o. POSIX does provide mechanisms to properly do so. But the kernel should help these poor apps. To minimize your MySQL backup cp(1) problem, try this workaround: cat ./cp_direct.sh #!/bin/sh dd if=$1 of=$2 bs=1M iflag=direct oflag=direct Combine this with [/etc/sysctl.conf]: vm.vfs_cache_pressure = 1 vm.dirty_ratio = 2 vm.dirty_background_ratio = 1 This should reduce both the stress on the vm and response latency during interactive work. -- Leroy
I'm adding myself to the cc as I seem to be struck with the same bug. 2 dual core xeon running amd64 with 3ware 9550sx sata raid controller (scsi driver) I tried completely without ide-ata or libata but the problem remains even using only the scsi subsystem. The system is idle, so if you have any suggestions. compiling works fast (small files .-( )
I can confirm that 2.6.17.14 works without showing this bug. 2.6.16.36 seems not to work -at least its worse ! the gentoo kernels for 2.6.18, 2.6.19 and 2.6.20 show this bug. I'm just testing a vanilla 2.6.20.3
*bump* *vote* hi my specs: asus m2n-e , amd64X2 3800+ ,sata hdd , ide cdrom, nvidia driver, 1 gig ram, 1gig swap , read all the bug reports and stuff, invested 4hours and am getting pissed on thisone I just wanted to emerge hugin and as it eats all the mem during compile it renders my system unusable under a serious swapstorm. Cpu is not that busy but disk i/o seems to kill it. This is a issue which should be fixed ASAP!
I did some compiling/booting cycles the last days trying to get a hold of this bug on my server. I stripped down my config to the needed things and tested vanilla 2.6.16 to 2.6.20 and was not able to reproduce my error. Seems I have configured it out somewhere... I use SMP, scheduler extensions for multicore and hypertheading, CFQ scheduler, acpi, cpufreq, preemtion, scsi, sata(old) netfilter, e1000, i2c, sensors, usb, ext2,ext3, vfat , i9xxx sata raid on scsi, and some other I didn't config libata, numa support, sound. Now I switched my kernel from old ide/sata to libata and volais: the bug reappeared and I'm waiting for my ls 4min. I will try to track it down and git-bisect it.
Hi I went backward within git every kernel from 2.6.20 to 2.6.16 and found the bug on each kernel. My test case: boot, mount my sata partititition and copy from the sata disk a 16GB file to my scsi/raid controller. watch the processes with top while doing so. run a "time ls -l /usr/share/doc" 30sec. after starting cp. With old ata it needs up to 30sec, with libata it waits for more than 10min! you can see via top that every process wich wants disk gets an on an own cpu and this cpu gets 100% waiting for i/o. processes without disk access (like the top process) show normal and fast response.
surprisingly, ulf.dambacher@web.de removed me from the cc list for some unknown reason. More surprisingly, he could actually do that!! anyway, this seems like a clear ata regression. Its even brought down to a set of commits. What else is needed for kernel devs to pinpoint the problem? What is the strategy going forward? Its been 5 months since the bug was reported, and we haven't seen any serious in-depth research and/or update from any kernel dev.
@ulf.dambacher I will soon test what Ulf discovered (ie libata is the culprit); since libata is compiled in my kernels, but I don't use it (no SATA drives). I will report my findings to the bug report. There is still something that doesn't match: I'm not affected in kernel below 2.6.17 (included), which does not seem to be your case. Did you had time to git-bisect the issue, to see if you come to the same conclusion as henkjans_bagger@hotmail.com. @devsk Yes, I do agree with you, we now have clearly identified the culprits, even if not explained. That's a shame that this bug gets no kernel dev attention. Maybe we should CC: the libata maintainer and move the bug section to Drivers.
first: my apologies for removing devsk. I mistook his email for a spambot becaus e I got emails with same timestamp as bugzilla. Won't happen again! I did some more tests to find the problematic subsystems, Now I am totally puzzled: I compiled a kernel with only scsi and usb hd support , no ata or libata. I booted and did my standart test: BUG Now I reconfigured a bit to boot the machine with a 32bit kernel from sata. My 32bit kernel with scsi and libata does not show the bug and works fluently-something. From all this I conclude that - it is amd64 specific. x86 on the same machine works - I only see it with the scsi subsystem and aditional subsystems using scsi (like usb-storage or libata) - scsi+old ata/sata works - it does not depend on the io-scheduler - it is not dependend on preemption - it is not dependent on acpi Can one of the kernel gurus please give me a hint on how to do further testing. I may even give you a live connection if needed... Currently I have time for testing, but this machine should be online ASAP. If I don't find the bug, I have to reinstall 32bit (and I will hate that .-)
Reply-To: akpm@linux-foundation.org Ulf, what do you mean by "BUG"? Reading back, I don't see anywhere where you've actually described the symptoms which you're seeing. And I have a suspicion that we're dealing with multiple, different problems in this lengthy report. Thanks.
@akpm My Bug shows up as follows: I copy one 15GB backup file from one (removeable) harddisk to my main harddisk. While doing so, the system gets unuseable: I can not log in. Every command I type takes ages to complete. e.g. a simple ls in my home dir or a ls /usr/share waits for more than 10min. If it ran before and gets data from cache it works faster. Interactively running programs like top behave normal and fast as long as they don't need disk access. speaking of top: it shows that at least the cp process makes one cpu 100% waiting. I see a process needing hd access gets on one processor which is then 100% waiting. If I stop the cp process, it takes some time until the system gets responsive again. Compare this with "normal operation": naturally if the system is on high disk access other processes work less fast. But a simple ls while doing the cp does not take more than around 5sec in my 32bit test case. with only scsi and old ata on 64bit it takes some 3-10sec. I think it's the same kind of behaviour Brice, who opened this bug, talks about.
I did some more tests. e.g. I completely removed any preemption, I turned of nmi watchdog. No visible changes. The only way I see normal/fast behaviour while doing my "cp" is, if I copy from old ide/sata to my scsi system. Trust the leds: source drive slowly glimming, destination drive leds flickering with slower frequency showing write. If I do read i/o they flicker differently and out of sync. top shows %wa of around 20%. ls Commands behave normal/fast If I try that with libata, I get the destination disk leds on signalling heavy load, the sources is flickering with slow frequency. by the out-of-sync flickering you can really see if other jobs get their read access which happens in between. And top shows %wa of 100% for at least 2 processes. and you have to wait for ls using usb_storage I get nearly the same behaviour as with libata.
I found out why my bug shows up with usb and sata but not with the old ide/sata system: It is too slow on my computer. dstat revealed that the cp run with ~40M/s on usb or sata and only 3k/s on old ide. so there was mutch time for other processes reads
I'm using ( Linux flatline 2.6.20-gentoo #1 PREEMPT x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux ) I can confirm that if I perform a lot of reads using process A, (e.g an AIDE scan), and a lot of reads using process B (i.e an emerge --metadata), then process C (e.g adding one line of text to a 20 line text file and saving the file) performing a small write is blocked and completes in around 30 seconds. So to summarise I can create a condition on my linux box whereby it takes 30 seconds to extend a file by 3 bytes. In addition if multiple processes are performing intensive reads, the IO throughput acheived by each process is extremely degraded And this is the clincher ... if I boot into any kernel with version <= 2.6.15 it works just fine. Also I would like to say that there is little visibility in this bug as to how many people are actually affected by this bug, perhaps because they do not post to bugzilla.kernel.org, but the thread in the gentoo forums about this problem is now 12 pages long. I just want to make people aware of how many people are affected (=lots) Info's to support this bug: Arch: amd64, glibc2.5 Filesystems: xfs (also i have tried ext3 same problem occurs) IO scheduler: cfq (I have tried the others, same problem) Block layer: md raid0 Devices: sata_sil It's pretty much a show stopper and I hope it can be solved. Is it being taken seriously enough? There has been no activity on this bug for a while.
John Helliwell, Can you try to git-bisect the problem ? Or ask on the gentoo forum you mention if someone can git-bisect precisely the issue. If that matches what other bug poster have found, we'll forward everything we know on lkml where it'll get more public scrutiny (and developper eyes too I hope). Did you try all the mentioned tips, like vfs_cache_pressure and things like that ? Are you affected with vanilla kernel 2.6.17 too ?
I've been experiencing this issue for some time now. I'm on an Athlon64 X2, 1GB RAM, nForce 4 motherboard with both PATA and SATA drives, running Gentoo. System compiler is gcc-4.1.2 with glibc-2.5. Yesterday, a user posted on the Gentoo forums about Ingo Molnar's "Modular Scheduler Core and Completely Fair Scheduler" patch (http://kerneltrap.org/node/8059). I applied v3 of this patch (http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v3-v2.6.21-rc7.patch) to kernel 2.6.21-rc7. The "stuttering" is present in unpatched 2.6.21-rc7, but is not present in 2.6.21-rc7 with Ingo's patch.
@Thomas Tuegel This doesn't mean that the scheduler is the culprit. It can just be that using another scheduler gives less cpu to the various process doing reads, and thus the process doing writes gets more chance to run (and does more processing). Applying the cfs scheduler patch might fix the symptom but doesn't cure the disease. Can some of the gentoo users can try to git-bisect the issue to see if that matches what others have found ? Can you please post the gentoo thread URL ? Thanks, Brice
@Brice Figureau Thread starts here: http://forums.gentoo.org/viewtopic-t-482731.html I think it may have been mentioned before, but I couldn't find it in the 70-odd bugzilla posts. You're right about symptoms vs disease. It turns out that a process doing heavy reads no longer locks down the system, but a process doing heavy writes still does. My apologies for the incorrect post earlier. There are a few users on the forums now talking about doing git-bisects. I'll be able to divert my attention to that this weekend.
Just a quick followup to my original bug reports. I ran 2.6.17 with great success since I discovered the issue in subsequent kernels. I decided to move to 2.6.20.6 last week, to see if things have changed (who knows?). Alas nothing new, it is still the same story. Beside that, running with vfs_cache_pressure=1 seems to really help, not perfect (as in 2.6.17) but much more usable. As soon as I have the possibility to reboot the server, I'll try to revert the culprit commit found by Henkjans. I'll report my findings. /me going back to 2.6.17 :-(
There is a patchkit on linux-kernel from Peter Zijlstra that might fix this. http://thread.gmane.org/gmane.linux.kernel/518019 Can people perhaps test it and report if it fixes the problem?
Did anyone try the patches in comment #74?
New patches at: http://lkml.org/lkml/2007/5/10/161 These do include something to make inter task starvation better. The previous set was only geared at inter device starvation issues.
I just finished reading over these comments, and see there has been no update since 2007/5/10. Have the patches listed in c75/76 fixed issue? I've used both kernels 2.6.18 and 2.6.20 (stock kernels from yum updates via fc6) and have the issue of high iowaits during file reads & writes, and during these times requests to the io subsystem would block and wait. I tried setting vfs_cache_pressure = 1, and that has helped system response time during io, but the iowaits still persist. platform details: cpu - amd athlon xp 2800+ memory - 2gb motherboard - asus a7n8 - includes sata_nv (nvidia), sata_sil. 3ware 9500s controller - kern module 32_9xxx - lun 0 - 2 x 320gb seagate baracudas - lun 1 - 6 x 500gb seagate baracudas I'm currently looking for a 2.6.17 kernel rpm (sorry, lowly user here) to install on the system. If unsuccessful, i'll grab both kernel source for 2.6.17 and 2.6.21, including the patch, and give those a try during the next couple days.
I have another dimension to add to this problem. I took the 2.6.21 sources from my amd64 install and created a x86 kernel in a chroot install. I did a 'make mrproper', copied the amd64 .config and 'make oldconfig'. Followed by make menuconfig to verify the config. Booted the resulting x86 kernel and the freeze problems are gone. So, in nutshell, I have two installs: an amd64 and an x86 with the same kernel source. I can easily make the mouse, gkrellm2, ksysguard, and other visual updates freeze for > 5 secs under amd64. Not reproducible even under much more (almost three times) severe IO load under x86. During the tests, the IO waits in x86 stay close to 50-70% range, touching 90s once in a while. I am staying with x86 for now. I see that the original reporter as well as many others reporting the issue are indeed using x86_64. So, can you guys please boot an x86 livecd with 2.6.21, mount your FSs and do you tests to verify if you see what I see. If anyone needs me to post some info, please let me know.
Does the x86 system run with APIC enabled or disabled? (you can see it in /proc/interrupts). If no try running the 64bit system with noapic for testing and see if that helps -- or run the 32bit with APIC enabled and see if it stalls too. I recently looked at an unrelated stall on my workstation and one of the problems seemed to be interrupt starvation for some interrupts. Changing to PIC mode could possibly change that.
APIC enabled on both. Same .config (well, up to the extent x86_64 and x86 can be same). Same cmdline (except for the install root) to both the kernels. I have booted the 64-bit with 'noapic nolapic' in the past without any improvement with the problem. irqbalance (I noticed one difference in x86_64 and x86 .config and that is that there is no in-kernel IRQ balancing in x86_64), running or not, doesn't make any difference.
This bug is so utterly annoying that I already downloaded the FreeBSD install-CD's to switch to their kernel as this bug leads to an unworkable system under load. Based on the last posts of devsk I decided to first try to switch to a 32-bit system. I can confirm the findings of Devsk. The problems I experience with the 64-bit kernel are NOT present when running a 32-bit system. Hopefully with the bad commit reported in post #42 this gives a reason to look at that commit. Please have a look at this bug as it is annoying as hell for the people that experience it and it is still present in the latest kernel I've tried (2.6.22-rc3).
Subject: Re: Unusable system (ie slow) when copying large files. On Thu, 14 Jun 2007 11:38:07 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=7372 > > > henkjans_bagger@hotmail.com changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |henkjans_bagger@hotmail.com > > > > > ------- Comment #81 from henkjans_bagger@hotmail.com 2007-06-14 11:41 > ------- > This bug is so utterly annoying that I already downloaded the FreeBSD > install-CD's to switch to their kernel as this bug leads to an unworkable > system under load. > > Based on the last posts of devsk I decided to first try to switch to a 32-bit > system. I can confirm the findings of Devsk. The problems I experience with > the > 64-bit kernel are NOT present when running a 32-bit system. > > Hopefully with the bad commit reported in post #42 this gives a reason to > look > at that commit. Please have a look at this bug as it is annoying as hell for > the people that experience it and it is still present in the latest kernel > I've > tried (2.6.22-rc3). > > Jeff, Tejun: have you guys been following this long-running saga? It seems to be hurting a few people and I have a suspicion that multiple unrelated problems are at play here. But this reporter at least has bisected it down to an ata commit from a year ago. _some_ of these reports (eg #42 and thereabouts, and #68) do I think point at an ata change. But some of the others make it look more like the balance_dirty_pages() starvation thing. Quite possibly it's an interaction between both, which means that fixing balance_dirty_pages() starvation will fix it. Peter, could I ask that when you have time, you spin up a version of your balance_dirty_pages patchset for the various reporters to try? Against 2.6.21 would suit, I guess. Thanks.
Created attachment 11756 [details] output of diff /tmp/config.x86 /tmp/config.x86_64 > /tmp/config.diff Is anyone ready to take on the x86 vs. x86_64 angle? I have identical kernel source and .config (only differs in parts where autoconfig removes unsupported options or adds default options for the ARCH) files, and problem only appears in x86_64.
Well, i'm going to plead some ignorance here, on the x86 vs. x86_64. I do understand the difference (32/64bit), and I'm running x86 mode. vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : AMD Sempron(tm) 2400+ But regarding the issue of the IOWAITs under high io load, the issue seems isolated to volumes connected via the 3ware controller, both raid1 (sda, pair of 320s) volume and raid5 (6 x 500gb) volume. Another couple points I'll mention here. - using xfs, reiserfs, ext2 and ext3, problem persists - 2.6.18, 2.6.20, problem exists - yet to test 2.6.17 - you guys want this tested? write caching is enabled on both 3ware volumes. When I copy data to these volumes, they will push iowaits up to 80-100%. however, on the same system, I have an external sata drive mounted, and when I write to that (reading from nfs mounted disk from other machine in all tests), i -don't- get the iowaits on that volume, which is connected via sata_sil, i believe here are some quick copy tests, WITH the option in vfs_cache_pressure set to 1 >> nfs mount to external sata 1 34 26 23 2 15| 0 19M| 0 0 | 0 0 |3137 2388 ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 1 39 0 41 3 16| 0 41M| 27M 820k| 0 0 |3896 2576 1 35 22 25 2 15| 0 13M| 0 0 | 0 0 |3742 2288 0 35 30 17 1 17| 0 0 | 31M 949k| 0 0 |3569 2274 1 31 35 19 1 13| 0 0 | 0 0 | 0 0 |3430 2288 1 35 20 27 2 15| 0 18M| 29M 889k| 0 0 |3463 2369 0 24 0 56 4 16|4096B 42M| 0 0 | 0 0 |3533 2759 1 24 36 19 1 18| 0 11M| 27M 837k| 0 0 |3686 2471 1 28 32 22 2 16| 0 0 | 0 0 | 0 0 |3571 2400 1 31 35 18 2 13| 0 0 | 29M 898k| 0 0 |3410 1936 1 35 21 29 1 14| 0 14M| 0 0 | 0 0 |3314 2154 1 37 0 43 4 15| 0 40M| 28M 843k| 0 0 |3441 2381 1 36 21 25 1 16| 0 15M| 0 0 | 0 0 |3382 2180 0 36 30 19 2 14| 0 0 | 28M 877k| 0 0 |3515 2166 1 35 35 13 1 15| 0 0 | 0 0 | 0 0 |3816 2257 1 37 20 29 1 12|4096B 12M| 31M 942k| 0 0 |3089 2165 0 30 0 53 1 16| 0 41M| 26M 793k| 0 0 |3255 2785 [root@achilles external]# time cp /swraid5b/linux.iso . real 0m47.415s user 0m0.078s sys 0m5.700s [root@achilles external]# >> nfs mount to sda, 3ware 9500 raid1 volume ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 1 36 0 46 1 15| 36k 560k| 32M 995k| 0 0 |3483 2282 1 37 0 44 2 16|1280k 3108k| 0 0 | 0 0 |3906 2450 0 38 0 47 1 14| 208k 332k| 31M 964k| 0 0 |3381 2426 1 5 0 93 0 1| 188k 19M| 0 0 | 0 0 |1222 1093 0 1 0 99 0 0| 40k 7544k|2565k 76k| 0 0 |1131 411 0 9 0 89 0 2| 16k 21M| 0 0 | 0 0 |1405 1338 1 38 0 46 2 13| 100k 2476k| 12M 355k| 0 0 |3482 2387 1 40 0 42 2 15| 20k 7492k| 0 0 | 0 0 |3667 2575 0 33 0 54 1 12| 116k 6412k| 27M 831k| 0 0 |2858 2233 1 24 0 73 0 2| 56k 19M| 0 0 | 0 0 |1635 2359 0 17 0 82 0 1| 116k 20M|4922k 147k| 0 0 |1209 2337 1 22 0 76 0 1| 52k 14M| 0 0 | 0 0 |1185 2027 6 25 0 68 1 0| 14M 116k|1904B 3233B| 0 0 |1635 2662 9 29 0 62 0 0| 10M 0 | 0 0 | 0 0 |1599 2637 2 18 0 79 0 1|1704k 5828k|3298B 1800B| 0 0 |1165 2043 50 19 0 32 0 0|2868k 0 | 0 0 | 0 0 |1130 1800 39 14 45 1 0 0|8192B 0 |1898B 2748B| 0 0 |1036 1520 [root@achilles external]# cd [root@achilles ~]# time cp /swraid5b/linux.iso . real 0m57.618s user 0m0.075s sys 0m6.151s [root@achilles ~]# >> nfsmount to sdb, 3ware 9500 raid5 volume ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 0 30 0 52 1 17| 0 9876k| 23M 683k| 0 0 |3634 2647 1 29 0 51 2 17| 0 17M| 0 0 | 0 0 |3983 2899 0 32 0 54 1 13| 0 16M| 32M 948k| 0 0 |3562 2684 1 30 0 52 1 16| 0 14M| 0 0 | 0 0 |3417 2830 1 32 0 48 3 16| 0 16M| 29M 872k| 0 0 |4150 2963 1 28 0 52 1 18| 0 12M| 0 0 | 0 0 |3832 2780 0 32 0 50 1 17| 0 5815k| 33M 985k| 0 0 |3592 2555 1 34 0 48 1 16| 0 4841k| 0 0 | 0 0 |3922 2472 1 26 0 56 2 15| 0 16M| 29M 858k| 0 0 |3172 2763 1 25 0 57 1 16| 0 12M| 0 0 | 0 0 |3853 2914 1 23 0 55 2 19| 0 11M| 31M 919k| 0 0 |3856 2922 0 26 0 57 1 16| 0 13M| 0 0 | 0 0 |5678 4563 1 34 0 50 2 13| 0 17M| 25M 1060k| 0 0 |5751 4429 0 26 0 62 1 11| 0 6936k| 0 0 | 0 0 |3666 3402 1 25 0 69 0 5| 0 12M| 16M 708k| 0 0 |3309 3247 0 18 0 78 1 3| 0 7480k| 0 0 | 0 0 |1958 2519 1 24 0 65 1 9| 0 17M|8940k 452k| 0 0 |4654 4239 0 27 0 60 2 11|4096B 10M| 0 0 | 0 0 |4612 4004 1 29 0 57 2 11| 0 18M| 15M 785k| 0 0 |5110 4513 0 29 0 53 2 16| 0 9352k| 0 0 | 0 0 |6226 5387 1 31 0 52 2 14| 0 8316k| 23M 1184k| 0 0 |5949 4908 0 27 0 55 4 14| 0 18M| 0 0 | 0 0 |6541 5119 0 24 0 64 1 11|4096B 9460k| 20M 1075k| 0 0 |4312 3889 ... [root@achilles array]# time cp /swraid5b/linux.iso . real 0m52.801s user 0m0.074s sys 0m4.534s [root@achilles array]# Which actually isnt' as bad as I've seen for IOwaits. When I copy a larger file, such as > 2GB, i seem to end up with higher iowait values after a longer copy process. I'm still not sure if my issue is with the iowait problem this thread is looking for, or, if i'm having an issue with my 3ware controller. Searches for '3ware 9500s" and "high iowaits do return results, though on 3ware's site they do indicate this is a problem with the linux kernel, not their issue. [root@achilles ~]# lsmod|egrep "ata|3w" sata_nv 22469 0 sata_sil 15689 1 libata 104661 2 sata_nv,sata_sil 3w_9xxx 36037 5 scsi_mod 138093 5 usb_storage,sg,libata,3w_9xxx,sd_mod [root@achilles ~]# [portions of dmesg output] scsi0 : 3ware 9000 Storage Controller 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xdf840000, IRQ: 19. 3w-9xxx: scsi0: Firmware FE9X 2.08.00.009, BIOS BE9X 2.03.01.052, Ports: 12. scsi 0:0:0:0: Direct-Access AMCC 9500S-12 DISK 2.08 PQ: 0 ANSI: 3 SCSI device sda: 624977920 512-byte hdwr sectors (319989 MB) sda: Write Protect is off sda: Mode Sense: 23 00 00 00 SCSI device sda: write cache: enabled, read cache: disabled, doesn't support DPO or FUA SCSI device sda: 624977920 512-byte hdwr sectors (319989 MB) sda: Write Protect is off sda: Mode Sense: 23 00 00 00 SCSI device sda: write cache: enabled, read cache: disabled, doesn't support DPO or FUA sda: sda1 sda2 sda3 sda4 < sda5 > sd 0:0:0:0: Attached scsi disk sda scsi 0:0:1:0: Direct-Access AMCC 9500S-12 DISK 2.08 PQ: 0 ANSI: 3 sdb : very big device. try to use READ CAPACITY(16). SCSI device sdb: 4882708480 512-byte hdwr sectors (2499947 MB) sdb: Write Protect is off sdb: Mode Sense: 23 00 00 00 SCSI device sdb: write cache: enabled, read cache: disabled, doesn't support DPO or FUA sdb : very big device. try to use READ CAPACITY(16). SCSI device sdb: 4882708480 512-byte hdwr sectors (2499947 MB) sdb: Write Protect is off sdb: Mode Sense: 23 00 00 00 SCSI device sdb: write cache: enabled, read cache: disabled, doesn't support DPO or FUA sdb: unknown partition table sd 0:0:1:0: Attached scsi disk sdb libata version 2.00 loaded. sata_sil 0000:01:0b.0: version 2.0 ACPI: PCI Interrupt Link [APC3] enabled at IRQ 18 ACPI: PCI Interrupt 0000:01:0b.0[A] -> Link [APC3] -> GSI 18 (level, high) -> IRQ 20 ata1: SATA max UDMA/100 cmd 0xF8836080 ctl 0xF883608A bmdma 0xF8836000 irq 20 ata2: SATA max UDMA/100 cmd 0xF88360C0 ctl 0xF88360CA bmdma 0xF8836008 irq 20 scsi1 : sata_sil ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata1.00: ATA-7, max UDMA/133, 488397168 sectors: LBA48 NCQ (depth 0/32) ata1.00: ata1: dev 0 multi count 16 ata1.00: configured for UDMA/100 scsi2 : sata_sil ata2: SATA link down (SStatus 0 SControl 310) scsi 1:0:0:0: Direct-Access ATA WDC WD2500AAJS-0 12.0 PQ: 0 ANSI: 5 SCSI device sdc: 488397168 512-byte hdwr sectors (250059 MB) sdc: Write Protect is off sdc: Mode Sense: 00 3a 00 00 SCSI device sdc: write cache: enabled, read cache: enabled, doesn't support DPO or FUA SCSI device sdc: 488397168 512-byte hdwr sectors (250059 MB) sdc: Write Protect is off sdc: Mode Sense: 00 3a 00 00 SCSI device sdc: write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdc: sdc1 sd 1:0:0:0: Attached scsi disk sdc I have seen comments about people trying *bsd and not having this issue. I've been tempted to try another flavor unix, and might give this a try. dwight s.
(In reply to comment #82) > Peter, could I ask that when you have time, you spin up a version of your > balance_dirty_pages patchset for the various reporters to try? Against > 2.6.21 would suit, I guess. Based on this response I've been trying a number of things today, but unfortunately did not get much further. One of the reasons is that on my system the appearance of the problem while running the 2.6.21 kernel is not as clear-cut as what I saw with 2.6.18. 1) I've applied the patches posted by Peter Zijlstra in comment #76 to the 2.6.21-mm2 kernel to check if it removes the problem. My impression is that the problem is still there with those patches, although less visible then with the clean 2.6.21 kernel. This would be consistent with the change I see going from 2.6.18 to 2.6.22. Again I checked both the 32 and 64 bit kernel, but it is hard to see a clear difference in this case. As I can't reach a clear conclusion here it might be good if somebody with a clearer stall of his/her system under 2.6.21 tries this as well. 2) Seeing that I did not get much further with that approach I tried again to get the kernel running with the rt-patch as requested in comment #39 . I now have this running for both the 32 and 64 bit kernel. My first impression is that the RT-kernel does not really show the problems I experience under the normal 64bit kernel. As I'm not sure I would like to check the latency_trace output to confirm. Unfortunately /proc/latency_trace refuses to update with the 64bit RT-kernel. Note that the same kernel-config works fine for the 32-bit RT-kernel. My logs display the following error-message on this: ---- Jun 16 13:56:41 PulsarX2 BUG: at kernel/latency_trace.c:1099 update_out_trace() Jun 16 13:56:41 PulsarX2 Jun 16 13:56:41 PulsarX2 Call Trace: Jun 16 13:56:41 PulsarX2 [<ffffffff8026ccbd>] l_start+0x1ed/0x960 Jun 16 13:56:41 PulsarX2 [<ffffffff80292d19>] obj_size+0x9/0x20 Jun 16 13:56:41 PulsarX2 [<ffffffff80293a67>] poison_obj+0x47/0x60 Jun 16 13:56:41 PulsarX2 [<ffffffff802b7bb1>] seq_read+0x71/0x2d0 Jun 16 13:56:41 PulsarX2 [<ffffffff80292cf9>] obj_offset+0x9/0x20 Jun 16 13:56:41 PulsarX2 [<ffffffff80293c0a>] cache_alloc_debugcheck_after+0x18a/0x1d0 Jun 16 13:56:41 PulsarX2 [<ffffffff80295939>] kmem_cache_alloc+0xe9/0x110 Jun 16 13:56:41 PulsarX2 [<ffffffff802b7c47>] seq_read+0x107/0x2d0 Jun 16 13:56:41 PulsarX2 [<ffffffff80299e19>] vfs_read+0xc9/0x170 Jun 16 13:56:41 PulsarX2 [<ffffffff8029a2e8>] sys_read+0x58/0x90 Jun 16 13:56:41 PulsarX2 [<ffffffff80209d9f>] system_call+0xef/0x15c Jun 16 13:56:41 PulsarX2 Jun 16 13:56:41 PulsarX2 INFO: lockdep is turned off. Jun 16 13:56:41 PulsarX2 --------------------------- Jun 16 13:56:41 PulsarX2 | preempt count: 00000000 ] Jun 16 13:56:41 PulsarX2 | 0-level deep critical section nesting: Jun 16 13:56:41 PulsarX2 ---------------------------------------- Jun 16 13:56:41 PulsarX2 Jun 16 13:56:41 PulsarX2 CPU0: 0000000000ac1d4a (0000000000ac1d51) ... #4 (0000000000ac1d59) 0000000000ac1d64 Jun 16 13:56:41 PulsarX2 CPU1: 00000000007e4131 (00000000007e413f) ... #2 (00000000007e4131) 00000000007e413f Jun 16 13:56:41 PulsarX2 CPU1 entries: 2 Jun 16 13:56:41 PulsarX2 first stamp: 0000000000ac1d4a Jun 16 13:56:41 PulsarX2 last stamp: 0000000000ac1d4a ----- Any thoughts on what causes this would be appreciated..... 3) I now noticed that a compiler warning is displayed while compiling the kernel for one of the files that was in the bad commit I found. I can't judge if that warning is really relevant for this bug so it might be good that I bring this forward: ----the warning---- drivers/ata/libata-core.c: In function 'ata_dev_configure': drivers/ata/libata-core.c:1788: warning: comparison of distinct pointer types lacks a cast ----the code that causes this warning (second line) ----- 1787 if (ata_device_blacklisted(dev) & ATA_HORKAGE_MAX_SEC_128) 1788 dev->max_sectors = min(ATA_MAX_SECTORS_128, dev->max_sectors); ----
Does anyone find odd that all the interrupts are actually being tied to one cpu? It seems like lots of I/O on the system but only one processor is serving interrupts?
Why is IRQ_BALANCE not needed on x86_64? Or does the code not work on x86_64?
Hello, Andrew. Yeah, I've been watching this bug report. The only thing related to libata I can think of is NCQ interacting badly with io scheduler but I haven't succeeded in regenerating the reported symptoms or does it explain the x86-64/i386 difference. I think we're looking at several different problems reported here and it's pretty difficult to track down because the symptom is "being slow". I'll try again to reproduce the reported problems. If you're still seeing this problem, please report the kernel boot log (dmesg right after boot) and see whether "echo deadline > /sys/block/sdX/queue/scheduler" changes anything.
It also could be that the firmware is starving io requests which are far from the current head position. Please see whether "echo 1 > /sys/block/sdX/device/queue_depth" changes anything.
Good to know that this bug still receives attention and having some programming experience I'm sure it is a nightmare to figure out what is going on here. 1) Regarding comment #88 and #89 - I've tried both suggestions. Setting different schedulers and reducing the queue-depth does not noticably influence the behavior I experience under AMD64. The problem remains. 2) As a reminder: the complaint is not just that the system is "being slow". In my case generating heavy IO on a disk makes that it becomes very difficult to start a new task. When I start calculating a checksum for a DVD-image K3B with the 2.6.18 kernel I have to wait until this process has finished until I can do anything else, which is completely unworkable. Something as simple as opening a new terminal-window is impossible even for cases where it should have been catched in memory. This problematic behavior has diminished for newer kernels, but it is still clearly there. Now it can take up to 20 seconds to open a new terminal, but it happens in the end qhich is an improvement. (Note that also switching users in an existing terminal can take this long....bizar) --> with the exact same tests the i386 kernel behaves as expected.
Created attachment 11781 [details] output of dmesg after boot AMD64 This is the output of dmesg under AMD64 with kernel 2.6.21.5 (clean)
Hmm, I could be wrong, but didn't someone recently discover that x86_64 didn't correctly use good granularity time stamps in the mainline kernel (through some CFS testing). Since this sounds like a scheduler issue and it behaves the way as described on x86 and x86-64 where only the kernel differs, it sounds suspiciously like something in that area. I'd suggest: - Try 2.6.22-rc5, see if it reproduces there. IIRC, the bug is fixed there now and it wasn't in -rc4. Who should this bug get reassigned to? To my knowledge, Robert hasn't done any kernel work in ages.
http://lkml.org/lkml/2007/5/25/69 This is the patch I'm thinking of, however it's not included in 2.6.22-rc5. So I'd say reproduce with 2.6.22-rc5, then apply that patch and try to reproduce again.
@Jens: The 2.6.22-rc5 kernel still seems to have the problem. I tried applying the patch indicated in comment #93. However the kernel refuses to compile with this patch. ---the output ---- CC arch/x86_64/kernel/../../i386/kernel/quirks.o make[1]: *** No rule to make target `arch/x86_64/kernel/../../i386/kernel/sched-clock.o', needed by `arch/x86_64/kernel/bootflag.o'. Stop. make: *** [arch/x86_64/kernel] Error 2 --- So not luck there. Should I try against 2.6.22-rc4-mm2 instead??
Is it possible to backport the fix given by Ingo to 2.6.21? It looks like something that would apply.
I think the patch in #93 can't fix the issues in this bug. A quick glance at 2.6.19 and 2.6.22-rc4, and sched_clock used by x86_64 is same as used by i386 in upto and including 2.6.22-rc4 vanilla. Only thing changing across ARCH is rdtscll but it has same granularity on x86_64 and i386. The bug that was fixed by Ingo was probably introduced recently when the function was separated out in its own C file. Can someone please verify that?
Bug#8636 sounds pretty similar but not yet sure whether it's the same problem. http://bugzilla.kernel.org/show_bug.cgi?id=8636
WRT comment #96 - it does look like a -mm only patch, so I can only conclude that the issue probably isn't there in mainline. Testing the CFS patches could be interesting still, since it does smell like a scheduler induced delay.
Hi, A quick follow-up: I'm still running 2.6.17.x because it is the better kernel for the moment for me. I tried 2.6.20.6 a while back, it was better than 2.6.18, but still not that perfect. I couldn't conclude the effect was produce by the kernel, since I also changed the mysql version (with a supposedly better I'm also using as I reported in my original report arch amd64/x86_64. I never tried i386 since all my userspace is 64 bits. Like Henkjans said in comment #90, this is not a slowlyness issue, it is more like if the system is stalling for _several_ seconds under high I/O. It is not possible to launch new process, hence my first feelings of a scheduler issue (and that's also why I chooses scheduler as the bug category). The only thing that helped (when running >= 2.6.18) was using vfs_cache_pressure=1. It helps, but is still far from the "perfectness" of 2.6.17. Just to remind that I'm running a LSI megaraid RAID card (ie SCSI on the whole device chain). I will have a maintenance window in about 12 days, so if you want me to change the kernel to something else (sorry no too experimental changes) to try sth different, please let me know. I didn't try Peter's patch, since it was (at the time) only for mm and seemed rather experimental. (the machine is in production). Anyway, that's great to see that the bug report is still alive and that developpers are still trying to diagnose/fix the issue. Thank you for your efforts :-) Brice
Seeing that there seems to be no progress in solving this bug it is probably best that I try to pinpoint the problem more precisely as suggested previously in comment #46 (Bug#8636 does nolonger appear related to me) It would however be helpfull if somebody who actually understands git helps me figure out how to do this to prevent me from wasting a lot of time without getting a more detailed answer. In comment #42 I found the first bad commit to be 6edad161cd4dfe1df772e7a74ab63cab53b5e8c1 This commit includes 258 different commits as it merges branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev. If I understand it correctly the culprit should be among those 258 commits. So the question is how to run git-bisect only on those 258 commits, so please help. === I think I have to fetch the libata git tree: git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git and then use git-bisect bad 0dd4b21f517e138ea113db255645fbae1bf5eef3 git-bisect good 236ee8c33277ab48671995f26dc68a4639936418 The problem is that I'm not sure which commit-IDs should be filled in to only bisect over the specified 258 commits! I think it should be the parents given in the header of the found first bad commit in comment #42. Is that correct and if so is it in the correct order?? ===
The outcome of the new git-bisect attempt is slightly embarrassing....(blush) --- 12fad3f965830d71f6454f02b2af002a64cec4d3 is first bad commit [PATCH] ahci: implement NCQ suppport Implement NCQ support. --- So I tried again to enable/disable NCQ as suggested earlier, but this time for the vanilla 2.6.18 kernel as this is the kernel that most clearly shows the problem in my case. And indeed enabling/disabling NCQ has a clear effect for 2.6.18! **with echo 1 > /sys/block/sda/device/queue_depth it takes ~7s to open a terminal while simultaneously calculating a checksum. ** with echo 31 > /sys/block/sda/device/queue_depth it takes >40s to open a terminal under the exact same conditions. So conclusion: NCQ plays an important role for the problem I have, BUT those 7s are still much longer then under my 32bit environment. When doing exactly the same test on the 32bit kernel(2.6.21) the terminal opens in 1-2s, so substantially faster then the 7s I get with NCQ disabled!! I would say that those 7s are still too long....
(In reply to comment #101) > The outcome of the new git-bisect attempt is slightly embarrassing....(blush) > > --- > 12fad3f965830d71f6454f02b2af002a64cec4d3 is first bad commit > [PATCH] ahci: implement NCQ suppport > Implement NCQ support. > --- Just to remind that my server exhibiting the same issue is using SCSI, and as such don't have NCQ, but have TCQ. I suppose both shares the same code base in the scsi subsystem on which SATA depends. [henkjans comment snipped] > I would say that those 7s are still too long.... Yes, and this means that the problem is getting worse with TCQ/NCQ enabled, but it is not the root cause. Are you ready to git-bisect the whole kernel without NCQ enabled, just to see what could be the culprit commit ? Anyway, thanks for the great help :-)
Re #102: Brice, NCQ/TCQ is implemented near completely in the disk. But yes the disk vendors likely use the same code base for it. #101: queue_length=1 is probably a little too radical; it would be interesting to try other values. Also out of curiosity: if you use ionice to improve the priority of the terminal together with a short queue length does it load faster? Also it's still a mystery why 64bit behaves differently. Did you ever have any kernel where that was not the case?
(In reply to comment #101) > So conclusion: NCQ plays an important role for the problem I have, BUT those > 7s Can you give a shot at deadline io scheduler (echo deadline > /sys/block/sdX/queue/scheduler)? Also, please post the result of 'hdparm -I /dev/sdX".
I use deadline scheduler and it doesn't help with intermittent freezes.
I'm not sure whether all reports here are caused by the same problem. Stuttering can be a symptom of many different problems. Even if you and Henkjans are experiencing the same problem, Henkjans' case might be worse due to faulty NCQ implementation, so I wanna separate that out.
(In reply to comment #102) > Are you ready to git-bisect the whole kernel without NCQ enabled, just to see > what could be the culprit commit ? This is indeed what I intend to do, but have first to solve to things 1) With NCQ disabled the effect becomes too subtle for my previous test between good/bad. So I have to find a better way to do this....I've been looking in the direction of comment #12 Problem is that I have to find out how to do the drop_caches part for Gentoo. 2) I'm no longer sure that 2.6.17.14 is the last good kernel. The enabling of NCQ for my drive when going to 2.6.18 could have made it seem this way. For that reason I must test a much older kernel for comparison first. (In reply to comment #103) > Also it's still a mystery why 64bit behaves differently. > Did you ever have any kernel where that was not the case? > I can't say what plays a role in the difference between 32/64. The only strange thing that I observe lies in the throughput observed when calculating a iso-checksum. Under 64-bit the reading spead from HD is 55-60 MB/s which is about max for my setup. Under 32-bit the same test shows 30-35 MB/s. (observed with dstat) This clearly plays a big role in how easy it is for the system to respond to an IO-request. Note that the outcome of hdparm is as it should be for both 32/64 bit (In reply to comment #104) > Can you give a shot at deadline io scheduler (echo deadline > > /sys/block/sdX/queue/scheduler)? Also, please post the result of 'hdparm -I > /dev/sdX". I'll post these tonight when I'm back behind my linux-box, but I have troubles finding a good test (see above) Note that these observations where most likely with the anticipatory scheduler. I'll test again to be sure.
(In reply to comment #104) > (In reply to comment #101) > > So conclusion: NCQ plays an important role for the problem I have, BUT > those 7s > > Can you give a shot at deadline io scheduler (echo deadline > > /sys/block/sdX/queue/scheduler)? Also, please post the result of 'hdparm -I > /dev/sdX". > Playing with the schedulers makes life interesting. I calculate the checksum for a DVD-iso and while at 40% into the checksum I try to open a terminal and measure how much seconds it takes. (between each run I drop caches and change scheduler) Here comes the statistics: 1) The bad 2.6.18 kernel with NCQ enabled - cfq: >43s,>43s,>45s (after this time the checksum is done) - anticipatory: 16s, 13s, 14s - deadline: 5s, 3s, 5s 2) The bad 2.6.18 kernel with NCQ disabled - cfq: 15s, 5s, 7s, 8s - anticipatory: 13s, 13s, 13s - deadline: 2s, 2s, 4s, 2s 3) The supposedly good 2.6.17.14 kernel with NCQ disabled (by definition) - cfq: 11s, 12s, 12s - anticipatory: 14s, 14s, 15s - deadline: 2s, 2s, 2s 4) For a rough comparison a 32-bit 2.6.21 kernel - cfq: 2s, 2s - anticipatory: 3s, 2s - deadline: 2s, 2s **) Note that the 32-bit kernel is located on a different HD then the iso-image and the 64-bit kernels, so it might not be a completely fair comparison. Conclusion - the cfq and anticipatory schedular give a pretty bad response under all tests with the 64-bit kernels, while it works fine with a 32-bit kernel. - enabling NCQ seems to make the response worse. In particular the cfq scheduler takes a serious hit.
Created attachment 11876 [details] Output of hdparm -I This is the output of hdparm -I as requested. Note that this is the drive that contains the 64-bit root-system as well as the iso-image. **) In principle this drive should be using AHCI on SATA-II
Created attachment 11877 [details] output of hdparm -I For completeness also the output of hdparm -I for the drive that contains the 32-bit root-system. Note that the bug was already present with only the other SATA drive in the system.
That's interesting result. With NCQ, a bit of increased latency is expected because more commands are in flight. Also, the kernel has no control over in what order the harddrive chooses to execute commands once they're issued to the drive. Can you put 32bit system on the same disk as 64bit one and repeat the test, preferably on the same region of the disk as the 32bit system (which of course requires reinstallation)? Another factor to consider is layout of the needed files and the checksumming file on disk - ie. how far they are. It might be combination of all factors - firmware, disk layout and cfq/as amplifying the effect due to resulting skewed statistics. cfq allocates time slices to each io contexts and if the unfairness of firmware makes the non-clustered access much more expensive than it should be, cfq can exaggerate the effect by over-charging those IOs accordingly. Jens, any other ideas?
(In reply to comment #111) > Can you put 32bit system on the same disk as 64bit one and repeat the test, > preferably on the same region of the disk as the 32bit system (which of > course > requires reinstallation)? I'm too "attached" to my current 64-bit install to wipe it, so I did the test slightly different. I sequentially mirrored the 32-bit(2.6.21) and the 64-bit(2.6.18) root at the same partition at the end of the disk showing the problems. The test I ran is exactly the same as I ran in comment #108, so numbers can be compared 1) The 64-bit 2.6.18 kernel with NCQ enabled - cfq: >44s,>60s,>55s (after this time the checksum is done) - anticipatory: 10s, 22s, 8s - deadline: 4s, 4s, 5s 2) The 64-bit 2.6.18 kernel with NCQ disabled - cfq: 6s, 6s, 5s - anticipatory: 10s, 9s, 10s - deadline: 2s, 3s, 2s 3) The 32-bit 2.6.21 kernel with NCQ enabled - cfq: 11s, 17s, 15s - anticipatory: 10s, 20s, 7s, 9s - deadline: 4s, 5s, 5s 4) The 32-bit 2.6.21 kernel with NCQ disabled - cfq: 4s, 4s, 4s - anticipatory: 9s, 10s, 16s - deadline: 2s, 2s, 3s The pattern in these times seems consistent with the obvious outcome that enabling NCQ in combination with the cfq-schedular results only in very long delays for the 64-bit kernel, not with the 32-bit kernel. > Another factor to consider is layout of the needed files and the checksumming > file on disk - ie. how far they are. To check this aspect I also tested the effect of copying the iso-file first to the original root-disk of the 64-bit system. This way the file that is read for calculating the checksum should be much closer then previously, however this clearly has little effect on the observed delays, see: 5) The 64-bit 2.6.18 kernel with NCQ enabled ( iso-file in / ) - cfq: 47s, 54s, 46s (after this time the checksum is done) - anticipatory: 13s, 17s, 16s - deadline: 9s, 5s, 4s 6) The 64-bit 2.6.18 kernel with NCQ disabled ( iso-file in / ) - cfq: 5s, 4s, 4s - anticipatory: 13s, 13s, 13s - deadline: 2s, 3s, 2s So both moving the root-file system to a completely different area of the disk as well as moving the iso into the 64-bit root itself has no noticable effect. So I would say that the actual layout of the disk seems to have little effect. For completeness: The partition-scheme 75MB /boot 1GB /swap 30GB / (original 64-bit root) 136GB /home (contains the iso-file somewhere) 15GB / (the 32-bit and 64-bit root in this test) 130GB /spare
Thanks for the testing. It really looks like a cfq bug which somehow gets triggered on only 64bits. /me does Jens summoning dance.
I think you'll have to keep dancing for the coming two weeks. In bug #8363 Jens mentioned that he would be away until the 9th of July. Are you sure that the problem is confined to the combination cfq/NCQ only? Overall I found the performance of the anticipatory scheduler astoundingly bad. It consistently shows longer delays (in the 10-20s range) then I would find acceptable on a multi-user system.
cfq and as are similar in many ways. cfq can be said fairer extension of as. Both assume a lot about disk behavior and usage pattern and heavily optimized toward typical desktop usage. as/cfq behave best if queue depth is 1. They work with higher queue depth but some assumptions there might not hold. The performance is usually pretty good due to heavy optimization but the very optimization might result in peculiarities like you're experiencing. So, Jens isn't available. Nick should know much better what's going on with as. I'll ask him to take a look.
(In reply to comment #113) > Thanks for the testing. It really looks like a cfq bug which somehow gets > triggered on only 64bits. /me does Jens summoning dance. Unfortunately for me, my server uses deadline only, which makes me think Henkjans is seeing finally a completely different problem than mine :-( I'm compiling a 2.6.21.5 with Peter Zijlstra 2.6.21-per_bdi_dirty_pages.patch patch. I'll report my findings here as soon as I find a window to reboot the server.
(In reply to comment #116) Yeah, that's my feeling too. We might even have yet different problems mixed in here. We'll just have to crack one by one.
Hi, (In reply to comment #76) > New patches at: > http://lkml.org/lkml/2007/5/10/161 > > These do include something to make inter task starvation better. > The previous set was only geared at inter device starvation issues. I'm now running since a few hours a 2.6.21.5 with Peter's patch. The machine is right now lightly stressed (and mysql cache are still cold, so memory consumption is not at its peak), so I'll perform more test in the next week to verify what I've found so far (which is good): 1) vfs_cache_pressure at 100, 2.6.21.5+per bdi throttling patch Result is good, not as snappier as I'd want during a large copy but still usable. No process seems stuck for agen, but there seems to be some short (second or subsecond) moment where everything is stuck (like if you run a top d 0.5, the screen is not updated on a regular basis). 2) vfs_cache_pressure at 1, 2.6.21.5+per bdi throttling patch Result is at 2.6.17 level. It is the better combination since 2.6.17. I've taken a vmstat snapshot during test 1), and it shows a "behavior" (don't know if it was planned or not): At first, there are lots of block activity (high bi and bo): procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa ... [copy started here] 2 1 104 343248 7200 156784 0 0 61796 20956 2217 2476 24 15 28 33 0 1 104 260908 7336 236964 0 0 40044 40596 1865 1668 17 13 14 55 1 4 104 101108 7488 393692 0 0 78540 43944 2199 2079 1 20 20 59 0 4 104 27872 6532 465136 0 0 50388 38516 1933 1997 13 16 20 50 ... [then a few seconds after, bi decrease and bo still high (does the system regulates itself to prevent the write starvation??), high iowait] 1 7 104 33164 1736 462888 0 0 58060 45732 2003 2440 1 20 0 79 1 7 104 38504 1812 457064 0 0 7384 43240 1293 1336 2 4 0 94 1 8 104 36804 1988 458664 0 0 1824 45824 1299 1535 3 3 0 94 1 7 104 36484 2096 458648 0 0 64 44120 1188 1389 2 2 0 96 3 7 104 36184 2164 459320 0 0 304 45988 1239 1743 35 4 2 59 1 5 104 27524 2240 470920 0 0 36796 21124 3054 3197 29 15 21 35 [then the inverse phenomenon] 3 5 104 37824 2284 461404 0 0 38188 32964 2071 1931 2 13 27 58 1 2 104 58628 2648 438620 0 0 68652 4016 2348 3355 38 23 10 28 3 1 104 28372 3508 466616 0 0 36188 3788 2033 3539 36 12 15 38 1 2 104 76732 3752 418304 0 0 44232 42072 2263 2911 6 21 36 36 1 3 104 58732 3784 435816 0 0 59844 33684 2067 3155 2 20 35 44 ... [the cycle reproduces every 10s or so] Also I don't know if that matters, but it seems that all interrupts are sent to processor 1: db1:~# cat /proc/interrupts CPU0 CPU1 0: 95 3946126 IO-APIC-edge timer 1: 0 2 IO-APIC-edge i8042 4: 0 6321 IO-APIC-edge serial 6: 0 3 IO-APIC-edge floppy 8: 0 1 IO-APIC-edge rtc 9: 0 0 IO-APIC-fasteoi acpi 12: 0 4 IO-APIC-edge i8042 14: 0 85 IO-APIC-edge ide0 16: 0 0 IO-APIC-fasteoi uhci_hcd:usb2 18: 0 0 IO-APIC-fasteoi uhci_hcd:usb4 19: 0 0 IO-APIC-fasteoi uhci_hcd:usb3 23: 0 21 IO-APIC-fasteoi ehci_hcd:usb1 46: 0 1619638 IO-APIC-fasteoi megaraid 64: 0 33237 IO-APIC-fasteoi eth0 65: 0 2209532 IO-APIC-fasteoi eth1 NMI: 0 0 LOC: 3946284 3944609 ERR: 0 Stay tuned for more test, but per-bdi seems at least for me definitely the way to go. Thanks, Brice
Can you combine the test with shorter queue lengths? echo N > /sys/block/DEVICE/queue/nr_requests e.g N=32 / N=16 / N=5 / N=1 I found that e.g. the AHCI interrupt handler can run quite long with large queue lengths.
(In reply to comment #119) > Can you combine the test with shorter queue lengths? Here are some results on my hardware (see original bug description). Running 2.6.21.5 + per bdi throttling patch in all tests. Baseline load on the machine is around 0.3/0.4 Baseline iowait on the machine is < 2 Baseline is nr_requests=128 (default) and TCQ depth at 64 (default) Deadline io scheduler This is my original test case: * lightly loaded MySQL instance doing lots of write and fsync (I guess), few reads (everything is in RAM).MySQL process takes about 93% of physical RAM (4G) on x86_64. * At test start, I drop the caches, create a LVM snapshot, and copy from this snapshot to another block device (another RAID1). (LVM snapshot was deemed to have no influence, since directly copying the files yield the same results) * While doing that I run mtop and vmstat -n 1, and also a 'time ls -al /usr/lib' * Pauses are seen in mtop, which doesn't refresh anymore and SQL requests are piling, which means something is starving mysql; at this stage, the ls can take more than 20s to display its results. 1) vfs_cache_pressure=1 TCQ nr_requests 64 128 between good and bad, load at 6, short sequences of 1s or less pauses every 3 or 4 seconds. 64 64 bad (load 4.5) iowait between 50 and 75 64 32 bad 64 16 bad 64 1 between good and bad (load 2.5), some real short pauses 2) vfs_cache_pressure=100 (kernel default) TCQ nr_requests 64 128 really bad, long (more than 1 or 2s) "pauses" every few seconds or so 64 32 really bad, lots of long pauses 64 16 really bad, lots of long pauses 64 1 between good and bad (load 2.6~3), some short pauses 3) vfs_cache_pressure=1 TCQ nr_requests 8 128 not that bad 1 128 snappiest configuration, almost no pauses (or unnoticable ones) While doing those tests, the new counters /sys/block/DEVICE/queue/writeback_kb and reclaimable_kb for both devices where always at 0... I don't know if that's the intended behavior or not. Conclusion: I still don't get the perfectness of vanilla 2.6.17.x with baseline settings, but it seems to go in a good direction. vfs_cache_pressure=1 is still better thant vfs_cache_pressure=100 and TCQ seems to play a role. Nr_requests doesn't seems to have an impact (except when disabled). Let me know if someone wants me to perform some tests... Brice
Thanks for the performance info Brice! We recently upgraded our office to gigabit Ethernet and got some big AMD64 / 3ware boxes for file and vmware servers... only to find them almost useless under any kind of real load. I've built some patched 2.6.21.6 kernels (using the bdi throttling patch you mentioned) to see if our various Debian Etch boxes run better. So far my testing shows a *great* improvement over the stock Debian 2.6.18 kernel on our configurations. The patch and build process was less than pleasant, since the rolled up 2.6.21 patch I found needed some nommu superblock/backing device patches to get it compiled and working against the current Debian linux-2.6-2.6.21 kernel source. I also had to patch some of the build scripts since the package was written for unstable and I need the kernel to run on Etch. I thought I'd share my combined iowait-hell.patch and the binary Debian kernel packages to assist anyone else who may also be fscked by this issue or interested in testing for themselves. http://www.awtrey.com/files/iowait-hell/ At this point I'm still screwed because VMware depends on some older interrupt / vector code changed in 2.6.20 (I think). With 2.6.21 I get great disk performance, but complete system lockups at random intervals when running VMware. I'd go back to pre-2.6.18, but some of the hardware isn't supported in prior kernel versions. This is such a huge issue for us that I'm willing to pay thousands of dollars to get a backport of this fix to 2.6.18 so we can actually use our vmware servers. (BTW, If anyone wants the job please shoot me an email to discuss.) I can't believe an issue this bad has gone unresolved for this long... Tony
I spoke too soon, at least for some architecture / drive controller combos. The new kernel ran much better on my build box (P4/IDE drives) than the stock 2.6.18, but when I tried the kernel on one of my AMD64 / 3ware boxes I still see the same iowait / pdflush unresponsiveness I was seeing on the 2.6.18. Damn! Could Dwight Spencer (see comment #77 and comment #84) and I be seeing other complications of this issue from running on AMD/3ware? Tony
(In reply to comment #122) > I spoke too soon, at least for some architecture / drive controller combos. > The > new kernel ran much better on my build box (P4/IDE drives) than the stock > 2.6.18, but when I tried the kernel on one of my AMD64 / 3ware boxes I still > see the same iowait / pdflush unresponsiveness I was seeing on the 2.6.18. > Damn! > > Could Dwight Spencer (see comment #77 and comment #84) and I be seeing other > complications of this issue from running on AMD/3ware? IMHO, the bug that affects us (if it's the same) has its roots in x86_64+SCSI since on any i386 (with or without SCSI) I never experienced it (ok that wasn't in the same test conditions and under the same load). What would be interesting if you could find an older (vanilla) kernel _not_ affected (like 2.6.17 was for me) is to run a git-bisect session. At least it would show if you are seeing the same issue as Henkjans.
Morning guys Hey tony ... I gave up on the 3ware 9500s controller and went directly to booting from 1 drive, and using software raid on the other drives. Performance has greatly improved. While I was getting 22MB/s read/write speed on the 3ware 9500s (pci-x card in a pci slot, so that might be my issue), going software raid (md5, raid5, kernel 2.6.22-1) has pushed my read speeds up to 100MB/s and writes vary from 50-70MB/s, an amazing jump. I still get some iowaits while doing large file transfers or disk io (such as installing winxp in a vm), but the system remains highly responsive and system load only goes up to negligibly (always less than a load of 2-3), not creaping up to 12+ load when using the 3ware. I typically use the utility "dstat" to keep an eye on cpu performance (load/iowait, etc) and disk/network io. (( That said, I believe I'm not having problems with md software raid5 causing data corruption, but that's an issue for for another thread i suspect. Trying sw raid 4 (3drives) and 1(2 drives) this morning. )) I'm considering ordering a lsi/megaraid sata 300 controller to do further testing on - I'd prefer hardware raid, but i'm not going to put up with xfer rates of 22MB/s. Hopefully the LSI doesn't have the issue with iowaits. I'm considering getting a pci-e based controller/board/cpu/memory combo, to give that a try, if pci-x in pci slots is the issue. Which controller/board/slot combination are you using, Tony? dwight. notes - currently running fedoracore 7, compiled 2.6.22-1 kernel - iowait issue w/ 3ware controller existed on both fc6, fc7, rhel4 update4, and centos. I didn't try debian/gentoo or any bsd's. - 2gb memory, motherboard is asus a7n8x, cpu is athlon xp 2800+
Created attachment 12066 [details] 2.6.18 bounce fix Shooting in the dark... Does reversing this attached patch change the picture on x86-64 at all?
Checking /proc/meminfo for Bounce should probably be enough.
(In reply to comment #126) > Checking /proc/meminfo for Bounce should probably be enough. Not sure what you exactly mean in comment #125, but I have: [brice@db1 ~]$ grep Bounce /proc/meminfo Bounce: 0 kB Don't know if that matches what you were thinking about... Thanks, Brice
Reassuring, I would be surprised if that had broken and we hadn't noticed until now. But there's not many great ideas popping up in here, so I figured it was worth a shot.
I suspect multiple bugs are getting mixed here again. At least Anthony's problem sounds like something different. Besides he seems to run a heavily changed Frankenkernel so he's INVALID anyways.
Sorry I didn't get back yesterday, but I took Dwight's advice and ditched the 3ware card. (Thanks Dwight! I *was* using a 3ware 9550SXU in a pci-x slot and seeing about the same performance you were seeing: 25MB/s reads) I reinstalled Debian and my server config on a seperate box (4 sata software RAID+LVM) and swapped out the drives last night. I ran some tests with the stock Debian 2.6.18 kernel and was getting better performance than with the 3ware, but I did see the characteristic pauses toward the middle of a bonnie++ run. After getting my Debian 2.6.21.6 + patch kernel loaded, the performance was greatly improved and even while running bonnie on one volume, I was able bounce around with any really noticable lag. Now, as far as this bug being AMD64 only. We develop a portable data analysis tool and we run it on Intel Core Mobile systems (Sony UX series, Panasonic Toughbook series) and see this bug or one almost exactly like it on those platforms as well. Once we have processed data for about 20 minutes, the iowait starts to go up, the system starts to hesitate and stutter. It's not swapping to disk. Our app takes a couple of hundred megs of ram and the rest of the gigabyte or so of ram is available for the kernel to use. Once our run is complete and all our programs have exited, the kjournald is 100% iowait for about 2-4 minutes writing something. Ironically, this problem is *much* worse on the UX-380 which has a solid state disk, incredible write perfomance and nanosecond seek times. Considering that I could dd the entire contents of our ram to the SSD disk in a couple of seconds, I can't even imagine what kjournald is doing for that long. I am not suggesting anything here, I'm just saying that when I saw the iowait issue on our new AMD64 boxes, the problem behaved exactly like the one we've seen in testing our product and can reproduce fairly reliably on both Intel Core and Core2 CPUs. Thanks for the help identifying the 3ware issue and for the patches that smoothed out the performance on our servers.
My 2cents: Symptoms described match those of my system - slow/nonexistent reads when heavy write IO in progress leading to collapse in system response. Twin Opteron 2.4GHz, 4GB RAM, 3ware 9550SX with RAID 1, ext3. OSes tried so far: CentOS 4.5 (2.6.9-55.ELsmp), openSUSE 10.2 (2.6.18.2-34-bigsmp), CentOS 5 (2.6.18-8.el5). Latest drivers (compiled from 9.4.1.3 3Ware source) and firmware for the card in use, have tried various changes to nr_requests, elevator, etc and have been able to mitigate the effect slightly but no magic bullet as yet. Kernel debugging isn't my forte, so this is mainly just a 'me too' posting - sorry about that.
Me too! 9650SE - CentOS 2.6.18-8.1.10.el5xen. It's Painful. Using deadline scheduler helped a bit in my case.
the per bdi patches are now upstream, so could you check 2.6.24-rc3, does it improve performance?
Not sure if this is relevant. I see similar behavior (high wa, glitchy interactive) on my franken-fileserver. I have root on sw md raid1. I have a 2TB raid6 array for data. 10 disks total, onboard ICH + 2 PATA PCI controllers + 1 4 port SATA controller. 2 disks are larger than the rest and are used in both the r6 and r1. My PATA disks show up as hdX on this system. The problems existed when I was all PATA, and I switched 4 drives to SATA to try to alleviate the problem, but it never helped. dd if=/dev/zero of=/raid1fs/z bs=1M count=512 to raid6 and wa stays <10%, and idle >0%. No glitches or extremely tiny. dd to raid1 and wa goes to near 100% for 10 seconds. Idle goes to 0. Everything gets glitchy until the writes complete and the wa goes back down. My main notification of the symptom is my MP3s NFS and MythTV served off that system start glitching badly. Those are just reads. So my big dd write is starving small reads. dd to a non-raid spare partition on a disk and it behaves same as raid1. So maybe best to speak about non-raid, single-disk. But it's easier for me to test raid1 as I don't usually have a non-raid partition. Not sure why raid6 seems mostly immune when writing to it! Not the main issue here though, so let's ignore it. 2.6.17-1.2139_FC5smp, final stock fc5 kernel. I will upgrade this box to f8 2.6.23 in the next few weeks. I know the kernel is ancient, but I rarely upgrade this box as almost every time I do I destroy my 2TB array due to some obscure R6 bug, and it's really painful. I read this entire bug and have tried a lot of the suggestions. echo 1 > /proc/sys/vm/vfs_cache_pressure no help echo deadline > /sys/block/*/queue/scheduler (all disks) no help or perhaps very very tiny help echo 16 > /sys/block/*/queue/nr_requests (all disks) Glitches seem to get a fair bit better (with scheduler=deadline still). It was set to 128, which appears to high and causes read starvation. With 128 I get long glitches every test run, even on an no-load system. Even dstat itself (over ssh) starts to jitter! With 32 I sometimes get short glitches. With 16 I get no glitches unless the server is being loaded by other NFS calls. No matter what nr_requests is, dstat shows the write going 56-60MB/s and wai takes up a constant 93-100% for the duration of the write. Note, I cannot easily tell if this is a disk starvation issue or a ps scheduler / net issue as my "glitch" symptom is MP3 streaming over NFS with xmms. So the problem could be a starved NIC (Intel PRO/1000 MT Server PCI card) or NFS process. However, it is clear that the large disk writes cause the starvation, whatever it is. Regardless of the cause, the kernel should not starve anything during large writes. They should be scaled back enough so that other tiny writes/reads can break through. I have reproduced the problem on non-franken servers with just 1 disk and no raid, or systems with just 2 disks md raid1. I will admit having 10 disks on cheap-ish controllers may not be helping the situation. When I try F8 2.6.23 I will report back. When F8 goes 2.6.24, I will also report back.
* bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote: > dd to a non-raid spare partition on a disk and it behaves same as > raid1. So maybe best to speak about non-raid, single-disk. But it's > easier for me to test raid1 as I don't usually have a non-raid > partition. Not sure why raid6 seems mostly immune when writing to it! > Not the main issue here though, so let's ignore it. before you upgrade that box, could you try to run a few tests with files in pure ramfs? I.e. mp3s copied to /dev/shm/ and served from there - without any connection to the raid arrays. I.e. are these delays purely IO interaction related - or is there any CPU preemption or other CPU scheduling problem involved as well.
I just tried exporting a /dev/shm test dir (bind mount, nfsv4) and shm mounts appear to really mess up nfsv4. I can't get it to mount for the life of me. If I setup a dir in /tmp and bind/export/mount then it works fine. Move to shm and the client mount hangs and it really messed up my other v4 exports! I had to restart nfsd a few times to fix it. Any other ideas to test your theory that's more compatible with nfsv4?
While the system was heavily NFS loaded, 5GB read/write, I ran the following script: #!/usr/bin/perl open(IN,'/data/Tmp/bigfile') or die; while (!eof(IN)) { $bef=time; sysread IN, $buf, 4096; $aft=time; print "elapsed time to read: ".($aft-$bef)." sec\n"; sleep(3); } /data lives on my big RAID6 array (with all my MP3s). Anyhow, I had a few instances where time was non-zero: elapsed time to read: 1 sec elapsed time to read: 0 sec elapsed time to read: 0 sec elapsed time to read: 1 sec elapsed time to read: 0 sec elapsed time to read: 2 sec elapsed time to read: 0 sec elapsed time to read: 0 sec elapsed time to read: 1 sec I simultaneously ran the same script, but on a different file, over NFS and strangely, all times were zero. Hmm. That's weird. While running the test I had an ls -l deep in the RAID6 tree that took about 30 secs to complete! And the directory only had 5 files in it! It seems whenever the wa% goes to near 100 and idle=0% that things start getting ugly. The user/sys% never get very high. It's always wa%. If anyone can suggest other tests, I'll run them.
(In reply to comment #135) > * bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote: > > > dd to a non-raid spare partition on a disk and it behaves same as > > raid1. So maybe best to speak about non-raid, single-disk. But it's > > easier for me to test raid1 as I don't usually have a non-raid > > partition. Not sure why raid6 seems mostly immune when writing to it! > > Not the main issue here though, so let's ignore it. > > before you upgrade that box, could you try to run a few tests with files > in pure ramfs? I.e. mp3s copied to /dev/shm/ and served from there - > without any connection to the raid arrays. > > I.e. are these delays purely IO interaction related - or is there any > CPU preemption or other CPU scheduling problem involved as well. I'm still running a 2.6.21 + per BDI patches (same config as in my comment #116) and I tried to play with the /dev/shm idea. Here is the test: 1) I dropped all caches, here is the 'free' information before running the test: total used free shared buffers cached Mem: 4052420 3625700 426720 0 508 17708 -/+ buffers/cache: 3607484 444936 Swap: 11863960 11364 11852596 Notice that there are plenty of free/unused ram for the tmpfs. 2) I created a 150MB tmpfs containing 500 zero byte files. 3) I'm copying a multi-gigabyte couple of files from /dev/sdb /dev/sda (remainder, LVM2 on top of high-end LSI hardware RAID card with multiple spindles and 256MB of battery backed cache). 4) while the copying progresses, I time a ls -al from the tmpfs mountpoint. And now the results: - on baseline, ls -al /mnt is taking: cold cache: real 0m1.387s user 0m0.004s sys 0m0.004s hot cache: real 0m0.024s user 0m0.008s sys 0m0.008s - while the copy is progressing, (hot cache everytime), the time to perform the ls is increasing, until the copy is finished: After a few second: real 0m4.654s user 0m0.008s sys 0m0.004s After about 20 seconds (at about 75% of the copy) real 0m15.064s user 0m0.004s sys 0m0.000s Near the end of the copy: real 0m7.998s user 0m0.004s sys 0m0.008s So to me it looks like the disks or lvm are not at cause. I might try 2.6.24 when it will be out, to see if that's better or not. Let me know if someone need more testing/information.
Comment #138 jives with what I see. I've done further poking around and noticed that while these large reads/writes are occurring, tcp overruns like crazy. netstat -s 12178 packets pruned from receive queue because of socket buffer overrun 162338 packets collapsed in receive queue due to low socket buffer After google searches, I did the following tweaks: net.core.rmem_max=16777216 net.core.wmem_max=16777216 net.core.rmem_default=262143 net.core.wmem_default=262143 net.ipv4.tcp_rmem="8192 1048576 16777216" net.ipv4.tcp_wmem="4096 1048576 16777216" vm.min_free_kbytes=65536 net.ipv4.tcp_max_syn_backlog=4096 That really helped drastically lower the overruns. In fact, each time I jack up the rmem/wmem, the situation gets better and my NFS drops fewer packets and xmms skips less. The box has 2GB RAM, so I'm inclined to just keep jacking up these values to see if I can buffer my problem away. My point is not the networking details, my point is this plus comment #138 strongly hint that the disk I/O is causing actual CPU starvation, not just disk starvation. This makes sense if wa% is hogging all the cycles.
This is really interesting. We're seeing very similar issues on http://spinn3r.com in our web crawler. Basically, we're a very high write throughput application. We might be seeing two issues here. One is that our application runs out of 8G of memory (in a sharded database) but the swappiness setting is constantly forcing pages out of memory and onto disk. We've also seen this locking issue on DDs as well. It sounds like there might be too issues in this bug though.
While setting up a new 64-bit system to replace my old 32-bit server I am also seeing this issue. I've found that mounting the filesystems with noatime helps with starting apps once they're in cache but high write load seems still slows down reads. Running "dd if=/dev/zero of=testfile.img bs=8k count=20M" slows down "find /", and if I use cfq with NCQ disabled "dd if=/dev/md5 of=/dev/zero bs=8k" can get reasonable speeds but enabling NCQ with depth 31 or changing to deadline scheduler seems to make that slow as well. Adding oflag=direct to the dd write operation slows down the write speed so reading speeds slightly increase in speed but still not as fast as when dd isn't running. I'm running Gentoo 64-bit with kernel: 2.6.24-gentoo on a Nvidia MCP67 motherboard, AMD Athlon64 X2 4000+, 2GB RAM and two 400GB SATA drives with ext3 filesystems running on top of LVM2 on top of software RAID 1.
I've mounted the filesystems with SystemRescueCd-0.4.4-beta5 32-bit which is based on 2.6.24 kernel and got similar results to when I am running 64-bit.
It seems there is a "conflict" between the I/O scheduler and filesystem barriers (which have been made a default mount option since 2.6.17/2.6.18). Disbling barriers at mount time by passing option nobarrier for XFS, something like barrier=0 for ext3 (and another one for reiser) seems to really help (factor 5 to 10 in write speed and no more lags on high I/O load) HTH
(In reply to comment #143) > It seems there is a "conflict" between the I/O scheduler and filesystem > barriers (which have been made a default mount option since 2.6.17/2.6.18). > Disbling barriers at mount time by passing option nobarrier for XFS, > something > like barrier=0 for ext3 (and another one for reiser) seems to really help > (factor 5 to 10 in write speed and no more lags on high I/O load) > > HTH > Do we have some basis for this or is this pure speculation?
(In reply to comment #144) > > Do we have some basis for this or is this pure speculation? > I don't know where the "conflict" resides, but all I'm sure is that's it's the only thing (disabling barriers when the underlying drivers and hardware supports it) that has really helped on all my servers in the last 1,5 year... As an example a copy of the kernel source took more than 7 minutes before, and when disabling filesystem barriers this takes only 1'30 .... there are also reports suggesting that it might be a step in the right direction here : http://forums.gentoo.org/viewtopic-t-482731-postdays-0-postorder-asc-start-550.html
We have got one more data point which is very interesting. It seems like if the data directory, where the IO is being done by an application, is loop mounted using a file, the symptoms disappear. Does anybody know what does a loop device have (or lack) that makes 64-bit code work at the same speed as 32-bit? Following was posted on gentoo forums: In the beginning of this year I build myself a new PC with a Intel E6750 processor, Gigabyte P35DS3R motherboard, Samsung HD501LJ 500Gb disk (370Gb with EXT3 for Linux), Nvidia 8600GT graphics card and 4Gb of memory. I decided to install a 64bit Gentoo Linux, currently with kernel 2.6.24-r3. On first impression the system seems to be quite quick. The software compiled very fast. The first evidence, that something was not quite right, were during playbacks of mpeg2 files with VLC, where there were frequent sound drops. I didn’t experience these drops with my old AMD XP 2800+. It really became very annoying during my Audacity sessions. The loading of a 2 hours MP3 started pretty fast, I guess until the buffer cache was filled, then the IO activity basically grind to a halt. Internet radio streams continued playing, but even switching desktops didn’t produce any reaction until the MP3 was nearly loaded into Audacity. Subsequently I did some experiments in various scenarios (probably not really scientific, but anyway), each time loading a 228Mb MP3 file into Audacity. After completion, Audacity would have filled it's data directory with about 3.3Gb audio data. During each run I collected some log files produced by "vmstat 5". From this data I produced some diagrams with OpenOffice. I don't want to duplicate everything here. Please surf over to http://ridderbusch.name/wp/2008/03/20/back-to-32bit-linux/ to see the details, diagrams and also the raw data from "vmstat". With standard Gentoo 64bit 2.6.24-r3 the load of the MP3 file would take over 8 minutes. With the "vmstat" output you could see, that for an extended period the system was about 90% of the time waiting for I/O and otherwise not doing much. With "queue_depth" set to 31 and CFQ scheduler, the result was slightly better, but not by much, just under 7 minutes (6m50s). Then I created a 6Gb image file and in turn made XFS and EXT3 file systems in this image, mounted via loopback on to the Audacity data directory and repeated the MP3 loading. This worked really fast. With XFS fs it took only 1m22s and with EXT3 1m46. As a last test I rebooted with Knoppix 5.3 (2.6.24 32bit kernel), reconfigured Audacity to load the data into /tmp of my Samsung disk and found, that this run completed after only 1m48s.
A user profiled the good and bad runs. Can someone please help decode what the profiler's result mean to the kernel? The post is at: http://forums.gentoo.org/viewtopic-p-5028884.html?sid=7cd6ae6b21ce6aa59b5021e89dfab313#5028884
I've experienced this problem for quite some time, generally with both streaming writes (from ftp server, etc) (or even just the writeback part), and with random seeks, all while watching video/listening to audio. Strange thing, with kde, open a file off ftp and open with mplayer, it will copy it to disk, i have 8gb of ram, and opening a 350mb file, you would expect it would stay cached in ram, however after its finished transferring, mplayer/kaffine/noatun opens, often i get 2~10 second pauses/stuttering while the write-back is being performed, as i understand it, you should never need to block reading from cache during writeback. This leads me to believe that its not strictly block device related. If you try and copy any decent sized file (100mb+) in the background of playing video, the video becomes unwatchable. i have tried 2.6.18, 22, and 24, on AMD64. I still having this problem, with 32 bit kernels, but no where near as bad. After reading some of the suggestions here i have tried tweaking /proc/sys/vm, to no effect, an effective change thou, is noatime, helps marignally, however barriers=0 makes a HUGE difference, write speed is up from 20~30mb/s to 65mb/s, to single sata drive, of ATI SB800 southbridge (in BIOS set to IDE, AHCI seems to do the same thing) This bug also affects my server (for the past few years), with a Areca SATA Raid controller card (using ARCMSR), i get appauling read and write speeds with XFS, on an 8 disk RAID5 array, have tried all sorts of things, and now just live with the poor performance. Unfortunately the Areca drivers arent in 2.6.17 kernel.
(In reply to comment #76) > New patches at: > http://lkml.org/lkml/2007/5/10/161 > > These do include something to make inter task starvation better. > The previous set was only geared at inter device starvation issues. Hello, just to point out that version 10 of the "per device dirty throttling" patch was released : http://lwn.net/Articles/249450/ This version allows to build against .23-rc4-mm1
(In reply to comment #133) > the per bdi patches are now upstream, so could you check 2.6.24-rc3, does it > improve performance? > Hello Ingo, I've just seen your comment about the upstream integration of the per bdi patches. Are you talking about version 10 (http://lwn.net/Articles/249450/) as I commented in comment #149?
Most of the per-bdi patches were merged in .24-rc1 so every kernel after that should have this. .26 will bring some new features but nothing earth shattering; capability of setting an upper and lower limit on the ratios as well as a sysfs interface in /sys/class/bdi
I'm wondering if this is in any way related: https://bugzilla.redhat.com/show_bug.cgi?id=444759
I believe I'm running into this problem as well. It happens when I'm making an ISO image from a DVD using K3b. All goes along fine for a while, as the data is just being shoved into the page cache, but when pages start getting flushed to disk, all hell breaks loose. I have a 3ware 9650SE running a RAID6 on eight Seagate Barracuda ES (7200 RPM) 320 GB drives. I know RAID6 isn't the best for write speeds, but I should be able to get better than 6 MB/s, shouldn't I? Alas, while the ISO image is being flushed to disk, I'm getting only 2-6 MB/s writes, and the number of dirty pages continues to climb. While this is going on, anything else requiring a disk read takes a very long time. Even the time between when I press Enter after entering my password for a root shell login and when the root prompt appears can be >10 seconds. I have a Core 2 Quad Q6600. While the pages are being written, I watch my CPU utilization graph, and I see intense spikes. The graph hovers mostly around 12% utilization (so like 50% of one core), but occasionally and suddenly it shoots up to 100% (all four cores pegged) for 1-5 seconds at a time. Copying a DVD to an ISO image shouldn't peg all four cores! Additionally, if I watch vmstat while doing the copy, the 100% CPU spikes coincide with vmstat's reporting 0% idle, 93% io (it doesn't ever go higher than 93%). Once K3b is finished writing the ISO, I still have several GB of dirty pages yet to be written. (I have 8 GB RAM.) If I issue a 'sync' command, the pages get flushed to disk, but still only at ~5 MB/s, though the CPU graph shows nearly idle during the sync. All this might seem like a problem with the 3ware controller, but here's the odd thing: $ vmstat 5 & sync ; time ( dd if=/dev/zero of=10G bs=4096 count=2621440 ; sync ) ; kill %1 [1] 22439 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 0 7431424 20 122548 0 0 26 33 5 5 2 0 97 0 1 1 0 2765820 20 4658304 0 0 3 162326 1288 1127 1 25 47 27 2 3 0 208376 20 7132916 0 0 0 305676 2482 13676 1 19 31 49 0 4 0 46312 20 7291344 0 0 0 58018 653 9456 0 4 30 66 0 5 0 42636 20 7296552 0 0 0 28431 451 3844 0 2 26 71 0 5 0 50780 20 7288524 0 0 0 26556 380 1295 0 2 25 73 0 0 0 45368 20 7299224 0 0 0 29407 441 1265 0 1 53 46 1 0 0 47216 20 7296964 0 0 0 87226 815 928 0 2 97 0 0 0 0 46656 20 7297696 0 0 0 84720 848 982 0 3 97 0 1 3 0 42768 20 7294484 0 0 0 57923 576 1671 0 3 62 34 0 5 0 42892 20 7293524 0 0 1 23463 372 1432 0 3 48 49 0 4 0 43072 20 7294312 0 0 1 21345 362 1657 0 2 48 50 0 5 0 43024 20 7294456 0 0 6 22790 370 1350 0 3 31 66 1 3 0 46780 20 7291080 0 0 1 22396 361 1462 0 2 31 67 0 4 0 43216 20 7294132 0 0 0 22608 370 1373 0 2 48 49 2 3 0 46608 20 7291544 0 0 0 21345 363 1351 0 2 50 48 0 5 0 42924 20 7294548 0 0 17 23004 375 1408 0 2 48 49 2 0 0 48400 20 7296148 0 0 2 20556 393 975 0 1 66 32 0 0 0 45044 20 7298744 0 0 0 78870 773 933 0 2 97 1 0 0 0 46324 20 7297772 0 0 0 4994 305 891 0 0 97 2 0 0 0 45276 20 7298896 0 0 0 25167 417 993 0 1 98 1 0 0 0 45744 20 7297992 0 0 0 29529 444 906 0 1 98 1 0 4 0 42936 20 7298160 0 0 0 7604 300 1120 0 1 79 20 0 4 0 43040 20 7295132 0 0 0 18597 335 1477 0 2 27 71 0 4 0 43092 20 7294580 0 0 0 23968 383 2246 0 3 28 69 2621440+0 records in 2621440+0 records out 10737418240 bytes (11 GB) copied, 120.815 s, 88.9 MB/s 0 2 0 54872 20 7289284 0 0 1 27088 471 1400 1 1 45 53 0 2 0 54756 20 7289284 0 0 0 2745 311 1146 0 0 50 49 0 2 0 54748 20 7289284 0 0 0 7869 331 1112 0 0 50 49 0 3 0 54880 20 7289284 0 0 0 6317 302 1244 0 0 50 50 0 3 0 54748 20 7289284 0 0 0 2328 305 1391 0 0 43 57 0 3 0 54880 20 7289284 0 0 0 5502 319 1359 0 0 50 50 0 3 0 54748 20 7289284 0 0 0 14270 372 1452 0 0 50 49 0 2 0 52508 20 7289772 0 0 35 6819 316 1318 1 0 55 45 0 1 0 51640 20 7289772 0 0 0 279078 2192 819 0 3 70 27 0 1 0 51648 20 7289772 0 0 0 418332 3344 619 0 0 75 24 real 2m51.679s user 0m0.487s sys 0m14.990s [1]+ Terminated vmstat 5 $ echo $[10*1024*1024/(2*60+51)] 61320 Notice how writing is fast at the beginning before the page cache has filled up, then gets very slow for most of the write, but then gets very fast again at the end. I have run this test three times in a row, deleting the 10G file between each run, and the results are always like this. However, here's the REALLY strange thing. See what happens when I use a 4 GB file instead of a 10 GB file: $ vmstat 5 & sync ; time ( dd if=/dev/zero of=4G bs=4096 count=1048576 ; sync ) ; kill %1 ; fgrep Dirty /proc/meminfo [1] 22831 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 3 0 0 7432280 20 122228 0 0 26 40 5 5 2 0 97 0 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 4.4571 s, 964 MB/s 0 2 0 3116964 20 4316616 0 0 1 152428 1194 1339 2 24 52 23 0 2 0 3117560 20 4316536 0 0 0 358446 2877 751 0 4 72 24 real 0m14.994s user 0m0.164s sys 0m4.959s Dirty: 0 kB [1]+ Terminated vmstat 5 I threw in the fgrep just so you could see that yes, all 4 GB of dirty pages do get written in this amount of time. So apparently the slowdown only occurs if I manage to fill up my page cache. Something is seriously wrong here. Supplemental information: # for each in /sys/block/sda/{device/queue_depth,queue/*} ; do echo "${each}: $(<${each})" ; done /sys/block/sda/device/queue_depth: 254 /sys/block/sda/queue/hw_sector_size: 512 /sys/block/sda/queue/iosched: /sys/block/sda/queue/max_hw_sectors_kb: 128 /sys/block/sda/queue/max_sectors_kb: 128 /sys/block/sda/queue/nr_requests: 512 /sys/block/sda/queue/read_ahead_kb: 384 /sys/block/sda/queue/scheduler: noop [deadline] # lspci -nnvvv -d 13c1:1004 02:00.0 RAID bus controller [0104]: 3ware Inc 9650SE SATA-II RAID [13c1:1004] (rev 01) Subsystem: 3ware Inc 9650SE SATA-II RAID [13c1:1004] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Region 0: Memory at e0000000 (64-bit, prefetchable) [size=32M] Region 2: Memory at e5200000 (64-bit, non-prefetchable) [size=4K] Region 4: I/O ports at 2000 [size=256] Expansion ROM at e5220000 [disabled] [size=128K] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/5 Enable- Address: 0000000000000000 Data: 0000 Capabilities: [70] Express (v1) Legacy Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <128ns, L1 <2us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x8, ASPM L0s L1, Latency L0 <512ns, L1 <64us ClockPM- Suprise- LLActRep+ BwNot- LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt- Kernel driver in use: 3w-9xxx
> From Comment 153: It happens when I'm making an ISO image from a DVD using > K3b. All goes along fine for a while, as the data is just being shoved > into the page cache, but when pages start getting flushed to disk, all hell > breaks loose. By "all hell breaks loose", do you mean a kernel panic? When I write a DVD with K3B, I get a kernel panic so bad that twice it destroyed my file system. I was afraid to report it over here as I am not using a standard kernel (I am using Red Hat's modified kernel). I reported the issue over at http://bugs.centos.org/view.php?id=3002. (Don't pay any attention to the title -- I originally thought it was being caused by my DVD burner.) Work around: use "Deep Burner" (http://www.deepburner.com/) free edition with the WINE emulator. I can cuts DVD's all day long without a single panic. Can anyone confirm that my issue is related to this issue? Many thanks, -T
(In reply to comment #154) > By "all hell breaks loose", do you mean a kernel panic? I meant what I went on to describe. I haven't experienced any kernel panics due to copying large files.
Part of this problem seems to have magically vanished for me. Now when I'm making an image from a DVD, my disk reads stay reasonably responsive the whole way through, and dirty memory (reported in /proc/meminfo) stays at around 1.5 GB, whereas it used to climb past 5 GB. Also, vmstat shows I/O wait at a super-consistent 19-21% while making the DVD image -- no variation outside of that range at all. However, I'm still seeing the extreme slowdown in flushing pages to disk while the page cache is full. I haven't changed kernel versions, but I have upgraded the firmware in my 3ware 9650SE to version FE9X 4.06.00.004, which was a major version bump over the last release. I have also upgraded to KDE 4.1.2 and K3b 1.95-svn (r856890); not sure if that matters.
I can still reproduce symptoms with 2.6.28-rc4. All you have to do to reproduce this is to start a 'dd' creation of a 6GB (anything above your phys RAM really) file and in parallel start a network copy over of a large file over cifs or nfs. The whole system just stalls (nothing can be done with system, mouse freezes, cursor stops blinking, no new programs can be started) for several seconds, then continues, then stalls again for few seconds and so on. Many programs go into 'D' state and CPU usage falls drastically if there were programs consuming CPU. Something is horribly wrong with page cache mgmt. Why is kernel caching such random write data and slowing everything down?
I have also seen this issue horribly on my SATA hard drive behind an ICH8 controller (Dell Latitude D830). Any large copies from the drive slow the machine to a crawl. Copying from my internal hard drive to an external USB 2 drive results in a transfer rate of 8MB/sec. This has been submitted to the Fedora bugzilla as bug #470736.
Similar bug in Ubuntu: https://launchpad.net/bugs/131094
Todd, have you made any progress on this? Just wondering. I might try taking a stab at it when classes end.
Hi Ben, My problem may or may not be this bug, but my strong suspicion is that it is. It seems that when I interrupt a large file transfer on my RAID card or interrupt a "dump" backup of my RAID card to my SATA drive, my USB goes to hell. See https://bugzilla.redhat.com/show_bug.cgi?id=469290 It is pretty easy to reproduce, all I have to do is start a "dump" from my primary partition (/dev/sda1) to my removable SATA drive (not eSATA) (/dev/sdb2), let it run for about ten minutes (takes 1hr 45min to run), then stop it with a <Ctrl><C>. After that, a 940 MB copy from /dev/sdb2 to my USB will drop from 1 min 32 sec to 92 minutes. Rebooting always fixes it. And, the problem will happen occasionally when I let the backup go through successfully, but not always. -T
Created attachment 19346 [details] OProfile of heavily IO loaded system This probably doesn't help much, but I figured I'd take a profile while doing a backup (my reproduction case) just for the hell of it. Doesn't look like anything too informative is in here but here it is anyways.
(In reply to comment #161) > Hi Ben, > > My problem may or may not be this bug, but my strong suspicion is that it > is. It seems that when I interrupt a large file transfer on my RAID card or > interrupt a "dump" backup of my RAID card to my SATA drive, my USB goes to > hell. See https://bugzilla.redhat.com/show_bug.cgi?id=469290 > > It is pretty easy to reproduce, all I have to do is start a "dump" from > my primary partition (/dev/sda1) to my removable SATA drive (not eSATA) > (/dev/sdb2), let it run for about ten minutes (takes 1hr 45min to run), then > stop it with a <Ctrl><C>. After that, a 940 MB copy from /dev/sdb2 to my > USB will drop from 1 min 32 sec to 92 minutes. Rebooting always fixes it. > > And, the problem will happen occasionally when I let the backup go > through successfully, but not always. > > -T > It really doesn't sound like you are encountering this particular bug. This bug causes both general system and I/O performance to go to hell while copying large files. It just took me 10 minutes to VT switch away from and back to gnome just now due to this bug. If you aren't having interactivity issues, then you are seeing something else. Poor IO performance is potentially justifiable, but inappropriate thrashing that kernels beyond 2.6.18 have been producing is a far more serious issue.
I am affected on this bug too, on many desktop systems, with Intel and AMD controllers. It's very easy to reproduce this bug. It must be something like the 50Hz problem in PDT-11, if someone knows it. There must be a threshold of disc throughput which causes such a problem. ------------- I tried to bypass the problem by using a fast SSD, but it's makes the problem even worse. There are the result of two dd writing operations. # dd if=/dev/zero of=/dev/sda6 bs=1M count=1500 1572864000 Bytes (1,6 GB) kopiert, 57,5956 s, 27,3 MB/s And poor desktop responsiveness. # dd if=/dev/zero of=/dev/sda6 bs=1M count=1500 oflag=direct 1572864000 Bytes (1,6 GB) kopiert, 20,9958 s, 74,9 MB/s And even firefox does not freeze. ------------- I have undone a patch found in the CentOS source (linux-2.6-fs-dio-use-kzalloc-to-zero-out-struct-dio.patch) on Ubuntu Hardy kernel (2.6.24-22-generic). I used this patch, because the problem first occurs on my centos system, when I updated from kernel-2.6.18-92.1.10.el5 to kernel-2.6.18-92.1.13.el5. This patch was also used on Ubuntu Gutsy the first time, on which the problem first occurs on my old system. There is only a small slowdown on my centos system, but there is one. This upachting increases my boot time of my Ubuntu Hardy testing installation from 60 seconds to 45 seconds. Every applications starts faster. On high loadavg there are differences like 140s for patched and 30s for unpatched kernel for stating gimp. This happens only, when executing eight concurrent dd writing operations. This does not solve the problem, but it does not appears less. While coping just a big file with gnome, there is no noticeable speed-up. Every action takes a long time and time goes to infinity. ------------- I tried to simulate a slower harddisk and installed Ubuntu Hardy on a full encrypted disc. Which reduces the write speed from 56MB/s to 40MB/s on a traditional disc. The were no differences between the patched and unpatched kernels. Both were unusable.
After spending two days pouring over code and profiling, I have made no progress on this issue nor have any novel proposals as to the source of the slowdown. It seems I just don't have the understanding of the relevant subsystems and to handle this myself. This bug is getting extremely unwieldy and seems to be mixing several separate issues. Would anyone be opposed to opening a new bug for the high I/O wait times issue? While I understand the difficult nature of this issue, the lack of kernel developer involvement in the handling of this bug is extremely worrisome. This was reported nearly 2 years ago and we still are no closer to a resolution despite countless profiles, benchmarks, and bisections. Moreover, at this point one would have to go back eleven kernel releases to have reasonable I/O performance. What could be done to expedite the handling of this bug? Would it help to move discussion onto the LKML? Is there more data that could be collected? Would further bisections help? We have a great amount of data and several proposed fixes which have been shown not to work, however there is no plan for moving forward. Could any of the IO developers suggest novel avenues for inquiry into this slowness? It seems that raw profiles give very little information about the source of the slowness itself. What techniques other than profiling and printks (which have been my bread-and-butter so far) can be used to pinpoint performance issues like this one? Any help that could be provided would be greatly appreciated. This bug has frustrated me ever since switching to x86-64 and a resolution is long overdue. Thanks, - Ben
At least I mostly ignore it because it has all the signs one of those catchall bugs where lots of people report lots of different problems which show vaguely the same sympthoms mixed in a single bug. In general it's extremly hard to sort out these kinds of bug conglomerations and the best avenue might be to start from scratch. Regarding ways to diagnose slowness: if nothing shows up in oprofile it's likely some problem where something waits. You should use some form of latency profilings, either using profile=sched or using latencytop.
On Thu, 2008-12-25 at 10:32 -0800, bugme-daemon@bugzilla.kernel.org wrote: > This bug is getting extremely unwieldy and seems to be mixing several > separate > issues. Would anyone be opposed to opening a new bug for the high I/O wait > times issue? > > While I understand the difficult nature of this issue, the lack of kernel > developer involvement in the handling of this bug is extremely worrisome These two are closely related -- I have no clue what all this is about anymore. We did fix some of the issues with the per-bdi writeback stuff, but clearly there is more. But everybody and his dog piling everything that makes his particular thing go slow in here makes for very muddy waters indeed. Furthermore, the linear nature of bugzilla combined with 2+yrs of history makes is impossible to even start reading this thing. So yes, please open new bugs and try and keep issues separate.
I think the original bug was that not having highmem apparently affects memory balance in a way that makes the system unresponsive under heavy IO. I've seen that in a few cases myself so I'm willing to believe it. Then after that people started to skim in with lots of different issues. Hmm, one way to verify the theory above would be to simulate highmem on a 64bit system and see if it makes the problem away. I can cook up a patch to do that.
(In reply to comment #168) > I think the original bug was that not having highmem apparently affects > memory > balance in a way that makes the system unresponsive under heavy IO. I've seen > that in a few cases myself so I'm willing to believe it. > This would definitely explain the large number of x86_64 users here but how did this go unnoticed on other non-x86/highmem architectures. > Then after that people started to skim in with lots of different issues. > > Hmm, one way to verify the theory above would be to simulate highmem on a > 64bit > system and see if it makes the problem away. I can cook up a patch to do > that. > I'd be willing to test this
I have finally opened a new bug (bug #12309) for the initial issue with high iowait times. I tried to restrict the description of the new bug to only known facts about the issue and it would be appreciated if we could keep this one a little bit more on-topic. It would be great if we could pick up discussion on #12309. If anyone has comments which they think might help and are certainly germane to the original issue, feel free to comment on the new bug.
The problem is not restricted to the amd64 architecture. It appeared in Ubuntu Gusty (I think it was 2.6.24) on a Pentium M the first time. And a 32-Bit installation does not fix the poor desktop responsiveness.
Created attachment 19758 [details] Program for measuring the issue After comparing some kernel code, I have found come really interesting fact. I think the poor desktop responsiveness is affected by the changed process scheduler (e.g. tickless kernel / high resolution timer ...) and not by the disc scheduler. I have written a test program (sorry for the dirty code), which enforces the problem and allows to measure it. Here some fact. I have executed the tests in recovery mode (kernel parameter single) once with 20 processes * 1.000.000 messages and once with 100 processes * 100.000 messages. The result values are echo time of ~80-90% of the messages / longest echo time and test duration. CentOS 2.6.18-92.el5 - 20/1M 4µs / 1s / 38,4s - 100/1k 4µs / 1s / 18,7s Ubuntu 6.04 - 8.10 2.6.15-53 - 20/1M 3-33µs / 1s / 33,6s - 100/1k 3-40µs / 1s / 17,7s 2.6.20-17 - 20/1M 3µs / 1s / 32s - 100/1k 3-9µs / 1s / 16,0s 2.6.22-16 - 20/1M 3-4µs / 7ms / 51,5s - 100/1k 4µs / 1s / 25,9s 2.6.24-23 - 20/1M 53µs / 64µs / 73s - 100/1k 77-250µs / 41ms / 32,0s 2.6.27-9 - 20/1M 120-200µs / 120ms / 159s - 100/1k 500-1000µs / 1s / 84s While executing the test with 100/1M under xorg/Gnome, the problem is enforced. There are no problems on CentOS and Feisty. I could not test it on Ubuntu 6.06. And I have heavy responsiveness problems with Hardy, Intrepid and Fedora 10. With 2.6.22 (installed in Feisty) the problem sometimes occurs and sometimes not.
Very nice -- a test case. Could you perhaps post it to linux-kernel/linux-mm with another description of the problem?
I hit the problem with "time ./ProcessSchedulerTest 100 50000". Didn't need to go any higher. I am using x86_64 on vanilla-2.6.28. 4GB RAM and not much load on the system (about 200 processes without the test case). symptoms were same. Mouse frozen, new konsole window takes ages to come up. So, it does seem to have nothing to do with disk. Unfortunately, I can't try 2.6.17 kernel because I upgraded my '/' to ext4. Can anyone else try this test case in 2.6.17 (anything prior to 2.6.18) on 64-bit?
Just want to mention that I ran the test case as normal user from a konsole window.
The test case is excellent but I haven't seen it pop up on the LKML yet. Thomas, do you intend on sending it to the list? I could probably put something together if you are short on time. Also, we should collectively decide to either migrate the discussion to Bug #12309 or not. In my opinion the shear size of this bug is making discussion a bit unwieldy and therefore #12309 would be a better venue, but if others disagree I can close it and we can resume here.
(In reply to comment #172) > Created an attachment (id=19758) [details] > Program for measuring the issue > > After comparing some kernel code, I have found come really interesting fact. > I > think the poor desktop responsiveness is affected by the changed process > scheduler (e.g. tickless kernel / high resolution timer ...) and not by the > disc scheduler. I have written a test program (sorry for the dirty code), > which > enforces the problem and allows to measure it. > To clarify, are you proposing that the high I/O wait time and the poor responsiveness problems are two separate issues? I just now had to wait for 5 minutes while evolution was moving messages to a large folder during a kernel build. Interestingly, directly after the kernel build completed evolution's rate of progress increased dramatically. During the kernel build, pdflush was always in iotop's top three and total disk I/O throughput was <1MB/sec. Wouldn't this tend to suggest a block layer or memory management issue?
see http://bugzilla.kernel.org/show_bug.cgi?id=12309
My company has a client with this issue. If it would help, I might be able to get the RAID controller in question shipped to someone that can do some debugging. I'll have to ask.
Ok, wow...I just solved the problem here. Apparently write caching was disabled. I'd never used sdparm before. Found a thread on Web Hosting Talk that sounded like this issue and explained that fix. sdparm -s WCE=1 /dev/sda And everything's fine. If that helps others, I take no credit. If it doesn't, I'm sorry for adding to the noise.
I have the same issue on a machine with i845e chipset, P4-1.5 Northwood, 2GB DDR RAM, GF6800 video and Audigy2 sound card. My main HDD is 160GB IDE Seagate. When there is disk activity the system becomes virtually unusable. For example, when I am burning a DVD on the drive attached to SII 3512 SATA controller, the CPU load goes from 40% at 7-8x to 98% at 16x. Downloading Fedora12 ISO last night at 500 kb/s kept system busy at 90%! If I start kernel compile, CPU load is stable 100%, which is Okay, but switching tabs in Firefox takes 10 seconds and starting any application like JUK, Dolphin, Konsole takes up to 1 minute. Running Fedora11 with 2.6.30.9.96 FC11 i686 PAE kernel. The system has become a bit more responsive (by about 10-20%) since I noticed p4-clockmod was being loaded and shut it down.
Re: comment #180, use sdparm to enable write caching Be careful, I think this is enabling hard drive internal write caching. On my system, with Hitachi disks, the on disk cache is not power down safe. I think this is the case with a lot of hard drives. We can consistently corrupt the disk by pulling the plug during heavy write activity using a non-journaled filesystem (or databases writing to raw devices). Ie., the disk drive tells the controller that the write completed, but the disk drive lies. Although it may corrupt the disks during a power failure, turning on hard drive internal write caching does dramatically increase throughput (x5 for heavy writes), decrease the CPU load average, and lower the wait for I/O percentages. What I don't understand is, I'm using a Perc 6 Megaraid disk controller with battery backed up writeback cache. I would think that this would do most of the caching work and the hard drive internal write cache would make little difference, but that's not the case.
There is no "lying" It has been documented in the ATA and SCSI specifications for over a decade that FLUSH CACHE (ATA) or SYNCHRONIZE CACHE (SCSI) commands are needing to flush disk cache to media. This is exactly the same as the write cache on every operating system in the planet, where dirty data will sit in pagecache for a time, before being flushed to disk. If you want to guarantee your data survives a power failure, either * use a filesystem or database that knows to flush the cache or * turn off the writeback cache These disk drives are not telling "lies." They are behaving exactly as the decade-old specifications say they should behave.
I am now on 2.6.32 and the issue persists.
Zenith - the most satisfying guess as to the bad behavior I was seeing, was a combination of ext3, and fsync forcing some blocking flush of the cache that essentially broke the system. That is fixed only with ext4 and btrfs. Can you confirm your file-system type ?
(In reply to comment #183) > There is no "lying" > > It has been documented in the ATA and SCSI specifications for over a decade > that FLUSH CACHE (ATA) or SYNCHRONIZE CACHE (SCSI) commands are needing to > flush disk cache to media. > > This is exactly the same as the write cache on every operating system in the > planet, where dirty data will sit in pagecache for a time, before being > flushed > to disk. > > If you want to guarantee your data survives a power failure, either > > * use a filesystem or database that knows to flush the cache > or > * turn off the writeback cache > > These disk drives are not telling "lies." They are behaving exactly as the > decade-old specifications say they should behave. My point is, companies like DELL are shipping Enterprise class servers where the hard drive internal cache (not power down safe) is enabled and the Linux operating system/drivers don't do a SCSI SYNCHRONIZE CACHE command to disk when flushing cache. If the box loses power during heavy write activity, you have a high probability of disk corruption. I suspicious that this is a common industry practice. I bet very few people have the expertise to find a problem like this. For the next system we order, we'll ask Dell if that version of Linux & disk drivers support SCSI SYNCHRONIZE CACHE commands when flushing Linux I/O cache, and whether a direct_io write (bypassing Linux cache) to a non-filesystem block device causes a Linux I/O cache flush (probably not) or a SCSI SYNCHRONIZE CACHE operation (probably not). One more note, when we execute a heavy write operation (loading a database), with the hard drive internal cache enabled, it runs fairly well on our Linux system. With the hard drive internal cache disabled, heavy writes slow down dramatically (factor of 4 even with disk controller writeback caching). The slow heavy writes cause the load average on the box goes high and the box becomes unresponsive (but doesn't crash). It sure seems like the disk subsystem getting overloaded with writes causes Linux to have problems. Ben
(In reply to comment #186) > My point is, companies like DELL are shipping Enterprise class servers where > the hard drive internal cache (not power down safe) is enabled and the Linux > operating system/drivers don't do a SCSI SYNCHRONIZE CACHE command to disk > when > flushing cache. Barrier support has been there for ages now and what enterprise distro is crazy enough to turn off barrier support which BTW is on by default on all popular file systems? Please stop spreading misinformation. > One more note, when we execute a heavy write operation (loading a database), > with the hard drive internal cache enabled, it runs fairly well on our Linux > system. With the hard drive internal cache disabled, heavy writes slow down > dramatically (factor of 4 even with disk controller writeback caching). The > slow heavy writes cause the load average on the box goes high and the box > becomes unresponsive (but doesn't crash). It sure seems like the disk > subsystem getting overloaded with writes causes Linux to have problems. In theory, with enough queue depth, WT should be able to achieve similar throughput as WB, but theory is one thing, reality another. Firmwares are not optimized for such use cases and queue depth isn't quite enough. This doesn't have much to do with OS or drivers. So, if you think the WT performance is critical please complain to hardware vendors. System being too unresponsive under high IO conditions is the bug topic here which I think is mostly resolved in mainline at least with ext4. This bug entry became almost impossible to track the topic. So, let's stop adding stuff here. If anyone is experiencing unreasonably high latency under ext4, please open a new bug report which specifically targets ext4. Thanks.
(In reply to <a href=https://bugzilla.kernel.org/show_bug.cgi?id=7372#c185>comment #186</a>) > Zenith - the most satisfying guess as to the bad behavior I was seeing, was a > combination of ext3, and fsync forcing some blocking flush of the cache that > essentially broke the system. That is fixed only with ext4 and btrfs. Can you > confirm your file-system type ? ext4