Bug 7372 - Unusable system (ie slow) when copying large files.
Unusable system (ie slow) when copying large files.
Status: CLOSED OBSOLETE
Product: Memory Management
Classification: Unclassified
Component: Other
i386 Linux
: P2 high
Assigned To: Todd
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-10-16 06:54 UTC by Brice Figureau
Modified: 2012-05-12 01:44 UTC (History)
52 users (show)

See Also:
Kernel Version: 2.6.18
Tree: Mainline
Regression: No


Attachments
lsmod output (1.28 KB, text/plain)
2006-10-16 06:55 UTC, Brice Figureau
Details
kernel .config (33.57 KB, text/plain)
2006-10-16 06:57 UTC, Brice Figureau
Details
cat /proc/meminfo (777 bytes, text/plain)
2006-10-16 06:58 UTC, Brice Figureau
Details
cat /proc/interrupts (1.20 KB, text/plain)
2006-10-16 06:59 UTC, Brice Figureau
Details
vmstat -n1 output (4.11 KB, text/plain)
2006-10-16 07:03 UTC, Brice Figureau
Details
iostat -x sda sdb output (3.29 KB, text/plain)
2006-10-16 07:03 UTC, Brice Figureau
Details
gzipped top -H -d 0.5 -b output for a few seconds (16.64 KB, application/x-tar)
2006-10-16 07:05 UTC, Brice Figureau
Details
dmesg output (502.12 KB, text/plain)
2006-10-23 08:14 UTC, Michael Meeks
Details
top -H -i -d 0.5 output under 2.6.19 (75.42 KB, text/plain)
2006-12-07 07:08 UTC, Brice Figureau
Details
vmstat -n 1 output under 2.6.19 (2.82 KB, text/plain)
2006-12-07 07:09 UTC, Brice Figureau
Details
first sysrq-t task list output (gzipped) (10.26 KB, application/x-gzip)
2006-12-08 08:16 UTC, Brice Figureau
Details
second task list (sysrq-t) gzipped (13.61 KB, application/x-gzip)
2006-12-08 08:18 UTC, Brice Figureau
Details
make balance_dirty_pages use nonblocking writeback (499 bytes, patch)
2006-12-08 13:25 UTC, Andrew Morton
Details | Diff
Oprofile reports - running only mysqld (20.11 KB, text/plain)
2006-12-15 08:41 UTC, Brice Figureau
Details
Oprofile report stressed version -running mysqld + cp (18.16 KB, text/plain)
2006-12-15 08:43 UTC, Brice Figureau
Details
Output of dstat while calculating a checksum (88.45 KB, text/plain)
2007-02-05 12:44 UTC, henkjans_bagger
Details
Output of SysRQ during stall (48.58 KB, text/plain)
2007-02-06 13:07 UTC, henkjans_bagger
Details
Output of lspci -nnvvv as requested (19.60 KB, text/plain)
2007-02-13 15:17 UTC, henkjans_bagger
Details
output of dmesg after boot (=identical after stutter) (15.46 KB, text/plain)
2007-02-13 15:19 UTC, henkjans_bagger
Details
trace-it output (172.96 KB, application/x-tar)
2007-02-17 02:27 UTC, John Blair
Details
output of diff /tmp/config.x86 /tmp/config.x86_64 > /tmp/config.diff (1.72 KB, application/octet-stream)
2007-06-14 17:31 UTC, devsk
Details
output of dmesg after boot AMD64 (18.82 KB, text/plain)
2007-06-18 09:21 UTC, henkjans_bagger
Details
Output of hdparm -I (2.44 KB, text/plain)
2007-06-25 13:25 UTC, henkjans_bagger
Details
output of hdparm -I (2.03 KB, text/plain)
2007-06-25 13:26 UTC, henkjans_bagger
Details
2.6.18 bounce fix (536 bytes, patch)
2007-07-17 07:31 UTC, Jens Axboe
Details | Diff
OProfile of heavily IO loaded system (372.96 KB, application/x-Bzip)
2008-12-17 12:22 UTC, Ben Gamari
Details
Program for measuring the issue (5.03 KB, application/octet-stream)
2009-01-12 00:54 UTC, Thomas Pilarski
Details

Description Brice Figureau 2006-10-16 06:54:01 UTC
Most recent kernel where this bug did not occur: don't know, tested only on 2.6.18
Distribution: Debian Etch
Hardware Environment: Dell PowerEdge 2850, x86_64, 2x Xeon 3.2GHz, RAM 4GB,
Megaraid, 2 hardware RAID1 volumes, ext3 fs on top of lvm2.
Software Environment:
mysql 5 is running on the server.

Problem Description:
During our nightly backup, we are copying (with cp) some big (a few GB) mysql
files from on volume to the other. 
During this time the system is almost unusable:
 * mysql doesn't serve new request (ie it can't use the disk)
 * ssh login is almost unpossible (ie very long)
 * top -d 0.5 almost freeze sometimes

This is completely reproducible with any io scheduler (cfq or as are better but
it is marginal).

Steps to reproduce:
1) have a mysql instance that consumes a large part of the RAM (ie 3.5GB over
4GB of physical RAM)
2) issue a cp for some big files
3) watch the server running only the copy, other process seems to be blocked
waiting for the disk to be ready. for instance a simple "ls" finally run about 3
to 4 seconds later after pressing return.

More information:
swappiness is at 20, kernel compiled without preemption, but with
SMP and Hyperthreading enabled.

More information as attachement.

The server is a live mysql server, I can't reboot it whenever I want, so it will
be a little bit complex to diagnose.
Comment 1 Brice Figureau 2006-10-16 06:55:53 UTC
Created attachment 9256 [details]
lsmod output
Comment 2 Brice Figureau 2006-10-16 06:57:53 UTC
Created attachment 9257 [details]
kernel .config
Comment 3 Brice Figureau 2006-10-16 06:58:46 UTC
Created attachment 9258 [details]
cat /proc/meminfo
Comment 4 Brice Figureau 2006-10-16 06:59:47 UTC
Created attachment 9259 [details]
cat /proc/interrupts
Comment 5 Brice Figureau 2006-10-16 07:03:08 UTC
Created attachment 9260 [details]
vmstat -n1 output
Comment 6 Brice Figureau 2006-10-16 07:03:59 UTC
Created attachment 9261 [details]
iostat -x sda sdb output
Comment 7 Brice Figureau 2006-10-16 07:05:27 UTC
Created attachment 9262 [details]
gzipped top -H -d 0.5 -b output for a few seconds
Comment 8 Martin J. Bligh 2006-10-16 07:39:56 UTC
I had the same thing happen to me ... turned out DMA was off on the drive.
Check with hdparm?

Comment 9 Brice Figureau 2006-10-16 07:51:09 UTC
Hum, those are 10k RPM SCSI drives behind a brand new PCI megaraid card, so
hdparm is of no help, and I suppose DMA is activated.
 
Throughput of the storage subsystem by itself is very good, it is just that the
server becomes unusable when two process competes for it.
Comment 10 Brice Figureau 2006-10-16 07:53:51 UTC
Important Side note:
I ran the almost same test on a less powerful server running 2.6.17.13, and it
worked really fine (I was able to use the server for something else beside the
running copy).

So I guess the issue is with 2.6.18.
Comment 11 Andrew Morton 2006-10-16 11:32:28 UTC
Was the "less powerful server" also using the megaraid driver with
hardware RAID and LVM?   We could be seeing an interaction between the
IO scheduler and the device driver layer here.

Also, it'd be interesting to see if

	echo 1 > /proc/sys/vm/dirty_background_ratio
	echo 2 > /proc/sys/vm/dirty_ratio

improves things.

Comment 12 Michael Meeks 2006-10-16 12:18:53 UTC
A simpler recipe that is trivially repeatable for me is:

        mount -o loop $iso /tmp/iso
        sudo /tmp/cold       # fairly typical drop_caches / BLKFLSBUF script
        cp -a /tmp/iso /tmp/foo-slow &
        sleep 1 ; time ls /usr

with parallel cp:
        real: 2.210, 2.020, 2.350  : avg. 2.193
without:
        real: 0.317, 0.377, 0.379  : avg. 0.358

ie. 6x degradation in I/O - of course, the user-visible symptoms in terms of
desktop responsiveness are of long hangs when any I/O is attempted. The hardware
is a fairly standard T60p - SMP kernel, SATA disk, and $iso is just a CD ISO of
your choice.

HTH.
Comment 13 Brice Figureau 2006-10-17 00:51:27 UTC
Andrew,

The "less powerful" server has the same megaraid card, but there is only one
RAID1 volume (also ext3 on top of LVM2), only 2GB of RAM and is running also as
a mail server (along with mysql 5).
The only real differences are:
 1) it runs 2.6.17.13
 2) mysql does not use almost all the physical RAM

The two "echos" didn't change anything to the issue.
My next test will be to run 2.6.18 on the "less powerful" server just to see if
it exhibits the same issue.
My next next test will be to run 2.6.17.x on the server running 2.6.18 (this is
trickier as the server is live, I need to prepare a maintenance window).

I did, like Michael, time ls /usr, just to see. After more than 10s without the
ls running, I stopped the cp. ls ran and printed:
real	0m11.319s
user	0m0.000s
sys	0m0.000s

As you can see, it never ran!

This all makes me thing that it is a process scheduler issue and not an io
scheduler (which makes sense since I tried almost all the io scheduler and the
problem is the same).

Moreover, I'm heading for a business trip for the next 5 days, so I'll perform
all those tests next week.

Many thanks for your help,
Comment 14 Michael Meeks 2006-10-23 08:14:37 UTC
Created attachment 9334 [details]
dmesg output 

Someone instructed me to do: 
echo 1 > /proc/sys/kernel/sysrq # before
echo t > /proc/sysrq-trigger # (during stall)

the trace may be useless, my stall is rather shorter than Brice's - and I
believe I did 1 trace before starting the I/O.
Comment 15 Brice Figureau 2006-10-26 04:34:41 UTC
Hi,

I did the following test:
I took my extra 'less powerfull' server, upgraded it to 2.6.18, and repeated the
cp test while using mysql.

I couldn't observe the same issue as observed on the other server, so I
certainly missed another important factor.
This certainly is a combination of memory usage (the mysql instance is using
3.5GB or physical RAM along as about 5GB VSS out of 4GB total physical RAM), and
i/o load (mysql load is low but constant). There is almost no shared and buffers
used (since mysql uses O_DIRECT for file access). 

Could that be that cp adds more buffers (and things like that) and since there
is not enough physical memory there is a kind of struggle between them?

Next test will be to run 2.6.17.x on my mysql instance and try to reproduce the
issue.

Thanks for your help,
Brice
Comment 16 Brice Figureau 2006-12-07 06:59:35 UTC
Hi,

A quick followup to my issue. 
I performed our monthly maintenance window, so I could restart the server. I
switched from 2.6.18 to 2.6.19.

But alas, it still exhibits the same issue as before, although it feels a little
bit better.

I tried several experiments like running the cp command with nice -n19, but it
didn't really help.

I'm puzzled.

Another experiment, was running 'dd' while running mysql:
$ dd if=/data/bigfile of=/root/bigfile iflag=direct oflag=direct
it doesn't exhibit the problem,
running the same dd without direct, exhibited this usual problem.

I will add the top -H -i output to the bug report taken while running the dd
(without direct), and it shows that all the mysqld threads seems stuck D state
during the whole time dd run (at least at the sampling points (every 500ms)).
I also added a new vmstat output if that matters.

So it seems that a long thread reading (dd) seems to be able to block many
really short writing threads (mysql).

I also tried to ionice -c3 or -c2 -n7 the dd process, and things feel better,
but it is not that best, and as usual incurs long pauses for mysql.

Any ideas?
Comment 17 Brice Figureau 2006-12-07 07:08:34 UTC
Created attachment 9754 [details]
top -H -i -d 0.5 output under 2.6.19

This file contains the top output which shows that the dd process seems to be
'favored' by the system.
Comment 18 Brice Figureau 2006-12-07 07:09:41 UTC
Created attachment 9755 [details]
vmstat -n 1 output under 2.6.19

vmstat output taken while running dd and mysql under 2.6.19
Comment 19 Brice Figureau 2006-12-08 08:16:09 UTC
Created attachment 9761 [details]
first sysrq-t task list output (gzipped)

I followed Micheal Meeks suggestion of using sysrq-t while mysql was frozen.
Michael was suggesting the mysql process could suffer from a lock contention.
I took too successive task trace, each of them while cp was running and mysql
suffered a freeze.

I attached the result to this bug report.
If someone more knowledgeable than me could parse it, that would be great.

Otherwise, I analyzed mysql read/write behaviour in my case, and it is mostly 
(ie over 90%) short writes to log files, because the whole database fits in
memory. So reads are performed from the mysql internal buffer space.

It occured to me that the kernel has not PREEMPT enabled, maybe that's the
cause of those long freeze?
Comment 20 Brice Figureau 2006-12-08 08:18:56 UTC
Created attachment 9762 [details]
second task list (sysrq-t) gzipped

Second gzipped trace, taken a few seconds after the first, but still while
mysql was frozen.

I hope that will help understand this specific issue.
Comment 21 Andrew Morton 2006-12-08 13:25:31 UTC
Created attachment 9766 [details]
make balance_dirty_pages use nonblocking writeback

You may find that this improves things.  It won't fix it,
but it might make it better.
Comment 22 Brice Figureau 2006-12-15 08:41:35 UTC
Created attachment 9828 [details]
Oprofile reports - running only mysqld

Andrew,

I didn't had the opportunity to test the patch you sent earlier, I'll do as
soon as I have a maintenance window on the server.

But I did had time to run a two oprofile session, in case it would help someone
to understand the issue.
Each report contains several different opreport, the first one is a general
one, then only the kernel, then finally kernel with call graph.

The first oprofile (oprofile_normal.txt) report was taken with only mysqld
running on the server (ie normal load). It shows that most of the time is spent
in mysqld (as it should) and that the clear winner in the kernel is:
sync_buffer.

The second oprofile report (oprofile_stressed.txt) was taken while running the
huge cp session along with the regular mysql load.
What's interesting is that most of the time seems spent running cp, and from
the kernel point of view, beside sync_buffer as before is in set_personnality.

I'll let the knowledgeable people interested in this report comment it...
I hope it'll help.
Comment 23 Brice Figureau 2006-12-15 08:43:22 UTC
Created attachment 9829 [details]
Oprofile report stressed version -running mysqld + cp

This one is the report of profiling while running the cp command.
Comment 24 Brice Figureau 2007-01-08 06:47:53 UTC
A quick followup,

I reverted the kernel on this machine to 2.6.17.13 (which I never tested in
production, and because of the ext3 corruption detected during holidays), and
everything is running smoothly during a huge file copy unlike with 2.6.18 or 2.6.19.

This narrows down the change to the zillions of commit between 2.6.17 and 2.6.18
(sorry). 

This is too bad that I can't git bisect because I'm not free to reboot the
server whenever I want.

While I'm running 2.6.17 if someone needs me to run some tests, please say so,
I'll do it happily.

Thanks,
Comment 25 henkjans_bagger 2007-01-27 04:03:47 UTC
It seems that the above bugreport is related to problems I experience with bad 
system performance upon copying/moving/unt(r)aring files, similar to problems 
reported in the gentoo-forum:

http://forums.gentoo.org/viewtopic-t-462052-highlight-.html
http://forums.gentoo.org/viewtopic-t-482731.html

My system is a simple AMD64, with 1 SATA drive, 1GB memory, so nothing special 
(running a kernel-2.6.18-gentoo) Upon copying a big file performing any 
operation that required disk I/O becomes impossible until the copying process 
is finished (throughput of the disk with hdparm is correct) as also shown by 
looking at the output of dstat during calculating a hash of an iso-file. 

-----------------------------------------------------------------------------
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq|_read _writ|_recv _send|__in_ _out_|_int_ _csw_
 32   7  48  12   0   0|  69M    0 |   0     0 |   0     0 |1246   897 
 26   6  46  19   0   1|  60M    0 |   0     0 |   0     0 |1132   933 
-----------------------------------------------------------------------------

Setting swappiness to 20 has no effectm nor does setting /proc/sys/vm/
dirty_background_ratio differently. I've also tried the patch given in this 
topic to see if it would resolve the problem. Unfortunately it does not solve 
the problem here. 

I'm happy to test other suggestions...
Comment 26 henkjans_bagger 2007-02-05 12:44:04 UTC
Created attachment 10296 [details]
Output of dstat while calculating a checksum

As a follow-up on my previous post I now attach the output of dstat while
calculating the checksum of an ISO-DVD with DVD. Simultaneous I try to open
other applications such as a simple terminal window. 

If you look in the line with the process to be started you see that there are a
large number of blocked processes. Something that does not happen when I try to
start an application without applying the heavy load. That does make sense as I
try to open applications during the calculation of the checksum, but they don't
appear. 

However there is one thing that appears to be important for this bug. You see
that roughly in the middle there is a line where 1290M is read by the harddisk,
which is clearly wrong as it should be ~60MB/s. 

That is exactly the point where I see that the whole system briefly stops in
this case very noticable. Also the output from dstat is halted. 
The large numbers are an artefact because it simply adds all reading that
occured over a much longer time (20 seconds in fact). 
Also visible for the network and system colums. Divide those numbers by 20 and
they correspond to the rest. 

As the IO clearly keeps running I think this provides clear proof that there is
a problem with the process scheduling codepath, not the IO scheduling. The only
strange thing is see is that exactly at that point 6 new processes are created,
but that could be an artefact as well. 

I hope this provides some insight into the problem. As said, I'm happy to test
things.
Comment 27 devsk 2007-02-05 13:23:55 UTC
There are too many people with x86_64 suffering from this issue. Can we please
have someone look at the data provided here and make some progress on this issue?
Comment 28 Andi Kleen 2007-02-06 03:36:32 UTC
The threads are blocked in io_schedule. Most likely a driver or block layer
problem.

I had a similar problem some time ago which is also still unsolved,
but the best theory for that was starvation in dirty accounting.
Comment 29 Jens Axboe 2007-02-06 05:55:11 UTC
Limit the queuing depth on the SCSI device, it's likely starvation there. Ala:

echo 4 > /sys/block/sda/device/queue_depth
echo 4 > /sys/block/sdb/device/queue_depth
Comment 30 Anonymous Emailer 2007-02-06 11:23:06 UTC
Reply-To: akpm@linux-foundation.org

On Tue, 6 Feb 2007 03:46:22 -0800
bugme-daemon@bugzilla.kernel.org wrote:

> I had a similar problem some time ago which is also still unsolved,
> but the best theory for that was starvation in dirty accounting.

yes, I expect that's the case.

Could the reporters please always capture the sysrq-T output during the
stalls?   I expect we're stuck in balance_dirty_pages().

People are slowly making changes in this area, but this one is tough and
really the guy who should fix it is me and I just don't seem to have
five minutes to think about the same thing lately :(

Comment 31 devsk 2007-02-06 11:52:40 UTC
the current value of /sys/block/sd[a-d]/device/queue_depth is 1 on my system.
Did you mean to increase it to 4?

Moreover, I can reproduce short freezes (of few seconds length) doing just a
copy of large file over cifs/smbfs, and the file involved is on IDE (/dev/hda5)
disk.

In another experiment I did, I copied a large file (2gig on /dev/hda) over cifs
and md5sum calculation on a file (3.5gig on /dev/sdd) at the same time in k3b.
During this time the system freezes for every 2-3 seconds and thruput of both
the disks falls to 2-5Mbytes/sec compared to normal 50-60Mbytes/sec and md5sum
calculation just crawls and mouse freezes. You can see the dstat output (sorry I
didn't save it, and formatting in bugzilla will be destroyed) at 

http://forums.gentoo.org/viewtopic-p-3888210.html#3888210

These are separate disks on separate channels, one IDE and one sata. Their
thruput should be pretty much to the max, even for concurrent access.

The thing is that the freezes and thruput reduction doesn't happen if network is
taken out of the picture.
Comment 32 henkjans_bagger 2007-02-06 12:28:20 UTC
I've tried the suggestions of Jens to see if setting queuedepth to 4 has an 
effect. It seems to have very little to no effect. While calculating that 
checksum I still have to wait 40s before a simple terminal opens.

More clear proof that it doesn't help is dstat again. I can still see stalls 
happening. In this case the system stalls for roughly 8 seconds while IO 
continues as you can see below. This is with the qeuedepth set to 4 (normally 
it is set to 31)

----------------------------------------
 32   4  49  14   0   0|  66M    0 | 
 28   5  43  23   0   1| 483M 8192B| 
 31   6  47  17   0   1|  60M    0 |  
----------------------------------------

@anonymous emailer...comment #30
In my case it is very difficult to trigger something recognizable like the 
stall above. Most freezes on my system are short....less than a second. Not 
even noticable when looking at the throughput from the HD.

Unfortunately I don't completely understand which info you need and at what 
exact time. Do you have a clear pointer to a document that shows me what to 
provide and how to do this. I do have the CONFIG_MAGIC_SYSRQ set to y in my 
kernel. 
Comment 33 henkjans_bagger 2007-02-06 13:07:37 UTC
Created attachment 10321 [details]
Output of SysRQ during stall

Ignore my previous question...stupid me, Brice already posted how to do it. 
So the attached file shows the output obtained with Sysrq-T.

I set it up by "echo 1 > .." , and already typed " echo t > ..." so that I only
needed to hit enter. Then I repeatedly calculated the checksum from an ISO with
K3B until I saw a clear stall. At that point I hit the enter in the console to
trigger the dump. At that specific point the system stalled for ~10s judging
from dstat. I hope that the dump indeed correspond to that point.
Comment 34 Anonymous Emailer 2007-02-06 13:55:50 UTC
Reply-To: akpm@linux-foundation.org

On Tue, 6 Feb 2007 13:16:58 -0800
bugme-daemon@bugzilla.kernel.org wrote:

> Created an attachment (id=10321)

hm, that's unexpected: everyone is sitting waiting for a read
to complete - nobody is stuck in balance_dirty_pages().

This points more at an elevator problem, or device driver, or
perhaps contention for the IDE bus itself.

Other people in thsi thread _have_ been seeing stalls due to
balance_dirty_pages() throttling, so I suspect we're looking at two
different and unrelated problems here.

So you were reading an ISO from a CDROM drive, yes?  Are all those
stalled processes also trying to read from IDE?  Perhaps on the second
channel of the same controller?

Comment 35 henkjans_bagger 2007-02-06 14:12:13 UTC
No, its not from a CD. It is an iso-image from a DVD that is located on my 
harddisk. I calculate the MD5 checksum within K3B, which you have to do to burn 
it to DVD. I took that as a testcase as it gives the clearest pattern I could 
find.

I have only a single SATA-harddrive in my system, so all other IO-operations 
should access the same HD. 

Note however that I try to open a terminal while this , which I do on a routine 
basis. For that reason that one should come from memory and not from the HD. 
However even that takes +40 seconds to happen. 
Comment 36 Anonymous Emailer 2007-02-06 14:39:14 UTC
Reply-To: akpm@linux-foundation.org


OK.

This actually could be a CPU scheduler problem - we've had problem in the past
where specific operations like this cause the mostly-busy task to trick the
CPU scheduler into giving it a too-high priority.

Please try nicing k3b down, see if that makes the problem go away.  Maybe

	nice -20 k3b ...

Comment 37 henkjans_bagger 2007-02-06 14:45:49 UTC
Sorry for spamming this evening...

I've now tried with an older kernel as requested by the starter of the bug-
report. I compiled the oldest 2.6.17 kernel in the Gentoo tree , which was the 
vanilla 2.6.17.6 version. 

With this older kernel there is no problem to open other applications while the 
checksum is being calculated. Also the system behaves decent under large copy 
operations such as "cp *.iso /dev/null". Which under the 2.6.18 kernel also 
leads to huge slowdowns. 
Comment 38 henkjans_bagger 2007-02-06 15:42:17 UTC
> Please try nicing k3b down, see if that makes the problem go away.  Maybe 
nice -20 k3b 

Are you sure you didn't mean "nice -n 20 k3b" (i thought negative values 
increased the priority). 

If I run k3b with "nice -20 k3b" the behavior is pretty much the same. Very 
long waiting times before an application appears and long stalls visible in 
dstat. If I run k3b with "nice -n 20 k3b" the behavior appears a bit smoother 
most of the time. But also here long waiting times happen. In both commands I 
tried I see the long stalls of ~10 seconds appear in the output of dstat. So 
overall I would say it has little effect.  

It might be relevant: in both cases I see K3B always on top of the list in the 
output of top with priority 34 or 35 and Nice equal to 19. 
Comment 39 Andrew Morton 2007-02-06 23:56:15 UTC
Ingo asks is Hankjans can try the -rt kernel  "I've got function tracing there
so if he 
can reproduce it there i can tell him how to get a proper trace and fix 
it in no time."

That's http://people.redhat.com/~mingo/realtime-preempt/
Comment 40 Brice Figureau 2007-02-07 07:37:24 UTC
henkjans_bagger@hotmail.com @ 2007-02-06 14:45:
>With this older kernel there is no problem to open other applications while the 
>checksum is being calculated. Also the system behaves decent under large copy 
>operations such as "cp *.iso /dev/null". Which under the 2.6.18 kernel also 
>leads to huge slowdowns. 

This match what I was observing on my server.
Can you try to git-bisect between 2.6.17 and 2.6.18 as I suggested in a private
mail ? 

Assuming we suffer from the same issue, it will isolate the patch that
introduced the problem and will allow kernel developper to spot the issue.

I can't do it myself because I can't freely reboot the server (ie I have strict
maintenance schedule every month for these systems, thus I estimated to find the
culprit in not more than 6 months :-P ).

Also, for the kernel developpers reading this bug-report would it help some
sysreq-t (or other debug material) from a non affected kernel (2.6.17) ?

Thanks,
Brice
Comment 41 henkjans_bagger 2007-02-07 10:32:27 UTC
> Ingo asks is Hankjans can try the -rt kernel ...

1) I don't know if it is good or bad news. Although it seems more difficult to 
trigger the stall I can reproduce the problem with the RT-kernel. I've taken 
the Vanilla 2.6.18 kernel and patched it with patch-2.6.18-rt3. Also in this 
case I'm able to get a clear 10s stall in the output from dstat. 

I was not sure which kernel to take, but as I have always been having the 
problem with the 2.6.18 kernel I thought moving to 2.6.20 would add an 
additional unknown. 

So the question: what should I do to provide a proper trace?

2) I've also been trying to narrow down what the last kernel was that did not 
show the problem. It seems 2.6.17.14 does not show the problem of the long 
stalls, while the first kernel from the 2.6.18 branch has this problem. 

Comment 42 henkjans_bagger 2007-02-07 16:05:44 UTC
I've now finally completed running a git-bisect. It took a long time as seeing 
the stall occur was not always easy. 

The output: 6edad161cd4dfe1df772e7a74ab63cab53b5e8c1 is first bad commit

A quick search with google yields the info given below. Seeing that indeed the 
commit is related to libata/Sata I'm pretty sure that I indeed did not make an 
error during the git-bisect.
---
commit 6edad161cd4dfe1df772e7a74ab63cab53b5e8c1 
tree 389d6daa728b2ba1bd8c2180cab705706449f62a 
parent 236ee8c33277ab48671995f26dc68a4639936418 
parent 0dd4b21f517e138ea113db255645fbae1bf5eef3 
author Linus Torvalds <torvalds@g5.osdl.org> Fri, 23 Jun 2006 15:58:44 -0700 
committer Linus Torvalds <torvalds@g5.osdl.org> Fri, 23 Jun 2006 15:58:44 -0700 

    Merge branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/
git/jgarzik/libata-dev
    
    * 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/
libata-dev: (258 commits)
    [libata] conversion to new debug scheme, part 1 of $N
    [PATCH] libata: Add ata_scsi_dev_disabled
    [libata] Add host lock to struct ata_port
    [PATCH] libata: implement per-dev EH action mask eh_info->dev_action[]
    [PATCH] libata-dev: move the CDB-intr DMA blacklisting
    [PATCH] ahci: disable NCQ support on vt8251
    [libata] ahci: add JMicron PCI IDs
    [libata] sata_nv: add PCI IDs
    [libata] ahci: Add NVIDIA PCI IDs.
    [PATCH] libata: convert several bmdma-style controllers to new EH, take #3
    [PATCH] sata_via: convert to new EH, take #3
    [libata] sata_nv: s/spin_lock_irqsave/spin_lock/ in irq handler
    [PATCH] sata_nv: add hotplug support
    [PATCH] sata_nv: convert to new EH
    [PATCH] sata_nv: better irq handlers
    [PATCH] sata_nv: simplify constants
    [PATCH] sata_nv: kill struct nv_host_desc and nv_host
    [PATCH] sata_nv: kill not-working hotplug code
    [libata] Update docs to reflect current driver API
    [PATCH] libata: add host_set->next for legacy two host_sets case, take #3
    ...
---


Comment 43 Brice Figureau 2007-02-08 00:50:04 UTC
>From henkjans_bagger@hotmail.com  2007-02-07 16:05
>I've now finally completed running a git-bisect. It took a long time as seeing 
>the stall occur was not always easy. 
>
>The output: 6edad161cd4dfe1df772e7a74ab63cab53b5e8c1 is first bad commit
>
>A quick search with google yields the info given below. Seeing that indeed the 
>commit is related to libata/Sata I'm pretty sure that I indeed did not make an 
>error during the git-bisect.

Thanks for taking time to bisect the problem.

Unfortunately at first glance, that's not good news for me: my servers are pure
SCSI based (no sata)...

That means to me either:

 1) your problem is different than mine as was suggested earlier
or

 2) the aforementionned patch changed a subtle thing in the SCSI layer (libata
is implemented behind). The only thing it seems to touch in the upper SCSI layer
is the implementation of a new SCSI error handling. Maybe this change "broke"
something regarding the SCSI low-level driver I'm using (megaraid).

Regarding the -rt kernel (I'm no expert, I just read the rt wiki), you have to
make sure Latency Tracing is activated in the kernel configuration (under Kernel
Hacking I guess), and then you can read the offending stack traces from
/proc/latency_trace.

Brice
Comment 44 henkjans_bagger 2007-02-08 10:41:10 UTC
@Brice: I'm afraid somebody has to tell me in a bit more detail what info is 
needed from the rt-kernel. I've compiled that kernel with the given option. 
However what I see happening is that the content of /proc/latency_trace is 
changing all the time (latencies >3ms), so I have no clue what I should exactly 
look for or attach to this bugreport. 

Another disadvantage is that my system completely deadlocks after some time 
with the RT-kernel (only option is reboot via the powerbutton). Something I've 
never seen happening with any other kernel on this machine. 
Comment 45 henkjans_bagger 2007-02-13 11:54:55 UTC
It has now been almost a week ago since I performed a bisect to locate which 
commit introduced the problem in my specific case. See the post in comment 
#c42. 

Is any more information needed from me to be able to solve these seconds long 
stalls during any IO-intensive process? It would be really nice to be able to 
move to a newer kernel than 2.6.17. 

ps) As said: I can reproduce the problem with the RT-kernel as well, but I need 
to know what information I need to provide. 
Comment 46 Andrew Morton 2007-02-13 12:22:23 UTC
I don't have the faintest idea how a sata change could cause this
behaviour, sorry.

Perhaps you could have another go with git-bisect and see if
you can identify the offending change amongst those 20?
Comment 47 henkjans_bagger 2007-02-13 12:49:30 UTC
> Perhaps you could have another go with git-bisect and see if
> you can identify the offending change amongst those 20?

1) Unfortunately I fail to see how I should do that. I did the bisect between 
2.6.17 and 2.6.18. The given first bad commit was the last one in the tree. 
After that I could go no further. 

2) The output with the RT-kernel would not provide the needed input as asked in 
comment #39? As said I could reproduce it there as well.

 
Comment 48 Anonymous Emailer 2007-02-13 13:26:12 UTC
Reply-To: akpm@linux-foundation.org

> On Tue, 13 Feb 2007 12:59:31 -0800 bugme-daemon@bugzilla.kernel.org wrote:
> 1) Unfortunately I fail to see how I should do that. I did the bisect between 
> 2.6.17 and 2.6.18. The given first bad commit was the last one in the tree. 
> After that I could go no further. 

Yeah, I know.  git-bisect is a bit sucky.

I expect that you can bisect it further by doing `git-bisect good' and
`git-bisect bad' against the first and last patches in that series-of-20.

Comment 49 Tejun Heo 2007-02-13 13:45:08 UTC
Henkjans, your kernel might just be stuck in EH thus hanging all IO accesses. 
Does the kernel complain about anything (dmesg) after such stuttering?  Also,
please report the result of 'lspci -nnvvv' and 'dmesg' right after boot and
'dmesg' right after the problem has occurred.  Thanks.
Comment 50 henkjans_bagger 2007-02-13 15:17:27 UTC
Created attachment 10409 [details]
Output of lspci -nnvvv as requested
Comment 51 henkjans_bagger 2007-02-13 15:19:58 UTC
Created attachment 10410 [details]
output of dmesg after boot (=identical after stutter)

Although the stutter is clearly observable in the output of dstat the kernel is
not complaining at all. The output of dmesg after the stutter is identical to
the output of dmesg immediately after boot.
Comment 52 Brice Figureau 2007-02-14 01:34:27 UTC
>> Perhaps you could have another go with git-bisect and see if
>> you can identify the offending change amongst those 20?
>
>1) Unfortunately I fail to see how I should do that. I did the bisect between 
>2.6.17 and 2.6.18. The given first bad commit was the last one in the tree. 
>After that I could go no further. 

I think you'd have to fecth the libata git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git

Then:
git-bisect reset
git-bisect start
git-bisect bad 0dd4b21f517e138ea113db255645fbae1bf5eef3
git-bisect good v2.6.17
(I think that the last known bad commit is 0dd4b21..., but I'm not really fluent
with git...).

>2) The output with the RT-kernel would not provide the needed input as asked in 
>comment #39? As said I could reproduce it there as well.

Make sure you have the different debug RT options set in "Kernel Hacking" in
kernel config (see: http://lwn.net/Articles/146861/).
Then you can read the highest latency from /proc/latency_trace and observe dmesg
as there should be something there.

You can reset it (ie just before your test to make sure you read something
usefull) with:
echo 0 > /proc/sys/kernel/preempt_max_latency

Regarding your crashes with -rt, I unfortunately can't comment. Maybe you should
ask on lklm or a more appropriate list.

I'll perform a test soon (ie when I'll be able to cg-fetch a kernel) myself on
my db server by eliminating the patch you bisected from a 2.6.18 kernel to see
if it solved my problem or not. This will allow us to verify we're suffering the
same exact issue or not.

Maybe it is also time to bring the discussion to lkml to ask if other people
suffer from this issue ?
Comment 53 henkjans_bagger 2007-02-15 00:21:06 UTC
I'm not really convinced that it is usefull to run another bisect. (It would 
take me are very long evening as it is not easy to determine good or bad) If I 
look at the patches that showed up in my original run in comment #42 I can 
exclude 50% as those are drivers that I don't use. 

Basically my system uses ahci and libata. If my problem is the same as what 
Bruce was experiencing the file ahci.c can be excluded. If I then look at the 
patches that have been introduced for files with libata or scsi in their name 
for the commit given in #42 it seems to me that the changes to the files 
involved are closely related: 

ps) I looked at the patches at http://www.kernel.org/git/?p=linux/kernel/git/
torvalds/linux-2.6.git;a=commitdiff;h=6edad161cd4dfe1df772e7a74ab63cab53b5e8c1
Comment 54 John Blair 2007-02-17 02:27:59 UTC
Created attachment 10441 [details]
trace-it output 

Attached you will find trace-it output	of 2.6.19.1-rt15 during a particularly
busy period with an emerge --sync / updatedb /matlab in the background / and a
movie playing. There was some skipping of the audio. But the skipping and
unresponsiveness was not as bad as it sometimes gets after a few days of
uptime.
But maybe you can see from this the culprit function?? 
Btw where is a good tutorial on how to read/interpret the output  of trace-it?

Thanks
Comment 55 Leroy Raymond van Logchem 2007-03-02 03:53:37 UTC
Pages are kept in memory for re-use, which is fast and fine except for:
1) data without re-use value or even single use
2) applications _do not_ advise the kernel how to cache pages related
   to there self generated i/o. POSIX does provide mechanisms to
   properly do so. But the kernel should help these poor apps.

To minimize your MySQL backup cp(1) problem, try this workaround:

cat ./cp_direct.sh 
#!/bin/sh
dd if=$1 of=$2 bs=1M iflag=direct oflag=direct

Combine this with [/etc/sysctl.conf]:

vm.vfs_cache_pressure = 1
vm.dirty_ratio = 2
vm.dirty_background_ratio = 1

This should reduce both the stress on the vm and response latency during
interactive work.

--
Leroy
Comment 56 Ulf Dambacher 2007-03-19 11:06:00 UTC
I'm adding myself to the cc as I seem to be struck with the same bug.
2 dual core xeon running amd64 with 3ware 9550sx sata raid controller (scsi driver)
I tried completely without ide-ata or libata but the problem remains even using
only the scsi subsystem.
The system is idle, so if you have any suggestions. compiling works fast (small
files .-( )
Comment 57 Ulf Dambacher 2007-03-20 13:32:01 UTC
I can confirm that 2.6.17.14 works without showing this bug.
2.6.16.36 seems not to work -at least its worse !
the gentoo kernels for 2.6.18, 2.6.19 and 2.6.20 show this bug. I'm just testing
a vanilla 2.6.20.3 
Comment 58 Simon B 2007-03-21 15:54:45 UTC
*bump* *vote* 
hi
my specs: asus m2n-e , amd64X2 3800+ ,sata hdd , ide cdrom, nvidia driver, 1 
gig ram, 1gig swap , read all the bug reports and stuff, invested 4hours and am 
getting pissed on thisone 
 
I just wanted to emerge hugin and as it eats all the mem during compile it 
renders my system unusable under a serious swapstorm. Cpu is not that busy but 
disk i/o seems to kill it.

This is a issue which should be fixed ASAP!
Comment 59 Ulf Dambacher 2007-03-23 07:53:43 UTC
I did some compiling/booting cycles the last days trying to get a hold of this
bug on my server.
I stripped down my config to the needed things and tested vanilla 2.6.16 to
2.6.20 and was not able to reproduce my error.
Seems I have configured it out somewhere...

I use SMP, scheduler extensions for multicore and hypertheading, CFQ scheduler,
acpi, cpufreq, preemtion, scsi, sata(old) netfilter, e1000, i2c, sensors, usb,
ext2,ext3, vfat , i9xxx sata raid on scsi, and some other
I didn't config libata, numa support, sound.

Now I switched my kernel from old ide/sata to libata and volais: the bug
reappeared and I'm waiting for my ls 4min.

I will try to track it down and git-bisect it.
Comment 60 Ulf Dambacher 2007-03-23 15:05:50 UTC
Hi
I went backward within git every kernel from 2.6.20 to 2.6.16 and found the bug
on each kernel.

My test case: boot, mount my sata partititition and copy from the sata disk a
16GB file to my scsi/raid controller. watch the processes with top while doing so.
run a "time ls -l /usr/share/doc" 30sec. after starting cp.
With old ata it needs up to 30sec, with libata it waits for more than 10min!

you can see via top that every process wich wants disk gets an on an own cpu and
this cpu gets 100% waiting for i/o.
processes without disk access (like the top process) show normal and fast response.
Comment 61 devsk 2007-03-23 17:26:09 UTC
surprisingly, ulf.dambacher@web.de removed me from the cc list for some unknown
reason. More surprisingly, he could actually do that!!

anyway, this seems like a clear ata regression. Its even brought down to a set
of commits. What else is needed for kernel devs to pinpoint the problem? What is
the strategy going forward? Its been 5 months since the bug was reported, and we
haven't seen any serious in-depth research and/or update from any kernel dev.
Comment 62 Brice Figureau 2007-03-24 04:40:57 UTC
 @ulf.dambacher

I will soon test what Ulf discovered (ie libata is the culprit); since libata is
compiled in my kernels, but I don't use it (no SATA drives).
I will report my findings to the bug report.

There is still something that doesn't match: I'm not affected in kernel below
2.6.17 (included), which does not seem to be your case.

Did you had time to git-bisect the issue, to see if you come to the same
conclusion as henkjans_bagger@hotmail.com.

 @devsk
Yes, I do agree with you, we now have clearly identified the culprits, even if
not explained. That's a shame that this bug gets no kernel dev attention. Maybe
we should CC: the libata maintainer and move the bug section to Drivers.
Comment 63 Ulf Dambacher 2007-03-24 06:15:29 UTC
first: my apologies for removing devsk. I mistook his email for a spambot becaus
e I got emails with same timestamp as bugzilla. Won't happen again!

I did some more tests to find the problematic subsystems,
Now I am totally puzzled:

I compiled a kernel with only scsi and usb hd support , no ata or libata.
I booted and did my standart test: BUG

Now I reconfigured a bit to boot the machine with a 32bit kernel from sata. 
My 32bit kernel with scsi and libata does not show the bug and works
fluently-something.

From all this I conclude that
- it is amd64 specific. x86 on the same machine works
- I only see it  with the scsi subsystem and aditional subsystems using scsi
(like usb-storage or libata) - scsi+old ata/sata works
- it does not depend on the io-scheduler
- it is not dependend on preemption
- it is not dependent on acpi

Can one of the kernel gurus please give me a hint on how to do further testing.
I may even give you a live connection if needed...
Currently I have time for testing, but this machine should be online ASAP. 
If I don't find the bug, I have to reinstall 32bit (and I will hate that .-)

Comment 64 Anonymous Emailer 2007-03-24 08:48:16 UTC
Reply-To: akpm@linux-foundation.org


Ulf, what do you mean by "BUG"?  Reading back, I don't see anywhere where
you've actually described the symptoms which you're seeing.

And I have a suspicion that we're dealing with multiple, different problems
in this lengthy report.

Thanks.

Comment 65 Ulf Dambacher 2007-03-24 09:10:52 UTC
@akpm
My Bug shows up as follows:
I copy one 15GB backup file from one (removeable) harddisk to my main harddisk.
While doing so, the system gets unuseable: I can not log in. Every command I
type takes ages to complete. e.g. a simple ls in my home dir or a ls /usr/share
waits for more than 10min. If it ran before and gets data from cache it works
faster. Interactively running programs like top behave normal and fast as long
as they don't need disk access.
speaking of top: it shows that at least the cp process makes one cpu 100%
waiting. I see a process needing hd access gets on one processor which is then
100% waiting.
If I stop the cp process, it takes some time until the system gets responsive again.

Compare this with "normal operation": naturally if the system is on high disk
access other processes work less fast. But a simple ls while doing the cp does
not take more than around 5sec in my 32bit test case. with only scsi and old ata
on 64bit it takes some 3-10sec.

I think it's the same kind of behaviour Brice, who opened this bug, talks about.
Comment 66 Ulf Dambacher 2007-03-25 03:26:16 UTC
I did some more tests.
e.g. I completely removed any preemption, I turned of nmi watchdog. No visible
changes.

The only way I see normal/fast behaviour while doing my "cp" is, if I copy from
old ide/sata to my scsi system.
Trust the leds: source drive slowly glimming, destination drive leds flickering
with slower frequency showing write. If I do read i/o they flicker differently
and out of sync.
top shows %wa of around 20%. ls Commands behave normal/fast

If I try that with libata, I get the destination disk leds on signalling heavy
load, the sources is flickering with slow frequency.
by the out-of-sync flickering you can really see if other jobs get their read
access which happens in between.
And top shows %wa of 100% for at least 2 processes. and you have to wait for ls 

using usb_storage I get nearly the same behaviour as with libata.
Comment 67 Ulf Dambacher 2007-03-25 13:35:29 UTC
I found out why my bug shows up with usb and sata but not with the old ide/sata
system: It is too slow on my computer.
dstat revealed that the cp run with ~40M/s on usb or sata and only 3k/s on old
ide. so there was mutch time for other processes reads
Comment 68 John Helliwell 2007-04-18 13:02:36 UTC
I'm using ( Linux flatline 2.6.20-gentoo #1 PREEMPT x86_64 AMD Athlon(tm) 64
Processor 3000+ AuthenticAMD GNU/Linux )

I can confirm that if I perform a lot of reads using process A, (e.g an AIDE
scan), and a lot of reads using process B (i.e an emerge --metadata), then
process C (e.g adding one line of text to a 20 line text file and saving the
file) performing a small write is blocked and completes in around 30 seconds.

So to summarise I can create a condition on my linux box whereby it takes 30
seconds to extend a file by 3 bytes. 

In addition if multiple processes are performing intensive reads, the IO
throughput acheived by each process is extremely degraded

And this is the clincher ... if I boot into any kernel with version <= 2.6.15 it
works just fine.

Also I would like to say that there is little visibility in this bug as to how
many people are actually affected by this bug, perhaps because they do not post
to bugzilla.kernel.org, but the thread in the gentoo forums about this problem
is now 12 pages long. I just want to make people aware of how many people are
affected (=lots)

Info's to support this bug:
Arch: amd64, glibc2.5
Filesystems: xfs (also i have tried ext3 same problem occurs)
IO scheduler: cfq (I have tried the others, same problem)
Block layer: md raid0
Devices: sata_sil

It's pretty much a show stopper and I hope it can be solved. Is it being taken
seriously enough? There has been no activity on this bug for a while.
Comment 69 Brice Figureau 2007-04-18 14:09:48 UTC
John Helliwell,

Can you try to git-bisect the problem ?
Or ask on the gentoo forum you mention if someone can git-bisect precisely the
issue. 

If that matches what other bug poster have found, we'll forward everything we
know on lkml where it'll get more public scrutiny (and developper eyes too I hope).

Did you try all the mentioned tips, like vfs_cache_pressure and things like that ?

Are you affected with vanilla kernel 2.6.17 too ?
Comment 70 Thomas Tuegel 2007-04-19 11:45:38 UTC
I've been experiencing this issue for some time now.  I'm on an Athlon64 X2, 
1GB RAM, nForce 4 motherboard with both PATA and SATA drives, running Gentoo.  
System compiler is gcc-4.1.2 with glibc-2.5.

Yesterday, a user posted on the Gentoo forums about Ingo Molnar's "Modular 
Scheduler Core and Completely Fair Scheduler" patch 
(http://kerneltrap.org/node/8059).  I applied v3 of this patch 
(http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v3-v2.6.21-rc7.patch) to 
kernel 2.6.21-rc7.  The "stuttering" is present in unpatched 2.6.21-rc7, but is 
not present in 2.6.21-rc7 with Ingo's patch.
Comment 71 Brice Figureau 2007-04-19 13:47:09 UTC
@Thomas Tuegel

This doesn't mean that the scheduler is the culprit. It can just be that using
another scheduler gives less cpu to the various process doing reads, and thus
the process doing writes gets more chance to run (and does more processing).

Applying the cfs scheduler patch might fix the symptom but doesn't cure the disease.

Can some of the gentoo users can try to git-bisect the issue to see if that
matches what others have found ?
Can you please post the gentoo thread URL ?

Thanks,
Brice
Comment 72 Thomas Tuegel 2007-04-19 14:59:44 UTC
@Brice Figureau

Thread starts here: http://forums.gentoo.org/viewtopic-t-482731.html
I think it may have been mentioned before, but I couldn't find it in the 70-odd 
bugzilla posts.

You're right about symptoms vs disease.  It turns out that a process doing 
heavy reads no longer locks down the system, but a process doing heavy writes 
still does.  My apologies for the incorrect post earlier.

There are a few users on the forums now talking about doing git-bisects.  I'll 
be able to divert my attention to that this weekend.
Comment 73 Brice Figureau 2007-04-20 14:36:51 UTC
Just a quick followup to my original bug reports.

I ran 2.6.17 with great success since I discovered the issue in subsequent kernels.
I decided to move to 2.6.20.6 last week, to see if things have changed (who
knows?). Alas nothing new, it is still the same story.

Beside that, running with vfs_cache_pressure=1 seems to really help, not perfect
(as in 2.6.17) but much more usable.

As soon as I have the possibility to reboot the server, I'll try to revert the
culprit commit found by Henkjans.

I'll report my findings.
/me going back to 2.6.17 :-(  
Comment 74 Andi Kleen 2007-04-22 16:57:27 UTC
There is a patchkit on linux-kernel from Peter Zijlstra that might fix this.

http://thread.gmane.org/gmane.linux.kernel/518019

Can people perhaps test it and report if it fixes the problem?

Comment 75 devsk 2007-05-01 15:06:40 UTC
Did anyone try the patches in comment #74?
Comment 76 Peter Zijlstra 2007-05-10 05:55:52 UTC
New patches at:
  http://lkml.org/lkml/2007/5/10/161

These do include something to make inter task starvation better.
The previous set was only geared at inter device starvation issues.
Comment 77 Dwight Spencer 2007-06-12 10:19:31 UTC
I just finished reading over these comments, and see there has been no update since 2007/5/10.  Have the patches listed in c75/76 fixed issue? I've used both kernels 2.6.18 and 2.6.20 (stock kernels from yum updates via fc6) and have the issue of high iowaits during file reads & writes, and during these times requests to the io subsystem would block and wait.  I tried setting vfs_cache_pressure = 1, and that has helped system response time during io, but the iowaits still persist.

platform details:
cpu - amd athlon xp 2800+
memory - 2gb
motherboard - asus a7n8 - includes sata_nv (nvidia), sata_sil.
3ware 9500s controller - kern module 32_9xxx
   - lun 0 - 2 x 320gb seagate baracudas
   - lun 1 - 6 x 500gb seagate baracudas


I'm currently looking for a 2.6.17 kernel rpm (sorry, lowly user here) to install on the system.  If unsuccessful, i'll grab both kernel source for 2.6.17 and 2.6.21, including the patch, and give those a try during the next couple days.  

Comment 78 devsk 2007-06-12 12:29:55 UTC
I have another dimension to add to this problem. I took the 2.6.21 sources from my amd64 install and created a x86 kernel in a chroot install. I did a 'make mrproper', copied the amd64 .config and 'make oldconfig'. Followed by make menuconfig to verify the config. Booted the resulting x86 kernel and the freeze problems are gone.

So, in nutshell, I have two installs: an amd64 and an x86 with the same kernel source. I can easily make the mouse, gkrellm2, ksysguard, and other visual updates freeze for > 5 secs under amd64. Not reproducible even under much more (almost three times) severe IO load under x86. During the tests, the IO waits in x86 stay close to 50-70% range, touching 90s once in a while.

I am staying with x86 for now.

I see that the original reporter as well as many others reporting the issue are indeed using x86_64. So, can you guys please boot an x86 livecd with 2.6.21, mount your FSs and do you tests to verify if you see what I see.

If anyone needs me to post some info, please let me know.
Comment 79 Andi Kleen 2007-06-13 10:08:45 UTC
Does the x86 system run with APIC enabled or disabled? (you can 
see it in /proc/interrupts).  If no try running the 64bit system
with noapic for testing and see if that helps -- or run the 32bit 
with APIC enabled and see if it stalls too.

I recently looked at an unrelated stall on my workstation and one
of the problems seemed to be interrupt starvation for some interrupts.
Changing to PIC mode could possibly change that.
Comment 80 devsk 2007-06-13 10:45:03 UTC
APIC enabled on both. Same .config (well, up to the extent x86_64 and x86 can be same). Same cmdline (except for the install root) to both the kernels.

I have booted the 64-bit with 'noapic nolapic' in the past without any improvement with the problem.

irqbalance (I noticed one difference in x86_64 and x86 .config and that is that there is no in-kernel IRQ balancing in x86_64), running or not, doesn't make any difference.
Comment 81 henkjans_bagger 2007-06-14 11:41:11 UTC
This bug is so utterly annoying that I already downloaded the FreeBSD install-CD's to switch to their kernel as this bug leads to an unworkable system under load.

Based on the last posts of devsk I decided to first try to switch to a 32-bit system. I can confirm the findings of Devsk. The problems I experience with the 64-bit kernel are NOT present when running a 32-bit system. 

Hopefully with the bad commit reported in post #42 this gives a reason to look at that commit. Please have a look at this bug as it is annoying as hell for the people that experience it and it is still present in the latest kernel I've tried (2.6.22-rc3). 
Comment 82 Andrew Morton 2007-06-14 12:51:14 UTC
Subject: Re:  Unusable system (ie slow) when copying large files.

On Thu, 14 Jun 2007 11:38:07 -0700 (PDT)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=7372
> 
> 
> henkjans_bagger@hotmail.com changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |henkjans_bagger@hotmail.com
> 
> 
> 
> 
> ------- Comment #81 from henkjans_bagger@hotmail.com  2007-06-14 11:41 -------
> This bug is so utterly annoying that I already downloaded the FreeBSD
> install-CD's to switch to their kernel as this bug leads to an unworkable
> system under load.
> 
> Based on the last posts of devsk I decided to first try to switch to a 32-bit
> system. I can confirm the findings of Devsk. The problems I experience with the
> 64-bit kernel are NOT present when running a 32-bit system. 
> 
> Hopefully with the bad commit reported in post #42 this gives a reason to look
> at that commit. Please have a look at this bug as it is annoying as hell for
> the people that experience it and it is still present in the latest kernel I've
> tried (2.6.22-rc3). 
> 
> 

Jeff, Tejun: have you guys been following this long-running saga?  It seems
to be hurting a few people and I have a suspicion that multiple unrelated
problems are at play here.

But this reporter at least has bisected it down to an ata commit from a
year ago.

_some_ of these reports (eg #42 and thereabouts, and #68) do I think point
at an ata change.

But some of the others make it look more like the balance_dirty_pages()
starvation thing.  Quite possibly it's an interaction between both,
which means that fixing balance_dirty_pages() starvation will fix it.

Peter, could I ask that when you have time, you spin up a version of your
balance_dirty_pages patchset for the various reporters to try?  Against
2.6.21 would suit, I guess.

Thanks.

Comment 83 devsk 2007-06-14 17:31:39 UTC
Created attachment 11756 [details]
output of diff /tmp/config.x86 /tmp/config.x86_64 > /tmp/config.diff

Is anyone ready to take on the x86 vs. x86_64 angle? I have identical kernel source and .config (only differs in parts where autoconfig removes unsupported options or adds default options for the ARCH) files, and problem only appears in x86_64.
Comment 84 Dwight Spencer 2007-06-15 06:36:33 UTC
Well, i'm going to plead some ignorance here, on the x86 vs. x86_64. I do understand the difference (32/64bit), and I'm running x86 mode.

vendor_id       : AuthenticAMD
cpu family      : 6
model           : 8
model name      : AMD Sempron(tm)   2400+

But regarding the issue of the IOWAITs under high io load, the issue seems isolated to volumes connected via the 3ware controller, both raid1 (sda, pair of 320s) volume and raid5 (6 x 500gb) volume.  Another couple points I'll mention here.
- using xfs, reiserfs, ext2 and ext3, problem persists
- 2.6.18, 2.6.20, problem exists
- yet to test 2.6.17 - you guys want this tested?

write caching is enabled on both 3ware volumes.  When I copy data to these volumes, they will push iowaits up to 80-100%.  however, on the same system, I have an external sata drive mounted, and when I write to that (reading from nfs mounted disk from other machine in all tests), i -don't- get the iowaits on that volume, which is connected via sata_sil, i believe

here are some quick copy tests, WITH the option in vfs_cache_pressure set to 1


>> nfs mount to external sata
  1  34  26  23   2  15|   0    19M|   0     0 |   0     0 |3137  2388 
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  1  39   0  41   3  16|   0    41M|  27M  820k|   0     0 |3896  2576 
  1  35  22  25   2  15|   0    13M|   0     0 |   0     0 |3742  2288 
  0  35  30  17   1  17|   0     0 |  31M  949k|   0     0 |3569  2274 
  1  31  35  19   1  13|   0     0 |   0     0 |   0     0 |3430  2288 
  1  35  20  27   2  15|   0    18M|  29M  889k|   0     0 |3463  2369 
  0  24   0  56   4  16|4096B   42M|   0     0 |   0     0 |3533  2759 
  1  24  36  19   1  18|   0    11M|  27M  837k|   0     0 |3686  2471 
  1  28  32  22   2  16|   0     0 |   0     0 |   0     0 |3571  2400 
  1  31  35  18   2  13|   0     0 |  29M  898k|   0     0 |3410  1936 
  1  35  21  29   1  14|   0    14M|   0     0 |   0     0 |3314  2154 
  1  37   0  43   4  15|   0    40M|  28M  843k|   0     0 |3441  2381 
  1  36  21  25   1  16|   0    15M|   0     0 |   0     0 |3382  2180 
  0  36  30  19   2  14|   0     0 |  28M  877k|   0     0 |3515  2166 
  1  35  35  13   1  15|   0     0 |   0     0 |   0     0 |3816  2257 
  1  37  20  29   1  12|4096B   12M|  31M  942k|   0     0 |3089  2165 
  0  30   0  53   1  16|   0    41M|  26M  793k|   0     0 |3255  2785 

[root@achilles external]# time cp /swraid5b/linux.iso .

real    0m47.415s
user    0m0.078s
sys     0m5.700s
[root@achilles external]# 


>> nfs mount to sda, 3ware 9500 raid1 volume

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  1  36   0  46   1  15|  36k  560k|  32M  995k|   0     0 |3483  2282 
  1  37   0  44   2  16|1280k 3108k|   0     0 |   0     0 |3906  2450 
  0  38   0  47   1  14| 208k  332k|  31M  964k|   0     0 |3381  2426 
  1   5   0  93   0   1| 188k   19M|   0     0 |   0     0 |1222  1093 
  0   1   0  99   0   0|  40k 7544k|2565k   76k|   0     0 |1131   411 
  0   9   0  89   0   2|  16k   21M|   0     0 |   0     0 |1405  1338 
  1  38   0  46   2  13| 100k 2476k|  12M  355k|   0     0 |3482  2387 
  1  40   0  42   2  15|  20k 7492k|   0     0 |   0     0 |3667  2575 
  0  33   0  54   1  12| 116k 6412k|  27M  831k|   0     0 |2858  2233 
  1  24   0  73   0   2|  56k   19M|   0     0 |   0     0 |1635  2359 
  0  17   0  82   0   1| 116k   20M|4922k  147k|   0     0 |1209  2337 
  1  22   0  76   0   1|  52k   14M|   0     0 |   0     0 |1185  2027 
  6  25   0  68   1   0|  14M  116k|1904B 3233B|   0     0 |1635  2662 
  9  29   0  62   0   0|  10M    0 |   0     0 |   0     0 |1599  2637 
  2  18   0  79   0   1|1704k 5828k|3298B 1800B|   0     0 |1165  2043 
 50  19   0  32   0   0|2868k    0 |   0     0 |   0     0 |1130  1800 
 39  14  45   1   0   0|8192B    0 |1898B 2748B|   0     0 |1036  1520 

[root@achilles external]# cd
[root@achilles ~]# time cp /swraid5b/linux.iso .

real    0m57.618s
user    0m0.075s
sys     0m6.151s
[root@achilles ~]#


>> nfsmount to sdb, 3ware 9500 raid5 volume

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  0  30   0  52   1  17|   0  9876k|  23M  683k|   0     0 |3634  2647 
  1  29   0  51   2  17|   0    17M|   0     0 |   0     0 |3983  2899 
  0  32   0  54   1  13|   0    16M|  32M  948k|   0     0 |3562  2684 
  1  30   0  52   1  16|   0    14M|   0     0 |   0     0 |3417  2830 
  1  32   0  48   3  16|   0    16M|  29M  872k|   0     0 |4150  2963 
  1  28   0  52   1  18|   0    12M|   0     0 |   0     0 |3832  2780 
  0  32   0  50   1  17|   0  5815k|  33M  985k|   0     0 |3592  2555 
  1  34   0  48   1  16|   0  4841k|   0     0 |   0     0 |3922  2472 
  1  26   0  56   2  15|   0    16M|  29M  858k|   0     0 |3172  2763 
  1  25   0  57   1  16|   0    12M|   0     0 |   0     0 |3853  2914 
  1  23   0  55   2  19|   0    11M|  31M  919k|   0     0 |3856  2922 
  0  26   0  57   1  16|   0    13M|   0     0 |   0     0 |5678  4563 
  1  34   0  50   2  13|   0    17M|  25M 1060k|   0     0 |5751  4429 
  0  26   0  62   1  11|   0  6936k|   0     0 |   0     0 |3666  3402 
  1  25   0  69   0   5|   0    12M|  16M  708k|   0     0 |3309  3247 
  0  18   0  78   1   3|   0  7480k|   0     0 |   0     0 |1958  2519 
  1  24   0  65   1   9|   0    17M|8940k  452k|   0     0 |4654  4239 
  0  27   0  60   2  11|4096B   10M|   0     0 |   0     0 |4612  4004 
  1  29   0  57   2  11|   0    18M|  15M  785k|   0     0 |5110  4513 
  0  29   0  53   2  16|   0  9352k|   0     0 |   0     0 |6226  5387 
  1  31   0  52   2  14|   0  8316k|  23M 1184k|   0     0 |5949  4908 
  0  27   0  55   4  14|   0    18M|   0     0 |   0     0 |6541  5119 
  0  24   0  64   1  11|4096B 9460k|  20M 1075k|   0     0 |4312  3889 

...

[root@achilles array]# time cp /swraid5b/linux.iso .

real    0m52.801s
user    0m0.074s
sys     0m4.534s
[root@achilles array]# 

Which actually isnt' as bad as I've seen for IOwaits.  When I copy a larger file, such as > 2GB, i seem to end up with higher iowait values after a longer copy process.  I'm still not sure if my issue is with the iowait problem this thread is looking for, or, if i'm having an issue with my 3ware controller.  Searches for '3ware 9500s" and "high iowaits do return results, though on 3ware's site they do indicate this is a problem with the linux kernel, not their issue.



[root@achilles ~]# lsmod|egrep "ata|3w"
sata_nv                22469  0 
sata_sil               15689  1 
libata                104661  2 sata_nv,sata_sil
3w_9xxx                36037  5 
scsi_mod              138093  5 usb_storage,sg,libata,3w_9xxx,sd_mod
[root@achilles ~]# 


[portions of dmesg output]

scsi0 : 3ware 9000 Storage Controller
3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xdf840000, IRQ: 19.
3w-9xxx: scsi0: Firmware FE9X 2.08.00.009, BIOS BE9X 2.03.01.052, Ports: 12.
scsi 0:0:0:0: Direct-Access     AMCC     9500S-12   DISK  2.08 PQ: 0 ANSI: 3
SCSI device sda: 624977920 512-byte hdwr sectors (319989 MB)
sda: Write Protect is off
sda: Mode Sense: 23 00 00 00
SCSI device sda: write cache: enabled, read cache: disabled, doesn't support DPO or FUA
SCSI device sda: 624977920 512-byte hdwr sectors (319989 MB)
sda: Write Protect is off
sda: Mode Sense: 23 00 00 00
SCSI device sda: write cache: enabled, read cache: disabled, doesn't support DPO or FUA
 sda: sda1 sda2 sda3 sda4 < sda5 >
sd 0:0:0:0: Attached scsi disk sda
scsi 0:0:1:0: Direct-Access     AMCC     9500S-12   DISK  2.08 PQ: 0 ANSI: 3
sdb : very big device. try to use READ CAPACITY(16).
SCSI device sdb: 4882708480 512-byte hdwr sectors (2499947 MB)
sdb: Write Protect is off
sdb: Mode Sense: 23 00 00 00
SCSI device sdb: write cache: enabled, read cache: disabled, doesn't support DPO or FUA
sdb : very big device. try to use READ CAPACITY(16).
SCSI device sdb: 4882708480 512-byte hdwr sectors (2499947 MB)
sdb: Write Protect is off
sdb: Mode Sense: 23 00 00 00
SCSI device sdb: write cache: enabled, read cache: disabled, doesn't support DPO or FUA
 sdb: unknown partition table
sd 0:0:1:0: Attached scsi disk sdb
libata version 2.00 loaded.
sata_sil 0000:01:0b.0: version 2.0
ACPI: PCI Interrupt Link [APC3] enabled at IRQ 18
ACPI: PCI Interrupt 0000:01:0b.0[A] -> Link [APC3] -> GSI 18 (level, high) -> IRQ 20
ata1: SATA max UDMA/100 cmd 0xF8836080 ctl 0xF883608A bmdma 0xF8836000 irq 20
ata2: SATA max UDMA/100 cmd 0xF88360C0 ctl 0xF88360CA bmdma 0xF8836008 irq 20
scsi1 : sata_sil
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: ATA-7, max UDMA/133, 488397168 sectors: LBA48 NCQ (depth 0/32)
ata1.00: ata1: dev 0 multi count 16
ata1.00: configured for UDMA/100
scsi2 : sata_sil
ata2: SATA link down (SStatus 0 SControl 310)
scsi 1:0:0:0: Direct-Access     ATA      WDC WD2500AAJS-0 12.0 PQ: 0 ANSI: 5
SCSI device sdc: 488397168 512-byte hdwr sectors (250059 MB)
sdc: Write Protect is off
sdc: Mode Sense: 00 3a 00 00
SCSI device sdc: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
SCSI device sdc: 488397168 512-byte hdwr sectors (250059 MB)
sdc: Write Protect is off
sdc: Mode Sense: 00 3a 00 00
SCSI device sdc: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdc: sdc1
sd 1:0:0:0: Attached scsi disk sdc



I have seen comments about people trying *bsd and not having this issue. I've been tempted to try another flavor unix, and might give this a try.  


dwight s.
Comment 85 henkjans_bagger 2007-06-16 09:27:47 UTC
(In reply to comment #82)

> Peter, could I ask that when you have time, you spin up a version of your
> balance_dirty_pages patchset for the various reporters to try?  Against
> 2.6.21 would suit, I guess.

Based on this response I've been trying a number of things today, but unfortunately did not get much further. One of the reasons is that on my system the appearance of the problem while running the 2.6.21 kernel is not as clear-cut as what I saw with 2.6.18.

1) I've applied the patches posted by Peter Zijlstra in comment #76 to the 2.6.21-mm2 kernel to check if it removes the problem. My impression is that the problem is still there with those patches, although less visible then with the clean 2.6.21 kernel. This would be consistent with the change I see going from 2.6.18 to 2.6.22. Again I checked both the 32 and 64 bit kernel, but it is hard to see a clear difference in this case. 

As I can't reach a clear conclusion here it might be good if somebody with a clearer stall of his/her system under 2.6.21 tries this as well.  

2) Seeing that I did not get much further with that approach I tried again to get the kernel running with the rt-patch as requested in comment #39 . I now have this running for both the 32 and 64 bit kernel. My first impression is that the RT-kernel does not really show the problems I experience under the normal 64bit kernel. As I'm not sure I would like to check the latency_trace output to confirm. 

Unfortunately /proc/latency_trace refuses to update with the 64bit RT-kernel. Note that the same kernel-config works fine for the 32-bit RT-kernel. My logs display the following error-message on this:
----
Jun 16 13:56:41 PulsarX2 BUG: at kernel/latency_trace.c:1099 update_out_trace()
Jun 16 13:56:41 PulsarX2 
Jun 16 13:56:41 PulsarX2 Call Trace:
Jun 16 13:56:41 PulsarX2 [<ffffffff8026ccbd>] l_start+0x1ed/0x960
Jun 16 13:56:41 PulsarX2 [<ffffffff80292d19>] obj_size+0x9/0x20
Jun 16 13:56:41 PulsarX2 [<ffffffff80293a67>] poison_obj+0x47/0x60
Jun 16 13:56:41 PulsarX2 [<ffffffff802b7bb1>] seq_read+0x71/0x2d0
Jun 16 13:56:41 PulsarX2 [<ffffffff80292cf9>] obj_offset+0x9/0x20
Jun 16 13:56:41 PulsarX2 [<ffffffff80293c0a>] cache_alloc_debugcheck_after+0x18a/0x1d0
Jun 16 13:56:41 PulsarX2 [<ffffffff80295939>] kmem_cache_alloc+0xe9/0x110
Jun 16 13:56:41 PulsarX2 [<ffffffff802b7c47>] seq_read+0x107/0x2d0
Jun 16 13:56:41 PulsarX2 [<ffffffff80299e19>] vfs_read+0xc9/0x170
Jun 16 13:56:41 PulsarX2 [<ffffffff8029a2e8>] sys_read+0x58/0x90
Jun 16 13:56:41 PulsarX2 [<ffffffff80209d9f>] system_call+0xef/0x15c
Jun 16 13:56:41 PulsarX2 
Jun 16 13:56:41 PulsarX2 INFO: lockdep is turned off.
Jun 16 13:56:41 PulsarX2 ---------------------------
Jun 16 13:56:41 PulsarX2 | preempt count: 00000000 ]
Jun 16 13:56:41 PulsarX2 | 0-level deep critical section nesting:
Jun 16 13:56:41 PulsarX2 ----------------------------------------
Jun 16 13:56:41 PulsarX2 
Jun 16 13:56:41 PulsarX2 CPU0: 0000000000ac1d4a (0000000000ac1d51) ... #4 (0000000000ac1d59) 0000000000ac1d64
Jun 16 13:56:41 PulsarX2 CPU1: 00000000007e4131 (00000000007e413f) ... #2 (00000000007e4131) 00000000007e413f
Jun 16 13:56:41 PulsarX2 CPU1 entries: 2
Jun 16 13:56:41 PulsarX2 first stamp: 0000000000ac1d4a
Jun 16 13:56:41 PulsarX2 last stamp: 0000000000ac1d4a
-----

Any thoughts on what causes this would be appreciated.....

3) I now noticed that a compiler warning is displayed while compiling the kernel for one of the files that was in the bad commit I found. I can't judge if that warning is really relevant for this bug so it might be good that I bring this forward:

----the warning----
drivers/ata/libata-core.c: In function 'ata_dev_configure':
drivers/ata/libata-core.c:1788: warning: comparison of distinct pointer types lacks a cast

----the code that causes this warning (second line) -----
1787	if (ata_device_blacklisted(dev) & ATA_HORKAGE_MAX_SEC_128)
1788		dev->max_sectors = min(ATA_MAX_SECTORS_128, dev->max_sectors);
----



Comment 86 Natalie Protasevich 2007-06-16 11:20:17 UTC
Does anyone find odd that all the interrupts are actually being tied to one cpu? It seems like lots of I/O on the system  but only one processor is serving interrupts?
Comment 87 devsk 2007-06-16 12:12:07 UTC
Why is IRQ_BALANCE not needed on x86_64? Or does the code not work on x86_64?
Comment 88 Tejun Heo 2007-06-18 03:25:43 UTC
Hello, Andrew.  Yeah, I've been watching this bug report.  The only thing related to libata I can think of is NCQ interacting badly with io scheduler but I haven't succeeded in regenerating the reported symptoms or does it explain the x86-64/i386 difference.  I think we're looking at several different problems reported here and it's pretty difficult to track down because the symptom is "being slow".  I'll try again to reproduce the reported problems.

If you're still seeing this problem, please report the kernel boot log (dmesg right after boot) and see whether "echo deadline > /sys/block/sdX/queue/scheduler" changes anything.
Comment 89 Tejun Heo 2007-06-18 03:34:52 UTC
It also could be that the firmware is starving io requests which are far from the current head position.  Please see whether "echo 1 > /sys/block/sdX/device/queue_depth" changes anything.
Comment 90 henkjans_bagger 2007-06-18 09:19:54 UTC
Good to know that this bug still receives attention and having some programming experience I'm sure it is a nightmare to figure out what is going on here. 

1) Regarding comment #88 and #89 
- I've tried both suggestions. Setting different schedulers and reducing the queue-depth does not noticably influence the behavior I experience under AMD64. The problem remains. 

2) As a reminder: the complaint is not just that the system is "being slow". In my case generating heavy IO on a disk makes that it becomes very difficult to start a new task. When I start calculating a checksum for a DVD-image K3B with the 2.6.18 kernel I have to wait until this process has finished until I can do anything else, which is completely unworkable. Something as simple as opening a new terminal-window is impossible even for cases where it should have been catched in memory. 

This problematic behavior has diminished for newer kernels, but it is still clearly there. Now it can take up to 20 seconds to open a new terminal, but it happens in the end qhich is an improvement. (Note that also switching users in an existing terminal can take this long....bizar)

--> with the exact same tests the i386 kernel behaves as expected.
Comment 91 henkjans_bagger 2007-06-18 09:21:14 UTC
Created attachment 11781 [details]
output of dmesg after boot AMD64

This is the output of dmesg under AMD64 with kernel 2.6.21.5 (clean)
Comment 92 Jens Axboe 2007-06-18 09:37:22 UTC
Hmm, I could be wrong, but didn't someone recently discover that x86_64 didn't correctly use good granularity time stamps in the mainline kernel (through some CFS testing). Since this sounds like a scheduler issue and it behaves the way as described on x86 and x86-64 where only the kernel differs, it sounds suspiciously like something in that area.

I'd suggest:

- Try 2.6.22-rc5, see if it reproduces there. IIRC, the bug is fixed there now and it wasn't in -rc4.

Who should this bug get reassigned to? To my knowledge, Robert hasn't done any kernel work in ages.
Comment 93 Jens Axboe 2007-06-18 10:50:16 UTC
http://lkml.org/lkml/2007/5/25/69

This is the patch I'm thinking of, however it's not included in 2.6.22-rc5. So I'd say reproduce with 2.6.22-rc5, then apply that patch and try to reproduce again.
Comment 94 henkjans_bagger 2007-06-18 11:43:22 UTC
@Jens: The 2.6.22-rc5 kernel still seems to have the problem. 
I tried applying the patch indicated in comment #93.
However the kernel refuses to compile with this patch. 

---the output ----
  CC      arch/x86_64/kernel/../../i386/kernel/quirks.o
make[1]: *** No rule to make target `arch/x86_64/kernel/../../i386/kernel/sched-clock.o', needed by `arch/x86_64/kernel/bootflag.o'.  Stop.
make: *** [arch/x86_64/kernel] Error 2
---

So not luck there. Should I try against 2.6.22-rc4-mm2 instead??
Comment 95 devsk 2007-06-18 11:59:55 UTC
Is it possible to backport the fix given by Ingo to 2.6.21? It looks like something that would apply.
Comment 96 devsk 2007-06-18 15:15:10 UTC
I think the patch in #93 can't fix the issues in this bug. A quick glance at 2.6.19 and 2.6.22-rc4, and sched_clock used by x86_64 is same as used by i386 in upto and including 2.6.22-rc4 vanilla. Only thing changing across ARCH is rdtscll but it has same granularity on x86_64 and i386. The bug that was fixed by Ingo was probably introduced recently when the function was separated out in its own C file. Can someone please verify that?
Comment 97 Tejun Heo 2007-06-18 21:24:03 UTC
Bug#8636 sounds pretty similar but not yet sure whether it's the same problem.

  http://bugzilla.kernel.org/show_bug.cgi?id=8636
Comment 98 Jens Axboe 2007-06-19 04:34:37 UTC
WRT comment #96 - it does look like a -mm only patch, so I can only conclude that the issue probably isn't there in mainline.

Testing the CFS patches could be interesting still, since it does smell like a scheduler induced delay.
Comment 99 Brice Figureau 2007-06-19 05:03:05 UTC
Hi,

A quick follow-up:
I'm still running 2.6.17.x because it is the better kernel for the moment for me. I tried 2.6.20.6 a while back, it was better than 2.6.18, but still not that perfect. I couldn't conclude the effect was produce by the kernel, since I also changed the mysql version (with a supposedly better

I'm also using as I reported in my original report arch amd64/x86_64. I never tried i386 since all my userspace is 64 bits.

Like Henkjans said in comment #90, this is not a slowlyness issue, it is more like if the system is stalling for _several_ seconds under high I/O. It is not possible to launch new process, hence my first feelings of a scheduler issue (and that's also why I chooses scheduler as the bug category).

The only thing that helped (when running >= 2.6.18) was using vfs_cache_pressure=1. It helps, but is still far from the "perfectness" of 2.6.17.

Just to remind that I'm running a LSI megaraid RAID card (ie SCSI on the whole device chain).

I will have a maintenance window in about 12 days, so if you want me to change the kernel to something else (sorry no too experimental changes) to try sth different, please let me know.

I didn't try Peter's patch, since it was (at the time) only for mm and seemed rather experimental. (the machine is in production).

Anyway, that's great to see that the bug report is still alive and that developpers are still trying to diagnose/fix the issue.
Thank you for your efforts :-)

Brice

Comment 100 henkjans_bagger 2007-06-23 16:02:25 UTC
Seeing that there seems to be no progress in solving this bug it is probably best that I try to pinpoint the problem more precisely as suggested 
previously in comment #46 (Bug#8636 does nolonger appear related to me)

It would however be helpfull if somebody who actually understands git helps me figure out how to do this to prevent me from wasting a lot of time without getting a more detailed answer. In comment #42 I found the first bad commit to be 6edad161cd4dfe1df772e7a74ab63cab53b5e8c1

This commit includes 258 different commits as it merges branch 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev. If I understand it correctly the culprit should be among those 258 commits. 

So the question is how to run git-bisect only on those 258 commits, so please help. 

===
I think I have to fetch the libata git tree: git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git and then use

git-bisect bad 0dd4b21f517e138ea113db255645fbae1bf5eef3
git-bisect good 236ee8c33277ab48671995f26dc68a4639936418

The problem is that I'm not sure which commit-IDs should be filled in to only bisect over the specified 258 commits! I think it should be the parents given in the header of the found first bad commit in comment #42. Is that correct and if so is it in the correct order??
===
Comment 101 henkjans_bagger 2007-06-24 09:37:59 UTC
The outcome of the new git-bisect attempt is slightly embarrassing....(blush)

---
12fad3f965830d71f6454f02b2af002a64cec4d3 is first bad commit
    [PATCH] ahci: implement NCQ suppport
    Implement NCQ support.
---

So I tried again to enable/disable NCQ as suggested earlier, but this time for the vanilla 2.6.18 kernel as this is the kernel that most clearly shows the problem in my case. And indeed enabling/disabling NCQ has a clear effect for 2.6.18! 

**with echo 1 > /sys/block/sda/device/queue_depth it takes ~7s to open a terminal while simultaneously calculating a checksum. 

** with echo 31 > /sys/block/sda/device/queue_depth it takes >40s to open a terminal under the exact same conditions. 

So conclusion: NCQ plays an important role for the problem I have, BUT those 7s are still much longer then under my 32bit environment. When doing exactly the same test on the 32bit kernel(2.6.21) the terminal opens in 1-2s, so substantially faster then the 7s I get with NCQ disabled!!

I would say that those 7s are still too long....

Comment 102 Brice Figureau 2007-06-24 09:54:52 UTC
(In reply to comment #101)
> The outcome of the new git-bisect attempt is slightly embarrassing....(blush)
> 
> ---
> 12fad3f965830d71f6454f02b2af002a64cec4d3 is first bad commit
>     [PATCH] ahci: implement NCQ suppport
>     Implement NCQ support.
> ---

Just to remind that my server exhibiting the same issue is using SCSI, and as such don't have NCQ, but have TCQ.
I suppose both shares the same code base in the scsi subsystem on which SATA depends.
 
[henkjans comment snipped]
> I would say that those 7s are still too long....

Yes, and this means that the problem is getting worse with TCQ/NCQ enabled, but it is not the root cause.

Are you ready to git-bisect the whole kernel without NCQ enabled, just to see what could be the culprit commit ?

Anyway, thanks for the great help :-)
Comment 103 Andi Kleen 2007-06-24 13:33:21 UTC
Re #102: Brice, NCQ/TCQ is implemented near completely in the disk. But yes
the disk vendors likely use the same code base for it.

#101: queue_length=1 is probably a little too radical; it would be interesting to 
try other values. Also out of curiosity: if you use ionice to improve the
priority of the terminal together with a short queue length does it load faster?

Also it's still a mystery why 64bit behaves differently.
Did you ever have any kernel where that was not the case? 
Comment 104 Tejun Heo 2007-06-24 19:38:28 UTC
(In reply to comment #101)
> So conclusion: NCQ plays an important role for the problem I have, BUT those 7s

Can you give a shot at deadline io scheduler (echo deadline > /sys/block/sdX/queue/scheduler)?  Also, please post the result of 'hdparm -I /dev/sdX".
Comment 105 devsk 2007-06-24 19:42:36 UTC
I use deadline scheduler and it doesn't help with intermittent freezes.
Comment 106 Tejun Heo 2007-06-24 19:58:35 UTC
I'm not sure whether all reports here are caused by the same problem.  Stuttering can be a symptom of many different problems.  Even if you and Henkjans are experiencing the same problem, Henkjans' case might be worse due to faulty NCQ implementation, so I wanna separate that out.
Comment 107 henkjans_bagger 2007-06-25 04:19:55 UTC
(In reply to comment #102)

> Are you ready to git-bisect the whole kernel without NCQ enabled, just to see
> what could be the culprit commit ?

This is indeed what I intend to do, but have first to solve to things

1) With NCQ disabled the effect becomes too subtle for my previous test between good/bad. 
So I have to find a better way to do this....I've been looking in the direction of comment #12
Problem is that I have to find out how to do the drop_caches part for Gentoo. 

2) I'm no longer sure that 2.6.17.14 is the last good kernel. 
The enabling of NCQ for my drive when going to 2.6.18 could have made it seem this way.
For that reason I must test a much older kernel for comparison first.


(In reply to comment #103)

> Also it's still a mystery why 64bit behaves differently.
> Did you ever have any kernel where that was not the case? 
> 
I can't say  what plays a role in the difference between 32/64. 
The only strange thing that I observe lies in the throughput observed when calculating a iso-checksum. 
Under 64-bit the reading spead from HD is 55-60 MB/s which is about max for my setup. 
Under 32-bit the same test shows 30-35 MB/s. (observed with dstat)
 
This clearly plays a big role in how easy it is for the system to respond to an IO-request.
Note that the outcome of hdparm is as it should be for both 32/64 bit

(In reply to comment #104)

> Can you give a shot at deadline io scheduler (echo deadline >
> /sys/block/sdX/queue/scheduler)?  Also, please post the result of 'hdparm -I
> /dev/sdX".

I'll post these tonight when I'm back behind my linux-box, but I have troubles finding a good test (see above)
Note that these observations where most likely with the anticipatory scheduler. I'll test again to be sure. 
Comment 108 henkjans_bagger 2007-06-25 13:22:27 UTC
(In reply to comment #104)
> (In reply to comment #101)
> > So conclusion: NCQ plays an important role for the problem I have, BUT those 7s
> 
> Can you give a shot at deadline io scheduler (echo deadline >
> /sys/block/sdX/queue/scheduler)?  Also, please post the result of 'hdparm -I
> /dev/sdX".
> 
Playing with the schedulers makes life interesting. I calculate the checksum for a DVD-iso and while at 40% into the checksum I try to open a terminal and measure how much seconds it takes. (between each run I drop caches and change scheduler)

Here comes the statistics:

1) The bad 2.6.18 kernel with NCQ enabled 
- cfq:         >43s,>43s,>45s  (after this time the checksum is done)
- anticipatory: 16s, 13s, 14s
- deadline:      5s,  3s,  5s

2) The bad 2.6.18 kernel with NCQ disabled 
- cfq:          15s,  5s,  7s, 8s
- anticipatory: 13s, 13s, 13s
- deadline:      2s,  2s,  4s, 2s

3) The supposedly good 2.6.17.14 kernel with NCQ disabled (by definition)
- cfq:          11s, 12s, 12s
- anticipatory: 14s, 14s, 15s
- deadline:     2s,  2s,  2s

4) For a rough comparison a 32-bit 2.6.21 kernel
- cfq:          2s, 2s
- anticipatory: 3s, 2s
- deadline:     2s, 2s

**) Note that the 32-bit kernel is located on a different HD then the iso-image and the 64-bit kernels, so it might not be a completely fair comparison. 

Conclusion
- the cfq and anticipatory schedular give a pretty bad response under all tests with the 64-bit kernels, while it works fine with a 32-bit kernel. 
- enabling NCQ seems to make the response worse. In particular the cfq scheduler takes a serious hit. 
Comment 109 henkjans_bagger 2007-06-25 13:25:09 UTC
Created attachment 11876 [details]
Output of hdparm -I

This is the output of hdparm -I as requested. Note that this is the drive that contains the 64-bit root-system as well as the iso-image. 

**) In principle this drive should be using AHCI on SATA-II
Comment 110 henkjans_bagger 2007-06-25 13:26:45 UTC
Created attachment 11877 [details]
output of hdparm -I

For completeness also the output of hdparm -I for the drive that contains the 32-bit root-system. 

Note that the bug was already present with only the other SATA drive in the system.
Comment 111 Tejun Heo 2007-06-25 21:17:29 UTC
That's interesting result.  With NCQ, a bit of increased latency is expected because more commands are in flight.  Also, the kernel has no control over in what order the harddrive chooses to execute commands once they're issued to the drive.

Can you put 32bit system on the same disk as 64bit one and repeat the test, preferably on the same region of the disk as the 32bit system (which of course requires reinstallation)?

Another factor to consider is layout of the needed files and the checksumming file on disk - ie. how far they are.  It might be combination of all factors - firmware, disk layout and cfq/as amplifying the effect due to resulting skewed statistics.  cfq allocates time slices to each io contexts and if the unfairness of firmware makes the non-clustered access much more expensive than it should be, cfq can exaggerate the effect by over-charging those IOs accordingly.  Jens, any other ideas?
Comment 112 henkjans_bagger 2007-06-26 14:57:20 UTC
(In reply to comment #111)
> Can you put 32bit system on the same disk as 64bit one and repeat the test,
> preferably on the same region of the disk as the 32bit system (which of course
> requires reinstallation)?

I'm too "attached" to my current 64-bit install to wipe it, so I did the test slightly different. I sequentially mirrored the 32-bit(2.6.21) and the 64-bit(2.6.18) root at the same partition at the end of the disk showing the problems. The test I ran is exactly the same as I ran in comment #108, so numbers can be compared

1) The 64-bit 2.6.18 kernel with NCQ enabled 
- cfq:         >44s,>60s,>55s  (after this time the checksum is done)
- anticipatory: 10s, 22s,  8s
- deadline:      4s,  4s,  5s

2) The 64-bit 2.6.18 kernel with NCQ disabled 
- cfq:           6s,  6s,  5s
- anticipatory: 10s,  9s, 10s
- deadline:      2s,  3s,  2s

3) The 32-bit 2.6.21 kernel with NCQ enabled 
- cfq:          11s, 17s, 15s  
- anticipatory: 10s, 20s,  7s, 9s
- deadline:      4s,  5s,  5s

4) The 32-bit 2.6.21 kernel with NCQ disabled 
- cfq:           4s,  4s,  4s
- anticipatory:  9s, 10s, 16s
- deadline:      2s,  2s,  3s

The pattern in these times seems consistent with the obvious outcome that enabling NCQ in combination with the cfq-schedular results only in very long delays for the 64-bit kernel, not with the 32-bit kernel. 

> Another factor to consider is layout of the needed files and the checksumming
> file on disk - ie. how far they are. 

To check this aspect I also tested the effect of copying the iso-file first to the original root-disk of the 64-bit system. This way the file that is read for calculating the checksum should be much closer then previously, however this clearly has little effect on the observed delays, see:

5) The 64-bit 2.6.18 kernel with NCQ enabled ( iso-file in / )
- cfq:          47s, 54s, 46s  (after this time the checksum is done)
- anticipatory: 13s, 17s, 16s
- deadline:      9s,  5s,  4s

6) The 64-bit 2.6.18 kernel with NCQ disabled ( iso-file in / )
- cfq:           5s,  4s,  4s
- anticipatory: 13s, 13s, 13s
- deadline:      2s,  3s,  2s

So both moving the root-file system to a completely different area of the disk as well as moving the iso into the 64-bit root itself has no noticable effect. So I would say that the actual layout of the disk seems to have little effect. 

For completeness: The partition-scheme
  75MB  /boot 
  1GB   /swap
 30GB   /  (original 64-bit root)
136GB   /home (contains the iso-file somewhere)
 15GB   /  (the 32-bit and 64-bit root in this test)
130GB   /spare


Comment 113 Tejun Heo 2007-06-26 17:34:30 UTC
Thanks for the testing.  It really looks like a cfq bug which somehow gets triggered on only 64bits.  /me does Jens summoning dance.
Comment 114 henkjans_bagger 2007-06-27 01:20:39 UTC
I think you'll have to keep dancing for the coming two weeks.
In bug #8363 Jens mentioned that he would be away until the 9th of July.

Are you sure that the problem is confined to the combination cfq/NCQ only?
Overall I found the performance of the anticipatory scheduler astoundingly bad. It consistently shows longer delays (in the 10-20s range) then I would find acceptable on a multi-user system. 
Comment 115 Tejun Heo 2007-06-27 01:46:45 UTC
cfq and as are similar in many ways.  cfq can be said fairer extension of as.  Both assume a lot about disk behavior and usage pattern and heavily optimized toward typical desktop usage.  as/cfq behave best if queue depth is 1.  They work with higher queue depth but some assumptions there might not hold.  The performance is usually pretty good due to heavy optimization but the very optimization might result in peculiarities like you're experiencing.

So, Jens isn't available. Nick should know much better what's going on with as.  I'll ask him to take a look.
Comment 116 Brice Figureau 2007-06-27 02:02:18 UTC
(In reply to comment #113)
> Thanks for the testing.  It really looks like a cfq bug which somehow gets
> triggered on only 64bits.  /me does Jens summoning dance.

Unfortunately for me, my server uses deadline only, which makes me think
Henkjans is seeing finally a completely different problem than mine :-(

I'm compiling a 2.6.21.5 with Peter Zijlstra 2.6.21-per_bdi_dirty_pages.patch
patch. I'll report my findings here as soon as I find a window to reboot the
server.
Comment 117 Tejun Heo 2007-06-27 02:06:59 UTC
(In reply to comment #116)

Yeah, that's my feeling too.  We might even have yet different problems mixed in here.  We'll just have to crack one by one.
Comment 118 Brice Figureau 2007-07-05 06:22:52 UTC
Hi,

(In reply to comment #76)
> New patches at:
>   http://lkml.org/lkml/2007/5/10/161
> 
> These do include something to make inter task starvation better.
> The previous set was only geared at inter device starvation issues.

I'm now running since a few hours a 2.6.21.5 with Peter's patch.

The machine is right now lightly stressed (and mysql cache are still cold, so memory consumption is not at its peak), so I'll perform more test in the next week to verify what I've found so far (which is good):

1) vfs_cache_pressure at 100, 2.6.21.5+per bdi throttling patch 
Result is good, not as snappier as I'd want during a large copy but still usable. No process seems stuck for agen, but there seems to be some short (second or subsecond) moment where everything is stuck (like if you run a top d 0.5, the screen is not updated on a regular basis).

2) vfs_cache_pressure at 1, 2.6.21.5+per bdi throttling patch
Result is at 2.6.17 level. It is the better combination since 2.6.17.

I've taken a vmstat snapshot during test 1), and it shows a "behavior" (don't know if it was planned or not):

At first, there are lots of block activity (high bi and bo):
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
...
[copy started here]
 2  1    104 343248   7200 156784    0    0 61796 20956 2217 2476 24 15 28 33
 0  1    104 260908   7336 236964    0    0 40044 40596 1865 1668 17 13 14 55
 1  4    104 101108   7488 393692    0    0 78540 43944 2199 2079  1 20 20 59
 0  4    104  27872   6532 465136    0    0 50388 38516 1933 1997 13 16 20 50
...
[then a few seconds after, bi decrease and bo still high (does the system
regulates itself to prevent the write starvation??), high iowait]

 1  7    104  33164   1736 462888    0    0 58060 45732 2003 2440  1 20  0 79
 1  7    104  38504   1812 457064    0    0  7384 43240 1293 1336  2  4  0 94
 1  8    104  36804   1988 458664    0    0  1824 45824 1299 1535  3  3  0 94
 1  7    104  36484   2096 458648    0    0    64 44120 1188 1389  2  2  0 96
 3  7    104  36184   2164 459320    0    0   304 45988 1239 1743 35  4  2 59
 1  5    104  27524   2240 470920    0    0 36796 21124 3054 3197 29 15 21 35

[then the inverse phenomenon]

 3  5    104  37824   2284 461404    0    0 38188 32964 2071 1931  2 13 27 58
 1  2    104  58628   2648 438620    0    0 68652  4016 2348 3355 38 23 10 28
 3  1    104  28372   3508 466616    0    0 36188  3788 2033 3539 36 12 15 38
 1  2    104  76732   3752 418304    0    0 44232 42072 2263 2911  6 21 36 36
 1  3    104  58732   3784 435816    0    0 59844 33684 2067 3155  2 20 35 44
...
[the cycle reproduces every 10s or so]

Also I don't know if that matters, but it seems that all interrupts are sent to processor 1:
db1:~# cat /proc/interrupts 
           CPU0       CPU1       
  0:         95    3946126   IO-APIC-edge      timer
  1:          0          2   IO-APIC-edge      i8042
  4:          0       6321   IO-APIC-edge      serial
  6:          0          3   IO-APIC-edge      floppy
  8:          0          1   IO-APIC-edge      rtc
  9:          0          0   IO-APIC-fasteoi   acpi
 12:          0          4   IO-APIC-edge      i8042
 14:          0         85   IO-APIC-edge      ide0
 16:          0          0   IO-APIC-fasteoi   uhci_hcd:usb2
 18:          0          0   IO-APIC-fasteoi   uhci_hcd:usb4
 19:          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 23:          0         21   IO-APIC-fasteoi   ehci_hcd:usb1
 46:          0    1619638   IO-APIC-fasteoi   megaraid
 64:          0      33237   IO-APIC-fasteoi   eth0
 65:          0    2209532   IO-APIC-fasteoi   eth1
NMI:          0          0 
LOC:    3946284    3944609 
ERR:          0

Stay tuned for more test, but per-bdi seems at least for me definitely the way to go.

Thanks,
Brice
Comment 119 Andi Kleen 2007-07-12 05:07:20 UTC
Can you combine the test with shorter queue lengths? 

echo N > /sys/block/DEVICE/queue/nr_requests

e.g N=32 / N=16 / N=5 / N=1

I found that e.g. the AHCI interrupt handler can run quite long with large
queue lengths.
Comment 120 Brice Figureau 2007-07-16 05:14:29 UTC
(In reply to comment #119)
> Can you combine the test with shorter queue lengths? 

Here are some results on my hardware (see original bug description).
Running 2.6.21.5 + per bdi throttling patch in all tests.

Baseline load on the machine is around 0.3/0.4
Baseline iowait on the machine is < 2
Baseline is nr_requests=128 (default) and TCQ depth at 64 (default)
Deadline io scheduler

This is my original test case:
 * lightly loaded MySQL instance doing lots of write and fsync (I guess), few reads (everything is in RAM).MySQL process takes about 93% of physical RAM (4G) on x86_64.
 * At test start, I drop the caches, create a LVM snapshot, and copy from this snapshot to another block device (another RAID1). (LVM snapshot was deemed to have no influence, since directly copying the files yield the same results)
 * While doing that I run mtop and vmstat -n 1, and also a 'time ls -al /usr/lib'
 * Pauses are seen in mtop, which doesn't refresh anymore and SQL requests are piling, which means something is starving mysql;  at this stage, the ls can take more than 20s to display its results.

1) vfs_cache_pressure=1
TCQ  nr_requests
64   128     between good and bad, load at 6, short sequences of 1s or less pauses every 3 or 4 seconds.
64   64      bad (load 4.5) iowait between 50 and 75
64   32      bad
64   16      bad
64    1      between good and bad (load 2.5), some real short pauses

2) vfs_cache_pressure=100 (kernel default)
TCQ  nr_requests
64   128     really bad, long (more than 1 or 2s) "pauses" every few seconds or so
64    32     really bad, lots of long pauses
64    16     really bad, lots of long pauses
64     1     between good and bad (load 2.6~3), some short pauses

3) vfs_cache_pressure=1
TCQ   nr_requests
8     128    not that bad   
1     128    snappiest configuration, almost no pauses (or unnoticable ones)

While doing those tests, the new counters /sys/block/DEVICE/queue/writeback_kb and reclaimable_kb for both devices where always at 0... I don't know if that's the intended behavior or not.

Conclusion:

I still don't get the perfectness of vanilla 2.6.17.x with baseline settings, but it seems to go in a good direction.
vfs_cache_pressure=1 is still better thant vfs_cache_pressure=100 and TCQ seems to play a role. Nr_requests doesn't seems to have an impact (except when disabled).

Let me know if someone wants me to perform some tests...
Brice
Comment 121 Anthony Awtrey 2007-07-16 21:11:48 UTC
Thanks for the performance info Brice!

We recently upgraded our office to gigabit Ethernet and got some big AMD64 / 3ware boxes for file and vmware servers... only to find them almost useless under any kind of real load. I've built some patched 2.6.21.6 kernels (using the bdi throttling patch you mentioned) to see if our various Debian Etch boxes run better. So far my testing shows a *great* improvement over the stock Debian 2.6.18 kernel on our configurations.

The patch and build process was less than pleasant, since the rolled up 2.6.21 patch I found needed some nommu superblock/backing device patches to get it compiled and working against the current Debian linux-2.6-2.6.21 kernel source. I also had to patch some of the build scripts since the package was written for unstable and I need the kernel to run on Etch. I thought I'd share my combined iowait-hell.patch and the binary Debian kernel packages to assist anyone else who may also be fscked by this issue or interested in testing for themselves.

http://www.awtrey.com/files/iowait-hell/

At this point I'm still screwed because VMware depends on some older interrupt / vector code changed in 2.6.20 (I think). With 2.6.21 I get great disk performance, but complete system lockups at random intervals when running VMware. I'd go back to pre-2.6.18, but some of the hardware isn't supported in prior kernel versions. This is such a huge issue for us that I'm willing to pay thousands of dollars to get a backport of this fix to 2.6.18 so we can actually use our vmware servers. (BTW, If anyone wants the job please shoot me an email to discuss.) I can't believe an issue this bad has gone unresolved for this long...

Tony
Comment 122 Anthony Awtrey 2007-07-17 05:13:17 UTC
I spoke too soon, at least for some architecture / drive controller combos. The new kernel ran much better on my build box (P4/IDE drives) than the stock 2.6.18, but when I tried the kernel on one of my AMD64 / 3ware boxes I still see the same iowait / pdflush unresponsiveness I was seeing on the 2.6.18. Damn!

Could Dwight Spencer (see comment #77 and comment #84) and I be seeing other complications of this issue from running on AMD/3ware?

Tony
Comment 123 Brice Figureau 2007-07-17 05:42:33 UTC
(In reply to comment #122)
> I spoke too soon, at least for some architecture / drive controller combos. The
> new kernel ran much better on my build box (P4/IDE drives) than the stock
> 2.6.18, but when I tried the kernel on one of my AMD64 / 3ware boxes I still
> see the same iowait / pdflush unresponsiveness I was seeing on the 2.6.18.
> Damn!
> 
> Could Dwight Spencer (see comment #77 and comment #84) and I be seeing other
> complications of this issue from running on AMD/3ware?

IMHO, the bug that affects us (if it's the same) has its roots in x86_64+SCSI since on any i386 (with or without SCSI) I never experienced it (ok that wasn't in the same test conditions and under the same load).

What would be interesting if you could find an older (vanilla) kernel _not_ affected (like 2.6.17 was for me) is to run a git-bisect session. At least it would show if you are seeing the same issue as Henkjans.
Comment 124 Dwight Spencer 2007-07-17 05:50:56 UTC
Morning guys 

Hey tony ... 

I gave up on the 3ware 9500s controller and went directly to booting from 1 drive, and using software raid on the other drives.  Performance has greatly improved.  While I was getting 22MB/s read/write speed on the 3ware 9500s (pci-x card in a pci slot, so that might be my issue), going software raid (md5, raid5, kernel 2.6.22-1) has pushed my read speeds up to 100MB/s and writes vary from 50-70MB/s, an amazing jump.

I still get some iowaits while doing large file transfers or disk io (such as installing winxp in a vm), but the system remains highly responsive and system load only goes up to negligibly (always less than a load of 2-3), not creaping up to 12+ load when using the 3ware.

I typically use the utility "dstat" to keep an eye on cpu performance (load/iowait, etc) and disk/network io.  

(( That said, I believe I'm not having problems with md software raid5 causing data corruption, but that's an issue for for another thread i suspect.  Trying sw raid 4 (3drives) and 1(2 drives) this morning. ))

I'm considering ordering a lsi/megaraid sata 300 controller to do further testing on - I'd prefer hardware raid, but i'm not going to put up with xfer rates of 22MB/s.  Hopefully the LSI doesn't have the issue with iowaits.  I'm considering getting a pci-e based controller/board/cpu/memory combo, to give that a try, if pci-x in pci slots is the issue. 

Which controller/board/slot combination are you using, Tony?

dwight.

notes
- currently running fedoracore 7, compiled 2.6.22-1 kernel
- iowait issue w/ 3ware controller existed on both fc6, fc7, rhel4 update4, and centos.   I didn't try debian/gentoo or any bsd's.
- 2gb memory, motherboard is asus a7n8x, cpu is athlon xp 2800+
Comment 125 Jens Axboe 2007-07-17 07:31:14 UTC
Created attachment 12066 [details]
2.6.18 bounce fix

Shooting in the dark... Does reversing this attached patch change the picture on x86-64 at all?
Comment 126 Jens Axboe 2007-07-17 07:32:21 UTC
Checking /proc/meminfo for Bounce should probably be enough.
Comment 127 Brice Figureau 2007-07-17 07:54:26 UTC
(In reply to comment #126)
> Checking /proc/meminfo for Bounce should probably be enough.

Not sure what you exactly mean in comment #125, but I have:
[brice@db1 ~]$ grep Bounce /proc/meminfo 
Bounce:              0 kB

Don't know if that matches what you were thinking about...
Thanks,
Brice

Comment 128 Jens Axboe 2007-07-17 07:57:29 UTC
Reassuring, I would be surprised if that had broken and we hadn't noticed until now. But there's not many great ideas popping up in here, so I figured it was worth a shot.
Comment 129 Andi Kleen 2007-07-18 06:00:48 UTC
I suspect multiple bugs are getting mixed here again. At least Anthony's
problem sounds like something different. Besides he seems to run a heavily
changed Frankenkernel so he's INVALID anyways.
Comment 130 Anthony Awtrey 2007-07-18 07:49:10 UTC
Sorry I didn't get back yesterday, but I took Dwight's advice and ditched the 3ware card. (Thanks Dwight! I *was* using a 3ware 9550SXU in a pci-x slot and seeing about the same performance you were seeing: 25MB/s reads)

I reinstalled Debian and my server config on a seperate box (4 sata software RAID+LVM) and swapped out the drives last night. I ran some tests with the stock Debian 2.6.18 kernel and was getting better performance than with the 3ware, but I did see the characteristic pauses toward the middle of a bonnie++ run. After getting my Debian 2.6.21.6 + patch kernel loaded, the performance was greatly improved and even while running bonnie on one volume, I was able bounce around with any really noticable lag.

Now, as far as this bug being AMD64 only. We develop a portable data analysis tool and we run it on Intel Core Mobile systems (Sony UX series, Panasonic Toughbook series) and see this bug or one almost exactly like it on those platforms as well. Once we have processed data for about 20 minutes, the iowait starts to go up, the system starts to hesitate and stutter. It's not swapping to disk. Our app takes a couple of hundred megs of ram and the rest of the gigabyte or so of ram is available for the kernel to use. Once our run is complete and all our programs have exited, the kjournald is 100% iowait for about 2-4 minutes writing something.

Ironically, this problem is *much* worse on the UX-380 which has a solid state disk, incredible write perfomance and nanosecond seek times. Considering that I could dd the entire contents of our ram to the SSD disk in a couple of seconds, I can't even imagine what kjournald is doing for that long.

I am not suggesting anything here, I'm just saying that when I saw the iowait issue on our new AMD64 boxes, the problem behaved exactly like the one we've seen in testing our product and can reproduce fairly reliably on both Intel Core and Core2 CPUs.

Thanks for the help identifying the 3ware issue and for the patches that smoothed out the performance on our servers.
Comment 131 simon+kernelbugzilla 2007-10-02 07:11:05 UTC
My 2cents:

Symptoms described match those of my system - slow/nonexistent reads when heavy write IO in progress leading to collapse in system response. Twin Opteron 2.4GHz, 4GB RAM, 3ware 9550SX with RAID 1, ext3. OSes tried so far: CentOS 4.5 (2.6.9-55.ELsmp), openSUSE 10.2 (2.6.18.2-34-bigsmp), CentOS 5 (2.6.18-8.el5).

Latest drivers (compiled from 9.4.1.3 3Ware source) and firmware for the card in use, have tried various changes to nr_requests, elevator, etc and have been able to mitigate the effect slightly but no magic bullet as yet.

Kernel debugging isn't my forte, so this is mainly just a 'me too' posting - sorry about that.
Comment 132 justbobet 2007-10-02 10:07:52 UTC
Me too! 9650SE - CentOS 2.6.18-8.1.10.el5xen. It's Painful. Using deadline scheduler helped a bit in my case.
Comment 133 Ingo Molnar 2007-11-17 19:13:07 UTC
the per bdi patches are now upstream, so could you check 2.6.24-rc3, does it improve performance?
Comment 134 Trevor Cordes 2007-11-25 04:34:32 UTC
Not sure if this is relevant.  I see similar behavior (high wa, glitchy interactive) on my franken-fileserver.  I have root on sw md raid1.  I have a 2TB raid6 array for data.  10 disks total, onboard ICH + 2 PATA PCI controllers + 1 4 port SATA controller.  2 disks are larger than the rest and are used in both the r6 and r1.  My PATA disks show up as hdX on this system.  The problems existed when I was all PATA, and I switched 4 drives to SATA to try to alleviate the problem, but it never helped.

dd if=/dev/zero of=/raid1fs/z bs=1M count=512
to raid6 and wa stays <10%, and idle >0%.  No glitches or extremely tiny.

dd to raid1 and wa goes to near 100% for 10 seconds.  Idle goes to 0.  Everything gets glitchy until the writes complete and the wa goes back down.  My main notification of the symptom is my MP3s NFS and MythTV served off that system start glitching badly.  Those are just reads.  So my big dd write is starving small reads.

dd to a non-raid spare partition on a disk and it behaves same as raid1.  So maybe best to speak about non-raid, single-disk.  But it's easier for me to test raid1 as I don't usually have a non-raid partition.  Not sure why raid6 seems mostly immune when writing to it!  Not the main issue here though, so let's ignore it.

2.6.17-1.2139_FC5smp, final stock fc5 kernel.  I will upgrade this box to f8 2.6.23 in the next few weeks.  I know the kernel is ancient, but I rarely upgrade this box as almost every time I do I destroy my 2TB array due to some obscure R6 bug, and it's really painful.

I read this entire bug and have tried a lot of the suggestions.

echo 1 > /proc/sys/vm/vfs_cache_pressure
no help

echo deadline > /sys/block/*/queue/scheduler (all disks)
no help or perhaps very very tiny help

echo 16 > /sys/block/*/queue/nr_requests (all disks)
Glitches seem to get a fair bit better (with scheduler=deadline still).  It was set to 128, which appears to high and causes read starvation.  With 128 I get long glitches every test run, even on an no-load system.  Even dstat itself (over ssh) starts to jitter!  With 32 I sometimes get short glitches.  With 16 I get no glitches unless the server is being loaded by other NFS calls.

No matter what nr_requests is, dstat shows the write going 56-60MB/s and wai takes up a constant 93-100% for the duration of the write.  Note, I cannot easily tell if this is a disk starvation issue or a ps scheduler / net issue as my "glitch" symptom is MP3 streaming over NFS with xmms.  So the problem could be a starved NIC (Intel PRO/1000 MT Server PCI card) or NFS process.  However, it is clear that the large disk writes cause the starvation, whatever it is.

Regardless of the cause, the kernel should not starve anything during large writes.  They should be scaled back enough so that other tiny writes/reads can break through.

I have reproduced the problem on non-franken servers with just 1 disk and no raid, or systems with just 2 disks md raid1.  I will admit having 10 disks on cheap-ish controllers may not be helping the situation.

When I try F8 2.6.23 I will report back.  When F8 goes 2.6.24, I will also report back.
Comment 135 Ingo Molnar 2007-11-25 09:33:40 UTC
* bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote:

> dd to a non-raid spare partition on a disk and it behaves same as 
> raid1.  So maybe best to speak about non-raid, single-disk.  But it's 
> easier for me to test raid1 as I don't usually have a non-raid 
> partition.  Not sure why raid6 seems mostly immune when writing to it!  
> Not the main issue here though, so let's ignore it.

before you upgrade that box, could you try to run a few tests with files 
in pure ramfs? I.e. mp3s copied to /dev/shm/ and served from there - 
without any connection to the raid arrays.

I.e. are these delays purely IO interaction related - or is there any 
CPU preemption or other CPU scheduling problem involved as well.

Comment 136 Trevor Cordes 2007-11-26 06:52:40 UTC
I just tried exporting a /dev/shm test dir (bind mount, nfsv4) and shm mounts appear to really mess up nfsv4.  I can't get it to mount for the life of me.  If I setup a dir in /tmp and bind/export/mount then it works fine.  Move to shm and the client mount hangs and it really messed up my other v4 exports!  I had to restart nfsd a few times to fix it.

Any other ideas to test your theory that's more compatible with nfsv4?
Comment 137 Trevor Cordes 2007-11-29 07:14:20 UTC
While the system was heavily NFS loaded, 5GB read/write, I ran the following script:

#!/usr/bin/perl
open(IN,'/data/Tmp/bigfile') or die;
while (!eof(IN)) {
  $bef=time;
  sysread IN, $buf, 4096;
  $aft=time;
  print "elapsed time to read: ".($aft-$bef)." sec\n";
  sleep(3);
}

/data lives on my big RAID6 array (with all my MP3s).

Anyhow, I had a few instances where time was non-zero:

elapsed time to read: 1 sec
elapsed time to read: 0 sec
elapsed time to read: 0 sec
elapsed time to read: 1 sec
elapsed time to read: 0 sec
elapsed time to read: 2 sec
elapsed time to read: 0 sec
elapsed time to read: 0 sec
elapsed time to read: 1 sec

I simultaneously ran the same script, but on a different file, over NFS and strangely, all times were zero.  Hmm.  That's weird.

While running the test I had an ls -l deep in the RAID6 tree that took about 30 secs to complete!  And the directory only had 5 files in it!

It seems whenever the wa% goes to near 100 and idle=0% that things start getting ugly.  The user/sys% never get very high.  It's always wa%.

If anyone can suggest other tests, I'll run them.
Comment 138 Brice Figureau 2007-12-03 03:08:49 UTC
(In reply to comment #135)
> * bugme-daemon@bugzilla.kernel.org <bugme-daemon@bugzilla.kernel.org> wrote:
> 
> > dd to a non-raid spare partition on a disk and it behaves same as 
> > raid1.  So maybe best to speak about non-raid, single-disk.  But it's 
> > easier for me to test raid1 as I don't usually have a non-raid 
> > partition.  Not sure why raid6 seems mostly immune when writing to it!  
> > Not the main issue here though, so let's ignore it.
> 
> before you upgrade that box, could you try to run a few tests with files 
> in pure ramfs? I.e. mp3s copied to /dev/shm/ and served from there - 
> without any connection to the raid arrays.
> 
> I.e. are these delays purely IO interaction related - or is there any 
> CPU preemption or other CPU scheduling problem involved as well.

I'm still running a 2.6.21 + per BDI patches (same config as in my comment #116) and I tried to play with the /dev/shm idea.

Here is the test:

 1) I dropped all caches, here is the 'free' information before running the test:
             total       used       free     shared    buffers     cached
Mem:       4052420    3625700     426720          0        508      17708
-/+ buffers/cache:    3607484     444936
Swap:     11863960      11364   11852596

Notice that there are plenty of free/unused ram for the tmpfs.

 2) I created a 150MB tmpfs containing 500 zero byte files.

 3) I'm copying a multi-gigabyte couple of files from /dev/sdb /dev/sda (remainder, LVM2 on top of high-end LSI hardware RAID card with multiple spindles and 256MB of battery backed cache).

 4) while the copying progresses, I time a ls -al from the tmpfs mountpoint.

And now the results:
 - on baseline, ls -al /mnt is taking:

cold cache:
real    0m1.387s
user    0m0.004s
sys     0m0.004s

hot cache:
real    0m0.024s
user    0m0.008s
sys     0m0.008s

 - while the copy is progressing, (hot cache everytime), the time to perform the ls is increasing, until the copy is finished:

After a few second:
real    0m4.654s
user    0m0.008s
sys     0m0.004s

After about 20 seconds (at about 75% of the copy)
real    0m15.064s
user    0m0.004s
sys     0m0.000s

Near the end of the copy:
real    0m7.998s
user    0m0.004s
sys     0m0.008s

So to me it looks like the disks or lvm are not at cause.

I might try 2.6.24 when it will be out, to see if that's better or not.

Let me know if someone need more testing/information.




Comment 139 Trevor Cordes 2007-12-03 13:39:19 UTC
Comment #138 jives with what I see.

I've done further poking around and noticed that while these large reads/writes are occurring, tcp overruns like crazy.

netstat -s
    12178 packets pruned from receive queue because of socket buffer overrun
    162338 packets collapsed in receive queue due to low socket buffer

After google searches, I did the following tweaks:
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.core.rmem_default=262143
net.core.wmem_default=262143
net.ipv4.tcp_rmem="8192 1048576 16777216"
net.ipv4.tcp_wmem="4096 1048576 16777216"
vm.min_free_kbytes=65536
net.ipv4.tcp_max_syn_backlog=4096

That really helped drastically lower the overruns.  In fact, each time I jack up the rmem/wmem, the situation gets better and my NFS drops fewer packets and xmms skips less.  The box has 2GB RAM, so I'm inclined to just keep jacking up these values to see if I can buffer my problem away.

My point is not the networking details, my point is this plus comment #138 strongly hint that the disk I/O is causing actual CPU starvation, not just disk starvation.  This makes sense if wa% is hogging all the cycles.
Comment 140 Kevin Burton 2007-12-28 15:29:01 UTC
This is really interesting.

We're seeing very similar issues on http://spinn3r.com in our web crawler.

Basically, we're a very high write throughput application.

We might be seeing two issues here.  One is that our application runs out of 8G of memory (in a sharded database) but the swappiness setting is constantly forcing pages out of memory and onto disk.

We've also seen this locking issue on DDs as well.

It sounds like there might be too issues in this bug though.
Comment 141 Matthew Stapleton 2008-02-02 03:11:28 UTC
While setting up a new 64-bit system to replace my old 32-bit server I am also seeing this issue.  I've found that mounting the filesystems with noatime helps with starting apps once they're in cache but high write load seems still slows down reads.  Running "dd if=/dev/zero of=testfile.img bs=8k count=20M" slows down "find /", and if I use cfq with NCQ disabled "dd if=/dev/md5 of=/dev/zero bs=8k" can get reasonable speeds but enabling NCQ with depth 31 or changing to deadline scheduler seems to make that slow as well.  Adding oflag=direct to the dd write operation slows down the write speed so reading speeds slightly increase in speed but still not as fast as when dd isn't running.

I'm running Gentoo 64-bit with kernel: 2.6.24-gentoo on a Nvidia MCP67 motherboard, AMD Athlon64 X2 4000+, 2GB RAM and two 400GB SATA drives with ext3 filesystems running on top of LVM2 on top of software RAID 1.
Comment 142 Matthew Stapleton 2008-02-03 17:46:02 UTC
I've mounted the filesystems with SystemRescueCd-0.4.4-beta5 32-bit which is based on 2.6.24 kernel and got similar results to when I am running 64-bit.
Comment 143 Stephane 2008-03-01 00:24:43 UTC
It seems there is a "conflict" between the I/O scheduler and filesystem barriers (which have been made a default mount option since 2.6.17/2.6.18). Disbling barriers at mount time by passing option nobarrier for XFS, something like barrier=0 for ext3 (and another one for reiser) seems to really help (factor 5 to 10 in write speed and no more lags on high I/O load)

HTH
Comment 144 devsk 2008-03-01 00:53:22 UTC
(In reply to comment #143)
> It seems there is a "conflict" between the I/O scheduler and filesystem
> barriers (which have been made a default mount option since 2.6.17/2.6.18).
> Disbling barriers at mount time by passing option nobarrier for XFS, something
> like barrier=0 for ext3 (and another one for reiser) seems to really help
> (factor 5 to 10 in write speed and no more lags on high I/O load)
> 
> HTH
> 

Do we have some basis for this or is this pure speculation?
Comment 145 Stephane 2008-03-01 05:35:57 UTC
(In reply to comment #144)
> 
> Do we have some basis for this or is this pure speculation?
> 

I don't know where the "conflict" resides, but all I'm sure is that's it's the only thing (disabling barriers when the underlying drivers and hardware supports it) that has really helped on all my servers in the last 1,5 year...

As an example a copy of the kernel source took more than 7 minutes before, and when disabling filesystem barriers this takes only 1'30 ....

there are also reports suggesting that it might be a step in the right direction here : http://forums.gentoo.org/viewtopic-t-482731-postdays-0-postorder-asc-start-550.html

Comment 146 devsk 2008-03-27 19:07:36 UTC
We have got one more data point which is very interesting. It seems like if the data directory, where the IO is being done by an application, is loop mounted using a file, the symptoms disappear. Does anybody know what does a loop device have (or lack) that makes 64-bit code work at the same speed as 32-bit?

Following was posted on gentoo forums:

In the beginning of this year I build myself a new PC with a Intel E6750 processor, Gigabyte P35DS3R motherboard, Samsung HD501LJ 500Gb disk (370Gb with EXT3 for Linux), Nvidia 8600GT graphics card and 4Gb of memory. I decided to install a 64bit Gentoo Linux, currently with kernel 2.6.24-r3. On first impression the system seems to be quite quick. The software compiled very fast. The first evidence, that something was not quite right, were during playbacks of mpeg2 files with VLC, where there were frequent sound drops. I didn’t experience these drops with my old AMD XP 2800+.

It really became very annoying during my Audacity sessions. The loading of a 2 hours MP3 started pretty fast, I guess until the buffer cache was filled, then the IO activity basically grind to a halt. Internet radio streams continued playing, but even switching desktops didn’t produce any reaction until the MP3 was nearly loaded into Audacity.

Subsequently I did some experiments in various scenarios (probably not really scientific, but anyway), each time loading a 228Mb MP3 file into Audacity. After completion, Audacity would have filled it's data directory with about 3.3Gb audio data. During each run I collected some log files produced by "vmstat 5". From this data I produced some diagrams with OpenOffice.

I don't want to duplicate everything here. Please surf over to http://ridderbusch.name/wp/2008/03/20/back-to-32bit-linux/ to see the details, diagrams and also the raw data from "vmstat".

With standard Gentoo 64bit 2.6.24-r3 the load of the MP3 file would take over 8 minutes. With the "vmstat" output you could see, that for an extended period the system was about 90% of the time waiting for I/O and otherwise not doing much. With "queue_depth" set to 31 and CFQ scheduler, the result was slightly better, but not by much, just under 7 minutes (6m50s).

Then I created a 6Gb image file and in turn made XFS and EXT3 file systems in this image, mounted via loopback on to the Audacity data directory and repeated the MP3 loading. This worked really fast. With XFS fs it took only 1m22s and with EXT3 1m46.

As a last test I rebooted with Knoppix 5.3 (2.6.24 32bit kernel), reconfigured Audacity to load the data into /tmp of my Samsung disk and found, that this run completed after only 1m48s. 
Comment 147 devsk 2008-04-18 18:13:43 UTC
A user profiled the good and bad runs. Can someone please help decode what the profiler's result mean to the kernel? The post is at:

http://forums.gentoo.org/viewtopic-p-5028884.html?sid=7cd6ae6b21ce6aa59b5021e89dfab313#5028884
Comment 148 Chris Seufert 2008-04-23 06:53:48 UTC
I've experienced this problem for quite some time, generally with both streaming writes (from ftp server, etc) (or even just the writeback part), and with random seeks, all while watching video/listening to audio.

Strange thing, with kde, open a file off ftp and open with mplayer, it will copy it to disk, i have 8gb of ram, and opening a 350mb file, you would expect it would stay cached in ram, however after its finished transferring, mplayer/kaffine/noatun opens, often i get 2~10 second pauses/stuttering while the write-back is being performed, as i understand it, you should never need to block reading from cache during writeback. This leads me to believe that its not strictly block device related. If you try and copy any decent sized file (100mb+) in the background of playing video, the video becomes unwatchable.

i have tried 2.6.18, 22, and 24, on AMD64. I still having this problem, with 32 bit kernels, but no where near as bad. After reading some of the suggestions here i have tried tweaking /proc/sys/vm, to no effect, an effective change thou, is noatime, helps marignally, however barriers=0 makes a HUGE difference, write speed is up from 20~30mb/s to 65mb/s, to single sata drive, of ATI SB800 southbridge (in BIOS set to IDE, AHCI seems to do the same thing)

This bug also affects my server (for the past few years), with a Areca SATA Raid controller card (using ARCMSR), i get appauling read and write speeds with XFS, on an 8 disk RAID5 array, have tried all sorts of things, and now just live with the poor performance. Unfortunately the Areca drivers arent in 2.6.17 kernel.
Comment 149 Maxim Doucet 2008-04-28 07:15:41 UTC
(In reply to comment #76)
> New patches at:
>   http://lkml.org/lkml/2007/5/10/161
> 
> These do include something to make inter task starvation better.
> The previous set was only geared at inter device starvation issues.

Hello,

just to point out that version 10 of the "per device dirty throttling" patch was released : http://lwn.net/Articles/249450/

This version allows to build against .23-rc4-mm1
Comment 150 Maxim Doucet 2008-05-26 03:50:10 UTC
(In reply to comment #133)
> the per bdi patches are now upstream, so could you check 2.6.24-rc3, does it
> improve performance?
> 

Hello Ingo, I've just seen your comment about the upstream integration of the per bdi patches. Are you talking about version 10 (http://lwn.net/Articles/249450/) as I commented in comment #149?
Comment 151 Peter Zijlstra 2008-05-26 04:11:20 UTC
Most of the per-bdi patches were merged in .24-rc1 so every kernel after that should have this.

.26 will bring some new features but nothing earth shattering; capability of setting an upper and lower limit on the ratios as well as a sysfs interface in /sys/class/bdi 
Comment 152 simon+kernelbugzilla 2008-06-20 10:32:04 UTC
I'm wondering if this is in any way related:

https://bugzilla.redhat.com/show_bug.cgi?id=444759
Comment 153 Matt Whitlock 2008-09-17 21:20:12 UTC
I believe I'm running into this problem as well.  It happens when I'm making an ISO image from a DVD using K3b.  All goes along fine for a while, as the data is just being shoved into the page cache, but when pages start getting flushed to disk, all hell breaks loose.

I have a 3ware 9650SE running a RAID6 on eight Seagate Barracuda ES (7200 RPM) 320 GB drives.  I know RAID6 isn't the best for write speeds, but I should be able to get better than 6 MB/s, shouldn't I?  Alas, while the ISO image is being flushed to disk, I'm getting only 2-6 MB/s writes, and the number of dirty pages continues to climb.  While this is going on, anything else requiring a disk read takes a very long time.  Even the time between when I press Enter after entering my password for a root shell login and when the root prompt appears can be >10 seconds.

I have a Core 2 Quad Q6600.  While the pages are being written, I watch my CPU utilization graph, and I see intense spikes.  The graph hovers mostly around 12% utilization (so like 50% of one core), but occasionally and suddenly it shoots up to 100% (all four cores pegged) for 1-5 seconds at a time.  Copying a DVD to an ISO image shouldn't peg all four cores!  Additionally, if I watch vmstat while doing the copy, the 100% CPU spikes coincide with vmstat's reporting 0% idle, 93% io (it doesn't ever go higher than 93%).

Once K3b is finished writing the ISO, I still have several GB of dirty pages yet to be written.  (I have 8 GB RAM.)  If I issue a 'sync' command, the pages get flushed to disk, but still only at ~5 MB/s, though the CPU graph shows nearly idle during the sync.

All this might seem like a problem with the 3ware controller, but here's the odd thing:

$ vmstat 5 & sync ; time ( dd if=/dev/zero of=10G bs=4096 count=2621440 ; sync ) ; kill %1
[1] 22439
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0      0 7431424     20 122548    0    0    26    33    5    5  2  0 97  0
 1  1      0 2765820     20 4658304    0    0     3 162326 1288 1127  1 25 47 27
 2  3      0 208376     20 7132916    0    0     0 305676 2482 13676  1 19 31 49
 0  4      0  46312     20 7291344    0    0     0 58018  653 9456  0  4 30 66
 0  5      0  42636     20 7296552    0    0     0 28431  451 3844  0  2 26 71
 0  5      0  50780     20 7288524    0    0     0 26556  380 1295  0  2 25 73
 0  0      0  45368     20 7299224    0    0     0 29407  441 1265  0  1 53 46
 1  0      0  47216     20 7296964    0    0     0 87226  815  928  0  2 97  0
 0  0      0  46656     20 7297696    0    0     0 84720  848  982  0  3 97  0
 1  3      0  42768     20 7294484    0    0     0 57923  576 1671  0  3 62 34
 0  5      0  42892     20 7293524    0    0     1 23463  372 1432  0  3 48 49
 0  4      0  43072     20 7294312    0    0     1 21345  362 1657  0  2 48 50
 0  5      0  43024     20 7294456    0    0     6 22790  370 1350  0  3 31 66
 1  3      0  46780     20 7291080    0    0     1 22396  361 1462  0  2 31 67
 0  4      0  43216     20 7294132    0    0     0 22608  370 1373  0  2 48 49
 2  3      0  46608     20 7291544    0    0     0 21345  363 1351  0  2 50 48
 0  5      0  42924     20 7294548    0    0    17 23004  375 1408  0  2 48 49
 2  0      0  48400     20 7296148    0    0     2 20556  393  975  0  1 66 32
 0  0      0  45044     20 7298744    0    0     0 78870  773  933  0  2 97  1
 0  0      0  46324     20 7297772    0    0     0  4994  305  891  0  0 97  2
 0  0      0  45276     20 7298896    0    0     0 25167  417  993  0  1 98  1
 0  0      0  45744     20 7297992    0    0     0 29529  444  906  0  1 98  1
 0  4      0  42936     20 7298160    0    0     0  7604  300 1120  0  1 79 20
 0  4      0  43040     20 7295132    0    0     0 18597  335 1477  0  2 27 71
 0  4      0  43092     20 7294580    0    0     0 23968  383 2246  0  3 28 69
2621440+0 records in
2621440+0 records out
10737418240 bytes (11 GB) copied, 120.815 s, 88.9 MB/s
 0  2      0  54872     20 7289284    0    0     1 27088  471 1400  1  1 45 53
 0  2      0  54756     20 7289284    0    0     0  2745  311 1146  0  0 50 49
 0  2      0  54748     20 7289284    0    0     0  7869  331 1112  0  0 50 49
 0  3      0  54880     20 7289284    0    0     0  6317  302 1244  0  0 50 50
 0  3      0  54748     20 7289284    0    0     0  2328  305 1391  0  0 43 57
 0  3      0  54880     20 7289284    0    0     0  5502  319 1359  0  0 50 50
 0  3      0  54748     20 7289284    0    0     0 14270  372 1452  0  0 50 49
 0  2      0  52508     20 7289772    0    0    35  6819  316 1318  1  0 55 45
 0  1      0  51640     20 7289772    0    0     0 279078 2192  819  0  3 70 27
 0  1      0  51648     20 7289772    0    0     0 418332 3344  619  0  0 75 24

real    2m51.679s
user    0m0.487s
sys     0m14.990s
[1]+  Terminated              vmstat 5

$ echo $[10*1024*1024/(2*60+51)]
61320

Notice how writing is fast at the beginning before the page cache has filled up, then gets very slow for most of the write, but then gets very fast again at the end.  I have run this test three times in a row, deleting the 10G file between each run, and the results are always like this.

However, here's the REALLY strange thing.  See what happens when I use a 4 GB file instead of a 10 GB file:

$ vmstat 5 & sync ; time ( dd if=/dev/zero of=4G bs=4096 count=1048576 ; sync ) ; kill %1 ; fgrep Dirty /proc/meminfo
[1] 22831
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 3  0      0 7432280     20 122228    0    0    26    40    5    5  2  0 97  0
1048576+0 records in
1048576+0 records out
4294967296 bytes (4.3 GB) copied, 4.4571 s, 964 MB/s
 0  2      0 3116964     20 4316616    0    0     1 152428 1194 1339  2 24 52 23
 0  2      0 3117560     20 4316536    0    0     0 358446 2877  751  0  4 72 24

real    0m14.994s
user    0m0.164s
sys     0m4.959s
Dirty:               0 kB
[1]+  Terminated              vmstat 5

I threw in the fgrep just so you could see that yes, all 4 GB of dirty pages do get written in this amount of time.

So apparently the slowdown only occurs if I manage to fill up my page cache.  Something is seriously wrong here.


Supplemental information:

# for each in /sys/block/sda/{device/queue_depth,queue/*} ; do echo "${each}: $(<${each})" ; done
/sys/block/sda/device/queue_depth: 254
/sys/block/sda/queue/hw_sector_size: 512
/sys/block/sda/queue/iosched:
/sys/block/sda/queue/max_hw_sectors_kb: 128
/sys/block/sda/queue/max_sectors_kb: 128
/sys/block/sda/queue/nr_requests: 512
/sys/block/sda/queue/read_ahead_kb: 384
/sys/block/sda/queue/scheduler: noop [deadline]

# lspci -nnvvv -d 13c1:1004
02:00.0 RAID bus controller [0104]: 3ware Inc 9650SE SATA-II RAID [13c1:1004] (rev 01)
        Subsystem: 3ware Inc 9650SE SATA-II RAID [13c1:1004]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at e0000000 (64-bit, prefetchable) [size=32M]
        Region 2: Memory at e5200000 (64-bit, non-prefetchable) [size=4K]
        Region 4: I/O ports at 2000 [size=256]
        Expansion ROM at e5220000 [disabled] [size=128K]
        Capabilities: [40] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] Message Signalled Interrupts: Mask- 64bit+ Queue=0/5 Enable-
                Address: 0000000000000000  Data: 0000
        Capabilities: [70] Express (v1) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <128ns, L1 <2us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x8, ASPM L0s L1, Latency L0 <512ns, L1 <64us
                        ClockPM- Suprise- LLActRep+ BwNot-
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk- DLActive+ BWMgmt- ABWMgmt-
        Kernel driver in use: 3w-9xxx
Comment 154 Todd 2008-09-20 21:49:00 UTC
> From Comment 153: It happens when I'm making an ISO image from a DVD using
> K3b.  All goes along fine for a while, as the data is just being shoved 
> into the page cache, but when pages start getting flushed to disk, all hell
> breaks loose.

By "all hell breaks loose", do you mean a kernel panic?

When I write a DVD with K3B, I get a kernel panic so bad that twice it 
destroyed my file system.  I was afraid to report it over here as I am
not using a standard kernel (I am using Red Hat's modified kernel).
I reported the issue over at http://bugs.centos.org/view.php?id=3002.
(Don't pay any attention to the title -- I originally thought it was being
caused by my DVD burner.)

Work around: use "Deep Burner" (http://www.deepburner.com/) free edition
with the WINE emulator.  I can cuts DVD's all day long without a single
panic.

Can anyone confirm that my issue is related to this issue?

Many thanks,
-T
Comment 155 Matt Whitlock 2008-09-21 02:36:31 UTC
(In reply to comment #154)
> By "all hell breaks loose", do you mean a kernel panic?

I meant what I went on to describe.  I haven't experienced any kernel panics due to copying large files.
Comment 156 Matt Whitlock 2008-10-19 16:48:29 UTC
Part of this problem seems to have magically vanished for me.  Now when I'm making an image from a DVD, my disk reads stay reasonably responsive the whole way through, and dirty memory (reported in /proc/meminfo) stays at around 1.5 GB, whereas it used to climb past 5 GB.  Also, vmstat shows I/O wait at a super-consistent 19-21% while making the DVD image -- no variation outside of that range at all.

However, I'm still seeing the extreme slowdown in flushing pages to disk while the page cache is full.

I haven't changed kernel versions, but I have upgraded the firmware in my 3ware 9650SE to version FE9X 4.06.00.004, which was a major version bump over the last release.  I have also upgraded to KDE 4.1.2 and K3b 1.95-svn (r856890); not sure if that matters.
Comment 157 devsk 2008-11-10 23:28:34 UTC
I can still reproduce symptoms with 2.6.28-rc4. All you have to do to reproduce this is to start a 'dd' creation of a 6GB (anything above your phys RAM really) file and in parallel start a network copy over of a large file over cifs or nfs. The whole system just stalls (nothing can be done with system, mouse freezes, cursor stops blinking, no new programs can be started) for several seconds, then continues, then stalls again for few seconds and so on. Many programs go into 'D' state and CPU usage falls drastically if there were programs consuming CPU.

Something is horribly wrong with page cache mgmt. Why is kernel caching such random write data and slowing everything down?
Comment 158 Ben Gamari 2008-12-01 20:20:52 UTC
I have also seen this issue horribly on my SATA hard drive behind an ICH8 controller (Dell Latitude D830). Any large copies from the drive slow the machine to a crawl. Copying from my internal hard drive to an external USB 2 drive results in a transfer rate of 8MB/sec. This has been submitted to the Fedora bugzilla as bug #470736.
Comment 159 Luka Renko 2008-12-03 13:44:04 UTC
Similar bug in Ubuntu:
https://launchpad.net/bugs/131094
Comment 160 Ben Gamari 2008-12-11 20:13:08 UTC
Todd, have you made any progress on this? Just wondering. I might try taking a stab at it when classes end.
Comment 161 Todd 2008-12-12 15:26:30 UTC
Hi Ben,

    My problem may or may not be this bug, but my strong suspicion is that it
is.  It seems that when I interrupt a large file transfer on my RAID card or
interrupt a "dump" backup of my RAID card to my SATA drive, my USB goes to
hell.  See https://bugzilla.redhat.com/show_bug.cgi?id=469290

    It is pretty easy to reproduce, all I have to do is start a "dump" from
my primary partition (/dev/sda1) to my removable SATA drive (not eSATA) 
(/dev/sdb2), let it run for about ten minutes (takes 1hr 45min to run), then
stop it with a <Ctrl><C>.  After that, a 940 MB copy from /dev/sdb2 to my
USB will drop from 1 min 32 sec to 92 minutes.  Rebooting always fixes it.

    And, the problem will happen occasionally when I let the backup go
through successfully, but not always.

-T
Comment 162 Ben Gamari 2008-12-17 12:22:56 UTC
Created attachment 19346 [details]
OProfile of heavily IO loaded system

This probably doesn't help much, but I figured I'd take a profile while doing a backup (my reproduction case) just for the hell of it. Doesn't look like anything too informative is in here but here it is anyways.
Comment 163 Ben Gamari 2008-12-17 12:27:56 UTC
(In reply to comment #161)
> Hi Ben,
> 
>     My problem may or may not be this bug, but my strong suspicion is that it
> is.  It seems that when I interrupt a large file transfer on my RAID card or
> interrupt a "dump" backup of my RAID card to my SATA drive, my USB goes to
> hell.  See https://bugzilla.redhat.com/show_bug.cgi?id=469290
> 
>     It is pretty easy to reproduce, all I have to do is start a "dump" from
> my primary partition (/dev/sda1) to my removable SATA drive (not eSATA) 
> (/dev/sdb2), let it run for about ten minutes (takes 1hr 45min to run), then
> stop it with a <Ctrl><C>.  After that, a 940 MB copy from /dev/sdb2 to my
> USB will drop from 1 min 32 sec to 92 minutes.  Rebooting always fixes it.
> 
>     And, the problem will happen occasionally when I let the backup go
> through successfully, but not always.
> 
> -T
> 


It really doesn't sound like you are encountering this particular bug. This bug causes both general system and I/O performance to go to hell while copying large files. It just took me 10 minutes to VT switch away from and back to gnome just now due to this bug. If you aren't having interactivity issues, then you are seeing something else. Poor IO performance is potentially justifiable, but inappropriate thrashing that kernels beyond 2.6.18 have been producing is a far more serious issue.
Comment 164 Thomas Pilarski 2008-12-23 02:04:34 UTC
I am affected on this bug too, on many desktop systems, with Intel and AMD controllers. It's very easy to reproduce this bug.

It must be something like the 50Hz problem in PDT-11, if someone knows it.
There must be a threshold of disc throughput which causes such a problem.
-------------  
I tried to bypass the problem by using a fast SSD, but it's makes the problem even worse. There are the result of two dd writing operations.

# dd if=/dev/zero of=/dev/sda6 bs=1M count=1500
1572864000 Bytes (1,6 GB) kopiert, 57,5956 s, 27,3 MB/s
And poor desktop responsiveness.

# dd if=/dev/zero of=/dev/sda6 bs=1M count=1500 oflag=direct
1572864000 Bytes (1,6 GB) kopiert, 20,9958 s, 74,9 MB/s
And even firefox does not freeze. 
-------------
I have undone a patch found in the CentOS source (linux-2.6-fs-dio-use-kzalloc-to-zero-out-struct-dio.patch) on Ubuntu Hardy kernel (2.6.24-22-generic). I used this patch, because the problem first occurs on my centos system, when I updated from kernel-2.6.18-92.1.10.el5 to kernel-2.6.18-92.1.13.el5. This patch was also used on Ubuntu Gutsy the first time, on which the problem first occurs on my old system. There is only a small slowdown on my centos system, but there is one.

This upachting increases my boot time of my Ubuntu Hardy testing installation from 60 seconds to 45 seconds. Every applications starts faster. On high loadavg there are differences like 140s for patched and 30s for unpatched kernel for stating gimp. This happens only, when executing eight concurrent dd writing operations. This does not solve the problem, but it does not appears less. 
While coping just a big file with gnome, there is no noticeable speed-up. Every action takes a long time and time goes to infinity.
-------------
I tried to simulate a slower harddisk and installed Ubuntu Hardy on a full encrypted disc. Which reduces the write speed from 56MB/s to 40MB/s on a traditional disc. The were no differences between the patched and unpatched kernels. Both were unusable.
Comment 165 Ben Gamari 2008-12-25 10:32:18 UTC
After spending two days pouring over code and profiling, I have made no progress on this issue nor have any novel proposals as to the source of the slowdown. It seems I just don't have the understanding of the relevant subsystems and to handle this myself.

This bug is getting extremely unwieldy and seems to be mixing several separate issues. Would anyone be opposed to opening a new bug for the high I/O wait times issue?

While I understand the difficult nature of this issue, the lack of kernel developer involvement in the handling of this bug is extremely worrisome. This was reported nearly 2 years ago and we still are no closer to a resolution despite countless profiles, benchmarks, and bisections. Moreover, at this point one would have to go back eleven kernel releases to have reasonable I/O performance.

What could be done to expedite the handling of this bug? Would it help to move discussion onto the LKML? Is there more data that could be collected? Would further bisections help? We have a great amount of data and several proposed fixes which have been shown not to work, however there is no plan for moving forward.

Could any of the IO developers suggest novel avenues for inquiry into this slowness? It seems that raw profiles give very little information about the source of the slowness itself. What techniques other than profiling and printks (which have been my bread-and-butter so far) can be used to pinpoint performance issues like this one?

Any help that could be provided would be greatly appreciated. This bug has frustrated me ever since switching to x86-64 and a resolution is long overdue. Thanks,

- Ben
Comment 166 Andi Kleen 2008-12-25 14:04:46 UTC
At least I mostly ignore it because it has all the signs one of those catchall bugs where lots of people report lots of different problems which show vaguely the same sympthoms mixed in a single bug. In general it's extremly hard to sort out these kinds of bug conglomerations and the best avenue might be to start from scratch.

Regarding ways to diagnose slowness: if nothing shows up in oprofile it's
likely some problem where something waits. You should use some form of latency
profilings, either using profile=sched or using latencytop.
Comment 167 Peter Zijlstra 2008-12-26 01:08:18 UTC
On Thu, 2008-12-25 at 10:32 -0800, bugme-daemon@bugzilla.kernel.org
wrote:

> This bug is getting extremely unwieldy and seems to be mixing several separate
> issues. Would anyone be opposed to opening a new bug for the high I/O wait
> times issue?
> 
> While I understand the difficult nature of this issue, the lack of kernel
> developer involvement in the handling of this bug is extremely worrisome

These two are closely related -- I have no clue what all this is about
anymore.

We did fix some of the issues with the per-bdi writeback stuff, but
clearly there is more.

But everybody and his dog piling everything that makes his particular
thing go slow in here makes for very muddy waters indeed.

Furthermore, the linear nature of bugzilla combined with 2+yrs of
history makes is impossible to even start reading this thing.

So yes, please open new bugs and try and keep issues separate.

Comment 168 Andi Kleen 2008-12-27 05:57:27 UTC
I think the original bug was that not having highmem apparently affects memory balance in a way that makes the system unresponsive under heavy IO. I've seen that in a few cases myself so I'm willing to believe it.

Then after that people started to skim in with lots of different issues.

Hmm, one way to verify the theory above would be to simulate highmem on a 64bit 
system and see if it makes the problem away. I can cook up a patch to do that.
Comment 169 Ben Gamari 2008-12-27 15:54:26 UTC
(In reply to comment #168)
> I think the original bug was that not having highmem apparently affects memory
> balance in a way that makes the system unresponsive under heavy IO. I've seen
> that in a few cases myself so I'm willing to believe it.
> 
This would definitely explain the large number of x86_64 users here but how did this go unnoticed on other non-x86/highmem architectures.

> Then after that people started to skim in with lots of different issues.
> 
> Hmm, one way to verify the theory above would be to simulate highmem on a 64bit 
> system and see if it makes the problem away. I can cook up a patch to do that.
> 
I'd be willing to test this
Comment 170 Ben Gamari 2008-12-27 16:05:06 UTC
I have finally opened a new bug (bug #12309) for the initial issue with high iowait times. I tried to restrict the description of the new bug to only known facts about the issue and it would be appreciated if we could keep this one a little bit more on-topic.

It would be great if we could pick up discussion on #12309. If anyone has comments which they think might help and are certainly germane to the original issue, feel free to comment on the new bug.
Comment 171 Thomas Pilarski 2008-12-29 22:56:14 UTC
The problem is not restricted to the amd64 architecture. It appeared in Ubuntu Gusty (I think it was 2.6.24) on a Pentium M the first time. And a 32-Bit installation does not fix the poor desktop responsiveness.
Comment 172 Thomas Pilarski 2009-01-12 00:54:46 UTC
Created attachment 19758 [details]
Program for measuring the issue

After comparing some kernel code, I have found come really interesting fact. I think the poor desktop responsiveness is affected by the changed process scheduler (e.g. tickless kernel / high resolution timer ...) and not by the disc scheduler. I have written a test program (sorry for the dirty code), which enforces the problem and allows to measure it.

Here some fact. I have executed the tests in recovery mode (kernel parameter single) once with 20 processes * 1.000.000 messages and once with 100 processes * 100.000 messages. The result values are echo time of ~80-90% of the messages / longest echo time and test duration.

CentOS
2.6.18-92.el5 - 20/1M 4µs / 1s / 38,4s - 100/1k 4µs / 1s / 18,7s
Ubuntu 6.04 - 8.10
2.6.15-53 - 20/1M 3-33µs / 1s / 33,6s - 100/1k 3-40µs / 1s / 17,7s
2.6.20-17 - 20/1M 3µs / 1s / 32s - 100/1k 3-9µs / 1s / 16,0s
2.6.22-16 - 20/1M 3-4µs / 7ms / 51,5s - 100/1k 4µs / 1s / 25,9s
2.6.24-23 - 20/1M 53µs / 64µs / 73s - 100/1k 77-250µs / 41ms / 32,0s
2.6.27-9 - 20/1M 120-200µs / 120ms / 159s - 100/1k 500-1000µs / 1s / 84s

While executing the test with 100/1M under xorg/Gnome, the problem is enforced. There are no problems on CentOS and Feisty. I could not test it on Ubuntu 6.06. And I have heavy responsiveness problems with Hardy, Intrepid and Fedora 10. With 2.6.22 (installed in Feisty) the problem sometimes occurs and sometimes not.
Comment 173 Andi Kleen 2009-01-12 11:19:33 UTC
Very nice -- a test case. Could you perhaps post it to linux-kernel/linux-mm 
with another description of the problem?
Comment 174 devsk 2009-01-12 15:03:12 UTC
I hit the problem with "time ./ProcessSchedulerTest 100 50000". Didn't need to go any higher. I am using x86_64 on vanilla-2.6.28. 4GB RAM and not much load on the system (about 200 processes without the test case).

symptoms were same. Mouse frozen, new konsole window takes ages to come up. So, it does seem to have nothing to do with disk.

Unfortunately, I can't try 2.6.17 kernel because I upgraded my '/' to ext4.

Can anyone else try this test case in 2.6.17 (anything prior to 2.6.18) on 64-bit?
Comment 175 devsk 2009-01-12 15:04:44 UTC
Just want to mention that I ran the test case as normal user from a konsole window.
Comment 176 Ben Gamari 2009-01-14 06:58:21 UTC
The test case is excellent but I haven't seen it pop up on the LKML yet. Thomas, do you intend on sending it to the list? I could probably put something together if you are short on time.

Also, we should collectively decide to either migrate the discussion to Bug #12309 or not. In my opinion the shear size of this bug is making discussion a bit unwieldy and therefore #12309 would be a better venue, but if others disagree I can close it and we can resume here.
Comment 177 Ben Gamari 2009-01-14 11:48:23 UTC
(In reply to comment #172)
> Created an attachment (id=19758) [details]
> Program for measuring the issue
> 
> After comparing some kernel code, I have found come really interesting fact. I
> think the poor desktop responsiveness is affected by the changed process
> scheduler (e.g. tickless kernel / high resolution timer ...) and not by the
> disc scheduler. I have written a test program (sorry for the dirty code), which
> enforces the problem and allows to measure it.
> 

To clarify, are you proposing that the high I/O wait time and the poor responsiveness problems are two separate issues?

I just now had to wait for 5 minutes while evolution was moving messages to a large folder during a kernel build. Interestingly, directly after the kernel build completed evolution's rate of progress increased dramatically. During the kernel build, pdflush was always in iotop's top three and total disk I/O throughput was <1MB/sec. Wouldn't this tend to suggest a block layer or memory management issue?
Comment 178 Thomas Pilarski 2009-01-15 03:14:01 UTC
see http://bugzilla.kernel.org/show_bug.cgi?id=12309
Comment 179 David Pflug 2009-10-02 08:52:10 UTC
My company has a client with this issue. If it would help, I might be able to get the RAID controller in question shipped to someone that can do some debugging. I'll have to ask.
Comment 180 David Pflug 2009-10-02 08:59:53 UTC
Ok, wow...I just solved the problem here. Apparently write caching was disabled. I'd never used sdparm before. Found a thread on Web Hosting Talk that sounded like this issue and explained that fix.

sdparm -s WCE=1 /dev/sda

And everything's fine. If that helps others, I take no credit. If it doesn't, I'm sorry for adding to the noise.
Comment 181 Zenith88 2009-11-18 19:49:47 UTC
I have the same issue on a machine with i845e chipset, P4-1.5 Northwood, 2GB DDR RAM, GF6800 video and Audigy2 sound card. My main HDD is 160GB IDE Seagate.

When there is disk activity the system becomes virtually unusable.

For example, when I am burning a DVD on the drive attached to SII 3512 SATA controller, the CPU load goes from 40% at 7-8x to 98% at 16x.

Downloading Fedora12 ISO last night at 500 kb/s kept system busy at 90%!

If I start kernel compile, CPU load is stable 100%, which is Okay, but switching tabs in Firefox takes 10 seconds and starting any application like JUK, Dolphin, Konsole takes up to 1 minute.

Running Fedora11 with 2.6.30.9.96 FC11 i686 PAE kernel.

The system has become a bit more responsive (by about 10-20%) since I noticed p4-clockmod was being loaded and shut it down.
Comment 182 bslade 2009-12-18 21:07:44 UTC
Re: comment #180, use sdparm to enable write caching

Be careful, I think this is enabling hard drive internal write caching.  On my system, with Hitachi disks, the on disk cache is not power down safe.  I think this is the case with a lot of hard drives.   We can consistently corrupt the disk by pulling the plug during heavy write activity using a non-journaled filesystem (or databases writing to raw devices).  Ie., the disk drive tells the controller that the write completed, but the disk drive lies.

Although it may corrupt the disks during a power failure, turning on hard drive internal write caching does dramatically increase throughput (x5 for heavy writes), decrease the CPU load average, and lower the wait for I/O percentages.

What I don't understand is, I'm using a Perc 6 Megaraid disk controller with battery backed up writeback cache.   I would think that this would do most of the caching work and the hard drive internal write cache would make little difference, but that's not the case.
Comment 183 Jeff Garzik 2009-12-18 21:15:04 UTC
There is no "lying"

It has been documented in the ATA and SCSI specifications for over a decade that FLUSH CACHE (ATA) or SYNCHRONIZE CACHE (SCSI) commands are needing to flush disk cache to media.

This is exactly the same as the write cache on every operating system in the planet, where dirty data will sit in pagecache for a time, before being flushed to disk.

If you want to guarantee your data survives a power failure, either

* use a filesystem or database that knows to flush the cache
           or
* turn off the writeback cache

These disk drives are not telling "lies."  They are behaving exactly as the decade-old specifications say they should behave.
Comment 184 Zenith88 2010-05-18 23:33:50 UTC
I am now on 2.6.32 and the issue persists.
Comment 185 Michael Meeks 2010-05-19 08:55:35 UTC
Zenith - the most satisfying guess as to the bad behavior I was seeing, was a combination of ext3, and fsync forcing some blocking flush of the cache that essentially broke the system. That is fixed only with ext4 and btrfs. Can you confirm your file-system type ?
Comment 186 bslade 2010-05-19 21:37:32 UTC
(In reply to comment #183)
> There is no "lying"
> 
> It has been documented in the ATA and SCSI specifications for over a decade
> that FLUSH CACHE (ATA) or SYNCHRONIZE CACHE (SCSI) commands are needing to
> flush disk cache to media.
> 
> This is exactly the same as the write cache on every operating system in the
> planet, where dirty data will sit in pagecache for a time, before being flushed
> to disk.
> 
> If you want to guarantee your data survives a power failure, either
> 
> * use a filesystem or database that knows to flush the cache
>            or
> * turn off the writeback cache
> 
> These disk drives are not telling "lies."  They are behaving exactly as the
> decade-old specifications say they should behave.

My point is, companies like DELL are shipping Enterprise class servers where the hard drive internal cache (not power down safe) is enabled and the Linux operating system/drivers don't do a SCSI SYNCHRONIZE CACHE command to disk when flushing cache.  If the box loses power during heavy write activity, you have a high probability of disk corruption.   I suspicious that this is a common industry practice.   I bet very few people have the expertise to find a problem like this.

For the next system we order, we'll ask Dell if that version of Linux & disk drivers support SCSI SYNCHRONIZE CACHE commands when flushing Linux I/O cache, and whether a direct_io write (bypassing Linux cache) to a non-filesystem block device causes a Linux I/O cache flush (probably not) or a SCSI SYNCHRONIZE CACHE operation (probably not).

One more note, when we execute a heavy write operation (loading a database), with the hard drive internal cache enabled, it runs fairly well on our Linux system.   With the hard drive internal cache disabled, heavy writes slow down dramatically (factor of 4 even with disk controller writeback caching).  The slow heavy writes cause the load average on the box goes high and the box becomes unresponsive (but doesn't crash).   It sure seems like the disk subsystem getting overloaded with writes causes Linux to have problems.

Ben
Comment 187 Tejun Heo 2010-05-19 22:41:11 UTC
(In reply to comment #186)
> My point is, companies like DELL are shipping Enterprise class servers where
> the hard drive internal cache (not power down safe) is enabled and the Linux
> operating system/drivers don't do a SCSI SYNCHRONIZE CACHE command to disk when
> flushing cache.

Barrier support has been there for ages now and what enterprise distro is crazy enough to turn off barrier support which BTW is on by default on all popular file systems?  Please stop spreading misinformation.

> One more note, when we execute a heavy write operation (loading a database),
> with the hard drive internal cache enabled, it runs fairly well on our Linux
> system.   With the hard drive internal cache disabled, heavy writes slow down
> dramatically (factor of 4 even with disk controller writeback caching).  The
> slow heavy writes cause the load average on the box goes high and the box
> becomes unresponsive (but doesn't crash).   It sure seems like the disk
> subsystem getting overloaded with writes causes Linux to have problems.

In theory, with enough queue depth, WT should be able to achieve similar throughput as WB, but theory is one thing, reality another. Firmwares are not optimized for such use cases and queue depth isn't quite enough. This doesn't have much to do with OS or drivers. So, if you think the WT performance is critical please complain to hardware vendors.

System being too unresponsive under high IO conditions is the bug topic here which I think is mostly resolved in mainline at least with ext4. This bug entry became almost impossible to track the topic. So, let's stop adding stuff here. If anyone is experiencing unreasonably high latency under ext4, please open a new bug report which specifically targets ext4.

Thanks.
Comment 188 Zenith88 2010-05-23 13:26:15 UTC
(In reply to <a href=https://bugzilla.kernel.org/show_bug.cgi?id=7372#c185>comment #186</a>)

> Zenith - the most satisfying guess as to the bad behavior I was seeing, was a
> combination of ext3, and fsync forcing some blocking flush of the cache that
> essentially broke the system. That is fixed only with ext4 and btrfs. Can you
> confirm your file-system type ?

ext4

Note You need to log in before you can comment on or make changes to this bug.