Bug 8636

Summary: Dump of ext3 runs very slowly
Product: File System Reporter: David Mathog (mathog)
Component: ext3Assignee: Jens Axboe (axboe)
Status: CLOSED WILL_NOT_FIX    
Severity: normal CC: akpm, alan, aph, axboe, bgamari, bunk, funtoos, heiko, henkjans_bagger, jgarzik, jmoyer, knt, lenharo, mozilla_bugs, ToddAndMargo, zing
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.21.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: .config used to build 2.6.21.1
output of: dmesg -s 1000000
dmesg -s 100000 from 2.6.17-14mdv system
dumpe2fs from saf03
dumpe2fs from saf02
bonnie and dump results for various kernels on two machines
.config for 2.6.20.14, K8, last kernel with FAST dump
.config for 2.6.21.0rc3 (K8) earliest SLOW dump
Summary table of bonnie and dump speed results
summary of tests, includes 2.6.21rc1
.config for 2.6.18 kernel that ran at 6.7Mb/sec
dump and bonnie tests for differen IOSCHED defaults
Always schedule dispatch on request completion
Detect processes issuing I/O within the average seek distance of one another
Fixed version of the previous patch.

Description David Mathog 2007-06-15 14:47:50 UTC
Most recent kernel where this bug did not occur: 
  2.6.17-14mdv
Distribution:
  Mandriva 2007.1
Hardware Environment
  Athlon64 3700+
  1G RAM
  WDC WD1600SB-01KBA0 (PATA disk)
  Asus A8N5X motherboard

Software Environment:
  dump 0.4b41-2dmv2007.0
Problem Description:

With kernel 2.6.21.1 dump of a mounted ext3 filesystem is
very, very slow, it runs at less than 5Mb/sec.  With older kernels
dump runs more than 5 times faster.

Steps to reproduce:
1. Boot the system with choice of kernel.
2. mount the ext3 filesystem (in this case it is "/", including
   all of Mandriva 2007.1 except for "/boot".)
3. dump -0 -f /dev/null /dev/hda5

Results:

kernel        elapsed(s)   rate (KB/s)    dumped (MB)
2.6.17-14mdv  289          26578          7501.07
2.6.21.1     1800           4498          7911.29

A few files were deleted on the disk between tests, which is why
the size isn't quite the same.  The dump on 2.6.21.1 takes so
long I didn't run it to completion again after removing those files,
but it said things like "expected completion in 28 minutes".
I did not round the time, dump really reported 1800s.
Comment 1 David Mathog 2007-06-15 14:51:32 UTC
Created attachment 11763 [details]
.config used to build 2.6.21.1
Comment 2 David Mathog 2007-06-15 14:54:59 UTC
Proof that /dev/null is not rate limiting.

On 2.6.21.1

% dd bs=512 count=1000000 if=/dev/zero of=/dev/null
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB) copied, 1.24117 seconds, 413 MB/s
Comment 3 David Mathog 2007-06-15 15:11:34 UTC
There are no errors or warnings in /var/log/messages to indicate a problem.

2.6.21.1 did dump slowly even when it was booted failsafe, the network started,
and a single connection in was made to run the test.   I also removed all the
cpufreq modules and reran the test and it was no faster.
Comment 4 Andrew Morton 2007-06-16 07:58:53 UTC
It could be that the IDE driver got slower.  Try using hdparm to ensure the disk's transfer speed, compare that with the earlier kernel.

Please see http://lkml.org/lkml/2007/5/31/181 - perhaps it's this.

Please attach to this report the output from `dmesg -s 1000000' for
both kernel versions.

Thanks.
Comment 5 Andrew Morton 2007-06-16 08:46:57 UTC
On Sat, 16 Jun 2007 11:36:53 -0400 Jeff Garzik <jeff@garzik.org> wrote:

> Andrew Morton wrote:
> > So I revisited [...]
> > http://bugzilla.kernel.org/show_bug.cgi?id=8636
> > 
> > I don't see why it's not-a-bug.  The second guy (Stephen Clark) had to toss
> > out the FC6 kernel and build his own kernel to fix this regression.
> 
> (I hope you don't mind me copying linux-ide)
> 
> The combined mode removal changed the driver load configuration.  Some 
> distros updated with that knowledge, but apparently Fedora did not.
> 
> Combined mode would explicitly reserve ports for libata in pci/quirks.c, 
> a horrible layering violation that created special case module load 
> order dependencies.
> 
> The change made libata and the old-IDE driver behave like normal Linux 
> drivers.  They can both grab an entire PCI device and drive it 
> correctly, so -- just like every other situation where two drivers can 
> match the same PCI ID -- the one that loads first wins.
> 
> Kernel configs need to be tweaked a bit, due to this change.  That's why 
> it was communicated in advance (but poorly, as it appears from your 
> questions and existing bug reports).  Two common results appear in bug 
> reports:
> 
> 1) One possible result of a complete lack of kconfig tweaking (blindly 
> hitting <enter> during 'make oldconfig') is falling back to the 
> most-compatible configuration, the legacy IDE driver (either libata or 
> old-IDE), with resultant slow performance.
> 
> 2) "no root! panic!" and similar I-cant-find-your-hard-drive results for 
> people with hardcoding root= configurations, for the minority where a 
> device moved from /dev/hdX -> /dev/sdX, or vice versa.
> 
> The damage is hoped to be limited to:
> 
> * Intel ICH5/6[/7?] users with combined mode enabled, which is a 
> not-small subset of all ICH[567] users.
> 
> * Users that did /not/ choose the combined_mode=libata kernel command 
> line option, a popular option for restoring performance /broken/ by 
> running two drivers in tandem [i.e. the old way, recently removed].
> 
> * In the combined mode configuration, one device is /dev/hdX (often the 
> CD-ROM) and one device is /dev/sdX, so only one of those devices will 
> move.  Standard LVM and mount-by-label/uuid found in standard distro 
> installs makes this move seamless for many.
> 
> * ...in distros that hopefully took stock of their compatibility 
> picture, and modified their kernel configs and drivers accordingly. 
> Some distros were defaulting to combined_mode=libata (==libata drives 
> all applicable Intel ICHx IDE devices), others were not.  That affects 
> the decision about kernel config changes.
> 
>       Jeff
> 
> 
> 
> 
Comment 6 Tejun Heo 2007-06-17 22:46:19 UTC
Yeap, it seems like transfer mode configuration problem.  Please report the result of 'dmesg -s 1000000'.
Comment 7 David Mathog 2007-06-18 09:34:29 UTC
(In reply to comment #4)
>  Try using hdparm to ensure the
> disk's transfer speed, compare that with the earlier kernel.

hdparm /dev/hda

on both kernels shows:

 multcount    = 16 (on)
 IO_support   =  1 (32-bit)
 unmaskirq    =  1 (on)
 using_dma    =  1 (on)
 keepsettings =  0 (off)
 readonly     =  0 (off)
 readahead    = 256 (on)
 geometry     = 19457/255/63, sectors = 312581808, start = 0

On 2.6.21.1

 hdparm -tT /dev/hda

/dev/hda:
 Timing cached reads:   1622 MB in  2.00 seconds = 811.34 MB/sec
 Timing buffered disk reads:  172 MB in  3.02 seconds =  56.94 MB/sec

on 2.6.17-14mdv

 hdparm -tT /dev/hda 

/dev/hda:
 Timing cached reads:   1700 MB in  2.00 seconds = 850.66 MB/sec
 Timing buffered disk reads:  172 MB in  3.03 seconds =  56.74 MB/sec

So it isn't disk speed per se, it's something related to dump or ext3.
Comment 8 David Mathog 2007-06-18 09:36:08 UTC
Created attachment 11782 [details]
output of: dmesg -s 1000000
Comment 9 David Mathog 2007-06-18 09:42:51 UTC
Created attachment 11783 [details]
dmesg -s 100000 from 2.6.17-14mdv system

This dmesg was actually generated on a second machine.  However the second machine had identical hardware to the first, the hda5 partition had been cloned from that of the first machine, and the second machine also had the slow dump problem if booted with the 2.6.21.1 kernel, and not with this kernel.
Comment 10 Tejun Heo 2007-06-18 09:51:19 UTC
Yeah, nothing seems wrong with the IDE driver.  Can you run bonnie on the filesystem and report the result?
Comment 11 David Mathog 2007-06-18 10:26:17 UTC
(In reply to comment #10)
> Yeah, nothing seems wrong with the IDE driver.  Can you run bonnie on the
> filesystem and report the result?
> 

Hope bonniee++ is ok, that's running now.  I'm using commands like this,
please let me know if it should be different:

bonnie++ -u 500:500 -d /tmp 2>&1 | tee bonnie_2.6.17-14mdv.txt

Here are the results for the older kernel:

 cat bonnie_2.6.17-14mdv.txt
Using uid:500, gid:500.
Writing with putc()...done
Writing intelligently...done
Rewriting...done
Reading with getc()...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
saf02.bio.caltec 2G 45099  91 58981  17 22377   5 37811  58 50173   3 226.5   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
saf02.bio.caltech.edu,2G,45099,91,58981,17,22377,5,37811,58,50173,3,226.5,0,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++

Here are the results for 2.6.21.1 kernel (just the bits at the end, the
status lines were all the same as above):

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
saf03.bio.caltec 2G 20433  68 40938  47 18901  10 45423  82 48388   5 187.7   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1361  97 +++++ +++ 26532 100  1449  98 +++++ +++  3361  97
saf03.bio.caltech.edu,2G,20433,68,40938,47,18901,10,45423,82,48388,5,187.7,0,16,1361,97,+++++,+++,26532,100,1449,98,+++++,+++,3361,97

Again, these were done on two different machines but with identical hardware,
and running a cloned /dev/hda5, just with different boot kernels.  Looks like
for bonnie++ 2.6.21.1 is slower than 2.6.17-14mdv, but while I see a factor of
2.2 (20433 vs 45099) I don't see a  factor of 5 in here to account for the 
incredibly slow dump speed.
Comment 12 David Mathog 2007-06-18 11:11:06 UTC
Hmm, this is bizarre.  df -k is showing significantly different disk usage
on 2.6.21.1 versus 2.6.17-14mdv.  I tracked it all the way down to single
directories, (all of which, as far as I can tell)  look like this:

2.6.21.1

[root@saf03 tmp]# du -k /Pmw/Pmw_1_2/bin
6       /Pmw/Pmw_1_2/bin
[root@saf03 tmp]# ls -alR  /Pmw/Pmw_1_2/bin
/Pmw/Pmw_1_2/bin:
total 7
drwxr-xr-x 2 root root 1024 May 23 11:57 ./
drwxr-xr-x 7 root root 1024 May 23 11:57 ../
-rwxr-xr-x 1 root root 4301 May 16 14:07 bundlepmw.py*
[root@saf03 tmp]# stat /Pmw/Pmw_1_2/bin/bundlepmw.py
  File: `/Pmw/Pmw_1_2/bin/bundlepmw.py'
  Size: 4301            Blocks: 10         IO Block: 1024   regular file
Device: 305h/773d       Inode: 6397963     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2007-05-16 14:07:51.000000000 -0700
Modify: 2007-05-16 14:07:51.000000000 -0700
Change: 2007-05-23 11:57:44.000000000 -0700

2.6.17-14mdv
[root@saf02 /]# du -k /Pmw/Pmw_1_2/bin
12      /Pmw/Pmw_1_2/bin
[root@saf02 /]# ls -alR  /Pmw/Pmw_1_2/bin
/Pmw/Pmw_1_2/bin:
total 16
drwxr-xr-x 2 root root 4096 May 23 11:57 ./
drwxr-xr-x 7 root root 4096 May 23 11:57 ../
-rwxr-xr-x 1 root root 4301 May 16 14:07 bundlepmw.py*
[root@saf02 /]#  stat /Pmw/Pmw_1_2/bin/bundlepmw.py
  File: `/Pmw/Pmw_1_2/bin/bundlepmw.py'
  Size: 4301            Blocks: 16         IO Block: 4096   regular file
Device: 305h/773d       Inode: 2599978     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2007-06-15 04:14:48.000000000 -0700
Modify: 2007-05-16 14:07:51.000000000 -0700
Change: 2007-06-15 04:21:22.000000000 -0700

Notice the difference in IO Block, which is 4096 on the older one
and 1024 on the newer one.  Just to be sure that isn't an issue with the
cloning I rebooted the machine running the newer kernel with the older kernel,
and repeated the commands.  The following are all still on saf03 but running
2.6.17-14mdv:

[root@saf03 ~]# du -k /Pmw/Pmw_1_2/bin
6       /Pmw/Pmw_1_2/bin
[root@saf03 ~]# ls -alR  /Pmw/Pmw_1_2/bin
/Pmw/Pmw_1_2/bin:
total 7
drwxr-xr-x 2 root root 1024 May 23 11:57 ./
drwxr-xr-x 7 root root 1024 May 23 11:57 ../
-rwxr-xr-x 1 root root 4301 May 16 14:07 bundlepmw.py*
[root@saf03 ~]# stat /Pmw/Pmw_1_2/bin/bundlepmw.py
  File: `/Pmw/Pmw_1_2/bin/bundlepmw.py'
  Size: 4301            Blocks: 10         IO Block: 4096   regular file
Device: 305h/773d       Inode: 6397963     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2007-05-16 14:07:51.000000000 -0700
Modify: 2007-05-16 14:07:51.000000000 -0700
Change: 2007-05-23 11:57:44.000000000 -0700

Ah, so there is an cloning issue, as the sizes on this machine are the
same no matter which kernel EXCEPT for the IO Block:, which is 4096
with the older kernel and 1024 with the newer one.  So I guess I better
rerun bonnie++ on the same machine.  Same command as above, saf03, using
the older kernel.  (IGNORE the SAF02 results, there is another factor involved)

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
saf03.bio.caltec 2G 30306  73 53393  37 19822   7 26392  43 48085   4 208.3   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1707  95 +++++ +++ +++++ +++  1877  97 +++++ +++  5402  93
saf03.bio.caltech.edu,2G,30306,73,53393,37,19822,7,26392,43,48085,4,208.3,0,16,1707,95,+++++,+++,+++++,+++,1877,97,+++++,+++,5402,93

Apparently when saf03 was set up by the Mandriva 2007.1 installer
the ext3 block size was not set to the default of 4096, but used XXXX instead, whereas when that partition was cloned to saf02 I know I used the default
blocksize, which turned out to be 4096.  I'll attach dumpe2fs for the two
systems next.
Comment 13 David Mathog 2007-06-18 11:13:00 UTC
Created attachment 11785 [details]
dumpe2fs from saf03
Comment 14 David Mathog 2007-06-18 11:13:33 UTC
Created attachment 11786 [details]
dumpe2fs from saf02
Comment 15 David Mathog 2007-06-18 11:31:42 UTC
Just double checked, even on saf02, with the different block sizes than
saf03, dump is still very slow under 2.6.21.1 (about a half hour, I quit after it
said "done in about 27 minutes") and still fast under 2.6.17-14mdv 

  DUMP: Volume 1 transfer rate: 37528 kB/s
  DUMP: 8218750 blocks (8026.12MB) on 1 volume(s)
Comment 16 David Mathog 2007-06-18 11:38:42 UTC
Let me modify the previous statement slightly.  With the default block sizes used on /dev/hda5 on saf02 on the older kernel the dump is actually a good bit
faster than it was with the same (old) kernel and the smaller block sizes
on saf03:

kernel        elapsed(s)   rate (KB/s)    dumped (MB)  machine
2.6.21.1     1800           4498          7911.29      SAF03
2.6.17-14mdv  289          26578          7501.07      SAF03
2.6.17-14mdv  219          37528          8026.12      SAF02

Apparently the default mkfs.ext3 settings work better with dump than
do the settings that Mandriva 2007.1 used.  I don't think this has anything
to do with the really slow dump under 2.6.21.1, this is just a typical
block size performance change.
Comment 17 David Mathog 2007-06-18 13:15:56 UTC
On the off chance that this problem had been fixed since 2.6.21.1 I tried 2.6.21.5.  Used .config from 2.6.21.1, did "make oldconfig".
Then completed building the new kernel, installed it, booted to it and...

dump is still slow.  The run was aborted after dump emitted a "done in 27 minutes" message.
Comment 18 Bartlomiej Zolnierkiewicz 2007-06-18 14:15:15 UTC
Hi,

Using git-bisect seems to be the most promising way in identifying the source of this regression (starting "good" kernel == 2.6.17, starting "bad" kernel == 2.6.21, test case == bonnie++).  Once we know what change has caused the slow down it should be (relatively) easy to fix it.  Please give git-bisect a try.
Comment 19 David Mathog 2007-06-18 16:24:58 UTC
(In reply to comment #18)
> Using git-bisect seems to be the most promising way in identifying the source
> of this regression (starting "good" kernel == 2.6.17, starting "bad" kernel
> ==
> 2.6.21, test case == bonnie++).

Ugh, this is going to take a while.  Ok, I'll try to narrow it down using
trinary search (two machines can be tested at once) using the "kernel-linus-source" mandriva rpm's.

2.6.21.1 and 2.6.21.5 were compiled for K8, the default is M586. That's one
more variable.  I don't have a feel for this, what are the odds that changing
the CPU target is playing a role in this?  If I have to test both ways
all the way through this is going to take even longer.  Thanks. 
Comment 20 Bartlomiej Zolnierkiewicz 2007-06-18 17:14:21 UTC
> Ugh, this is going to take a while.  Ok, I'll try to narrow it down using
> trinary search (two machines can be tested at once) using the
> "kernel-linus-source" mandriva rpm's.

I'm not familiar with mandriva's kernel source rpms but it sounds
like this is going to be manual search instead of pseudo-automatic
one (with git tree and git-bisect)?

BTW git-nary-search would be a cool feature request for git ;)

> 2.6.21.1 and 2.6.21.5 were compiled for K8, the default is M586. That's one
> more variable.  I don't have a feel for this, what are the odds that changing
> the CPU target is playing a role in this?  If I have to test both ways
> all the way through this is going to take even longer.  Thanks.

this is quite unlikely and testing just once i.e. 2.6.21.5 for M586 should be enough to verify it
Comment 21 Tejun Heo 2007-06-18 21:22:56 UTC
This sounds pretty similar to bug#7372.

  http://bugzilla.kernel.org/show_bug.cgi?id=7372

It's a long running bug which we still don't have much idea about.  It looks like scheduling is involved somehow.
Comment 22 devsk 2007-06-18 21:47:24 UTC
David, you don't mention if you are using x86_64? Are you?

[ why is the arch not a field in the bug report summary on top. Its one of the most basic (not necessarily most important) things in a kernel report ]
Comment 23 henkjans_bagger 2007-06-19 01:47:11 UTC
@Daniel: Git-bisect is a pretty powerfull tool, but time-consuming. 
Seeing that your last working kernel is 2.6.17.14 (the same as mine) and that Tejun already noticed the similarity with bug#7372 it might be possible to save a lot of time. 

In that bug I performed a git-bisect giving me a bad-commit ID, see http://bugzilla.kernel.org/show_bug.cgi?id=7372#c42 

It might be good to check if your problem originates from the same commit as that saves you the trouble/time of doing the git-bisect yourselves. This potentially would be one reboot vs. ~10 for a complete git-bisect. 

This check you can do by reverting the bad commit I found: 6edad161cd4dfe1df772e7a74ab63cab53b5e8c1

For that you use 

	git checkout master
	git revert <bad-commit-id>

See http://www.kernel.org/pub/software/scm/git/docs/v1.0.13/howto/isolate-bugs-with-bisect.txt
Comment 24 David Mathog 2007-06-19 08:00:25 UTC
(In reply to comment #22)
> David, you don't mention if you are using x86_64? Are you?

No it's the 32 bit version.  Sorry I didn't say so at the beginning.
Comment 25 David Mathog 2007-06-19 10:50:10 UTC
(In reply to comment #23)
> @Daniel: Git-bisect is a pretty powerfull tool, but time-consuming. 

On that general point, is there some easy way to build a kernel where the
build doesn't waste 99% of its time compiling a zillion modules that aren't needed on the test system?  For instance, do a lsmod, and just build those?  After many hours building kernel pieces that aren't needed for the tests,
I'd really like to be able to do:

  make only_active_modules

(making only tens of modules)

instead of

  make all

(making thousands of modules)

For the sort of testing I'm doing to track down this bug it's nuts building, for instance, the atm and WiFi modules, which will never be loaded because that hardware doesn't even exist on the test system.  That would cut the build/test time down by a significant factor.
Comment 26 Jeff Garzik 2007-06-19 11:11:45 UTC
You can do
   1. edit .config to remove symbols you do not need
   2. run 'make oldconfig'
   3. repeat as necessary

to eliminate the stuff you don't ever use or need.
Comment 27 David Mathog 2007-06-19 11:27:28 UTC
The dump problem seems to have started very early in the 2.6.21 series.

Everything up to 2.6.20.14 does a FAST "dump", whereas everything down to 2.6.21.0.rc3 does a SLOW "dump".  I could not find an earlier version in the 2.6.21 series than 2.6.21.0.rc3.  2.6.20.14 was built both with m586 and K8
and it was FAST both ways.

I will attach the results next.
Comment 28 David Mathog 2007-06-19 11:28:19 UTC
Created attachment 11798 [details]
bonnie and dump results for various kernels on two machines
Comment 29 David Mathog 2007-06-19 11:30:15 UTC
CRUD, sorry about that. The last two entries in the previous attachment should say
SLOW, not FAST.  That is, both 2.6.21.0.rc3 and 2.6.21.1 took, or were going to take, 30 minutes do dump the partition, which is SLOW.
Comment 30 David Mathog 2007-06-19 11:33:46 UTC
Created attachment 11799 [details]
.config for 2.6.20.14, K8, last kernel with FAST dump
Comment 31 David Mathog 2007-06-19 11:38:17 UTC
Created attachment 11800 [details]
.config for 2.6.21.0rc3 (K8) earliest SLOW dump
Comment 32 David Mathog 2007-06-19 11:54:04 UTC
Created attachment 11801 [details]
Summary table of bonnie and dump speed results
Comment 33 David Mathog 2007-06-19 13:48:21 UTC
Created attachment 11802 [details]
summary of tests, includes 2.6.21rc1

Found 2.6.20rc1 and tested it, it too is slow.  The updated results
are in this attachment.  Apparently the "slow dump" bug appears between
2.6.20.14 and 2.6.21rc1.
Comment 34 David Mathog 2007-06-19 15:22:42 UTC
In the preceding comment the two "2.6.20rc1" strings should have been "2.6.21rc1".
Comment 35 Tejun Heo 2007-06-20 00:49:03 UTC
Can you get git tree and bisect further?  2.6.21rc1 basically contains all the devel patches occurred after 2.6.20 release so rc1 is a really big lump.  Thanks.
Comment 36 David Mathog 2007-06-20 08:14:07 UTC
(In reply to comment #35)
> Can you get git tree and bisect further?  2.6.21rc1 basically contains all
> the
> devel patches occurred after 2.6.20 release so rc1 is a really big lump. 

I can try, but I've never used git.  

The kernel sources need to go into /usr/src, good=2.6.20.14, and bad=2.6.21rc1, so what EXACTLY do I have to type to find the "bisect" point and then enter
what after that to download that version into /usr/src so that it can be
built and tested?  I did read the man page but it's not at all clear to me
what the proper syntax is to use for kernel versions, especially kernel
versions between major releases, which is where you want me to look.

Thanks.
Comment 37 David Mathog 2007-06-20 14:27:11 UTC
Installed git and downloaded the linux-2.6 git tree which built
/usr/src/linux-2.6 with a .git directory in it.  

cd /usr/src/linux-2.6
git bisect start
git bisect bad v2.6.21-rc1
#the syntax for the previous one was in the man page.
#Now I start guessing...
git bisect good v2.6.20.14
Bad rev input: v2.6.20.14
git bisect good v2.6.20-14
Bad rev input: v2.6.20-14
git bisect good v2.6.20-r14
Bad rev input: v2.6.20-r14
#etc. also without the leading v to no better effect

I give up, how do I tell git the last good one corresponds to kernel
2.6.20.14???

Also, once it sets the next position, what git command extracts that
source tree into /usr/src/testkernel? (Or whatever it will be called)

Thanks
Comment 38 David Mathog 2007-06-20 14:34:28 UTC
Do I have the right git tree?

git tag -l

shows (in part)

v2.6.20
v2.6.20-rc1
v2.6.20-rc2
v2.6.20-rc3
v2.6.20-rc4
v2.6.20-rc5
v2.6.20-rc6
v2.6.20-rc7
v2.6.21
v2.6.21-rc1
v2.6.21-rc2
v2.6.21-rc3

Where are 2.6.20.1 <-> 2.6.20.14???
Comment 39 Bartlomiej Zolnierkiewicz 2007-06-20 16:56:22 UTC
"-stable" releases are maintained in separate git trees created only for a specific release i.e. 2.6.20 (=> 2.6.20.x git tree)

Please just start with 2.6.20.
Comment 40 David Mathog 2007-06-21 17:58:18 UTC
(In reply to comment #39)
> "-stable" releases are maintained in separate git trees created only for a
> specific release i.e. 2.6.20 (=> 2.6.20.x git tree)
> 
> Please just start with 2.6.20.

This is slow work!

Using v2.6.20 as GOOD I've bisected down to within 22 steps
of v2.6.20 starting from v2.6.21-rc1 as BAD.  It's possible that the problem
is within those 22, but it could also be that v2.6.20 in this tree is also
BAD. I had tested release 2.6.20.14 (see above) and it was "GOOD", but I have not explicitly tested what is in this git tree called  v2.6.20.  I assumed

   v2.6.20 < v2.6.20-rc* < 2.6.20.x stable releases (explicitly tested)

Is that assumption valid?  If not I'll have to start over with GOOD closer to the front of this particular git tree.
Comment 41 Bartlomiej Zolnierkiewicz 2007-06-22 01:44:56 UTC
Subject: Re:  Dump of ext3 runs very slowly

On Friday 22 June 2007, bugme-daemon@bugzilla.kernel.org wrote:
> ------- Comment #40 from mathog@caltech.edu  2007-06-21 17:58 -------
> (In reply to comment #39)
> > "-stable" releases are maintained in separate git trees created only for a
> > specific release i.e. 2.6.20 (=> 2.6.20.x git tree)
> > 
> > Please just start with 2.6.20.
> 
> This is slow work!
> 
> Using v2.6.20 as GOOD I've bisected down to within 22 steps
> of v2.6.20 starting from v2.6.21-rc1 as BAD.  It's possible that the problem
> is within those 22, but it could also be that v2.6.20 in this tree is also
> BAD. I had tested release 2.6.20.14 (see above) and it was "GOOD", but I have
> not explicitly tested what is in this git tree called  v2.6.20.  I assumed
> 
>    v2.6.20 < v2.6.20-rc* < 2.6.20.x stable releases (explicitly tested)
> 
> Is that assumption valid?  If not I'll have to start over with GOOD closer to

Nope. ;)

v2.6.20 < v2.6.20-rc*

v2.6.20 < 2.6.20.x

In other words 2.6.20.x is a _separate_ branch based on v2.6.20.

> the front of this particular git tree.
Comment 42 Bartlomiej Zolnierkiewicz 2007-06-22 01:49:20 UTC
Also real life example of using git-bisect command:

http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/
Comment 43 David Mathog 2007-06-22 10:07:10 UTC
(In reply to comment #41)

> 
> v2.6.20 < v2.6.20-rc*
> 
> v2.6.20 < 2.6.20.x
> 
> In other words 2.6.20.x is a _separate_ branch based on v2.6.20.

Yeah, this is becoming less fun by the minute.  

So I backed WAY up and built ebdea46fecae40c4d7effcd33f40918a37a1df4b
which the .config says is 2.6.18.  The results for that were BAD(ish):

     DUMP: 8789900 blocks (8583.89MB) on 1 volume(s)
     DUMP: finished in 1316 seconds, throughput 6679 kBytes/sec

That is, not as BAD as the usual <5Mb/sec, but MUCH worse than the normal
>25Mb/sec.  This suggests two bugs.  Unfortunately it also suggests that the
>patches in this git tree bear only a tenuous relationship in their order to
>the ones in the stable releases.  To review, I already tested stable releases
>variants of 2.6.20, 2.6.19, 2.6.18, and 2.6.17, and they were all FAST, only
>2.6.21.rc1 and above were slow.  In this git tree I'm all the way back at
>2.6.18 and it's slow (mostly).  Maybe I happened to hit a bad region, where
>some other slowish bug was present?  In any case, I'm going to have to search
>both up and down from this point to find the point where some second bug drops
>it the
rest of the way to <5Mb, and hopefully in the other direction at some point it becomes fast again, I've not yet actually seen any kernel from this tree which
is fast.  Does this git tree maybe include some sort of debugging methodology
or other cruft in kernels built with it that might slow it down relative to 
stable kernels?  I'll include the .config for this BADish kernel next.

> 
> > the front of this particular git tree.
> 
Comment 44 David Mathog 2007-06-22 10:09:02 UTC
Created attachment 11851 [details]
.config for 2.6.18 kernel that ran at 6.7Mb/sec
Comment 45 Bartlomiej Zolnierkiewicz 2007-06-22 11:27:58 UTC
Subject: Re:  Dump of ext3 runs very slowly

On Friday 22 June 2007, bugme-daemon@bugzilla.kernel.org wrote:
> ------- Comment #43 from mathog@caltech.edu  2007-06-22 10:07 -------
> (In reply to comment #41)
> 
> > 
> > v2.6.20 < v2.6.20-rc*
> > 
> > v2.6.20 < 2.6.20.x
> > 
> > In other words 2.6.20.x is a _separate_ branch based on v2.6.20.
> 
> Yeah, this is becoming less fun by the minute.  
> 
> So I backed WAY up and built ebdea46fecae40c4d7effcd33f40918a37a1df4b
> which the .config says is 2.6.18.  The results for that were BAD(ish):
> 
>      DUMP: 8789900 blocks (8583.89MB) on 1 volume(s)
>      DUMP: finished in 1316 seconds, throughput 6679 kBytes/sec

This might be some old bug which has been fixed *already*.

> That is, not as BAD as the usual <5Mb/sec, but MUCH worse than the normal
> >25Mb/sec.  This suggests two bugs.  Unfortunately it also suggests that the
> patches in this git tree bear only a tenuous relationship in their order to
> the ones in the stable releases.  To review, I already tested stable releases
> variants of 2.6.20, 2.6.19, 2.6.18, and 2.6.17, and they were all FAST, only
> 2.6.21.rc1 and above were slow.  In this git tree I'm all the way back at
> 2.6.18 and it's slow (mostly).  Maybe I happened to hit a bad region, where
> some other slowish bug was present?  In any case, I'm going to have to search
> both up and down from this point to find the point where some second bug
> drops it the
> rest of the way to <5Mb, and hopefully in the other direction at some point
> it
> becomes fast again, I've not yet actually seen any kernel from this tree
> which
> is fast.  Does this git tree maybe include some sort of debugging methodology
> or other cruft in kernels built with it that might slow it down relative to 
> stable kernels?  I'll include the .config for this BADish kernel next.

Simplified description of the kernel development model:

* There is one main git tree managed by Linus where *all* changes are commited.

* From time to time Linus says this is 2.6.x kernel and tags the git tree
  with appropriate tag.  2.6.x.y stable realease tree is branched of this tag.

* There is a rule that *only* changes that were commited by Linus to his tree
  can be merged into 2.6.x.y so i.e. situation that 2.6.20.x contains some
  changes that are not in Linus' kernels > 2.6.20 cannot happen (in theory at
  least).

Lets combine the above with the results so far.  You say that all variants
of 2.6.20.y stable were FAST and only kernels > 2.6.21-rc1 are SLOW so you
don't really need to git-bisect from 2.6.18 but you can start from 2.6.20.

PS I've seen some discussion on linux-kernel mailing list about change
in /proc/sys/vm/dirty_ratio setting which may be related to this issue
(sorry I don't have a time to dig it deeper ATM).
Comment 46 Bartlomiej Zolnierkiewicz 2007-06-22 11:45:18 UTC
Subject: Re:  Dump of ext3 runs very slowly

On Friday 22 June 2007, bugme-daemon@bugzilla.kernel.org wrote:
> ------- Comment #44 from mathog@caltech.edu  2007-06-22 10:09 -------
> Created an attachment (id=11851)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=11851&action=view)
> .config for 2.6.18 kernel that ran at 6.7Mb/sec

By comparing this config with the 2.6.20.14 one I discovered this:

-CONFIG_DEFAULT_IOSCHED="cfq"
+CONFIG_DEFAULT_IOSCHED="anticipatory"

So this 2.6.18 config is using CFQ I/O scheduler by default while 2.6.20.14
one uses AS I/O scheduler.  I also checked other "SLOW" configs (for 2.6.21.1
and 2.6.21-rc3) and they are also using CFQ by default.

You may try to change I/O scheduler dynamically by using

# echo "anticipatory" > /sys/block/hda/queue/scheduler

in some known to be SLOW kernels (i.e. 2.6.21.1) and see if it helps.
Comment 47 David Mathog 2007-06-22 13:53:53 UTC
(In reply to comment #46)
> By comparing this config with the 2.6.20.14 one I discovered this:
> 
> -CONFIG_DEFAULT_IOSCHED="cfq"
> +CONFIG_DEFAULT_IOSCHED="anticipatory"

Good catch!  

That turns out to be the problem.  The 2.6.21.1 .config that came
with the kernel-linus-source from Mandriva had cfq as the default.  The .config
from earlier kernels used a default of "anticipatory". The next attachment
shows the results for all 4 IOSCHED modes under 2.6.21.1.
It turns out that "cfq" is the only "slow" IOSCHED mode for the dump test, and
that another IO test (bonnie++) shows only minimal differences.  So
I guess now the question is what is the nature of the interaction between "cfq"
and dump of an ext2 file system that results in the 5X loss of throughput to
/dev/null or a network connection?  Or perhaps the scheduling problem is between the dump and the pipe?
Comment 48 David Mathog 2007-06-22 13:54:36 UTC
Created attachment 11853 [details]
dump and bonnie tests for differen IOSCHED defaults
Comment 49 David Mathog 2007-06-22 14:25:49 UTC
The cfq with dump problem apparently goes way back.  It was accidentally
triggered in the git bisect here:

ebdea46fecae40c4d7effcd33f40918a37a1df4b

and intentionally triggered here:

4522d58275f124105819723e24e912c8e5bf3cdd

and finally, I tested it all the way back at Mandriva's 2.6.17-14mdv kernel,
which normally runs "anticipatory". But...

echo "cfq" > /sys/block/hda/queue/scheduler
dump -0 -f /dev/null /dev/hda5 2>&1 | tee dump.txt
  DUMP: dumping (Pass IV) [regular files]
  DUMP: 16.56% done at 4837 kB/s, finished in 0:25
Ie, slow.

I have an even older system with Ultra320 SCSI disks and kernel 2.6.17.4.
/dev/sdc1 is an ext3 file system.

echo "anticipatory" > /sys/block/sdc/queue/scheduler
dump -0 -f /dev/null /dev/sdc1 2>&1 | tee dump_anticipatory.txt
  DUMP: 77.67% done at 63470 kB/s, finished in 0:01
echo "cfq" >/sys/block/sdc/queue/scheduler
dump -0 -f /dev/null /dev/sdc1 2>&1 | tee dump_anticipatory.txt
  DUMP: 75.87% done at 62212 kB/s, finished in 0:01

So on the oldest kernel I have running here "cfq" is just a hair slower
than "anticipatory".  Either the bug doesn't go that far back or SCSI disks aren't affected the same way PATA ones are. 
Comment 50 Jens Axboe 2007-06-23 03:44:02 UTC
Created attachment 11857 [details]
Always schedule dispatch on request completion

It'd be nice if you could test 2.6.22-rc5 and see if that behaves differently. And 2.6.21 with the attached patch as well.

I'll be gone the next two weeks, so I cannot look more at this issue before 9th of July.
Comment 51 David Mathog 2007-07-10 11:36:40 UTC
Jens,
  
The patch you supplied results in this piece of code:

		if (cfq_slice_used(cfqq))
			cfq_slice_expired(cfqd, 0, 1);
		else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list)) {
			if (!cfq_arm_slice_timer(cfqd))
				cfq_schedule_dispatch(cfqd);
		}
		cfq_schedule_dispatch(cfqd); /* added by patch */

It does NOT affect the slow dump problem:

dump -0 -f /dev/null /dev/hda5 2>&1 | tee dump1.txt
  DUMP: 15.03% done at 4399 kB/s, finished in 0:28

Your patch would seem in some instances to result in two calls to cfg_schedule_dispatch.  Was that your intention? Or is this what you intended?

		if (cfq_slice_used(cfqq))
			cfq_slice_expired(cfqd, 0, 1);
		else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list)) {
			(void) cfq_arm_slice_timer(cfqd);
		}
		cfq_schedule_dispatch(cfqd); /* added by patch */

I tried the latter form on 2.6.21.1 as well, just in case, but it did
NOT speed things up either:

 dump -0 -f /dev/null /dev/hda5 2>&1
...
DUMP: 14.66% done at 4292 kB/s, finished in 0:29

Next???

Thanks
Comment 52 henkjans_bagger 2007-08-24 11:30:21 UTC
Sorry to jump in again, but it seems that this bug goes in the same direction as what I experienced in bug #7372. Longstanding problem that gets worse if one uses cfg as scheduler. As a reminder: In my case I found that the problem with cfq disappeared when I disabled NCQ for the SATA-drive by setting the queue-length to 1. 

David does not have a SATA-drive, but the similarities between the bugreports are striking. Assuming that also for PATA drives there is also a queue, could it be that a wrong assumption in the queue processing is causing this?

As a reminder my test-results: http://bugzilla.kernel.org/show_bug.cgi?id=7372#c112
Comment 53 Todd 2008-01-05 22:00:19 UTC
Hi All,

I did the 
    echo "anticipatory" > /sys/block/sda/queue/scheduler
And it fixed my dump problems.  Thank you!

Comment 52: I have two servers out there with this problem.
Both dump to an Ecrix/Exabyte/Tandberg VXA3 tape drive
through an Adaptec 19160 SCSI-U160 LVDS Controller.

Does this help your troubleshooting?

-T
Comment 54 Andrew Haley 2008-01-16 07:44:32 UTC
After "echo anticipatory > /sys/block/sdc/queue/scheduler"

  DUMP: dumping (Pass IV) [regular files]
  DUMP: 4.34% done at 48278 kB/s, finished in 1:50

Incredible.  This was about 5000 kB/s before the change.

I suppose this must be something to do with the way that dump uses multiple processes to overlap its I/O.  Everything I've tried other than dump seems to work perfectly well with cfq.
Comment 55 Jens Axboe 2008-01-16 12:41:12 UTC
If dump is indeed using several threads to do IO in the same location, then that will currently run a lot slower with CFQ. I have some patches that need a bit of testing that will fix that up, basically one set for io context sharing and one for cooperating processes/threads.

Let me see if I can work something out tomorrow.
Comment 56 Todd 2008-02-05 10:17:15 UTC
Hi All,

I used the workaround to cure my slow dump to an eSata drive.

But, an identically configured server with an Adaptec
19160 SCSI-U160 LVDS Controller and an Ecrix/Exabyte/Tandberg/{insert
owner of the month here} VXA-3 tape drive is still very slow.
The second dump takes up to 10 hours (should take 2).

From my debug report (first of two dumps -- the small one):

   echo anticipatory > /sys/block/sda/queue/scheduler
   anticipatory
   sync; sync
   /sbin/dump -0a -f /dev/nst0 /dev/sda3
      DUMP: Date of this level 0 dump: Mon Feb  4 23:03:13 2008
      DUMP: Dumping /dev/sda3 (/export) to /dev/nst0
   ...
      DUMP: Volume 1 20215300 blocks (19741.50MB)
      DUMP: Volume 1 took 2:10:01
      DUMP: Volume 1 transfer rate: 2591 kB/s

Should be 8000kB/s (based on a third identical server running 
Cent OS 4.5).

Does anyone know of a workaround for VXA-3 dumps?

Many thanks,
-T
Comment 57 Todd 2008-03-12 17:18:54 UTC
Hi All,

Alan, over on the CentOS forum, point out this work around to me:

Changed my /boot/grub.grub.conf kernel line to
     	kernel /boot/vmlinuz-2.6.18-53.1.14.el5 ro root=LABEL=/1 rhgb quiet elevator=as

rebooted, and got:

          $ cat /sys/block/sda/queue/scheduler
          noop [anticipatory] deadline cfq

          $ cat /sys/block/sdb/queue/scheduler
          noop [anticipatory] deadline cfq 

Hope this helps in the troubleshooting.

Reference:
     http://www.redhat.com/magazine/008jun05/features/schedulers/

-T
   
Comment 58 Jens Axboe 2008-03-13 04:59:50 UTC
It'll get fixed when enough time is available. This is really dump doing something stupid, but CFQ should cope with it better.
Comment 59 Jeff Moyer 2008-10-27 11:01:23 UTC
Created attachment 18473 [details]
Detect processes issuing I/O within the average seek distance of one another

This patch, based on some of the close cooperator patches that have been merged and reverted in the past, addresses the dump(8) performance issue in my test setup.  I'd sure appreciate it if others could test this and report their results.  Performance results for all workloads are desirable, as we don't want to introduce regressions elsewhere.

Cheers!
Comment 60 Jeff Moyer 2008-11-24 07:37:15 UTC
Created attachment 19002 [details]
Fixed version of the previous patch.

This version of the patch has a couple of fixes from Jens.
Comment 61 Alan 2009-03-23 11:09:45 UTC
Was never applied - what is the status on this ?
Comment 62 Jens Axboe 2009-03-23 12:11:19 UTC
Status is that it should be fixed in the app, it's really behaving in a sub-optimal manner (spreading interleaved IO out to threads, it's just dumb). Jeff has a patch for dump that uses CLONE_IO to make it work as fast as it should. In a better life, dumps misplaced attempt at optimizing the IO pattern should just be killed imho. But Jeffs patch is simpler, works, and is available :-)
Comment 63 Jens Axboe 2009-03-23 12:11:43 UTC
Closing
Comment 64 Jeff Moyer 2009-03-23 12:25:29 UTC
Regarding dump(8), Jens expressed that he would rather see the dump utility modified to share I/O contexts.  I floated a patch for that to the dump-devel mailing list, but heard no response from the maintainer.  I just reposted another version which fixes some compile issues when CLONE_IO isn't present on the system.

I'll note that this problem exists not just for dump(8).  We see similar issues with streaming writes from a single NFS client to an NFS server running with >1 nfsd thread.  Similar issues are also present in at least one iSCSI target mode implementation.  I've also seen yet another instance of this in a system configured with a 3rd party volume manager.

For the NFS server issue, I'm working on a patch to create an I/O context per NFS client, so that multiple threads issuing I/O on behalf of a single client will share it.  We'll see how far that goes in addressing the problem.

For the iSCSI target mode implementation, the authors also modified their code to share I/O contexts, with decent results.

This still leaves other third party applications that perform I/O in this way.  We can either ignore the problem, or investigate pushing something like what I proposed into CFQ.  I'd have to give more numbers on the additional overhead introduced, I think.  Personally, I think we should address this issue as there are enough indepent cases to verify that this is something that people actually do.  If we don't, people will simply change I/O schedulers.