Most recent kernel where this bug did not occur: 2.6.17-14mdv Distribution: Mandriva 2007.1 Hardware Environment Athlon64 3700+ 1G RAM WDC WD1600SB-01KBA0 (PATA disk) Asus A8N5X motherboard Software Environment: dump 0.4b41-2dmv2007.0 Problem Description: With kernel 2.6.21.1 dump of a mounted ext3 filesystem is very, very slow, it runs at less than 5Mb/sec. With older kernels dump runs more than 5 times faster. Steps to reproduce: 1. Boot the system with choice of kernel. 2. mount the ext3 filesystem (in this case it is "/", including all of Mandriva 2007.1 except for "/boot".) 3. dump -0 -f /dev/null /dev/hda5 Results: kernel elapsed(s) rate (KB/s) dumped (MB) 2.6.17-14mdv 289 26578 7501.07 2.6.21.1 1800 4498 7911.29 A few files were deleted on the disk between tests, which is why the size isn't quite the same. The dump on 2.6.21.1 takes so long I didn't run it to completion again after removing those files, but it said things like "expected completion in 28 minutes". I did not round the time, dump really reported 1800s.
Created attachment 11763 [details] .config used to build 2.6.21.1
Proof that /dev/null is not rate limiting. On 2.6.21.1 % dd bs=512 count=1000000 if=/dev/zero of=/dev/null 1000000+0 records in 1000000+0 records out 512000000 bytes (512 MB) copied, 1.24117 seconds, 413 MB/s
There are no errors or warnings in /var/log/messages to indicate a problem. 2.6.21.1 did dump slowly even when it was booted failsafe, the network started, and a single connection in was made to run the test. I also removed all the cpufreq modules and reran the test and it was no faster.
It could be that the IDE driver got slower. Try using hdparm to ensure the disk's transfer speed, compare that with the earlier kernel. Please see http://lkml.org/lkml/2007/5/31/181 - perhaps it's this. Please attach to this report the output from `dmesg -s 1000000' for both kernel versions. Thanks.
On Sat, 16 Jun 2007 11:36:53 -0400 Jeff Garzik <jeff@garzik.org> wrote: > Andrew Morton wrote: > > So I revisited [...] > > http://bugzilla.kernel.org/show_bug.cgi?id=8636 > > > > I don't see why it's not-a-bug. The second guy (Stephen Clark) had to toss > > out the FC6 kernel and build his own kernel to fix this regression. > > (I hope you don't mind me copying linux-ide) > > The combined mode removal changed the driver load configuration. Some > distros updated with that knowledge, but apparently Fedora did not. > > Combined mode would explicitly reserve ports for libata in pci/quirks.c, > a horrible layering violation that created special case module load > order dependencies. > > The change made libata and the old-IDE driver behave like normal Linux > drivers. They can both grab an entire PCI device and drive it > correctly, so -- just like every other situation where two drivers can > match the same PCI ID -- the one that loads first wins. > > Kernel configs need to be tweaked a bit, due to this change. That's why > it was communicated in advance (but poorly, as it appears from your > questions and existing bug reports). Two common results appear in bug > reports: > > 1) One possible result of a complete lack of kconfig tweaking (blindly > hitting <enter> during 'make oldconfig') is falling back to the > most-compatible configuration, the legacy IDE driver (either libata or > old-IDE), with resultant slow performance. > > 2) "no root! panic!" and similar I-cant-find-your-hard-drive results for > people with hardcoding root= configurations, for the minority where a > device moved from /dev/hdX -> /dev/sdX, or vice versa. > > The damage is hoped to be limited to: > > * Intel ICH5/6[/7?] users with combined mode enabled, which is a > not-small subset of all ICH[567] users. > > * Users that did /not/ choose the combined_mode=libata kernel command > line option, a popular option for restoring performance /broken/ by > running two drivers in tandem [i.e. the old way, recently removed]. > > * In the combined mode configuration, one device is /dev/hdX (often the > CD-ROM) and one device is /dev/sdX, so only one of those devices will > move. Standard LVM and mount-by-label/uuid found in standard distro > installs makes this move seamless for many. > > * ...in distros that hopefully took stock of their compatibility > picture, and modified their kernel configs and drivers accordingly. > Some distros were defaulting to combined_mode=libata (==libata drives > all applicable Intel ICHx IDE devices), others were not. That affects > the decision about kernel config changes. > > Jeff > > > >
Yeap, it seems like transfer mode configuration problem. Please report the result of 'dmesg -s 1000000'.
(In reply to comment #4) > Try using hdparm to ensure the > disk's transfer speed, compare that with the earlier kernel. hdparm /dev/hda on both kernels shows: multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 19457/255/63, sectors = 312581808, start = 0 On 2.6.21.1 hdparm -tT /dev/hda /dev/hda: Timing cached reads: 1622 MB in 2.00 seconds = 811.34 MB/sec Timing buffered disk reads: 172 MB in 3.02 seconds = 56.94 MB/sec on 2.6.17-14mdv hdparm -tT /dev/hda /dev/hda: Timing cached reads: 1700 MB in 2.00 seconds = 850.66 MB/sec Timing buffered disk reads: 172 MB in 3.03 seconds = 56.74 MB/sec So it isn't disk speed per se, it's something related to dump or ext3.
Created attachment 11782 [details] output of: dmesg -s 1000000
Created attachment 11783 [details] dmesg -s 100000 from 2.6.17-14mdv system This dmesg was actually generated on a second machine. However the second machine had identical hardware to the first, the hda5 partition had been cloned from that of the first machine, and the second machine also had the slow dump problem if booted with the 2.6.21.1 kernel, and not with this kernel.
Yeah, nothing seems wrong with the IDE driver. Can you run bonnie on the filesystem and report the result?
(In reply to comment #10) > Yeah, nothing seems wrong with the IDE driver. Can you run bonnie on the > filesystem and report the result? > Hope bonniee++ is ok, that's running now. I'm using commands like this, please let me know if it should be different: bonnie++ -u 500:500 -d /tmp 2>&1 | tee bonnie_2.6.17-14mdv.txt Here are the results for the older kernel: cat bonnie_2.6.17-14mdv.txt Using uid:500, gid:500. Writing with putc()...done Writing intelligently...done Rewriting...done Reading with getc()...done Reading intelligently...done start 'em...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP saf02.bio.caltec 2G 45099 91 58981 17 22377 5 37811 58 50173 3 226.5 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ saf02.bio.caltech.edu,2G,45099,91,58981,17,22377,5,37811,58,50173,3,226.5,0,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ Here are the results for 2.6.21.1 kernel (just the bits at the end, the status lines were all the same as above): Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP saf03.bio.caltec 2G 20433 68 40938 47 18901 10 45423 82 48388 5 187.7 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 1361 97 +++++ +++ 26532 100 1449 98 +++++ +++ 3361 97 saf03.bio.caltech.edu,2G,20433,68,40938,47,18901,10,45423,82,48388,5,187.7,0,16,1361,97,+++++,+++,26532,100,1449,98,+++++,+++,3361,97 Again, these were done on two different machines but with identical hardware, and running a cloned /dev/hda5, just with different boot kernels. Looks like for bonnie++ 2.6.21.1 is slower than 2.6.17-14mdv, but while I see a factor of 2.2 (20433 vs 45099) I don't see a factor of 5 in here to account for the incredibly slow dump speed.
Hmm, this is bizarre. df -k is showing significantly different disk usage on 2.6.21.1 versus 2.6.17-14mdv. I tracked it all the way down to single directories, (all of which, as far as I can tell) look like this: 2.6.21.1 [root@saf03 tmp]# du -k /Pmw/Pmw_1_2/bin 6 /Pmw/Pmw_1_2/bin [root@saf03 tmp]# ls -alR /Pmw/Pmw_1_2/bin /Pmw/Pmw_1_2/bin: total 7 drwxr-xr-x 2 root root 1024 May 23 11:57 ./ drwxr-xr-x 7 root root 1024 May 23 11:57 ../ -rwxr-xr-x 1 root root 4301 May 16 14:07 bundlepmw.py* [root@saf03 tmp]# stat /Pmw/Pmw_1_2/bin/bundlepmw.py File: `/Pmw/Pmw_1_2/bin/bundlepmw.py' Size: 4301 Blocks: 10 IO Block: 1024 regular file Device: 305h/773d Inode: 6397963 Links: 1 Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2007-05-16 14:07:51.000000000 -0700 Modify: 2007-05-16 14:07:51.000000000 -0700 Change: 2007-05-23 11:57:44.000000000 -0700 2.6.17-14mdv [root@saf02 /]# du -k /Pmw/Pmw_1_2/bin 12 /Pmw/Pmw_1_2/bin [root@saf02 /]# ls -alR /Pmw/Pmw_1_2/bin /Pmw/Pmw_1_2/bin: total 16 drwxr-xr-x 2 root root 4096 May 23 11:57 ./ drwxr-xr-x 7 root root 4096 May 23 11:57 ../ -rwxr-xr-x 1 root root 4301 May 16 14:07 bundlepmw.py* [root@saf02 /]# stat /Pmw/Pmw_1_2/bin/bundlepmw.py File: `/Pmw/Pmw_1_2/bin/bundlepmw.py' Size: 4301 Blocks: 16 IO Block: 4096 regular file Device: 305h/773d Inode: 2599978 Links: 1 Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2007-06-15 04:14:48.000000000 -0700 Modify: 2007-05-16 14:07:51.000000000 -0700 Change: 2007-06-15 04:21:22.000000000 -0700 Notice the difference in IO Block, which is 4096 on the older one and 1024 on the newer one. Just to be sure that isn't an issue with the cloning I rebooted the machine running the newer kernel with the older kernel, and repeated the commands. The following are all still on saf03 but running 2.6.17-14mdv: [root@saf03 ~]# du -k /Pmw/Pmw_1_2/bin 6 /Pmw/Pmw_1_2/bin [root@saf03 ~]# ls -alR /Pmw/Pmw_1_2/bin /Pmw/Pmw_1_2/bin: total 7 drwxr-xr-x 2 root root 1024 May 23 11:57 ./ drwxr-xr-x 7 root root 1024 May 23 11:57 ../ -rwxr-xr-x 1 root root 4301 May 16 14:07 bundlepmw.py* [root@saf03 ~]# stat /Pmw/Pmw_1_2/bin/bundlepmw.py File: `/Pmw/Pmw_1_2/bin/bundlepmw.py' Size: 4301 Blocks: 10 IO Block: 4096 regular file Device: 305h/773d Inode: 6397963 Links: 1 Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2007-05-16 14:07:51.000000000 -0700 Modify: 2007-05-16 14:07:51.000000000 -0700 Change: 2007-05-23 11:57:44.000000000 -0700 Ah, so there is an cloning issue, as the sizes on this machine are the same no matter which kernel EXCEPT for the IO Block:, which is 4096 with the older kernel and 1024 with the newer one. So I guess I better rerun bonnie++ on the same machine. Same command as above, saf03, using the older kernel. (IGNORE the SAF02 results, there is another factor involved) Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP saf03.bio.caltec 2G 30306 73 53393 37 19822 7 26392 43 48085 4 208.3 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 1707 95 +++++ +++ +++++ +++ 1877 97 +++++ +++ 5402 93 saf03.bio.caltech.edu,2G,30306,73,53393,37,19822,7,26392,43,48085,4,208.3,0,16,1707,95,+++++,+++,+++++,+++,1877,97,+++++,+++,5402,93 Apparently when saf03 was set up by the Mandriva 2007.1 installer the ext3 block size was not set to the default of 4096, but used XXXX instead, whereas when that partition was cloned to saf02 I know I used the default blocksize, which turned out to be 4096. I'll attach dumpe2fs for the two systems next.
Created attachment 11785 [details] dumpe2fs from saf03
Created attachment 11786 [details] dumpe2fs from saf02
Just double checked, even on saf02, with the different block sizes than saf03, dump is still very slow under 2.6.21.1 (about a half hour, I quit after it said "done in about 27 minutes") and still fast under 2.6.17-14mdv DUMP: Volume 1 transfer rate: 37528 kB/s DUMP: 8218750 blocks (8026.12MB) on 1 volume(s)
Let me modify the previous statement slightly. With the default block sizes used on /dev/hda5 on saf02 on the older kernel the dump is actually a good bit faster than it was with the same (old) kernel and the smaller block sizes on saf03: kernel elapsed(s) rate (KB/s) dumped (MB) machine 2.6.21.1 1800 4498 7911.29 SAF03 2.6.17-14mdv 289 26578 7501.07 SAF03 2.6.17-14mdv 219 37528 8026.12 SAF02 Apparently the default mkfs.ext3 settings work better with dump than do the settings that Mandriva 2007.1 used. I don't think this has anything to do with the really slow dump under 2.6.21.1, this is just a typical block size performance change.
On the off chance that this problem had been fixed since 2.6.21.1 I tried 2.6.21.5. Used .config from 2.6.21.1, did "make oldconfig". Then completed building the new kernel, installed it, booted to it and... dump is still slow. The run was aborted after dump emitted a "done in 27 minutes" message.
Hi, Using git-bisect seems to be the most promising way in identifying the source of this regression (starting "good" kernel == 2.6.17, starting "bad" kernel == 2.6.21, test case == bonnie++). Once we know what change has caused the slow down it should be (relatively) easy to fix it. Please give git-bisect a try.
(In reply to comment #18) > Using git-bisect seems to be the most promising way in identifying the source > of this regression (starting "good" kernel == 2.6.17, starting "bad" kernel > == > 2.6.21, test case == bonnie++). Ugh, this is going to take a while. Ok, I'll try to narrow it down using trinary search (two machines can be tested at once) using the "kernel-linus-source" mandriva rpm's. 2.6.21.1 and 2.6.21.5 were compiled for K8, the default is M586. That's one more variable. I don't have a feel for this, what are the odds that changing the CPU target is playing a role in this? If I have to test both ways all the way through this is going to take even longer. Thanks.
> Ugh, this is going to take a while. Ok, I'll try to narrow it down using > trinary search (two machines can be tested at once) using the > "kernel-linus-source" mandriva rpm's. I'm not familiar with mandriva's kernel source rpms but it sounds like this is going to be manual search instead of pseudo-automatic one (with git tree and git-bisect)? BTW git-nary-search would be a cool feature request for git ;) > 2.6.21.1 and 2.6.21.5 were compiled for K8, the default is M586. That's one > more variable. I don't have a feel for this, what are the odds that changing > the CPU target is playing a role in this? If I have to test both ways > all the way through this is going to take even longer. Thanks. this is quite unlikely and testing just once i.e. 2.6.21.5 for M586 should be enough to verify it
This sounds pretty similar to bug#7372. http://bugzilla.kernel.org/show_bug.cgi?id=7372 It's a long running bug which we still don't have much idea about. It looks like scheduling is involved somehow.
David, you don't mention if you are using x86_64? Are you? [ why is the arch not a field in the bug report summary on top. Its one of the most basic (not necessarily most important) things in a kernel report ]
@Daniel: Git-bisect is a pretty powerfull tool, but time-consuming. Seeing that your last working kernel is 2.6.17.14 (the same as mine) and that Tejun already noticed the similarity with bug#7372 it might be possible to save a lot of time. In that bug I performed a git-bisect giving me a bad-commit ID, see http://bugzilla.kernel.org/show_bug.cgi?id=7372#c42 It might be good to check if your problem originates from the same commit as that saves you the trouble/time of doing the git-bisect yourselves. This potentially would be one reboot vs. ~10 for a complete git-bisect. This check you can do by reverting the bad commit I found: 6edad161cd4dfe1df772e7a74ab63cab53b5e8c1 For that you use git checkout master git revert <bad-commit-id> See http://www.kernel.org/pub/software/scm/git/docs/v1.0.13/howto/isolate-bugs-with-bisect.txt
(In reply to comment #22) > David, you don't mention if you are using x86_64? Are you? No it's the 32 bit version. Sorry I didn't say so at the beginning.
(In reply to comment #23) > @Daniel: Git-bisect is a pretty powerfull tool, but time-consuming. On that general point, is there some easy way to build a kernel where the build doesn't waste 99% of its time compiling a zillion modules that aren't needed on the test system? For instance, do a lsmod, and just build those? After many hours building kernel pieces that aren't needed for the tests, I'd really like to be able to do: make only_active_modules (making only tens of modules) instead of make all (making thousands of modules) For the sort of testing I'm doing to track down this bug it's nuts building, for instance, the atm and WiFi modules, which will never be loaded because that hardware doesn't even exist on the test system. That would cut the build/test time down by a significant factor.
You can do 1. edit .config to remove symbols you do not need 2. run 'make oldconfig' 3. repeat as necessary to eliminate the stuff you don't ever use or need.
The dump problem seems to have started very early in the 2.6.21 series. Everything up to 2.6.20.14 does a FAST "dump", whereas everything down to 2.6.21.0.rc3 does a SLOW "dump". I could not find an earlier version in the 2.6.21 series than 2.6.21.0.rc3. 2.6.20.14 was built both with m586 and K8 and it was FAST both ways. I will attach the results next.
Created attachment 11798 [details] bonnie and dump results for various kernels on two machines
CRUD, sorry about that. The last two entries in the previous attachment should say SLOW, not FAST. That is, both 2.6.21.0.rc3 and 2.6.21.1 took, or were going to take, 30 minutes do dump the partition, which is SLOW.
Created attachment 11799 [details] .config for 2.6.20.14, K8, last kernel with FAST dump
Created attachment 11800 [details] .config for 2.6.21.0rc3 (K8) earliest SLOW dump
Created attachment 11801 [details] Summary table of bonnie and dump speed results
Created attachment 11802 [details] summary of tests, includes 2.6.21rc1 Found 2.6.20rc1 and tested it, it too is slow. The updated results are in this attachment. Apparently the "slow dump" bug appears between 2.6.20.14 and 2.6.21rc1.
In the preceding comment the two "2.6.20rc1" strings should have been "2.6.21rc1".
Can you get git tree and bisect further? 2.6.21rc1 basically contains all the devel patches occurred after 2.6.20 release so rc1 is a really big lump. Thanks.
(In reply to comment #35) > Can you get git tree and bisect further? 2.6.21rc1 basically contains all > the > devel patches occurred after 2.6.20 release so rc1 is a really big lump. I can try, but I've never used git. The kernel sources need to go into /usr/src, good=2.6.20.14, and bad=2.6.21rc1, so what EXACTLY do I have to type to find the "bisect" point and then enter what after that to download that version into /usr/src so that it can be built and tested? I did read the man page but it's not at all clear to me what the proper syntax is to use for kernel versions, especially kernel versions between major releases, which is where you want me to look. Thanks.
Installed git and downloaded the linux-2.6 git tree which built /usr/src/linux-2.6 with a .git directory in it. cd /usr/src/linux-2.6 git bisect start git bisect bad v2.6.21-rc1 #the syntax for the previous one was in the man page. #Now I start guessing... git bisect good v2.6.20.14 Bad rev input: v2.6.20.14 git bisect good v2.6.20-14 Bad rev input: v2.6.20-14 git bisect good v2.6.20-r14 Bad rev input: v2.6.20-r14 #etc. also without the leading v to no better effect I give up, how do I tell git the last good one corresponds to kernel 2.6.20.14??? Also, once it sets the next position, what git command extracts that source tree into /usr/src/testkernel? (Or whatever it will be called) Thanks
Do I have the right git tree? git tag -l shows (in part) v2.6.20 v2.6.20-rc1 v2.6.20-rc2 v2.6.20-rc3 v2.6.20-rc4 v2.6.20-rc5 v2.6.20-rc6 v2.6.20-rc7 v2.6.21 v2.6.21-rc1 v2.6.21-rc2 v2.6.21-rc3 Where are 2.6.20.1 <-> 2.6.20.14???
"-stable" releases are maintained in separate git trees created only for a specific release i.e. 2.6.20 (=> 2.6.20.x git tree) Please just start with 2.6.20.
(In reply to comment #39) > "-stable" releases are maintained in separate git trees created only for a > specific release i.e. 2.6.20 (=> 2.6.20.x git tree) > > Please just start with 2.6.20. This is slow work! Using v2.6.20 as GOOD I've bisected down to within 22 steps of v2.6.20 starting from v2.6.21-rc1 as BAD. It's possible that the problem is within those 22, but it could also be that v2.6.20 in this tree is also BAD. I had tested release 2.6.20.14 (see above) and it was "GOOD", but I have not explicitly tested what is in this git tree called v2.6.20. I assumed v2.6.20 < v2.6.20-rc* < 2.6.20.x stable releases (explicitly tested) Is that assumption valid? If not I'll have to start over with GOOD closer to the front of this particular git tree.
Subject: Re: Dump of ext3 runs very slowly On Friday 22 June 2007, bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #40 from mathog@caltech.edu 2007-06-21 17:58 ------- > (In reply to comment #39) > > "-stable" releases are maintained in separate git trees created only for a > > specific release i.e. 2.6.20 (=> 2.6.20.x git tree) > > > > Please just start with 2.6.20. > > This is slow work! > > Using v2.6.20 as GOOD I've bisected down to within 22 steps > of v2.6.20 starting from v2.6.21-rc1 as BAD. It's possible that the problem > is within those 22, but it could also be that v2.6.20 in this tree is also > BAD. I had tested release 2.6.20.14 (see above) and it was "GOOD", but I have > not explicitly tested what is in this git tree called v2.6.20. I assumed > > v2.6.20 < v2.6.20-rc* < 2.6.20.x stable releases (explicitly tested) > > Is that assumption valid? If not I'll have to start over with GOOD closer to Nope. ;) v2.6.20 < v2.6.20-rc* v2.6.20 < 2.6.20.x In other words 2.6.20.x is a _separate_ branch based on v2.6.20. > the front of this particular git tree.
Also real life example of using git-bisect command: http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/
(In reply to comment #41) > > v2.6.20 < v2.6.20-rc* > > v2.6.20 < 2.6.20.x > > In other words 2.6.20.x is a _separate_ branch based on v2.6.20. Yeah, this is becoming less fun by the minute. So I backed WAY up and built ebdea46fecae40c4d7effcd33f40918a37a1df4b which the .config says is 2.6.18. The results for that were BAD(ish): DUMP: 8789900 blocks (8583.89MB) on 1 volume(s) DUMP: finished in 1316 seconds, throughput 6679 kBytes/sec That is, not as BAD as the usual <5Mb/sec, but MUCH worse than the normal >25Mb/sec. This suggests two bugs. Unfortunately it also suggests that the >patches in this git tree bear only a tenuous relationship in their order to >the ones in the stable releases. To review, I already tested stable releases >variants of 2.6.20, 2.6.19, 2.6.18, and 2.6.17, and they were all FAST, only >2.6.21.rc1 and above were slow. In this git tree I'm all the way back at >2.6.18 and it's slow (mostly). Maybe I happened to hit a bad region, where >some other slowish bug was present? In any case, I'm going to have to search >both up and down from this point to find the point where some second bug drops >it the rest of the way to <5Mb, and hopefully in the other direction at some point it becomes fast again, I've not yet actually seen any kernel from this tree which is fast. Does this git tree maybe include some sort of debugging methodology or other cruft in kernels built with it that might slow it down relative to stable kernels? I'll include the .config for this BADish kernel next. > > > the front of this particular git tree. >
Created attachment 11851 [details] .config for 2.6.18 kernel that ran at 6.7Mb/sec
Subject: Re: Dump of ext3 runs very slowly On Friday 22 June 2007, bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #43 from mathog@caltech.edu 2007-06-22 10:07 ------- > (In reply to comment #41) > > > > > v2.6.20 < v2.6.20-rc* > > > > v2.6.20 < 2.6.20.x > > > > In other words 2.6.20.x is a _separate_ branch based on v2.6.20. > > Yeah, this is becoming less fun by the minute. > > So I backed WAY up and built ebdea46fecae40c4d7effcd33f40918a37a1df4b > which the .config says is 2.6.18. The results for that were BAD(ish): > > DUMP: 8789900 blocks (8583.89MB) on 1 volume(s) > DUMP: finished in 1316 seconds, throughput 6679 kBytes/sec This might be some old bug which has been fixed *already*. > That is, not as BAD as the usual <5Mb/sec, but MUCH worse than the normal > >25Mb/sec. This suggests two bugs. Unfortunately it also suggests that the > patches in this git tree bear only a tenuous relationship in their order to > the ones in the stable releases. To review, I already tested stable releases > variants of 2.6.20, 2.6.19, 2.6.18, and 2.6.17, and they were all FAST, only > 2.6.21.rc1 and above were slow. In this git tree I'm all the way back at > 2.6.18 and it's slow (mostly). Maybe I happened to hit a bad region, where > some other slowish bug was present? In any case, I'm going to have to search > both up and down from this point to find the point where some second bug > drops it the > rest of the way to <5Mb, and hopefully in the other direction at some point > it > becomes fast again, I've not yet actually seen any kernel from this tree > which > is fast. Does this git tree maybe include some sort of debugging methodology > or other cruft in kernels built with it that might slow it down relative to > stable kernels? I'll include the .config for this BADish kernel next. Simplified description of the kernel development model: * There is one main git tree managed by Linus where *all* changes are commited. * From time to time Linus says this is 2.6.x kernel and tags the git tree with appropriate tag. 2.6.x.y stable realease tree is branched of this tag. * There is a rule that *only* changes that were commited by Linus to his tree can be merged into 2.6.x.y so i.e. situation that 2.6.20.x contains some changes that are not in Linus' kernels > 2.6.20 cannot happen (in theory at least). Lets combine the above with the results so far. You say that all variants of 2.6.20.y stable were FAST and only kernels > 2.6.21-rc1 are SLOW so you don't really need to git-bisect from 2.6.18 but you can start from 2.6.20. PS I've seen some discussion on linux-kernel mailing list about change in /proc/sys/vm/dirty_ratio setting which may be related to this issue (sorry I don't have a time to dig it deeper ATM).
Subject: Re: Dump of ext3 runs very slowly On Friday 22 June 2007, bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #44 from mathog@caltech.edu 2007-06-22 10:09 ------- > Created an attachment (id=11851) > --> (http://bugzilla.kernel.org/attachment.cgi?id=11851&action=view) > .config for 2.6.18 kernel that ran at 6.7Mb/sec By comparing this config with the 2.6.20.14 one I discovered this: -CONFIG_DEFAULT_IOSCHED="cfq" +CONFIG_DEFAULT_IOSCHED="anticipatory" So this 2.6.18 config is using CFQ I/O scheduler by default while 2.6.20.14 one uses AS I/O scheduler. I also checked other "SLOW" configs (for 2.6.21.1 and 2.6.21-rc3) and they are also using CFQ by default. You may try to change I/O scheduler dynamically by using # echo "anticipatory" > /sys/block/hda/queue/scheduler in some known to be SLOW kernels (i.e. 2.6.21.1) and see if it helps.
(In reply to comment #46) > By comparing this config with the 2.6.20.14 one I discovered this: > > -CONFIG_DEFAULT_IOSCHED="cfq" > +CONFIG_DEFAULT_IOSCHED="anticipatory" Good catch! That turns out to be the problem. The 2.6.21.1 .config that came with the kernel-linus-source from Mandriva had cfq as the default. The .config from earlier kernels used a default of "anticipatory". The next attachment shows the results for all 4 IOSCHED modes under 2.6.21.1. It turns out that "cfq" is the only "slow" IOSCHED mode for the dump test, and that another IO test (bonnie++) shows only minimal differences. So I guess now the question is what is the nature of the interaction between "cfq" and dump of an ext2 file system that results in the 5X loss of throughput to /dev/null or a network connection? Or perhaps the scheduling problem is between the dump and the pipe?
Created attachment 11853 [details] dump and bonnie tests for differen IOSCHED defaults
The cfq with dump problem apparently goes way back. It was accidentally triggered in the git bisect here: ebdea46fecae40c4d7effcd33f40918a37a1df4b and intentionally triggered here: 4522d58275f124105819723e24e912c8e5bf3cdd and finally, I tested it all the way back at Mandriva's 2.6.17-14mdv kernel, which normally runs "anticipatory". But... echo "cfq" > /sys/block/hda/queue/scheduler dump -0 -f /dev/null /dev/hda5 2>&1 | tee dump.txt DUMP: dumping (Pass IV) [regular files] DUMP: 16.56% done at 4837 kB/s, finished in 0:25 Ie, slow. I have an even older system with Ultra320 SCSI disks and kernel 2.6.17.4. /dev/sdc1 is an ext3 file system. echo "anticipatory" > /sys/block/sdc/queue/scheduler dump -0 -f /dev/null /dev/sdc1 2>&1 | tee dump_anticipatory.txt DUMP: 77.67% done at 63470 kB/s, finished in 0:01 echo "cfq" >/sys/block/sdc/queue/scheduler dump -0 -f /dev/null /dev/sdc1 2>&1 | tee dump_anticipatory.txt DUMP: 75.87% done at 62212 kB/s, finished in 0:01 So on the oldest kernel I have running here "cfq" is just a hair slower than "anticipatory". Either the bug doesn't go that far back or SCSI disks aren't affected the same way PATA ones are.
Created attachment 11857 [details] Always schedule dispatch on request completion It'd be nice if you could test 2.6.22-rc5 and see if that behaves differently. And 2.6.21 with the attached patch as well. I'll be gone the next two weeks, so I cannot look more at this issue before 9th of July.
Jens, The patch you supplied results in this piece of code: if (cfq_slice_used(cfqq)) cfq_slice_expired(cfqd, 0, 1); else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list)) { if (!cfq_arm_slice_timer(cfqd)) cfq_schedule_dispatch(cfqd); } cfq_schedule_dispatch(cfqd); /* added by patch */ It does NOT affect the slow dump problem: dump -0 -f /dev/null /dev/hda5 2>&1 | tee dump1.txt DUMP: 15.03% done at 4399 kB/s, finished in 0:28 Your patch would seem in some instances to result in two calls to cfg_schedule_dispatch. Was that your intention? Or is this what you intended? if (cfq_slice_used(cfqq)) cfq_slice_expired(cfqd, 0, 1); else if (sync && RB_EMPTY_ROOT(&cfqq->sort_list)) { (void) cfq_arm_slice_timer(cfqd); } cfq_schedule_dispatch(cfqd); /* added by patch */ I tried the latter form on 2.6.21.1 as well, just in case, but it did NOT speed things up either: dump -0 -f /dev/null /dev/hda5 2>&1 ... DUMP: 14.66% done at 4292 kB/s, finished in 0:29 Next??? Thanks
Sorry to jump in again, but it seems that this bug goes in the same direction as what I experienced in bug #7372. Longstanding problem that gets worse if one uses cfg as scheduler. As a reminder: In my case I found that the problem with cfq disappeared when I disabled NCQ for the SATA-drive by setting the queue-length to 1. David does not have a SATA-drive, but the similarities between the bugreports are striking. Assuming that also for PATA drives there is also a queue, could it be that a wrong assumption in the queue processing is causing this? As a reminder my test-results: http://bugzilla.kernel.org/show_bug.cgi?id=7372#c112
Hi All, I did the echo "anticipatory" > /sys/block/sda/queue/scheduler And it fixed my dump problems. Thank you! Comment 52: I have two servers out there with this problem. Both dump to an Ecrix/Exabyte/Tandberg VXA3 tape drive through an Adaptec 19160 SCSI-U160 LVDS Controller. Does this help your troubleshooting? -T
After "echo anticipatory > /sys/block/sdc/queue/scheduler" DUMP: dumping (Pass IV) [regular files] DUMP: 4.34% done at 48278 kB/s, finished in 1:50 Incredible. This was about 5000 kB/s before the change. I suppose this must be something to do with the way that dump uses multiple processes to overlap its I/O. Everything I've tried other than dump seems to work perfectly well with cfq.
If dump is indeed using several threads to do IO in the same location, then that will currently run a lot slower with CFQ. I have some patches that need a bit of testing that will fix that up, basically one set for io context sharing and one for cooperating processes/threads. Let me see if I can work something out tomorrow.
Hi All, I used the workaround to cure my slow dump to an eSata drive. But, an identically configured server with an Adaptec 19160 SCSI-U160 LVDS Controller and an Ecrix/Exabyte/Tandberg/{insert owner of the month here} VXA-3 tape drive is still very slow. The second dump takes up to 10 hours (should take 2). From my debug report (first of two dumps -- the small one): echo anticipatory > /sys/block/sda/queue/scheduler anticipatory sync; sync /sbin/dump -0a -f /dev/nst0 /dev/sda3 DUMP: Date of this level 0 dump: Mon Feb 4 23:03:13 2008 DUMP: Dumping /dev/sda3 (/export) to /dev/nst0 ... DUMP: Volume 1 20215300 blocks (19741.50MB) DUMP: Volume 1 took 2:10:01 DUMP: Volume 1 transfer rate: 2591 kB/s Should be 8000kB/s (based on a third identical server running Cent OS 4.5). Does anyone know of a workaround for VXA-3 dumps? Many thanks, -T
Hi All, Alan, over on the CentOS forum, point out this work around to me: Changed my /boot/grub.grub.conf kernel line to kernel /boot/vmlinuz-2.6.18-53.1.14.el5 ro root=LABEL=/1 rhgb quiet elevator=as rebooted, and got: $ cat /sys/block/sda/queue/scheduler noop [anticipatory] deadline cfq $ cat /sys/block/sdb/queue/scheduler noop [anticipatory] deadline cfq Hope this helps in the troubleshooting. Reference: http://www.redhat.com/magazine/008jun05/features/schedulers/ -T
It'll get fixed when enough time is available. This is really dump doing something stupid, but CFQ should cope with it better.
Created attachment 18473 [details] Detect processes issuing I/O within the average seek distance of one another This patch, based on some of the close cooperator patches that have been merged and reverted in the past, addresses the dump(8) performance issue in my test setup. I'd sure appreciate it if others could test this and report their results. Performance results for all workloads are desirable, as we don't want to introduce regressions elsewhere. Cheers!
Created attachment 19002 [details] Fixed version of the previous patch. This version of the patch has a couple of fixes from Jens.
Was never applied - what is the status on this ?
Status is that it should be fixed in the app, it's really behaving in a sub-optimal manner (spreading interleaved IO out to threads, it's just dumb). Jeff has a patch for dump that uses CLONE_IO to make it work as fast as it should. In a better life, dumps misplaced attempt at optimizing the IO pattern should just be killed imho. But Jeffs patch is simpler, works, and is available :-)
Closing
Regarding dump(8), Jens expressed that he would rather see the dump utility modified to share I/O contexts. I floated a patch for that to the dump-devel mailing list, but heard no response from the maintainer. I just reposted another version which fixes some compile issues when CLONE_IO isn't present on the system. I'll note that this problem exists not just for dump(8). We see similar issues with streaming writes from a single NFS client to an NFS server running with >1 nfsd thread. Similar issues are also present in at least one iSCSI target mode implementation. I've also seen yet another instance of this in a system configured with a 3rd party volume manager. For the NFS server issue, I'm working on a patch to create an I/O context per NFS client, so that multiple threads issuing I/O on behalf of a single client will share it. We'll see how far that goes in addressing the problem. For the iSCSI target mode implementation, the authors also modified their code to share I/O contexts, with decent results. This still leaves other third party applications that perform I/O in this way. We can either ignore the problem, or investigate pushing something like what I proposed into CFQ. I'd have to give more numbers on the additional overhead introduced, I think. Personally, I think we should address this issue as there are enough indepent cases to verify that this is something that people actually do. If we don't, people will simply change I/O schedulers.