Bug 13127

Summary: 2.6.34 CCISS Speed issue
Product: IO/Storage Reporter: mrblogs
Component: OtherAssignee: Mike Miller (mike.miller)
Status: RESOLVED OBSOLETE    
Severity: normal CC: akpm, alan, biggi, codeguard, ericbouer, leandro, mm-kernel, scameron, ulf.gruene
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Performance test result highlighting performance regression between 2.6.28 and 2.6.29 and more recent
timing harddisks

Description mrblogs 2009-04-17 00:50:09 UTC
I am using Gentoo on a HP DL380 G3. Intel Xeon (P4) 3.06ghz processors,
with a HP COMPAQ CCISS ARRAY in RAID 0 mode (hardware, linux just sees
one single drive)

hdparm on 2.6.27 shows the speed at ~ 130mbs

After performing an upgrade of the kernel

hdparm on 2.6.29 shows the speed at ~ 30mbs (30-50mbs varies)

I looked at the DMESG comparisons,
and the only difference is that 2.6.29 mentions:
"IRQF_DISABLED is not guaranteed on shared IRQs"

Now I have looked for hours to see if it was a configuration issue, but
to no avail.

 Yes the 2.6.27 kernel was self built, and it works fine. The 2.6.29 was
 built using the .config file from the 2.6.27.

I compiled this kernel on two servers, both same spec (well, one has 4GB
ram other has 2GB ram), same setup - and both responded as above.

On the one, I recompiled the kernel to just use the new version (NO
config changes). The other I re-compiled using config changes (removing
things not relevant for the machine etc) - both end up having this
problem.

I have done a "diff" compare on the configs, but there is nothing I can
find that would explain this.

I know that this "IRQF_DISABLED" warning isn't present in 2.6.27 boots,
and that the CISS driver works full speed in that one.

I thought it was originally that I changed the IO Scheduling, but as i
said, on the one machine - all i did was copy the config file and
recompile.

Do you know of anything that could be causing this?

I checked proc/interrupts on both kernel versions, and IRQ 30 is
unshared (IO-APIC-FASTEOI cciss or something like that), i am just
unsure what is causing the issue.
Comment 1 Andrew Morton 2009-04-17 18:41:59 UTC
Assigned to Mike, marked as a regression.
Comment 2 Mike Miller 2009-04-22 20:26:23 UTC
I'm running some comparisons between 2.6.27.9 and 2.6.29. I see about a 30% drop in buffered disk reads but nothing as drastic as reported here. I don't have the same HW but I would expect about the same behavior if the driver is the culprit.
 
2.6.27.9 - buffered disk reads 150.88MB/sec
2.6.29 --- buffered disk reads 102.80MB/sec
 
The timed cache reads are within 10% of each other. 

What you see in dmesg has no effect on the performance. I've elimated the IRQF_SHARED flag in my testing and it made no difference. The reason you didn't see the message in 2.6.27.x is that it does not exist in older kernels.

We have changed very little in the main IO path for a long time.
Comment 3 Leandro Oliveira 2009-08-10 13:56:44 UTC
I report same problem with kernel 2.6.29, 2.6.30 and 2.6.31 running at DL380-G5 . Sample dmesg log. Read speed is ~ 20MBps.

-= test =-

dd if=/dev/cciss/c0d0p3 of=/dev/null bs=1024 count=1000000
1000000+0 records in
1000000+0 records out
1024000000 bytes (1.0 GB) copied, 95.0507 s, 10.8 MB/s

-= dmesg =-
HP CISS Driver (v 3.6.20)
cciss 0000:06:00.0: PCI INT A -> Link[LNKC] -> GSI 10 (level, low) -> IRQ 10
cciss 0000:06:00.0: irq 35 for MSI/MSI-X
cciss 0000:06:00.0: irq 36 for MSI/MSI-X
cciss 0000:06:00.0: irq 37 for MSI/MSI-X
cciss 0000:06:00.0: irq 38 for MSI/MSI-X
e1000e: Intel(R) PRO/1000 Network Driver - 1.0.2-k2
e1000e: Copyright (c) 1999-2008 Intel Corporation.
e1000e 0000:0e:00.0: PCI INT A -> Link[LNKB] -> GSI 7 (level, low) -> IRQ 7
e1000e 0000:0e:00.0: setting latency timer to 64
e1000e 0000:0e:00.0: irq 39 for MSI/MSI-X
IRQ 37/cciss0: IRQF_DISABLED is not guaranteed on shared IRQs
cciss0: <0x3230> at PCI 0000:06:00.0 IRQ 37 using DAC
      blocks= 1757614684 block_size= 512
      heads=255, sectors=32, cylinders=215394

      blocks= 1757614684 block_size= 512
      heads=255, sectors=32, cylinders=215394

 cciss/c0d0: p1 p2 < p5 > p3
Comment 4 ulf.gruene 2009-10-21 06:36:15 UTC
I can confirm this regression. The problem started between kernel versions 2.6.28 and 2.6.29.

Results from hdparm -t :
2.6.27.35:     86.41 MB/sec
2.6.28.10:     87.10 MB/sec
2.6.29.6:      41.51 MB/sec
2.6.31:        30.51 MB/sec

Server model: ProLiant DL360 G5

Output from dmesg:
HP CISS Driver (v 3.6.20)
cciss 0000:06:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
cciss 0000:06:00.0: irq 57 for MSI/MSI-X
cciss 0000:06:00.0: irq 58 for MSI/MSI-X
cciss 0000:06:00.0: irq 59 for MSI/MSI-X
cciss 0000:06:00.0: irq 60 for MSI/MSI-X
IRQ 59/cciss0: IRQF_DISABLED is not guaranteed on shared IRQs
cciss0: <0x3230> at PCI 0000:06:00.0 IRQ 59 using DAC
      blocks= 286677120 block_size= 512
      heads=255, sectors=32, cylinders=35132

      blocks= 286677120 block_size= 512
      heads=255, sectors=32, cylinders=35132

 cciss/c0d0: p1 p2
      blocks= 286677120 block_size= 512
      heads=255, sectors=32, cylinders=35132

      blocks= 286677120 block_size= 512
      heads=255, sectors=32, cylinders=35132

 cciss/c0d1: p1
Comment 5 Michael Miller 2009-10-21 16:29:32 UTC
Hi,

I see a similar performance drop off. I have seen it on a DL360G5 (with current firmware CD applied) and a DL380G5 (older firmware). 

The figures below are on a Dl380G5. A fresh install of Ubuntu 8.04 with a apt-get update;apt-get upgrade;apt-get dist-upgrade done for all current patches. Two 146GB SAS 10k RPM drives  - each in a RAID-0 single logical drive created using the P400 BIOS.

I copied the stock kernel config options and then compiled various kernels using those options (after doing a make menuconfig to ensure new options get defaults etc). 2.6.31.4 & 2.6.32-rc5 I also configured the BNX2 driver into the kernel to include the firmware and also enabled the SYSFS deprecated option to ensure that the LVM2 stuff worked on bootup.

The following was run for each kernel:
dd if=/dev/cciss/c0d1 of=/dev/null bs=1024k count=1024
dd if=/dev/cciss/c0d1 of=/dev/null bs=1024k count=2048


In summary 2.6.28.10 was the last good kernel series and 2.6.29.6 showed the performance drop. I have not run through all the 2.6.29 series. I will try get that run tomorrow to see if its common to the entire 2.6.29 kernel.

Obviously, a drop from around 90MB/s to 34MB/s is significant.

many thanks,
Mike 

The raw results are below:
--------------------------
Wed Oct 21 16:40:54 BST 2009
Linux uk-ditnas902 2.6.24-24-server #1 SMP Fri Sep 18 16:47:05 UTC 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.83 s, 90.8 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.6072 s, 91.0 MB/s
--------------------------
--------------------------
Wed Oct 21 16:44:06 BST 2009
Linux uk-ditnas902 2.6.24.7-mjm #1 SMP Wed Oct 21 11:24:15 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.8286 s, 90.8 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.5383 s, 91.2 MB/s
--------------------------
--------------------------
Wed Oct 21 16:47:16 BST 2009
Linux uk-ditnas902 2.6.25.20-mjm #1 SMP Wed Oct 21 11:51:14 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.7866 s, 91.1 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.5078 s, 91.4 MB/s
--------------------------
--------------------------
Wed Oct 21 16:50:26 BST 2009
Linux uk-ditnas902 2.6.26.8-mjm #1 SMP Wed Oct 21 12:18:45 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.7988 s, 91.0 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.5338 s, 91.3 MB/s
--------------------------
--------------------------
Wed Oct 21 16:53:35 BST 2009
Linux uk-ditnas902 2.6.27.37-mjm #1 SMP Wed Oct 21 12:46:28 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.8022 s, 91.0 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.5142 s, 91.3 MB/s
--------------------------
--------------------------
Wed Oct 21 16:56:42 BST 2009
Linux uk-ditnas902 2.6.28.10-mjm #1 SMP Wed Oct 21 13:13:52 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.8088 s, 90.9 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.5781 s, 91.1 MB/s
--------------------------
--------------------------
Wed Oct 21 16:59:49 BST 2009
Linux uk-ditnas902 2.6.29.6-mjm #1 SMP Wed Oct 21 13:42:24 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.041 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.0189 s, 34.6 MB/s
--------------------------
--------------------------
Wed Oct 21 17:03:54 BST 2009
Linux uk-ditnas902 2.6.30.9-mjm #1 SMP Wed Oct 21 14:11:58 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.0455 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.0342 s, 34.6 MB/s
--------------------------
--------------------------
Wed Oct 21 17:07:58 BST 2009
Linux uk-ditnas902 2.6.31.4-mjm #1 SMP Wed Oct 21 14:42:28 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.0351 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.0157 s, 34.6 MB/s
--------------------------
--------------------------
Wed Oct 21 17:12:02 BST 2009
Linux uk-ditnas902 2.6.32-rc5-mjm #1 SMP Wed Oct 21 15:11:32 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.0223 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.0179 s, 34.6 MB/s
--------------------------
Comment 6 Michael Miller 2009-10-22 09:40:49 UTC
Hi,

I have compiled the 2.6.29 kernels and they all show the reduced performance. So the change resulting in the reduced performance must be around the time the 2.6.29 fork took place.

--------------------------
Thu Oct 22 09:56:51 BST 2009
Linux uk-ditnas902 2.6.28.10-mjm #1 SMP Wed Oct 21 13:13:52 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.8292 s, 90.8 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.5529 s, 91.2 MB/s
--------------------------
--------------------------
Thu Oct 22 10:00:02 BST 2009
Linux uk-ditnas902 2.6.29-mjm #1 SMP Wed Oct 21 20:05:22 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.0513 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.0565 s, 34.6 MB/s
--------------------------
--------------------------
Thu Oct 22 10:04:10 BST 2009
Linux uk-ditnas902 2.6.29.1-mjm #1 SMP Wed Oct 21 17:47:16 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.0485 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.0786 s, 34.6 MB/s
--------------------------
--------------------------
Thu Oct 22 10:08:18 BST 2009
Linux uk-ditnas902 2.6.29.2-mjm #1 SMP Wed Oct 21 18:14:59 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.0697 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.0291 s, 34.6 MB/s
--------------------------
--------------------------
Thu Oct 22 10:12:26 BST 2009
Linux uk-ditnas902 2.6.29.3-mjm #1 SMP Wed Oct 21 18:42:37 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.0689 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.081 s, 34.6 MB/s
--------------------------
--------------------------
Thu Oct 22 10:16:35 BST 2009
Linux uk-ditnas902 2.6.29.4-mjm #1 SMP Wed Oct 21 19:10:12 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.055 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.0636 s, 34.6 MB/s
--------------------------
--------------------------
Thu Oct 22 10:20:43 BST 2009
Linux uk-ditnas902 2.6.29.5-mjm #1 SMP Wed Oct 21 19:37:53 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.0474 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.0479 s, 34.6 MB/s
--------------------------
--------------------------
Thu Oct 22 10:24:50 BST 2009
Linux uk-ditnas902 2.6.29.6-mjm #1 SMP Wed Oct 21 13:42:24 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.0676 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.0199 s, 34.6 MB/s
--------------------------
Comment 7 Michael Miller 2009-10-22 15:49:45 UTC
It seems 2.6.29-rc1 also contains the performance regression. So some change in the 2.6.29-rc1 kernel appears to be the culprit.

--------------------------
Thu Oct 22 16:41:12 BST 2009
Linux uk-ditnas902 2.6.29-rc1-mjm #2 SMP Thu Oct 22 16:15:30 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.0283 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.0603 s, 34.6 MB/s
--------------------------

I have done the same tests in a some of VMs we have running (so the cciss Linux driver is not in play) and it seems that the stock 2.6.24-24-virtual Ubuntu 8.04 kernel is slightly slower than a custom compiled 2.6.30.3 (using pretty much the standard kernel options) which would be expected. Unfortunately, I do not have any non-HP hardware available at the moment to test 2.6.28 vs 2.6.29 to conclusively point at the cciss driver.

But at the moment 2.6.29-rc1 seems to have introduced the performance decrease this bug is relating to.
Comment 8 Michael Miller 2009-10-23 11:38:33 UTC
Hi,

Here are three more data points from a HP Proliant DL360G6. Single quad-core CPU with 6GB ram. This again shows the slower/reduced CCISS performance when going from kernel 2.6.28 to 2.6.29/2.6.31.

Again, the test was:
dd if=/dev/cciss/c0d1 of=/dev/null bs=1024k count=1024
dd if=/dev/cciss/c0d1 of=/dev/null bs=1024k count=2048


root@uk-ditnas902:~# cat *txt
--------------------------
Fri Oct 23 12:19:25 BST 2009
Linux uk-ditnas903 2.6.28.10-mjm #1 SMP Wed Oct 21 13:13:52 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 11.7862 s, 91.1 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 23.5063 s, 91.4 MB/s
--------------------------
--------------------------
Thu Jun 25 08:56:40 BST 2009
Linux uk-ditnas903 2.6.29-rc1-mjm #2 SMP Thu Oct 22 16:15:30 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 31.049 s, 34.6 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 62.022 s, 34.6 MB/s
--------------------------
--------------------------
Fri Oct 23 12:27:12 BST 2009
Linux uk-ditnas903 2.6.31.4-mjm #1 SMP Wed Oct 21 14:42:28 BST 2009 x86_64 GNU/Linux
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 29.981 s, 35.8 MB/s
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 59.7466 s, 35.9 MB/s
--------------------------
Comment 9 Michael Miller 2009-10-23 13:44:33 UTC
Posted issue at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/337419
Comment 10 Michael Miller 2009-10-28 16:10:51 UTC
After contacting Mike Miller @ hp and getting very useful help and information, I reran some tests and managed to get improved performance. According to Mike, there were some read ahead changes made to the cciss driver. I am not sure which kernel these were made to.

- setting the cciss driver read ahead to 64kB using:
    echo 64 > /sys/block/cciss\!c0d1/queue/read_ahead_kb
OR
    blockdev --setra 128 /dev/cciss/c0d1
made the most significant improvement. This brought my test results back to pre 2.6.29 levels. From 2.6.29-rc1 the default is now 128kB. Prior kernels seemed to use a default of 4096kB. For my test cases, the 64kB read_ahead_kb setting is optimal across logical drive stripe sizes and SmartArray cache ratio.

The 64kB read ahead was optimal for my test case, of a logical drive consisting of either one or two 146GB SAS 10k RPM drives, and brought performance back to pre 2.6.29 levels.

I have done further testing (results I will try to summarise here shortly) on a variety of RAID levels on a 6i and 6404 controller which indicate a performance regression between 2.6.28.10 and 2.6.29.2. I intend to rerun my test on a non-HP server and see if it is a generic regression or cciss related.
Comment 11 Michael Miller 2009-10-28 22:00:53 UTC
My additional tests confirm there is a performance regression when using the cciss driver for various workloads. The regression is present since 2.6.29-rc1. I have run my test on a Dell 2950 with PERC6/i and the performance regression is not present.

I have attached a spreadsheet with the results of my dd test for various read ahead block sizes, various logical drive configurations and various Smart Array cache ratios (read/write cache ratio). The graphs in the spreadsheet clearly indicate the sequential read test performance differences observed for 2.6.28 vs 2.6.29. 

Anyone using a Smart Array controller should, in my opinion, do thorough testing prior to using a 2.6.29 or more recent kernel for production work loads. At a minimum the latest 2.6.28 kernel's performance should be compare with the more recent kernel being considered.
Comment 12 Michael Miller 2009-10-28 22:08:00 UTC
Created attachment 23569 [details]
Performance test result highlighting performance regression between 2.6.28 and 2.6.29 and more recent

Spreadsheet with results from performance tests highlighting the performance regression between 2.6.28 and 2.6.29 and more recent.

Test runs showing include various logical drive array configurations (Raid1/5/6), array cache ratios and read ahead block sizes.
Comment 13 Markus Duft 2010-03-19 15:05:07 UTC
Created attachment 25614 [details]
timing harddisks

some more data on (hopefully) this issue. seems as if in 2.6.31 the performance of a RAID1 improved compared to 2.6.28. OTOH, the performance of RAID0 single drives on the controller decreased _a lot_.
Comment 14 Stephen M. Cameron 2010-03-23 13:01:04 UTC
Some time ago, I had been tracking down a performance problem with hpsa somewhere between 2.6.29 and 2.6.30, and I git bisected it to this commit:

2f5cb7381b737e24c8046fd4aeab571fb71315f5 is first bad commit
commit 2f5cb7381b737e24c8046fd4aeab571fb71315f5
Author: Jens Axboe 
Date:   Tue Apr 7 08:51:19 2009 +0200

    cfq-iosched: change dispatch logic to deal with single requests at the time
    
    The IO scheduler core calls into the IO scheduler dispatch_request hook
    to move requests from the IO scheduler and into the driver dispatch
    list. It only does so when the dispatch list is empty. CFQ moves several
    requests to the dispatch list, which can cause higher latencies if we
    suddenly have to switch to some important sync IO. Change the logic to
    move one request at the time instead.
    
    This should almost be functionally equivalent to what we did before,
    except that we now honor 'quantum' as the maximum queue depth at the
    device side from any single cfqq. If there's just a single active
    cfqq, we allow up to 4 times the normal quantum.
    
    Signed-off-by: Jens Axboe

It seems to have since been fixed, or at least made somewhat better somewhere between 2.6.32 and v2.6.33-rc1, though I was not able to git bisect to find the fix, as I ran into kernels that wouldn't boot somewhere in there, and didn't perservere enough to get past it.

That was with hpsa though, I didn't try cciss.  Might be the same problem though.
Comment 15 Birgir Haraldsson 2010-06-03 01:24:25 UTC
Seems to me that this is still a problem in 2.6.34.
Comment 16 Stephen M. Cameron 2010-06-03 14:36:03 UTC
You are using the cfq i/o scheduler? (completely fair queuing)

Does using another i/o scheduler (e.g. deadline) make a difference?

-- steve
Comment 17 Birgir Haraldsson 2010-06-03 15:09:06 UTC
Default was cfq, and I tried deadline, but it was the same.

I use 64 for /sys/block/cciss\!c0d0/queue/read_ahead_kb (~50% better than 128)

cciss0: HP Smart Array E200i Controller
Comment 18 Alan 2012-10-30 16:48:12 UTC
If this is still seen on modern kernels then please re-open/update