Bug 15177

Summary: Asynchronous writes up to 30% slower than in previous kernel versions
Product: IO/Storage Reporter: Bart Van Assche (bvanassche)
Component: Block LayerAssignee: Jens Axboe (axboe)
Status: RESOLVED OBSOLETE    
Severity: normal CC: alan, florian, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33-rc3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 12398    
Attachments: The blockdev-perftest script.
Kernel config used for the 2.6.27.39 kernel.
Kernel config used for the 2.6.33-rc3 kernel.
Performance results for the 2.6.27.39 kernel, the SRP protocol and CFQ.
Performance results for the 2.6.27.39 kernel, the SRP protocol and NOOP.
Performance results for the 2.6.33-rc3 kernel, the SRP protocol and CFQ.
Performance results for the 2.6.33-rc3 kernel, the SRP protocol and NOOP.
Performance results for the 2.6.27.39 kernel, the iSER protocol and CFQ.
Performance results for the 2.6.27.39 kernel, the iSER protocol and NOOP.
Performance results for the 2.6.33-rc3 kernel, the iSER protocol and CFQ.
Performance results for the 2.6.33-rc3 kernel, the iSER protocol and NOOP.
Kernel config used for the 2.6.28.10 kernel.
Performance results for the 2.6.28.10 kernel, the SRP protocol and CFQ.
Performance results for the 2.6.28.10 kernel, the SRP protocol and NOOP.
Kernel config used for the 2.6.29.6 kernel.
Performance results for the 2.6.29.6 kernel, the SRP protocol and CFQ.
Performance results for the 2.6.29.6 kernel, the SRP protocol and NOOP.
Kernel config used for the 2.6.30.9 kernel.
Performance results for the 2.6.30.9 kernel, the SRP protocol and CFQ.
Performance results for the 2.6.30.9 kernel, the SRP protocol and NOOP.
Performance results for the 2.6.33-rc7 kernel, the SRP protocol and NOOP.
Performance results for the 2.6.33-rc7 kernel, the SRP protocol and CFQ.

Description Bart Van Assche 2010-01-30 16:38:57 UTC
Created attachment 24794 [details]
The blockdev-perftest script.

While performing regression tests on the SCST-SRP target implementation I noticed that the throughput of asynchronous block I/O with the 2.6.33-rc3 kernel is up to 25% slower than that of the 2.6.27.39 kernel. I noticed that this behavior not only occurs with the SRP initiator but also with the iSER initiator. Behavior was consistent for the NOOP and the CFQ schedulers.

Did I make a mistake while configuring the 2.6.33-rc3 kernel ?

The script that has been used for running the performance tests has been attached. That script has been invoked as follows:
$ blockdev-perftest -d -s 30 -m 12 ${initiator_device}; blockdev-perftest -d -s 30 -m 12 ${initiator_device}

Setup details for the initiator system, the system on which the performance tests have been run:
* SRP initiator was loaded with parameter srp_sg_tablesize=128
* Frequency scaling was disabled.
* Runlevel: 3.
* IRQ affinity for mlx4_core: bound to a single core (smp_affinity=1).
* IB HCA: QDR Mellanox ConnectX MT26428
* CPU: Intel Core2 Duo E6750 @ 2.66 GHz.

Setup details for the target system:
* 2.6.30.7 kernel with SCST patches and with kernel debugging disabled.
* OFED 1.5 IB drivers.
* SCST revision compiled in release mode (make debug2release)
  and with SCST_MAX_TGT_DEV_COMMANDS changed from 48 into 128.
* ib_srpt kernel module parameters thread=0.
* STGT revision 1.0.1.
* 1 GB file residing on a tmpfs filesystem was exported towards the initiator system.
* Frequency scaling was disabled.
* Runlevel: 3.
* IRQ affinity for mlx4-comp-0: not bound to a core (smp_affinity=3).
* IB HCA: QDR Mellanox ConnectX MT26428
* CPU: Intel Core2 Duo E8400 @ 3.00 GHz.
Comment 1 Bart Van Assche 2010-01-30 16:39:42 UTC
Created attachment 24795 [details]
Kernel config used for the 2.6.27.39 kernel.
Comment 2 Bart Van Assche 2010-01-30 16:40:11 UTC
Created attachment 24796 [details]
Kernel config used for the 2.6.33-rc3 kernel.
Comment 3 Bart Van Assche 2010-01-30 16:41:08 UTC
Created attachment 24797 [details]
Performance results for the 2.6.27.39 kernel, the SRP protocol and CFQ.
Comment 4 Bart Van Assche 2010-01-30 16:41:56 UTC
Created attachment 24798 [details]
Performance results for the 2.6.27.39 kernel, the SRP protocol and NOOP.
Comment 5 Bart Van Assche 2010-01-30 16:42:31 UTC
Created attachment 24799 [details]
Performance results for the 2.6.33-rc3 kernel, the SRP protocol and CFQ.
Comment 6 Bart Van Assche 2010-01-30 16:42:52 UTC
Created attachment 24800 [details]
Performance results for the 2.6.33-rc3 kernel, the SRP protocol and NOOP.
Comment 7 Bart Van Assche 2010-01-30 16:44:25 UTC
Created attachment 24801 [details]
 Performance results for the 2.6.27.39 kernel, the iSER protocol and CFQ.
Comment 8 Bart Van Assche 2010-01-30 16:44:50 UTC
Created attachment 24802 [details]
 Performance results for the 2.6.27.39 kernel, the iSER protocol and NOOP.
Comment 9 Bart Van Assche 2010-01-30 16:45:17 UTC
Created attachment 24803 [details]
Performance results for the 2.6.33-rc3 kernel, the iSER protocol and CFQ.
Comment 10 Bart Van Assche 2010-01-30 16:45:37 UTC
Created attachment 24804 [details]
Performance results for the 2.6.33-rc3 kernel, the iSER protocol and NOOP.
Comment 11 Bart Van Assche 2010-01-30 16:50:45 UTC
(In reply to comment #0)
> * SCST revision compiled in release mode (make debug2release)
The above should have been: SCST revision 1484 from https://scst.svn.sourceforge.net/svnroot/scst/branches/srpt-separate-rx-tx-buffers.
Comment 12 Jens Axboe 2010-01-30 18:50:19 UTC
Bart, thanks for the detailed regression report. Would it be possible for you to try and narrow it down to a specific kernel version? Eg does 2.6.32 run fast or slow, .31, etc.
Comment 13 Bart Van Assche 2010-01-31 19:31:45 UTC
Created attachment 24833 [details]
Kernel config used for the 2.6.28.10 kernel.
Comment 14 Bart Van Assche 2010-01-31 19:32:23 UTC
Created attachment 24834 [details]
Performance results for the 2.6.28.10 kernel, the SRP protocol and CFQ.
Comment 15 Bart Van Assche 2010-01-31 19:32:46 UTC
Created attachment 24835 [details]
Performance results for the 2.6.28.10 kernel, the SRP protocol and NOOP.
Comment 16 Bart Van Assche 2010-01-31 19:33:21 UTC
Created attachment 24836 [details]
Kernel config used for the 2.6.29.6 kernel.
Comment 17 Bart Van Assche 2010-01-31 19:33:56 UTC
Created attachment 24837 [details]
Performance results for the 2.6.29.6 kernel, the SRP protocol and CFQ.
Comment 18 Bart Van Assche 2010-01-31 19:34:17 UTC
Created attachment 24838 [details]
Performance results for the 2.6.29.6 kernel, the SRP protocol and NOOP.
Comment 19 Bart Van Assche 2010-01-31 19:34:57 UTC
Created attachment 24839 [details]
Kernel config used for the 2.6.30.9 kernel.
Comment 20 Bart Van Assche 2010-01-31 19:35:25 UTC
Created attachment 24840 [details]
Performance results for the 2.6.30.9 kernel, the SRP protocol and CFQ.
Comment 21 Bart Van Assche 2010-01-31 19:35:46 UTC
Created attachment 24841 [details]
Performance results for the 2.6.30.9 kernel, the SRP protocol and NOOP.
Comment 22 Bart Van Assche 2010-01-31 19:49:47 UTC
(In reply to comment #12)
> Bart, thanks for the detailed regression report. Would it be possible for you
> to try and narrow it down to a specific kernel version? Eg does 2.6.32 run
> fast
> or slow, .31, etc.

Apparently there was a performance regression for asynchronous I/O in kernel version 2.6.30 for both NOOP and CFQ and also in 2.6.33-rc3 for CFQ.

This is how asynchronous I/O throughput evolved over kernel versions for the NOOP I/O scheduler (the two columns represent write and read throughput respectively in MB/s):
$ grep -w 4096 scst-srp-*-noop.txt|while read direct_io_results && read prefix bs w1 w2 w3 w ws wi r1 r2 r3 r rs ri; do echo $prefix $w $r; done
scst-srp-2.6.27.39-noop.txt:  946.143 505.922
scst-srp-2.6.28.10-noop.txt: 1146.544 513.158
scst-srp-2.6.29.6-noop.txt:  1003.962 509.874
scst-srp-2.6.30.9-noop.txt:   793.301 546.462
scst-srp-2.6.31.7-noop.txt:   813.216 551.276
scst-srp-2.6.32-noop.txt:     828.657 558.745
scst-srp-2.6.33-rc3-noop.txt: 825.712 963.724

And this is how asynchronous I/O throughput evolved over kernel versions for the CFQ I/O scheduler:
$ grep -w 4096 scst-srp-*-cfq.txt|while read direct_io_results && read prefix bs w1 w2 w3 w ws wi r1 r2 r3 r rs ri; do echo $prefix $w $r; done
scst-srp-2.6.27.39-cfq.txt:  843.048 820.406
scst-srp-2.6.28.10-cfq.txt: 1047.015 767.636
scst-srp-2.6.29.6-cfq.txt:  1015.714 819.344
scst-srp-2.6.30.9-cfq.txt:   790.427 540.332
scst-srp-2.6.31.7-cfq.txt:   823.695 557.837
scst-srp-2.6.32-cfq.txt:     826.158 565.967
scst-srp-2.6.33-rc3-cfq.txt: 717.231 950.252
Comment 23 Rafael J. Wysocki 2010-02-01 15:18:10 UTC
It looks like the CFQ write performance dropped ~15% with respect to 2.6.32, although the 2.6.33-rc3 CFQ read performance seems to be the best ever.
Comment 24 Bart Van Assche 2010-02-01 15:26:24 UTC
(In reply to comment #23)
> It looks like the CFQ write performance dropped ~15% with respect to 2.6.32,
> although the 2.6.33-rc3 CFQ read performance seems to be the best ever.

Please note that the attached test results have been obtained using only one access pattern (linear I/O). It is possible that this conclusion can be generalized to other access patterns, but I'm not sure about this.
Comment 25 Bart Van Assche 2010-02-08 18:51:20 UTC
Created attachment 24953 [details]
Performance results for the 2.6.33-rc7 kernel, the SRP protocol and NOOP.
Comment 26 Bart Van Assche 2010-02-08 18:51:53 UTC
Created attachment 24954 [details]
Performance results for the 2.6.33-rc7 kernel, the SRP protocol and CFQ.
Comment 27 Bart Van Assche 2010-02-08 19:14:32 UTC
At least for this specific test the asynchronous write throughput of 2.6.33-rc7 is better than that of 2.6.32 (on average 4% for writes and 70% for reads with the NOOP scheduler). Asynchronous I/O throughput is not yet as good as that of 2.6.28 though.

Direct I/O throughput with the NOOP scheduler has improved by about 2% for most block sizes compared to 2.6.32.

$ grep -w 4096 scst-srp-*-noop.txt|while read direct_io_results && read prefix bs w1 w2 w3 w ws wi r1 r2 r3 r rs ri; do echo $prefix $w $r; done
scst-srp-2.6.27.39-noop.txt:  946.143 505.922
scst-srp-2.6.28.10-noop.txt: 1146.544 513.158
scst-srp-2.6.29.6-noop.txt:  1003.962 509.874
scst-srp-2.6.30.9-noop.txt:   793.301 546.462
scst-srp-2.6.31.7-noop.txt:   813.216 551.276
scst-srp-2.6.32-noop.txt:     828.657 558.745
scst-srp-2.6.33-rc3-noop.txt: 825.712 963.724
scst-srp-2.6.33-rc7-noop.txt: 865.645 950.904
$ grep -w 4096 scst-srp-*-cfq.txt|while read direct_io_results && read prefix bs w1 w2 w3 w ws wi r1 r2 r3 r rs ri; do echo $prefix $w $r; done
scst-srp-2.6.27.39-cfq.txt:  843.048 820.406
scst-srp-2.6.28.10-cfq.txt: 1047.015 767.636
scst-srp-2.6.29.6-cfq.txt:  1015.714 819.344
scst-srp-2.6.30.9-cfq.txt:   790.427 540.332
scst-srp-2.6.31.7-cfq.txt:   823.695 557.837
scst-srp-2.6.32-cfq.txt:     826.158 565.967
scst-srp-2.6.33-rc3-cfq.txt: 717.231 950.252
scst-srp-2.6.33-rc7-cfq.txt: 858.258 938.176
Comment 28 Rafael J. Wysocki 2010-02-15 20:53:09 UTC
This is not a recent regression, appears to be from 2.6.28.
Comment 29 Bart Van Assche 2010-02-15 21:48:29 UTC
(In reply to comment #23)
> It looks like the CFQ write performance dropped ~15% with respect to 2.6.32,
> although the 2.6.33-rc3 CFQ read performance seems to be the best ever.

In the hope that I do not disappoint you: more SRP and iSER throughput measurements showed that the dominating factor in asynchronous read performance is the readahead size. This factor has a much larger influence than any other parameter I tried to vary, including the I/O scheduler.
Comment 30 Florian Mickler 2010-09-28 05:18:46 UTC
What is your conclusion from this? 

Should this still be an open regression report from 2.6.28 or do you consider this fixed? Do you have numbers for current kernels? 

I think regularly monitoring these things for every kernel version and maybe even for the late -rc series might be beneficial....
Comment 31 Bart Van Assche 2010-09-28 19:24:54 UTC
(In reply to comment #30)
> What is your conclusion from this? 
> 
> Should this still be an open regression report from 2.6.28 or do you consider
> this fixed? Do you have numbers for current kernels? 

While asynchronous read throughput has improved significantly, asynchronous write throughput for a 4 KB block size and with the NOOP scheduler is 32% slower with the 2.6.33-rc7 kernel compared to the 2.6.28 kernel. Looks like a regression to me.

> I think regularly monitoring these things for every kernel version and maybe
> even for the late -rc series might be beneficial....

I agree. I hope that someone who is reading this has the time to set up an infrastructure that allows to run this test in an automated fashion for every new rc kernel.

Regarding more recent kernels: a quick test indicated that 2.6.35 kernels are at least a few percent slower than 2.6.34 kernels for direct I/O.
Comment 32 Florian Mickler 2010-12-17 20:44:14 UTC
I assume that between 2.6.33-rc7 and 2.6.34 no miracle occurred and thus the 2.6.35 throughput is also slower than 2.6.28. You might consider to raise this issue/discuss this issue on the mailinglist and see if someone is interested in fixing it.