Created attachment 24794 [details] The blockdev-perftest script. While performing regression tests on the SCST-SRP target implementation I noticed that the throughput of asynchronous block I/O with the 2.6.33-rc3 kernel is up to 25% slower than that of the 2.6.27.39 kernel. I noticed that this behavior not only occurs with the SRP initiator but also with the iSER initiator. Behavior was consistent for the NOOP and the CFQ schedulers. Did I make a mistake while configuring the 2.6.33-rc3 kernel ? The script that has been used for running the performance tests has been attached. That script has been invoked as follows: $ blockdev-perftest -d -s 30 -m 12 ${initiator_device}; blockdev-perftest -d -s 30 -m 12 ${initiator_device} Setup details for the initiator system, the system on which the performance tests have been run: * SRP initiator was loaded with parameter srp_sg_tablesize=128 * Frequency scaling was disabled. * Runlevel: 3. * IRQ affinity for mlx4_core: bound to a single core (smp_affinity=1). * IB HCA: QDR Mellanox ConnectX MT26428 * CPU: Intel Core2 Duo E6750 @ 2.66 GHz. Setup details for the target system: * 2.6.30.7 kernel with SCST patches and with kernel debugging disabled. * OFED 1.5 IB drivers. * SCST revision compiled in release mode (make debug2release) and with SCST_MAX_TGT_DEV_COMMANDS changed from 48 into 128. * ib_srpt kernel module parameters thread=0. * STGT revision 1.0.1. * 1 GB file residing on a tmpfs filesystem was exported towards the initiator system. * Frequency scaling was disabled. * Runlevel: 3. * IRQ affinity for mlx4-comp-0: not bound to a core (smp_affinity=3). * IB HCA: QDR Mellanox ConnectX MT26428 * CPU: Intel Core2 Duo E8400 @ 3.00 GHz.
Created attachment 24795 [details] Kernel config used for the 2.6.27.39 kernel.
Created attachment 24796 [details] Kernel config used for the 2.6.33-rc3 kernel.
Created attachment 24797 [details] Performance results for the 2.6.27.39 kernel, the SRP protocol and CFQ.
Created attachment 24798 [details] Performance results for the 2.6.27.39 kernel, the SRP protocol and NOOP.
Created attachment 24799 [details] Performance results for the 2.6.33-rc3 kernel, the SRP protocol and CFQ.
Created attachment 24800 [details] Performance results for the 2.6.33-rc3 kernel, the SRP protocol and NOOP.
Created attachment 24801 [details] Performance results for the 2.6.27.39 kernel, the iSER protocol and CFQ.
Created attachment 24802 [details] Performance results for the 2.6.27.39 kernel, the iSER protocol and NOOP.
Created attachment 24803 [details] Performance results for the 2.6.33-rc3 kernel, the iSER protocol and CFQ.
Created attachment 24804 [details] Performance results for the 2.6.33-rc3 kernel, the iSER protocol and NOOP.
(In reply to comment #0) > * SCST revision compiled in release mode (make debug2release) The above should have been: SCST revision 1484 from https://scst.svn.sourceforge.net/svnroot/scst/branches/srpt-separate-rx-tx-buffers.
Bart, thanks for the detailed regression report. Would it be possible for you to try and narrow it down to a specific kernel version? Eg does 2.6.32 run fast or slow, .31, etc.
Created attachment 24833 [details] Kernel config used for the 2.6.28.10 kernel.
Created attachment 24834 [details] Performance results for the 2.6.28.10 kernel, the SRP protocol and CFQ.
Created attachment 24835 [details] Performance results for the 2.6.28.10 kernel, the SRP protocol and NOOP.
Created attachment 24836 [details] Kernel config used for the 2.6.29.6 kernel.
Created attachment 24837 [details] Performance results for the 2.6.29.6 kernel, the SRP protocol and CFQ.
Created attachment 24838 [details] Performance results for the 2.6.29.6 kernel, the SRP protocol and NOOP.
Created attachment 24839 [details] Kernel config used for the 2.6.30.9 kernel.
Created attachment 24840 [details] Performance results for the 2.6.30.9 kernel, the SRP protocol and CFQ.
Created attachment 24841 [details] Performance results for the 2.6.30.9 kernel, the SRP protocol and NOOP.
(In reply to comment #12) > Bart, thanks for the detailed regression report. Would it be possible for you > to try and narrow it down to a specific kernel version? Eg does 2.6.32 run > fast > or slow, .31, etc. Apparently there was a performance regression for asynchronous I/O in kernel version 2.6.30 for both NOOP and CFQ and also in 2.6.33-rc3 for CFQ. This is how asynchronous I/O throughput evolved over kernel versions for the NOOP I/O scheduler (the two columns represent write and read throughput respectively in MB/s): $ grep -w 4096 scst-srp-*-noop.txt|while read direct_io_results && read prefix bs w1 w2 w3 w ws wi r1 r2 r3 r rs ri; do echo $prefix $w $r; done scst-srp-2.6.27.39-noop.txt: 946.143 505.922 scst-srp-2.6.28.10-noop.txt: 1146.544 513.158 scst-srp-2.6.29.6-noop.txt: 1003.962 509.874 scst-srp-2.6.30.9-noop.txt: 793.301 546.462 scst-srp-2.6.31.7-noop.txt: 813.216 551.276 scst-srp-2.6.32-noop.txt: 828.657 558.745 scst-srp-2.6.33-rc3-noop.txt: 825.712 963.724 And this is how asynchronous I/O throughput evolved over kernel versions for the CFQ I/O scheduler: $ grep -w 4096 scst-srp-*-cfq.txt|while read direct_io_results && read prefix bs w1 w2 w3 w ws wi r1 r2 r3 r rs ri; do echo $prefix $w $r; done scst-srp-2.6.27.39-cfq.txt: 843.048 820.406 scst-srp-2.6.28.10-cfq.txt: 1047.015 767.636 scst-srp-2.6.29.6-cfq.txt: 1015.714 819.344 scst-srp-2.6.30.9-cfq.txt: 790.427 540.332 scst-srp-2.6.31.7-cfq.txt: 823.695 557.837 scst-srp-2.6.32-cfq.txt: 826.158 565.967 scst-srp-2.6.33-rc3-cfq.txt: 717.231 950.252
It looks like the CFQ write performance dropped ~15% with respect to 2.6.32, although the 2.6.33-rc3 CFQ read performance seems to be the best ever.
(In reply to comment #23) > It looks like the CFQ write performance dropped ~15% with respect to 2.6.32, > although the 2.6.33-rc3 CFQ read performance seems to be the best ever. Please note that the attached test results have been obtained using only one access pattern (linear I/O). It is possible that this conclusion can be generalized to other access patterns, but I'm not sure about this.
Created attachment 24953 [details] Performance results for the 2.6.33-rc7 kernel, the SRP protocol and NOOP.
Created attachment 24954 [details] Performance results for the 2.6.33-rc7 kernel, the SRP protocol and CFQ.
At least for this specific test the asynchronous write throughput of 2.6.33-rc7 is better than that of 2.6.32 (on average 4% for writes and 70% for reads with the NOOP scheduler). Asynchronous I/O throughput is not yet as good as that of 2.6.28 though. Direct I/O throughput with the NOOP scheduler has improved by about 2% for most block sizes compared to 2.6.32. $ grep -w 4096 scst-srp-*-noop.txt|while read direct_io_results && read prefix bs w1 w2 w3 w ws wi r1 r2 r3 r rs ri; do echo $prefix $w $r; done scst-srp-2.6.27.39-noop.txt: 946.143 505.922 scst-srp-2.6.28.10-noop.txt: 1146.544 513.158 scst-srp-2.6.29.6-noop.txt: 1003.962 509.874 scst-srp-2.6.30.9-noop.txt: 793.301 546.462 scst-srp-2.6.31.7-noop.txt: 813.216 551.276 scst-srp-2.6.32-noop.txt: 828.657 558.745 scst-srp-2.6.33-rc3-noop.txt: 825.712 963.724 scst-srp-2.6.33-rc7-noop.txt: 865.645 950.904 $ grep -w 4096 scst-srp-*-cfq.txt|while read direct_io_results && read prefix bs w1 w2 w3 w ws wi r1 r2 r3 r rs ri; do echo $prefix $w $r; done scst-srp-2.6.27.39-cfq.txt: 843.048 820.406 scst-srp-2.6.28.10-cfq.txt: 1047.015 767.636 scst-srp-2.6.29.6-cfq.txt: 1015.714 819.344 scst-srp-2.6.30.9-cfq.txt: 790.427 540.332 scst-srp-2.6.31.7-cfq.txt: 823.695 557.837 scst-srp-2.6.32-cfq.txt: 826.158 565.967 scst-srp-2.6.33-rc3-cfq.txt: 717.231 950.252 scst-srp-2.6.33-rc7-cfq.txt: 858.258 938.176
This is not a recent regression, appears to be from 2.6.28.
(In reply to comment #23) > It looks like the CFQ write performance dropped ~15% with respect to 2.6.32, > although the 2.6.33-rc3 CFQ read performance seems to be the best ever. In the hope that I do not disappoint you: more SRP and iSER throughput measurements showed that the dominating factor in asynchronous read performance is the readahead size. This factor has a much larger influence than any other parameter I tried to vary, including the I/O scheduler.
What is your conclusion from this? Should this still be an open regression report from 2.6.28 or do you consider this fixed? Do you have numbers for current kernels? I think regularly monitoring these things for every kernel version and maybe even for the late -rc series might be beneficial....
(In reply to comment #30) > What is your conclusion from this? > > Should this still be an open regression report from 2.6.28 or do you consider > this fixed? Do you have numbers for current kernels? While asynchronous read throughput has improved significantly, asynchronous write throughput for a 4 KB block size and with the NOOP scheduler is 32% slower with the 2.6.33-rc7 kernel compared to the 2.6.28 kernel. Looks like a regression to me. > I think regularly monitoring these things for every kernel version and maybe > even for the late -rc series might be beneficial.... I agree. I hope that someone who is reading this has the time to set up an infrastructure that allows to run this test in an automated fashion for every new rc kernel. Regarding more recent kernels: a quick test indicated that 2.6.35 kernels are at least a few percent slower than 2.6.34 kernels for direct I/O.
I assume that between 2.6.33-rc7 and 2.6.34 no miracle occurred and thus the 2.6.35 throughput is also slower than 2.6.28. You might consider to raise this issue/discuss this issue on the mailinglist and see if someone is interested in fixing it.