Bug 15241

Summary: performance reduction for sata-nv from 2.6.30 to 2.6.32
Product: IO/Storage Reporter: M G Berberich (kernel)
Component: Serial ATAAssignee: Jeff Garzik (jgarzik)
Status: RESOLVED OBSOLETE    
Severity: normal CC: akpm, alan, fengguang.wu, florian, kay, kernel, rjw, tj, wbrana
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32.7 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13615    
Attachments: kernel-log
To ensure a larger blocksize and reduce processing overheads.

Description M G Berberich 2010-02-06 22:56:08 UTC
Created attachment 24931 [details]
kernel-log

Performance of sata-nv (measured by hdparm -t) degrades from about 160 MB/s on 2.6.30 to max 80 MByte/s on 2.6.32.

System is a Gigabyte GA-M55S-S3 (NForce 550) with a OCZ Vertex running 64-Bit linux on an AMD Athlon64 X2 5000+ and 4GByte ram. Both, a selfcompiled stock 2.6.32.7 kernel and the Debian 2.6.32-trunk-amd64 gives maximum performance of 85 MByte/s messured with hdparm -t, while a Debian 2.6.30-2-amd64 kernel gives 160 MByte/s.

The Sata-device seems to be set up properly:
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-8: OCZ-VERTEX, 1.5, max UDMA/133
ata1.00: configured for UDMA/133

kernel-log and lspci attached.
Comment 1 Tejun Heo 2010-02-08 02:54:34 UTC
Does "echo deadline > /sys/block/sda/queue/scheduler" change anything?
Comment 2 M G Berberich 2010-02-08 09:26:06 UTC
Yes, it's getting worse. 
After “echo deadline > /sys/block/sda/queue/scheduler” performance degreases to about 60MB/sec.
Comment 3 Tejun Heo 2010-02-08 09:30:09 UTC
How about "echo 1 > /sys/block/sda/device/queue_depth"?
Comment 4 M G Berberich 2010-02-08 09:43:44 UTC
That's better. It gives about 110 MB/sec
But still is way off the 160 MB/sec I got with a 2.6.30 kernel.
Comment 5 Tejun Heo 2010-02-09 03:28:14 UTC
How about "dd if=/dev/sda of=/dev/null iflag=direct bs=4M count=1024"?
Comment 6 M G Berberich 2010-02-09 08:35:18 UTC
“dd if=/dev/sda of=/dev/null iflag=direct bs=4M count=1024”
gives about 200MB/s on both 2.6.32 and 2.6.30 kernels.

Without the “iflag=direct” it's about 80MB/s on 2.6.32 (haven't tested on 2.6.30).
Comment 7 Tejun Heo 2010-02-10 01:57:20 UTC
Yeah, well, that pretty much rules out the block layer and driver.  It's probably readahead or something in vm layer.  Andrew, can you please take a look?

Thanks.
Comment 8 Andrew Morton 2010-02-10 02:07:05 UTC
CC Mister Readahead.

Let's check to see if it affects regular file and not just blockdevs:

On a fresh (if possible) filesystem:

dd if=/dev/zero of=foo bs=4M count=1024
echo 3 > /proc/sys/vm/drop_caches 
time dd if=foo of=/dev/null bs=4M

Thanks.
Comment 9 M G Berberich 2010-02-10 08:49:57 UTC
I created a new partition (sda4) and put an ext2 filesystem on it.
(I'm not sure about the effect, as of the mapping layer inside the SSD)

  # echo 3 > /proc/sys/vm/drop_caches
  # time dd if=foo of=/dev/null bs=4M
  1024+0 records in
  1024+0 records out
  4294967296 bytes (4.3 GB) copied, 22.7644 s, 189 MB/s
  
  real    0m22.785s
  user    0m0.004s
  sys     0m7.184s

so this gives about 190 MB/s (fine).
Then I made a test on the blockdevice again:

  # echo 3 > /proc/sys/vm/drop_caches
  # time dd if=/dev/sda4 of=/dev/null bs=4M count=1024
  1024+0 records in
  1024+0 records out
  4294967296 bytes (4.3 GB) copied, 53.3936 s, 80.4 MB/s

  real    0m53.402s
  user    0m0.012s
  sys     0m18.189s

Tests were made on the 2.6.32-trunk-amd64-kernel of debian.
Comment 10 Wu Fengguang 2010-02-10 13:25:11 UTC
I tested three kernels and also compared block device with normal file
read, and find no big differences.

However I do remember Vladislav or Ronald encountered similar
slow-block-dev problem. I have CCed them in another email:
Vladislav Bolkhovitin <vst@vlnb.net>
Ronald Moesbergen <intercommit@gmail.com>

2.6.33-rc7

# time dd if=/dev/sda8 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0285 s, 32.2 MB/s
dd if=/dev/sda8 of=/dev/null bs=4M count=100  0.00s user 7.02s system 53% cpu 13.038 total

2.6.32

# time dd if=/dev/sda8 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.2839 s, 31.6 MB/s
dd if=/dev/sda8 of=/dev/null bs=4M count=100  0.00s user 5.29s system 39% cpu 13.316 total

2.6.30-1-amd64

$ time dd if=/dev/sda8 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 12.4972 s, 33.6 MB/s
dd if=/dev/sda8 of=/dev/null bs=4M count=100  0.00s user 2.44s system 19% cpu 12.501 total

$ time dd if=/mm/iso/ghost.gho of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 11.4964 s, 36.5 MB/s
dd if=/mm/iso/ghost.gho of=/dev/null bs=4M count=100  0.00s user 0.66s system 5% cpu 11.501 total
Comment 11 Wu Fengguang 2010-02-10 14:31:08 UTC
I tested three kernels and also compared block device with normal file
read, and find no big differences.

However I do remember Vladislav or Ronald encountered similar
slow-block-dev problem.

2.6.33-rc7

# time dd if=/dev/sda8 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0285 s, 32.2 MB/s
dd if=/dev/sda8 of=/dev/null bs=4M count=100  0.00s user 7.02s system 53% cpu 13.038 total

2.6.32

# time dd if=/dev/sda8 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.2839 s, 31.6 MB/s
dd if=/dev/sda8 of=/dev/null bs=4M count=100  0.00s user 5.29s system 39% cpu 13.316 total

2.6.30-1-amd64

$ time dd if=/dev/sda8 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 12.4972 s, 33.6 MB/s
dd if=/dev/sda8 of=/dev/null bs=4M count=100  0.00s user 2.44s system 19% cpu 12.501 total

$ time dd if=/mm/iso/ghost.gho of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 11.4964 s, 36.5 MB/s
dd if=/mm/iso/ghost.gho of=/dev/null bs=4M count=100  0.00s user 0.66s system 5% cpu 11.501 total

Thanks,
Fengguang
Comment 12 M G Berberich 2010-02-14 15:27:19 UTC
I testet with the “readahead: add blk_run_backing_dev“ patch (65a80b4c61f5b5f6eb0f5669c8fb120893bfb388) on 2.6.23 as suggested by Wu.
It does not help: I got about 200 MB/s from file and about 90 MB/s from /dev/sda.

I made some test with a “conventional” disk
 
/dev/sda (SSD):   80 MB/s
file (SSD):      190 MB/s
/dev/sdd (Disk):  90 MB/s
file (Disk):      90 MB/s

Interestingly: reading from block device is slower on SSD than on disk.
Comment 13 Peter Barth 2010-02-25 18:45:41 UTC
I have an OCZ Vertex on an AMD Athlon64 X2 5000+ with 8GByte ram, too. Running in 64bit mode with a vanilla 2.6.33 kernel gives me similar results for blockdev vs. file access.

/dev/sdb2 (SSD):   70 MB/s
file (SSD):       150 MB/s
/dev/sda2 (disc):  85 MB/s
file (disc):       80 MB/s

All these tests were made with bs=4M, count=250 and with droped caches.
So for me block device on SSD is slower than on disk, too.
Comment 14 Peter Barth 2010-02-27 23:28:01 UTC
I'm sorry. Actually the values above came from an 2.6.33-rc7 kernel. Today I tested the 2.6.33 release which gave me a bit different results.

Instead of 70 MB/s for blockdev I'm now getting 128 MB/s (and 150 MB/s for file). Interestingly I now get 150 MB/s if the partition is mounted, too.

A shell session (modified for better reading) looks like that:

 # dd if=/dev/sdb2 of=/dev/null bs=4M count=250
   -> 128 MB/s
 # mount /dev/sdb2 /mnt/
 # dd if=/mnt/test_file of=/dev/null bs=4M count=250
   -> 150 MB/s
 # dd if=/dev/sdb2 of=/dev/null bs=4M count=250
   -> 150 MB/s
 # umount /dev/sdb2
 # dd if=/dev/sdb2 of=/dev/null bs=4M count=250
   -> 128 MB/s

I added "echo 3 > /proc/sys/vm/drop_caches" in between, just to be sure and thereby discovered that cache on blockdev only workes for a mounted partition. But I guess this is intended?
Comment 15 Wu Fengguang 2010-02-28 03:44:31 UTC
What's the output of "stat /dev/sdb2" before/after the mount?

I happen to notice it yestoday that block_read_full_page() is submitting IO in buffer heads in the bdev blocksize. The blocksize may be enlarged by the fs with sb_set_blocksize() at mount time.
Comment 16 Peter Barth 2010-02-28 09:55:01 UTC
In mounted and unmounted state "stat" gives me the same output:

kuhli ~ # stat /dev/sdb2
  File: „/dev/sdb2“
  Size: 0               Blocks: 0          IO Block: 4096   blockorientierte Spezialdatei
Device: eh/14d  Inode: 4109        Links: 1     Device type: 8,12
Access: (0660/brw-rw----)  Uid: (    0/    root)   Gid: (    6/    disk)
Access: 2010-02-27 23:36:46.704302705 +0100
Modify: 2010-02-27 23:36:30.174255460 +0100
Change: 2010-02-27 23:36:30.174255460 +0100
Comment 17 M G Berberich 2010-03-01 21:45:52 UTC
I can confirm the effect of mounted vs. unmounted blockdevices (2.6.33-Debian-kernel)

    /dev/sda4 mounted:   220 MB/s
    /dev/sda4 unmounted: 142 MB/s
    file on /dev/sda4:   197 MB/s

stat-output does not change on mounting/unmounting.
Comment 18 Wu Fengguang 2010-03-02 01:35:03 UTC
Sorry. It seems that `stat` won't reflect the blocksize (and size) used by blockdev. The two vars are initilized together in bd_set_size() at blockdev (first) open time. We can do "echo 4096 > /sys/block/sda/queue/logical_block_size" to possibly enlarge blocksize computed in that function.

Another difference may be, the unmounted case will flush data at blockdev file close() time, that may add some delay.
Comment 19 Wu Fengguang 2010-03-02 01:43:07 UTC
bd_set_size() will pick some sub-page blocksize if the blockdev size is not multiple of 4096. So what's your /dev/sdb2,sda4 sizes? "sfdisk -l" output?
Comment 20 Wu Fengguang 2010-03-02 04:41:22 UTC
Created attachment 25305 [details]
To ensure a larger blocksize and reduce processing overheads.
Comment 21 Peter Barth 2010-03-07 17:00:01 UTC
Hello,

I tested your patch today. (I didn't test the "echo 4096 >
/sys/block/sda/queue/logical_block_size" as this didn't work (file's read only)).

With the patch and umounted disc I get 145MB/s (before 128MB/s), with mounted disc I keep getting 150MB/s.

So the patch increased the blockdev speed but with mounted disc it's still a bit faster.
Comment 22 Florian Mickler 2010-12-18 00:40:07 UTC
I just tested 2.6.37-rc5+ with my rotational disk and am getting slightly faster results for the unmounted case (probably within the noise). Can you verify that on current kernels?
Comment 23 Florian Mickler 2010-12-19 17:37:28 UTC
I just realized that my test on conventional disk is useless as the comments above seem to suggest it only affects 'fast enough' disks (ssd's). 
But nonetheless, can someone verify that the bug is still present on current kernels?
Comment 24 Peter Barth 2011-01-09 23:22:23 UTC
I just tested with a 2.6.37 kernel an I still get the same behaviour as noted in comment #14
Comment 25 wbrana 2011-02-25 21:19:46 UTC
I can probably see this bug with 2.6.37 and RAID10 of slow disks

dd if=/dev/md2 of=/dev/null iflag=direct bs=4M count=128
128+0 records in
128+0 records out
536870912 bytes (537 MB) copied, 1.68345 s, 319 MB/s

dd if=/dev/md2 of=/dev/null bs=4M count=128
128+0 records in
128+0 records out
536870912 bytes (537 MB) copied, 2.94772 s, 182 MB/s
Comment 26 wbrana 2011-02-26 11:28:13 UTC
I tried to boot PC without Xserver. 

dd if=/dev/md2 of=/dev/null bs=4M count=128
354 MB/s 
after Xserver is started
182-195 MB/s

My PC has SATA controller: Intel Corporation 82801IB (ICH9) 4 port SATA AHCI Controller (rev 02)
Comment 27 wbrana 2011-02-26 11:54:52 UTC
With patch from comment #20 

dd if=/dev/md2 of=/dev/null bs=4M count=128
354 MB/s 
after Xserver is started
170-185 MB/s

- can't mount encrypted partition if I boot with patched kernel
pam_mount(mount.c:68): crypt_activate_by_passphrase: Operation not permitted
Comment 28 wbrana 2011-02-26 12:27:07 UTC
It seems that performance is decreased after login to KDE, not after start of Xserver. If I stop KDE and Xserver, perf. isn't recovered.
Comment 29 Tejun Heo 2011-02-26 12:28:03 UTC
Maybe KDE enables ALPM?
Comment 30 wbrana 2011-02-26 13:27:41 UTC
cat /sys/class/scsi_host/host[0-4]/link_power_management_policy
max_performance
max_performance
max_performance
max_performance
max_performance

Perf. drop doesn't happen with IceWM.
Comment 31 wbrana 2011-03-01 21:03:20 UTC
I found that one of following daemons, which are started by KDE, is causing it.
/usr/libexec/udisks-daemon
/usr/libexec/upowerd
/usr/libexec/polkitd
After disabling with renaming and reboot, perf drop doesn't happen.
Comment 32 Tejun Heo 2011-03-02 08:25:58 UTC
Kay, u* daemons are slowing down IO accesses.  Any idea on what's going on?  wbrana, can you please use udevadm and/or blktrace to see what's going on?  Which commands are those daemons issuing and how often?

Thanks.
Comment 33 Kay Sievers 2011-03-02 12:11:48 UTC
On Wed, Mar 2, 2011 at 09:26,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=15241
> --- Comment #32 from Tejun Heo <tj@kernel.org>  2011-03-02 08:25:58 ---
> Kay, u* daemons are slowing down IO accesses.  Any idea on what's going on?

The only known thing that udisks might be doing is polling with open()
every 2 seconds for devices which mark the media as removable.

Does simply killing udisks during runtime change the behavior?

What's in the process list for udisks? We change the name to tell
which devices are polled.
Comment 34 wbrana 2011-03-02 12:54:18 UTC
killing udisks doesn't change behavior
there is process: udisks-daemon: not polling any devices 

complete output from udevadm monitor:

monitor will print the received events for:
UDEV - the event which udev sends out after rule processing
KERNEL - the kernel uevent

KERNEL[1299069807.367113] add      /devices/virtual/bdi/253:0 (bdi)
KERNEL[1299069807.367228] add      /devices/virtual/block/dm-0 (block)
UDEV  [1299069807.367349] add      /devices/virtual/bdi/253:0 (bdi)
UDEV  [1299069807.367396] add      /devices/virtual/block/dm-0 (block)
KERNEL[1299069807.367451] change   /devices/virtual/block/dm-0 (block)
KERNEL[1299069807.367480] change   /devices/virtual/block/dm-0 (block)
UDEV  [1299069807.367608] change   /devices/virtual/block/dm-0 (block)
UDEV  [1299069807.369444] change   /devices/virtual/block/dm-0 (block)
KERNEL[1299069807.372051] remove   /devices/virtual/block/dm-0 (block)
UDEV  [1299069807.373665] remove   /devices/virtual/block/dm-0 (block)
KERNEL[1299069807.377309] remove   /devices/virtual/bdi/253:0 (bdi)
KERNEL[1299069807.377411] remove   /devices/virtual/block/dm-0 (block)
UDEV  [1299069807.377417] remove   /devices/virtual/bdi/253:0 (bdi)
UDEV  [1299069807.377520] remove   /devices/virtual/block/dm-0 (block)
KERNEL[1299069807.395488] change   /devices/virtual/block/md23 (block)
KERNEL[1299069807.395495] add      /devices/virtual/bdi/253:0 (bdi)
KERNEL[1299069807.395702] add      /devices/virtual/block/dm-0 (block)
UDEV  [1299069807.395709] add      /devices/virtual/bdi/253:0 (bdi)
KERNEL[1299069807.395891] change   /devices/virtual/block/dm-0 (block)
UDEV  [1299069807.395929] add      /devices/virtual/block/dm-0 (block)
UDEV  [1299069807.507581] change   /devices/virtual/block/md23 (block)
UDEV  [1299069807.712631] change   /devices/virtual/block/dm-0 (block)
Comment 35 wbrana 2011-03-02 14:07:26 UTC
perf drop happens also when udisks-daemon is disabled and upowerd and polkitd are enabled
Comment 36 Kay Sievers 2011-03-02 14:38:18 UTC
On Wed, Mar 2, 2011 at 15:07,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=15241
> perf drop happens also when udisks-daemon is disabled and upowerd and polkitd
> are enabled

Might be stuff called by pm-utils. Things in:
  /usr/lib/pm-utils/power.d/sata_alpm
or some other files close to that one. It looks like a mess. :)
Comment 37 Kay Sievers 2011-03-02 14:43:02 UTC
On Wed, Mar 2, 2011 at 15:38,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=15241

From #udev:
<mbiebl> you can ask him to manually run pm-powersave true|false
<mbiebl> to narrow it down to pm-utils
Comment 38 wbrana 2011-03-02 15:00:03 UTC
strange results

pm-powersave true:
dd if=/dev/md2 of=/dev/null bs=4M count=128
128+0 records in
128+0 records out
536870912 bytes (537 MB) copied, 1.51789 s, 354 MB/s
cat /sys/class/scsi_host/host[0-4]/link_power_management_policy
min_power
min_power
min_power
min_power
min_power


pm-powersave false:
dd if=/dev/md2 of=/dev/null bs=4M count=128
128+0 records in
128+0 records out
536870912 bytes (537 MB) copied, 2.87367 s, 187 MB/s
cat /sys/class/scsi_host/host[0-4]/link_power_management_policy
max_performance
max_performance
max_performance
max_performance
max_performance
Comment 39 wbrana 2011-03-02 15:25:18 UTC
I tried to change /sys/class/scsi_host/host[0-4]/link_power_management_policy from command line. Perf isn't changed.
Perf of /dev/sd[a-d] isn't changed by pm-powersave.