Created attachment 21608 [details]
I'm writing some files (about 200MB) on a DVD-RW with the pktcdvd driver.
The writing starts with good rate and then it drops down to 16 kB/s, and it can not go back to better rates until the write operation has finished and the system has synced.
Switching /dev/sr0 to anticipatory scheduler does restore normal rates (up to 2700 kB/s).
How to reproduce :
- format a DVD-RW for packet-writing
- # pktsetup 0 /dev/sr0
- # mount -o noatime,nodiratime,rw /dev/pktcdvd/pktcdvd0 /media/pkt
- # dd if=/dev/zero of=/media/pkt/flexbackup/test bs=1024K count=8 conv=fdatasync
I bisected this behaviour to the following commit:
2f5cb7381b737e24c8046fd4aeab571fb71315f5 is first bad commit
Author: Jens Axboe <email@example.com>
Date: Tue Apr 7 08:51:19 2009 +0200
cfq-iosched: change dispatch logic to deal with single requests at the time
The IO scheduler core calls into the IO scheduler dispatch_request hook
to move requests from the IO scheduler and into the driver dispatch
list. It only does so when the dispatch list is empty. CFQ moves several
requests to the dispatch list, which can cause higher latencies if we
suddenly have to switch to some important sync IO. Change the logic to
move one request at the time instead.
This should almost be functionally equivalent to what we did before,
except that we now honor 'quantum' as the maximum queue depth at the
device side from any single cfqq. If there's just a single active
cfqq, we allow up to 4 times the normal quantum.
Signed-off-by: Jens Axboe <firstname.lastname@example.org>
:040000 040000 be8b0f16d8faba88fc04e877eaa9240ec5d00648 5fd6eb925f065f411ed0d387277e45ba722ea154 M block
I was unable to revert because of too many conflicts.
(please note: the kernel is actually vanilla kernel with reiser4 patches)
Created attachment 21609 [details]
Created attachment 21650 [details]
I don't have a lot of ideas on this one, behaviour should be pretty much identical to before. Can you try with this debug patch? It disables some of the logic that stops queuing, we call that on a different scope now, so it may be related.
Sorry jens, this patch (attachment 21650 [details]) does not help, the rate of writing still drops down to 16 kB/s.
Could you provide a blktrace trace for the two kernels (one that works well, and one that doesn't)? Please also include the kernel versions you tested.
Created attachment 21844 [details]
blktrace of /dev/sr0, dd was slow (16 KB/s) until sr0 was switched to noop scheduler
Sorry for the being so long.
So here is blktrace output for a bad "dd if=/dev/zero of=/media/pkt/flexbackup/test bs=1024K count=8 conv=fdatasync" session. Writing was very slow until I switched sr0 to noop scheduler.
Kernel version is 2.6.30-rc8
Created attachment 21845 [details]
blktrace of /dev/sr0, dd speed was good
And here is another blktrace of /dev/sr0, same kernel (2.6.30-rc8), cfq scheduler.
In this run, dd speed was good enough, it completed within 14 seconds (about 600 kB/s).
To be precise, the good trace was captured first, and the wrong trace was captured later. No boot between the two blktrace.
If you kill hal, does it make any difference to the performance?
Created attachment 21846 [details]
blktrace of /dev/sr0, no HAL daemon, dd speed was slow
no, killing HAL daemon does not help.
Thanks for providing this information! Just a note, it's much easier for me to consume the binary trace data, as there are utilities than can be run on that data to analyse it.
The Queuing of requests is happening in order in both traces. However, the requests are issued to the device out of sequence in the bad trace, and the I/O times (from D to C) are extremely high when this is going on (around 5 seconds as opposed to 2-23 milliseconds).
This anomaly only happens in the beginning of the run. It eventually works itself out:
(these numbers are printed upon completion)
block d->c time
Laurent, I have one further question. Do you see the same performance for every run under 2.6.29?
One thing I've noticed is that every time we dispatch a request out of order, it is due to a request that has languished on the fifo for too long. Add to that the logic that only takes one request from the fifo at a time and you get some really nasty seeking going on, which is awful for performance of the cdrom drive.
We could allow all of the expired requests to get serviced, but I presume that code has been put into place for a reason. The other approach I came up with was to choose a new next_rq after updating last_position. The reason for that is now that the disk head has moved, the next best request will change, too.
I've not been able to get a system running a new kernel to test this, unfortunately. Laurent, you can feel free to try the patch that I'll attach shortly. However, if it breaks, you get to keep both pieces. ;-)
Now, I still can't explain why the commit that git bisect picked could have caused all of this. Jens, I look forward to hearing your thoughts.
Created attachment 21887 [details]
Choose a new next_rq when updating last_position
This patch will choose a new next_rq for the cfqq associated with the request that is being activated. This is completely untested.
Created attachment 21888 [details]
blktrace of /dev/sr0, kernel 2.6.29
(In reply to comment #10)
> Laurent, I have one further question. Do you see the same performance for
> every run under 2.6.29?
AFAICT, 2.6.29 was fine, copying 8 MB with "dd" regularly takes about 15 to 20 seconds.
(In reply to comment #12)
> Created an attachment (id=21887) [details]
> Choose a new next_rq when updating last_position
> This patch will choose a new next_rq for the cfqq associated with the request
> that is being activated. This is completely untested.
This patch is buggy, I've a Oops on boot:
NULL pointer dereference at ...
EIP is at rb_next+0x3/0x3a
Created attachment 21994 [details]
choose a new next_rq when inserting a request into the dispatch queue
This patch builds and boots and survives stress testing. I'm not sure if it's at all a good idea. Hopefully Jens will have a chance to look at this soon.
Sorry for the very long delay...
Here is the good news: Jeff, your patch (attchment 21994) solved the issue.
I ran the test command 20 times in a row, the problem never showed up:
# while [ $i -le 20 ]; do
# echo == pass $i ==
# dd if=/dev/zero of=/media/pkt/flexbackup/test bs=1024K count=8 conv=fdatasync
# sleep 15
# (( i += 1 ))
Nevertheless, the performance was better in 2.6.29.
Scheduler Kernel Average time (s) Standard Deviation
----------- --------------- ---------------- -------------------
CFQ 2.6.30 patched 38.19 15.9
CFQ 2.6.29 22.92 5.93
Noop 2.6.29 40.22 4.46
Anticipatory 2.6.29 47.15 9.83
I did some new tests with kernel 2.6.31-rc7. This kernel exhibits the same problem I described. (i.e. writing on a DVD-RW with packet-writing driver slows down to 16 kb/s, it takes several minutes to complete a 8 MB write).
The patch provided by Jeff (attachment 21994 [details]) solves the problem.
Scheduler Kernel Average time (s) Standard Deviation
--------- ---------------- ---------------- -------------------
CFQ 2.6.29 22.92 5.93
CFQ 2.6.31-rc7 patched 27.76 15.78
Does this bug need further investigations ? or a better patch ?
The patch looks straight forward, so I think I can sneak it in for 2.6.31 (it IS a regression). Sorry for the delay in handling this, I'll get on it now.
Jeff, can you send me a proper signed-off patch?
Yes, I'll get that out this morning.
In thinking about this more, I'm not entirely sure this patch is a good idea. Are we sure the patch is addressing the real issue? I never did figure out why there were requests that were stuck on the FIFO that should have been serviced early on in the run. For that matter, I'm not sure that we have complete blktrace runs, either.
Laurent, I hate to be a pain, but could you collect data one more time? Please follow this procedure using the unpatched kernel:
1) start blktrace
2) start your app
3) wait until I/O rates have dropped, and stay down for at least 5 seconds
4) kill blktrace
Then, attach the blktrace binary logs (you can tar them up if you like) to this bugzilla.
If you have time, running the same procedure on a kernel pre-regression would be helpful. If we can compare those logs, that might show what the real problem is.
Thanks in advance!
Created attachment 22866 [details]
blktrace of /dev/sr0, kernel 2.6.31-rc7
Jeff, here are some new traces, running on kernel 2.6.31-rc7.
I had to run the command "dd if=/dev/zero of=... bs=1024K count=8 conv=fdatasync" twice to trigger the problem. The first run was very fast, it completed in about 10 seconds, the second one completed in about 10 minutes...
attachment 21888 [details] was taken on a pre-regression kernel (2.6.29).
I think the first run against 2.6.31-rc7 completed quickly because cfqd->last_position was zero, hence the first request issued was the one closest to the beginning of the cdrom drive. On the second run, we had a leftover cfqd->last_position, hence the best request to choose was one closer to last_position, not the front of the disk. This causes the very first request in the run to sit in the queue until it expires, triggerring backwards seeks.
Now, cfq should recover from this a little better, and the patch I wrote definitely makes that the case. However, it may be worth exploring adding a time stamp to the last_position field. If the timestamp of last_position update is too old, then just reset it to 0. For random workloads, hopefully it doesn't make that big of a difference. For sequential workloads, it should cut down on the seeks and keep I/O from languishing in the queue.
For now, I'll go ahead and submit the one-liner attached here. The timestamp issue can be pursued separately.
(In reply to comment #23)
> On the second run, we had a
> leftover cfqd->last_position, hence the best request to choose was one closer
> to last_position, not the front of the disk. This causes the very first
> request in the run to sit in the queue until it expires, triggerring
Looking at the code, I'm wrong about this. If both requests are behind the last_position, then the request with the lower sector number will be chosen to eliminate further backward seeks.
Here is some data including the end of the first run and the beginning of the second:
7.545015072 Q W 5785280 + 64
7.545018552 I W 5785280 + 64 I: backwards seek of 16448 sectors
7.545028032 D W 5785280 + 64
7.545079592 D W 5801664 + 64
last_position should now be 5801664+64 == 5801728
8.435615479 C W 5785280 + 64 0ms 890ms 890ms
10.967841276 C W 5801664 + 64 0ms 3422ms 3422ms
Now the second run starts:
35.267350838 Q R 5785612 + 4
35.267358998 I R 5785612 + 4
35.267366238 Q R 5785540 + 4
35.267370478 I R 5785540 + 4 I: backwards seek of 76 sectors
35.267373678 Q R 5785484 + 4
35.267377478 I R 5785484 + 4 I: backwards seek of 60 sectors
35.267380358 Q R 5785420 + 4
35.267383878 I R 5785420 + 4 I: backwards seek of 68 sectors
35.267395318 D R 5785612 + 4 D: backwards seek of 16052 sectors
So, I'm not sure why 5785612 was chosen for dispatch, here.
Laurent, I have one more thing for you to try. With and without the patch applied, try setting /sys/block/sr0/queue/iosched/back_seek_max to 0. Then, let me know if you get more consistent or less consistent results.
Here are the results with /sys/block/sr0/queue/iosched/back_seek_max=0.
(the durations are given by the dd command: dd if=/dev/zero of=/media/pkt/flexbackup/test bs=1024K count=8 conv=fdatasync)
run #1: 13.7682 s
run #2: 10.9875 s
run #3: 575.601 s
run #4: 32.7028 s
I stopped the test here.
Kernel 2.6.30-rc7 patched, dd command issued 20 times:
Average time : 24.9s
Standard Deviation : 8.68s
Thanks a ton, Laurent! Good to know that setting back_seek_max to 0 is not a workaround for older kernels. It looks like you'll want to set that for your device even with the patch as it makes the runtime much more consistent. That's a much better std. dev. than before!
Fixed in 2.6.32-rc1 by
Author: Jeff Moyer <email@example.com>
Date: Fri Sep 11 17:08:59 2009 +0200
cfq: choose a new next_req when a request is dispatched