Bug 13401
Summary: | pktcdvd writing is really slow with CFQ scheduler (bisected) | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Laurent Riffard (laurent.riffard) |
Component: | Block Layer | Assignee: | Jens Axboe (axboe) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | alan, florian, jmoyer, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.30-rc7 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 13070 | ||
Attachments: |
Full dmesg
.config Test patch blktrace of /dev/sr0, dd was slow (16 KB/s) until sr0 was switched to noop scheduler blktrace of /dev/sr0, dd speed was good blktrace of /dev/sr0, no HAL daemon, dd speed was slow Choose a new next_rq when updating last_position blktrace of /dev/sr0, kernel 2.6.29 choose a new next_rq when inserting a request into the dispatch queue blktrace of /dev/sr0, kernel 2.6.31-rc7 |
Description
Laurent Riffard
2009-05-28 18:43:44 UTC
Created attachment 21609 [details]
.config
Created attachment 21650 [details]
Test patch
I don't have a lot of ideas on this one, behaviour should be pretty much identical to before. Can you try with this debug patch? It disables some of the logic that stops queuing, we call that on a different scope now, so it may be related.
Sorry jens, this patch (attachment 21650 [details]) does not help, the rate of writing still drops down to 16 kB/s.
Hi, Laurent, Could you provide a blktrace trace for the two kernels (one that works well, and one that doesn't)? Please also include the kernel versions you tested. Created attachment 21844 [details]
blktrace of /dev/sr0, dd was slow (16 KB/s) until sr0 was switched to noop scheduler
Sorry for the being so long.
So here is blktrace output for a bad "dd if=/dev/zero of=/media/pkt/flexbackup/test bs=1024K count=8 conv=fdatasync" session. Writing was very slow until I switched sr0 to noop scheduler.
Kernel version is 2.6.30-rc8
Created attachment 21845 [details]
blktrace of /dev/sr0, dd speed was good
And here is another blktrace of /dev/sr0, same kernel (2.6.30-rc8), cfq scheduler.
In this run, dd speed was good enough, it completed within 14 seconds (about 600 kB/s).
To be precise, the good trace was captured first, and the wrong trace was captured later. No boot between the two blktrace.
If you kill hal, does it make any difference to the performance? Created attachment 21846 [details]
blktrace of /dev/sr0, no HAL daemon, dd speed was slow
no, killing HAL daemon does not help.
Thanks for providing this information! Just a note, it's much easier for me to consume the binary trace data, as there are utilities than can be run on that data to analyse it. The Queuing of requests is happening in order in both traces. However, the requests are issued to the device out of sequence in the bad trace, and the I/O times (from D to C) are extremely high when this is going on (around 5 seconds as opposed to 2-23 milliseconds). This anomaly only happens in the beginning of the run. It eventually works itself out: (these numbers are printed upon completion) block d->c time ------------------------- 8265280: 5404ms 8264704: 5404ms 8265344: 5474ms 8264832: 5514ms 8265472: 5244ms 8264960: 5224ms 8265600: 5234ms 8265088: 5234ms 8265728: 5194ms 8265216: 5184ms 8265856: 5214ms 8265408: 5224ms 8265984: 5244ms 8265536: 5224ms 8266112: 5254ms 8265664: 5294ms 8266240: 5234ms 8265792: 5184ms 8266368: 5244ms 8265920: 5294ms 8266496: 5244ms 8266560: 2603ms 8266624: 2ms 8266048: 2951ms 8266176: 5582ms 8266304: 5254ms 8266432: 5244ms 8266688: 5284ms 8266752: 2663ms 8266816: 2ms 8266880: 2ms 8266944: 2ms 8267008: 2ms 8267072: 2ms Laurent, I have one further question. Do you see the same performance for every run under 2.6.29? One thing I've noticed is that every time we dispatch a request out of order, it is due to a request that has languished on the fifo for too long. Add to that the logic that only takes one request from the fifo at a time and you get some really nasty seeking going on, which is awful for performance of the cdrom drive. We could allow all of the expired requests to get serviced, but I presume that code has been put into place for a reason. The other approach I came up with was to choose a new next_rq after updating last_position. The reason for that is now that the disk head has moved, the next best request will change, too. I've not been able to get a system running a new kernel to test this, unfortunately. Laurent, you can feel free to try the patch that I'll attach shortly. However, if it breaks, you get to keep both pieces. ;-) Now, I still can't explain why the commit that git bisect picked could have caused all of this. Jens, I look forward to hearing your thoughts. Cheers, Jeff Created attachment 21887 [details]
Choose a new next_rq when updating last_position
This patch will choose a new next_rq for the cfqq associated with the request that is being activated. This is completely untested.
Created attachment 21888 [details] blktrace of /dev/sr0, kernel 2.6.29 (In reply to comment #10) > Laurent, I have one further question. Do you see the same performance for > every run under 2.6.29? AFAICT, 2.6.29 was fine, copying 8 MB with "dd" regularly takes about 15 to 20 seconds. (In reply to comment #12) > Created an attachment (id=21887) [details] > Choose a new next_rq when updating last_position > > This patch will choose a new next_rq for the cfqq associated with the request > that is being activated. This is completely untested. This patch is buggy, I've a Oops on boot: NULL pointer dereference at ... EIP is at rb_next+0x3/0x3a Call Trace: cfq_find_next... cfq_activate_request... elv_next_request... ... Created attachment 21994 [details]
choose a new next_rq when inserting a request into the dispatch queue
Hi, Laurent,
This patch builds and boots and survives stress testing. I'm not sure if it's at all a good idea. Hopefully Jens will have a chance to look at this soon.
Cheers,
Jeff
Sorry for the very long delay... Here is the good news: Jeff, your patch (attchment 21994) solved the issue. I ran the test command 20 times in a row, the problem never showed up: # sync # i=1 # while [ $i -le 20 ]; do # echo == pass $i == # dd if=/dev/zero of=/media/pkt/flexbackup/test bs=1024K count=8 conv=fdatasync # sleep 15 # (( i += 1 )) # done Nevertheless, the performance was better in 2.6.29. Scheduler Kernel Average time (s) Standard Deviation ----------- --------------- ---------------- ------------------- CFQ 2.6.30 patched 38.19 15.9 CFQ 2.6.29 22.92 5.93 Noop 2.6.29 40.22 4.46 Anticipatory 2.6.29 47.15 9.83 I did some new tests with kernel 2.6.31-rc7. This kernel exhibits the same problem I described. (i.e. writing on a DVD-RW with packet-writing driver slows down to 16 kb/s, it takes several minutes to complete a 8 MB write).
The patch provided by Jeff (attachment 21994 [details]) solves the problem.
Scheduler Kernel Average time (s) Standard Deviation
--------- ---------------- ---------------- -------------------
CFQ 2.6.29 22.92 5.93
CFQ 2.6.31-rc7 patched 27.76 15.78
Does this bug need further investigations ? or a better patch ?
The patch looks straight forward, so I think I can sneak it in for 2.6.31 (it IS a regression). Sorry for the delay in handling this, I'll get on it now. Jeff, can you send me a proper signed-off patch? Yes, I'll get that out this morning. In thinking about this more, I'm not entirely sure this patch is a good idea. Are we sure the patch is addressing the real issue? I never did figure out why there were requests that were stuck on the FIFO that should have been serviced early on in the run. For that matter, I'm not sure that we have complete blktrace runs, either. Laurent, I hate to be a pain, but could you collect data one more time? Please follow this procedure using the unpatched kernel: 1) start blktrace 2) start your app 3) wait until I/O rates have dropped, and stay down for at least 5 seconds 4) kill blktrace Then, attach the blktrace binary logs (you can tar them up if you like) to this bugzilla. If you have time, running the same procedure on a kernel pre-regression would be helpful. If we can compare those logs, that might show what the real problem is. Thanks in advance! Created attachment 22866 [details] blktrace of /dev/sr0, kernel 2.6.31-rc7 Jeff, here are some new traces, running on kernel 2.6.31-rc7. I had to run the command "dd if=/dev/zero of=... bs=1024K count=8 conv=fdatasync" twice to trigger the problem. The first run was very fast, it completed in about 10 seconds, the second one completed in about 10 minutes... attachment 21888 [details] was taken on a pre-regression kernel (2.6.29). I think the first run against 2.6.31-rc7 completed quickly because cfqd->last_position was zero, hence the first request issued was the one closest to the beginning of the cdrom drive. On the second run, we had a leftover cfqd->last_position, hence the best request to choose was one closer to last_position, not the front of the disk. This causes the very first request in the run to sit in the queue until it expires, triggerring backwards seeks. Now, cfq should recover from this a little better, and the patch I wrote definitely makes that the case. However, it may be worth exploring adding a time stamp to the last_position field. If the timestamp of last_position update is too old, then just reset it to 0. For random workloads, hopefully it doesn't make that big of a difference. For sequential workloads, it should cut down on the seeks and keep I/O from languishing in the queue. For now, I'll go ahead and submit the one-liner attached here. The timestamp issue can be pursued separately. (In reply to comment #23) > On the second run, we had a > leftover cfqd->last_position, hence the best request to choose was one closer > to last_position, not the front of the disk. This causes the very first > request in the run to sit in the queue until it expires, triggerring > backwards > seeks. Looking at the code, I'm wrong about this. If both requests are behind the last_position, then the request with the lower sector number will be chosen to eliminate further backward seeks. Here is some data including the end of the first run and the beginning of the second: 7.545015072 Q W 5785280 + 64 7.545018552 I W 5785280 + 64 I: backwards seek of 16448 sectors 7.545028032 D W 5785280 + 64 7.545079592 D W 5801664 + 64 last_position should now be 5801664+64 == 5801728 8.435615479 C W 5785280 + 64 0ms 890ms 890ms 10.967841276 C W 5801664 + 64 0ms 3422ms 3422ms Now the second run starts: 35.267350838 Q R 5785612 + 4 35.267358998 I R 5785612 + 4 35.267366238 Q R 5785540 + 4 35.267370478 I R 5785540 + 4 I: backwards seek of 76 sectors 35.267373678 Q R 5785484 + 4 35.267377478 I R 5785484 + 4 I: backwards seek of 60 sectors 35.267380358 Q R 5785420 + 4 35.267383878 I R 5785420 + 4 I: backwards seek of 68 sectors 35.267395318 D R 5785612 + 4 D: backwards seek of 16052 sectors So, I'm not sure why 5785612 was chosen for dispatch, here. Laurent, I have one more thing for you to try. With and without the patch applied, try setting /sys/block/sr0/queue/iosched/back_seek_max to 0. Then, let me know if you get more consistent or less consistent results. Thanks! Hi Jeff, Here are the results with /sys/block/sr0/queue/iosched/back_seek_max=0. (the durations are given by the dd command: dd if=/dev/zero of=/media/pkt/flexbackup/test bs=1024K count=8 conv=fdatasync) Kernel 2.6.30-rc7: run #1: 13.7682 s run #2: 10.9875 s run #3: 575.601 s run #4: 32.7028 s I stopped the test here. Kernel 2.6.30-rc7 patched, dd command issued 20 times: Average time : 24.9s Standard Deviation : 8.68s Thanks a ton, Laurent! Good to know that setting back_seek_max to 0 is not a workaround for older kernels. It looks like you'll want to set that for your device even with the patch as it makes the runtime much more consistent. That's a much better std. dev. than before! Cheers! Fixed in 2.6.32-rc1 by commit 06d2188644c85c56d243efab914f368d1d23c4a3 Author: Jeff Moyer <jmoyer@redhat.com> Date: Fri Sep 11 17:08:59 2009 +0200 cfq: choose a new next_req when a request is dispatched |