Bug 13401

Summary:	pktcdvd writing is really slow with CFQ scheduler (bisected)
Product:	IO/Storage	Reporter:	Laurent Riffard (laurent.riffard)
Component:	Block Layer	Assignee:	Jens Axboe (axboe)
Status:	CLOSED CODE_FIX
Severity:	normal	CC:	alan, florian, jmoyer, rjw
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	2.6.30-rc7	Subsystem:
Regression:	Yes	Bisected commit-id:
Bug Depends on:
Bug Blocks:	13070
Attachments:	Full dmesg .config Test patch blktrace of /dev/sr0, dd was slow (16 KB/s) until sr0 was switched to noop scheduler blktrace of /dev/sr0, dd speed was good blktrace of /dev/sr0, no HAL daemon, dd speed was slow Choose a new next_rq when updating last_position blktrace of /dev/sr0, kernel 2.6.29 choose a new next_rq when inserting a request into the dispatch queue blktrace of /dev/sr0, kernel 2.6.31-rc7

Description Laurent Riffard 2009-05-28 18:43:44 UTC

Created attachment 21608 [details]
Full dmesg

I'm writing some files (about 200MB) on a DVD-RW with the pktcdvd driver.

The writing starts with good rate and then it drops down to 16 kB/s, and it can not go back to better rates until the write operation has finished and the system has synced.

Switching /dev/sr0 to anticipatory scheduler does restore normal rates (up to 2700 kB/s).

How to reproduce :
- format a DVD-RW for packet-writing
- # pktsetup 0 /dev/sr0
- # mount -o noatime,nodiratime,rw /dev/pktcdvd/pktcdvd0 /media/pkt
- # dd if=/dev/zero of=/media/pkt/flexbackup/test bs=1024K count=8 conv=fdatasync

I bisected this behaviour to the following commit:

2f5cb7381b737e24c8046fd4aeab571fb71315f5 is first bad commit
commit 2f5cb7381b737e24c8046fd4aeab571fb71315f5
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Tue Apr 7 08:51:19 2009 +0200

    cfq-iosched: change dispatch logic to deal with single requests at the time
    
    The IO scheduler core calls into the IO scheduler dispatch_request hook
    to move requests from the IO scheduler and into the driver dispatch
    list. It only does so when the dispatch list is empty. CFQ moves several
    requests to the dispatch list, which can cause higher latencies if we
    suddenly have to switch to some important sync IO. Change the logic to
    move one request at the time instead.
    
    This should almost be functionally equivalent to what we did before,
    except that we now honor 'quantum' as the maximum queue depth at the
    device side from any single cfqq. If there's just a single active
    cfqq, we allow up to 4 times the normal quantum.
    
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

:040000 040000 be8b0f16d8faba88fc04e877eaa9240ec5d00648 5fd6eb925f065f411ed0d387277e45ba722ea154 M	block


I was unable to revert because of too many conflicts.

(please note: the kernel is actually vanilla kernel with reiser4 patches)

Comment 1 Laurent Riffard 2009-05-28 18:45:14 UTC

Created attachment 21609 [details]
.config

Comment 2 Jens Axboe 2009-05-31 06:22:41 UTC

Created attachment 21650 [details]
Test patch

I don't have a lot of ideas on this one, behaviour should be pretty much identical to before. Can you try with this debug patch? It disables some of the logic that stops queuing, we call that on a different scope now, so it may be related.

Comment 3 Laurent Riffard 2009-06-03 20:22:49 UTC

Sorry jens, this patch (attachment 21650 [details]) does not help, the rate of writing still drops down to 16 kB/s.

Comment 4 Jeff Moyer 2009-06-08 14:35:03 UTC

Hi, Laurent,

Could you provide a blktrace trace for the two kernels (one that works well, and one that doesn't)?  Please also include the kernel versions you tested.

Comment 5 Laurent Riffard 2009-06-10 20:56:02 UTC

Created attachment 21844 [details]
blktrace of /dev/sr0, dd was slow (16 KB/s) until sr0 was switched to noop scheduler

Sorry for the being so long.

So here is blktrace output for a bad "dd if=/dev/zero of=/media/pkt/flexbackup/test bs=1024K count=8 conv=fdatasync" session. Writing was very slow until I switched sr0 to noop scheduler.

Kernel version is 2.6.30-rc8

Comment 6 Laurent Riffard 2009-06-10 21:08:45 UTC

Created attachment 21845 [details]
blktrace of /dev/sr0, dd speed was good

And here is another blktrace of /dev/sr0, same kernel (2.6.30-rc8), cfq scheduler.

In this run, dd speed was good enough, it completed within 14 seconds (about 600 kB/s).

To be precise, the good trace was captured first, and the wrong trace was captured later. No boot between the two blktrace.

Comment 7 Jens Axboe 2009-06-10 21:11:58 UTC

If you kill hal, does it make any difference to the performance?

Comment 8 Laurent Riffard 2009-06-10 21:25:06 UTC

Created attachment 21846 [details]
blktrace of /dev/sr0, no HAL daemon, dd speed was slow

no, killing HAL daemon does not help.

Comment 9 Jeff Moyer 2009-06-11 15:28:15 UTC

Thanks for providing this information!  Just a note, it's much easier for me to consume the binary trace data, as there are utilities than can be run on that data to analyse it.

The Queuing of requests is happening in order in both traces.  However, the requests are issued to the device out of sequence in the bad trace, and the I/O times (from D to C) are extremely high when this is going on (around 5 seconds as opposed to 2-23 milliseconds).

This anomaly only happens in the beginning of the run.  It eventually works itself out:
(these numbers are printed upon completion)

block           d->c time
-------------------------
8265280:        5404ms
8264704:        5404ms
8265344:        5474ms
8264832:        5514ms
8265472:        5244ms
8264960:        5224ms
8265600:        5234ms
8265088:        5234ms
8265728:        5194ms
8265216:        5184ms
8265856:        5214ms
8265408:        5224ms
8265984:        5244ms
8265536:        5224ms
8266112:        5254ms
8265664:        5294ms
8266240:        5234ms
8265792:        5184ms
8266368:        5244ms
8265920:        5294ms
8266496:        5244ms
8266560:        2603ms
8266624:        2ms
8266048:        2951ms
8266176:        5582ms
8266304:        5254ms
8266432:        5244ms
8266688:        5284ms
8266752:        2663ms
8266816:        2ms
8266880:        2ms
8266944:        2ms
8267008:        2ms
8267072:        2ms

Comment 10 Jeff Moyer 2009-06-11 15:48:15 UTC

Laurent, I have one further question.  Do you see the same performance for every run under 2.6.29?

Comment 11 Jeff Moyer 2009-06-12 20:27:16 UTC

One thing I've noticed is that every time we dispatch a request out of order, it is due to a request that has languished on the fifo for too long.  Add to that the logic that only takes one request from the fifo at a time and you get some really nasty seeking going on, which is awful for performance of the cdrom drive.

We could allow all of the expired requests to get serviced, but I presume that code has been put into place for a reason.  The other approach I came up with was to choose a new next_rq after updating last_position.  The reason for that is now that the disk head has moved, the next best request will change, too.

I've not been able to get a system running a new kernel to test this, unfortunately.  Laurent, you can feel free to try the patch that I'll attach shortly.  However, if it breaks, you get to keep both pieces.  ;-)

Now, I still can't explain why the commit that git bisect picked could have caused all of this.  Jens, I look forward to hearing your thoughts.

Cheers,
Jeff

Comment 12 Jeff Moyer 2009-06-12 20:29:24 UTC

Created attachment 21887 [details]
Choose a new next_rq when updating last_position

This patch will choose a new next_rq for the cfqq associated with the request that is being activated.  This is completely untested.

Comment 13 Laurent Riffard 2009-06-13 08:40:04 UTC

Created attachment 21888 [details]
blktrace of /dev/sr0, kernel 2.6.29

(In reply to comment #10)
> Laurent, I have one further question.  Do you see the same performance for
> every run under 2.6.29?

AFAICT, 2.6.29 was fine, copying 8 MB with "dd" regularly takes about 15 to 20 seconds.

Comment 14 Laurent Riffard 2009-06-13 11:26:36 UTC

(In reply to comment #12)
> Created an attachment (id=21887) [details]
> Choose a new next_rq when updating last_position
> 
> This patch will choose a new next_rq for the cfqq associated with the request
> that is being activated.  This is completely untested.

This patch is buggy, I've a Oops on boot:
  NULL pointer dereference at ...
  EIP is at rb_next+0x3/0x3a
Call Trace:
  cfq_find_next...
  cfq_activate_request...
  elv_next_request...
  ...

Comment 15 Jeff Moyer 2009-06-18 17:35:26 UTC

Created attachment 21994 [details]
choose a new next_rq when inserting a request into the dispatch queue

Hi, Laurent,

This patch builds and boots and survives stress testing.  I'm not sure if it's at all a good idea.  Hopefully Jens will have a chance to look at this soon.

Cheers,
Jeff

Comment 16 Laurent Riffard 2009-07-14 15:13:49 UTC

Sorry for the very long delay...

Here is the good news: Jeff, your patch (attchment 21994) solved the issue.

I ran the test command 20 times in a row, the problem never showed up:

# sync
# i=1
# while [ $i -le 20 ]; do 
#   echo == pass $i ==
#   dd if=/dev/zero of=/media/pkt/flexbackup/test bs=1024K count=8 conv=fdatasync
#   sleep 15
#   (( i += 1 ))
# done

Nevertheless, the performance was better in 2.6.29.

Scheduler    Kernel          Average time (s)  Standard Deviation
-----------  --------------- ----------------  -------------------
CFQ          2.6.30 patched             38.19               15.9
CFQ          2.6.29                     22.92                5.93
Noop         2.6.29                     40.22                4.46
Anticipatory 2.6.29                     47.15                9.83

Comment 17 Laurent Riffard 2009-08-24 21:50:59 UTC

I did some new tests with kernel 2.6.31-rc7. This kernel exhibits the same problem I described. (i.e. writing on a DVD-RW with packet-writing driver slows down to 16 kb/s, it takes several minutes to complete a 8 MB write).

The patch provided by Jeff (attachment 21994 [details]) solves the problem.

Scheduler Kernel           Average time (s)  Standard Deviation
--------- ---------------- ----------------  -------------------
CFQ       2.6.29                     22.92                5.93
CFQ       2.6.31-rc7 patched         27.76               15.78

Does this bug need further investigations ? or a better patch ?

Comment 18 Jens Axboe 2009-08-25 06:38:06 UTC

The patch looks straight forward, so I think I can sneak it in for 2.6.31 (it IS a regression). Sorry for the delay in handling this, I'll get on it now.

Comment 19 Jens Axboe 2009-08-25 06:38:39 UTC

Jeff, can you send me a proper signed-off patch?

Comment 20 Jeff Moyer 2009-08-25 12:47:30 UTC

Yes, I'll get that out this morning.

Comment 21 Jeff Moyer 2009-08-25 16:03:50 UTC

In thinking about this more, I'm not entirely sure this patch is a good idea.  Are we sure the patch is addressing the real issue?  I never did figure out why there were requests that were stuck on the FIFO that should have been serviced early on in the run.  For that matter, I'm not sure that we have complete blktrace runs, either.

Laurent, I hate to be a pain, but could you collect data one more time?  Please follow this procedure using the unpatched kernel:

1) start blktrace
2) start your app
3) wait until I/O rates have dropped, and stay down for at least 5 seconds
4) kill blktrace

Then, attach the blktrace binary logs (you can tar them up if you like) to this bugzilla.

If you have time, running the same procedure on a kernel pre-regression would be helpful.  If we can compare those logs, that might show what the real problem is.

Thanks in advance!

Comment 22 Laurent Riffard 2009-08-26 21:45:05 UTC

Created attachment 22866 [details]
blktrace of /dev/sr0, kernel 2.6.31-rc7

Jeff, here are some new traces, running on kernel 2.6.31-rc7.

I had to run the command "dd if=/dev/zero of=... bs=1024K count=8 conv=fdatasync" twice to trigger the problem. The first run was very fast, it completed in about 10 seconds, the second one completed in about 10 minutes...

attachment 21888 [details] was taken on a pre-regression kernel (2.6.29).

Comment 23 Jeff Moyer 2009-08-31 17:29:05 UTC

I think the first run against 2.6.31-rc7 completed quickly because cfqd->last_position was zero, hence the first request issued was the one closest to the beginning of the cdrom drive.  On the second run, we had a leftover cfqd->last_position, hence the best request to choose was one closer to last_position, not the front of the disk.  This causes the very first request in the run to sit in the queue until it expires, triggerring backwards seeks.

Now, cfq should recover from this a little better, and the patch I wrote definitely makes that the case.  However, it may be worth exploring adding a time stamp to the last_position field.  If the timestamp of last_position update is too old, then just reset it to 0.  For random workloads, hopefully it doesn't make that big of a difference.  For sequential workloads, it should cut down on the seeks and keep I/O from languishing in the queue.

For now, I'll go ahead and submit the one-liner attached here.  The timestamp issue can be pursued separately.

Comment 24 Jeff Moyer 2009-08-31 19:44:23 UTC

(In reply to comment #23)
> On the second run, we had a
> leftover cfqd->last_position, hence the best request to choose was one closer
> to last_position, not the front of the disk.  This causes the very first
> request in the run to sit in the queue until it expires, triggerring
> backwards
> seeks.

Looking at the code, I'm wrong about this.  If both requests are behind the last_position, then the request with the lower sector number will be chosen to eliminate further backward seeks.

Here is some data including the end of the first run and the beginning of the second:

7.545015072     Q       W       5785280 + 64            
7.545018552     I       W       5785280 + 64            I: backwards seek of 16448 sectors
7.545028032     D       W       5785280 + 64            
7.545079592     D       W       5801664 + 64            

last_position should now be 5801664+64 == 5801728

8.435615479     C       W       5785280 + 64            0ms     890ms   890ms
10.967841276    C       W       5801664 + 64            0ms     3422ms  3422ms

Now the second run starts:

35.267350838    Q       R       5785612 + 4             
35.267358998    I       R       5785612 + 4             
35.267366238    Q       R       5785540 + 4             
35.267370478    I       R       5785540 + 4             I: backwards seek of 76 sectors
35.267373678    Q       R       5785484 + 4             
35.267377478    I       R       5785484 + 4             I: backwards seek of 60 sectors
35.267380358    Q       R       5785420 + 4             
35.267383878    I       R       5785420 + 4             I: backwards seek of 68 sectors
35.267395318    D       R       5785612 + 4             D: backwards seek of 16052 sectors

So, I'm not sure why 5785612 was chosen for dispatch, here.

Comment 25 Jeff Moyer 2009-09-01 16:04:28 UTC

Laurent, I have one more thing for you to try.  With and without the patch applied, try setting /sys/block/sr0/queue/iosched/back_seek_max to 0.  Then, let me know if you get more consistent or less consistent results.

Thanks!

Comment 26 Laurent Riffard 2009-09-02 20:18:48 UTC

Hi Jeff,

Here are the results with /sys/block/sr0/queue/iosched/back_seek_max=0.

(the durations are given by the dd command: dd if=/dev/zero of=/media/pkt/flexbackup/test bs=1024K count=8 conv=fdatasync)

Kernel 2.6.30-rc7:

run #1: 13.7682 s
run #2: 10.9875 s
run #3: 575.601 s
run #4: 32.7028 s
I stopped the test here.


Kernel 2.6.30-rc7 patched, dd command issued 20 times:

Average time : 24.9s
Standard Deviation : 8.68s

Comment 27 Jeff Moyer 2009-09-02 20:36:46 UTC

Thanks a ton, Laurent!  Good to know that setting back_seek_max to 0 is not a workaround for older kernels.  It looks like you'll want to set that for your device even with the patch as it makes the runtime much more consistent.  That's a much better std. dev. than before!

Cheers!

Comment 28 Florian Mickler 2010-11-30 09:31:14 UTC

Fixed in 2.6.32-rc1 by 

commit 06d2188644c85c56d243efab914f368d1d23c4a3
Author: Jeff Moyer <jmoyer@redhat.com>
Date:   Fri Sep 11 17:08:59 2009 +0200

    cfq: choose a new next_req when a request is dispatched