10756 – many pre-mature anticipation timeouts in anticipatory I/O scheduler

Bug 10756 - many pre-mature anticipation timeouts in anticipatory I/O scheduler

Summary: many pre-mature anticipation timeouts in anticipatory I/O scheduler

Status:	CLOSED OBSOLETE

Alias:	None

Product:	IO/Storage
Classification:	Unclassified
Component:	Block Layer (show other bugs)
Hardware:	All Linux

Importance:	P1 normal
Assignee:	Jens Axboe

URL:
Keywords:

Depends on:
Blocks:

Reported:	2008-05-19 23:29 UTC by Chuanpeng Li
Modified:	2012-05-21 15:30 UTC (History)
CC List:	2 users (show)

See Also:
Kernel Version:	2.6.23
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Chuanpeng Li 2008-05-19 23:29:40 UTC

Latest working kernel version: N/A
Earliest failing kernel version: 2.6.13
Distribution: www.kernel.org
Hardware Environment: IBM eServer: dual 2G Xeon processors;IBM 36GB SCSI drive
Software Environment: Redhat 9: gcc 3.2.2 
Problem Description:
  Starting form 2.6.13, the switch of kernel timer frequency HZ from 1000 to 250 
results in "default_antic_expire = 1 tick". 1 tick is 4 ms, BUT the anticipation timeout can occur anywhere from 0 to 4 ms, because the timer may be started anytime from 0 to 4 ms before the next system timer interrupt. In practice, I observe anticipation timeout as short as 100 micro-seconds using the LTT trace tool. Compared with HZ=1000, the new frequency (HZ=250) causes frequent pre-mature anticipation timeouts and degraded I/O throughput under concurrent I/O workload. I suggest to set the "default_antic_expire" to 2 when its value is calculated as 1. (see source "block/as-iosched.c")
Steps to reproduce: 
  (1) run a concurrent server with I/O-bound workload, such as a micro-benchmark that sequentially reads 256 KB from random locations in randomly chosen files. 
  (2) I/O throughput at HZ=250 is 10-15% lower than HZ=1000
  (3) At HZ=250, a lot of anticipation timeouts can be observed using trace tools such as LTT.

Comment 1 Anonymous Emailer 2008-05-19 23:46:13 UTC

Reply-To: akpm@linux-foundation.org

(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Mon, 19 May 2008 23:29:41 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10756
> 
>            Summary: many pre-mature anticipation timeouts in anticipatory
>                     I/O scheduler
>            Product: IO/Storage
>            Version: 2.5
>      KernelVersion: 2.6.23
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Block Layer
>         AssignedTo: axboe@kernel.dk
>         ReportedBy: chuanpengli@yahoo.com
>                 CC: io_other@kernel-bugs.osdl.org
> 
> 
> Latest working kernel version: N/A
> Earliest failing kernel version: 2.6.13
> Distribution: www.kernel.org
> Hardware Environment: IBM eServer: dual 2G Xeon processors;IBM 36GB SCSI
> drive
> Software Environment: Redhat 9: gcc 3.2.2 
> Problem Description:
>   Starting form 2.6.13, the switch of kernel timer frequency HZ from 1000 to
> 250 
> results in "default_antic_expire = 1 tick". 1 tick is 4 ms, BUT the
> anticipation timeout can occur anywhere from 0 to 4 ms, because the timer may
> be started anytime from 0 to 4 ms before the next system timer interrupt. In
> practice, I observe anticipation timeout as short as 100 micro-seconds using
> the LTT trace tool. Compared with HZ=1000, the new frequency (HZ=250) causes
> frequent pre-mature anticipation timeouts and degraded I/O throughput under
> concurrent I/O workload. I suggest to set the "default_antic_expire" to 2
> when
> its value is calculated as 1. (see source "block/as-iosched.c")
> Steps to reproduce: 
>   (1) run a concurrent server with I/O-bound workload, such as a
> micro-benchmark that sequentially reads 256 KB from random locations in
> randomly chosen files. 
>   (2) I/O throughput at HZ=250 is 10-15% lower than HZ=1000
>   (3) At HZ=250, a lot of anticipation timeouts can be observed using trace
> tools such as LTT.

Interesting.

It's often a bug to do mod_timer(timer, jiffies+1) for this very reason
- the timer can expire any time between one jiffie down to zero seconds
hence, which is a large (infinite) ratio, which can have unpredictable
effects.

A probably-suitable-but-dopey fix might be

--- a/block/as-iosched.c~a
+++ a/block/as-iosched.c
@@ -416,6 +416,9 @@ static void as_antic_waitnext(struct as_
 
 	timeout = ad->antic_start + ad->antic_expire;
 
+	if (ad->antic_expire == 1)
+		timeout++;		/* comment goes here */
+
 	mod_timer(&ad->antic_timer, timeout);
 
 	ad->antic_status = ANTIC_WAIT_NEXT;
_

but a) It is unclear what in there prevents `timeout' from referring to
a time which has already passed (say, there was a storm of slow-running
onterrupts on this CPU) and b) I bet other IO schedulers have the same
issue.

Comment 2 Anonymous Emailer 2008-05-20 02:13:18 UTC

Reply-To: jens.axboe@oracle.com

On Mon, May 19 2008, Andrew Morton wrote:
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Mon, 19 May 2008 23:29:41 -0700 (PDT) bugme-daemon@bugzilla.kernel.org
> wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=10756
> > 
> >            Summary: many pre-mature anticipation timeouts in anticipatory
> >                     I/O scheduler
> >            Product: IO/Storage
> >            Version: 2.5
> >      KernelVersion: 2.6.23
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Block Layer
> >         AssignedTo: axboe@kernel.dk
> >         ReportedBy: chuanpengli@yahoo.com
> >                 CC: io_other@kernel-bugs.osdl.org
> > 
> > 
> > Latest working kernel version: N/A
> > Earliest failing kernel version: 2.6.13
> > Distribution: www.kernel.org
> > Hardware Environment: IBM eServer: dual 2G Xeon processors;IBM 36GB SCSI
> drive
> > Software Environment: Redhat 9: gcc 3.2.2 
> > Problem Description:
> >   Starting form 2.6.13, the switch of kernel timer frequency HZ from 1000
> to
> > 250 
> > results in "default_antic_expire = 1 tick". 1 tick is 4 ms, BUT the
> > anticipation timeout can occur anywhere from 0 to 4 ms, because the timer
> may
> > be started anytime from 0 to 4 ms before the next system timer interrupt.
> In
> > practice, I observe anticipation timeout as short as 100 micro-seconds
> using
> > the LTT trace tool. Compared with HZ=1000, the new frequency (HZ=250)
> causes
> > frequent pre-mature anticipation timeouts and degraded I/O throughput under
> > concurrent I/O workload. I suggest to set the "default_antic_expire" to 2
> when
> > its value is calculated as 1. (see source "block/as-iosched.c")
> > Steps to reproduce: 
> >   (1) run a concurrent server with I/O-bound workload, such as a
> > micro-benchmark that sequentially reads 256 KB from random locations in
> > randomly chosen files. 
> >   (2) I/O throughput at HZ=250 is 10-15% lower than HZ=1000
> >   (3) At HZ=250, a lot of anticipation timeouts can be observed using trace
> > tools such as LTT.
> 
> Interesting.
> 
> It's often a bug to do mod_timer(timer, jiffies+1) for this very reason
> - the timer can expire any time between one jiffie down to zero seconds
> hence, which is a large (infinite) ratio, which can have unpredictable
> effects.
> 
> A probably-suitable-but-dopey fix might be
> 
> --- a/block/as-iosched.c~a
> +++ a/block/as-iosched.c
> @@ -416,6 +416,9 @@ static void as_antic_waitnext(struct as_
>  
>       timeout = ad->antic_start + ad->antic_expire;
>  
> +     if (ad->antic_expire == 1)
> +             timeout++;              /* comment goes here */
> +
>       mod_timer(&ad->antic_timer, timeout);
>  
>       ad->antic_status = ANTIC_WAIT_NEXT;
> _
> 
> but a) It is unclear what in there prevents `timeout' from referring to
> a time which has already passed (say, there was a storm of slow-running
> onterrupts on this CPU) and b) I bet other IO schedulers have the same
> issue.

I have another patch pending that just makes sure that the timer
addition is always at least 2 for this very reason. CFQ needs a similar
patch, it currently makes sure it's at least 1 (but should be 2).

Note You need to log in before you can comment on or make changes to this bug.