Bug 9289 - 100% iowait on one of cpus in current -git
Summary: 100% iowait on one of cpus in current -git
Status: CLOSED CODE_FIX
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks: 9243
  Show dependency tree
 
Reported: 2007-11-02 09:39 UTC by Rafael J. Wysocki
Modified: 2007-11-24 11:50 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.23-mm1 and later
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Rafael J. Wysocki 2007-11-02 09:39:39 UTC
Subject         : 100% iowait on one of cpus in current -git
Submitter       : Maxim Levitsky <maximlevitsky@gmail.com>
                  "Torsten Kaiser" <just.for.lkml@googlemail.com>
References      : http://lkml.org/lkml/2007/10/22/20
                  http://lkml.org/lkml/2007/10/31/212
Handled-By      : Fengguang Wu <wfg@mail.ustc.edu.cn>
Comment 1 Torsten Kaiser 2007-11-02 14:56:22 UTC
Please note, that I while still believe that the same bdi or inode changes cause my problem, I am not able to confirm that it breaks in mainline 2.6.24-rc1 or the current git as these versions do not boot for me.
Comment 2 Thomas Schwarzgruber 2007-11-05 01:48:25 UTC
I can see this behaviour with 2.6.24-rc1-gb4f55508 too. iowait is most of the time on 100% and no disk activity. But it seems that it seems to cause the fan to run more often here on my laptop... but that could be a subjective impression...
I have to state that this problem does not occur on 2.6.23.
Comment 3 Thomas Schwarzgruber 2007-11-05 09:27:09 UTC
I did a git-bisect and this commit should be responsible:

bash-3.00# git bisect bad
2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b is first bad commit
commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b
Author: Fengguang Wu <wfg@mail.ustc.edu.cn>
Date:   Tue Oct 16 23:30:43 2007 -0700

    writeback: introduce writeback_control.more_io to indicate more io

    After making dirty a 100M file, the normal behavior is to start the writeback
    for all data after 30s delays.  But sometimes the following happens instead:

        - after 30s:    ~4M
        - after 5s:     ~4M
        - after 5s:     all remaining 92M

    Some analyze shows that the internal io dispatch queues goes like this:

                s_io            s_more_io
                -------------------------
        1)      100M,1K         0
        2)      1K              96M
        3)      0               96M

    1) initial state with a 100M file and a 1K file
    2) 4M written, nr_to_write <= 0, so write more
    3) 1K written, nr_to_write > 0, no more writes(BUG)

    nr_to_write > 0 in (3) fools the upper layer to think that data have all been
    written out.  The big dirty file is actually still sitting in s_more_io.  We
    cannot simply splice s_more_io back to s_io as soon as s_io becomes empty, and
    let the loop in generic_sync_sb_inodes() continue: this may starve newly
    expired inodes in s_dirty.  It is also not an option to draw inodes from both
    s_more_io and s_dirty, an let the loop go on: this might lead to live locks,
    and might also starve other superblocks in sync time(well kupdate may still
    starve some superblocks, that's another bug).

    We have to return when a full scan of s_io completes.  So nr_to_write > 0 does
    not necessarily mean that "all data are written".  This patch introduces a
    flag writeback_control.more_io to indicate this situation.  With it the big
    dirty file no longer has to wait for the next kupdate invocation 5s later.

    Cc: David Chinner <dgc@sgi.com>
    Cc: Ken Chen <kenchen@google.com>
    Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

:040000 040000 c818abb6abb893eb2a0aa7ed88b20968dafbe4b5 31149975fe579746f97f59730bc3ebe21b6c10e9 M      fs
:040000 040000 98cb3aa7bf523a302e2c3be4379f24694d00f349 a532973a51d01572eb16ade1a36f8ca5d84588ac M      include
:040000 040000 fa8736c8c34b7501b115cd74f4c97bf135c8205e 84a0ef8c03f50863dc26f559c3e5f0ee88d56a8d M      mm
Comment 4 Torsten Kaiser 2007-11-07 10:55:31 UTC
David Chinner found out what was causing my problems:
http://lkml.org/lkml/2007/11/6/398

So my slowdowns / stalls were not the same bug, but above patch fixes this other regression for me.
Comment 5 Rafael J. Wysocki 2007-11-07 13:50:59 UTC
Still, it seems we have the regression caused by commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b pending.
Comment 6 Thomas Schwarzgruber 2007-11-19 13:28:41 UTC
Just wanted to report that the iowait=100% behaviour seems to be gone here since updating to 2.6.24-rc3-g2ffbb837 -- perhaps anybody could confirm this?
Comment 7 Rafael J. Wysocki 2007-11-24 11:50:55 UTC
Assuming optimistically that we have fixed the issue and closing the bug.  Please reopen if it reappears.

Note You need to log in before you can comment on or make changes to this bug.