Bug 29402

Summary: kernel panics while running ffsb scalability workloads on 2.6.38-rc1 through -rc5
Product: File System Reporter: Eric Whitney (enwlinux)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Severity: normal CC: florian, lczerner, maciej.rutecki, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38-rc5 Tree: Mainline
Regression: Yes
Bug Depends on:    
Bug Blocks: 27352    
Attachments: captured console output - spinlock bad magic: ext4lazyinit
captured console output - another back trace of a different sort
Fix ordering patch
hung task trace for ext4lazyinit resulting from first patch
Possible fix - patch simplifies completion counting and avoids races
Possible fix no.2 - simplification to avoid bad ordering
Simplification to avoid bad ordering

Description Eric Whitney 2011-02-18 21:47:14 UTC
Created attachment 48352 [details]
captured console output - spinlock bad magic: ext4lazyinit

The 2.6.38-rc5 kernel can panic while running any one of the ffsb profiles in http://free.linux.hp.com/~enw/ext4/profiles on an ext4 filesystem on a 48 core x86 system. These panics occur most frequently using the 48 or 192 thread versions of those profiles.  The problem has been reproduced on a 16 core x86 system using identical storage, but occurs there at lower frequency.  On average, it takes only two runs of "large_file_creates_threads-192.ffsb" to produce a panic on the 48 core system.

The panics occur more or less equally frequently on a vanilla ext4 filesystem, ext4 filesystem without a journal, and an ext4 filesystem with a journal but mounted with mblk_io_submit.

With various debugging options enabled including spinlock debugging, panics or oopses or BUGS occur in four varieties: protection violation, invalid opcode,
NULL pointer, and spinlock bad magic.  Typically, the first fault triggers a cascade of subsequent oopses, etc.

These panics can be suppressed by using -E lazy_itable_init at mkfs time.  The test system survived two series of 10 ffsb tests beginning with a single mkfs each.  Subsequently, the system survived a run of about 16 hours in which a complete scalability measurement pass was made

Repeated ffsb runs on ext3 and xfs filesystems on 2.6.38-rc* have not produced panics.

Numerous previous ffsb scalability runs on ext4 and 2.6.37 did not produce panics.

The panics can be produced using either HP SmartArray (backplane RAID) or FibreChannel storage with no material difference in the panic backtraces.

Attempted bisection of the bug in 38-rc1 was inconclusive.  Repeatability was lost the earlier in -rc1 I got.  The last clear indication was in the midst of perf changes very early in the release (SHA id beginning with 006b20fe4c, "Merge branch 'perf/urgent' into perf/core").  Preceding that are RCU and GFS patches, plus a small number of x86 patches.

Relatively little useful spinlock debugging information was reported in repeated tests in early 38 rc's - with later rc's, more information gradually became visible (or maybe I was just getting progressively more lucky).

The first attachment contains the partial backtrace that most clearly suggests lazy_itable_init involvement.  The softirq portion of this backtrace tends to look the same across the panics I've seen.
Comment 1 Eric Whitney 2011-02-18 21:54:16 UTC
Created attachment 48362 [details]
captured console output - another back trace of a different sort

This backtrace is also from -rc5, and shows both the typical trace in softirq as well as the cascade of faults mentioned in the description.  In this case, we've got a failure other than spinlock bad magic reported.
Comment 2 Lukas Czerner 2011-02-21 16:52:14 UTC
On Fri, 18 Feb 2011, bugzilla-daemon@bugzilla.kernel.org wrote:
> These panics can be suppressed by using -E lazy_itable_init at mkfs time. 
> The
> test system survived two series of 10 ffsb tests beginning with a single mkfs
> each.  Subsequently, the system survived a run of about 16 hours in which a
> complete scalability measurement pass was made

Hi Eric,

Are you saying that it works fine (without panic) with lazy_itable_init
feature enabled ? I am confused now, because in the first trace
"spinlock bad magic: ext4lazyinit" the ext4lazyinit thread is obviously
running. When the ext4lazyinit thread is running, it means that inode
table was not zeroed, hence kernel thread (ext4lazyinit) is trying to do
the job. Am I missing something ?

Note that on the kernel which supports lazy itable initialization
(/sys/fs/ext4/features/lazy_itable_init exists) the mke2fs (since
v1.41.12-50-g210fd2c) will skip inode table initialization
automatically (it is the same as you would specify -E lazy_itable_init).

Comment 3 Eric Whitney 2011-02-22 04:10:33 UTC
Hi, Lukas:

Unfortunately, I omitted the argument, and that's what's confusing you.   I'm saying that it works fine - no panic - with the feature disabled:  -E lazy_itable_init=0.  (I'm sorry about that.  I was filing just a little too quickly after informing Ted via IRC.  It's been a long hunt.)  Again, it panics easily with lazy_itable_init set to its default, which means without using the switch.

In addition to what I initially reported, I've made more than 80 additional ffsb runs on freshly made ext4, ext4 nojournal, and ext4 mblk_io_submit filesystems over the last three days with lazy_itable_init=0, and there have been no further panics.

For the record, I'm using e2fsprogs 1.41.14 straight from Ted's git tree.

Comment 4 Lukas Czerner 2011-02-22 11:30:34 UTC
On Tue, 22 Feb 2011, bugzilla-daemon@bugzilla.kernel.org wrote:

> Unfortunately, I omitted the argument, and that's what's confusing you.

That's what I was afraid of. Fortunately we know where to look now. It
seems that the problem is really in the lazy init code in ext4. It is
possible that due some locking problem we might be zeroing existing
inode. Did you tried to run fsck on that filesystem after the crash ?

Comment 5 Lukas Czerner 2011-02-22 13:43:10 UTC
Created attachment 48672 [details]
Fix ordering patch
Comment 6 Lukas Czerner 2011-02-22 13:43:24 UTC
Hi Eric,

it might be just a shot in the dark, but would you be willing to try out this patch ? Unfortunately I do not currently have a machine to reproduce the problem.

I *think* that the problem is in the block layer (bio_batch_end_io() and blkdev_issue_zeroout() to be specific) and since lazy init is using that to do the zeroing we can not hit it with lazy init turned off.

Now, the problem I see is that when we are going to wait_for_completion() in blkdev_issue_discard() we check if the bb.done equals issued (number of issued bios). If it equals, we can skip the wait_for_completion() and jump out of the function since there is nothing to wait for. However, there is a ordering problem because bio_batch_end_io() is calling atomic_inc(&bb->done) before complete(), hence it might seem to blkdev_issue_zeroout() that all bios has been completed and exit. At this point when bio_batch_end_io() is going to call complete(bb->wait) bb does not longer exist since it was declared locally in blkdev_issue_zeroout() ==> panic while trying to acquire wait.lock!

(thread 1)                         (thread 2)
bio_batch_end_io()                 blkdev_issue_zeroout()
  if(bb) {                         ...
    if (bb->end_io)                ...
      bb->end_io(bio, err);        ...
    atomic_inc(&bb->done);         ...
    ...                            while (issued != atomic_read(&bb.done))
    ...                            (let issued == bb.done)
    ...                            (do the rest of the function)
    ...                            return ret;
Panic in complete() while trying to acquire spinlock.

I hope it is not complete nonsense :).Please let me know whether the patch helps.

Comment 7 Eric Whitney 2011-02-25 03:08:48 UTC
Hi Lukas:

When running the large_file_creates_threads=192 workload with this patch on an -rc5 kernel, I consistently saw a new failure. The ext4lazyinit thread stalled (tripping hung task detector) along with the flusher and journal threads, and ffsb then ceased to make forward progress.

I think we're just seeing a new variant of the original race, where bio_batch_end_io() sometimes never gets the value of &bb->done it's waiting for:

Comment 8 Eric Whitney 2011-02-25 03:25:46 UTC
(sorry - network hiccup - starting from the top again)

bio_batch_end_io()                  blkdev_issue_zeroout()
  ...                                 wait_for_completion(&wait)
  complete(bb->wait)                  ...
  ...                                 issued != atomic_read(&bb.done)
  ...                                 wait_for_completion(&wait)
  atomic_inc(&bb->done)               ...
                                      <wedge while waiting for the completion
                                       that won't be coming>

At any rate, that's what I think is happening - we're missing the last "inc" and then we're not getting the completion notification we need to finish in blkdev_issue_zeroout().

I'll attach the first hung task trace from the console for reference.

Comment 9 Eric Whitney 2011-02-25 03:27:13 UTC
Created attachment 49042 [details]
hung task trace for ext4lazyinit resulting from first patch
Comment 10 Eric Whitney 2011-02-26 03:50:09 UTC
Created attachment 49172 [details]
Possible fix - patch simplifies completion counting and avoids races

Hi Lukas:

This patch has successfully passed an initial round of tests which 2.6.38-rc* kernels have never passed.  It seemed appropriate to simplify the way outstanding completions were being accounted for to avoid the races, and the code ought to be just a little more efficient on a SMP/NUMA box as well.  At any rate, it's a stab at it - I'd appreciate any comments you might have.  I'll put it through a weekend's worth of test runs next.

Comment 11 Lukas Czerner 2011-02-26 10:48:42 UTC
Hi Eric,

thanks for the patch. I have almost the same one, except you do not
really need "done" variable, but instead just do:

	while (issued--)

Also, there is no need for bio_end_io_t in struct bio_batch, but it does
not affect the results, obviously. Finally, I was able to reproduce the
problem so I'll give it a try as well.

Comment 12 Eric Whitney 2011-02-28 18:07:38 UTC
Hi Lukas:

The quick and dirty patch I attached on 26 Feb successfully passed 36 hours of testing on ext4, ext4 nojournal, and ext4 mblk_io_submit filesystems using the ffsb scalability workloads.  That was a total of 135 ffsb runs.  The original panic appeared on average after two runs.

At any rate, hopefully this is a useful data point.  I'd be happy to test your more polished patch in the same manner.

Comment 13 Lukas Czerner 2011-02-28 18:22:03 UTC
Created attachment 49642 [details]
Possible fix no.2 - simplification to avoid bad ordering

Hi Eric,

here is a patch for testing. The behaviour should be the same as with your patch, to I hope it would work as expected. Thanks for testing.

Comment 14 Eric Whitney 2011-03-03 18:58:15 UTC
Hi Lukas:

Your "Possible fix no. 2" passed 36 hours of extended testing using the ffsb scalability test suite on a modified 2.6.35-rc5 kernel with ext4, ext4 nojournal, and ext4 mblk filesystems.  The patch survived 135 ffsb runs where an average of two runs were sufficient to produce the bug originally. 

I think your fix is good and ought to be submitted.  Please feel free to add my Reported-by/Tested-by if you wish.

I'll follow up and retest once this patch makes it to mainline.

Thanks very much,
Comment 15 Florian Mickler 2011-03-04 20:18:56 UTC
Comment 16 Florian Mickler 2011-03-04 20:19:10 UTC
Patch: https://bugzilla.kernel.org/show_bug.cgi?id=29402#c13
Comment 17 Rafael J. Wysocki 2011-03-06 12:03:24 UTC
Ignore-Patch : https://bugzilla.kernel.org/show_bug.cgi?id=29402#c13
Patch : https://bugzilla.kernel.org/attachment.cgi?id=49642

Handled-By : Lukas Czerner <lczerner@redhat.com>
Comment 18 Lukas Czerner 2011-03-11 08:13:05 UTC
Created attachment 50622 [details]
Simplification to avoid bad ordering

This is new version of the bug fix, which has been tested by Eric Whitney and posted (https://patchwork.kernel.org/patch/627671/).

Quoting Eric:
"No panics, another full pass through the ffsb workloads on the three ext4 variants without trouble, and the system logs are clean as well."
Comment 19 Florian Mickler 2011-03-14 17:25:37 UTC
merged after .38-rc8:
commit 0aeea18964173715a1037034ef6838198f319319
Author: Lukas Czerner <lczerner@redhat.com>
Date:   Fri Mar 11 10:23:53 2011 +0100

    block: fix mis-synchronisation in blkdev_issue_zeroout()