Bug 14141

Summary: order 2 page allocation failures in iwlagn
Product: Memory Management Reporter: Rafael J. Wysocki (rjw)
Component: Page AllocatorAssignee: Andrew Morton (akpm)
Status: CLOSED CODE_FIX    
Severity: normal CC: detlev.casanova, elendil, florian, james, kvalo, linux-kernel-bugs, linville, max, mick, mister.olli, reinette.chatre, wey-yi.w.guy
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31-rc7 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13615    
Attachments: dmesg output with iwlagn on kernel 2.6.32
2.6.33 mem page allocation failure
dmesg output of massive hang

Description Rafael J. Wysocki 2009-09-06 19:49:35 UTC
Subject    : iwlagn: order 2 page allocation failures
Submitter  : Frans Pop <elendil@planet.nl>
Date       : 2009-09-06 7:40
References : http://marc.info/?l=linux-kernel&m=125222287419691&w=4
Handled-By : Pekka Enberg <penberg@cs.helsinki.fi>

This entry is being used for tracking a regression from 2.6.30.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Rafael J. Wysocki 2009-10-02 17:22:49 UTC
On Friday 02 October 2009, Mel Gorman wrote:
> On Fri, Oct 02, 2009 at 11:11:52AM +0200, Frans Pop wrote:
> > On Thursday 01 October 2009, Rafael J. Wysocki wrote:
> > > The following bug entry is on the current list of known regressions
> > > introduced between 2.6.30 and 2.6.31.  Please verify if it still should
> > > be listed and let me know (either way).
> > >
> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14141
> > > Subject   : order 2 page allocation failures in iwlagn
> > > Submitter : Frans Pop <elendil@planet.nl>
> > > Date              : 2009-09-06 7:40 (26 days old)
> > > References        :
> http://marc.info/?l=linux-kernel&m=125222287419691&w=4
> > > Handled-By        : Pekka Enberg <penberg@cs.helsinki.fi>
> > 
> > I'm not sure about this.
> > 
> > The error messages from failed allocations should now be a lot less as a 
> > result of this commit:
> > commit f82a924cc88a5541df1d4b9d38a0968cd077a051
> > Author: Reinette Chatre <reinette.chatre@intel.com>
> > Date:   Thu Sep 17 10:43:56 2009 -0700
> >     iwlwifi: reduce noise when skb allocation fails
> > 
> > That commit is in mainline, and I'm not sure if it is important enough for 
> > a stable update (AFAICT it's not listed for 2.6.31.2).
> > 
> > That commit is mostly cosmetic, but possibly the real regression is not in 
> > iwlagn but in the way memory is freed/defragmented. That aspect was also 
> > reported by Bartlomiej (#14016) and was extensively discussed (without a 
> > clear conclusion) here: http://lkml.org/lkml/2009/8/26/140.
> > 
> > My own feeling is that Bartlomiej is correct and that something has changed 
> > since .29 and that on average we do have less higher order areas available 
> > after the system has been in use for some time, but I can't substantiate 
> > that. I do know that before .30 I had never seen the SKB allocation 
> > errors.
> > 
> > Main problem is that it's hard to deliberately and reproducibly get the 
> > system in a state where the errors occur.
> > 
> 
> Apparently, Karol Lewandowski (cc added) has a reliable
> reproduction case for when the firmware loading problem occurs
> (http://lkml.org/lkml/2009/9/30/242). While it's not the same problem
> exactly,
> it's probable they're related. I'm hoping the problem commit can be
> identified
> by his bisection whenever he gets around to it.
> 
> > I certainly do feel that the kernel should try to make sure higher order 
> > allocations remain possible during system use. They are not only needed 
> > shortly after boot: drivers can be loaded/unloaded at any time. OTOH Mel 
> > probably does have a point that really high order GFP_ATOMIC allocations 
> > by drivers make no sense [1].
> > 
> 
> While they don't make sense, I accept that the problem is apparently
> occuring more now than it did so something has changed that is not
> obvious to normal testing. Hopefully Karol will be able to help us out.
> 
> > Anyway, I have no problems with this BR being closed.
Comment 2 Reinette Chatre 2009-10-02 21:31:55 UTC
(In reply to comment #1)

> > > Anyway, I have no problems with this BR being closed.

Can this bug be closed?

Thanks
Comment 3 Rafael J. Wysocki 2009-10-02 21:37:32 UTC
Well, let's close it, although I think that the underlying issue is still worth pursuing if anyone has a good test case (unfortunately, nobody does at the moment AFAICS).
Comment 4 Frans Pop 2009-10-05 07:07:17 UTC
References : http://lkml.org/lkml/2009/10/2/86
References : http://lkml.org/lkml/2009/10/5/24

Test case has been found now. See last link above.

BR is related to:
- http://bugzilla.kernel.org/show_bug.cgi?id=14016
- http://bugzilla.kernel.org/show_bug.cgi?id=14265
Comment 5 Rafael J. Wysocki 2009-10-05 22:42:11 UTC
On Monday 05 October 2009, Frans Pop wrote:
> On Monday 05 October 2009, Mel Gorman wrote:
> > On Mon, Oct 05, 2009 at 08:50:58AM +0200, Frans Pop wrote:
> > > On Monday 05 October 2009, Frans Pop wrote:
> > > > I'll dig into this a bit more as it looks like this should be
> > > > reproducible, probably even without the kernel build. Next step is
> > > > to see how .30 behaves in the same situation.
> > >
> > > This looks conclusive. I tested .30 and .32-rc3 from clean reboots and
> > > only starting gitk. I only started music playing in the background
> > > (amarok) from an NFS share to ensure network activity.
> > >
> > > With .32-rc3 I got 4 SKB allocation errors while starting the *second*
> > > gitk instance. And the system was completely frozen with music stopped
> > > until gitk finished loading.
> > >
> > > With .30 I was able to start *three* gitk's (which meant 2 of them got
> > > (partially) swapped out) without any allocation errors. And with the
> > > system remaining relatively responsive. There was a short break in the
> > > music while I started the 2nd instance, but it just continued playing
> > > afterwards. There was also some mild latency in the mouse cursor, but
> > > nothing like the full desktop freeze I get with .32-rc3.
> > >
> > > One thing I should mention: my swap is an LVM volume that's in a VG
> > > that's on a LUKS encrypted partition.
> > >
> > > Does this give you enough info to go on, or should I try a bisection?
> >
> > I'll be trying to reproduce it, but it's unlikely I'll manage to
> > reproduce it reliably as there may be a specific combination of hardware
> > necessary as well. What I'm going to try is writing a module that
> > allocates order-5 every second GFP_ATOMIC and see can I reproduce using
> > scenarios similar to yours but it'll take some time with no guarantee of
> > success. If you could bisect it, it would be fantastic.
> 
> And the winner is:
> 2ff05b2b4eac2e63d345fc731ea151a060247f53 is first bad commit
> commit 2ff05b2b4eac2e63d345fc731ea151a060247f53
> Author: David Rientjes <rientjes@google.com>
> Date:   Tue Jun 16 15:32:56 2009 -0700
> 
>     oom: move oom_adj value from task_struct to mm_struct
> 
> I'm confident that the bisection is good. The test case was very reliable 
> while zooming in on the merge from akpm.
Comment 6 Rafael J. Wysocki 2009-10-05 22:44:09 UTC
First-Bad-Commit : 2ff05b2b4eac2e63d345fc731ea151a060247f53
Comment 7 Rafael J. Wysocki 2009-10-06 00:07:53 UTC
On Tuesday 06 October 2009, David Rientjes wrote:
> On Mon, 5 Oct 2009, Frans Pop wrote:
> 
> > And the winner is:
> > 2ff05b2b4eac2e63d345fc731ea151a060247f53 is first bad commit
> > commit 2ff05b2b4eac2e63d345fc731ea151a060247f53
> > Author: David Rientjes <rientjes@google.com>
> > Date:   Tue Jun 16 15:32:56 2009 -0700
> > 
> >     oom: move oom_adj value from task_struct to mm_struct
> > 
> > I'm confident that the bisection is good. The test case was very reliable 
> > while zooming in on the merge from akpm.
> > 
> 
> I doubt it for two reasons: (i) this commit was reverted in 0753ba0 since 
> 2.6.31-rc7 and is no longer in the kernel, and (ii) these are GFP_ATOMIC 
> allocations which would be unaffected by oom killer scores.
Comment 8 Rafael J. Wysocki 2009-10-12 21:26:48 UTC
More debugging has been carried out.

References : http://lkml.indiana.edu/hypermail/linux/kernel/0910.1/01395.html
Comment 9 Rafael J. Wysocki 2009-10-26 17:13:14 UTC
*** Bug 14475 has been marked as a duplicate of this bug. ***
Comment 10 John W. Linville 2009-10-26 19:33:26 UTC
*** Bug 14440 has been marked as a duplicate of this bug. ***
Comment 11 Rafael J. Wysocki 2009-11-21 20:10:21 UTC
*** Bug 14655 has been marked as a duplicate of this bug. ***
Comment 12 Rafael J. Wysocki 2009-11-21 20:13:40 UTC
Handled-By : Mel Gorman <mel@csn.ul.ie>
Notify-Also : Karol Lewandowski <karol.k.lewandowski@gmail.com>
Notify-Also : Frans Pop <elendil@planet.nl>
Notify-Also : David Rientjes <rientjes@google.com>
Notify-Also : Pekka Enberg <penberg@cs.helsinki.fi>
Notify-Also : Tobias Oetiker <tobi@oetiker.ch>
Notify-Also : Chris Mason <chris.mason@oracle.com>
Notify-Also : KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Comment 13 Detlev Casanova 2009-12-06 16:18:20 UTC
Hi, Linux 2.6.32 is still affected.
No problem when iwlagn and iwlcore are not loaded but once it's loaded --> order 2 page allocation failures from all kind of programs using memory a lot (cc1plus and swapper for instance).

I'll attach a full dmesg. I don't have time to do that now but coud a bissection help ?

Cheers,

Detlev.
Comment 14 Detlev Casanova 2009-12-06 16:19:40 UTC
Created attachment 24053 [details]
dmesg output with iwlagn on kernel 2.6.32
Comment 15 Reinette Chatre 2009-12-07 17:41:54 UTC
This behavior as seen in iwlagn should disappear when you use the new paged RX for this driver which will be in 2.6.33.
Comment 16 Detlev Casanova 2009-12-26 19:15:31 UTC
Kernel 3.6.33-rc2 still has the same problem, is the new paged RX already in the kernel ? Do I have to activate it somewhere in the kernel config (I don't see that option in the iwlcore module configuration) ?

Thanks
Comment 17 Reinette Chatre 2010-01-05 18:10:38 UTC
(In reply to comment #16)
> Kernel 3.6.33-rc2 still has the same problem, is the new paged RX already in
> the kernel ? Do I have to activate it somewhere in the kernel config (I don't
> see that option in the iwlcore module configuration) ?
> 

Yes - the paged rx work is now in 2.6.33. Are you seeing order 2 page allocation failures while the driver is running? This is strange since the paged rx work modified the allocations from order 2 to order 1. Could you please pass on a trace?
Comment 18 Detlev Casanova 2010-01-22 20:52:52 UTC
I saw it once with 2.6.33-rc1 (or rc2) but not anymore with rc3.
Comment 19 Reinette Chatre 2010-04-13 23:27:55 UTC
*** Bug 15766 has been marked as a duplicate of this bug. ***
Comment 20 Ritesh Raj Sarraf 2010-05-09 18:13:45 UTC
Created attachment 26304 [details]
2.6.33 mem page allocation failure

I think I am hit by this bug where on heavy I/O I can get the bug triggered.

I am running the released 2.6.33 kernel.

[34858.683705] Xorg: page allocation failure. order:1, mode:0x50d0


Full dmesg will be attached to the bug report
Comment 21 mister.olli 2010-09-10 14:45:39 UTC
Hi,

I think I have triggered this bug yesterday during shrinking one of my virtual machines within VMware.

I had the hanging behaviour for some time now, mostly hangs were not longer than 10 seconds. I never managed to find any informations within my log until yesterday. The hang time was ~30minutes long.

I think this is the case why the attached dmesg is incomplete, as the kernel ring buffer is to small to capture all of these events, including bootup.

I'm currently running:
dante@phoenix ~ $ uname -a
Linux phoenix 2.6.35-rc4 #8 SMP PREEMPT Mon Jul 19 14:36:36 CEST 2010 x86_64 Intel(R) Core(TM)2 Duo CPU L9600 @ 2.13GHz GenuineIntel GNU/Linux

If there are any bits of information I can provide to help please let me know.
Comment 22 mister.olli 2010-09-10 14:47:49 UTC
Created attachment 29502 [details]
dmesg output of massive hang
Comment 23 wey-yi.w.guy 2010-09-10 16:17:36 UTC
any more detail how to trigger this? also what is the system setup, such as band/channel, traffic type(legacy/HT),...

Thanks
Wey
Comment 24 Florian Mickler 2010-12-17 21:50:03 UTC
References: http://lkml.org/lkml/2010/3/2/349

Fixed in 2.6.34-rc1 by: 

commit d9661adfb8e53a7647360140af3b92284cbe52d4
Author: Alan Cox <alan@linux.intel.com>
Date:   Thu Feb 18 16:43:47 2010 +0000

    tty: Keep the default buffering to sub-page units