Subject : iwlagn: order 2 page allocation failures
Submitter : Frans Pop <email@example.com>
Date : 2009-09-06 7:40
References : http://marc.info/?l=linux-kernel&m=125222287419691&w=4
Handled-By : Pekka Enberg <firstname.lastname@example.org>
This entry is being used for tracking a regression from 2.6.30. Please don't
close it until the problem is fixed in the mainline.
On Friday 02 October 2009, Mel Gorman wrote:
> On Fri, Oct 02, 2009 at 11:11:52AM +0200, Frans Pop wrote:
> > On Thursday 01 October 2009, Rafael J. Wysocki wrote:
> > > The following bug entry is on the current list of known regressions
> > > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > > be listed and let me know (either way).
> > >
> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14141
> > > Subject : order 2 page allocation failures in iwlagn
> > > Submitter : Frans Pop <email@example.com>
> > > Date : 2009-09-06 7:40 (26 days old)
> > > References :
> > > Handled-By : Pekka Enberg <firstname.lastname@example.org>
> > I'm not sure about this.
> > The error messages from failed allocations should now be a lot less as a
> > result of this commit:
> > commit f82a924cc88a5541df1d4b9d38a0968cd077a051
> > Author: Reinette Chatre <email@example.com>
> > Date: Thu Sep 17 10:43:56 2009 -0700
> > iwlwifi: reduce noise when skb allocation fails
> > That commit is in mainline, and I'm not sure if it is important enough for
> > a stable update (AFAICT it's not listed for 184.108.40.206).
> > That commit is mostly cosmetic, but possibly the real regression is not in
> > iwlagn but in the way memory is freed/defragmented. That aspect was also
> > reported by Bartlomiej (#14016) and was extensively discussed (without a
> > clear conclusion) here: http://lkml.org/lkml/2009/8/26/140.
> > My own feeling is that Bartlomiej is correct and that something has changed
> > since .29 and that on average we do have less higher order areas available
> > after the system has been in use for some time, but I can't substantiate
> > that. I do know that before .30 I had never seen the SKB allocation
> > errors.
> > Main problem is that it's hard to deliberately and reproducibly get the
> > system in a state where the errors occur.
> Apparently, Karol Lewandowski (cc added) has a reliable
> reproduction case for when the firmware loading problem occurs
> (http://lkml.org/lkml/2009/9/30/242). While it's not the same problem
> it's probable they're related. I'm hoping the problem commit can be
> by his bisection whenever he gets around to it.
> > I certainly do feel that the kernel should try to make sure higher order
> > allocations remain possible during system use. They are not only needed
> > shortly after boot: drivers can be loaded/unloaded at any time. OTOH Mel
> > probably does have a point that really high order GFP_ATOMIC allocations
> > by drivers make no sense .
> While they don't make sense, I accept that the problem is apparently
> occuring more now than it did so something has changed that is not
> obvious to normal testing. Hopefully Karol will be able to help us out.
> > Anyway, I have no problems with this BR being closed.
(In reply to comment #1)
> > > Anyway, I have no problems with this BR being closed.
Can this bug be closed?
Well, let's close it, although I think that the underlying issue is still worth pursuing if anyone has a good test case (unfortunately, nobody does at the moment AFAICS).
References : http://lkml.org/lkml/2009/10/2/86
References : http://lkml.org/lkml/2009/10/5/24
Test case has been found now. See last link above.
BR is related to:
On Monday 05 October 2009, Frans Pop wrote:
> On Monday 05 October 2009, Mel Gorman wrote:
> > On Mon, Oct 05, 2009 at 08:50:58AM +0200, Frans Pop wrote:
> > > On Monday 05 October 2009, Frans Pop wrote:
> > > > I'll dig into this a bit more as it looks like this should be
> > > > reproducible, probably even without the kernel build. Next step is
> > > > to see how .30 behaves in the same situation.
> > >
> > > This looks conclusive. I tested .30 and .32-rc3 from clean reboots and
> > > only starting gitk. I only started music playing in the background
> > > (amarok) from an NFS share to ensure network activity.
> > >
> > > With .32-rc3 I got 4 SKB allocation errors while starting the *second*
> > > gitk instance. And the system was completely frozen with music stopped
> > > until gitk finished loading.
> > >
> > > With .30 I was able to start *three* gitk's (which meant 2 of them got
> > > (partially) swapped out) without any allocation errors. And with the
> > > system remaining relatively responsive. There was a short break in the
> > > music while I started the 2nd instance, but it just continued playing
> > > afterwards. There was also some mild latency in the mouse cursor, but
> > > nothing like the full desktop freeze I get with .32-rc3.
> > >
> > > One thing I should mention: my swap is an LVM volume that's in a VG
> > > that's on a LUKS encrypted partition.
> > >
> > > Does this give you enough info to go on, or should I try a bisection?
> > I'll be trying to reproduce it, but it's unlikely I'll manage to
> > reproduce it reliably as there may be a specific combination of hardware
> > necessary as well. What I'm going to try is writing a module that
> > allocates order-5 every second GFP_ATOMIC and see can I reproduce using
> > scenarios similar to yours but it'll take some time with no guarantee of
> > success. If you could bisect it, it would be fantastic.
> And the winner is:
> 2ff05b2b4eac2e63d345fc731ea151a060247f53 is first bad commit
> commit 2ff05b2b4eac2e63d345fc731ea151a060247f53
> Author: David Rientjes <firstname.lastname@example.org>
> Date: Tue Jun 16 15:32:56 2009 -0700
> oom: move oom_adj value from task_struct to mm_struct
> I'm confident that the bisection is good. The test case was very reliable
> while zooming in on the merge from akpm.
First-Bad-Commit : 2ff05b2b4eac2e63d345fc731ea151a060247f53
On Tuesday 06 October 2009, David Rientjes wrote:
> On Mon, 5 Oct 2009, Frans Pop wrote:
> > And the winner is:
> > 2ff05b2b4eac2e63d345fc731ea151a060247f53 is first bad commit
> > commit 2ff05b2b4eac2e63d345fc731ea151a060247f53
> > Author: David Rientjes <email@example.com>
> > Date: Tue Jun 16 15:32:56 2009 -0700
> > oom: move oom_adj value from task_struct to mm_struct
> > I'm confident that the bisection is good. The test case was very reliable
> > while zooming in on the merge from akpm.
> I doubt it for two reasons: (i) this commit was reverted in 0753ba0 since
> 2.6.31-rc7 and is no longer in the kernel, and (ii) these are GFP_ATOMIC
> allocations which would be unaffected by oom killer scores.
More debugging has been carried out.
References : http://lkml.indiana.edu/hypermail/linux/kernel/0910.1/01395.html
*** Bug 14475 has been marked as a duplicate of this bug. ***
*** Bug 14440 has been marked as a duplicate of this bug. ***
*** Bug 14655 has been marked as a duplicate of this bug. ***
Handled-By : Mel Gorman <firstname.lastname@example.org>
Notify-Also : Karol Lewandowski <email@example.com>
Notify-Also : Frans Pop <firstname.lastname@example.org>
Notify-Also : David Rientjes <email@example.com>
Notify-Also : Pekka Enberg <firstname.lastname@example.org>
Notify-Also : Tobias Oetiker <email@example.com>
Notify-Also : Chris Mason <firstname.lastname@example.org>
Notify-Also : KOSAKI Motohiro <email@example.com>
Hi, Linux 2.6.32 is still affected.
No problem when iwlagn and iwlcore are not loaded but once it's loaded --> order 2 page allocation failures from all kind of programs using memory a lot (cc1plus and swapper for instance).
I'll attach a full dmesg. I don't have time to do that now but coud a bissection help ?
Created attachment 24053 [details]
dmesg output with iwlagn on kernel 2.6.32
This behavior as seen in iwlagn should disappear when you use the new paged RX for this driver which will be in 2.6.33.
Kernel 3.6.33-rc2 still has the same problem, is the new paged RX already in the kernel ? Do I have to activate it somewhere in the kernel config (I don't see that option in the iwlcore module configuration) ?
(In reply to comment #16)
> Kernel 3.6.33-rc2 still has the same problem, is the new paged RX already in
> the kernel ? Do I have to activate it somewhere in the kernel config (I don't
> see that option in the iwlcore module configuration) ?
Yes - the paged rx work is now in 2.6.33. Are you seeing order 2 page allocation failures while the driver is running? This is strange since the paged rx work modified the allocations from order 2 to order 1. Could you please pass on a trace?
I saw it once with 2.6.33-rc1 (or rc2) but not anymore with rc3.
*** Bug 15766 has been marked as a duplicate of this bug. ***
Created attachment 26304 [details]
2.6.33 mem page allocation failure
I think I am hit by this bug where on heavy I/O I can get the bug triggered.
I am running the released 2.6.33 kernel.
[34858.683705] Xorg: page allocation failure. order:1, mode:0x50d0
Full dmesg will be attached to the bug report
I think I have triggered this bug yesterday during shrinking one of my virtual machines within VMware.
I had the hanging behaviour for some time now, mostly hangs were not longer than 10 seconds. I never managed to find any informations within my log until yesterday. The hang time was ~30minutes long.
I think this is the case why the attached dmesg is incomplete, as the kernel ring buffer is to small to capture all of these events, including bootup.
I'm currently running:
dante@phoenix ~ $ uname -a
Linux phoenix 2.6.35-rc4 #8 SMP PREEMPT Mon Jul 19 14:36:36 CEST 2010 x86_64 Intel(R) Core(TM)2 Duo CPU L9600 @ 2.13GHz GenuineIntel GNU/Linux
If there are any bits of information I can provide to help please let me know.
Created attachment 29502 [details]
dmesg output of massive hang
any more detail how to trigger this? also what is the system setup, such as band/channel, traffic type(legacy/HT),...
Fixed in 2.6.34-rc1 by:
Author: Alan Cox <firstname.lastname@example.org>
Date: Thu Feb 18 16:43:47 2010 +0000
tty: Keep the default buffering to sub-page units