Kernel Bug Tracker – Bug 9966
kernel BUG at snapshot.c:464, caused by pfn_valid() lying
Last modified: 2008-03-13 15:45:14 UTC
Subject : kernel BUG at kernel/power/snapshot.c:464!
Submitter : Jeff Mahoney <firstname.lastname@example.org>
Date : 2008-02-08 20:03
References : http://lkml.org/lkml/2008/2/8/331
Handled-By : "Rafael J. Wysocki" <email@example.com>
This entry is being used for tracking a regression from 2.6.24. Please don't
close it until the problem is fixed in the mainline.
The source of the bug is that on a K8 NUMA system there is a PFN for which
pfn_valid() returns 'true' and yet it doesn't belong to any zone.
Regression from .24? The oops in Jeff's message says it is from .24 so how
can it be an regression from .24
If it's in .25 it would be probably related to the new vmemmap code, but it isn't
Andi: He specifically says he can't reproduce it in 2.6.24-final...
Rafael: I'm no longer sure this is "my" bug. I see death in swsusp_save...
The oops he posted says
Pid: 3131, comm: cat Not tainted 2.6.24-vanilla #14
Ok perhaps he posted the wrong oops.
Jeff, can you clarify?
Sure, the test kernel was built from a git tree pulled before 2.6.25-rc1 was released, so version hadn't been bumped yet. Unfortunately, I've since deleted that tree, so the only version information I have left is the timestamp of that message. The problem does still exist in 2.6.25-rc4-git1, though.
Would it be practical to carry out a binary search for the commit that caused this regression?
Ah, actually, I must've ended up booting with the wrong kernel at some point. 2.6.24-final does actually fail. I'll spend some more time tracking this down.
Rafael - I'll do the bisect. I had tried it previously and didn't receive good results, but that would be due to the fact that I mistakenly thought 2.6.24 was a known good version for this.
Okay, so this is not a recent regression.
Could you please try to identify the last working kernel?
No, it's not a recent regression. It turns out the last working version is 2.6.21. I'll bisect between 2.6.21 and 2.6.22 to narrow it down.
Ok, bisect's done:
74dfd666de861c97d47bdbd892f6d21b801d0247 is first bad commit
Author: Rafael J. Wysocki <firstname.lastname@example.org>
Date: Sun May 6 14:50:43 2007 -0700
swsusp: do not use page flags
Make swsusp use memory bitmaps instead of page flags for marking 'nosave' and
free pages. This allows us to 'recycle' two page flags that can be used for
other purposes. Also, the memory needed to store the bitmaps is allocated
when necessary (ie. before the suspend) and freed after the resume which is
The patch is designed to minimize the amount of changes and there are some
nice simplifications and optimizations possible on top of it. I am going to
implement them separately in the future.
Signed-off-by: Rafael J. Wysocki <email@example.com>
Acked-by: Pavel Machek <firstname.lastname@example.org>
Signed-off-by: Andrew Morton <email@example.com>
Signed-off-by: Linus Torvalds <firstname.lastname@example.org>
oop, our regressionmeister caused a regression.
Well, this patch only introduced the check that triggers. Later, we added the pfn_valid() check in kernel/power/snapshot.c:mark_nosave_pages() that should catch all problems like this. If it doesn't, there's a bug and it looks like it's been there forever.
I'll try to prepare a workaround for it, but IMHO the original bug should be fixed (which is that pfn_valid() apparently returns 'true' for a page which doesn't belong to any zone).
Created attachment 15178 [details]
Patch to test
Jeff, can you please check if the attached patch helps?
Created attachment 15179 [details]
Another version of the patch
The previous version of the patch may generate too much output. Please test this one instead.
Both patches are buggy. I'll attach a correct one in a while.
Created attachment 15181 [details]
Updated patch to test (hopefully correct)
Please test this patch. It works on my box FWIW, but the unpatched code works on it either.
Ok, that works around the problem. The output I get is:
swsusp: Marking nosave pages: 000000000009f000 - 0000000000100000
swsusp: Marking nosave pages: 00000000f7ff0000 - 0000000100000000
PM: Invalid pfn: 1015792
PM: Invalid pfn: 1015793
PM: Invalid pfn: 1015794
PM: Invalid pfn: 1015795
PM: Invalid pfn: 1015796
PM: Invalid pfn: 1015797
PM: Invalid pfn: 1015798
PM: Invalid pfn: 1015799
PM: Invalid pfn: 1015800
PM: Invalid pfn: 1015801
PM: Invalid pfn: 1015802
PM: Invalid pfn: 1015803
PM: Invalid pfn: 1015804
PM: Invalid pfn: 1015805
PM: Invalid pfn: 1015806
PM: Invalid pfn: 1015807
swsusp: Basic memory bitmaps created
swsusp: Basic memory bitmaps freed
Created attachment 15182 [details]
Final workaround patch
One more refinement: print page addresses in the same format as in the other messages.
Please attach full dmesg output from the kernel with this patch applied up to (and including) the point where you try to reproduce the problem.
Created attachment 15183 [details]
Here's the full dmesg output. Note that resume gets run in the initrd, but I've also run in manually to ensure reproduction once the system goes multiuser.
Hm, that's interesting. Apparently, the problem only occurs for exactly 16 pages starting at 0xf7ff0000. It looks like an off-by-one error somewhere or something similar. Well ...
Anyway, thanks for testing. I'll try to push the patch upstream, as I really don't expect the underlying issue to be solved anytime soon. :-)
Hmm... something went very wrong in memory_bm_find_bit, and instead of fixing problem, we workaround and hope it will not bite us?
Is it possible that someone allocated some memory and confused our accounting?
Or maybe we can revert the bitmap stuff until we understand what is going on there?
First, we can't revert the bitmaps stuff, which is as old as 2.6.21 BTW, because the mm people have already taken the page flags released by it.
Second, the thing that went wrong was that the pfn provided to memory_bm_find_bit(), which is a pfn of the page we DO NOT WANT TO save, turned out to be outside of any zone, while pfn_valid(pfn) returned 'true' for it. It is actually SAFE to ignore this error and go on, because we won't be trying to save this page ANYWAY.
IOW, the situation is understood and the workaround is the best thing we can do at the moment.
Andrew, could you help here? It seems like swsusp code is just a messenger here.
Nick, I'm told you are experienced in this area, could you help? This code is going to land in opensuse11 pretty soon...
I think I know what's going on.
The PFNs triggering the error belong to the ACPI data area, which is mapped directly using ioremap(), so the pfn_valid() check doesn't catch them.
Thus I think the patch from Comment #19 is the right thing to do (please note that in this patch the PFNs from the outside of all zones are only ignored while marking the "nosave" pages and we won't try to save any memory from the outside of all zones anyway, so it's all consistent).
Created attachment 15201 [details]
Final patch updated
Well, in fact we can just ignore the result returned by mem_bm_set_bit_check() without printing a message for the "invalid" PFNs (they are only "invalid" from our point of view).
It looks like Rafael has it in hand, but...
Yeah, I believe pfn_valid() is, umm, "approximate" on x86_64 because
making it exact would be took expensive. Andi might be able to
remind us what's going on in there.
Can we get slow_pofn_valid() that actually works? ;-).
Could you elaborate, please?
In architecture independent code, pfn_valid really should be avoided. I think the one thing you can be sure of is that "valid page" -> pfn_valid, but not vice versa.
Can't you avoid relying on pfn_valid if the architecture registers all nosave regions correctly?
Well, in fact that's what the patch from Comment #27 is for. :-)
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a82f7119fd940c1505fc9fdf93d835fa52bc075d