Bug 9966 - kernel BUG at snapshot.c:464, caused by pfn_valid() lying
kernel BUG at snapshot.c:464, caused by pfn_valid() lying
Status: CLOSED CODE_FIX
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64
All Linux
: P1 normal
Assigned To: Rafael J. Wysocki
:
Depends on:
Blocks: 7216
  Show dependency treegraph
 
Reported: 2008-02-13 14:35 UTC by Rafael J. Wysocki
Modified: 2008-03-13 15:45 UTC (History)
9 users (show)

See Also:
Kernel Version: commit 74dfd666de861c97d47bdbd892f6d21b801d0247
Tree: Mainline
Regression: Yes


Attachments
Patch to test (2.29 KB, patch)
2008-03-08 11:10 UTC, Rafael J. Wysocki
Details | Diff
Another version of the patch (2.52 KB, patch)
2008-03-08 11:34 UTC, Rafael J. Wysocki
Details | Diff
Updated patch to test (hopefully correct) (2.54 KB, patch)
2008-03-08 14:01 UTC, Rafael J. Wysocki
Details | Diff
Final workaround patch (2.57 KB, patch)
2008-03-08 16:36 UTC, Rafael J. Wysocki
Details | Diff
dmesg output (26.27 KB, text/plain)
2008-03-08 21:47 UTC, Jeff Mahoney
Details
Final patch updated (3.20 KB, patch)
2008-03-10 06:53 UTC, Rafael J. Wysocki
Details | Diff

Description Rafael J. Wysocki 2008-02-13 14:35:30 UTC
Subject         : kernel BUG at kernel/power/snapshot.c:464!
Submitter       : Jeff Mahoney <jeffm@suse.com>
Date            : 2008-02-08 20:03
References      : http://lkml.org/lkml/2008/2/8/331
Handled-By      : "Rafael J. Wysocki" <rjw@sisk.pl>

This entry is being used for tracking a regression from 2.6.24.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Rafael J. Wysocki 2008-02-18 04:09:46 UTC
The source of the bug is that on a K8 NUMA system there is a PFN for which
pfn_valid() returns 'true' and yet it doesn't belong to any zone.
Comment 2 Andi Kleen 2008-03-05 16:03:22 UTC
Regression from .24? The oops in Jeff's message says it is from .24 so how
can it be an regression from .24

If it's in .25 it would be probably related to the new vmemmap code, but it isn't
is it?

Comment 3 Pavel Machek 2008-03-06 01:36:40 UTC
Andi: He specifically says he can't reproduce it in 2.6.24-final...

Rafael: I'm no longer sure this is "my" bug. I see death in swsusp_save...
Comment 4 Andi Kleen 2008-03-06 01:44:10 UTC
The oops he posted says

Pid: 3131, comm: cat Not tainted 2.6.24-vanilla #14

Ok perhaps he posted the wrong oops.
Comment 5 Pavel Machek 2008-03-06 13:26:49 UTC
Jeff, can you clarify?
Comment 6 Jeff Mahoney 2008-03-07 07:25:56 UTC
Sure, the test kernel was built from a git tree pulled before 2.6.25-rc1 was released, so version hadn't been bumped yet. Unfortunately, I've since deleted that tree, so the only version information I have left is the timestamp of that message. The problem does still exist in 2.6.25-rc4-git1, though.
Comment 7 Rafael J. Wysocki 2008-03-07 07:58:16 UTC
Would it be practical to carry out a binary search for the commit that caused this regression?
Comment 8 Jeff Mahoney 2008-03-07 08:52:56 UTC
Ah, actually, I must've ended up booting with the wrong kernel at some point. 2.6.24-final does actually fail. I'll spend some more time tracking this down.

Rafael - I'll do the bisect. I had tried it previously and didn't receive good results, but that would be due to the fact that I mistakenly thought 2.6.24 was a known good version for this.
Comment 9 Rafael J. Wysocki 2008-03-07 09:14:47 UTC
Okay, so this is not a recent regression.

Could you please try to identify the last working kernel?
Comment 10 Jeff Mahoney 2008-03-07 15:18:29 UTC
No, it's not a recent regression. It turns out the last working version is 2.6.21. I'll bisect between 2.6.21 and 2.6.22 to narrow it down.
Comment 11 Jeff Mahoney 2008-03-07 18:39:37 UTC
Ok, bisect's done:
74dfd666de861c97d47bdbd892f6d21b801d0247 is first bad commit
commit 74dfd666de861c97d47bdbd892f6d21b801d0247
Author: Rafael J. Wysocki <rjw@sisk.pl>
Date:   Sun May 6 14:50:43 2007 -0700

    swsusp: do not use page flags
    
    Make swsusp use memory bitmaps instead of page flags for marking 'nosave' and
    free pages.  This allows us to 'recycle' two page flags that can be used for
    other purposes.  Also, the memory needed to store the bitmaps is allocated
    when necessary (ie.  before the suspend) and freed after the resume which is
    more reasonable.
    
    The patch is designed to minimize the amount of changes and there are some
    nice simplifications and optimizations possible on top of it.  I am going to
    implement them separately in the future.
    
    Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
    Acked-by: Pavel Machek <pavel@ucw.cz>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Comment 12 Andrew Morton 2008-03-07 18:53:27 UTC
oop, our regressionmeister caused a regression.
Comment 13 Rafael J. Wysocki 2008-03-08 10:26:48 UTC
Well, this patch only introduced the check that triggers.  Later, we added the pfn_valid() check in kernel/power/snapshot.c:mark_nosave_pages() that should catch all problems like this.  If it doesn't, there's a bug and it looks like it's been there forever.

I'll try to prepare a workaround for it, but IMHO the original bug should be fixed (which is that pfn_valid() apparently returns 'true' for a page which doesn't belong to any zone).
Comment 14 Rafael J. Wysocki 2008-03-08 11:10:24 UTC
Created attachment 15178 [details]
Patch to test

Jeff, can you please check if the attached patch helps?
Comment 15 Rafael J. Wysocki 2008-03-08 11:34:07 UTC
Created attachment 15179 [details]
Another version of the patch

The previous version of the patch may generate too much output.  Please test this one instead.
Comment 16 Rafael J. Wysocki 2008-03-08 12:00:13 UTC
Both patches are buggy.  I'll attach a correct one in a while.
Comment 17 Rafael J. Wysocki 2008-03-08 14:01:14 UTC
Created attachment 15181 [details]
Updated patch to test (hopefully correct)

Please test this patch.  It works on my box FWIW, but the unpatched code works on it either.
Comment 18 Jeff Mahoney 2008-03-08 15:04:29 UTC
Ok, that works around the problem. The output I get is:
swsusp: Marking nosave pages: 000000000009f000 - 0000000000100000
swsusp: Marking nosave pages: 00000000f7ff0000 - 0000000100000000
PM: Invalid pfn: 1015792
PM: Invalid pfn: 1015793
PM: Invalid pfn: 1015794
PM: Invalid pfn: 1015795
PM: Invalid pfn: 1015796
PM: Invalid pfn: 1015797
PM: Invalid pfn: 1015798
PM: Invalid pfn: 1015799
PM: Invalid pfn: 1015800
PM: Invalid pfn: 1015801
PM: Invalid pfn: 1015802
PM: Invalid pfn: 1015803
PM: Invalid pfn: 1015804
PM: Invalid pfn: 1015805
PM: Invalid pfn: 1015806
PM: Invalid pfn: 1015807
swsusp: Basic memory bitmaps created
swsusp: Basic memory bitmaps freed
Comment 19 Rafael J. Wysocki 2008-03-08 16:36:39 UTC
Created attachment 15182 [details]
Final workaround patch

One more refinement: print page addresses in the same format as in the other messages.

Please attach full dmesg output from the kernel with this patch applied up to (and including) the point where you try to reproduce the problem.
Comment 20 Jeff Mahoney 2008-03-08 21:47:41 UTC
Created attachment 15183 [details]
dmesg output

Here's the full dmesg output. Note that resume gets run in the initrd, but I've also run in manually to ensure reproduction once the system goes multiuser.
Comment 21 Rafael J. Wysocki 2008-03-09 05:42:18 UTC
Hm, that's interesting.  Apparently, the problem only occurs for exactly 16 pages starting at 0xf7ff0000.  It looks like an off-by-one error somewhere or something similar.  Well ...

Anyway, thanks for testing.  I'll try to push the patch upstream, as I really don't expect the underlying issue to be solved anytime soon. :-)
Comment 22 Pavel Machek 2008-03-09 12:11:23 UTC
Hmm... something went very wrong in memory_bm_find_bit, and instead of fixing problem, we workaround and hope it will not bite us?

Is it possible that someone allocated some memory and confused our accounting?

Or maybe we can revert the bitmap stuff until we understand what is going on there?
Comment 23 Rafael J. Wysocki 2008-03-09 12:59:58 UTC
First, we can't revert the bitmaps stuff, which is as old as 2.6.21 BTW, because the mm people have already taken the page flags released by it.

Second, the thing that went wrong was that the pfn provided to memory_bm_find_bit(), which is a pfn of the page we DO NOT WANT TO save, turned out to be outside of any zone, while pfn_valid(pfn) returned 'true' for it.  It is actually SAFE to ignore this error and go on, because we won't be trying to save this page ANYWAY.

IOW, the situation is understood and the workaround is the best thing we can do at the moment.
Comment 24 Pavel Machek 2008-03-10 06:24:28 UTC
Andrew, could you help here? It seems like swsusp code is just a messenger here.
Comment 25 Pavel Machek 2008-03-10 06:26:09 UTC
Nick, I'm told you are experienced in this area, could you help? This code is going to land in opensuse11 pretty soon...
Comment 26 Rafael J. Wysocki 2008-03-10 06:38:40 UTC
I think I know what's going on.

The PFNs triggering the error belong to the ACPI data area, which is mapped directly using ioremap(), so the pfn_valid() check doesn't catch them.

Thus I think the patch from Comment #19 is the right thing to do (please note that in this patch the PFNs from the outside of all zones are only ignored while marking the "nosave" pages and we won't try to save any memory from the outside of all zones anyway, so it's all consistent).
Comment 27 Rafael J. Wysocki 2008-03-10 06:53:22 UTC
Created attachment 15201 [details]
Final patch updated

Well, in fact we can just ignore the result returned by mem_bm_set_bit_check() without printing a message for the "invalid" PFNs (they are only "invalid" from our point of view).
Comment 28 Anonymous Emailer 2008-03-10 09:12:01 UTC
Reply-To: akpm@linux-foundation.org

It looks like Rafael has it in hand, but...

Yeah, I believe pfn_valid() is, umm, "approximate" on x86_64 because
making it exact would be took expensive.  Andi might be able to
remind us what's going on in there.

Comment 29 Pavel Machek 2008-03-10 15:16:11 UTC
Can we get slow_pofn_valid() that actually works? ;-).
Comment 30 Rafael J. Wysocki 2008-03-10 16:07:05 UTC
Could you elaborate, please?
Comment 31 Nick Piggin 2008-03-10 18:36:01 UTC
In architecture independent code, pfn_valid really should be avoided. I think the one thing you can be sure of is that "valid page" -> pfn_valid, but not vice versa.

Can't you avoid relying on pfn_valid if the architecture registers all nosave regions correctly?
Comment 32 Rafael J. Wysocki 2008-03-11 05:49:31 UTC
Well, in fact that's what the patch from Comment #27 is for. :-)

Note You need to log in before you can comment on or make changes to this bug.