Bug 15685

Summary: Kernel 2.6.33 w/ TuxOnIce fails to suspend (bisected)
Product: Drivers Reporter: Nix (nix)
Component: Video(DRI - non Intel)Assignee: Nigel Cunningham (nigel)
Status: RESOLVED UNREPRODUCIBLE    
Severity: normal CC: glisse, lenb, nigel, rjw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: Kernel config on crashing machine ('make oldconfig' from working 2.6.32 configuration)
dmesg of suspend-crashing kernel
dmesg of suspend-crashing kernel

Description Nix 2010-04-02 19:21:25 UTC
Created attachment 25828 [details]
Kernel config on crashing machine ('make oldconfig' from working 2.6.32 configuration)

(This is a re-report of fdo bug 26872, because I've seen no evidence that anyone's looking at fdo DRI bugs at all, but bugs reported to the kernel list at least get Rafael's scripts looking at them.)

(Crash persists with 2.6.33.2 and the tip of drm-radeon-testing.)

I found to my unhappiness that suspension locks up solid in the atomic copy/restore phase, on my x86-64 KMS system. There's no need to start X or do anything 3D, I can reproduce this from a framebuffer console login prompt. The fault is plainly Radeon KMS's: compile it out and suspension works file.

Nothing is logged on the netconsole, even with verbose PM debugging on.

The graphics card is an HD4870, and suspension mostly worked with it in 2.6.32 (there are circumstances in which TuxOnIce does two suspensions without an intervening resume, and those have always caused Radeon KMS to lock up).

My attempts to bisect it were somewhat hampered by *another* suspend-resume bug with similar symptoms (for me, a triple flash of the caps-lock light followed by a spontaneous reboot, at atomic copy/restore time), fixed by commit 9270eb1b496cb002d75f49ef82c9ef4cbd22a5a0. (The log for this commit helpfully didn't mention suspend/resume at all, only the fdo bug number, so my grepping checks were fruitless and I wasted six hours bisecting to a fixed bug. Bah.)

The faulty pair of commits is a regression from a major API rework :/ applying one but not the other will not compile (obviously). After this commit, suspension never works, although the failure mode changes a few times.

commit ca262a9998d46196750bb19a9dc4bd465b170ff7
Author: Jerome Glisse <jglisse@redhat.com>
Date:   Tue Dec 8 15:33:32 2009 +0100

    drm/ttm: Rework validation & memory space allocation (V3)

commit 312ea8da049a1830aa50c6e00002e50e30df476e
Author: Jerome Glisse <jglisse@redhat.com>
Date:   Mon Dec 7 15:52:58 2009 +0100

    drm/radeon/kms: Convert radeon to new TTM validation API (V2)

Before these commits, suspension works. Afterwards, instead of suspension I see a quintuple(?) flash of the caps lock light and a hard reboot. I'm not certain what this means: a triple fault?

The second change in behaviour, between abrupt reboot on suspend and hard hang, I haven't yet fully bisected (forty-three reboots in one evening was quite enough), but it lies in the range
2c761270d5520dd84ab0b4e47c24d99ff8503c38..004b35063296b6772fa72404a35b498f1e71e87e. In practice I strongly suspect that this is the same bug, manifesting differently.

Comparison with other Radeon and KMS users indicates that this is almost
certainly r6xx-r7xx-specific. (rv3xx works, r3xx works, Intel works.)

(Well, either that or I'm the only person in the world who's seeing it: I
haven't found anyone else with a r6xx or r7xx who's trying to suspend, yet.)


If there's anything else I can do to help debug a hard lockup like this, please say. I do have a second machine available to debug the first, but if the first is *dead* it's hard to do anything...
Comment 1 Jérôme Glisse 2010-04-02 19:30:00 UTC
Please test lastest 2.6.34-rc we commited a bunch of fixes in there.
Comment 2 Nix 2010-04-02 20:25:59 UTC
Same failure with 2.6.34 (Linus tree head as of five minutes ago), I'm afraid.
Comment 3 Jérôme Glisse 2010-04-02 20:54:57 UTC
Is this a laptop ? if so please give full model name.
If not please give motherboard model and CPU model.

Thanks
Comment 4 Nix 2010-04-02 21:02:48 UTC
Not a laptop: I'm suspending to cut my electricity bill (by nearly 30% at last count, ouch).

This is a 2.67GHz Intel Core i7 920 (stepping 4) on an Asus P6T motherboard.
Comment 5 Nix 2010-04-03 11:05:46 UTC
If there was some way to get info out of the system while the freezer is engaged I'd generate some and give you the freeze point, but unfortunately the network card appears to be one of the frozen entities, so netconsole and presumably kgdb-over-Ethernet are unhelpful. (None of my machines have serial ports, only serial-over-USB, so serial-kgdb isn't useful in this situation either.)
Comment 6 Jérôme Glisse 2010-04-03 17:59:08 UTC
I will reproduce your configuration next week, in the mean time please attach full dmesg and full Xorg.0.log thanks.
Comment 7 Nix 2010-04-03 18:29:24 UTC
OK. It'll be about a day, I'm in the middle of a huge KDE3->KDE4 transition right now.
Comment 8 Nix 2010-04-05 11:37:36 UTC
Created attachment 25864 [details]
dmesg of suspend-crashing kernel

Here's the dmesg. Are you sure you want an X log as well? This crash happens if you suspend from the textual login prompt without ever going near X, so it seems as if it would be superfluous to me.
Comment 9 Nix 2010-04-05 11:39:56 UTC
Created attachment 25865 [details]
dmesg of suspend-crashing kernel

Sorry, bugzilla being unhelpful: an attachment with an URI to an internal machine that you can't get to is no use to you at all... here's the actual file.
Comment 10 Rafael J. Wysocki 2010-04-05 21:25:08 UTC
First-Bad-Commit : 312ea8da049a1830aa50c6e00002e50e30df476e
Comment 11 Jérôme Glisse 2010-04-07 13:40:11 UTC
Ok works here, please test without tuxonice, tuxonice is know to be problematic and it's out of tree so i won't have time to fix it.
Comment 12 Nix 2010-04-07 19:25:49 UTC
Er, how do I suspend without tuxonice? I've never used the in-tree swsusp successfully (last time I tried I didn't just get a crash but massive filesystem corruption, so I sort of haven't dared: reporting it would have meant reproducing it, and I didn't want to touch it with a bargepole after *that*).

Ideal would be a way to kick off the freezer and then just unfreeze without suspending: that'd be pretty much risk-free...
Comment 13 Rafael J. Wysocki 2010-04-07 19:43:32 UTC
(In reply to comment #12)
> Er, how do I suspend without tuxonice? I've never used the in-tree swsusp
> successfully (last time I tried I didn't just get a crash but massive
> filesystem corruption, so I sort of haven't dared: reporting it would have
> meant reproducing it, and I didn't want to touch it with a bargepole after
> *that*).

It's kind of strange given that quite a few people use it daily, including everyone using hibernation in the default openSUSE setup.

Anyway,

> Ideal would be a way to kick off the freezer and then just unfreeze without
> suspending: that'd be pretty much risk-free...

# echo core > /sys/power/pm_test
# echo disk > /sys/power/state

should do the trick.
Comment 14 Nix 2010-04-07 23:17:03 UTC
Oh, *thank* you, that's very useful indeed.

That lets me eliminate virtually everything from consideration, because it works with ToI off (but patched in) and fails with it on! Thus the problem is either a ToI bug (and likely a bug in not very much ToI code; most of the ToI code relates to block writeout, not freezing), or a bug in the core kernel tickled by ToI.


(btw, I'm willing to believe my fs damage when last I tried the core suspension code was my fat-fingering, not the code itself. It plainly wasn't the freezer, whatever it was, as ToI uses that too, and that worked.)
Comment 15 Jérôme Glisse 2010-04-08 06:41:07 UTC
I think the issue is that Tuxonice assume somethings about graphic driver which is wrong with kms. Somethings about memory, ttm is messing with memory and tuxonice maybe miss that and don't properly save our memory.
Comment 16 Rafael J. Wysocki 2010-04-08 19:02:23 UTC
Reassigning to Nigel.
Comment 17 Len Brown 2010-04-08 19:25:54 UTC
Nix: can you clarify if the in-kernel hibernate capability
is working reliably and repeatedly or not?

If it is, and TOI is not, then that could be a great help
in isolating the issue..
Comment 18 Nix 2010-04-08 19:31:03 UTC
The recipe given above (freezing then resuming again), works perfectly if ToI is not compiled in, and fails when it is compiled in.

I speculate that ToI is either attempting to save a page that simply isn't there (which would explain the reboot) or is looping saving some page (which would explain the hang). I thought the freezer was entirely generic code, but obviously (given this) this is an oversimplification. I guess I'd better get reading the source rather than speculating wildly :)

Does anyone have a way of getting debugging output out of the freezer process? It's hard to debug something when every means I might use to get info out has been frozen... I might try to exempt the network card from freezing, so netconsole still works. Obviously actually hibernating in this situation would be problematic, but instantly resuming again should be OK. Then I can scatter printk()s around and search for the failure locus more precisely.
Comment 19 Nigel Cunningham 2010-04-08 21:19:46 UTC
Nix, if it is the fact that we save the image in two parts that's causing the problem, that will be easy to establish. Just do

echo 1 > /sys/power/tuxonice/no_pageset2

prior to trying to hibernate.

Given that you're talking about problems with freezing, I don't think that should make a difference. It sounds more like it might be an issue with the fuse work I have in the TuxOnIce patch. Do you have any fuse filesystems mounted when trying to hibernate?

There's a compile time option to enable debugging of the new fuse code, by the way. In the Power Management support menu, there's a new option (when patched with TuxOnIce) called "Filesystem freezer debugging". Enable it to get more info.

Regarding netconsole, it will work through the whole process without you needing to make any modifications. I use it often when testing in a virtual machine.

Nigel

By the way, Jerome: I'm wondering what you mean by TuxOnIce being known to be problematic. I'm not aware of any issues, and will happily be told! :)
Comment 20 Nix 2010-04-09 19:53:45 UTC
When the image is saved in one go, no crash: suspend/resume works perfectly. You've got the failure, all right. Not only does it work when suspending from the fb console, but even insane things like suspending while running X, scorched3d, xine, and a compositing window manager all at the same time work :)

(In hindsight, I wonder if we didn't have a sign of this problem before. Even in 2.6.32 --- and 2.6.31+drm-radeon-testing --- I saw a lockup at freeze time if ToI found that it had to expand its extra pages allowance, which leads to a pageset1-freeze-pageset1-freeze-pageset2 saving pattern (the second freeze crashed). So my extra pages allowance is currently set to the fairly ridiculous figure of 50000 pages, to stop this happening...)

(Why *do* we save the image in two parts? Presumably it's not just to be gratuitously different from in-kernel suspend: from looking at need_pageset2() I suppose it's so that we can suspend when more than half of memory needs to be saved, by saving as much of it as possible before copying, then ditching it, in the hope that this gives us enough space to copy the rest into?)
Comment 21 Nigel Cunningham 2010-04-09 22:29:26 UTC
Thanks for the reply. I'll look into what changed with that commit.

As to why we save the image in two parts, it's because it lets us get a complete image of memory. Without using two parts, we have to spend time freeing memory so we have somewhere to put the atomic copy. That freed memory means we take less time to write the image, but it also makes the system less responsive post-resume, because we usually then have to page back in some or all of this memory that we discarded when hibernating. It makes more sense in most cases to try and save a complete image of memory, perhaps resulting in a slightly longer time for reading and writing the image, but a more responsive system post-resume.

In using two parts, we save pages we know / expect won't change or be needed first, then use the memory they occupied for our atomic copy of the remainder. The trick is in finding memory that won't change or be needed. For a long time, the algorithm was nice and simple (LRU pages), but with the advent of KMS, things are getting a little more complicated - we need a way of finding those pages and putting them in the atomically copied part of the image. I had done that, but I guess the above commit changes things again.
Comment 22 Nigel Cunningham 2010-04-09 22:33:03 UTC
Hey Rafael, if you have this in your regression list, please remove it.

Also, should I be seeking to get some way of marking bugs as applying to TuxOnIce and not mainline? I don't think there'd be much call for it - I searched for bugs mentioning TuxOnIce yesterday and found that every other one apart from this ended up not being in TuxOnIce at all. Nevertheless, thought I'd ask. Perhaps you'd prefer to just close it as invalid and leave Nix and I to discuss it on TuxOnIce-devel.
Comment 23 Nigel Cunningham 2010-04-09 23:09:47 UTC
Hmm... correction to comment 21 - it's LRU pages + process pages that aren't needed (ie not userui).
Comment 24 Rafael J. Wysocki 2010-04-10 19:12:10 UTC
(In reply to comment #22)
> Hey Rafael, if you have this in your regression list, please remove it.

No, it's not in the regression list.

> Also, should I be seeking to get some way of marking bugs as applying to
> TuxOnIce and not mainline? I don't think there'd be much call for it - I
> searched for bugs mentioning TuxOnIce yesterday and found that every other
> one
> apart from this ended up not being in TuxOnIce at all. Nevertheless, thought
> I'd ask. Perhaps you'd prefer to just close it as invalid and leave Nix and I
> to discuss it on TuxOnIce-devel.

Actually, I'm interested in the outcome because of the ongoing effort to rework
the in-kernel hibernation, so please track it here if that's not a problem.
Comment 25 Nigel Cunningham 2010-05-29 12:03:12 UTC
I've managed to try TuxOnIce + KMS + 2.6.34 today, and can repeatedly hibernate and resume without problems. This is a 32 bit machine though, so YMMV.

Nix, would you able to check whether it works again with 2.6.34 or with 2.6.34 + a 32 bit kernel?

Regards,

Nigel
Comment 26 Nix 2010-05-30 16:05:05 UTC
It appears to work fine on my x86-64 2.6.34 system. In the last month and a half it looks like someone accidentally fixed this :) I can't see any particularly obvious commits, so it's possible that sheer chance and change has moved things around in memory such that whatever latent bug this was has become once more latent.

I'll raise a ruckus if it comes back again, but for now I think this must be closed, since the testcase has gone unreproducible on me.