Bug 15685
Summary: | Kernel 2.6.33 w/ TuxOnIce fails to suspend (bisected) | ||
---|---|---|---|
Product: | Drivers | Reporter: | Nix (nix) |
Component: | Video(DRI - non Intel) | Assignee: | Nigel Cunningham (nigel) |
Status: | RESOLVED UNREPRODUCIBLE | ||
Severity: | normal | CC: | glisse, lenb, nigel, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.33 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216 | ||
Attachments: |
Kernel config on crashing machine ('make oldconfig' from working 2.6.32 configuration)
dmesg of suspend-crashing kernel dmesg of suspend-crashing kernel |
Description
Nix
2010-04-02 19:21:25 UTC
Please test lastest 2.6.34-rc we commited a bunch of fixes in there. Same failure with 2.6.34 (Linus tree head as of five minutes ago), I'm afraid. Is this a laptop ? if so please give full model name. If not please give motherboard model and CPU model. Thanks Not a laptop: I'm suspending to cut my electricity bill (by nearly 30% at last count, ouch). This is a 2.67GHz Intel Core i7 920 (stepping 4) on an Asus P6T motherboard. If there was some way to get info out of the system while the freezer is engaged I'd generate some and give you the freeze point, but unfortunately the network card appears to be one of the frozen entities, so netconsole and presumably kgdb-over-Ethernet are unhelpful. (None of my machines have serial ports, only serial-over-USB, so serial-kgdb isn't useful in this situation either.) I will reproduce your configuration next week, in the mean time please attach full dmesg and full Xorg.0.log thanks. OK. It'll be about a day, I'm in the middle of a huge KDE3->KDE4 transition right now. Created attachment 25864 [details]
dmesg of suspend-crashing kernel
Here's the dmesg. Are you sure you want an X log as well? This crash happens if you suspend from the textual login prompt without ever going near X, so it seems as if it would be superfluous to me.
Created attachment 25865 [details]
dmesg of suspend-crashing kernel
Sorry, bugzilla being unhelpful: an attachment with an URI to an internal machine that you can't get to is no use to you at all... here's the actual file.
First-Bad-Commit : 312ea8da049a1830aa50c6e00002e50e30df476e Ok works here, please test without tuxonice, tuxonice is know to be problematic and it's out of tree so i won't have time to fix it. Er, how do I suspend without tuxonice? I've never used the in-tree swsusp successfully (last time I tried I didn't just get a crash but massive filesystem corruption, so I sort of haven't dared: reporting it would have meant reproducing it, and I didn't want to touch it with a bargepole after *that*). Ideal would be a way to kick off the freezer and then just unfreeze without suspending: that'd be pretty much risk-free... (In reply to comment #12) > Er, how do I suspend without tuxonice? I've never used the in-tree swsusp > successfully (last time I tried I didn't just get a crash but massive > filesystem corruption, so I sort of haven't dared: reporting it would have > meant reproducing it, and I didn't want to touch it with a bargepole after > *that*). It's kind of strange given that quite a few people use it daily, including everyone using hibernation in the default openSUSE setup. Anyway, > Ideal would be a way to kick off the freezer and then just unfreeze without > suspending: that'd be pretty much risk-free... # echo core > /sys/power/pm_test # echo disk > /sys/power/state should do the trick. Oh, *thank* you, that's very useful indeed. That lets me eliminate virtually everything from consideration, because it works with ToI off (but patched in) and fails with it on! Thus the problem is either a ToI bug (and likely a bug in not very much ToI code; most of the ToI code relates to block writeout, not freezing), or a bug in the core kernel tickled by ToI. (btw, I'm willing to believe my fs damage when last I tried the core suspension code was my fat-fingering, not the code itself. It plainly wasn't the freezer, whatever it was, as ToI uses that too, and that worked.) I think the issue is that Tuxonice assume somethings about graphic driver which is wrong with kms. Somethings about memory, ttm is messing with memory and tuxonice maybe miss that and don't properly save our memory. Reassigning to Nigel. Nix: can you clarify if the in-kernel hibernate capability is working reliably and repeatedly or not? If it is, and TOI is not, then that could be a great help in isolating the issue.. The recipe given above (freezing then resuming again), works perfectly if ToI is not compiled in, and fails when it is compiled in. I speculate that ToI is either attempting to save a page that simply isn't there (which would explain the reboot) or is looping saving some page (which would explain the hang). I thought the freezer was entirely generic code, but obviously (given this) this is an oversimplification. I guess I'd better get reading the source rather than speculating wildly :) Does anyone have a way of getting debugging output out of the freezer process? It's hard to debug something when every means I might use to get info out has been frozen... I might try to exempt the network card from freezing, so netconsole still works. Obviously actually hibernating in this situation would be problematic, but instantly resuming again should be OK. Then I can scatter printk()s around and search for the failure locus more precisely. Nix, if it is the fact that we save the image in two parts that's causing the problem, that will be easy to establish. Just do echo 1 > /sys/power/tuxonice/no_pageset2 prior to trying to hibernate. Given that you're talking about problems with freezing, I don't think that should make a difference. It sounds more like it might be an issue with the fuse work I have in the TuxOnIce patch. Do you have any fuse filesystems mounted when trying to hibernate? There's a compile time option to enable debugging of the new fuse code, by the way. In the Power Management support menu, there's a new option (when patched with TuxOnIce) called "Filesystem freezer debugging". Enable it to get more info. Regarding netconsole, it will work through the whole process without you needing to make any modifications. I use it often when testing in a virtual machine. Nigel By the way, Jerome: I'm wondering what you mean by TuxOnIce being known to be problematic. I'm not aware of any issues, and will happily be told! :) When the image is saved in one go, no crash: suspend/resume works perfectly. You've got the failure, all right. Not only does it work when suspending from the fb console, but even insane things like suspending while running X, scorched3d, xine, and a compositing window manager all at the same time work :) (In hindsight, I wonder if we didn't have a sign of this problem before. Even in 2.6.32 --- and 2.6.31+drm-radeon-testing --- I saw a lockup at freeze time if ToI found that it had to expand its extra pages allowance, which leads to a pageset1-freeze-pageset1-freeze-pageset2 saving pattern (the second freeze crashed). So my extra pages allowance is currently set to the fairly ridiculous figure of 50000 pages, to stop this happening...) (Why *do* we save the image in two parts? Presumably it's not just to be gratuitously different from in-kernel suspend: from looking at need_pageset2() I suppose it's so that we can suspend when more than half of memory needs to be saved, by saving as much of it as possible before copying, then ditching it, in the hope that this gives us enough space to copy the rest into?) Thanks for the reply. I'll look into what changed with that commit. As to why we save the image in two parts, it's because it lets us get a complete image of memory. Without using two parts, we have to spend time freeing memory so we have somewhere to put the atomic copy. That freed memory means we take less time to write the image, but it also makes the system less responsive post-resume, because we usually then have to page back in some or all of this memory that we discarded when hibernating. It makes more sense in most cases to try and save a complete image of memory, perhaps resulting in a slightly longer time for reading and writing the image, but a more responsive system post-resume. In using two parts, we save pages we know / expect won't change or be needed first, then use the memory they occupied for our atomic copy of the remainder. The trick is in finding memory that won't change or be needed. For a long time, the algorithm was nice and simple (LRU pages), but with the advent of KMS, things are getting a little more complicated - we need a way of finding those pages and putting them in the atomically copied part of the image. I had done that, but I guess the above commit changes things again. Hey Rafael, if you have this in your regression list, please remove it. Also, should I be seeking to get some way of marking bugs as applying to TuxOnIce and not mainline? I don't think there'd be much call for it - I searched for bugs mentioning TuxOnIce yesterday and found that every other one apart from this ended up not being in TuxOnIce at all. Nevertheless, thought I'd ask. Perhaps you'd prefer to just close it as invalid and leave Nix and I to discuss it on TuxOnIce-devel. Hmm... correction to comment 21 - it's LRU pages + process pages that aren't needed (ie not userui). (In reply to comment #22) > Hey Rafael, if you have this in your regression list, please remove it. No, it's not in the regression list. > Also, should I be seeking to get some way of marking bugs as applying to > TuxOnIce and not mainline? I don't think there'd be much call for it - I > searched for bugs mentioning TuxOnIce yesterday and found that every other > one > apart from this ended up not being in TuxOnIce at all. Nevertheless, thought > I'd ask. Perhaps you'd prefer to just close it as invalid and leave Nix and I > to discuss it on TuxOnIce-devel. Actually, I'm interested in the outcome because of the ongoing effort to rework the in-kernel hibernation, so please track it here if that's not a problem. I've managed to try TuxOnIce + KMS + 2.6.34 today, and can repeatedly hibernate and resume without problems. This is a 32 bit machine though, so YMMV. Nix, would you able to check whether it works again with 2.6.34 or with 2.6.34 + a 32 bit kernel? Regards, Nigel It appears to work fine on my x86-64 2.6.34 system. In the last month and a half it looks like someone accidentally fixed this :) I can't see any particularly obvious commits, so it's possible that sheer chance and change has moved things around in memory such that whatever latent bug this was has become once more latent. I'll raise a ruckus if it comes back again, but for now I think this must be closed, since the testcase has gone unreproducible on me. |