Bug 6754
Summary: | S3: suspend freeze if tg3 netconsole (both mainline and +acpi-release-20060608) | ||
---|---|---|---|
Product: | ACPI | Reporter: | Henrique de Moraes Holschuh (hmh) |
Component: | Power-Sleep-Wake | Assignee: | acpi_power-sleep-wake |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | acpi-bugzilla |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.17.1 mainline | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
dmesg output
dmidecode output (serial numbers removed) acpidump output lspci -vv output |
Description
Henrique de Moraes Holschuh
2006-06-27 08:43:40 UTC
Created attachment 8422 [details]
dmesg output
dmesg output (doesn't show the lockup).
Created attachment 8423 [details]
dmidecode output (serial numbers removed)
dmidecode output, serial numbers were removed.
Created attachment 8424 [details]
acpidump output
If the report doesn't make it obvious, the very same kernel with the acpi-release-20060608-2.6.17 patches reverted *does not lockup on suspend with netconsole enabled*. On one hand, I found out the message which was causing trouble and would not be sent: Jun 27 17:28:17 thorin ACPI: PCI interrupt for device 0000:0b:02.0 disabled Jun 27 17:28:17 thorin ACPI: PCI interrupt for device 0000:0b:00.0 disabled Jun 27 17:28:17 thorin acpi_bus-0201 [60] bus_set_power : Device is not power manageable Jun 27 17:28:17 thorin ACPI: PCI interrupt for device 0000:00:1e.2 disabled Jun 27 17:28:17 thorin ACPI: PCI interrupt for device 0000:00:1d.7 disabled ... On the other hand, after triggering it twice, I am not managing to trigger this bug again with a kernel with more debugging options enabled to get further data. I will attach lspci -vv output just in case, but if I don't manage to trigger this bug again, I will have to declare it a heisenbug and close the report. Created attachment 8426 [details]
lspci -vv output
without netconsole the resume works? Yes, it did (but not *perfectly*). S3 wakeup works, but is quietly hosing something that ends up causing an ext3 deadlock some unspecified (and wildly variant) time later. I had not experienced hangs at S3 resume without netconsole. Anyway, right now (2.6.17 with stack frames, debug information, spinlock debugging, and a few other debugging stuff) I cannot reproduce the problem with netconsole. I will try to reproduce the netconsole issue a bit more later. S3 sleep and resume under 2.6.17 and the new ACPI patches is *definately* not working well in this machine, while 2.6.16 (plus a ton of SATA-related patches) did. There was this netconsole lockup I am not managing to reproduce. Also, once ALSA would not manage to kick the AC97 codec back into gear (while in the other 50 or so suspend-resume cycles I have done already it worked perfectly)and I had to remove and reload the entire ALSA stack to get the codec reattached and running, etc. The new ACPI patches run a lot of new DSDT code, and the IBM T43 DSDT is *nasty*, doing SMIs everywhere and for everything, poking the PCI buses directly and so on... so it might be IBM's fault. The kernel definately is assigning different IRQs to different devices when using the new ACPI, and even across different boots of the same kernel (probably because device discovery order when udev fires coldplug is not always the same). The good news: now I can duplicate the issue anytime I want, now. The bad news: the issue is there with stock 2.6.17, AND with 2.6.17+acpi-release-20060608-2.6.17, and I have not managed to duplicate WHY I was not being able to duplicate it always before. The problem happens during the S3 *suspend* (and not during resume). SysRq+B sometimes works to recover the machine. Here are the messages right before the softlock: Stopping tasks: ================| pnp: Device 00:0a disabled. ACPI: PCI interrupt for device 0000:0b:00.0 disabled acpi_bus-0201 <HANG> Here are the messages *without* netconsole loaded (no softlock): Stopping tasks: ================| pnp: Device 00:0a disabled. ACPI: PCI interrupt for device 0000:0b:00.0 disabled acpi_bus-0201 [10] bus_set_power : Device is not power manageable ACPI: PCI interrupt for device 0000:00:1e.2 disabled ACPI: PCI interrupt for device 0000:00:1d.7 disabled (same for 1d.3, 1d.2, 1d.1 and 1d.0) PM: Entering mem sleep hwsleep-0283 [11] enter_sleep_state: : Entering sleep state [S3] And since this kernel has all the uncover-code-doing-screwed-up-things options I could enable, it traps this error during resume: BUG: sleeping function called from invalid context at include/linux/rwsem.h:43, in_atomic():0, irqs_disabled():1... which is AFAIK known bogosity in stock 2.6.17 ACPI. This bug is also present in a kernel with the acpi-20060608 patches. The value between [] in the acpi_bus-0201 line where hangs happen when netconsole is loaded change, sometimes. So does the one in the hwsleep-0285 line. I don't know what is wrong with me today, but the BUG line has nothing to do with ACPI, the blame lies with cpufreq. The S3 resume issues were traced back to suspend2 silently corrupting kernel running state. This does not mean that the tg3 bug isn't there. After 2.6.18 is out and stable enough (like, at 2.6.18.1 or thereabouts), I will run an intesive test to try to reproduce the tg3 netconsole hang *without* suspend2 or any other patches. tg3 netconsole status for 2.6.18: BUG FIXED. In 2.6.18, netconsole/tg3 simply shutdowns (and loses all console messages until late in the resume process -- but that's a different issue). But it doesn't hang at all anymore, so this bug can be closed as CODE_FIX in 2.6.18. I will do just that. |