Bug 6754 - S3: suspend freeze if tg3 netconsole (both mainline and +acpi-release-20060608)
Summary: S3: suspend freeze if tg3 netconsole (both mainline and +acpi-release-20060608)
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Sleep-Wake (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: acpi_power-sleep-wake
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-06-27 08:43 UTC by Henrique de Moraes Holschuh
Modified: 2007-04-28 12:49 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.17.1 mainline
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
dmesg output (20.00 KB, text/plain)
2006-06-27 08:50 UTC, Henrique de Moraes Holschuh
Details
dmidecode output (serial numbers removed) (12.85 KB, text/plain)
2006-06-27 08:51 UTC, Henrique de Moraes Holschuh
Details
acpidump output (257.45 KB, text/plain)
2006-06-27 08:52 UTC, Henrique de Moraes Holschuh
Details
lspci -vv output (16.38 KB, text/plain)
2006-06-27 13:31 UTC, Henrique de Moraes Holschuh
Details

Description Henrique de Moraes Holschuh 2006-06-27 08:43:40 UTC
Most recent kernel where this bug did not occur:
Distribution: 2.6.17.1 without acpi-release-20060608-2.6.17
Hardware Environment: ThinkPad T43 2687
Software Environment: Debian
Problem Description:

Upon suspend-to-ram with the netconsole loaded, the system will deadlock with
this message:

acpi_bus-0201
(i.e. the rest of the acpi_bus-0201 message is NOT printed).

Steps to reproduce:
Load netconsole, verify that it is working over ethernet (tg3 on pciexpress).
Suspend-to-ram (S3).
Comment 1 Henrique de Moraes Holschuh 2006-06-27 08:50:44 UTC
Created attachment 8422 [details]
dmesg output

dmesg output (doesn't show the lockup).
Comment 2 Henrique de Moraes Holschuh 2006-06-27 08:51:47 UTC
Created attachment 8423 [details]
dmidecode output (serial numbers removed)

dmidecode output, serial numbers were removed.
Comment 3 Henrique de Moraes Holschuh 2006-06-27 08:52:40 UTC
Created attachment 8424 [details]
acpidump output
Comment 4 Henrique de Moraes Holschuh 2006-06-27 08:56:21 UTC
If the report doesn't make it obvious, the very same kernel with the
acpi-release-20060608-2.6.17 patches reverted *does not lockup on suspend with
netconsole enabled*.
Comment 5 Henrique de Moraes Holschuh 2006-06-27 13:29:56 UTC
On one hand, I found out the message which was causing trouble and would not be
sent:
Jun 27 17:28:17 thorin ACPI: PCI interrupt for device 0000:0b:02.0 disabled 
Jun 27 17:28:17 thorin ACPI: PCI interrupt for device 0000:0b:00.0 disabled 
Jun 27 17:28:17 thorin acpi_bus-0201 [60] bus_set_power         : Device is not
power manageable 
Jun 27 17:28:17 thorin ACPI: PCI interrupt for device 0000:00:1e.2 disabled 
Jun 27 17:28:17 thorin ACPI: PCI interrupt for device 0000:00:1d.7 disabled 
...

On the other hand, after triggering it twice, I am not managing to trigger this
bug again with a kernel with more debugging options enabled to get further data.

I will attach lspci -vv output just in case, but if I don't manage to trigger
this bug again, I will have to declare it a heisenbug and close the report.
Comment 6 Henrique de Moraes Holschuh 2006-06-27 13:31:43 UTC
Created attachment 8426 [details]
lspci -vv output
Comment 7 Len Brown 2006-07-01 10:02:37 UTC
without netconsole the resume works?
Comment 8 Henrique de Moraes Holschuh 2006-07-01 11:14:23 UTC
Yes, it did (but not *perfectly*).  S3 wakeup works, but is quietly hosing
something that ends up causing an ext3 deadlock some unspecified (and wildly
variant) time later.  I had not experienced hangs at S3 resume without netconsole.

Anyway, right now (2.6.17 with stack frames, debug information, spinlock
debugging, and a few other debugging stuff) I cannot reproduce the problem with
netconsole.  I will try to reproduce the netconsole issue a bit more later. 

S3 sleep and resume under 2.6.17 and the new ACPI patches is *definately* not
working well in this machine, while 2.6.16 (plus a ton of SATA-related patches)
did.  There was this netconsole lockup I am not managing to reproduce. Also,
once ALSA would not manage to kick the AC97 codec back into gear (while in the
other 50 or so suspend-resume cycles I have done already it worked perfectly)and
I had to remove and reload the entire ALSA stack to get the codec reattached and
running, etc.

The new ACPI patches run a lot of new DSDT code, and the IBM T43 DSDT is
*nasty*, doing SMIs everywhere and for everything, poking the PCI buses directly
and so on... so it might be IBM's fault.  The kernel definately is assigning
different IRQs to different devices when using the new ACPI, and even across
different boots of the same kernel (probably because device discovery order when
udev fires coldplug is not always the same).
Comment 9 Henrique de Moraes Holschuh 2006-07-05 12:40:15 UTC
The good news: now I can duplicate the issue anytime I want, now.

The bad news: the issue is there with stock 2.6.17, AND with
2.6.17+acpi-release-20060608-2.6.17, and I have not managed to duplicate WHY I
was not being able to duplicate it always before.

The problem happens during the S3 *suspend* (and not during resume). SysRq+B
sometimes works to recover the machine.

Here are the messages right before the softlock:
Stopping tasks: ================|
pnp: Device 00:0a disabled.
ACPI: PCI interrupt for device 0000:0b:00.0 disabled
acpi_bus-0201 <HANG>

Here are the messages *without* netconsole loaded (no softlock):
Stopping tasks: ================|
pnp: Device 00:0a disabled.
ACPI: PCI interrupt for device 0000:0b:00.0 disabled
acpi_bus-0201 [10] bus_set_power           : Device is not power manageable
ACPI: PCI interrupt for device 0000:00:1e.2 disabled
ACPI: PCI interrupt for device 0000:00:1d.7 disabled
(same for 1d.3, 1d.2, 1d.1 and 1d.0)
PM: Entering mem sleep
 hwsleep-0283 [11] enter_sleep_state:      : Entering sleep state [S3]


And since this kernel has all the uncover-code-doing-screwed-up-things options I
could enable, it traps this error during resume:

BUG: sleeping function called from invalid context at include/linux/rwsem.h:43,
in_atomic():0, irqs_disabled():1... which is AFAIK known bogosity in stock
2.6.17 ACPI.  This bug is also present in a kernel with the acpi-20060608 patches.

The value between [] in the acpi_bus-0201 line where hangs happen when
netconsole is loaded change, sometimes. So does the one in the hwsleep-0285 line.
Comment 10 Henrique de Moraes Holschuh 2006-07-05 13:54:30 UTC
I don't know what is wrong with me today, but the BUG line has nothing to do
with ACPI, the blame lies with cpufreq.
Comment 11 Henrique de Moraes Holschuh 2006-09-14 05:27:08 UTC
The S3 resume issues were traced back to suspend2 silently corrupting kernel
running state.  This does not mean that the tg3 bug isn't there.

After 2.6.18 is out and stable enough (like, at 2.6.18.1 or thereabouts), I will
run an intesive test to try to reproduce the tg3 netconsole hang *without*
suspend2 or any other patches.
Comment 12 Henrique de Moraes Holschuh 2006-09-29 06:47:15 UTC
tg3 netconsole status for 2.6.18:  BUG FIXED.

In 2.6.18, netconsole/tg3 simply shutdowns (and loses all console messages until
late in the resume process -- but that's a different issue).  But it doesn't
hang at all anymore, so this bug can be closed as CODE_FIX in 2.6.18.  I will do
just that.

Note You need to log in before you can comment on or make changes to this bug.