Bug 9428

Summary: panic in rt2x00 driver when resuming after hibernate
Product: Power Management Reporter: Marcus Better (marcus)
Component: Hibernation/SuspendAssignee: Ivo van Doorn (IvDoorn)
Status: CLOSED CODE_FIX    
Severity: normal CC: andrey+kernel, bunk, linville
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24-rc3 Subsystem:
Regression: --- Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: Photo of the oops, upper part
Photo of the oops, lower part
Halt TX during suspend
Correctly stop radio during suspend
Ignore device suspend state

Description Marcus Better 2007-11-21 07:42:18 UTC
Most recent kernel where this bug did not occur: unknown
Distribution: Debian testing/unstable
Hardware Environment: LG LE50 Express laptop
Software Environment: i386 kernel, X.org, KDE
Problem Description:

When resuming from hibernate, the suspend image is loaded but shortly afterwards I get a kernel panic.

Steps to reproduce:

1. Boot with the rt2500pci module loaded and wireless interface up. (It is not necessary to associate to an access point.)
2. Hibernate using s2disk.
3. Resume.
Comment 1 Marcus Better 2007-11-21 07:43:16 UTC
Created attachment 13670 [details]
Photo of the oops, upper part
Comment 2 Marcus Better 2007-11-21 07:44:18 UTC
Created attachment 13671 [details]
Photo of the oops, lower part
Comment 3 John W. Linville 2007-11-21 13:42:26 UTC
mac80211 doesn't really have suspend/resume support native to it, so this could be a stack problem.  Still, since this is in the transmit path it may be best to start with the driver team to take a look.
Comment 4 Ivo van Doorn 2007-11-21 14:03:07 UTC
Created attachment 13679 [details]
Halt TX during suspend

Could you try the attached patch to see if that works better?
There seemed to be a race condition where it was possible mac80211
would send out frames during the resume handler just before the ring entries would be allocated.

This patch will call mac80211 to stop all TX queues during suspend, this should prevent mac80211 to send out any frames during resume.
Comment 5 Marcus Better 2007-11-22 00:27:58 UTC
> Could you try the attached patch to see if that works better?

I tried suspend to RAM (not hibernate!), and got a similar panic. I cannot 
make a screenshot at the moment.

Anyone know how to get the oops logged somewhere? I have tried netconsole 
(even built-in) but it doesn't seem to work that early after hibernate. Would 
it help to compile the ethernet driver statically?
Comment 6 Ivo van Doorn 2007-11-22 09:36:45 UTC
You mean you still got the panic after applying the patch?
Does it still point to rt2x00pci_write_tx_data?
Comment 7 Marcus Better 2007-12-05 00:08:30 UTC
(In reply to comment #6)
> You mean you still got the panic after applying the patch?

Yes.

> Does it still point to rt2x00pci_write_tx_data?

Yes. I'll try to get a complete screenshot later.
Comment 8 Ivo van Doorn 2007-12-30 04:27:27 UTC
Created attachment 14229 [details]
Correctly stop radio during suspend

Could you revert the previous patch and try again with this new patch?
I have found several issues in the suspend/resume path that might have caused this and the attached patch should resolve those.
Comment 9 Marcus Better 2008-01-01 07:30:29 UTC
(In reply to comment #8)
> Created an attachment (id=14229) [details]
> Correctly stop radio during suspend
> 
> Could you revert the previous patch and try again with this new patch?

First I get this when booting:

WARNING: at net/mac80211/rx.c:1486 __ieee80211_rx()
Pid: 5, comm: events/0 Not tainted 2.6.24-rc6-lg #2
 [<f8a6e9df>] __ieee80211_rx+0xcff/0xe90 [mac80211]
 [<c0240249>] __lock_acquire+0x4f9/0x1080
 [<c023f826>] trace_hardirqs_on+0x76/0x160
 [<c0448b47>] _spin_unlock_irqrestore+0x37/0x60
 [<f8a600dd>] ieee80211_tasklet_handler+0xed/0x100 [mac80211]
 [<f89bc386>] rt2x00pci_rxdone+0x106/0x1b0 [rt2x00pci]
 [<c02243f3>] tasklet_action+0x33/0x80
 [<c0224312>] __do_softirq+0x62/0xc0
 [<c02243b5>] do_softirq+0x45/0x50
 [<c0224656>] irq_exit+0x76/0x80
 [<c0206a54>] do_IRQ+0x44/0x80
 [<c023f6c8>] mark_held_locks+0x38/0x70
 [<c0204d76>] common_interrupt+0x2e/0x34
 [<c03f32bb>] rt_run_flush+0x4b/0xc0
 [<c0425819>] fib_netdev_event+0x79/0x90
 [<c02375ee>] notifier_call_chain+0x3e/0x70
 [<c03e1860>] linkwatch_event+0x0/0x30
 [<c0237667>] raw_notifier_call_chain+0x17/0x20
 [<c03d65dc>] netdev_state_change+0x1c/0x30
 [<c03e1772>] __linkwatch_run_queue+0xd2/0x1c0
 [<c03e187f>] linkwatch_event+0x1f/0x30
 [<c022f4df>] run_workqueue+0x14f/0x1d0
 [<c022f488>] run_workqueue+0xf8/0x1d0
 [<c022ff22>] worker_thread+0x82/0xe0
 [<c0233110>] autoremove_wake_function+0x0/0x40
 [<c022fea0>] worker_thread+0x0/0xe0
 [<c0232e72>] kthread+0x42/0x70
 [<c0232e30>] kthread+0x0/0x70
 [<c0204f1b>] kernel_thread_helper+0x7/0x1c
 =======================

This is with 2.6.24-rc6, current git.

Then I can suspend to RAM, but it doesn't resume (just gives a blank screen and nothing on the netconsole). However the same thing happens even with init=/bin/sh (and no rt2x00 driver loaded) so there must be some other problem.

Suspend to disk freezes at "Snapshotting system" with these console messages:
swsusp: Marking nosave pages: 000000000009f000 - 0000000000100000
swsusp: Basic memory bitmaps created

If I unload the rt2500pci module and then do s2ram, I get this:

BUG: unable to handle kernel paging request at virtual address f89df667
printing eip: c030256e *pde = 37c1f067 *pte = 00000000
Oops: 0000 [#1] PREEMPT
Modules linked in: ipv6 nls_iso8859_1 ntfs kqemu snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device xt_NFQUEUE x_tables firewire_sbp2 loop mousedev pcmcia firmware_class arc4 ecb blkcipher rt2x00pci rt2x00lib rfkill input_polldev snd_hda_intel mac80211 sdhci cfg80211 rtc snd_pcm pcspkr psmouse eeprom_93cx6 mmc_core serio_raw yenta_socket rsrc_nonstatic pcmcia_core evdev snd_timer snd soundcore snd_page_alloc i2c_piix4 video output battery ac power_supply button ext3 jbd mbcache dm_mirror dm_snapshot dm_mod firewire_ohci firewire_core crc_itu_t ide_cd cdrom ehci_hcd ohci_hcd usbcore thermal processor fan

Pid: 2881, comm: s2ram Not tainted (2.6.24-rc6-lg #2)
EIP: 0060:[<c030256e>] EFLAGS: 00010092 CPU: 0
EIP is at strcmp+0xe/0x30
EAX: f89df667 EBX: c062ae98 ECX: 00000000 EDX: c04e4b55
ESI: f89df667 EDI: c04e4b55 EBP: f7dc2aa0 ESP: f70f5d8c
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process s2ram (pid: 2881, ti=f70f4000 task=f7dc2aa0 task.ti=f70f4000)
Stack: c062de68 00000003 c023d669 0002cec8 c062de68 00000000 c0240cd6 c0617c98
       00000002 f7dc2aa0 f7dc2ac8 00000000 c0617e30 00000003 f7dc2aa0 000002e9
       00000002 00000000 f79435b4 f7dc3018 00000001 00000002 c0667e68 c0240249
Call Trace:
 [<c023d669>] count_matching_names+0x69/0xb0
 [<c0240cd6>] __lock_acquire+0xf86/0x1080
 [<c0240249>] __lock_acquire+0x4f9/0x1080
 [<c023f6c8>] mark_held_locks+0x38/0x70
 [<c0240e2f>] lock_acquire+0x5f/0x80
 [<c0293a31>] notify_change+0x251/0x370
 [<c04476b9>] down_write+0x39/0x60
 [<c0293a31>] notify_change+0x251/0x370
 [<c0293a31>] notify_change+0x251/0x370
 [<c027d1be>] do_truncate+0x6e/0xa0
 [<c0448a75>] _spin_unlock+0x25/0x40
 [<c0287340>] may_open+0x1d0/0x250
 [<c028969f>] open_namei+0x6f/0x630
 [<c027c95e>] do_filp_open+0x2e/0x60
 [<c0448a75>] _spin_unlock+0x25/0x40
 [<c027c62b>] get_unused_fd_flags+0xbb/0xe0
 [<c027c9dc>] do_sys_open+0x4c/0xe0
 [<c027caac>] sys_open+0x1c/0x20
 [<c02042de>] sysenter_past_esp+0x5f/0xa5
 =======================
Code: c0 aa 89 d8 8b 74 24 10 8b 5c 24 0c 8b 7c 24 14 8b 6c 24 18 83 c4 1c c3 90 8d 74 26 00 83 ec 08 89 34 24 89 c6 89 7c 24 04 89 d7 <ac> ae 75 08 84 c0 75 f8 31 c0 eb 04 19 c0 0c 01 8b 34 24 8b 7c
EIP: [<c030256e>] strcmp+0xe/0x30 SS:ESP 0068:f70f5d8c
---[ end trace 85d36914d35df281 ]---


The same experiment with 2.6.24-rc4 and your previous patch gave a oops immediately on rmmod, also in count_matching_names but with a somewhat different trace (not captured).
Comment 10 Ivo van Doorn 2008-01-01 07:43:32 UTC
That __ieee80211_rx() warning is a known issue, I have a patch for that sitting in rt2x00.git which I will send upstream within a few days.

As for the other panic, not sure where that comes from, rt2x00  isn't in the backtrace, and neither does not create any sysfs entries itself.
Comment 11 Marcus Better 2008-01-01 07:56:59 UTC
bugme-daemon@bugzilla.kernel.org skrev:
> As for the other panic, not sure where that comes from, rt2x00  isn't in the
> backtrace, and neither does not create any sysfs entries itself.

So, should I just file a separate bug report for this?
Comment 12 Marcus Better 2008-01-01 08:46:12 UTC
> As for the other panic, not sure where that comes from, rt2x00  isn't in the
> backtrace, and neither does not create any sysfs entries itself.

There's still something wrong with rt2x00. If I boot with init=/bin/sh, 
I can hibernate with "echo disk > /sys/power/state". But after loading 
rt2500pci it aborts the hibernation:

swsusp: Marking nosave pages: 000000000009f000 - 0000000000100000
swsusp: Basic memory bitmaps created
Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.00 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Shrinking memory... done (0 pages freed)
Freed 0 kbytes in 0.09 seconds (0.00 MB/s)
Suspending console(s)
phy0 -> rt2x00lib_suspend: Notice - Going to sleep.
phy0 -> rt2500pci_set_state: Notice - Device failed to enter state 1, 
current device state: bbp 3 and rf 3.
pci_device_suspend(): rt2x00pci_suspend+0x0/0x70 [rt2x00pci]() returns -16
suspend_device(): pci_device_suspend+0x0/0x60() returns -16
Could not suspend device 0000:08:02.0: error -16
hda: UDMA/100 mode selected
hdc: UDMA/33 mode selected
Restarting tasks ... done.
swsusp: Basic memory bitmaps freed
swsusp: Marking nosave pages: 000000000009f000 - 0000000000100000
swsusp: Basic memory bitmaps created
Syncing filesystems ... done.
Freezing user space processes ... (elapsed 0.00 seconds) done.
Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
Shrinking memory... done (0 pages freed)
Freed 0 kbytes in 0.09 seconds (0.00 MB/s)
Suspending console(s)
phy0 -> rt2x00lib_suspend: Notice - Going to sleep.
phy0 -> rt2500pci_set_state: Notice - Device failed to enter state 1, 
current device state: bbp 3 and rf 3.
pci_device_suspend(): rt2x00pci_suspend+0x0/0x70 [rt2x00pci]() returns -16
suspend_device(): pci_device_suspend+0x0/0x60() returns -16
Could not suspend device 0000:08:02.0: error -16
hda: UDMA/100 mode selected
hdc: UDMA/33 mode selected
Restarting tasks ... done.
swsusp: Basic memory bitmaps freed
Comment 13 Ivo van Doorn 2008-03-26 11:54:00 UTC
Created attachment 15451 [details]
Ignore device suspend state

I am really sorry, I had this patches sitting in my test-patch folder, but I completely forgot to attach it to this bug.. :(

Some hardware never seems to accept the "goto sleep" command, since the legacy drivers don't have suspend and resume handlers the entire code for it was basically a educated guess (based on the "enable radio" code).
This patch will only print a warning when the "goto sleep" command fails, and just continues as usual. Perhaps that means the device will not reach a sleep state and consumes more power then it should, but it is equally possible it simply needs some seconds longer to sleep. Anyway, by making the command non-fatal it will not block the rest of the suspend procedure.
Comment 14 Marcus Better 2008-03-27 03:50:10 UTC
> ------- Comment #13 from IvDoorn@gmail.com  2008-03-26 11:54 -------
> Ignore device suspend state

Now at least I don't get a panic, it resumes from hibernate but hangs 
right after entering userspace. There's also a weird speaker beep at 
that point, no idea why. Haven't checked what causes it.

(And suspend is completely borked on this machine, but it doesn't seem 
to be the rt2x00 driver.)
Comment 15 Ivo van Doorn 2008-03-27 08:52:32 UTC
Ok, I'll send the last patch upstream for 2.6.25 today.

You are using latest 2.6.25-rc snapshot for this test, right?
Because there were some known resume issues when RFkill was enabled, and those issues have been fixed in the latest -rc7 release.
Comment 16 Marcus Better 2008-03-27 08:55:36 UTC
> ------- Comment #15 from IvDoorn@gmail.com  2008-03-27 08:52 -------
> You are using latest 2.6.25-rc snapshot for this test, right?

Not quite, it was post-rc6 though. I'll try -rc7 later...
Comment 17 Ivo van Doorn 2008-03-27 09:08:32 UTC
For reference, you need patch:
rt2x00: Add suspend/resume handlers to rt2x00rfkill
Comment 18 Marcus Better 2008-03-27 09:12:03 UTC
> ------- Comment #17 from IvDoorn@gmail.com  2008-03-27 09:08 -------
> For reference, you need patch:
> rt2x00: Add suspend/resume handlers to rt2x00rfkill

Ok, so I've got it already.

Marcus
Comment 19 Ivo van Doorn 2008-03-27 09:20:07 UTC
Ok, if you could try to obtain a panic trace or something like that about the remaining issue during resume, then we can see what is causing that last problem. :)
Comment 20 Shaohua 2008-12-23 22:37:37 UTC
Pretty old bug, should we close it, or can you try if latest kernel fixes the issue?
Comment 21 Marcus Better 2008-12-25 04:55:21 UTC
(In reply to comment #20)
> Pretty old bug, should we close it, or can you try if latest kernel fixes
> the issue?

I don't have the hardware anymore, so cannot test it.
Comment 22 Ivo van Doorn 2009-01-03 11:20:59 UTC
Ok, I'm going to close this one, so far no other users have reported this issue either, so I guess it is fixed.