Bug 13390

Summary: rt2860: kernel freezes completely when turning off WiFi
Product: Drivers Reporter: Márton Németh (nm127)
Component: network-wirelessAssignee: Greg Kroah-Hartman (greg)
Status: RESOLVED INVALID    
Severity: normal CC: bugspam, dam, dmoerner, greg, linville, meijer.o, neil.stewart, prahal, saturley709739, wrar
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30-rc7 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg after turning on WiFi
kernel .config for 2.6.30-rc7
photo of the screen with crash messages
another dmesg without freeze but with "possible recursive locking"
photo with a longer crash log
Panic
Work around the panic via changes in eeepc-laptop.
Work around the panic via changes in eeepc-laptop.
Work around the panic via changes in eeepc-laptop (2.6.31-rc6).
Work around the panic via changes in eeepc-laptop (2.6.32-rc3)
Hopefully a proper fix…
Backport to 2.6.30 (compile-tested only)
(partial revert) Fix for eeepc-laptop (needed by rt2800pci at least)

Description Márton Németh 2009-05-26 19:43:56 UTC
When turning off the Wireless device with Fn+F2 on EeePC 901 usually the kernel freezes completely.

Steps to reproduce:
1. Boot EeePC 901 with WiFi off
2. On the text console press Fn+F2: the blue WiFi feedback LED turns on and some messages appears on the dmesg (see attached).
3. After about 30 seconds press the Fn+F2 again: the blue WiFi LED turns off and a crash trace appears on the screen. (Unfortunately this is not saved to the system log and the EeePC 901 does not have any serial port to set up a serial console on it.)
Comment 1 Márton Németh 2009-05-26 19:44:50 UTC
Created attachment 21564 [details]
dmesg after turning on WiFi
Comment 2 Márton Németh 2009-05-26 19:46:02 UTC
Created attachment 21565 [details]
kernel .config for 2.6.30-rc7
Comment 3 Márton Németh 2009-05-26 19:47:10 UTC
Created attachment 21566 [details]
photo of the screen with crash messages
Comment 4 Márton Németh 2009-05-26 21:20:32 UTC
Created attachment 21568 [details]
another dmesg without freeze but with "possible recursive locking"

[  317.646024] [ INFO: possible recursive locking detected ]
[  317.647010] 2.6.30-rc7 #1
[  317.647986] ---------------------------------------------
[  317.648033] swapper/0 is trying to acquire lock:
[  317.648033]  (pTimer){+.-...}, at: [<c0134a71>] del_timer_sync+0x0/0x83
[  317.648033] 
[  317.648033] but task is already holding lock:
[  317.648033]  (pTimer){+.-...}, at: [<c013420f>] run_timer_softirq+0x113/0x20b
[  317.648033] 
[  317.648033] other info that might help us debug this:
[  317.648033] 1 lock held by swapper/0:
[  317.648033]  #0:  (pTimer){+.-...}, at: [<c013420f>] run_timer_softirq+0x113/0x20b
[  317.648033] 
[  317.648033] stack backtrace:
[  317.648033] Pid: 0, comm: swapper Tainted: G         C 2.6.30-rc7 #1
[  317.648033] Call Trace:
[  317.648033]  [<c034a44d>] ? printk+0xf/0x12
[  317.648033]  [<c014cfea>] __lock_acquire+0xbfa/0x128d
[  317.648033]  [<c0128848>] ? scheduler_tick+0x39/0x1b2
[  317.648033]  [<c014d730>] lock_acquire+0xb3/0xd6
[  317.648033]  [<c0134a71>] ? del_timer_sync+0x0/0x83
[  317.648033]  [<c0134aa5>] del_timer_sync+0x34/0x83
[  317.648033]  [<c0134a71>] ? del_timer_sync+0x0/0x83
[  317.648033]  [<f88dee93>] RTMP_OS_Del_Timer+0x10/0x1a [rt2860sta]
[  317.648033]  [<f88bb23e>] RTMPCancelTimer+0x22/0x4a [rt2860sta]
[  317.648033]  [<f88f30ff>] RT28xxPciMlmeRadioOFF+0x7a/0x261 [rt2860sta]
[  317.648033]  [<f88b5cb4>] MlmePeriodicExec+0xd0/0x427 [rt2860sta]
[  317.648033]  [<f88df1b1>] linux_MlmePeriodicExec+0x13/0x2a [rt2860sta]
[  317.648033]  [<c013428e>] run_timer_softirq+0x192/0x20b
[  317.648033]  [<c013420f>] ? run_timer_softirq+0x113/0x20b
[  317.648033]  [<f88df19e>] ? linux_MlmePeriodicExec+0x0/0x2a [rt2860sta]
[  317.648033]  [<c0130aa9>] __do_softirq+0xb1/0x187
[  317.648033]  [<c0130bb5>] do_softirq+0x36/0x5a
[  317.648033]  [<c0130d2f>] irq_exit+0x38/0x6f
[  317.648033]  [<c0104e31>] do_IRQ+0x71/0x87
[  317.648033]  [<c010392e>] common_interrupt+0x2e/0x34
[  317.648033]  [<c014007b>] ? enqueue_hrtimer+0x29/0x68
[  317.648033]  [<f83f928a>] ? acpi_idle_enter_bm+0x267/0x298 [processor]
[  317.648033]  [<c02cc101>] cpuidle_idle_call+0x65/0x98
[  317.648033]  [<c0102400>] cpu_idle+0x4e/0x7e
[  317.648033]  [<c033cc4f>] rest_init+0x67/0x69
[  317.648033]  [<c04c9811>] start_kernel+0x309/0x30e
[  317.648033]  [<c04c906a>] __init_begin+0x6a/0x6f
Comment 5 Márton Németh 2009-05-26 21:36:12 UTC
Created attachment 21570 [details]
photo with a longer crash log 

I succeed to set the console font smaller the command "consolechars -f /usr/share/consolefonts/lat2-08.psf.gz" so in this photo more text is visible.
Comment 6 Darren Salt 2009-06-15 23:28:24 UTC
Created attachment 21928 [details]
Panic

This is what I captured via netconsole.
Comment 7 Darren Salt 2009-06-15 23:32:19 UTC
The panic doesn't happen if I first bring ra0 down (here, via "ifdown ra0").
Comment 8 Márton Németh 2009-06-16 05:31:26 UTC
(In reply to comment #7)
> The panic doesn't happen if I first bring ra0 down (here, via "ifdown ra0").

What hardware are you using?
Comment 9 Darren Salt 2009-06-16 12:28:44 UTC
> What hardware are you using?

EeePC 901.
Comment 10 Daniel Moerner 2009-06-16 18:33:16 UTC
(In reply to comment #7)
> The panic doesn't happen if I first bring ra0 down (here, via "ifdown ra0").

ifdown --force ra0 was what originally triggered this bug for me, I suspended the EeePC, then on resume the wifi was not able to scan for networks, so I tried an ifup ra0 && ifdown --force ra0 && ifup ra0. This triggered the panic. I was not able to just reproduce this on my EeePC 901 though.
Comment 11 Darren Salt 2009-07-14 13:28:00 UTC
# echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove

works fine, removing the device and disposing of ra0. It's then safe to switch off the interface via rfkill.
Comment 12 Neil 2009-07-21 13:35:37 UTC
Hello. I too am having trouble here, on an eee pc 1000HE. The manual fix in Comment 11 ("echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove") also causes a freeze and requires a hard reboot, but only after a suspend-resume cycle. Also, "ifdown ra0" or "modprobe -r rt2860sta" both freeze the system, but again only after a suspend-resume cycle. It seems something in the suspend-resume process is breaking rt2860sta.

My posting to the debian-eeepc-devel list gives more details:
http://lists.alioth.debian.org/pipermail/debian-eeepc-devel/2009-July/002397.html

[As an aside, how does one re-establish a network connection after running "echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove"?]
Comment 13 Darren Salt 2009-08-02 20:34:21 UTC
Created attachment 22579 [details]
Work around the panic via changes in eeepc-laptop.

This patch is intended only as an EeePC-specific workaround for the panic and should no longer be used once rt2800pci becomes sufficiently useable or rt2860sta (presumably!) is fixed.
Comment 14 Darren Salt 2009-08-06 22:56:21 UTC
Created attachment 22629 [details]
Work around the panic via changes in eeepc-laptop.

Bug fix; won't kill wireless when toggling anything else.
Comment 15 Darren Salt 2009-08-14 15:52:57 UTC
Created attachment 22718 [details]
 Work around the panic via changes in eeepc-laptop (2.6.31-rc6).

Updated the workaround patch for 2.6.31-rc6.
Comment 16 o. meijer 2009-09-15 18:18:17 UTC
I tested the patch on kernel 2.6.30, hardware EEEpc 1000 (ra2680sta, same bug) and it works fine. But after suspending to disk, the freeze reoccurs, as described in the bug report. (using s2disk). However, using shutdown instead of platform in /etc/uswsusp.conf is a workaround for this.
Comment 17 o. meijer 2009-09-15 18:23:08 UTC
correction, EEEpc 1000 should be EEEpc 1000h
Comment 18 Darren Salt 2009-10-06 02:20:21 UTC
Created attachment 23279 [details]
Work around the panic via changes in eeepc-laptop (2.6.32-rc3)
Comment 19 Greg Kroah-Hartman 2009-10-06 19:33:52 UTC
Can you send this patch to the maintainer of this driver through email, with a Signed-off-by: line as described in the file, Documentation/SubmittingPatches?
Comment 20 Darren Salt 2009-10-06 20:50:28 UTC
Done. I'll submit either/both of the older versions to -stable if this one is accepted as is.

I put the patches here because I'm not sure whether it should be worked around in eeepc-laptop or dealt with in rt2860sta (and then there's rt2800pci, which I've not yet been adequately able to test – basically, I want to see if the patch is needed when that driver's in use).
Comment 21 Darren Salt 2009-10-13 01:40:23 UTC
Created attachment 23368 [details]
Hopefully a proper fix…

Hah. That easy, in the end. Test away :-)

(I'll send it with the proper Signed-Off-By etc. this evening.)
Comment 22 o. meijer 2009-10-13 22:34:46 UTC
I tested the rt2860sta  of comment 21. It fixes the bug. Thanks a lot.
Comment 23 Darren Salt 2009-10-14 01:40:04 UTC
Created attachment 23401 [details]
Backport to 2.6.30 (compile-tested only)

Here's my backport of the patch to 2.6.30.

GregKH: if you want this for -stable (should there be another 2.6.30 release), let me know (or you can use the description & S-O-B from the patch for 2.6.31 and .32-rc).
Comment 24 Andrey Rahmatullin 2009-10-16 09:14:14 UTC
The fix from mainline works here.
Comment 25 Alban Browaeys 2009-11-23 15:26:59 UTC
Created attachment 23890 [details]
(partial revert) Fix for eeepc-laptop (needed by rt2800pci at least)

rfkill_switch_all calls eeepc-laptop rfkill with calls rfkill_unregister (due to eeepc-laptop removing the device)
as both rfkill_switch_all and rfkill_unregister lock on the global rfkill mutex this cannot work 
Thus  the coming back of the work schedule for eeepc hotplug.
Comment 26 Alban Browaeys 2009-11-23 15:28:59 UTC
The trace with rt2800pci before the " (partial revert) Fix for eeepc-laptop (needed by rt2800pci at least)"   was :
[ 2204.613861] Dumping ftrace buffer:
[ 2204.613861]    (ftrace buffer empty)
[ 2218.320610] SysRq : Restore framebuffer console
[ 2228.303896] SysRq : Dump ftrace buffer
[ 2243.235978] SysRq : Show Blocked State
[ 2243.239948]   task                PC stack   pid father
[ 2243.239948] events/0      D c0140031     0     9      2 0x00000000
[ 2243.239948]  c2007e40 00000046 c2002a40 c0140031 c0118874 c022c95d ec803d50 c2007e40
[ 2243.239948]  f6bec8c0 f68f8540 c05793e4 f704e490 f704e620 00000000 f8d46464 00000001
[ 2243.239948]  f704e490 f704e620 ef1d007b 000000d8 00000000 ffffff10 c0126a38 f8d46458
[ 2243.239948] Call Trace:
[ 2243.239948]  [<c0140031>] ? irq_exit+0x31/0x60
[ 2243.239948]  [<c0118874>] ? smp_apic_timer_interrupt+0x54/0x80
[ 2243.239948]  [<c022c95d>] ? sysfs_addrm_finish+0x1ad/0x1f0
[ 2243.239948]  [<c0126a38>] ? mutex_spin_on_owner+0x88/0xb0
[ 2243.239948]  [<c03b2a58>] ? __mutex_lock_slowpath+0xb8/0x110
[ 2243.239948]  [<c03b2ded>] ? mutex_lock+0x1d/0x40
[ 2243.239948]  [<f8d44f6b>] ? rfkill_unregister+0x4b/0xa0 [rfkill]
[ 2243.239948]  [<f8a225d3>] ? wiphy_unregister+0x23/0x180 [cfg80211]
[ 2243.239948]  [<f8b7c0dd>] ? ieee80211_unregister_hw+0xad/0xd0 [mac80211]
[ 2243.239948]  [<f86195b3>] ? rt2x00leds_unregister_led+0x23/0x40 [rt2x00lib]
[ 2243.239948]  [<f8614950>] ? rt2x00lib_remove_dev+0xa0/0xb0 [rt2x00lib]
[ 2243.239948]  [<f86381fa>] ? rt2x00pci_remove+0x2a/0x70 [rt2x00pci]
[ 2243.239948]  [<c02937e6>] ? pci_device_remove+0x16/0x40
[ 2243.239948]  [<c02f1eae>] ? __device_release_driver+0x4e/0xa0
[ 2243.239948]  [<c02f1fad>] ? device_release_driver+0x1d/0x30
[ 2243.239948]  [<c02f1543>] ? bus_remove_device+0x73/0xa0
[ 2243.239948]  [<c02efa0b>] ? device_del+0xfb/0x180
[ 2243.239948]  [<c02efa98>] ? device_unregister+0x8/0x10
[ 2243.239948]  [<c028e77e>] ? pci_stop_bus_device+0x4e/0x60
[ 2243.239948]  [<c028e80a>] ? pci_remove_bus_device+0xa/0x90
[ 2243.239948]  [<f8dc5980>] ? eeepc_rfkill_hotplug+0x60/0x140 [eeepc_laptop]
[ 2243.239948]  [<f8dc5b5a>] ? eeepc_rfkill_set+0x2a/0x50 [eeepc_laptop]
[ 2243.239948]  [<f8d444fd>] ? rfkill_set_block+0x5d/0xc0 [rfkill]
[ 2243.239948]  [<f8d44637>] ? __rfkill_switch_all+0x57/0x70 [rfkill]
[ 2243.239948]  [<f8d446fa>] ? rfkill_switch_all+0x5a/0x60 [rfkill]
[ 2243.239948]  [<f8d4562f>] ? rfkill_op_handler+0xbf/0x180 [rfkill]
[ 2243.239948]  [<c02e0f8f>] ? console_callback+0xbf/0x120
[ 2243.239948]  [<c014f86b>] ? worker_thread+0x13b/0x200
[ 2243.239948]  [<f8d45570>] ? rfkill_op_handler+0x0/0x180 [rfkill]
[ 2243.239948]  [<c01534a0>] ? autoremove_wake_function+0x0/0x50
[ 2243.239948]  [<c014f730>] ? worker_thread+0x0/0x200
[ 2243.239948]  [<c0153184>] ? kthread+0x74/0x80
[ 2243.239948]  [<c0153110>] ? kthread+0x0/0x80
[ 2243.239948]  [<c0103a57>] ? kernel_thread_helper+0x7/0x10
Comment 27 Alban Browaeys 2009-11-24 18:50:44 UTC
By the way even in my version of the patch I left the mutex. I had another issue (due to cfg80211 flush_work on the global queue in wiphy_unregister while we were already in a work on the global queue, ie eeepc_hotplug_work. This one should be fixed soon in wireless-testing tree) with locking and removed them. Now the cfg80211 issue is fixed but I wonder if this ehotk mutex is not  an issue or of any use in a workqueue implementation. They were not there in previous "in tree" workqueue implementation too.
Comment 28 Darren Salt 2009-11-24 23:08:01 UTC
(In reply to comment #25)
> Created an attachment (id=23890) [details]
> (partial revert) Fix for eeepc-laptop (needed by rt2800pci at least)

I don't know what that patch is a partial reversion of (I'm not looking right now), but it's evident from the patch content that it isn't reverting what you claim that it's reverting...
Comment 29 Alban Browaeys 2009-11-25 14:40:32 UTC
This is a revert of the mutex implementation of eeepc-laptop:
http://git.kernel.org/?p=linux/kernel/git/linville/wireless-testing.git;a=commitdiff;h=dcf443b5813074031a45b05ad9c57da98bcae329

Sorry if I was confusing in my explanation. Reading the previous diff about the switch from works to mutexes in eeepc-laptop. I told partial revert as I left the mutexes in my version of the patch while at the same time going back to the works scheduled.
I saw hints that mutexes in works  are bad. So a complete revert removing the mutexes looks more and more like the right thing to do.

Were you looking at the attachement id 23890 ?
Thanks for the lookup.
Comment 30 Greg Kroah-Hartman 2012-02-22 21:07:08 UTC
All USB bugs should be sent to the linux-usb@vger.kernel.org mailing 
list, and not entered into bugzilla.  Please bring this issue up there,
if it is still a problem in the latest kernel release.