Bug 85461

Summary: After using alt+sysrq+reisub to restart the machine and booting back up the rt2800pci driver fails to enter state 4 (-5)
Product: Drivers Reporter: Giedrius Statkevičius (giedrius.statkevicius)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED CODE_FIX    
Severity: normal CC: linville, mikewortin, stf_xl
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.17.0-rc7-next-20140930 Subsystem:
Regression: No Bisected commit-id:
Attachments: The output of `dmesg` of a boot after using alt+sysrq+reisub that shows what is happening when the network card doesn't work (fails to enter state 4 as it says)
enable rt3290 unconditionally
rt3290_enable_disable.patch

Description Giedrius Statkevičius 2014-10-02 17:36:51 UTC
Created attachment 152231 [details]
The output of `dmesg` of a boot after using alt+sysrq+reisub that shows what is happening when the network card doesn't work (fails to enter state 4 as it says)

Sometimes I have to use the alt+sysrq+reisub combination to restart a frozen machine but after it boots back up again (after using the combination, no restarts/shutdowns inbetween) I am not able to use my wireless card.
My dmesg gets littered with:

[   38.195828] ieee80211 phy0: rt2800_wait_wpdma_ready: Error - WPDMA TX/RX busy [0x00000068]
[   39.297104] ieee80211 phy0: rt2800_wait_wpdma_ready: Error - WPDMA TX/RX busy [0x00000068]
[   39.297114] ieee80211 phy0: rt2800pci_set_device_state: Error - Device failed to enter state 4 (-5)

After shutting down the system via `shutdown -h now` and starting up again the card works as expected.
I am attaching the log of dmesg when the issue happens.

What `lspci -v` says about my wireless card:

03:00.0 Network controller: Ralink corp. RT3290 Wireless 802.11n 1T/1R PCIe
	Subsystem: Hewlett-Packard Company Ralink RT3290LE 802.11bgn 1x1 Wi-Fi and Bluetooth 4.0 Combo Adapter
	Flags: bus master, fast devsel, latency 0, IRQ 19
	Memory at d0610000 (32-bit, non-prefetchable) [size=64K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 00-00-d5-1d-9d-31-17-a4
	Kernel driver in use: rt2800pci
	Kernel modules: rt2800pci
Comment 1 Giedrius Statkevičius 2014-10-02 20:42:17 UTC
Sorry but I forgot to add that this also happens with version 3.16.3-1 vanilla archlinux kernel and it seems that only shutting down and starting up again helps to fix this, `sudo reboot` doesn't fix this.
Comment 2 Giedrius Statkevičius 2014-12-15 13:23:25 UTC
Using connman this happens after a simple restart. NetworkManager somehow only manages to trigger this bug with sysrq+reisub.
Comment 3 Mike Wortin 2017-01-14 19:49:58 UTC
(In reply to Giedrius Statkevičius from comment #2)
> Using connman this happens after a simple restart. NetworkManager somehow
> only manages to trigger this bug with sysrq+reisub.

In my machine this happens after any restart, i`m using NetworkManager, but this bug reproduces when no connman, wicd, networkmanager or any other network manager installed too.
This reproduces in this kernels too: 4.8.0-2 (debian) 4.8.13-Arch, 4.8.16 (fedora), 4.10.0rc3
Comment 4 Stanislaw Gruszka 2017-01-15 17:00:19 UTC
(In reply to Mike Wortin from comment #3)
> This reproduces in this kernels too: 4.8.0-2 (debian) 4.8.13-Arch, 4.8.16
> (fedora), 4.10.0rc3

Does it work before on previous versions ? If so, what is latest working kernel version?
Comment 5 Mike Wortin 2017-01-15 19:00:48 UTC
(In reply to Stanislaw Gruszka from comment #4)
> (In reply to Mike Wortin from comment #3)
> > This reproduces in this kernels too: 4.8.0-2 (debian) 4.8.13-Arch, 4.8.16
> > (fedora), 4.10.0rc3
> 
> Does it work before on previous versions ? If so, what is latest working
> kernel version?

Yes, it works on previous versions.
The latest kernel version (in my mind), which does not have this bug is 3.16 (debian stable)
Comment 6 Stanislaw Gruszka 2017-01-16 08:15:27 UTC
Those are rt2x00 changes between 3.16 and 4.8:

$ git log v3.16..v4.8 --no-merges --oneline -- drivers/net/wireless/ralink/
2557654 rt2800lib: enable MFP if hw crypt is disabled
57fbcce cfg80211: remove enum ieee80211_band
8b4c000 rt2x00usb: Use usb anchor to manage URB
f36f299 rt2x00: add new rt2800usb device Buffalo WLI-UC-G450
9cc3fdc rt2x00: unterminated strlen of user data
5b45171 net: wireless: rt2x00: Space Required
b2cc2dd net: wireless: rt2x00: Space issue
ac2b335 net: wireless: rt2x00: Fixed Spacing issues
262c741 rt2x00: fix monitor mode regression
50ea05e mac80211: pass block ack session timeout to to driver
7683fe0 rt2x00pci: Disable memory-write-invalidate when the driver exits
952348a rt2x00: type bug in _rt2500usb_register_read()
33aca94 rt2x00: move under ralink vendor directory

$ git log v3.16..v4.8 --no-merges --oneline -- drivers/net/wireless/rt2x00/
33aca94 rt2x00: move under ralink vendor directory
4a733ef mac80211: remove PM-QoS listener
910367e rt2800usb: add usb ID 1b75:3070 for Airlive WT-2000USB
e3abc8f mac80211: allow to transmit A-MSDU within A-MPDU
11ab35e rt2x00: use DECLARE_EWMA
f10746f rt2x00: adjust EEPROM_SIZE for rt2500usb
ed8e0ed rt2800: fix assigning same WCID for different stations
30686bf mac80211: convert HW flags to unsigned long bitmap
9352c19 mac80211: extend get_tkip_seq to all keys
df14046 mac80211: remove support for IFF_PROMISC
01fbd4e rt2800usb: check Autorun mode on FW load only once
ea345c1 rt2x00: add new rt2800usb device DWA 130
7daa54b rt2x00usb: drop rt2x00usb_disable_radio() from rt2800usb_disable_radio()
92d5e24 rt2x00usb: check USB's request error code in rt2800usb_autorun_detect()
e4fcfaf rt2x00usb: initialize the read value in case of failure
4ed20be cfg80211: remove "channel" from survey names
6341e62 kconfig: use bool instead of boolean for type definition attributes
b9d305c rt2x00: use helper to check capability/requirement
dc50a52 Revert "rt2x00: Endless loop on hub port power down"
14bc8bd rt2x00: change REGISTER_TIMEOUT
7a5a735 rt2x00: change REGISTER_BUSY_COUNT for USB
ad92bc9 rt2x00: use timeout in rt2x00usb_vendor_request
87dd2d7 rt2800: calculate tx power temperature compensation on selected chips
a344d67 mac80211: allow drivers to support NL80211_SCAN_FLAG_RANDOM_ADDR
cfd9167 rt2x00: do not align payload on modern H/W
664d6a7 wireless: rt2x00: add new rt2800usb device
e9dc51a rt2x00: tune multi-registers I/O timeout
f853e9b net: wireless: rt2x00: drop owner assignment from platform_drivers
01f7fee rt2800: correct BBP1_TX_POWER_CTRL mask
ac0372a rt2x00: support Ralink 5362.
9baa3c3 PCI: Remove DEFINE_PCI_DEVICE_TABLE macro use
6a06e55 wireless: rt2x00: add new rt2800usb devices
d4150246 drivers/net/wireless/rt2x00/rt2x00dev.c: remove null test before kfree
df6e633 rt2x00: Use dma_zalloc_coherent
19dcb76 rt2x00: do not initialize BCN_OFFSET registers
ddb4055 rt2x00: change order when stop beaconing
88ff2f4 rt2x00: change default MAC_BSSID_DW1_BSS_BCN_NUM
ba08910 rt2x00: change beaconing setup on RT2800
283dafa rt2x00: change beaconing locking
57eaeb6 net: wireless: rt2x00: rt2x00mac.c: Cleaning up uninitialized variables

I do not see in them commit that could possibly couse the problem. Perhaps issue was coused by change in PCI subsystem not in rt2x00 driver.
Comment 7 Stanislaw Gruszka 2017-01-16 08:17:55 UTC
If I understand correctly problem do not happen after power-off - power-on , only if software reset is performed, hence maybe card is not reset properly by PCI sub-layer.
Comment 8 Stanislaw Gruszka 2017-01-16 08:22:39 UTC
Created attachment 251941 [details]
enable rt3290 unconditionally

Does the patch make things better ?
Comment 9 Mike Wortin 2017-01-25 11:39:39 UTC
I was updated BIOS to the last version and this bug does not affecting me now.
Now my BIOS version is F35 (Insyde H20).
I have a HP 15-r047er laptop.
Comment 10 Stanislaw Gruszka 2017-01-26 09:43:26 UTC
Pity you didn't test the patch before BIOS update ...
Comment 11 Stanislaw Gruszka 2017-01-26 09:46:56 UTC
*** Bug 192581 has been marked as a duplicate of this bug. ***
Comment 12 Stanislaw Gruszka 2017-01-26 10:15:46 UTC
Giedrius, could you test patch from comment 8 ?
Comment 13 Giedrius Statkevičius 2017-01-26 11:17:25 UTC
(In reply to Stanislaw Gruszka from comment #12)
> Giedrius, could you test patch from comment 8 ?

I could give it a try this weekend. Due to this bug I actually switched to a Atheros card but I will put the old card back in just for this.
Comment 14 Stanislaw Gruszka 2017-01-27 14:28:42 UTC
Created attachment 253271 [details]
rt3290_enable_disable.patch

Not sure if previous patch will be sufficient, please also test this one, which also disable RT3290 PCI remove callback (however I'm not sure if reset via sysrq perform PCI devices removal callbacks, if not RT3290 disable should be done on initialization, before it's enabling code).
Comment 15 Giedrius Statkevičius 2017-01-28 17:12:38 UTC
(In reply to Stanislaw Gruszka from comment #8)
> Created attachment 251941 [details]
> enable rt3290 unconditionally
> 
> Does the patch make things better ?

Applied this on next-20170125. With pure next-20170125 the bug is still reproducible on the same hardware and the same messages are printed as it was described in the original report. It seems like the best way to reproduce this is to get some network activity going on and type alt+sysrq+reisub quickly. With this patch I was unable to reproduce the bug in 10 tries or so.
Comment 16 Giedrius Statkevičius 2017-01-28 17:26:27 UTC
(In reply to Stanislaw Gruszka from comment #14)
> Created attachment 253271 [details]
> rt3290_enable_disable.patch
> 
> Not sure if previous patch will be sufficient, please also test this one,
> which also disable RT3290 PCI remove callback (however I'm not sure if reset
> via sysrq perform PCI devices removal callbacks, if not RT3290 disable
> should be done on initialization, before it's enabling code).

Reproduced the bug on a second or third try with this patch. The patch was applied on next-20170125. Error messages were identical to the ones in the original report.
Comment 17 Giedrius Statkevičius 2017-01-28 17:26:44 UTC
(In reply to Giedrius Statkevičius from comment #16)
> (In reply to Stanislaw Gruszka from comment #14)
> > Created attachment 253271 [details]
> > rt3290_enable_disable.patch
> > 
> > Not sure if previous patch will be sufficient, please also test this one,
> > which also disable RT3290 PCI remove callback (however I'm not sure if
> reset
> > via sysrq perform PCI devices removal callbacks, if not RT3290 disable
> > should be done on initialization, before it's enabling code).
> 
> Reproduced the bug on a second or third try with this patch. The patch was
> applied on next-20170125. Error messages were identical to the ones in the
> original report.

Also, this patch introduces some compilation warnings.
Comment 18 Stanislaw Gruszka 2017-01-29 09:45:00 UTC
(In reply to Giedrius Statkevičius from comment #15)
> alt+sysrq+reisub quickly. With this patch I was unable to reproduce the bug
> in 10 tries or so.

I assume patch fixes orginally reported bug. I will post it soon.
Comment 19 Stanislaw Gruszka 2017-01-31 10:52:21 UTC
Patch was commited to wireless-drivers-next:

https://git.kernel.org/cgit/linux/kernel/git/kvalo/wireless-drivers-next.git/commit/?id=6715208d0a95ae417203f8e4a7937c1b4c4947f2

I'm closing the bug.

Thanks for reporting and testing Giedrius!