During a channel change (whether associated to an AP or not) AGC calibration times out and causes the channel set to be considered a failure: kernel: ath: phy1: timeout (100000 us) on reg 0x9860: 0x00048d21 & 0x00000001 != 0x00000000 kernel: ath: phy1: offset calibration failed to complete in 1ms; noisy environment? kernel: ath: phy1: Unable to reset channel, reset status -5 kernel: ath: phy1: Unable to set channel (Note that the "1ms" message is incorrect, it's actually based off AH_WAIT_TIMEOUT. This condition eventually happens even when the timeout is set to several seconds.) Once this happens, the device is unable to associate to an AP. In this state, the log will have entries that look like this: kernel: wlan0: authenticate with 00:21:a0:25:f9:10 kernel: wlan0: Allocated STA 00:21:a0:25:f9:10 kernel: wlan0: capabilities/regulatory prevented using AP HT/VHT configuration, downgraded kernel: wlan0: Inserted STA 00:21:a0:25:f9:10 kernel: wlan0: send auth to 00:21:a0:25:f9:10 (try 1/3) kernel: wlan0: send auth to 00:21:a0:25:f9:10 (try 2/3) kernel: wlan0: send auth to 00:21:a0:25:f9:10 (try 3/3) kernel: wlan0: authentication with 00:21:a0:25:f9:10 timed out kernel: wlan0: Removed STA 00:21:a0:25:f9:10 kernel: wlan0: Destroyed STA 00:21:a0:25:f9:10 With debugging enabled, you'll see the authentication messages being queued, but never sent. After the time out, you'll get a TX complete message for all 3 messages, but this is due to the TX queue being flushed before a reset as wpa_supplicant goes back to scan for another AP. One interesting point in the code is that after the channel change fails, ath_complete_reset is skipped. This results in ath_prepare_reset being called many more times than ath_complete_reset. I don't know enough about the internals to know if the work that these two functions do should be balanced. Hardware Info: 17:00.0 Network controller: Atheros Communications Inc. AR9160 Wireless Network Adapter [AR9001 802.11(a)bgn] (rev 01) Subsystem: Ubiquiti Networks, Inc. SR71-A 802.11abgn Wireless Mini PCI Adapter Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 168, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 18 Region 0: Memory at fd5f0000 (32-bit, non-prefetchable) [size=64K] Capabilities: [44] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=100mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Kernel driver in use: ath9k Kernel modules: ath9k
Created attachment 96211 [details] Patch to disable fastcc and always cold reset After discussing with Adrian Chadd, I've been testing with the attached patch to eliminate potential issues with Fast Channel Change and warm resets. The issue still occurs with this patch.
I should note that removing/replacing the ath9k module fixes the issue
Created attachment 96301 [details] xmit debug after auth timeout /sys/kernel/debug/ieee80211/phy0/ath9k/xmit after the authentication has timed out, but before the chip is reset to scan
Created attachment 96311 [details] xmit debug after reset /sys/kernel/debug/ieee80211/phy0/ath9k/xmit after the authentication has timed out and after the chip is reset to scan
Think I found it: In ath_reset_internal, before we reset, we call ath_prepare_reset, which calls ath9k_hw_disable_interrupts. If the channel change fails, we never call ath_complete_reset, which calls ath9k_hw_enable_interrupts. ath9k_hw_{enable|disable}_interrupts calls definitely needs to be balanced. Looking at an old log, IER never gets re-enabled after the channel set failure. Every time it tries to enable it, you see "Do not enable IER ref count X" in the log, I've seen this as high as -97 for units that have been on for several weeks. Any suggestions on the proper cleanup? Could we just call ath_complete_reset anyway? Some sort of special handling?
Created attachment 96911 [details] Potential Fix I've successfully tested this patch. This does not fix the root cause of whatever is causing the calibration to fail, but it does clean up if a failure occurs. I also queue a reset since obviously something went wrong and that's our best effort to fix it.
Well, the patch has been merged, but I am not sure if queueing a reset in case a reset fails is correct. We might end up doing endless resets.... Are you still seeing the calibration failure ? Is PowerSave enabled ?
(In reply to comment #7) > Well, the patch has been merged, but I am not sure if queueing a reset in > case > a reset fails is correct. We might end up doing endless resets.... > > Are you still seeing the calibration failure ? Is PowerSave enabled ? Yes, I still see it and have replicated it with both PS on and off.
Hi guys, I also meet the same issue. Platform: Atheros QCA4531 OS: Linux 4.4.60 Error Logs: [ 3521.992941] wlan0: authenticate with 00:0f:c9:55:9a:08 [ 3521.998410] wlan0: Allocated STA 00:0f:c9:55:9a:08 [ 3522.010067] wlan0: Inserted STA 00:0f:c9:55:9a:08 [ 3522.010099] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3) [ 3522.240495] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3) [ 3522.414592] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3) [ 3522.634321] wlan0: authentication with 00:0f:c9:55:9a:08 timed out [ 3522.640752] wlan0: Removed STA 00:0f:c9:55:9a:08 [ 3522.641115] wlan0: Destroyed STA 00:0f:c9:55:9a:08 [ 3643.002766] wlan0: authenticate with 00:0f:c9:55:9a:08 [ 3643.008234] wlan0: Allocated STA 00:0f:c9:55:9a:08 [ 3643.019898] wlan0: Inserted STA 00:0f:c9:55:9a:08 [ 3643.019929] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3) [ 3643.129988] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3) [ 3643.231366] wla366] wla auth to 00:0f:c9:55:9a:08 (try 3/3) [ 3643.337126] wlan0: authentication with 00:0f:c9:55:9a:08 timed out [ 3643.343607] wlan0: Removed STA 00:0f:c9:55:9a:08 [ 3643.343974] wlan0: Destroyed STA 00:0f:c9:55:9a:08 [ 3764.714292] wlan0: authenticate with 00:0f:c9:55:9a:08 [ 3764.719688] wlan0: Allocated STA 00:0f:c9:55:9a:08 [ 3764.731425] wlan0: Inserted STA 00:0f:c9:55:9a:08 [ 3764.731457] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3) [ 3764.861011] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3) [ 3764.962540] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3) [ 3765.058784] wlan0: authentication with 00:0f:c9:55:9a:08 timed out [ 3765.065253] wlan0: Removed STA 00:0f:c9:55:9a:08 [ 3765.065610] wlan0: Destroyed STA 00:0f:c9:55:9a:08 I just want to know whether this bug is still there, or solved? Thanks very much.
(In reply to Tom Liu from comment #9) > Hi guys, > > I also meet the same issue. > Platform: Atheros QCA4531 > OS: Linux 4.4.60 > Error Logs: > [ 3521.992941] wlan0: authenticate with 00:0f:c9:55:9a:08 > [ 3521.998410] wlan0: Allocated STA 00:0f:c9:55:9a:08 > [ 3522.010067] wlan0: Inserted STA 00:0f:c9:55:9a:08 > [ 3522.010099] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3) > [ 3522.240495] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3) > [ 3522.414592] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3) > [ 3522.634321] wlan0: authentication with 00:0f:c9:55:9a:08 timed out > [ 3522.640752] wlan0: Removed STA 00:0f:c9:55:9a:08 > [ 3522.641115] wlan0: Destroyed STA 00:0f:c9:55:9a:08 > [ 3643.002766] wlan0: authenticate with 00:0f:c9:55:9a:08 > [ 3643.008234] wlan0: Allocated STA 00:0f:c9:55:9a:08 > [ 3643.019898] wlan0: Inserted STA 00:0f:c9:55:9a:08 > [ 3643.019929] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3) > [ 3643.129988] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3) > [ 3643.231366] wla366] wla auth to 00:0f:c9:55:9a:08 (try 3/3) > [ 3643.337126] wlan0: authentication with 00:0f:c9:55:9a:08 timed out > [ 3643.343607] wlan0: Removed STA 00:0f:c9:55:9a:08 > [ 3643.343974] wlan0: Destroyed STA 00:0f:c9:55:9a:08 > [ 3764.714292] wlan0: authenticate with 00:0f:c9:55:9a:08 > [ 3764.719688] wlan0: Allocated STA 00:0f:c9:55:9a:08 > [ 3764.731425] wlan0: Inserted STA 00:0f:c9:55:9a:08 > [ 3764.731457] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3) > [ 3764.861011] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3) > [ 3764.962540] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3) > [ 3765.058784] wlan0: authentication with 00:0f:c9:55:9a:08 timed out > [ 3765.065253] wlan0: Removed STA 00:0f:c9:55:9a:08 > [ 3765.065610] wlan0: Destroyed STA 00:0f:c9:55:9a:08 > > I just want to know whether this bug is still there, or solved? Thanks very > much. Sorry, I lost this: Network Controller is AR9531, thanks.