Bug 55771

Summary: ath9k: Calibration Failure Leads To Inability To Associate
Product: Drivers Reporter: Robert Shade (robert.shade)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: NEW ---    
Severity: normal CC: iwtbavbm, linville, mcgrof, sujith
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.8.4 Subsystem:
Regression: No Bisected commit-id:
Attachments: Patch to disable fastcc and always cold reset
xmit debug after auth timeout
xmit debug after reset
Potential Fix

Description Robert Shade 2013-03-26 13:14:53 UTC
During a channel change (whether associated to an AP or not) AGC calibration times out and causes the channel set to be considered a failure:

kernel: ath: phy1: timeout (100000 us) on reg 0x9860: 0x00048d21 & 0x00000001 != 0x00000000
kernel: ath: phy1: offset calibration failed to complete in 1ms; noisy environment?
kernel: ath: phy1: Unable to reset channel, reset status -5
kernel: ath: phy1: Unable to set channel

(Note that the "1ms" message is incorrect, it's actually based off AH_WAIT_TIMEOUT.  This condition eventually happens even when the timeout is set to several seconds.)

Once this happens, the device is unable to associate to an AP.  In this state, the log will have entries that look like this:

kernel: wlan0: authenticate with 00:21:a0:25:f9:10
kernel: wlan0: Allocated STA 00:21:a0:25:f9:10
kernel: wlan0: capabilities/regulatory prevented using AP HT/VHT configuration, downgraded
kernel: wlan0: Inserted STA 00:21:a0:25:f9:10
kernel: wlan0: send auth to 00:21:a0:25:f9:10 (try 1/3)
kernel: wlan0: send auth to 00:21:a0:25:f9:10 (try 2/3)
kernel: wlan0: send auth to 00:21:a0:25:f9:10 (try 3/3)
kernel: wlan0: authentication with 00:21:a0:25:f9:10 timed out
kernel: wlan0: Removed STA 00:21:a0:25:f9:10
kernel: wlan0: Destroyed STA 00:21:a0:25:f9:10

With debugging enabled, you'll see the authentication messages being queued, but never sent.  After the time out, you'll get a TX complete message for all 3 messages, but this is due to the TX queue being flushed before a reset as wpa_supplicant goes back to scan for another AP.

One interesting point in the code is that after the channel change fails, ath_complete_reset is skipped.  This results in ath_prepare_reset being called many more times than ath_complete_reset.  I don't know enough about the internals to know if the work that these two functions do should be balanced.

Hardware Info:
17:00.0 Network controller: Atheros Communications Inc. AR9160
Wireless Network Adapter [AR9001 802.11(a)bgn] (rev 01)
        Subsystem: Ubiquiti Networks, Inc. SR71-A 802.11abgn Wireless Mini PCI Adapter
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 168, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at fd5f0000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [44] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=100mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: ath9k
        Kernel modules: ath9k
Comment 1 Robert Shade 2013-03-26 13:21:02 UTC
Created attachment 96211 [details]
Patch to disable fastcc and always cold reset

After discussing with Adrian Chadd, I've been testing with the attached patch to eliminate potential issues with Fast Channel Change and warm resets.

The issue still occurs with this patch.
Comment 2 Robert Shade 2013-03-26 13:28:05 UTC
I should note that removing/replacing the ath9k module fixes the issue
Comment 3 Robert Shade 2013-03-27 00:51:03 UTC
Created attachment 96301 [details]
xmit debug after auth timeout

/sys/kernel/debug/ieee80211/phy0/ath9k/xmit after the authentication has timed out, but before the chip is reset to scan
Comment 4 Robert Shade 2013-03-27 00:52:13 UTC
Created attachment 96311 [details]
xmit debug after reset

/sys/kernel/debug/ieee80211/phy0/ath9k/xmit after the authentication has timed
out and after the chip is reset to scan
Comment 5 Robert Shade 2013-03-27 18:51:21 UTC
Think I found it:  In ath_reset_internal, before we reset, we call ath_prepare_reset, which calls ath9k_hw_disable_interrupts.  If the channel change fails, we never call ath_complete_reset, which calls ath9k_hw_enable_interrupts.

ath9k_hw_{enable|disable}_interrupts calls definitely needs to be balanced.  Looking at an old log, IER never gets re-enabled after the channel set failure.  Every time it tries to enable it, you see "Do not enable IER ref count X" in the log, I've seen this as high as -97 for units that have been on for several weeks.

Any suggestions on the proper cleanup?  Could we just call ath_complete_reset anyway?  Some sort of special handling?
Comment 6 Robert Shade 2013-04-01 13:51:20 UTC
Created attachment 96911 [details]
Potential Fix

I've successfully tested this patch.  This does not fix the root cause of whatever is causing the calibration to fail, but it does clean up if a failure occurs.  I also queue a reset since obviously something went wrong and that's our best effort to fix it.
Comment 7 Sujith 2013-06-03 02:15:20 UTC
Well, the patch has been merged, but I am not sure if queueing a reset in case a reset fails is correct. We might end up doing endless resets....

Are you still seeing the calibration failure ? Is PowerSave enabled ?
Comment 8 Robert Shade 2013-06-04 03:52:24 UTC
(In reply to comment #7)
> Well, the patch has been merged, but I am not sure if queueing a reset in
> case
> a reset fails is correct. We might end up doing endless resets....
> 
> Are you still seeing the calibration failure ? Is PowerSave enabled ?

Yes, I still see it and have replicated it with both PS on and off.
Comment 9 Tom Liu 2019-10-11 11:01:50 UTC
Hi guys,

I also meet the same issue.
Platform: Atheros QCA4531
OS: Linux 4.4.60
Error Logs:
[ 3521.992941] wlan0: authenticate with 00:0f:c9:55:9a:08
[ 3521.998410] wlan0: Allocated STA 00:0f:c9:55:9a:08
[ 3522.010067] wlan0: Inserted STA 00:0f:c9:55:9a:08
[ 3522.010099] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
[ 3522.240495] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
[ 3522.414592] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3)    
[ 3522.634321] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
[ 3522.640752] wlan0: Removed STA 00:0f:c9:55:9a:08
[ 3522.641115] wlan0: Destroyed STA 00:0f:c9:55:9a:08
[ 3643.002766] wlan0: authenticate with 00:0f:c9:55:9a:08
[ 3643.008234] wlan0: Allocated STA 00:0f:c9:55:9a:08
[ 3643.019898] wlan0: Inserted STA 00:0f:c9:55:9a:08
[ 3643.019929] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
[ 3643.129988] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
[ 3643.231366] wla366] wla auth to 00:0f:c9:55:9a:08 (try 3/3)        
[ 3643.337126] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
[ 3643.343607] wlan0: Removed STA 00:0f:c9:55:9a:08
[ 3643.343974] wlan0: Destroyed STA 00:0f:c9:55:9a:08
[ 3764.714292] wlan0: authenticate with 00:0f:c9:55:9a:08
[ 3764.719688] wlan0: Allocated STA 00:0f:c9:55:9a:08
[ 3764.731425] wlan0: Inserted STA 00:0f:c9:55:9a:08
[ 3764.731457] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
[ 3764.861011] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
[ 3764.962540] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3)
[ 3765.058784] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
[ 3765.065253] wlan0: Removed STA 00:0f:c9:55:9a:08     
[ 3765.065610] wlan0: Destroyed STA 00:0f:c9:55:9a:08

I just want to know whether this bug is still there, or solved? Thanks very much.
Comment 10 Tom Liu 2019-10-11 11:08:40 UTC
(In reply to Tom Liu from comment #9)
> Hi guys,
> 
> I also meet the same issue.
> Platform: Atheros QCA4531
> OS: Linux 4.4.60
> Error Logs:
> [ 3521.992941] wlan0: authenticate with 00:0f:c9:55:9a:08
> [ 3521.998410] wlan0: Allocated STA 00:0f:c9:55:9a:08
> [ 3522.010067] wlan0: Inserted STA 00:0f:c9:55:9a:08
> [ 3522.010099] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
> [ 3522.240495] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
> [ 3522.414592] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3)    
> [ 3522.634321] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
> [ 3522.640752] wlan0: Removed STA 00:0f:c9:55:9a:08
> [ 3522.641115] wlan0: Destroyed STA 00:0f:c9:55:9a:08
> [ 3643.002766] wlan0: authenticate with 00:0f:c9:55:9a:08
> [ 3643.008234] wlan0: Allocated STA 00:0f:c9:55:9a:08
> [ 3643.019898] wlan0: Inserted STA 00:0f:c9:55:9a:08
> [ 3643.019929] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
> [ 3643.129988] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
> [ 3643.231366] wla366] wla auth to 00:0f:c9:55:9a:08 (try 3/3)        
> [ 3643.337126] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
> [ 3643.343607] wlan0: Removed STA 00:0f:c9:55:9a:08
> [ 3643.343974] wlan0: Destroyed STA 00:0f:c9:55:9a:08
> [ 3764.714292] wlan0: authenticate with 00:0f:c9:55:9a:08
> [ 3764.719688] wlan0: Allocated STA 00:0f:c9:55:9a:08
> [ 3764.731425] wlan0: Inserted STA 00:0f:c9:55:9a:08
> [ 3764.731457] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
> [ 3764.861011] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
> [ 3764.962540] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3)
> [ 3765.058784] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
> [ 3765.065253] wlan0: Removed STA 00:0f:c9:55:9a:08     
> [ 3765.065610] wlan0: Destroyed STA 00:0f:c9:55:9a:08
> 
> I just want to know whether this bug is still there, or solved? Thanks very
> much.

Sorry, I lost this: Network Controller is AR9531, thanks.