Bug 55771 - ath9k: Calibration Failure Leads To Inability To Associate
Summary: ath9k: Calibration Failure Leads To Inability To Associate
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-03-26 13:14 UTC by Robert Shade
Modified: 2019-10-11 11:08 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.8.4
Tree: Mainline
Regression: No


Attachments
Patch to disable fastcc and always cold reset (1.43 KB, patch)
2013-03-26 13:21 UTC, Robert Shade
Details | Diff
xmit debug after auth timeout (2.59 KB, application/octet-stream)
2013-03-27 00:51 UTC, Robert Shade
Details
xmit debug after reset (2.59 KB, application/octet-stream)
2013-03-27 00:52 UTC, Robert Shade
Details
Potential Fix (533 bytes, patch)
2013-04-01 13:51 UTC, Robert Shade
Details | Diff

Description Robert Shade 2013-03-26 13:14:53 UTC
During a channel change (whether associated to an AP or not) AGC calibration times out and causes the channel set to be considered a failure:

kernel: ath: phy1: timeout (100000 us) on reg 0x9860: 0x00048d21 & 0x00000001 != 0x00000000
kernel: ath: phy1: offset calibration failed to complete in 1ms; noisy environment?
kernel: ath: phy1: Unable to reset channel, reset status -5
kernel: ath: phy1: Unable to set channel

(Note that the "1ms" message is incorrect, it's actually based off AH_WAIT_TIMEOUT.  This condition eventually happens even when the timeout is set to several seconds.)

Once this happens, the device is unable to associate to an AP.  In this state, the log will have entries that look like this:

kernel: wlan0: authenticate with 00:21:a0:25:f9:10
kernel: wlan0: Allocated STA 00:21:a0:25:f9:10
kernel: wlan0: capabilities/regulatory prevented using AP HT/VHT configuration, downgraded
kernel: wlan0: Inserted STA 00:21:a0:25:f9:10
kernel: wlan0: send auth to 00:21:a0:25:f9:10 (try 1/3)
kernel: wlan0: send auth to 00:21:a0:25:f9:10 (try 2/3)
kernel: wlan0: send auth to 00:21:a0:25:f9:10 (try 3/3)
kernel: wlan0: authentication with 00:21:a0:25:f9:10 timed out
kernel: wlan0: Removed STA 00:21:a0:25:f9:10
kernel: wlan0: Destroyed STA 00:21:a0:25:f9:10

With debugging enabled, you'll see the authentication messages being queued, but never sent.  After the time out, you'll get a TX complete message for all 3 messages, but this is due to the TX queue being flushed before a reset as wpa_supplicant goes back to scan for another AP.

One interesting point in the code is that after the channel change fails, ath_complete_reset is skipped.  This results in ath_prepare_reset being called many more times than ath_complete_reset.  I don't know enough about the internals to know if the work that these two functions do should be balanced.

Hardware Info:
17:00.0 Network controller: Atheros Communications Inc. AR9160
Wireless Network Adapter [AR9001 802.11(a)bgn] (rev 01)
        Subsystem: Ubiquiti Networks, Inc. SR71-A 802.11abgn Wireless Mini PCI Adapter
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 168, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at fd5f0000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [44] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=100mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Kernel driver in use: ath9k
        Kernel modules: ath9k
Comment 1 Robert Shade 2013-03-26 13:21:02 UTC
Created attachment 96211 [details]
Patch to disable fastcc and always cold reset

After discussing with Adrian Chadd, I've been testing with the attached patch to eliminate potential issues with Fast Channel Change and warm resets.

The issue still occurs with this patch.
Comment 2 Robert Shade 2013-03-26 13:28:05 UTC
I should note that removing/replacing the ath9k module fixes the issue
Comment 3 Robert Shade 2013-03-27 00:51:03 UTC
Created attachment 96301 [details]
xmit debug after auth timeout

/sys/kernel/debug/ieee80211/phy0/ath9k/xmit after the authentication has timed out, but before the chip is reset to scan
Comment 4 Robert Shade 2013-03-27 00:52:13 UTC
Created attachment 96311 [details]
xmit debug after reset

/sys/kernel/debug/ieee80211/phy0/ath9k/xmit after the authentication has timed
out and after the chip is reset to scan
Comment 5 Robert Shade 2013-03-27 18:51:21 UTC
Think I found it:  In ath_reset_internal, before we reset, we call ath_prepare_reset, which calls ath9k_hw_disable_interrupts.  If the channel change fails, we never call ath_complete_reset, which calls ath9k_hw_enable_interrupts.

ath9k_hw_{enable|disable}_interrupts calls definitely needs to be balanced.  Looking at an old log, IER never gets re-enabled after the channel set failure.  Every time it tries to enable it, you see "Do not enable IER ref count X" in the log, I've seen this as high as -97 for units that have been on for several weeks.

Any suggestions on the proper cleanup?  Could we just call ath_complete_reset anyway?  Some sort of special handling?
Comment 6 Robert Shade 2013-04-01 13:51:20 UTC
Created attachment 96911 [details]
Potential Fix

I've successfully tested this patch.  This does not fix the root cause of whatever is causing the calibration to fail, but it does clean up if a failure occurs.  I also queue a reset since obviously something went wrong and that's our best effort to fix it.
Comment 7 Sujith 2013-06-03 02:15:20 UTC
Well, the patch has been merged, but I am not sure if queueing a reset in case a reset fails is correct. We might end up doing endless resets....

Are you still seeing the calibration failure ? Is PowerSave enabled ?
Comment 8 Robert Shade 2013-06-04 03:52:24 UTC
(In reply to comment #7)
> Well, the patch has been merged, but I am not sure if queueing a reset in
> case
> a reset fails is correct. We might end up doing endless resets....
> 
> Are you still seeing the calibration failure ? Is PowerSave enabled ?

Yes, I still see it and have replicated it with both PS on and off.
Comment 9 Tom Liu 2019-10-11 11:01:50 UTC
Hi guys,

I also meet the same issue.
Platform: Atheros QCA4531
OS: Linux 4.4.60
Error Logs:
[ 3521.992941] wlan0: authenticate with 00:0f:c9:55:9a:08
[ 3521.998410] wlan0: Allocated STA 00:0f:c9:55:9a:08
[ 3522.010067] wlan0: Inserted STA 00:0f:c9:55:9a:08
[ 3522.010099] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
[ 3522.240495] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
[ 3522.414592] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3)    
[ 3522.634321] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
[ 3522.640752] wlan0: Removed STA 00:0f:c9:55:9a:08
[ 3522.641115] wlan0: Destroyed STA 00:0f:c9:55:9a:08
[ 3643.002766] wlan0: authenticate with 00:0f:c9:55:9a:08
[ 3643.008234] wlan0: Allocated STA 00:0f:c9:55:9a:08
[ 3643.019898] wlan0: Inserted STA 00:0f:c9:55:9a:08
[ 3643.019929] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
[ 3643.129988] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
[ 3643.231366] wla366] wla auth to 00:0f:c9:55:9a:08 (try 3/3)        
[ 3643.337126] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
[ 3643.343607] wlan0: Removed STA 00:0f:c9:55:9a:08
[ 3643.343974] wlan0: Destroyed STA 00:0f:c9:55:9a:08
[ 3764.714292] wlan0: authenticate with 00:0f:c9:55:9a:08
[ 3764.719688] wlan0: Allocated STA 00:0f:c9:55:9a:08
[ 3764.731425] wlan0: Inserted STA 00:0f:c9:55:9a:08
[ 3764.731457] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
[ 3764.861011] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
[ 3764.962540] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3)
[ 3765.058784] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
[ 3765.065253] wlan0: Removed STA 00:0f:c9:55:9a:08     
[ 3765.065610] wlan0: Destroyed STA 00:0f:c9:55:9a:08

I just want to know whether this bug is still there, or solved? Thanks very much.
Comment 10 Tom Liu 2019-10-11 11:08:40 UTC
(In reply to Tom Liu from comment #9)
> Hi guys,
> 
> I also meet the same issue.
> Platform: Atheros QCA4531
> OS: Linux 4.4.60
> Error Logs:
> [ 3521.992941] wlan0: authenticate with 00:0f:c9:55:9a:08
> [ 3521.998410] wlan0: Allocated STA 00:0f:c9:55:9a:08
> [ 3522.010067] wlan0: Inserted STA 00:0f:c9:55:9a:08
> [ 3522.010099] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
> [ 3522.240495] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
> [ 3522.414592] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3)    
> [ 3522.634321] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
> [ 3522.640752] wlan0: Removed STA 00:0f:c9:55:9a:08
> [ 3522.641115] wlan0: Destroyed STA 00:0f:c9:55:9a:08
> [ 3643.002766] wlan0: authenticate with 00:0f:c9:55:9a:08
> [ 3643.008234] wlan0: Allocated STA 00:0f:c9:55:9a:08
> [ 3643.019898] wlan0: Inserted STA 00:0f:c9:55:9a:08
> [ 3643.019929] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
> [ 3643.129988] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
> [ 3643.231366] wla366] wla auth to 00:0f:c9:55:9a:08 (try 3/3)        
> [ 3643.337126] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
> [ 3643.343607] wlan0: Removed STA 00:0f:c9:55:9a:08
> [ 3643.343974] wlan0: Destroyed STA 00:0f:c9:55:9a:08
> [ 3764.714292] wlan0: authenticate with 00:0f:c9:55:9a:08
> [ 3764.719688] wlan0: Allocated STA 00:0f:c9:55:9a:08
> [ 3764.731425] wlan0: Inserted STA 00:0f:c9:55:9a:08
> [ 3764.731457] wlan0: send auth to 00:0f:c9:55:9a:08 (try 1/3)
> [ 3764.861011] wlan0: send auth to 00:0f:c9:55:9a:08 (try 2/3)
> [ 3764.962540] wlan0: send auth to 00:0f:c9:55:9a:08 (try 3/3)
> [ 3765.058784] wlan0: authentication with 00:0f:c9:55:9a:08 timed out
> [ 3765.065253] wlan0: Removed STA 00:0f:c9:55:9a:08     
> [ 3765.065610] wlan0: Destroyed STA 00:0f:c9:55:9a:08
> 
> I just want to know whether this bug is still there, or solved? Thanks very
> much.

Sorry, I lost this: Network Controller is AR9531, thanks.

Note You need to log in before you can comment on or make changes to this bug.