Bug 13199
Summary: | ath5k: nonfunctional, floods of "noise floor calibration failed" on cold boot | ||
---|---|---|---|
Product: | Drivers | Reporter: | Zack Weinberg (zackw) |
Component: | network-wireless | Assignee: | drivers_network-wireless (drivers_network-wireless) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | jrnieder, linville, me, uzytkownik2 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | http://bugs.debian.org/503544 | ||
Kernel Version: | 2.6.26, 2.6.27, 2.6.28, 2.6.29 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg output
dmesg |
Description
Zack Weinberg
2009-04-27 23:33:22 UTC
On rereading the Debian bug I discovered that I had misremembered the first kernel version with the problem; I switched from madwifi to ath5k as of 2.6.26, not .27, and I was seeing the "noise floor calibration timeout" problem in that version as well. $ sudo lspci -vvxxxx -s 03:00.0 | egrep -v ':( 00)+$' 03:00.0 Ethernet controller: Atheros Communications Inc. AR5212 802.11abg NIC (rev 01) Subsystem: IBM ThinkPad 11a/b/g Wireless LAN Mini Express Adapter (AR5BXB6) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 17 Region 0: Memory at df3f0000 (64-bit, non-prefetchable) [size=64K] Capabilities: [40] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Mask- 64bit- Count=1/1 Enable- Address: 00000000 Data: 0000 Capabilities: [60] Express (v1) Legacy Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM L0s L1 Enabled; RCB 128 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- Capabilities: [90] MSI-X: Enable- Mask- TabSize=1 Vector table: BAR=0 offset=00000000 PBA: BAR=0 offset=00000000 Capabilities: [100] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- AERCap: First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [140] Virtual Channel <?> Kernel driver in use: ath5k 00: 8c 16 14 10 07 01 10 00 01 00 00 02 10 00 00 00 10: 04 00 3f df 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 01 50 00 00 14 10 8a 05 30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 00 00 40: 01 50 c2 01 00 00 00 00 00 00 00 00 00 00 00 00 50: 05 60 00 00 00 00 00 00 00 00 00 00 00 00 00 00 60: 10 90 11 00 c0 0c 04 05 10 20 0a 00 11 3c 03 00 70: 4b 00 11 10 00 00 00 00 c0 03 00 00 00 00 00 00 90: 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a0: 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 100: 01 00 01 14 00 00 10 00 00 00 00 00 11 20 06 00 110: 00 00 00 00 00 00 00 00 b4 00 00 00 01 00 00 04 120: 0f 03 00 00 00 01 01 02 00 00 00 00 00 00 00 00 140: 02 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 150: 00 00 00 00 01 00 00 80 00 00 00 00 00 00 00 00 700: 76 00 63 01 ff ff ff ff 04 00 00 00 01 3f 00 0f 710: 20 01 01 00 00 00 00 00 aa 13 00 00 80 02 00 00 720: ff ff ff ff 00 00 00 00 10 6b f6 03 10 00 00 08 730: 10 40 00 00 01 40 00 00 ff ff 0f 00 00 00 00 00 On looking at the logs again, it's "noise floor calibration *failed*" that shows up consistently when the wireless card is in a bad state. Here's the syndrome: Apr 27 16:07:31 trurl kernel: [ 7.548170] ath5k 0000:03:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17 Apr 27 16:07:31 trurl kernel: [ 7.548197] ath5k 0000:03:00.0: setting latency timer to 64 Apr 27 16:07:31 trurl kernel: [ 7.548330] ath5k 0000:03:00.0: registered as 'phy0' Apr 27 16:07:31 trurl kernel: [ 7.705777] wmaster0 (ath5k): not using net_device_ops yet Apr 27 16:07:31 trurl kernel: [ 7.734132] wlan0 (ath5k): not using net_device_ops yet Apr 27 16:07:31 trurl kernel: [ 7.734781] Registered led device: ath5k-phy0::rx Apr 27 16:07:31 trurl kernel: [ 7.734817] Registered led device: ath5k-phy0::tx Apr 27 16:07:31 trurl kernel: [ 7.734822] ath5k phy0: Atheros AR5414 chip found (MAC: 0xa3, PHY: 0x61) Apr 27 16:07:36 trurl kernel: [ 20.754245] ath5k phy0: noise floor calibration failed (2412MHz) Apr 27 16:07:37 trurl kernel: [ 21.069019] ath5k phy0: noise floor calibration failed (2417MHz) Apr 27 16:07:37 trurl kernel: [ 21.391169] ath5k phy0: noise floor calibration failed (2422MHz) Apr 27 16:07:37 trurl kernel: [ 21.507012] ath5k phy0: noise floor calibration failed (2427MHz) Apr 27 16:07:42 trurl kernel: [ 25.960569] ath5k phy0: noise floor calibration failed (2412MHz) Apr 27 16:07:46 trurl kernel: [ 30.906901] ath5k phy0: noise floor calibration failed (2412MHz) Apr 27 16:07:47 trurl kernel: [ 31.510759] ath5k phy0: noise floor calibration failed (2417MHz) Apr 27 16:07:47 trurl kernel: [ 31.691464] ath5k phy0: noise floor calibration failed (2422MHz) Apr 27 16:07:52 trurl kernel: [ 36.034843] ath5k phy0: noise floor calibration failed (2412MHz) Apr 27 16:07:52 trurl kernel: [ 36.583076] ath5k phy0: noise floor calibration failed (2412MHz) Apr 27 16:07:53 trurl kernel: [ 37.012978] ath5k phy0: noise floor calibration failed (2412MHz) Apr 27 16:07:53 trurl kernel: [ 37.097252] ath5k phy0: noise floor calibration failed (2412MHz) Apr 27 16:07:53 trurl kernel: [ 37.383880] ath5k phy0: noise floor calibration failed (2417MHz) Apr 27 16:07:53 trurl kernel: [ 37.753787] ath5k phy0: noise floor calibration failed (2422MHz) Apr 27 16:07:53 trurl kernel: [ 37.883195] ath5k phy0: noise floor calibration failed (2427MHz) I have similar problem - however in this case it is after some time of running (a few days) and cold boot helps. dmesg output similar. Should I fill separate bug. (In reply to comment #4) > I have similar problem - however in this case it is after some time of > running > (a few days) and cold boot helps. dmesg output similar. Should I fill > separate > bug. No, that is ok. It would be a useful datapoint if you could both try the wireless-testing kernel (or the latest compat-wireless snapshot, and if the latter please note the git version). These both have numerous fixes to the phy initialization and calibration. (In reply to comment #5) > (In reply to comment #4) > > I have similar problem - however in this case it is after some time of > running > > (a few days) and cold boot helps. dmesg output similar. Should I fill > separate > > bug. > > No, that is ok. It would be a useful datapoint if you could both try the > wireless-testing kernel (or the latest compat-wireless snapshot, and if the > latter please note the git version). These both have numerous fixes to the > phy > initialization and calibration. The wireless-testing kernel seems to failed connect at all. As I forgotten to compile-in debugfs I wasn't able to see ay debug. However in dmesg was nothing. Created attachment 21338 [details]
dmesg output
Apparently with newer kernels the phy0 goes oops after dereferencing NULL.
The message changed to: [360262.093139] wlan0: No ProbeResp from current AP <mac_of_ap> - assume out of range [360263.224020] wlan0: direct probe to AP <mac_of_ap> try 1 [360263.424019] wlan0: direct probe to AP <mac_of_ap> try 2 [360263.623130] wlan0: direct probe to AP <mac_of_ap> try 3 [360263.823128] wlan0: direct probe to AP <mac_of_ap> timed out This just means we sent a probe to the AP and it didn't reply (or we lost the response). Does moving closer to the AP help? (In reply to comment #9) > This just means we sent a probe to the AP and it didn't reply (or we lost the > response). Does moving closer to the AP help? Why would it help when it reports 94-100% signal? I'll try but I'm afraid it is hardware/software problem on this side as restart without moving helps. BTW - AP is 2 m away and I'm separated by wall from gypsum and rock wool (sorry if I translated it incorrectly) with clear path nearly straight. (In reply to comment #9) > This just means we sent a probe to the AP and it didn't reply (or we lost the > response). Does moving closer to the AP help? When it happened I tried to hold it even directly over AP (well - <30 cm). The same results. (In reply to comment #11) > (In reply to comment #9) > > This just means we sent a probe to the AP and it didn't reply (or we lost > the > > response). Does moving closer to the AP help? > > When it happened I tried to hold it even directly over AP (well - <30 cm). > The > same results. 2.6.30-rc8 seems to be ok. I'll wait a several days anyway to check (noise floor occured after few days while calibration in a day). (In reply to comment #12) > (In reply to comment #11) > > (In reply to comment #9) > > > This just means we sent a probe to the AP and it didn't reply (or we lost > the > > > response). Does moving closer to the AP help? > > > > When it happened I tried to hold it even directly over AP (well - <30 cm). > The > > same results. > > 2.6.30-rc8 seems to be ok. I'll wait a several days anyway to check (noise > floor occured after few days while calibration in a day). 2.6.30 seemed to be OK but reproduced with gentoo patchset. Which is a bit strange as it contains: - Patch for alpha systemes (unrelated for x86 as the code is not even compiled) - Patch for live-cd (not enabled) - Fbcondecor patch. As it was not enabled anytime before I guess that problem still holds. No idea about the Gentoo patches, but it sounds like this is resolved in the upstream kernel... (In reply to comment #14) > No idea about the Gentoo patches, but it sounds like this is resolved in the > upstream kernel... You mean? If it was reproduced on 2.6.30 with likely harmless patches... You said "2.6.30 seemed to be OK", but that you reproduced the problem with the gentoo patchset. In that case, contact the gentoo guys...? (In reply to comment #16) > You said "2.6.30 seemed to be OK", but that you reproduced the problem with > the > gentoo patchset. In that case, contact the gentoo guys...? Yes I stated that "2.6.30 seems to be OK". However I reproduced it on gentoo patchset which contained 2 patches which were totaly unrelated as they haven't been even built and one patch which was not related to wifi (and chance for totaly unrelated patch which is causing *exactly( the same problems as previous unpatched buggy kernel is near 0 - at least as far as I know statistics). Anyway reproduced on vanilla 2.6.30 anyway. Ok let's reset, the current problem is this: [360262.093139] wlan0: No ProbeResp from current AP <mac_of_ap> - assume out of range [360263.224020] wlan0: direct probe to AP <mac_of_ap> try 1 [360263.424019] wlan0: direct probe to AP <mac_of_ap> try 2 [360263.623130] wlan0: direct probe to AP <mac_of_ap> try 3 [360263.823128] wlan0: direct probe to AP <mac_of_ap> timed out The original issue of "noise floor calibration failed" etc is no longer happening? If you have another wireless card, a tcpdump of the OTA traffic would help immensely. If not, turning on the maximum debug setting on your wpa_supplicant process would be useful. (In reply to comment #18) > Ok let's reset, the current problem is this: > > [360262.093139] wlan0: No ProbeResp from current AP <mac_of_ap> - assume out > of > range > [360263.224020] wlan0: direct probe to AP <mac_of_ap> try 1 > [360263.424019] wlan0: direct probe to AP <mac_of_ap> try 2 > [360263.623130] wlan0: direct probe to AP <mac_of_ap> try 3 > [360263.823128] wlan0: direct probe to AP <mac_of_ap> timed out > > The original issue of "noise floor calibration failed" etc is no longer > happening? > No as it occured after about week of running. This one occures after ~24h - to fix it I hibernate the computer which fixes the problem but it may 'reset' the timer hiding the problem. > If you have another wireless card, a tcpdump of the OTA traffic would help > immensely. I'm, hardly, in range of eth cable (for many reasons it is not a long-term solution). If it is needed I may run some tcpdump or wireshark if it'll help but what should I look for? What's OTA? > If not, turning on the maximum debug setting on your wpa_supplicant > process would be useful. I'm not using wpa_supplicant directly but I'll try to run it after crash. Anyways will debugging of ath5k help? If yes on what settings? (In reply to comment #19) > > The original issue of "noise floor calibration failed" etc is no longer > > happening? > > > > No as it occured after about week of running. This one occures after ~24h - > to > fix it I hibernate the computer which fixes the problem but it may 'reset' > the > timer hiding the problem. Ok, and when it happened, was the card unresponsive? Occasional noise floor calibration failures do happen and shouldn't be critical, but if the card entirely ceases to function until you suspend/resume, that's a problem. Can you post dmesg from when it happens? And are you in a noisy environment (lots of other APs around, e.g.)? > I'm, hardly, in range of eth cable (for many reasons it is not a long-term > solution). If it is needed I may run some tcpdump or wireshark if it'll help > but what should I look for? What's OTA? Sorry, OTA = over the air, you need tcpdump running on a second wireless card, but nevermind, that will only help debug association failures. > I'm not using wpa_supplicant directly but I'll try to run it after crash. > Anyways will debugging of ath5k help? If yes on what settings? Maybe adding WARN to the failure case to see the stack would be useful. Chances are though it will show we're in scan->config->reset->calibrate which won't tell a whole lot. Created attachment 21998 [details] dmesg (In reply to comment #20) > (In reply to comment #19) > > > The original issue of "noise floor calibration failed" etc is no longer > > > happening? > > > > > > > No as it occured after about week of running. This one occures after ~24h - > to > > fix it I hibernate the computer which fixes the problem but it may 'reset' > the > > timer hiding the problem. > > Ok, and when it happened, was the card unresponsive? Occasional noise floor > calibration failures do happen and shouldn't be critical, but if the card > entirely ceases to function until you suspend/resume, that's a problem. > > Can you post dmesg from when it happens? debug enabled seems to speed the process. It makes it in several minutes. dmesg posted > And are you in a noisy environment > (lots of other APs around, e.g.)? > 6 AP. I discovered all of them are on channels 10/11 which may be a problem. > > I'm, hardly, in range of eth cable (for many reasons it is not a long-term > > solution). If it is needed I may run some tcpdump or wireshark if it'll > help > > but what should I look for? What's OTA? > > Sorry, OTA = over the air, you need tcpdump running on a second wireless > card, > but nevermind, that will only help debug association failures. > OK. Will running on separate computer will help? I have WPA2-psk (I guess not). > > I'm not using wpa_supplicant directly but I'll try to run it after crash. (...) Interface wlan0 set UP - waiting a second for the driver to complete initialization SIOCGIWRANGE: WE(compiled)=22 WE(source)=21 enc_capa=0xf capabilities: key_mgmt 0xf enc 0xf flags 0x0 WEXT: Operstate: linkmode=1, operstate=5 Own MAC address: 00:16:ce:26:39:e6 wpa_driver_wext_set_wpa wpa_driver_wext_set_key: alg=0 key_idx=0 set_tx=0 seq_len=0 key_len=0 wpa_driver_wext_set_key: alg=0 key_idx=1 set_tx=0 seq_len=0 key_len=0 wpa_driver_wext_set_key: alg=0 key_idx=2 set_tx=0 seq_len=0 key_len=0 wpa_driver_wext_set_key: alg=0 key_idx=3 set_tx=0 seq_len=0 key_len=0 wpa_driver_wext_set_countermeasures wpa_driver_wext_set_drop_unencrypted RSN: flushing PMKID list in the driver Setting scan request: 0 sec 100000 usec EAPOL: SUPP_PAE entering state DISCONNECTED EAPOL: KEY_RX entering state NO_KEY_RECEIVE EAPOL: SUPP_BE entering state INITIALIZE EAP: EAP entering state DISABLED Using existing control interface directory. ctrl_interface_group=0 Added interface wlan0 Ignore event for foreign ifindex 5 RTM_NEWLINK: operstate=0 ifi_flags=0x1003 ([UP]) RTM_NEWLINK, IFLA_IFNAME: Interface 'wlan0' added RTM_NEWLINK: operstate=0 ifi_flags=0x1003 ([UP]) RTM_NEWLINK, IFLA_IFNAME: Interface 'wlan0' added Wireless event: cmd=0x8b06 len=8 State: DISCONNECTED -> SCANNING Starting AP scan (broadcast SSID) Trying to get current scan results first without requesting a new scan to speed up initial association Received 0 bytes of scan results (0 BSSes) Cached scan results are empty - not posting Selecting BSS from priority group 5 Try to find WPA-enabled AP Try to find non-WPA AP Selecting BSS from priority group 0 Try to find WPA-enabled AP Try to find non-WPA AP No suitable AP found. Setting scan request: 0 sec 0 usec Starting AP scan (broadcast SSID) ioctl[SIOCSIWSCAN]: Device or resource busy Scan requested (ret=-1) - scan timeout 5 seconds Failed to initiate AP scan. Setting scan request: 10 sec 0 usec EAPOL: disable timer tick Scan timeout - try to get results Received 0 bytes of scan results (0 BSSes) CTRL-EVENT-SCAN-RESULTS Selecting BSS from priority group 5 Try to find WPA-enabled AP Try to find non-WPA AP Selecting BSS from priority group 0 Try to find WPA-enabled AP Try to find non-WPA AP No suitable AP found. Setting scan request: 5 sec 0 usec Starting AP scan (broadcast SSID) Scan requested (ret=0) - scan timeout 5 seconds (...) > > Anyways will debugging of ath5k help? If yes on what settings? > > Maybe adding WARN to the failure case to see the stack would be useful. > Chances are though it will show we're in scan->config->reset->calibrate which > won't tell a whole lot. Sorry - I'm not quite understend. ATM I've changed channel to 1. I'm currently the only one with 1. I found 1 with 7, 2 with 10 and 1 with 11. I'll check if it will help. (In reply to comment #22) > ATM I've changed channel to 1. I'm currently the only one with 1. I found 1 > with 7, 2 with 10 and 1 with 11. I'll check if it will help. Problem occured after the 2 days so it helped somehow. I think I've spotted something else on channel 1 but now I receives 1 on 6, 1 on 7, 2 on 10 and 3 on 11 - so probably it is on shore of range. Reproduced on 2.6.31-rc4. Should I open new bug or should this bug be reopened? (In reply to comment #24) > Reproduced on 2.6.31-rc4. Should I open new bug or should this bug be > reopened? Sorry- I forgot to add. It seems to happen only in rather noisy enviroment (like 9 networks - this time was much faster but thunderstorms are approaching). Hi Zack, Do you still have access to that hardware, and if so are you still experiencing this bug? I'm curious because there is a proposed fix that Debian will probably be using in their next stable update. See http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=15;bug=666969 if you would like to test. Thanks, Jonathan I'm sorry, I haven't had that laptop for nearly two years now. No prob, and thanks for the update. |