Bug 12068

Summary: ath5k: card eventually fails to function, spams dmesg with can't reset hardware errors
Product: Networking Reporter: Rich Ercolani (rercola)
Component: WirelessAssignee: Luis Chamberlain (mcgrof)
Status: RESOLVED OBSOLETE    
Severity: normal CC: alan, antoine, bem, chrisahrendt, dispiste, erik.andren, harviecz, mcgrof, me, mozilla_bugs, vitaliy.tokarev, xerofoify
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.3.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: add a reset lock

Description Rich Ercolani 2008-11-19 16:39:37 UTC
Latest working kernel version: [none]
Earliest failing kernel version: 2.6.27 [tested with .6 as well]
Distribution: Ubuntu
Hardware Environment: ThinkPad T61p
Software Environment: Ubuntu 8.10 x86_64

Problem Description:
Eventually [there is no evident deterministic time period], my wireless card stops working under Linux. dmesg is flooded alternately with WARNINGs from warn_on_slowpath in rate_control_pid_get_rate and the following:
ath5k phy0: failed to reset the MAC Chip
ath5k phy0: can't reset hardware (-5)

Attempting to reload the driver does not help - I just get the above printout again.

Sometimes, when I first boot the machine, it outputs the above messages and then the card recovers, and I can connect to wireless networks - but if this behavior occurs once I've already connected to a wireless network, it will not recover, and will remain this way across driver reloads.

Steps to reproduce:
1) Have an ath5k miniPCI wireless card in my T61p.
2) Load the ath5k driver [in any version I've ever tried - I first tried the ath5k driver in 2.6.24 with the compat-wireless-old snapshots, and it also had this problem].
3) Try connecting to some wireless networks - eventually, it will stop working, and you will need to reboot.
Comment 1 Luis Chamberlain 2008-12-01 15:09:52 UTC
This is fixed in wireless-testing:

Author: Bob Copeland <me@bobcopeland.com>
Date:   Mon Nov 3 22:14:00 2008 -0500

    ath5k: correct handling of rx status fields
    
    ath5k_rx_status fields rs_antenna and rs_more are u8s, but we
    were setting them with bitwise ANDs of 32-bit values.
    
    As a consequence, jumbo frames would not be discarded as intended.
    Then, because the hw rate value of such frames is zero, and, since
    "ath5k: rates cleanup", we do not fall back to the basic rate, such
    packets would trigger the following WARN_ON:
    
    ------------[ cut here ]------------
    WARNING: at net/mac80211/rx.c:2192 __ieee80211_rx+0x4d/0x57e [mac80211]()
    Modules linked in: ath5k af_packet sha256_generic aes_i586 aes_generic cbc loop i915 drm binfmt_misc acpi_cpu
    Pid: 0, comm: swapper Tainted: G        W  2.6.28-rc2-wl #14
    Call Trace:
     [<c0123d1e>] warn_on_slowpath+0x41/0x5b
     [<c012005d>] ? sched_debug_show+0x31e/0x9c6
     [<c012489f>] ? vprintk+0x369/0x389
     [<c0309539>] ? _spin_unlock_irqrestore+0x54/0x58
     [<c011cd8f>] ? try_to_wake_up+0x14f/0x15a
     [<f81918cb>] __ieee80211_rx+0x4d/0x57e [mac80211]
     [<f828872a>] ath5k_tasklet_rx+0x5a1/0x5e4 [ath5k]
     [<c013b9cd>] ? clockevents_program_event+0xd4/0xe3
     [<c01283a9>] tasklet_action+0x94/0xfd
     [<c0127d19>] __do_softirq+0x8c/0x13e
     [<c0127e04>] do_softirq+0x39/0x55
     [<c0128082>] irq_exit+0x46/0x85
     [<c010576c>] do_IRQ+0x9a/0xb2
     [<c010461c>] common_interrupt+0x28/0x30
     [<f80e934a>] ? acpi_idle_enter_bm+0x2ad/0x31b [processor]
     [<c02976bf>] cpuidle_idle_call+0x65/0x9a
     [<c010262c>] cpu_idle+0x76/0xa6
     [<c02fb402>] rest_init+0x62/0x64
    
    Signed-off-by: Bob Copeland <me@bobcopeland.com>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>
Comment 2 Luis Chamberlain 2008-12-01 15:10:49 UTC
This just need to be propagated to 2.6.27 if not already done.
Comment 3 Bob Copeland 2008-12-02 09:46:37 UTC
I'm not so sure - can you test the patch and report if it helps (it should apply to .27 with fuzz).  I didn't send it for 2.6.27 since it was a rate cleanup in .28-rc1 that started the above warning.  However if we report packets with hwrate=1 for jumbo frames then it could still trigger a separate warning in the pid rate controller.

Patch is here:
http://bugzilla.kernel.org/show_bug.cgi?id=11901
Comment 4 Rich Ercolani 2008-12-02 16:34:13 UTC
I just pulled latest wireless-testing and ran it - it fixes can't reset hardware (as far as I've seen), but I still get warn_on_slowpath.

Call Trace:
 <IRQ>  [<ffffffff8024e9b4>] warn_on_slowpath+0x64/0x90
 [<ffffffff802e1d4b>] ? __slab_alloc+0x24b/0x260
 [<ffffffff803b353c>] ? map_single+0x1fc/0x280
 [<ffffffff80234059>] ? __phys_addr+0x9/0x50
 [<ffffffff803b383d>] ? swiotlb_map_single_attrs+0x6d/0xf0
 [<ffffffff80232b80>] ? swiotlb_map_single_phys+0x0/0x20
 [<ffffffffa02b0669>] lbm_cw___ieee80211_rx+0x1c9/0x630 [lbm_cw_mac80211]
 [<ffffffff803b31f0>] ? unmap_single+0x140/0x160
 [<ffffffffa02f0c6d>] ath5k_tasklet_rx+0x34d/0x5c0 [ath5k]
 [<ffffffff802546c6>] tasklet_action+0x86/0x110
 [<ffffffff80254d8c>] __do_softirq+0x8c/0x100
 [<ffffffff8021417c>] call_softirq+0x1c/0x30
 [<ffffffff80215875>] do_softirq+0x65/0xa0
 [<ffffffff80254af5>] irq_exit+0x95/0xa0
 [<ffffffff80215b1b>] do_IRQ+0x8b/0x100
 [<ffffffff80212f0e>] ret_from_intr+0x0/0x29
 <EOI>  [<ffffffffa003ac09>] ? acpi_idle_enter_bm+0x287/0x2d7 [processor]
 [<ffffffffa003ac01>] ? acpi_idle_enter_bm+0x27f/0x2d7 [processor]
 [<ffffffff80272e39>] ? tick_nohz_get_sleep_length+0x9/0x30
 [<ffffffff8044bb49>] ? cpuidle_idle_call+0xb9/0x100
 [<ffffffff80210e95>] ? cpu_idle+0x75/0x110
 [<ffffffff804f0536>] ? rest_init+0x66/0x70
Comment 5 Luis Chamberlain 2008-12-02 16:42:59 UTC
This is different, re-opening the bug.
Comment 6 Rich Ercolani 2008-12-10 09:28:00 UTC
No, no, I lied. It doesn't fix can't reset hardware - it even spits out different kinds of can't reset hardware. Great.

Dec 10 11:59:26 eris kernel: [17580.657110] ath5k phy0: can't reset hardware (-11)
Dec 10 11:59:26 eris kernel: [17580.657137] ------------[ cut here ]------------
Dec 10 11:59:26 eris kernel: [17580.657143] WARNING: at /home/rich/linux-backports-modules-2.6.27-2.6.27/debian/build/build-generic/compat-wi
reless-2.6/net/mac80211/main.c:227 ieee80211_hw_config+0xac/0xb0 [lbm_cw_mac80211]()
Dec 10 11:59:26 eris kernel: [17580.657151] Modules linked in: aes_x86_64 aes_generic binfmt_misc af_packet rfcomm bridge stp bnep sco l2cap 
bluetooth kvm_intel kvm kqemu ppdev autofs4 ipv6 acpi_cpufreq cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative cpufreq_
ondemand freq_table container pci_slot sbs sbshc iptable_filter ip_tables x_tables uvesafb sbp2 parport_pc lp parport pcmcia joydev thinkpad_
acpi rfkill arc4 nvram sdhci_pci snd_hda_intel ecb crypto_blkcipher snd_pcm_oss sdhci psmouse snd_mixer_oss evdev serio_raw yenta_socket mmc_
core snd_pcm rsrc_nonstatic pcmcia_core ricoh_mmc pcspkr ath5k iTCO_wdt iTCO_vendor_support snd_seq_dummy lbm_cw_mac80211 nvidia(P) snd_seq_o
ss battery lbm_cw_cfg80211 ac i2c_core led_class snd_seq_midi video output snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device wm
i snd soundcore button snd_page_alloc shpchp pci_hotplug intel_agp ext3 jbd mbcache sr_mod cdrom sd_mod crc_t10dif sg ata_piix ata_generic ah
ci pata_acpi libata scsi_mod ohci1394
Dec 10 11:59:26 eris kernel: ieee1394 uhci_hcd ehci_hcd e1000e usbcore dock thermal processor fan fbcon tileblit font bitblit softcursor fuse
Dec 10 11:59:26 eris kernel: [17580.657353] Pid: 7541, comm: wpa_supplicant Tainted: P        W 2.6.27-9-generic #1
Dec 10 11:59:26 eris kernel: [17580.657358] 
Dec 10 11:59:26 eris kernel: [17580.657359] Call Trace:
Dec 10 11:59:26 eris kernel: [17580.657371]  [<ffffffff8024e9b4>] warn_on_slowpath+0x64/0x90
Dec 10 11:59:26 eris kernel: [17580.657388]  [<ffffffffa0a6d0bf>] ? ath5k_hw_reset+0xf1f/0x1150 [ath5k]
Dec 10 11:59:26 eris kernel: [17580.657397]  [<ffffffff803a73d5>] ? __ratelimit+0xa5/0xf0
Dec 10 11:59:26 eris kernel: [17580.657404]  [<ffffffff80471665>] ? net_ratelimit+0x15/0x20
Dec 10 11:59:26 eris kernel: [17580.657419]  [<ffffffffa0a70292>] ? ath5k_reset+0x232/0x240 [ath5k]
Dec 10 11:59:26 eris kernel: [17580.657445]  [<ffffffffa0a1540c>] ieee80211_hw_config+0xac/0xb0 [lbm_cw_mac80211]
Dec 10 11:59:26 eris kernel: [17580.657476]  [<ffffffffa0a2ce0d>] ieee80211_set_freq+0x8d/0x90 [lbm_cw_mac80211]
Dec 10 11:59:26 eris kernel: [17580.657502]  [<ffffffffa0a16135>] ieee80211_ioctl_siwfreq+0xf5/0x100 [lbm_cw_mac80211]
Dec 10 11:59:26 eris kernel: [17580.657512]  [<ffffffff804ea6b6>] ioctl_standard_call+0x66/0xf0
Dec 10 11:59:26 eris kernel: [17580.657520]  [<ffffffff804ea650>] ? ioctl_standard_call+0x0/0xf0
Dec 10 11:59:26 eris kernel: [17580.657528]  [<ffffffff804ea250>] ? ioctl_private_call+0x0/0xb0
Dec 10 11:59:26 eris kernel: [17580.657535]  [<ffffffff804ea8d0>] wireless_process_ioctl+0xe0/0x170
Dec 10 11:59:26 eris kernel: [17580.657543]  [<ffffffff804ea250>] ? ioctl_private_call+0x0/0xb0
Dec 10 11:59:26 eris kernel: [17580.657550]  [<ffffffff804ea650>] ? ioctl_standard_call+0x0/0xf0
Dec 10 11:59:26 eris kernel: [17580.657557]  [<ffffffff804ea9cf>] wext_ioctl_dispatch+0x6f/0xb0
Dec 10 11:59:26 eris kernel: [17580.657565]  [<ffffffff804eab26>] wext_handle_ioctl+0x46/0x90
Dec 10 11:59:26 eris kernel: [17580.657573]  [<ffffffff80466b2f>] dev_ioctl+0x3cf/0x410
Dec 10 11:59:26 eris kernel: [17580.657581]  [<ffffffff802e97d9>] ? do_sync_write+0xf9/0x140
Dec 10 11:59:26 eris kernel: [17580.657589]  [<ffffffff8026afaf>] ? hrtimer_start+0xdf/0x1b0
Dec 10 11:59:26 eris kernel: [17580.657596]  [<ffffffff80454f01>] sock_ioctl+0x91/0x280
Dec 10 11:59:26 eris kernel: [17580.657604]  [<ffffffff802f8586>] vfs_ioctl+0x36/0xb0
Dec 10 11:59:26 eris kernel: [17580.657610]  [<ffffffff802f8883>] do_vfs_ioctl+0x283/0x2f0
Dec 10 11:59:26 eris kernel: [17580.657616]  [<ffffffff802f8991>] sys_ioctl+0xa1/0xb0
Dec 10 11:59:26 eris kernel: [17580.657625]  [<ffffffff8021285a>] system_call_fastpath+0x16/0x1b
Dec 10 11:59:26 eris kernel: [17580.657630] 
Dec 10 11:59:26 eris kernel: [17580.657634] ---[ end trace a6dd66994c3403c0 ]---
Dec 10 11:59:26 eris kernel: [17580.672186] wlan0: authenticate with AP 00:13:c3:59:48:91
Dec 10 11:59:26 eris kernel: [17580.702284] ath5k phy0: failed to wakeup the MAC Chip
Dec 10 11:59:26 eris kernel: [17580.702288] ath5k phy0: can't reset hardware (-5)
Comment 7 Bob Copeland 2008-12-10 10:13:33 UTC
Yup I've seen this before.  This is ieee80211_hw_config complaining because we return failure from ->config.  I guess we should detect this condition then disable hardware.  At least then it doesn't fill up syslog with a zillion errors.  Fact is, once we're in this state there's not much to be done.  Card is totally hung and only a power cycle (e.g. suspend/resume cycle or reboot) seems to bring it back.
Comment 8 Antoine Pairet 2008-12-14 09:23:55 UTC
*** Bug 12175 has been marked as a duplicate of this bug. ***
Comment 9 Bob Copeland 2008-12-31 06:26:01 UTC
Created attachment 19574 [details]
add a reset lock

Please try this patch.  It won't change the fact that mac80211 will spew errors if config fails for some reason, but it should fix the root cause of hanging the device.
Comment 10 Rich Ercolani 2009-01-31 15:52:45 UTC
Patch seems to have resolved it - I can't swear to it, since the problem is sporadic, but I definitely haven't seen it crop up since I compiled+installed the patched driver a few days ago.
Comment 11 Erik Andr 2009-12-30 10:01:39 UTC
What's the status of this bug. Is this still an issue with a recent kernel?
Comment 12 Rich Ercolani 2009-12-30 22:43:05 UTC
Works fine for me unpatched in 2.6.31. I think it worked in 2.6.28, but I can't swear to it at present.
Comment 13 Vitaliy Tokarev 2010-01-11 22:13:08 UTC
Fedora 12, 2.6.31.9-174.fc12.i686

Bug is still exists in my system. I've recieve messages like this when downloading big files from internet:

Jan 12 00:57:56 localhost kernel: ath5k phy0: noise floor calibration timeout (2422MHz)
Jan 12 00:58:46 localhost kernel: ath5k phy0: noise floor calibration timeout (2422MHz)
Jan 12 01:01:04 localhost kernel: ath5k phy0: noise floor calibration timeout (2422MHz)
Jan 12 01:01:06 localhost kernel: ath5k phy0: noise floor calibration timeout (2422MHz)
Jan 12 01:01:27 localhost kernel: ath5k phy0: noise floor calibration timeout (2422MHz)
Jan 12 01:01:28 localhost kernel: ath5k phy0: noise floor calibration timeout (2422MHz)

At that moment, when this happens the system slows down terribly. And it affects the whole system. The program "top" can not show it. The mouse cursor begins to move jerkily. There is a strong decrease in connection speed (about 30-35KiB/s).

At this moment I can not show you debug log, because I need to recompile kernel.
Comment 14 Tomas Mudrunka 2010-05-03 23:59:06 UTC
I've got the same problem (also acer aspire) with 2.6.33:
https://bugzilla.kernel.org/show_bug.cgi?id=ath5k-wakeup
(i've created another bug, because at the beggining it seemed to be destroying
EEPROM...)
Comment 15 Cesar Martinez Izquierdo 2010-05-22 11:25:18 UTC
I suffer the same symptoms in my LG E500. When the card eventually fails to function, I can just switch off the computer and switch it on again (rebooting does not solve the problem). Removing the module and loading again makes the system crazy (it eventually hard freezes).

Now I have compiled the ath5k module with debug support. Is there any specific debug info I could get to help solving this problem?

I have also seen that developers are sending lots of patches to ath5k-devel. How can I know if they are related with this bug? What's the best way to test them, should I apply the patch myself or do you maintain a repository with these patches already applied?

Thanks for your time.
Comment 16 Tomas Mudrunka 2010-05-23 11:18:47 UTC
2Cesar: Seems i have the same issue here:
https://bugzilla.kernel.org/show_bug.cgi?id=15843
Comment 17 Cesar Martinez Izquierdo 2010-05-24 19:43:04 UTC
To Thomas: It looks similar, although I wouldn't say it is exactly the same problem (at least, according to dmesg messages).
Comment 18 Cesar Martinez Izquierdo 2010-06-02 19:36:00 UTC
Hello, I have good news for ath5k (and bad news for my PC).
At least in my case, it seems to be a hardware problem, not a software problem.

I've been making some tests with Windows Vista, and I experience the same problems (after some hours working, the card stops working and I can only solve it by switching the computer off and on). So I wouldn't say it is an ath5k problem, but a hardware problem.
Comment 19 Tomas Mudrunka 2010-06-02 21:55:36 UTC
Cesar: And what if ath5k really corrupted the card? I've seen some EEPROM error in dmesg when those issues started. and there were more cases of really corrupted EEPROM in the past...

what card exactly do you have? (product, manufacturer and lspci -vv please)
I have same issues with Wistron CM9, but i guess that such popular card can't be so buggy...
Comment 20 Cesar Martinez Izquierdo 2010-06-06 13:35:27 UTC
Thomas: The card works fine for some hours before stops working... I don't think it would work at all if the EEPROM was permanently damaged. However, I'm not an expert at all in hardware nor drivers...

The card is an internal one, in an LG E500 laptop.
Dmesg says:
ath5k phy0: Atheros AR2425 chip found (MAC: 0xe2, PHY: 0x70)

lspci -vv:

03:00.0 Ethernet controller: Atheros Communications Inc. AR5001 Wireless Network Adapter [168c:001c] (rev 01)
        Subsystem: Device 1a3b:1026
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 19
        Region 0: Memory at fe1f0000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [40] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable-
                Address: 00000000  Data: 0000
        Capabilities: [60] Express (v1) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us
                        ClockPM- Suprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        Capabilities: [90] MSI-X: Enable- Mask- TabSize=1
                Vector table: BAR=0 offset=00000000
                PBA: BAR=0 offset=00000000
        Capabilities: [100] Advanced Error Reporting <?>
        Capabilities: [140] Virtual Channel <?>
        Kernel driver in use: ath5k
        Kernel modules: ath5k
Comment 21 Chris Ahrendt 2012-04-23 23:14:32 UTC
Still occuring on the latest kernal.

00:09.0 Ethernet controller: Atheros Communications Inc. Atheros AR5001X+ Wireless Network Adapter (rev 01)
        Subsystem: Netgear Device 5e00
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 168 (2500ns min, 7000ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 17
        Region 0: Memory at d4800000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: <access denied>
        Kernel driver in use: ath5k
        Kernel modules: ath5k

umod -a

Linux nautilus 3.3.2-1.fc16.i686 #1 SMP Sat Apr 14 01:11:09 UTC 2012 i686 i686 i386 GNU/Linux

This is on an internal pci card. 

If you need any additional information or traces let me know.
Comment 22 xerofoify 2014-06-25 01:55:19 UTC
This bug is old. I would recommend testing on
a newer kernel , the newest as of this writing
is 3.15.1.
Cheers Nick
Comment 23 Alan 2015-02-19 15:16:32 UTC
This bug relates to a very old kernel. Closing as obsolete.