Bug 12080 - ath5k phy0: unable to reset hardware: -11
Summary: ath5k phy0: unable to reset hardware: -11
Status: CLOSED UNREPRODUCIBLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Luis Chamberlain
URL:
Keywords:
: 12849 (view as bug list)
Depends on:
Blocks:
 
Reported: 2008-11-22 11:28 UTC by Joshua Covington
Modified: 2009-07-06 18:18 UTC (History)
16 users (show)

See Also:
Kernel Version: 2.6.27.7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg with kernel panic backtrace (111.06 KB, application/octet-stream)
2008-12-02 17:02 UTC, Joshua Covington
Details
debug messages 1 (119.35 KB, application/octet-stream)
2008-12-15 15:12 UTC, Joshua Covington
Details
debug messages 2 (69.23 KB, application/octet-stream)
2008-12-15 15:16 UTC, Joshua Covington
Details
card registers when locked (1.69 KB, application/octet-stream)
2008-12-15 15:25 UTC, Joshua Covington
Details
atheros log messages (18.84 KB, application/octet-stream)
2008-12-23 10:44 UTC, Joshua Covington
Details
Use a spinlock around ath5k_hw_reset (2.97 KB, patch)
2008-12-24 15:08 UTC, Bob Copeland
Details | Diff
ath5k debug messages debug=0x103f (119.33 KB, application/octet-stream)
2008-12-26 04:10 UTC, Joshua Covington
Details
ath5k debug messages2 debug=0x0033 (121.21 KB, application/octet-stream)
2008-12-26 10:39 UTC, Joshua Covington
Details
fixed patch for latest compat-wireless (2.93 KB, patch)
2009-02-13 14:52 UTC, Joshua Covington
Details | Diff

Description Joshua Covington 2008-11-22 11:28:40 UTC
Latest working kernel version:
n/a

Earliest failing kernel version:
up to now


Distribution:
fedora, vanilla

Hardware Environment:
08:04.0 Ethernet controller: Atheros Communications Inc. AR2413 802.11bg NIC (rev 01)
        Subsystem: AMBIT Microsystem Corp. Unknown device 0418
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 168 (2500ns min, 7000ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 21
        Region 0: Memory at c0200000 (32-bit, non-prefetchable) [size=64K]
        Capabilities: [44] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=2 PME-
        Kernel driver in use: ath5k_pci
        Kernel modules: ath5k

Software Environment:
installed are
system-config-network-tui-1.5.10-1.fc9.noarch
system-config-network-1.5.10-1.fc9.noarch
kdenetwork-4.1.3-1.fc9.i386
kdenetwork-libs-4.1.3-1.fc9.i386
NetworkManager-gnome-0.7.0-0.11.svn4022.4.fc9.i386
NetworkManager-0.7.0-0.11.svn4022.4.fc9.i386
NetworkManager-glib-0.7.0-0.11.svn4022.4.fc9.i386


Problem Description:
I got thousands of messages like:
ath5k phy0: gain calibration timeout (2412MHz)
ath5k phy0: unable to reset hardware: -11
and after removing the driver i cannot reinsert it. only hardware restart helps. I'm still capable to make a wireless connection but if it is lost i need to restart the computer. the messages that i got are:
ath5k_pci 0000:08:04.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21
ath5k_pci 0000:08:04.0: registered as 'phy0'                      
ath5k phy0: Atheros AR2413 chip found (MAC: 0x78, PHY: 0x45)      
device-mapper: multipath: version 1.0.5 loaded                    
ath5k phy0: noise floor calibration timeout (2422MHz)             
ath5k phy0: noise floor calibration timeout (2437MHz)             
ath5k phy0: ath5k_chan_set: unable to reset channel (2437 Mhz)    
ath5k phy0: gain calibration timeout (2412MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2412 Mhz)    
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2422 Mhz)    
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: can't reset hardware (-11)                            
ath5k phy0: noise floor calibration timeout (2422MHz)             
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: can't reset hardware (-11)                            
ath5k phy0: gain calibration timeout (2412MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2412 Mhz)    
ath5k phy0: gain calibration timeout (2417MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2417 Mhz)    
ath5k phy0: noise floor calibration failed (2417MHz)              
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2422 Mhz)    
ath5k phy0: gain calibration timeout (2427MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2427 Mhz)    
ath5k phy0: gain calibration timeout (2432MHz)                    
ath5k phy0: gain calibration timeout (2412MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2412 Mhz)    
ath5k phy0: gain calibration timeout (2417MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2417 Mhz)    
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2422 Mhz)    
ath5k phy0: gain calibration timeout (2427MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2427 Mhz)    
ath5k phy0: gain calibration timeout (2432MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2432 Mhz)    
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: can't reset hardware (-11)                            
ath5k phy0: gain calibration timeout (2412MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2412 Mhz)    
ath5k phy0: gain calibration timeout (2417MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2417 Mhz)    
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2422 Mhz)    
ath5k phy0: gain calibration timeout (2427MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2427 Mhz)    
ath5k phy0: gain calibration timeout (2462MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2462 Mhz)    
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2422 Mhz)    
ath5k phy0: noise floor calibration timeout (2422MHz)             
ath5k phy0: gain calibration timeout (2412MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2412 Mhz)    
ath5k phy0: gain calibration timeout (2417MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2417 Mhz)    
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2422 Mhz)    
ath5k phy0: gain calibration timeout (2427MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2427 Mhz)    
ath5k phy0: gain calibration timeout (2432MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2432 Mhz)    
ath5k phy0: gain calibration timeout (2437MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2437 Mhz)    
ath5k phy0: gain calibration timeout (2442MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2442 Mhz)    
ath5k phy0: gain calibration timeout (2447MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2447 Mhz)    
ath5k phy0: gain calibration timeout (2452MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2452 Mhz)    
ath5k phy0: gain calibration timeout (2457MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2457 Mhz)    
ath5k phy0: gain calibration timeout (2412MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2412 Mhz)    
ath5k phy0: gain calibration timeout (2417MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2417 Mhz)    
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2422 Mhz)    
ath5k phy0: gain calibration timeout (2427MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2427 Mhz)    
ath5k phy0: gain calibration timeout (2432MHz)                    
ath5k phy0: ath5k_chan_set: unable to reset channel (2432 Mhz)    
ath5k phy0: noise floor calibration timeout (2422MHz)             
ath5k phy0: noise floor calibration timeout (2422MHz)             
ath5k phy0: noise floor calibration timeout (2422MHz)             
ath5k phy0: noise floor calibration timeout (2422MHz)             
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: unable to reset hardware: -11                         
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: unable to reset hardware: -11                         
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: unable to reset hardware: -11                         
ath5k phy0: gain calibration timeout (2422MHz)                    
ath5k phy0: unable to reset hardware: -11                         
ath5k_pci 0000:08:04.0: PCI INT A disabled                        
ath5k_pci 0000:08:04.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21
ath5k_pci 0000:08:04.0: registered as 'phy1'                      
ath5k phy1: Atheros AR2413 chip found (MAC: 0x78, PHY: 0x45)      
ath5k phy1: gain calibration timeout (2412MHz)                    
ath5k phy1: unable to reset hardware: -11                         
ath5k phy1: gain calibration timeout (2412MHz)                    
ath5k phy1: unable to reset hardware: -11                         
ath5k phy1: gain calibration timeout (2412MHz)                    
ath5k phy1: unable to reset hardware: -11                         
ath5k_pci 0000:08:04.0: PCI INT A disabled
ath5k_pci 0000:08:04.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21
ath5k_pci 0000:08:04.0: registered as 'phy2'
ath5k phy2: gain calibration timeout (2412MHz)
ath5k phy2: unable to reset hardware: -11
ath5k phy2: Atheros AR2413 chip found (MAC: 0x78, PHY: 0x45)
ath5k phy2: gain calibration timeout (2412MHz)
ath5k phy2: unable to reset hardware: -11
ath5k phy2: gain calibration timeout (2412MHz)
ath5k phy2: unable to reset hardware: -11
ath5k_pci 0000:08:04.0: PCI INT A disabled
ath5k_pci 0000:08:04.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21
ath5k_pci 0000:08:04.0: registered as 'phy3'
ath5k phy3: Atheros AR2413 chip found (MAC: 0x78, PHY: 0x45)
ath5k phy3: gain calibration timeout (2412MHz)
ath5k phy3: unable to reset hardware: -11
ath5k phy3: gain calibration timeout (2412MHz)
ath5k phy3: unable to reset hardware: -11
ath5k_pci 0000:08:04.0: PCI INT A disabled
ath5k_pci 0000:08:04.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21
ath5k_pci 0000:08:04.0: registered as 'phy4'
ath5k phy4: Atheros AR2413 chip found (MAC: 0x78, PHY: 0x45)
ath5k phy4: gain calibration timeout (2412MHz)
ath5k phy4: unable to reset hardware: -11
ath5k phy4: gain calibration timeout (2412MHz)
ath5k phy4: unable to reset hardware: -11
ath5k_pci 0000:08:04.0: PCI INT A disabled
ath5k_pci 0000:08:04.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21
ath5k_pci 0000:08:04.0: registered as 'phy5'
ath5k phy5: Atheros AR2413 chip found (MAC: 0x78, PHY: 0x45)
ath5k phy5: gain calibration timeout (2412MHz)
ath5k phy5: unable to reset hardware: -11
ath5k phy5: gain calibration timeout (2412MHz)
ath5k phy5: unable to reset hardware: -11


Steps to reproduce:
This is on notebook acer aspire 5051awxmi with atheros AR5BMB5 (this is what the label says). I think this is AR5005. madwifi works fine.
Comment 1 Luis Chamberlain 2008-12-01 15:16:47 UTC
Please try compat-wireless to see if this is still an issue for you on wireless-testing:

http://wireless.kernel.org/en/users/Download
Comment 2 Max Bowsher 2008-12-01 20:07:58 UTC
This looks similar to what I'm seeing on the Acer Aspire One, using Ubuntu Intrepid, including the linux-backports-modules-intrepid package, which iiuc is a packaging of compat-wireless. For me it happens only occasionally on initial poweron, but somewhat more frequently after a reboot rather than power-cycle.
Comment 3 Bob Copeland 2008-12-02 06:59:55 UTC
I've seen it for quite a while though it is much less frequent these days.  A suspend-resume cycle will also work.  "unable to reset hardware: -11" means the card is fully hung at that point and needs powering down.

It would be helpful if you can consistently reproduce this on the latest compat-wireless.
Comment 4 Joshua Covington 2008-12-02 17:02:01 UTC
Created attachment 19109 [details]
dmesg with kernel panic backtrace

I tried compat-wireless-02-12-2008 and this is the result from it: a kernel panic. at first it worked fine and gave me stronger wireless signal. but when i tried to remove the driver for a second time i got lots of kernel panics (see the file). and then i couldn't do anything else. I tried to reinsert the driver for two more times but without luck.
Comment 5 Joshua Covington 2008-12-10 09:37:17 UTC
hey, it'more than 10 day and there is no message here. if i can somehow help, let me know.
Comment 6 Bob Copeland 2008-12-10 10:21:59 UTC
The problem is, while I see it happen, I cannot consistently reproduce it.  Are you able to?  We can fix the backtraces but can't yet fix the underlying problem without knowing which sequence is hanging the card.  By the way, AFAICT this is the same bug as http://bugzilla.kernel.org/show_bug.cgi?id=12068.
Comment 7 Joshua Covington 2008-12-10 10:54:59 UTC
yes, the bug seem very similar to this one (he gets error -5 i got -11).

I cannot reproduce it either. maybe it is something connected to the acpi system. maybe the acpi can try to wake up the card, or just prevent it from entering this sleep state.

since the power reboot can correct this, then maybe you can try to make the card go throught a "fake" power reboot so that it can wake up.

these are just ideas :)
Comment 8 Bob Copeland 2008-12-10 11:19:05 UTC
Well, I guess one thing to try is debugging what actually causes the reset.  That can be had by: 

$ echo intr > /debug/ath5k/phy0/debug
$ echo beacon > /debug/ath5k/phy0/debug

That will add a whole lot of debugging messages to the syslog, but it will tell us which interrupt preceeded the reset (probably INT_FATAL, which also won't tell much, but at least that would narrow things a bit).  Or for less debugging output, put a printk directly in ath5k_intr() in the various cases that reschedule restq.
Comment 9 Joshua Covington 2008-12-10 15:17:25 UTC
ok

i can try it but where should i look for /debug/ath5k/phy07debug? Are you talking for the source code? there is nothing like this in /proc?

if this is in the source how can i recompile ontly the ath5k module? I don't want to recompile the whole wireless-compat, because I don't need them and it takes about 15 mins.
Comment 10 Bob Copeland 2008-12-11 07:18:10 UTC
You just need to be sure that ath5k is compiled with debug support, and mount debugfs somewhere (as root: mkdir /debug && mount -t debugfs none /debug).  You can edit config.mk(?) to only build ath5k (and mac80211 etc) if you want to rebuild it.  Without looking, I'm not 100% sure of the details wrt compat-wireless.

Another thought occurs to me - we often hit this during a scan when mac80211 calls ->config() to set the channels.  That could be a coincidence (since this just happens to invoke reset which could generally be broken), or it could be racing with other hardware code.  I'm going to hack up some code to torture test reset, maybe I can get it in the bad state more reliably.
Comment 11 Joshua Covington 2008-12-11 13:20:13 UTC
I'll try it with the wireless-compat-2008-12-10 and report back.
Comment 12 Erwin Burema 2008-12-14 02:32:35 UTC
Also have encountered this bug, and did turn on debugging
Comment 13 Joshua Covington 2008-12-15 15:12:26 UTC
Created attachment 19316 [details]
debug messages 1

this is part of the dmesg that contains the debug messages. i tried it about 15 times untill i manage to 'lock' the card. this is on fedora-kernel-2.6.27.9-69.fc9.i686 with compat-wireless-2008-12-11. I've mark the places in the file that can be interested.
Comment 14 Joshua Covington 2008-12-15 15:16:17 UTC
Created attachment 19317 [details]
debug messages 2

here are other debug messages. it is  not that easy to lock the card. I've marked some interesting places in the file. the message that looks strange to me is: 
wlan0: Failed to config new BSSID to the low-level driver

after this i need to do a hardware reset. let me know if more info is needed.
Comment 15 Joshua Covington 2008-12-15 15:25:45 UTC
Created attachment 19318 [details]
card registers when locked

these are the card registeres when it is 'locked'
Comment 16 Bob Copeland 2008-12-15 15:40:55 UTC
Ok, well it's definitely doing a config() when setting bssid, then a scan happens.  interrupt before that is TXDESC, looks harmless.  I also changed calib_tim to 1 second and made it always reset; that seemed to worsen things to where I could frequently get lockup.  It can spend 20s trying to calibrate the noise level so wouldn't be surprised if it races with itself in reset.  
Comment 17 Joshua Covington 2008-12-15 15:55:27 UTC
is there a position in the registers that defines when the card is working normally? if so then we can check for the corresponding flag before a reset and avoid the endless loop.
Comment 18 Joshua Covington 2008-12-17 10:08:06 UTC
Is there something else I can do to help with the debuging or testing of this?
Comment 19 Bob Copeland 2008-12-17 12:46:38 UTC
For now, no... I think it is clear that reset with config() is problematic, and with some changes I can reproduce it pretty frequently now.  It might just be a matter of locking the sc mutex within config, or scheduling the changes to happen inside restq.
Comment 20 Joshua Covington 2008-12-20 16:27:08 UTC
I was looking into the change log and saw this:

commit	5a3503abfc5a2e51a27c0b28339e04b24cedad60
ath5k: Update interrupt masking code

commit	994d90627030722ff38ef134907c7b3c7d3aebae
ath5k: Clean up eeprom parsing and add missing calibration data

commit	23c401574b16cb2b6d2231ba405ebf85b8c87de5
ath5k: ignore the return value of ath5k_hw_noise_floor_calibration

maybe the last one can help to resolve the problem. if because of the noise-floor-calibration the card needs to call some reset function, then this commit make it ignore the message. and some new bits from the legacy HAL have been added. Are these in the compat-wireless-2008-12-20.tar.bz2 so that I can test them?
Comment 21 Joshua Covington 2008-12-21 17:20:45 UTC
this is what I get with compat-wireless-2008-12-11 on 2.6.27.9-74.fc9.i686:

ath5k phy0: noise floor calibration timeout (2422MHz)                                                                                    
---this message repeats 32x times-------------                                                                                   
ath5k phy0: noise floor calibration timeout (2422MHz)                                                                                    
ath5k phy0: failed to warm reset the MAC Chip                                                                                            
ath5k phy0: can't reset hardware (-5)                                                                                                    
ath5k phy0: noise floor calibration timeout (2422MHz)                                                                                    
----this messages repeats 40x times-------------
ath5k phy0: failed to warm reset the MAC Chip
ath5k phy0: can't reset hardware (-5)

This is a floud of calibration errors, but there are new ones:
Failed to warm reset the Mac Chip 
Comment 22 Bob Copeland 2008-12-22 12:54:23 UTC
Indeed, I thought 23c401574b16cb2b6d2231ba405ebf85b8c87de5 had gone to 2.6.28 already.  You can still get hangs with the card but then you usually get the -5 error.
Comment 23 Joshua Covington 2008-12-22 13:10:42 UTC
ok, this means that the problem is still unsolved. how can i help? I'm not a programmer but not a newbie either. who should/will take care of this?

is there a difference between error -11 and -5? if the problem is in the reset routine, maybe atheros should give a hint. I haven't experienced many problems with the madwifi driver based on the proprietary HAL.
Comment 24 Bob Copeland 2008-12-22 13:22:33 UTC
I am working on it... there's no real difference, it comes down to trying to write to a register, trying to read back the result, then the card starts returning junk (all ones).  We return different error codes based on when the read fails.
Comment 25 Joshua Covington 2008-12-22 17:31:23 UTC
thank you for this.

actually after a cold start up i got these:
ath5k phy0: gain calibration timeout (2412MHz)
ath5k phy0: can't reset hardware (-11)
ath5k phy0: gain calibration timeout (2417MHz)
ath5k phy0: can't reset hardware (-11)
ath5k phy0: gain calibration timeout (2422MHz)
ath5k phy0: can't reset hardware (-11)
ath5k phy0: gain calibration timeout (2427MHz)
ath5k phy0: can't reset hardware (-11)
ath5k phy0: gain calibration timeout (2432MHz)
ath5k phy0: can't reset hardware (-11)

I couldn't make any connection :)

And there is something else:

I have another problem, that at first looks to have nothing to do with the atheros card. sometimes my screen (!) locks up and only hardware reset helps. i filed bugs against the xserver, the ati driver etc but nothing helps (kernel logs and xserver logs show nothing). everything works fine only if i use the compat-wireless package or the madwifi packaga. with the madwifi I've never experienced such problems.

for 2 days i updated from fedora kernel 2.6.27.9-69 to 2.6.27.9-74. and the problems occur only when i'm using the default (in fedora 9) aht5k driver. after recompiling ath5k for the lates kernel everything is fine (this occured everytime untill i managed to finish the recompilation and load the new ath5k).

this sounds insane but through the last 4 months i've tried different kernels and ati-drives/xserver. while the ath5k driver is loaded i get my corrupted 
display in 75% of all cases. I'm almost 100% sure that this comes from the card because after inserting compat-wireless-2008-12-11 I've not experienced any lockups even with my old kernel. before this i got the lockups in almost 75% of all cases.

Therefore i think that the current ath5k drivers hijacks some irqs in such a way that it totaly locks my mashine. is this possible? I'd like to enable full debuging for this but point it to a separate file so that i can easily post them here. how can i send the debug in a separate file?
Comment 26 Joshua Covington 2008-12-23 10:44:20 UTC
Created attachment 19458 [details]
atheros log messages

I hope this can shed some light in this problem. This is from fedora 9 kernel 2.6.27.9-74 with compat-wireless-2008-12-21.

actually the kernel initializes the card (at least some of the cfg80211 routines go through) and then the ath5k driver cannot set it. this is from a cold start up.
Comment 27 Bob Copeland 2008-12-24 15:08:34 UTC
Created attachment 19475 [details]
Use a spinlock around ath5k_hw_reset

Please try this patch.  I haven't had a chance to really test it other than to see that it didn't lock up immediately.
Comment 28 Antoine Pairet 2008-12-25 04:22:34 UTC
I did apply the patch against compat-wireless-2008-12-21 by running the following in the ath5k directory:

patch -b < bobcopeland.diff

I will give feedback soon.
thank you,
Comment 29 Joshua Covington 2008-12-25 05:14:46 UTC
I still haven't test it. My problem is that the lockup occurs randomly and the best way to reproduce it was to remove and reinsert the driver. But it didn't happen all the time. Sometimes i repeat this more than 15! times in order to lockup the card.

So I'm not sure that I can properly test the patch. You said that after modifying something in the reset routine it was easier to lock the card (Comment #19). can you post what you modified so that i can directly force the card to reset itself or even locks itself up?

With the compat-wireless-2008-12-21 it is a little bit harder to lockup the card. And the problems happens less frequently (less than 50%) than earlier. so it's not that easy to verify the patch. How can I force that card to reset itself?
Comment 30 Joshua Covington 2008-12-26 04:10:30 UTC
Created attachment 19489 [details]
ath5k debug messages debug=0x103f

Here are some debug messages from the compat-wireless-2008-12-25 with the applied patch. The debug level was set to 0x103f. I've marked some places in the file that can be interested (here3 and here5). I removed and reinserted the driver 56! consequtive times and could not lockup the card. I think the patch works.

I also got some messages in the log files that the AP couldn't authenticate the card. The AP disconnected the card 3 times but it didn't lockup and made the connection. I think the disconnect was because the card wasn't ready and couldn't transmit correctly the key. But as I said no problems so far.

If there is a way to force it to lockup let me know.
Comment 31 Bob Copeland 2008-12-26 08:41:32 UTC
Ok thanks for testing.  I had changed ath5k_calinterval to a lower number to make the lockup happen sooner.  The other thing you can try is applying the patch on top of vanilla 2.6.28.  I did see one hard lockup but I'm not sure if it was related to the patch or not, so I'll post the patch to ath5k-devel for more testing just to be sure.
Comment 32 Antoine Pairet 2008-12-26 09:24:30 UTC
As far as I am concerned, the patch seems to work pretty good. With compat-wireless-2008-12-21 I did observed frequent lock ups when waking from hibernation. After hibernation, I sometimes had no wireless connection anymore and the error "can not reset hardware (-11)" appeared in syslog. 

Since I applied the patch, I did approximatively 15 cycles (hibernation - wake up) and did not observed a single lock up. 

If I can do any ohter test, I would be glad to help.
Thanks,
Comment 33 Joshua Covington 2008-12-26 10:39:58 UTC
Created attachment 19497 [details]
ath5k debug messages2 debug=0x0033

I set ath5k_calinterval=2 (I think this is low enough) and made 19 cycles (remove -> insert the driver). No lockups so far. The debug messages (debug=0x0033) are attached.

I think the patch works fine. What about the ath5k_calinterval? With the value set to 2 I couldn't get neither the -5 error nor the -11. Maybe the value should be lowered.
Comment 34 Joshua Covington 2008-12-26 11:35:28 UTC
(In reply to comment #31)
>
> The other thing you can try is applying the patch on top of vanilla 2.6.28.
>

I just had to change the offsets for the vanilla 2.6.28. That's enough. I haven't recompiled the kernel, though, because it takes me about 40 min but the patch works with it.

Here are the new offsets:

@@ -525,6 +525,7 @@ ath5k_pci_probe(struct pci_dev *pdev,
@@ -2664,6 +2665,7 @@ ath5k_reset(struct ath5k_softc *sc, bool stop, bool change_channel)
@@ -2672,7 +2672,11 @@ ath5k_reset(struct ath5k_softc *sc, bool stop, bool change_channel)
@@ -152,6 +152,7 @@ struct ath5k_softc {
@@ -1620,7 +1620,7 @@ int ath5k_hw_rfregs(struct ath5k_hw *ah, struct ieee80211_channel *channel,
Comment 35 Joshua Covington 2008-12-31 05:59:18 UTC
I the last 3-4 days I've experienced different situations that earlier always resulted in the card locking up inself. With the patch this doesn't happen. I managed to make connection in all the cases. Only once I had to reinsert the driver but still no lockup occurred.

I've disabled the debugging and therefore cannot attach any logs. But as i said the patch is definitely (at least my case) working.

Thank you Bob. Can you post the patch in http://bugzilla.kernel.org/show_bug.cgi?id=12068 ? Maybe this help in their case, too.
Comment 36 Bob Copeland 2008-12-31 06:21:22 UTC
Ok thanks for testing.  Good idea, I'll post it on the other bug too.

I also posted it on ath5k-devel list a few days ago, so far no reports but I'll submit it for upstream in a week or so unless problems arise.
Comment 37 Joshua Covington 2009-01-13 10:09:34 UTC
Any news if this has already entered upstream/mainline?
Comment 38 Bob Copeland 2009-01-14 08:04:47 UTC
Sorry, I still haven't sent it yet.  I can reliably lock up my machine (by setting cal_interval to 1 and making it always reset the card).  I spent some time chasing that with netconsole and nmi watchdog but to little avail.  I want it to see some wider testing so perhaps now that the merge window is closed. 
Comment 39 Joshua Covington 2009-02-13 14:52:59 UTC
Created attachment 20240 [details]
fixed patch for latest compat-wireless

This patch has been working for last month but I got an error with the lastest compat-wireless-20090213. This is the fixed version.

Any plans to queue this for stable? As long as cal_interval is > 1 it solves the problem. In case it is set to 1 all of us will got this reset lock anyway. Until then it is a workable solution.
Comment 40 Bob Copeland 2009-03-17 18:51:55 UTC
*** Bug 12849 has been marked as a duplicate of this bug. ***
Comment 41 John W. Linville 2009-06-11 14:26:11 UTC
Hey, Bob!  What's the story on this?  Do we need this patch?
Comment 42 Bob Copeland 2009-06-11 17:25:30 UTC
I don't think so.  In 2.6.30 we completely rewrote the reset stuff, and it seems to have largely solved these sorts of issues (I for one haven't seen them in a long time).  Consequently I didn't really want to submit this patch because it is a bit cargo-cultish, and I could make the kernel hang with it if I tried hard.  Though if anyone can still get these errors in 2.6.30 then I can forward-port it and put some time back into finding the lockup.
Comment 43 John W. Linville 2009-06-11 17:39:17 UTC
Sure, fine...Joshua, is this working for you on 2.6.30?
Comment 44 Joshua Covington 2009-07-05 01:02:10 UTC
I'm still stuck with 2.6.27.25 because i don't have time to update. I've applied the patch ever since it was posted. I think you can close this, if the whole code has been rewritten. I'll reopen it if I see this again.

Note You need to log in before you can comment on or make changes to this bug.