Bug 13000 - ath5k causes kernel panic
ath5k causes kernel panic
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Product: Networking
Classification: Unclassified
Component: Wireless
All Linux
: P1 normal
Assigned To: networking_wireless@kernel-bugs.osdl.org
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-03 01:35 UTC by Ognjen Maric
Modified: 2009-06-28 20:15 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.29
Tree: Mainline
Regression: No


Attachments
a photo of the dump (462.45 KB, image/jpeg)
2009-04-03 01:36 UTC, Ognjen Maric
Details
handle rate control errors with a warning (1023 bytes, patch)
2009-04-09 12:01 UTC, Bob Copeland
Details | Diff
Kernel config (56.51 KB, application/octet-stream)
2009-04-10 12:00 UTC, Ognjen Maric
Details
excerpt from /var/log/messages.log (2.64 KB, text/plain)
2009-04-11 15:46 UTC, Ognjen Maric
Details
Kernel stack trace photo (454.69 KB, image/jpeg)
2009-04-22 18:50 UTC, Ognjen Maric
Details

Description Ognjen Maric 2009-04-03 01:35:27 UTC
After switching from madwifi to ath5k, I get kernel panics within minutes of using the ath5k driver. I'm attaching a photo of the panic dump. Here's the output of lspci (under madwifi/2.6.28.5):

05:00.0 Ethernet controller: Atheros Communications Inc. AR242x 802.11abg Wireless PCI Express Adapter (rev 01)
        Subsystem: Askey Computer Corp. Device 7106                                                            
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-  
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-   
        Latency: 0, Cache Line Size: 64 bytes                                                                  
        Interrupt: pin A routed to IRQ 18                                                                      
        Region 0: Memory at f0800000 (64-bit, non-prefetchable) [size=64K]                                     
        Capabilities: [40] Power Management version 2                                                          
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold-)                   
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-                                                    
        Capabilities: [50] MSI: Mask- 64bit- Count=1/1 Enable-                                                 
                Address: 00000000  Data: 0000                                                                  
        Capabilities: [60] Express (v1) Legacy Endpoint, MSI 00                                                
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <128ns, L1 <2us                         
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-                                        
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-                             
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-                                           
                        MaxPayload 128 bytes, MaxReadReq 512 bytes                                             
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-                            
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us             
                        ClockPM- Surprise- LLActRep- BwNot-                                                    
                LnkCtl: ASPM Disabled; RCB 128 bytes Disabled- Retrain- CommClk+                               
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-                                         
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-             
        Capabilities: [90] MSI-X: Enable- Mask- TabSize=1                                                      
                Vector table: BAR=0 offset=00000000                                                            
                PBA: BAR=0 offset=00000000                                                                     
        Capabilities: [100] Advanced Error Reporting                                                           
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-                                   
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-                                   
                AERCap: First Error Pointer: 14, GenCap+ CGenEn- ChkCap+ ChkEn-                                   
        Capabilities: [140] Virtual Channel <?>                                                                   
        Kernel driver in use: ath_pci                                                                             
        Kernel modules: ath_pci
Comment 1 Ognjen Maric 2009-04-03 01:36:42 UTC
Created attachment 20783 [details]
a photo of the dump
Comment 2 Bob Copeland 2009-04-04 15:51:52 UTC
(In reply to comment #1)
> Created an attachment (id=20783) [details]
> a photo of the dump

Shot in the dark, but do you get the same with 2.6.29.1?  There was a patch in it that might help.  Also are you using adhoc or managed mode?
Comment 3 Ognjen Maric 2009-04-05 21:30:50 UTC
Tried out 2.6.29.1, unfortunately the system still duly freezes within minutes. I'm using managed mode (with WPA2, if that matters).

If there's something more I can do to help debug the problem, please let me know, as I have no experience with debugging kernel issues.
Comment 4 Bob Copeland 2009-04-06 02:10:50 UTC
Can you post your config?

Also, if you could turn off automatic association with APs, then try to grab a scan with iw, that might help:

$ sudo iw dev wlan0 scan trigger
# do this a few times
$ sudo iw dev wlan0 scan dump >> dump.log
Comment 5 Bob Copeland 2009-04-09 12:01:02 UTC
Created attachment 20904 [details]
handle rate control errors with a warning

Can you try this patch and report whether it helps, and if so which warnings it produces?
Comment 6 Ognjen Maric 2009-04-10 11:58:18 UTC
I'm posting my config. The iw scan (both of the commands) fails with a:
command failed: Operation not supported (-95). 


(In reply to comment #5)
> Can you try this patch and report whether it helps, and if so which warnings it
> produces?

I'll try out the patch and post the results.
Comment 7 Ognjen Maric 2009-04-10 12:00:01 UTC
Created attachment 20921 [details]
Kernel config
Comment 8 Bob Copeland 2009-04-10 12:08:37 UTC
(In reply to comment #6)
> I'm posting my config. The iw scan (both of the commands) fails with a:
> command failed: Operation not supported (-95). 

Ok, thank you.  That's ok, it probably requires very recent kernel + iw
(wireless-testing and iw from git e.g.).

> (In reply to comment #5)
> > Can you try this patch and report whether it helps, and if so which warnings it
> > produces?
> 
> I'll try out the patch and post the results.

Ok great, thanks!
Comment 9 Ognjen Maric 2009-04-11 15:43:58 UTC
Tried out the patch, unfortunately my kernel still panics very quickly. I'm attaching the warnings I get.
Comment 10 Ognjen Maric 2009-04-11 15:46:01 UTC
Created attachment 20939 [details]
excerpt from /var/log/messages.log
Comment 11 Bob Copeland 2009-04-21 18:51:09 UTC
(In reply to comment #9)
> Tried out the patch, unfortunately my kernel still panics very quickly. I'm
> attaching the warnings I get.

So it actually panics after it emits the warning?  Or it just emits the warning?

> Apr 10 15:15:25 ogi-laptop kernel: [   83.710063] minstrel: invalid rate report 1 (n=1)

So minstrel actually has only one available rate, that sounds messed up.  What sounds even more messed up is that it's asking us to send on a rate that isn't supported.  

By any chance does this help:

http://marc.info/?l=linux-wireless&m=124022467014813&w=2
Comment 12 Ognjen Maric 2009-04-21 19:49:50 UTC
(In reply to comment #11)
> So it actually panics after it emits the warning?  Or it just emits the
> warning?

Yes it panics afterwards, but only later, and not immediately after the warning.

> By any chance does this help:
> 
> http://marc.info/?l=linux-wireless&m=124022467014813&w=2

Will try it out and post the results.
Comment 13 Ognjen Maric 2009-04-22 18:48:52 UTC
Unfortunately, still no cigar. I get the same warning, and the kernel still panics later.

BTW I noticed that the kernel stack trace now looks slighlty different than the original one I posted, now ending in ath5k_tx, but I'm not aware exactly when on the road from the original 2.6.29 did this change. I'm attaching a photo of the trace.
Comment 14 Ognjen Maric 2009-04-22 18:50:40 UTC
Created attachment 21082 [details]
Kernel stack trace photo
Comment 15 Ognjen Maric 2009-05-03 09:39:38 UTC
I installed crda and this resolved the issue for me. No more panics nor warnings in messages.log.

Still, I'm not closing the bug, because I'm not sure this is the intended behaviour - from what I gather, my wireless should still work without crda, just with (possibly) less available channels. So I'm letting someone more knowledgeable decide.
Comment 16 Bob Copeland 2009-05-03 13:53:20 UTC
(In reply to comment #15)
> I installed crda and this resolved the issue for me. No more panics nor
> warnings in messages.log.
> 
> Still, I'm not closing the bug, because I'm not sure this is the intended
> behaviour - from what I gather, my wireless should still work without crda,
> just with (possibly) less available channels. So I'm letting someone more
> knowledgeable decide.

Ahhhh very interesting.  Thank you for tracking this down, this helps a lot.  No, the kernel shouldn't crash without crda.  I'll remove it from my test system and see if I can reproduce.
Comment 17 Ognjen Maric 2009-05-05 10:56:29 UTC
Sorry, scrap that - it's unrelated to crda. I was trying it out on a different AP. Only then I realized that your hunch about rates was spot on. My AP was set to a fixed TX rate of 2 Mbps (duh!), causing the failure.
Comment 18 Bob Copeland 2009-06-03 18:23:41 UTC
Interestingly, if I use hostapd with only a 2mbps rate, I eventually get a lockup on the client side, and in a couple of cases the machine running the AP has also panicked.  Seems different from the one above stack trace, but I haven't fully successful in capturing all the relevant logs yet.
Comment 19 Bob Copeland 2009-06-05 14:13:46 UTC
Patch is here:
http://lkml.org/lkml/2009/6/5/269
Comment 20 Ognjen Maric 2009-06-28 20:15:58 UTC
Sorry for taking so long to respond, but I didn't have the time to test this properly. Works like a charm, still running after a couple of hours of regular load. Many thanks.

Note You need to log in before you can comment on or make changes to this bug.