Bug 12647 - ath5k locks up machine
Summary: ath5k locks up machine
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
: 12226 12652 (view as bug list)
Depends on:
Blocks:
 
Reported: 2009-02-07 05:20 UTC by David Kozub
Modified: 2009-04-01 14:26 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.29-rc3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
console log (with DETECT_SOFTLOCKUP, no hires timers and notsc) (5.58 KB, text/x-log)
2009-02-07 05:24 UTC, David Kozub
Details
disable beacon miss interrupts (2.09 KB, patch)
2009-02-07 07:22 UTC, Bob Copeland
Details | Diff
content of /proc/kmesg whan i enable all in /dmesg/ath5k.... (6.73 KB, application/octet-stream)
2009-02-12 10:22 UTC, satish
Details
sysrq+p (show regs) after an oops with disabled BMISS and enabled MIB interrupts (2.74 KB, text/plain)
2009-02-16 13:04 UTC, David Kozub
Details
disassembly of ath5k.ko with patch (#2) applied (71 bytes, text/plain)
2009-02-16 23:47 UTC, David Kozub
Details
disassembly of ath5k.ko (original 2.6.29-rc3 rsource) (53 bytes, text/plain)
2009-02-17 08:00 UTC, David Kozub
Details
reset all counters on mib interrupts (964 bytes, patch)
2009-02-24 18:09 UTC, Bob Copeland
Details | Diff
log file "Unhandled kernel unaligned access[" (19.46 KB, application/octet-stream)
2009-02-25 07:09 UTC, satish
Details
soft lockup log - still with original patch (#2) to disable BMISS (5.43 KB, text/plain)
2009-02-25 14:11 UTC, David Kozub
Details
ath5k_hw_update_mib_counters verbose output (4.85 KB, text/x-log)
2009-03-14 16:13 UTC, David Kozub
Details
patch adding some printks into ath5k_hw_update_mib_counters() (2.36 KB, patch)
2009-03-14 16:16 UTC, David Kozub
Details | Diff
ath5k_hw_update_mib_counters verbose output2 (9.52 KB, text/plain)
2009-03-19 13:20 UTC, David Kozub
Details

Description David Kozub 2009-02-07 05:20:48 UTC
Latest working kernel version: N/A
Earliest failing kernel version: 2.6.26
Distribution: Debian GNU/Linux 5.0
Hardware Environment: ALIX 2c3 (AMD Geode LX800), Atheros AR5413
Software Environment: Debian GNU/Linux 5.0 + -rc kernel
Problem Description:
Running with an established wireless connection locks up the machine after approximately a day or two of uptime. (Tested with WEP encryption.)

Without extra debug options and with hires timers enabled the machine is just locked up, without anything appearing on the console. (Magic sysrq is working though.)

Turning on almost all debug options (including DETECT_SOFTLOCKUP), disabling hires timers and supplying "notsc" as a kernel parameter changes the bahaviour to result in a "BUG: soft lockup". This also seems to shorten the typical uptime till bug to ~1 day.

This is reproducible with a range of kernels. I tried many in the range 2.6.26 - 2.6.29-rc3. Running the box without ath5k loaded results in no freezes.

Maybe this is related to #12226.

Steps to reproduce:
Just boot up, establish a wlan connection and wait a day or two.
Comment 1 David Kozub 2009-02-07 05:24:33 UTC
Created attachment 20150 [details]
console log (with DETECT_SOFTLOCKUP, no hires timers and notsc)
Comment 2 Bob Copeland 2009-02-07 07:22:30 UTC
Created attachment 20154 [details]
disable beacon miss interrupts

Does this patch help?  It could be an interrupt storm.
Comment 3 Bob Copeland 2009-02-07 07:39:24 UTC
There was extra stuff in there, the first hunk is all that mattered.

Looking at the trace, we have two cpus handling interrupts at the same time.  There's no locking in the isr itself, and we're writing to hardware within the isr.  Both of those seem like quite bad things.  Still, we should exit when testing is_intr_pending unless there is a storm.
Comment 4 Bob Copeland 2009-02-07 07:42:53 UTC
Also try disabling MIB interrupts from sc->imask at line 2237 in base.c (ath5k_init).
Comment 5 David Kozub 2009-02-07 14:41:08 UTC
Patch applied, also removed AR5K_INT_MIB from sc->imask. Let's see if the box survives the night.

One thing puzzles me though:

> we have two cpus handling interrupts at the same time

I don't see anything in the log implying two cpus. And as the box is an ALIX 2c3 (http://www.pcengines.ch/alix2c3.htm) - an AMD Geode LX800 based thing - it has single CPU with a single core.

(Ok, I think I added a comment like this few hours back, but I don't see it here now... sorry if it shows up twice.)
Comment 6 Bob Copeland 2009-02-07 16:08:07 UTC
Oh -- you're right, I read it wrong.
Comment 7 David Kozub 2009-02-08 03:11:27 UTC
Ok, 16 hours uptime. Still, too early to tell. But I noticed in older logs I do occasionally get messages like:

ath5k phy0: noise floor calibration timeout (2417MHz)

With the new kernel I have not seen it yet, but maybe this is unrelated and I eventually see it with the patched kernel too. Either way, the message itself doesn't seem to immediately do anything, i.e. everything seems to work afterwards.
Comment 8 satish 2009-02-10 01:39:12 UTC
Distribution:kamikaze (openwrt)
Hardware Environment: rb433 ah board ( AR7161 processor)
Wifi chipset -AR5413
This is satish from Indian Institute of Science. I am working on 802.11s
implementation.

We are using routerboard 433ah platform which has atheros processor.
The wifi chip we are using is AR5413. We are using ath5k driver downloaded from
compat wireless.

Our board reboots at higher clock frequency (> 100 MHZ). The processor hangs
approx. five minutes after we bring the wlan0 interface up.
When we ping from the board with higher size data bytes ( Ex: ping -s 128 ),
the board hangs after two pings in higher clock frequency. The same problem
occurs when we ping our board from some other system. But the above issue does
not arise when the clock frequency is 100 MHZ.

So we doubt the above problem is due to the insufficient buffer size. When the
board runs at higher clock freq., the buffer gets filled up immediately.

I, therefore, request you to suggest some parameters with which we can play to
bring the buffer size up. If you think it is not due to the buffer size, please
suggest a solution. The card works fine with madwifi driver. So there is no
issue with the hardware.

Steps to reproduce:
connect to ad hoc network 
ping to any station with large size data bytes ex: ping -s 128
Comment 9 Bob Copeland 2009-02-10 04:25:21 UTC
*** Bug 12652 has been marked as a duplicate of this bug. ***
Comment 10 Bob Copeland 2009-02-10 04:29:30 UTC
*** Bug 12226 has been marked as a duplicate of this bug. ***
Comment 11 Bob Copeland 2009-02-10 05:31:22 UTC
(In reply to comment #8)
> Distribution:kamikaze (openwrt)
> Hardware Environment: rb433 ah board ( AR7161 processor)
> Wifi chipset -AR5413
> This is satish from Indian Institute of Science. I am working on 802.11s
> implementation.
> 
> We are using routerboard 433ah platform which has atheros processor.
> The wifi chip we are using is AR5413. We are using ath5k driver downloaded
> from
> compat wireless.

What is the contents of the git_describe file in compat_wireless?

> I, therefore, request you to suggest some parameters with which we can play
> to
> bring the buffer size up. If you think it is not due to the buffer size,
> please
> suggest a solution. The card works fine with madwifi driver. So there is no
> issue with the hardware.

Can you mount debugfs and then do:

echo 'intr' > /debug/ath5k/phy0/debug

This should show on the console whether you are getting lots of interrupts or not at hang time.

ath5k uses a linked list of descriptors for rx and tx dma and the last entry is self-linked to itself so the hardware theoretically shouldn't run out.  The buffers are large enough for standard ieee80211 packets, but if you want to play with these values:

IEEE80211_MAX_LEN in ath5k.h - size of a dma buffer
ATH_RXBUF,ATH_TXBUF in base.h - number of rx,tx buffers 
Comment 12 David Kozub 2009-02-11 00:38:09 UTC
Ok, I got to uptime 3 days 13 hours, so I'm almost sure the bug was fixed (as the box would have choked by this time w/o the patch).

What I did was applying patch from #2 and removing AR5K_INT_MIB from sc->imask (#4).

Now I wonder what the conclusion would be. What does AR5K_INT_MIB do? Should I try the patch (#2) while keeping AR5K_INT_MIB? Or is the patch + no AR5K_INT_MIB a good solution?
Comment 13 satish 2009-02-12 01:17:25 UTC
i mounted the debugfs added intr to debug file 

It shows up the debug info " _ratelimit: 4035 callbacks suppressed " every five seconds. 

__ratelimit: 4035 callbacks suppressed
__ratelimit: 4032 callbacks suppressed
__ratelimit: 4032 callbacks suppressed
__ratelimit: 4045 callbacks suppressed
__ratelimit: 4026 callbacks suppressed
__ratelimit: 4036 callbacks suppressed
__ratelimit: 4020 callbacks suppressed
__ratelimit: 4034 callbacks suppressed
__ratelimit: 4034 callbacks suppressed
__ratelimit: 4037 callbacks suppressed
__ratelimit: 4030 callbacks suppressed
__ratelimit: 3434 callbacks suppressed
__ratelimit: 4028 callbacks suppressed
__ratelimit: 4044 callbacks suppressed
__ratelimit: 4047 callbacks suppressed
__ratelimit: 4046 callbacks suppressed
__ratelimit: 4026 callbacks suppressed


does it mean it is geeting too many interrupts
Comment 14 satish 2009-02-12 02:17:23 UTC
the content of the git describe file in compat wireless is

v2.6.29-rc3-12724-g64f573f
Comment 15 Bob Copeland 2009-02-12 05:06:53 UTC
(In reply to comment #12)
> Ok, I got to uptime 3 days 13 hours, so I'm almost sure the bug was fixed (as
> the box would have choked by this time w/o the patch).
> 
> What I did was applying patch from #2 and removing AR5K_INT_MIB from
> sc->imask
> (#4).
> 
> Now I wonder what the conclusion would be. What does AR5K_INT_MIB do? Should
> I
> try the patch (#2) while keeping AR5K_INT_MIB? Or is the patch + no
> AR5K_INT_MIB a good solution?
> 

Yeah, please try with MIB interrupts enabled as well.

BMISS interrupts fire when we expect a beacon but don't get one.  However, they can fire erroneously, and right now we do absolutely nothing with them anyway.

MIB interrupts fire when there are updates to the management information base (snmp counters).  While potentially useful, they aren't critical, and for now we don't do noise immunity so the driver doesn't use them either.
Comment 16 Bob Copeland 2009-02-12 05:11:58 UTC
(In reply to comment #13)
> i mounted the debugfs added intr to debug file 
> 
> It shows up the debug info " _ratelimit: 4035 callbacks suppressed " every
> five
> seconds. 
> 
> __ratelimit: 4035 callbacks suppressed
> __ratelimit: 4032 callbacks suppressed
> __ratelimit: 4032 callbacks suppressed
> __ratelimit: 4045 callbacks suppressed
> __ratelimit: 4026 callbacks suppressed
> __ratelimit: 4036 callbacks suppressed
> __ratelimit: 4020 callbacks suppressed
> __ratelimit: 4034 callbacks suppressed
> __ratelimit: 4034 callbacks suppressed
> __ratelimit: 4037 callbacks suppressed
> __ratelimit: 4030 callbacks suppressed
> __ratelimit: 3434 callbacks suppressed
> __ratelimit: 4028 callbacks suppressed
> __ratelimit: 4044 callbacks suppressed
> __ratelimit: 4047 callbacks suppressed
> __ratelimit: 4046 callbacks suppressed
> __ratelimit: 4026 callbacks suppressed
> 
> 
> does it mean it is geeting too many interrupts
> 

Yes -- but you should also be able to see which interrupt.  Can you update the loglevel on your console (e.g. dmesg -n 8) or cat /proc/kmsg?
Comment 17 satish 2009-02-12 10:22:45 UTC
Created attachment 20219 [details]
content of /proc/kmesg whan i enable all in /dmesg/ath5k....
Comment 18 satish 2009-02-12 10:24:48 UTC
> Yes -- but you should also be able to see which interrupt.  Can you update
> the
> loglevel on your console (e.g. dmesg -n 8) or cat /proc/kmsg?
> 

when i add intr alone into the debug/ath5k/phy0/debug, the /proc/kmesg/ is flooded with following info: 
__ratelimit: 4026 callbacks suppressed
<4>__ratelimit: 4026 callbacks suppressed
<7>ath5k phy0: (ath5k_intr:2427): status 0x1/0x800904b5
<7>ath5k phy0: (ath5k_intr:2427): status 0x1/0x800904b5
<7>ath5k phy0: (ath5k_intr:2427): status 0x1/0x800904b5

the previous attachment is the content of /proc/kmesg/ when i enable all indebug/ath5k/phy0/...

 
Comment 19 David Kozub 2009-02-16 12:58:49 UTC
I've got bad news: Without AR5K_INT_MIB the board kept running for ~ 3.5 days, but eventually choke too. :-(

At the time I didn't have a computer connected to the serial, unfortunately, so I don't have kernel log. All I have is magic sys rq + p afterwards, don't know how useful it is, though.

I'll try to catch the panic again, but most probably it's still the same issue.
Comment 20 David Kozub 2009-02-16 13:04:57 UTC
Created attachment 20265 [details]
sysrq+p (show regs) after an oops with disabled BMISS and enabled MIB interrupts
Comment 21 David Kozub 2009-02-16 13:06:52 UTC
(In reply to comment #19)
> I've got bad news: Without AR5K_INT_MIB the board kept running for ~ 3.5
> days,
> but eventually choke too. :-(

Arrrrgh, I mean: With AR5K_INT_MIB ENABLED, of course. (Sorry for the confusion.)
Comment 22 Bob Copeland 2009-02-16 14:27:25 UTC
(In reply to comment #19)
> I've got bad news: Without AR5K_INT_MIB the board kept running for ~ 3.5
> days,
> but eventually choke too. :-(

Well, I'd rather it crash and know why than have a hand-wavy workaround.  

Ok, another softlockup trace would be good.  Can you also add the output of objdump -S drivers/net/wireless/ath5k/ath5k.ko?  Also for the original configuration, if possible.  I'd like to see the mixed C and assembly.

By the way, are you using managed mode (or adhoc, etc)?  I found a deadlock in adhoc mode yesterday.
Comment 23 Bob Copeland 2009-02-16 14:28:23 UTC
(In reply to comment #17)
> Created an attachment (id=20219) [details]
> content of /proc/kmesg whan i enable all in /dmesg/ath5k....
> 

As Nick stated in ath5k-devel email, these are just rx interrupts, so that so far doesn't indicate an issue.  Did changing the DMA size help at all?
Comment 24 David Kozub 2009-02-16 23:32:35 UTC
I tried mounting debugfs and doing an "echo 'intr' > /debug/ath5k/phy0/debug". After that (and after changing loglevel to 8), I get floods of messages similar to what satish gets:

[25621.372008] __ratelimit: 494364 callbacks suppressed
[25621.378924] ath5k phy0: (ath5k_intr:2378): status 0x1000/0x800814b5
[25621.382924] ath5k phy0: (ath5k_intr:2378): status 0x1000/0x800814b5
[25621.385743] ath5k phy0: (ath5k_intr:2378): status 0x1001/0x800814b5
...

Letting the board run overnight, it died quickly with those settings, but there's no panic in the log, the only other thing than "__ratelimit" and "ath5k_intr" is:

[  823.460006] spurious 8259A interrupt: IRQ7.

(Weird, ATH is IRQ9, cat /proc/interrupts shows nothing on IRQ7.)

By dying, I mean: wireless was down and I could not log in via serial either (would not react to anything else than break).

A sysrq+p shows
[25695.716008]  [<d09fb7c2>] ? ath5k_hw_update_mib_counters+0x52/0x130 [ath5k]
on the call stack.

I'll boot w/o debugfs and will try to get the original panic.
Comment 25 David Kozub 2009-02-16 23:43:25 UTC
I'm using managed mode.
Comment 26 David Kozub 2009-02-16 23:47:36 UTC
Created attachment 20271 [details]
disassembly of ath5k.ko with patch (#2) applied
Comment 27 David Kozub 2009-02-17 08:00:25 UTC
Created attachment 20273 [details]
disassembly of ath5k.ko (original 2.6.29-rc3 rsource)
Comment 28 Bob Copeland 2009-02-17 08:25:04 UTC
Ok, th(In reply to comment #24)
> I tried mounting debugfs and doing an "echo 'intr' >
> /debug/ath5k/phy0/debug".
> After that (and after changing loglevel to 8), I get floods of messages
> similar
> to what satish gets:
> 
> [25621.372008] __ratelimit: 494364 callbacks suppressed
> [25621.378924] ath5k phy0: (ath5k_intr:2378): status 0x1000/0x800814b5
> [25621.382924] ath5k phy0: (ath5k_intr:2378): status 0x1000/0x800814b5
> [25621.385743] ath5k phy0: (ath5k_intr:2378): status 0x1001/0x800814b5
> ...
> 
> Letting the board run overnight, it died quickly with those settings, but
> there's no panic in the log, the only other thing than "__ratelimit" and
> "ath5k_intr" is:
> 
> [  823.460006] spurious 8259A interrupt: IRQ7.
> 
> (Weird, ATH is IRQ9, cat /proc/interrupts shows nothing on IRQ7.)
> 
> By dying, I mean: wireless was down and I could not log in via serial either
> (would not react to anything else than break).
> 
> A sysrq+p shows
> [25695.716008]  [<d09fb7c2>] ? ath5k_hw_update_mib_counters+0x52/0x130
> [ath5k]
> on the call stack.
> 
> I'll boot w/o debugfs and will try to get the original panic.
> 

Ok the important thing is status field, which in this case is 0x1000.  Those are defined in the enum ath5k_int, in this case that is AR5K_INT_MIB, so you are getting a MIB storm.  Thus it makes sense that turning off MIB interrupts helps.  Currently, we don't do anything about the MIB counters except reset them (e.g. we don't do automatic noise immunity) so if it keeps firing we may not be fixing the triggering condition.  I'll check to see if there is a way we can more intelligently program the mib interrupts, worse case we can just turn them off again.
Comment 29 satish 2009-02-20 01:59:18 UTC
(In reply to comment #23)
> (In reply to comment #17)
> > Created an attachment (id=20219) [details] [details]
> > content of /proc/kmesg whan i enable all in /dmesg/ath5k....
> > 
> 
> As Nick stated in ath5k-devel email, these are just rx interrupts, so that so
> far doesn't indicate an issue.  Did changing the DMA size help at all?
> 

yes it helped. i changed the DMASIZE to 64B. Now it is not hanging even in higher cpu frequency.   sorry i forgot to update it here.
Comment 30 satish 2009-02-23 22:56:33 UTC
(In reply to comment #23)
> (In reply to comment #17)
> > Created an attachment (id=20219) [details] [details]
> > content of /proc/kmesg whan i enable all in /dmesg/ath5k....
> > 
> 
> As Nick stated in ath5k-devel email, these are just rx interrupts, so that so
> far doesn't indicate an issue.  Did changing the DMA size help at all?
> 

I changed the DMA size to 64B. Now it does not restart when i ping with higher size. But when i do wget from some other system the board reboots displaying the following error:


"ath5k unhandled kernel unaligned access"

Please reply if u have any details abt this error.
Comment 31 Bob Copeland 2009-02-24 09:45:30 UTC
> "ath5k unhandled kernel unaligned access"
> 
> Please reply if u have any details abt this error.

I replied in email but this is unrelated to any hangs, and most likely a bug.  Please post the accompanying logs in the thread on the list, and we can go from there.
Comment 32 Bob Copeland 2009-02-24 09:50:48 UTC
(In reply to comment #28)

> not be fixing the triggering condition.  I'll check to see if there is a way
> we
> can more intelligently program the mib interrupts, worse case we can just
> turn
> them off again.

Just an update, it seems we can use the MIB controller to clear all counters.  I'll post a patch for that soon.  In the meantime, if you want to, you can add some printks to print out the values of some of the counters in ath5k_hw_update_mib_counters in pcu.c (i.e. the return values of ath5k_hw_reg_read(), plus the values in the profile count registers before they are zeroed).  I'd be interested to know which counters are triggering the interrupt -- the interrupt is supposed to happen when the register saturates.
Comment 33 Bob Copeland 2009-02-24 18:09:51 UTC
Created attachment 20356 [details]
reset all counters on mib interrupts

Try this patch with mib interrupts enabled.
Comment 34 satish 2009-02-25 07:09:47 UTC
Created attachment 20363 [details]
log file "Unhandled kernel unaligned access["


The attachment is the /proc/kmsg file with all anabled in /debug/ath5k....

The following message was displayed on the screen when i did wget :
Unhandled kernel unaligned access[#1]:
Cpu 0
$ 0   : 00000000 7fcfbb60 fbd3efbe 802778a0
$ 4   : 802778a0 000000d0 00000000 0000002f
$ 8   : 87b98f80 80064610 87afc000 fffffff8
$12   : 00000000 00000000 00000000 00000000
$16   : fbd3efbe 00000001 000000d0 8014116c
$20   : 00000080 2ac31030 00000000 00000101
$24   : 00000000 2ab9efb0
Comment 35 satish 2009-02-25 07:14:54 UTC
(In reply to comment #31)
> > "ath5k unhandled kernel unaligned access"
> > 
> > Please reply if u have any details abt this error.
> 
> Is there a stack trace to go along with this?  If so, please post it.

>ARM doesn't support unaligned access so it indicates a portability bug
>in ath5k.


Bob,
in ur mail u have sent, u have asked something abt stack trace. how should i check it. Regarding my processor, i am using mips (Atheros AR7161). I think mips also doesn't allow unalligned access. 
Comment 36 Bob Copeland 2009-02-25 07:51:56 UTC
Sorry, I don't know where I thought you were using ARM.  Anyway I opened a new bug to track the unaligned access problem.  Please leave info there:

http://bugzilla.kernel.org/show_bug.cgi?id=12777
Comment 37 David Kozub 2009-02-25 14:11:50 UTC
Created attachment 20367 [details]
soft lockup log - still with original patch (#2) to disable BMISS

Don't know if it's still helpful - but I just run into it. This is essentially the same situation as in #21, where I had no log.

I'll try with the new patch now.
Comment 38 David Kozub 2009-03-01 16:18:22 UTC
Good news, everyone!

The uptime of my ALIX has crawled over 4 days (using the patch from #33). So, I think the patch fixes this bug.

To Bob: If you're still interested, I might try to get some stats as for what counters were triggering the interrupt. Either way, thx for fixing this. :)
Comment 39 Bob Copeland 2009-03-04 09:40:21 UTC
Yeah, it would be good for the changelog if nothing else.  I'll send the patch on to the list for comment and hopefully we can get it upstream soon.
Comment 40 David Kozub 2009-03-14 16:13:41 UTC
Created attachment 20523 [details]
ath5k_hw_update_mib_counters verbose output

I added some printks, here's what I'm getting. The system doesn't die just after that. For the bug to manifest, I'd need to keep it running longer.
Comment 41 David Kozub 2009-03-14 16:16:01 UTC
Created attachment 20524 [details]
patch adding some printks into ath5k_hw_update_mib_counters()

Here's the code I used. It might not be what you want. If you need some other sort of data, let me know. (Now finally having some time to look into that again.)
Comment 42 David Kozub 2009-03-15 02:27:01 UTC
Keeping the board running overnight, it got into a state of continuously spiting the counter values:

[18595.900006] ATH5K: AR5K_FCS_FAIL = 22660

[18595.905786] ATH5K: AR5K_BEACON_CNT = 7539

[18595.909842] ATH5K: AR5K_PROFCNT_TX = 30265

[18595.914163] ATH5K: AR5K_PROFCNT_RX = 439630068

[18595.919523] ATH5K: AR5K_PROFCNT_RXCLR = 453916775

[18595.923523] ATH5K: AR5K_PROFCNT_CYCLE = 2180133112

[18595.925664] ATH5K: AR5K_PROFCNT_RX = 122634

[18595.928645] ATH5K: AR5K_PROFCNT_RXCLR = 67281

[18595.933747] ATH5K: AR5K_PROFCNT_CYCLE = 1130345

[18595.939368] ATH5K: AR5K_PROFCNT_RX = 68585

[18595.943688] ATH5K: AR5K_PROFCNT_RXCLR = 76626

[18595.947688] ATH5K: AR5K_PROFCNT_CYCLE = 1117662

.
.
.
[25497.110947] ATH5K: AR5K_PROFCNT_RX = 513297

[25497.115528] ATH5K: AR5K_PROFCNT_RXCLR = 655197

[25497.119528] ATH5K: AR5K_PROFCNT_CYCLE = 1141604

[25497.122510] ATH5K: AR5K_PROFCNT_RX = 239784

[25497.127091] ATH5K: AR5K_PROFCNT_RXCLR = 148941

[25497.131091] ATH5K: AR5K_PROFCNT_CYCLE = 1144025
Comment 43 Bob Copeland 2009-03-15 19:51:54 UTC
odd, FCS_FAIL is always 49152 in the first attachment.  But that's an usual number, and if it is read-and-clear then it should be a different number each time.
Comment 44 David Kozub 2009-03-19 13:20:11 UTC
Created attachment 20598 [details]
ath5k_hw_update_mib_counters verbose output2

Extract from another try.

My guess it that the box locks up at:

[41273.472006] ATH5K: AR5K_FCS_FAIL = 1865
[41273.476937] ATH5K: AR5K_BEACON_CNT = 6206
...

after that the profile counters start being dumped to the log like crazy. What actually happens is that the box doesn't respond to anything (probably responds to sysrq, couldn't have tried this time), but the profile counters are still being dumped... so the whole log (till I inspected and rebooted the box) is ~ 90M.

This time AR5K_FCS_FAIL takes on different values... But AR5K_BEACON_CNT seems to be stuck at the same value as AR5K_FCS_FAIL was in the previous log: 49152 (an interesting number... 0xC000 - but this might not mean anything).

And just like in the previous log, a register is stuck at 49152 for some time, then it changes, only for the profile registers to go crazy after that.

But just before the final freeze, it takes a different value:
[41273.476937] ATH5K: AR5K_BEACON_CNT = 6206

Also... the logging I hacked in doesn't reveal anything, so there well might be things we're missing.

Do you want me to try to get another log, possibly with a different logging approach?

[You can get the whole log at http://linux.fjfi.cvut.cz/~zub/humel/humel.log - but it doesn't bring anything interesting, IMHO.]
Comment 45 Bob Copeland 2009-03-19 13:35:44 UTC
No that's ok.. heh don't know why I didn't see 49152 == 0xc000, which is definitely a trigger number (top two bits set).  So, I think on your hardware the MIB counters are *not* read-and-clear, or else it is counting those numbers very fast... you could do one last thing and print out the value after it reaches 0xc000 and you have read it, to see if it is back to zero again.

As for why the profile counters start going crazy.. I would just guess the card must be in a bad state so the firmware starts burning cpu, phy may be miscalibrated, etc.

I already submitted a patch upstream to just disable MIB interrupts for now, but when we implement ANI I think the other patch will be useful if your registers aren't getting cleared.  Thanks for your testing!
Comment 46 John W. Linville 2009-04-01 14:25:34 UTC
Closing on basis of comment 45...

Note You need to log in before you can comment on or make changes to this bug.