Bug 32032 - Divide error in minstrel_ht_tx_status followed by hang
Summary: Divide error in minstrel_ht_tx_status followed by hang
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Staging (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Roland Vossen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-03-27 21:46 UTC by Wouter Cloetens
Modified: 2011-05-30 08:35 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.38
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
syslog capturing crashes (9.54 KB, text/plain)
2011-03-27 21:46 UTC, Wouter Cloetens
Details
Fix brcm80211 crash by setting ampdu_len and ampdu_ack_len in brcm80211 appropriately (1.13 KB, patch)
2011-04-08 19:27 UTC, Wouter Cloetens
Details | Diff

Description Wouter Cloetens 2011-03-27 21:46:15 UTC
Created attachment 52232 [details]
syslog capturing crashes

Debian Squeeze with updated kernels containing brcm80211 driver.
Reproduced with the following kernel packages:
 linux-image-2.6.37-trunk-686   2.6.37-1~experimental.1
 linux-image-2.6.37.2-686       2.6.37-2
 linux-image-2.6.38-1-686       2.6.38-1
with two different firmware versions for brcm80211, currently:
 firmware-brcm80211             0.29

The system worked reliably with the first kernel until an automated package update occurred. Wireless is now unusable. I have been unable to trace what may have changed to trigger the problem.

After associating, the following crash occurs, followed by a second crash in acpi_idle_enter_bm, which then keeps occurring, rendering the system completely hung in a matter of seconds.

Mar 27 14:31:10 davin kernel: [ 1716.250150] divide error: 0000 [#1] SMP
Mar 27 14:31:10 davin kernel: [ 1716.250305] last sysfs file: /sys/module/aes_generic/initstate
Mar 27 14:31:10 davin kernel: [ 1716.250478] Modules linked in: aes_i586 aes_generic hidp hid acpi_cpufreq mperf cpufreq_conservative cpufreq_powersave cpufreq_userspace cpufreq_stats parport_pc ppdev lp parport sco bridge stp bnep rfcomm l2cap binfmt_misc uinput fuse loop arc4 snd_hda_codec_realtek ecb snd_hda_intel joydev snd_hda_codec snd_hwdep brcm80211(C) snd_pcm mac80211 snd_seq snd_timer snd_seq_device i915 cfg80211 drm_kms_helper tpm_tis snd psmouse tpm btusb uvcvideo bluetooth soundcore i2c_i801 drm rfkill i2c_algo_bit ac tpm_bios video videodev serio_raw i2c_core evdev pcspkr snd_page_alloc battery power_supply button processor ext4 mbcache jbd2 crc16 sd_mod crc_t10dif ahci libahci libata uhci_hcd scsi_mod ehci_hcd usbcore sky2 thermal thermal_sys nls_base [last unloaded: scsi_wait_scan]
Mar 27 14:31:10 davin kernel: [ 1716.253132]
Mar 27 14:31:10 davin kernel: [ 1716.253186] Pid: 0, comm: kworker/0:0 Tainted: G         C   2.6.38-1-686 #1 SAMSUNG ELECTRONICS CO., LTD. NF110/NF210/NF310          /NF110/NF210/NF310 
Mar 27 14:31:10 davin kernel: [ 1716.253637] EIP: 0060:[<f8e7e46f>] EFLAGS: 00010246 CPU: 1
Mar 27 14:31:10 davin kernel: [ 1716.253830] EIP is at minstrel_ht_tx_status+0x462/0x675 [mac80211]
Mar 27 14:31:10 davin kernel: [ 1716.254008] EAX: 0000006c EBX: ef452068 ECX: 00010000 EDX: 00000000
Mar 27 14:31:10 davin kernel: [ 1716.254008] ESI: ef452000 EDI: ef452000 EBP: 00000000 ESP: f68d9f04
Mar 27 14:31:10 davin kernel: [ 1716.254008]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Mar 27 14:31:10 davin kernel: [ 1716.254008] Process kworker/0:0 (pid: 0, ti=f68d8000 task=f684acb0 task.ti=f68d4000)
Mar 27 14:31:10 davin kernel: [ 1716.254008] Stack:
Mar 27 14:31:10 davin kernel: [ 1716.254008]  00000000 ef452000 ef452068 e5ce76c0 00000000 00000000 00000000 ef6f7220
Mar 27 14:31:10 davin kernel: [ 1716.254008]  00000000 00000000 00000000 00000000 00010000 00000000 f8e8224c ef6f7000
Mar 27 14:31:10 davin kernel: [ 1716.254008]  e5ce76d8 e5ce76c0 f8e63205 ef452000 e5ce76c0 ef2ea2c0 e5ce76d8 f5b0c086
Mar 27 14:31:10 davin kernel: [ 1716.254008] Call Trace:
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<f8e63205>] ? ieee80211_tx_status+0x234/0x6b7 [mac80211]
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<f8e62933>] ? ieee80211_tasklet_handler+0x4a/0x8c [mac80211]
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<c10347fd>] ? __do_softirq+0x0/0x14f
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<c1033cd7>] ? tasklet_action+0x61/0x9f
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<c1034898>] ? __do_softirq+0x9b/0x14f
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<c10347fd>] ? __do_softirq+0x0/0x14f
Mar 27 14:31:10 davin kernel: [ 1716.254008]  <IRQ>
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<c1034781>] ? irq_exit+0x26/0x59
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<c1003e63>] ? do_IRQ+0x73/0x84
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<c10034b0>] ? common_interrupt+0x30/0x38
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<f85a5cde>] ? acpi_idle_enter_bm+0x247/0x284 [processor]
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<c11ded9b>] ? cpuidle_idle_call+0xd4/0x15d
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<c1002201>] ? cpu_idle+0x95/0xb0
Mar 27 14:31:10 davin kernel: [ 1716.254008]  [<c1288b52>] ? start_secondary+0x1b8/0x1bd
Mar 27 14:31:10 davin kernel: [ 1716.254008] Code: 89 43 e4 8b 4c 24 08 8b 49 08 81 f9 98 19 00 00 89 4c 24 30 77 0c 8b 44 24 08 c7 00 00 00 00 00 eb 32 0f b7 6e 0a 31 d2 8b 46 1c <f7> f5 8b 14 24 89 c1 8b 44 24 18 03 8c 90 18 f0 e7 f8 b8 40 42
Mar 27 14:31:10 davin kernel: [ 1716.254008] EIP: [<f8e7e46f>] minstrel_ht_tx_status+0x462/0x675 [mac80211] SS:ESP 0068:f68d9f04
Comment 1 John W. Linville 2011-03-28 18:25:45 UTC
Could you do the following?

gdb /lib/modules/$(uname -r)/kernel/net/mac80211/mac80211.ko

Then, issue this command inside of gdb;

l *(minstrel_ht_tx_status+0x462)

Then post the output here?  Thanks!
Comment 2 Wouter Cloetens 2011-03-29 22:10:00 UTC
(In reply to comment #1)
> l *(minstrel_ht_tx_status+0x462)

Easily said... I rebuilt the kernel with CONFIG_DEBUG_INFO enabled and verified that the backtrace is the same.


0x1c493 is in minstrel_ht_tx_status (/home/zombie/local/2.6.38-1/linux-2.6-2.6.38/debian/build/source_i386_none/net/mac80211/rc80211_minstrel_ht.c:172).
167             if (mr->probability < MINSTREL_FRAC(1, 10)) {
168                     mr->cur_tp = 0;
169                     return;
170             }
171
172             usecs = mi->overhead / MINSTREL_TRUNC(mi->avg_ampdu_len);
173             usecs += minstrel_mcs_groups[group].duration[rate];
174             mr->cur_tp = MINSTREL_TRUNC((1000000 / usecs) * mr->probability);
175     }
176

I compiled in debugfs, but the system hangs before I can inspect the value of avg_ampdu_len.
Comment 3 Felix Fietkau 2011-04-01 00:14:33 UTC
My guess is, brcm80211 can report zero for tx_info->status.ampdu_len, which could lead to such an issue.
brcm80211 has a comment there saying this about ampdu_len and ampdu_ack_len:
                /* XXX TODO: Make these accurate. */
Comment 4 Wouter Cloetens 2011-04-08 19:27:04 UTC
Created attachment 53822 [details]
Fix brcm80211 crash by setting ampdu_len and ampdu_ack_len in brcm80211 appropriately

(In reply to comment #3)
> My guess is, brcm80211 can report zero for tx_info->status.ampdu_len, which
> could lead to such an issue.

Spot on. Without understanding the stack, from looking at the ath9k driver, it seems to me that ampdu_len and ampdu_ack_len must correspond to the number of frames and the number of ack frames, whereas brcm80211 is just dumping status of the last frame in both.

Attached patch solves the problem for me.

I have added code in mac80211 to protect against bogus values, but it doesn't get hit, so I'm fairly confident.

Could we move this bug from the NEEDINFO state?
Comment 5 John W. Linville 2011-04-08 19:42:29 UTC
Be my guest! :-)
Comment 6 Wouter Cloetens 2011-04-08 23:13:58 UTC
I'm not volunteering to take ownership of the bug... Can someone in the know review the patch and my analysis?
Comment 7 Roland Vossen 2011-04-11 14:03:15 UTC
Hi Wouter,

thanks for bringing this bug to our attention. I reviewed your proposed solution, but am afraid there is a problem with it: after execution hits the lines in which you set status fields:
    tx_info->status.ampdu_len = tot_mpdu;
    tx_info->status.ampdu_ack_len = suc_mpdu;

and execution continues by making another iteration in the while loop, it then encounters:

	while (p) {
		tx_info = IEEE80211_SKB_CB(p);

So you set the tx_info fields, but then the tx_info pointer gets changed.

I will work on an other fix and keep you posted.

Again, thanks for detecting this !

Bye, 

Roland Vossen
Broadcom
Comment 8 Roland Vossen 2011-04-12 14:56:22 UTC
I posted a fix into Greg KH's development tree, which is 2.6.39-rc1 based. I forwarded the patch email to Wouter.

Thanks, Roland Vossen
Broadcom
Comment 9 Florian Mickler 2011-05-30 07:52:35 UTC
A patch referencing this bug report has been merged in v3.0-rc1:

commit e9c661e08c2a6015c1b7cba1cecefa27a089df71
Author: Roland Vossen <rvossen@broadcom.com>
Date:   Tue Apr 12 14:34:41 2011 +0200

    staging: brcm80211: bugfix for div by zero in minstrel_ht_tx_status

Note You need to log in before you can comment on or make changes to this bug.