Bug 32032
Summary: | Divide error in minstrel_ht_tx_status followed by hang | ||
---|---|---|---|
Product: | Drivers | Reporter: | Wouter Cloetens (zombie) |
Component: | Staging | Assignee: | Roland Vossen (rvossen) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | florian, linville, nbd, rvossen |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.38 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
syslog capturing crashes
Fix brcm80211 crash by setting ampdu_len and ampdu_ack_len in brcm80211 appropriately |
Could you do the following? gdb /lib/modules/$(uname -r)/kernel/net/mac80211/mac80211.ko Then, issue this command inside of gdb; l *(minstrel_ht_tx_status+0x462) Then post the output here? Thanks! (In reply to comment #1) > l *(minstrel_ht_tx_status+0x462) Easily said... I rebuilt the kernel with CONFIG_DEBUG_INFO enabled and verified that the backtrace is the same. 0x1c493 is in minstrel_ht_tx_status (/home/zombie/local/2.6.38-1/linux-2.6-2.6.38/debian/build/source_i386_none/net/mac80211/rc80211_minstrel_ht.c:172). 167 if (mr->probability < MINSTREL_FRAC(1, 10)) { 168 mr->cur_tp = 0; 169 return; 170 } 171 172 usecs = mi->overhead / MINSTREL_TRUNC(mi->avg_ampdu_len); 173 usecs += minstrel_mcs_groups[group].duration[rate]; 174 mr->cur_tp = MINSTREL_TRUNC((1000000 / usecs) * mr->probability); 175 } 176 I compiled in debugfs, but the system hangs before I can inspect the value of avg_ampdu_len. My guess is, brcm80211 can report zero for tx_info->status.ampdu_len, which could lead to such an issue. brcm80211 has a comment there saying this about ampdu_len and ampdu_ack_len: /* XXX TODO: Make these accurate. */ Created attachment 53822 [details] Fix brcm80211 crash by setting ampdu_len and ampdu_ack_len in brcm80211 appropriately (In reply to comment #3) > My guess is, brcm80211 can report zero for tx_info->status.ampdu_len, which > could lead to such an issue. Spot on. Without understanding the stack, from looking at the ath9k driver, it seems to me that ampdu_len and ampdu_ack_len must correspond to the number of frames and the number of ack frames, whereas brcm80211 is just dumping status of the last frame in both. Attached patch solves the problem for me. I have added code in mac80211 to protect against bogus values, but it doesn't get hit, so I'm fairly confident. Could we move this bug from the NEEDINFO state? Be my guest! :-) I'm not volunteering to take ownership of the bug... Can someone in the know review the patch and my analysis? Hi Wouter, thanks for bringing this bug to our attention. I reviewed your proposed solution, but am afraid there is a problem with it: after execution hits the lines in which you set status fields: tx_info->status.ampdu_len = tot_mpdu; tx_info->status.ampdu_ack_len = suc_mpdu; and execution continues by making another iteration in the while loop, it then encounters: while (p) { tx_info = IEEE80211_SKB_CB(p); So you set the tx_info fields, but then the tx_info pointer gets changed. I will work on an other fix and keep you posted. Again, thanks for detecting this ! Bye, Roland Vossen Broadcom I posted a fix into Greg KH's development tree, which is 2.6.39-rc1 based. I forwarded the patch email to Wouter. Thanks, Roland Vossen Broadcom A patch referencing this bug report has been merged in v3.0-rc1: commit e9c661e08c2a6015c1b7cba1cecefa27a089df71 Author: Roland Vossen <rvossen@broadcom.com> Date: Tue Apr 12 14:34:41 2011 +0200 staging: brcm80211: bugfix for div by zero in minstrel_ht_tx_status |
Created attachment 52232 [details] syslog capturing crashes Debian Squeeze with updated kernels containing brcm80211 driver. Reproduced with the following kernel packages: linux-image-2.6.37-trunk-686 2.6.37-1~experimental.1 linux-image-2.6.37.2-686 2.6.37-2 linux-image-2.6.38-1-686 2.6.38-1 with two different firmware versions for brcm80211, currently: firmware-brcm80211 0.29 The system worked reliably with the first kernel until an automated package update occurred. Wireless is now unusable. I have been unable to trace what may have changed to trigger the problem. After associating, the following crash occurs, followed by a second crash in acpi_idle_enter_bm, which then keeps occurring, rendering the system completely hung in a matter of seconds. Mar 27 14:31:10 davin kernel: [ 1716.250150] divide error: 0000 [#1] SMP Mar 27 14:31:10 davin kernel: [ 1716.250305] last sysfs file: /sys/module/aes_generic/initstate Mar 27 14:31:10 davin kernel: [ 1716.250478] Modules linked in: aes_i586 aes_generic hidp hid acpi_cpufreq mperf cpufreq_conservative cpufreq_powersave cpufreq_userspace cpufreq_stats parport_pc ppdev lp parport sco bridge stp bnep rfcomm l2cap binfmt_misc uinput fuse loop arc4 snd_hda_codec_realtek ecb snd_hda_intel joydev snd_hda_codec snd_hwdep brcm80211(C) snd_pcm mac80211 snd_seq snd_timer snd_seq_device i915 cfg80211 drm_kms_helper tpm_tis snd psmouse tpm btusb uvcvideo bluetooth soundcore i2c_i801 drm rfkill i2c_algo_bit ac tpm_bios video videodev serio_raw i2c_core evdev pcspkr snd_page_alloc battery power_supply button processor ext4 mbcache jbd2 crc16 sd_mod crc_t10dif ahci libahci libata uhci_hcd scsi_mod ehci_hcd usbcore sky2 thermal thermal_sys nls_base [last unloaded: scsi_wait_scan] Mar 27 14:31:10 davin kernel: [ 1716.253132] Mar 27 14:31:10 davin kernel: [ 1716.253186] Pid: 0, comm: kworker/0:0 Tainted: G C 2.6.38-1-686 #1 SAMSUNG ELECTRONICS CO., LTD. NF110/NF210/NF310 /NF110/NF210/NF310 Mar 27 14:31:10 davin kernel: [ 1716.253637] EIP: 0060:[<f8e7e46f>] EFLAGS: 00010246 CPU: 1 Mar 27 14:31:10 davin kernel: [ 1716.253830] EIP is at minstrel_ht_tx_status+0x462/0x675 [mac80211] Mar 27 14:31:10 davin kernel: [ 1716.254008] EAX: 0000006c EBX: ef452068 ECX: 00010000 EDX: 00000000 Mar 27 14:31:10 davin kernel: [ 1716.254008] ESI: ef452000 EDI: ef452000 EBP: 00000000 ESP: f68d9f04 Mar 27 14:31:10 davin kernel: [ 1716.254008] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Mar 27 14:31:10 davin kernel: [ 1716.254008] Process kworker/0:0 (pid: 0, ti=f68d8000 task=f684acb0 task.ti=f68d4000) Mar 27 14:31:10 davin kernel: [ 1716.254008] Stack: Mar 27 14:31:10 davin kernel: [ 1716.254008] 00000000 ef452000 ef452068 e5ce76c0 00000000 00000000 00000000 ef6f7220 Mar 27 14:31:10 davin kernel: [ 1716.254008] 00000000 00000000 00000000 00000000 00010000 00000000 f8e8224c ef6f7000 Mar 27 14:31:10 davin kernel: [ 1716.254008] e5ce76d8 e5ce76c0 f8e63205 ef452000 e5ce76c0 ef2ea2c0 e5ce76d8 f5b0c086 Mar 27 14:31:10 davin kernel: [ 1716.254008] Call Trace: Mar 27 14:31:10 davin kernel: [ 1716.254008] [<f8e63205>] ? ieee80211_tx_status+0x234/0x6b7 [mac80211] Mar 27 14:31:10 davin kernel: [ 1716.254008] [<f8e62933>] ? ieee80211_tasklet_handler+0x4a/0x8c [mac80211] Mar 27 14:31:10 davin kernel: [ 1716.254008] [<c10347fd>] ? __do_softirq+0x0/0x14f Mar 27 14:31:10 davin kernel: [ 1716.254008] [<c1033cd7>] ? tasklet_action+0x61/0x9f Mar 27 14:31:10 davin kernel: [ 1716.254008] [<c1034898>] ? __do_softirq+0x9b/0x14f Mar 27 14:31:10 davin kernel: [ 1716.254008] [<c10347fd>] ? __do_softirq+0x0/0x14f Mar 27 14:31:10 davin kernel: [ 1716.254008] <IRQ> Mar 27 14:31:10 davin kernel: [ 1716.254008] [<c1034781>] ? irq_exit+0x26/0x59 Mar 27 14:31:10 davin kernel: [ 1716.254008] [<c1003e63>] ? do_IRQ+0x73/0x84 Mar 27 14:31:10 davin kernel: [ 1716.254008] [<c10034b0>] ? common_interrupt+0x30/0x38 Mar 27 14:31:10 davin kernel: [ 1716.254008] [<f85a5cde>] ? acpi_idle_enter_bm+0x247/0x284 [processor] Mar 27 14:31:10 davin kernel: [ 1716.254008] [<c11ded9b>] ? cpuidle_idle_call+0xd4/0x15d Mar 27 14:31:10 davin kernel: [ 1716.254008] [<c1002201>] ? cpu_idle+0x95/0xb0 Mar 27 14:31:10 davin kernel: [ 1716.254008] [<c1288b52>] ? start_secondary+0x1b8/0x1bd Mar 27 14:31:10 davin kernel: [ 1716.254008] Code: 89 43 e4 8b 4c 24 08 8b 49 08 81 f9 98 19 00 00 89 4c 24 30 77 0c 8b 44 24 08 c7 00 00 00 00 00 eb 32 0f b7 6e 0a 31 d2 8b 46 1c <f7> f5 8b 14 24 89 c1 8b 44 24 18 03 8c 90 18 f0 e7 f8 b8 40 42 Mar 27 14:31:10 davin kernel: [ 1716.254008] EIP: [<f8e7e46f>] minstrel_ht_tx_status+0x462/0x675 [mac80211] SS:ESP 0068:f68d9f04