Bug 42766
Summary: | iwl3945: "Microcode SW error detected" while using Azureus | ||
---|---|---|---|
Product: | Networking | Reporter: | Clemens Eisserer (linuxhippy) |
Component: | Wireless | Assignee: | Stanislaw Gruszka (stf_xl) |
Status: | NEEDINFO --- | ||
Severity: | normal | CC: | alan, dikiy_evrej, florian, pebolle, plaes, stf_xl, tothi |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.2.5- | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg | grep iwl
error dump with linux 3.1.0 iwl crash backtrace iwlegacy_tracing.patch trace output after iwl4965 crash (using iwlegacy_tracing_for_3.5.patch) dmesg after iwl4965 crash (using iwlegacy_tracing_for_3.5.patch) iwl3945_print_rate_table.patch Dmesg output with a backtraces. lspci output |
Is this regression, i.e. errors start to happen on some new kernel version and did not happen before? Please provide logs with debugging enabled as described here: https://bugzilla.kernel.org/show_bug.cgi?id=35932#c2 Except that iwl3945 module should be used. Kernel should be compiled with CONFIG_IWLEGACY_DEBUG=y Hmm, I realized that first Microcode error happens after 3 hours, so gathering so much logs can be problematic, perhaps less verbose options could be used i.e debug=0x00800003. I'm a bit confused about kernel version you mark this bug as 3.2.5- , whereas dmesg you provide apparently comes from some new version i.e. 3.3-rc . 3.2.5- mean does the bug did not happen there? I am running the default kernel for Fedora16, maybe Redhat has backported some patched? Linux localhost.localdomain 3.2.5-3.fc16.i686 #1 SMP Thu Feb 9 02:24:08 UTC 2012 i686 i686 i386 GNU/Linux Yes, wireless stack on fedora kernel-3.2.5 is from 3.3-rc. Do you hit the problem using kernel-3.2.1, which is the latest kernel, which does not include 3.3-rc backport? http://koji.fedoraproject.org/koji/buildinfo?buildID=294921 I'll give it a try with the older kernel versions, will take some time however as the errors only pop up after some time. The problem also shows up with linux-3.1.0, please find the log-output attached. Created attachment 72392 [details]
error dump with linux 3.1.0
running Azureus on fedora's 3.2.6 kernel last night even resulted in a hard system crash - please see the screenshot attached. Created attachment 72402 [details]
iwl crash backtrace
Unfortunately the beginning of the backtrace is not visible, so I can not tell what is exact reason of the crash. I suppose it is related with firmware resets, which are triggered after each microcode error, and is suppose to bring device into operational state after the error. If crash if really caused by fw resets, using fw_restart=0 module option should prevent it. I posted fix that should address crash from comment 10 http://marc.info/?l=linux-wireless&m=133120903025425&w=2 Regarding microcode error, I'll write a patch that will use tracing for longing debug information into memory ring buffer. Once microcode error happen, ring buffer will be dumped into dmesg, hence we will not need to do annoying longing of tons of messages for few hours. A patch referencing this bug report has been merged in Linux v3.3: commit 210787e82a0ac1ffb5d7be1c796f0c51050849ad Author: Stanislaw Gruszka <sgruszka@redhat.com> Date: Thu Mar 8 13:16:01 2012 +0100 iwl3945: fix possible il->txq NULL pointer dereference in delayed works Hi Stanislaw, How is it going with the debugging patch - I still get constant error reports in dmesg. I'm sorry, I was quite busy lastly. I hope I will provide the patch soon. Created attachment 73496 [details]
iwlegacy_tracing.patch
This patch use tracing for logging debug messages. It is for kernel 3.3:
To use it, kernel should be compiled with:
CONFIG_TRACING=y
CONFIG_IWLEGACY_DEBUG=y
Perhaps touse options are needed too, but I'm not sure:
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
To get debug logs plese do:
modprobe -r iwl3945
mount -t debugfs debugfs /sys/kernel/debug/
modprobe iwl3945 debug=0x47833fff
Then use wlan device like usual. When problem will happen, tracing will be off i.e.
cat /sys/kernel/debug/tracing/tracing_on
will show 0. There will be also one line information in dmesg about Microcode error. Then get output of trace file.
cat /sys/kernel/debug/tracing/trace > ~/trace.txt
and attach trace.txt here.
Excuse me, why has this bug been declared as "fixed"? From what I know, for now only debug-code was added to track the issue down - unfourtunatly I haven't found the compile the kernel myself. We fixed crash by commit from comment #13, but main problem reported here - Microcode errors is not fixed. I'm reopening the bug. Actually changing into Needinfo state as I'm waiting for debug logs. Created attachment 76791 [details]
trace output after iwl4965 crash (using iwlegacy_tracing_for_3.5.patch)
last line of the trace log is the "queue stuck".
Created attachment 76801 [details]
dmesg after iwl4965 crash (using iwlegacy_tracing_for_3.5.patch)
i attached the dmesg output belonging to trace.txt
sorry for comment #20 #21 (and the attachments) those things are for bug 44251 :) Clemens, trace buffer is too small, need to be increased. When you will do steps from comment #16, please increase trace buffer size. So module loading commands should looks like: modprobe -r iwl3945 mount -t debugfs debugfs /sys/kernel/debug/ echo 16384 > /sys/kernel/debug/tracing/buffer_size_kb modprobe iwl3945 debug=0x47833fff Thanks for the additional information. I've recently setup the laptop, I should be able to test this tonight... Thanks for your patience. I uploaded a full trace to: http://93.83.133.214/trace.7z Hope it helps. Created attachment 77021 [details]
iwl3945_print_rate_table.patch
Clemens, please get another trace with this patch and debug mask 0x47933fff (8 in the middle changed to 9). Looks like driver provides malformed rate table to firmware, and that's causing error. I would like to see how rate table looks. Thanks.
Above patch if for 3.5, but it should apply on any kernel version. please find the new trace at: http://93.83.133.214/trace2.7z Seems patch from comment 26 was not applied. However this time Microcode error trigger before RATE_SCALE command, so probably error is not caused by wrong rate table, but I'm not sure - I do not have clear culprit. Can you check if disable_hw_scan=0 module parameter workaround this issue? Running into similar issue (microcode error) with 3.9.0-rc4+ (commit 7f42ace3118afedbd1848a349d01a11d9ca13d41 has been applied): [snip] iwl3945 0000:03:00.0: Loaded firmware version: 15.32.2.9 iwl3945 0000:03:00.0: Start IWL Error Log Dump: iwl3945 0000:03:00.0: Status: 0x000202E4, count: 1 iwl3945 0000:03:00.0: Desc Time asrtPC blink2 ilink1 nmiPC Line iwl3945 0000:03:00.0: SYSASSERT (0x5) 0000200505 0x008B6 0x00274 0x0031C 0x04EA2 116 iwl3945 0000:03:00.0: Error Reply type 0x00000005 cmd C_TX (0x1C) seq 0x0286 ser 0x00740000 iwl3945 0000:03:00.0: Command C_TX_PWR_TBL failed: FW Error iwl3945 0000:03:00.0: Error setting Tx power (-5). iwl3945 0000:03:00.0: Can't stop Rx DMA. ieee80211 phy2: Hardware restart was requested [/snip] Are you sure 7f42ace3118afedbd1848a349d01a11d9ca13d41 was applied, microcode errors like it should be fixed by this commit ? Yes, checked it to be sure, kernel is 3.9.0-rc5+ and the commit is indeed included. Here's the error: [28640.984170] iwl3945 0000:03:00.0: Microcode SW error detected. Restarting 0x82000008. [28640.984181] iwl3945 0000:03:00.0: Loaded firmware version: 15.32.2.9 [28640.984217] iwl3945 0000:03:00.0: Start IWL Error Log Dump: [28640.984223] iwl3945 0000:03:00.0: Status: 0x000202E4, count: 1 [28640.984227] iwl3945 0000:03:00.0: Desc Time asrtPC blink2 ilink1 nmiPC Line [28640.984458] iwl3945 0000:03:00.0: SYSASSERT (0x5) 0000001909 0x008B6 0x0035E 0x00320 0x00000 267 [28640.984473] iwl3945 0000:03:00.0: Error Reply type 0x00000001 cmd UNKNOWN (0x88) seq 0x0034 ser 0x00880000 [28640.985011] iwl3945 0000:03:00.0: Can't stop Rx DMA. Is this 3.9 regression ? If so, please revert commit 7f42ace3118afedbd1848a349d01a11d9ca13d41 as well as those two commits: bdb084b22d8aee66c87af5e9c36bd6cf7f3bccfd iwlegacy: more checks for dma mapping errors 96ebbe8d019ab50b34fc3499398ab98de8c2d01b iwlegacy: check for dma mapping errors and check if that helps. If this is not regression or reverting commits do not help, please provide verbose debug logs as described here: https://bugzilla.kernel.org/show_bug.cgi?id=35932#c2 Linux Mint 18, Kernel 4.4.0-47-generic. The Bug is still there. The WiFi-Device becomes then unusable. With the Ubuntu 10.04 with a kernel 2.* there where any bug on this hardware. So, its a regression. I will try the older firmware... Created attachment 244231 [details]
Dmesg output with a backtraces.
Created attachment 244241 [details]
lspci output
It seems to be a bug like a https://bugzilla.kernel.org/show_bug.cgi?id=93041 one. I mean, that the bug I reported in this thread does not relate to the bug in this thread. Please delete... |
Created attachment 72367 [details] dmesg | grep iwl When downloading files using Azureus, I quite often see the following lines in syslog: [26137.862504] iwl3945 0000:02:00.0: Microcode SW error detected. Restarting 0x82000008. [26137.862514] iwl3945 0000:02:00.0: Loaded firmware version: 15.32.2.9 [26137.862550] iwl3945 0000:02:00.0: Start IWL Error Log Dump: [26137.862556] iwl3945 0000:02:00.0: Status: 0x000202E4, count: 1 [26137.862561] iwl3945 0000:02:00.0: Desc Time asrtPC blink2 ilink1 nmiPC Line [26137.862789] iwl3945 0000:02:00.0: SYSASSERT (0x5) 0000204194 0x008B6 0x00274 0x00320 0x04EBE 116 [26137.862792] [26137.862810] iwl3945 0000:02:00.0: Error Reply type 0x00000005 cmd C_TX (0x1C) seq 0x020B ser 0x00740000 [26138.362079] iwl3945 0000:02:00.0: Error sending C_RXON: time out after 500ms. [26138.362089] iwl3945 0000:02:00.0: Error setting new configuration (-110). [26138.370271] iwl3945 0000:02:00.0: Can't stop Rx DMA. [26138.370366] ieee80211 phy0: Hardware restart was requested Please don't hestitate for more specific information.