Bug 42766 - iwl3945: "Microcode SW error detected" while using Azureus
Summary: iwl3945: "Microcode SW error detected" while using Azureus
Status: NEEDINFO
Alias: None
Product: Networking
Classification: Unclassified
Component: Wireless (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Stanislaw Gruszka
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-02-13 19:17 UTC by Clemens Eisserer
Modified: 2016-11-13 02:52 UTC (History)
7 users (show)

See Also:
Kernel Version: 3.2.5-
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg | grep iwl (4.62 KB, text/plain)
2012-02-13 19:17 UTC, Clemens Eisserer
Details
error dump with linux 3.1.0 (5.25 KB, text/plain)
2012-02-15 23:25 UTC, Clemens Eisserer
Details
iwl crash backtrace (574.50 KB, image/jpeg)
2012-02-16 15:45 UTC, Clemens Eisserer
Details
iwlegacy_tracing.patch (3.22 KB, text/plain)
2012-06-02 13:25 UTC, Stanislaw Gruszka
Details
trace output after iwl4965 crash (using iwlegacy_tracing_for_3.5.patch) (27.70 KB, text/plain)
2012-08-04 20:23 UTC, Istvan Toth
Details
dmesg after iwl4965 crash (using iwlegacy_tracing_for_3.5.patch) (248.39 KB, text/plain)
2012-08-04 20:26 UTC, Istvan Toth
Details
iwl3945_print_rate_table.patch (585 bytes, text/plain)
2012-08-07 13:18 UTC, Stanislaw Gruszka
Details
Dmesg output with a backtraces. (151.99 KB, text/plain)
2016-11-12 17:16 UTC, Ilya
Details
lspci output (2.31 KB, text/plain)
2016-11-12 17:16 UTC, Ilya
Details

Description Clemens Eisserer 2012-02-13 19:17:11 UTC
Created attachment 72367 [details]
dmesg | grep iwl

When downloading files using Azureus, I quite often see the following lines in syslog:


[26137.862504] iwl3945 0000:02:00.0: Microcode SW error detected. Restarting 0x82000008.
[26137.862514] iwl3945 0000:02:00.0: Loaded firmware version: 15.32.2.9
[26137.862550] iwl3945 0000:02:00.0: Start IWL Error Log Dump:
[26137.862556] iwl3945 0000:02:00.0: Status: 0x000202E4, count: 1
[26137.862561] iwl3945 0000:02:00.0: Desc       Time       asrtPC  blink2 ilink1  nmiPC   Line
[26137.862789] iwl3945 0000:02:00.0: SYSASSERT     (0x5) 0000204194 0x008B6 0x00274 0x00320 0x04EBE 116
[26137.862792] 
[26137.862810] iwl3945 0000:02:00.0: Error Reply type 0x00000005 cmd C_TX (0x1C) seq 0x020B ser 0x00740000
[26138.362079] iwl3945 0000:02:00.0: Error sending C_RXON: time out after 500ms.
[26138.362089] iwl3945 0000:02:00.0: Error setting new configuration (-110).
[26138.370271] iwl3945 0000:02:00.0: Can't stop Rx DMA.
[26138.370366] ieee80211 phy0: Hardware restart was requested

Please don't hestitate for more specific information.
Comment 1 Stanislaw Gruszka 2012-02-13 19:37:26 UTC
Is this regression, i.e. errors start to happen on some new kernel version and did not happen before?

Please provide logs with debugging enabled as described here: https://bugzilla.kernel.org/show_bug.cgi?id=35932#c2
Except that iwl3945 module should be used. Kernel should be compiled with 
CONFIG_IWLEGACY_DEBUG=y
Comment 2 Stanislaw Gruszka 2012-02-13 19:46:52 UTC
Hmm, I realized that first Microcode error happens after 3 hours, so gathering so much logs can be problematic, perhaps less verbose options could be used i.e
debug=0x00800003.
Comment 3 Stanislaw Gruszka 2012-02-13 19:53:47 UTC
I'm a bit confused about kernel version you mark this bug as 3.2.5- , whereas dmesg you provide apparently comes from some new version i.e. 3.3-rc . 3.2.5- mean does the bug did not happen there?
Comment 4 Clemens Eisserer 2012-02-13 22:10:17 UTC
I am running the default kernel for Fedora16, maybe Redhat has backported some patched?

Linux localhost.localdomain 3.2.5-3.fc16.i686 #1 SMP Thu Feb 9 02:24:08 UTC 2012 i686 i686 i386 GNU/Linux
Comment 5 Stanislaw Gruszka 2012-02-14 11:51:49 UTC
Yes, wireless stack on fedora kernel-3.2.5 is from 3.3-rc. Do you hit the problem using kernel-3.2.1, which is the latest kernel, which does not include 3.3-rc backport? http://koji.fedoraproject.org/koji/buildinfo?buildID=294921
Comment 6 Clemens Eisserer 2012-02-15 20:26:47 UTC
I'll give it a try with the older kernel versions, will take some time however as the errors only pop up after some time.
Comment 7 Clemens Eisserer 2012-02-15 23:24:12 UTC
The problem also shows up with linux-3.1.0, please find the log-output attached.
Comment 8 Clemens Eisserer 2012-02-15 23:25:09 UTC
Created attachment 72392 [details]
error dump with linux 3.1.0
Comment 9 Clemens Eisserer 2012-02-16 15:44:31 UTC
running Azureus on fedora's 3.2.6 kernel last night even resulted in a hard system crash - please see the screenshot attached.
Comment 10 Clemens Eisserer 2012-02-16 15:45:15 UTC
Created attachment 72402 [details]
iwl crash backtrace
Comment 11 Stanislaw Gruszka 2012-02-17 20:46:15 UTC
Unfortunately the beginning of the backtrace is not visible, so I can not tell what is exact reason of the crash. I suppose it is related with firmware resets, which are triggered after each microcode error, and is suppose to bring device into operational state after the error. If crash if really caused by fw resets, using fw_restart=0 module option should prevent it.
Comment 12 Stanislaw Gruszka 2012-03-08 20:33:10 UTC
I posted fix that should address crash from comment 10
http://marc.info/?l=linux-wireless&m=133120903025425&w=2

Regarding microcode error, I'll write a patch that will use tracing for longing debug information into memory ring buffer. Once microcode error happen, ring buffer will be dumped into dmesg, hence we will not need to do annoying longing of tons of messages for few hours.
Comment 13 Florian Mickler 2012-03-24 11:15:41 UTC
A patch referencing this bug report has been merged in Linux v3.3:

commit 210787e82a0ac1ffb5d7be1c796f0c51050849ad
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Thu Mar 8 13:16:01 2012 +0100

    iwl3945: fix possible il->txq NULL pointer dereference in delayed works
Comment 14 Clemens Eisserer 2012-04-04 21:11:07 UTC
Hi Stanislaw,

How is it going with the debugging patch - I still get constant error reports in dmesg.
Comment 15 Stanislaw Gruszka 2012-04-05 15:30:35 UTC
I'm sorry, I was quite busy lastly. I hope I will provide the patch soon.
Comment 16 Stanislaw Gruszka 2012-06-02 13:25:11 UTC
Created attachment 73496 [details]
iwlegacy_tracing.patch

This patch use tracing for logging debug messages. It is for kernel 3.3:

To use it, kernel should be compiled with:

CONFIG_TRACING=y
CONFIG_IWLEGACY_DEBUG=y

Perhaps touse options are needed too, but I'm not sure:

CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y

To get debug logs plese do:

modprobe -r iwl3945
mount -t debugfs debugfs /sys/kernel/debug/
modprobe iwl3945 debug=0x47833fff

Then use wlan device like usual. When problem will happen, tracing will be off i.e.

cat /sys/kernel/debug/tracing/tracing_on

will show 0. There will be also one line information in dmesg about Microcode error. Then get output of trace file.

cat /sys/kernel/debug/tracing/trace > ~/trace.txt

and attach trace.txt here.
Comment 17 Clemens Eisserer 2012-06-13 15:32:36 UTC
Excuse me, why has this bug been declared as "fixed"?

From what I know, for now only debug-code was added to track the issue down - unfourtunatly I haven't found the compile the kernel myself.
Comment 18 Stanislaw Gruszka 2012-06-13 19:31:32 UTC
We fixed crash by commit from comment #13, but main problem reported here - Microcode errors is not fixed. I'm reopening the bug.
Comment 19 Stanislaw Gruszka 2012-06-13 19:32:44 UTC
Actually changing into Needinfo state as I'm waiting for debug logs.
Comment 20 Istvan Toth 2012-08-04 20:23:29 UTC
Created attachment 76791 [details]
trace output after iwl4965 crash (using iwlegacy_tracing_for_3.5.patch)

last line of the trace log is the "queue stuck".
Comment 21 Istvan Toth 2012-08-04 20:26:59 UTC
Created attachment 76801 [details]
dmesg after iwl4965 crash (using iwlegacy_tracing_for_3.5.patch) 

i attached the dmesg output belonging to trace.txt
Comment 22 Istvan Toth 2012-08-04 20:36:01 UTC
sorry for comment #20 #21 (and the attachments)

those things are for bug 44251 :)
Comment 23 Stanislaw Gruszka 2012-08-06 16:36:40 UTC
Clemens, trace buffer is too small, need to be increased. When you will do steps from comment #16, please increase trace buffer size. So module loading commands should looks like:

modprobe -r iwl3945
mount -t debugfs debugfs /sys/kernel/debug/
echo 16384 > /sys/kernel/debug/tracing/buffer_size_kb
modprobe iwl3945 debug=0x47833fff
Comment 24 Clemens Eisserer 2012-08-06 16:42:10 UTC
Thanks for the additional information.

I've recently setup the laptop, I should be able to test this tonight...
Comment 25 Clemens Eisserer 2012-08-06 22:24:36 UTC
Thanks for your patience.

I uploaded a full trace to: http://93.83.133.214/trace.7z

Hope it helps.
Comment 26 Stanislaw Gruszka 2012-08-07 13:18:50 UTC
Created attachment 77021 [details]
iwl3945_print_rate_table.patch

Clemens, please get another trace with this patch and debug mask 0x47933fff (8 in the middle changed to 9). Looks like driver provides malformed rate table to firmware, and that's causing error. I would like to see how rate table looks. Thanks.
Comment 27 Stanislaw Gruszka 2012-08-07 14:37:02 UTC
Above patch if for 3.5, but it should apply on any kernel version.
Comment 28 Clemens Eisserer 2012-08-08 06:11:53 UTC
please find the new trace at: http://93.83.133.214/trace2.7z
Comment 29 Stanislaw Gruszka 2012-08-10 21:48:01 UTC
Seems patch from comment 26 was not applied. However this time Microcode error trigger before RATE_SCALE command, so probably error is not caused by wrong rate table, but I'm not sure - I do not have clear culprit.

Can you check if disable_hw_scan=0 module parameter workaround this issue?
Comment 30 Priit Laes 2013-03-31 06:32:42 UTC
Running into similar issue  (microcode error) with 3.9.0-rc4+ (commit 7f42ace3118afedbd1848a349d01a11d9ca13d41 has been applied):

[snip]
iwl3945 0000:03:00.0: Loaded firmware version: 15.32.2.9
iwl3945 0000:03:00.0: Start IWL Error Log Dump:
iwl3945 0000:03:00.0: Status: 0x000202E4, count: 1
iwl3945 0000:03:00.0: Desc       Time       asrtPC  blink2 ilink1  nmiPC   Line
iwl3945 0000:03:00.0: SYSASSERT     (0x5) 0000200505 0x008B6 0x00274 0x0031C 0x04EA2 116
iwl3945 0000:03:00.0: Error Reply type 0x00000005 cmd C_TX (0x1C) seq 0x0286 ser 0x00740000
iwl3945 0000:03:00.0: Command C_TX_PWR_TBL failed: FW Error
iwl3945 0000:03:00.0: Error setting Tx power (-5).
iwl3945 0000:03:00.0: Can't stop Rx DMA.
ieee80211 phy2: Hardware restart was requested
[/snip]
Comment 31 Stanislaw Gruszka 2013-04-04 11:51:04 UTC
Are you sure 7f42ace3118afedbd1848a349d01a11d9ca13d41 was applied, microcode errors like it should be fixed by this commit ?
Comment 32 Priit Laes 2013-04-06 11:04:18 UTC
Yes, checked it to be sure, kernel is 3.9.0-rc5+ and the commit is indeed included. Here's the error:

[28640.984170] iwl3945 0000:03:00.0: Microcode SW error detected. Restarting 0x82000008.
[28640.984181] iwl3945 0000:03:00.0: Loaded firmware version: 15.32.2.9
[28640.984217] iwl3945 0000:03:00.0: Start IWL Error Log Dump:
[28640.984223] iwl3945 0000:03:00.0: Status: 0x000202E4, count: 1
[28640.984227] iwl3945 0000:03:00.0: Desc       Time       asrtPC  blink2 ilink1  nmiPC   Line
[28640.984458] iwl3945 0000:03:00.0: SYSASSERT     (0x5) 0000001909 0x008B6 0x0035E 0x00320 0x00000 267

[28640.984473] iwl3945 0000:03:00.0: Error Reply type 0x00000001 cmd UNKNOWN (0x88) seq 0x0034 ser 0x00880000
[28640.985011] iwl3945 0000:03:00.0: Can't stop Rx DMA.
Comment 33 Stanislaw Gruszka 2013-04-06 14:42:14 UTC
Is this 3.9 regression ?  If so, please revert commit 7f42ace3118afedbd1848a349d01a11d9ca13d41 as well as those two
commits:

bdb084b22d8aee66c87af5e9c36bd6cf7f3bccfd iwlegacy: more checks for dma mapping errors
96ebbe8d019ab50b34fc3499398ab98de8c2d01b iwlegacy: check for dma mapping errors

and check if that helps. If this is not regression or reverting commits do not help, please provide verbose debug logs as described here:
https://bugzilla.kernel.org/show_bug.cgi?id=35932#c2
Comment 34 Ilya 2016-11-12 17:14:55 UTC
Linux Mint 18, Kernel 4.4.0-47-generic. The Bug is still there. The WiFi-Device becomes then unusable. With the Ubuntu 10.04 with a kernel 2.* there where any bug on this hardware. So, its a regression.

I will try the older firmware...
Comment 35 Ilya 2016-11-12 17:16:25 UTC
Created attachment 244231 [details]
Dmesg output with a backtraces.
Comment 36 Ilya 2016-11-12 17:16:59 UTC
Created attachment 244241 [details]
lspci output
Comment 37 Ilya 2016-11-13 02:52:24 UTC
It seems to be a bug like a https://bugzilla.kernel.org/show_bug.cgi?id=93041 one. I mean, that the bug I reported in this thread does not relate to the bug in this thread. Please delete...

Note You need to log in before you can comment on or make changes to this bug.