Bug 93041

Summary: iwl3945: Stuck queue causes crash
Product: Networking Reporter: Edward Hart (edward.dan.hart)
Component: WirelessAssignee: networking_wireless (networking_wireless)
Status: RESOLVED DOCUMENTED    
Severity: normal CC: brianh, dikiy_evrej, edward.dan.hart, mark_k, nathan.collins, scruffythinking, stf_xl, szg00000, tomi.kyostila
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.16.0-031600-generic Subsystem:
Regression: No Bisected commit-id:
Attachments: Relevant portion of dmesg
Output of lspci -v.
Dmesg output with a backtraces.

Description Edward Hart 2015-02-10 16:49:25 UTC

    
Comment 1 Edward Hart 2015-02-10 17:09:15 UTC
Created attachment 166381 [details]
Relevant portion of dmesg

Firstly, I do not sure what causes this bug (sorry). It appears to occur randomly every two weeks or so. 

However, the same things happen every time:

1. a queue becomes stuck
2. a hardware restart is requested
3. iwl3945 fails to set up bootstrap microcode
4. several call traces are output to dmesg
5. wireless connection is lost and the laptop has to be rebooted to reconnect.
Comment 2 Edward Hart 2015-02-10 17:11:08 UTC
Created attachment 166391 [details]
Output of lspci -v.
Comment 3 Edward Hart 2015-02-10 17:28:56 UTC
(In reply to Edward Hart from comment #1)

> Firstly, I do not sure what causes this bug (sorry).

To clarify: I do not know what causes the stuck queue that is part of this bug.
Comment 4 Edward Hart 2015-05-17 09:24:28 UTC
This bug had stopped occurring until I upgraded to Ubuntu 15.04 and now it's happening with the same regularity as before.

Does anyone have any idea how can I get more info/debug this?
Comment 5 Mark 2016-08-18 17:24:15 UTC
Some other people (myself included) are having this problem too:
https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1408963
Comment 6 Edward Hart 2016-08-19 08:44:57 UTC
(In reply to Mark from comment #5)
> Some other people (myself included) are having this problem too:
> https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1408963

Thanks for the link.
Comment 7 J 2016-10-26 19:28:50 UTC
As reported on that launchpad page, the bug has not gone away since the update to Ubuntu 16.10. It may have gotten worse -- people who hadn't experienced the bug in a version or two, and at least one who never had a problem, are now dealing with it. (As am I.)
Comment 8 Ilya 2016-11-13 02:46:59 UTC
Same problem here (with a kernel 2.6* there were no problem, till I upgraded to Mint 18 :D ) . dmesg attached.

I tried do downgrade firmware to the old one:  iwlwifi-3945-1.ucode

It seems to be a workaround. I doesn't see the bug for a >6 hours now.
Comment 9 Ilya 2016-11-13 02:48:12 UTC
Created attachment 244271 [details]
Dmesg output with a backtraces.
Comment 10 J 2016-11-18 18:26:08 UTC
Not sure if this will help, but I had this issue for about 18 months on Ubuntu Unity and Ubuntu Gnome, multiple wifi drops a day requiring a reboot. Learned that Mate manages the network stack differently, so I installed Ubuntu Mate 16.10 over Ubuntu Gnome 16.10, and have had only 1 crash in over a week -- and I don't think that was related to the same issue (a number of things crashed at once).

So that may help someone narrow down the issue -- whatever is happening in Unity and Gnome 3 to crash iwl3945, it's not happening on Mate.
Comment 11 Ilya 2016-11-18 22:32:00 UTC
Downgrade the firmware is really a workaround, since 5 days any there are any problem with wifi.
Comment 12 Ilya 2016-11-20 09:33:13 UTC
The salut was a bit too early :) Yesterday there were some stucks. But anyway, its a much, much fewer then it was with an actual firmware iwl-3945-2.ucode.
Comment 13 J 2016-11-20 17:37:30 UTC
(In reply to J from comment #10)
> Not sure if this will help, but I had this issue for about 18 months on
> Ubuntu Unity and Ubuntu Gnome, multiple wifi drops a day requiring a reboot.
> Learned that Mate manages the network stack differently, so I installed
> Ubuntu Mate 16.10 over Ubuntu Gnome 16.10, and have had only 1 crash in over
> a week -- and I don't think that was related to the same issue (a number of
> things crashed at once).
> 
> So that may help someone narrow down the issue -- whatever is happening in
> Unity and Gnome 3 to crash iwl3945, it's not happening on Mate.

Scratch that -- iwl3945 still crashes n MATE requiring a restart, just not as often. In my case, it occurs about every three days on MATE, as opposed to every three hours on Unity or Gnome 3. 

Hearing that going back to iwl-3945-1.ucode fixes the problem for some.
Comment 14 Ilya 2016-12-10 14:34:46 UTC
downgrade  to 3.13.* makes the things much better...
Comment 15 Brian Havard 2017-02-24 11:50:54 UTC
I'm getting this same problem on an HP/Compaq 6710b running Linux Mint 18.1.
I've tried updating to the latest kernel build I could find, 4.9.11 (amd64) from kernel.ubuntu.org but it still happens.
Comment 16 Stanislaw Gruszka 2017-02-24 13:05:57 UTC
> iwl3945 0000:03:00.0: BSM uCode verification failed at addr 0x00003800+0 (of
> 900), is 0xa5a5a5a2, s/b 0xf802020

means that we can not communicate with the hardware via PCI bus. It's hard to tell what cause that: it can be hardware problem, iwl3945 firmware problem , kernel PCI sublayer problem, or iwl3945/mac80211/cfg80211 drivers problem.
Comment 17 Stanislaw Gruszka 2017-02-24 13:11:45 UTC
(In reply to Ilya from comment #14)
> downgrade  to 3.13.* makes the things much better...

Those are changes in iwl3945 drivers since 3.13:

git log v3.13..HEAD --no-merges --oneline -- drivers/net/wireless/iwlegacy drivers/net/wireless/intel/iwlegacy/

ae3cf47 iwlegacy: make il3945_mac_ops __ro_after_init
6bdf1e0 Makefile: drop -D__CHECK_ENDIAN__ from cflags
1dc8079 iwlegacy: constify local structures
4c73195 iwlegacy: use IS_ENABLED() instead of checking for built-in or module
7947d3e mac80211: Add support for beacon report radio measurement
2cce76c iwlegacy: avoid warning about missing braces
57fbcce cfg80211: remove enum ieee80211_band
84d17a2 iwl4965: Fix more memory leaks in __il4965_up()
c2fd344 iwl4965: Fix a memory leak in error handling code of __il4965_up
fe9b479 iwl4965: Fix a null pointer dereference in il_tx_queue_free and il_cmd_queue_free
fb9693f iwlegacy: Return directly if allocation fails in il_eeprom_init()
50ea05e mac80211: pass block ack session timeout to to driver
9ec855c iwlegacy: 4965-mac: constify il_sensitivity_ranges structure
31ced24 iwlegacy: mark il_adjust_beacon_interval as noinline
3f0267f iwlegacy: cleanup end of il_send_add_sta()
7ac9a36 iwlegacy: move under intel directory
621a5f7 debugfs: Pass bool pointer to debugfs_create_bool()
e3abc8f mac80211: allow to transmit A-MSDU within A-MPDU
8358491 iwlegacy: convert hex_dump_to_buffer() to %*ph
30686bf mac80211: convert HW flags to unsigned long bitmap
df14046 mac80211: remove support for IFF_PROMISC
93803b3 wireless: Use eth_<foo>_addr instead of memset
ff5e568 iwlegacy: 4965-rs: Remove bogus colon after newline from debug message
1a94ace iwl3945: Use setup_timer
af68b87 iwl4965: Use setup_timer
0f791eb4 mac80211: allow channel switch with multiple channel contexts
595a23f iwl4965: fix %d confusingly prefixed with 0x in format string
0d8614b mac80211: replace SMPS hw flags with wiphy feature bits
9baa3c3 PCI: Remove DEFINE_PCI_DEVICE_TABLE macro use
9f0b4cb iwlegacy: use correct structure type name in sizeof
c56ef67 mac80211: support more than one band in scan request
45eeeaf iwlegacy: Convert /n to \n
77be2c5 mac80211: add vif to flush call
997bc71 iwl4965: disable 8K A-MSDU by default
dbdac2b iwlegacy: properly enable power saving
8e67427 iwlegacy: merge reclaim check
59f0118 iwl3945: fix wakeup interrupt
cc01f9b mac80211: remove module handling from rate control ops
631ad70 mac80211: make rate control ops const
c8bf40a wireless: delete non-required instances of include <linux/init.h>
ccbac29 iwlegacy: use ether_addr_equal_64bits
c8aa5ab drivers: net: Mark functions as static in debug.c
0e06b09 drivers: net: Mark functions as static in 4965-debug.c
03a71e0 drivers: net: Mark functions as static in 3945-debug.c
5f5deff iwl3945: do not print RFKILL message
a2f73b6 cfg80211: move regulatory flags to their own variable
8fe02e1 cfg80211: consolidate passive-scan and no-ibss flags

most of them are cosmetic patches or justification to mac80211/cfg80211 interface changes, I do not see anything suspicious there. If this is regression from 3.13 most likely it was caused by some other kernel changes is in PCI sublayer. If bug is somehow reproducible it could be eventually bisected by git-bisect, but this can be hard, especially if bug is not easy to reproduce.
Comment 18 Stanislaw Gruszka 2017-02-24 13:26:01 UTC
Actually this commit is suspicious:

dbdac2b iwlegacy: properly enable power saving

However power save should be not used, check 

iw dev wlan0 get power_save

and if is enabled, disable by:

iw dev wlan0 set power_save off

and check if it helps with the problem.
Comment 19 Brian Havard 2017-02-24 23:19:54 UTC
Yes, my laptop shows power save is on after a fresh boot.
I'll turn it off and leave it running for a while to see if that changes fixes it.
Comment 20 Brian Havard 2017-02-25 07:07:21 UTC
So far so good. Nearly 8 hours uptime with no failure. Looks like power saving is the likely culprit. Before it was lucky to last 1 hour.
Comment 21 Ilya 2017-05-09 11:51:53 UTC
(In reply to Brian Havard from comment #20)
> So far so good. Nearly 8 hours uptime with no failure. Looks like power
> saving is the likely culprit. Before it was lucky to last 1 hour.

Any news? Are there problems after this workaround? If no, how can we fix it with a code?
Comment 22 Stanislaw Gruszka 2017-05-09 15:02:21 UTC
This is firmware problem, which will not be fixed. Power save was allowed in "dbdac2b iwlegacy: properly enable power saving" commit to allow users to save power (for some people PS do not crash the firmware).

What can eventually be done is add warning when PS is enabled, as seems same users enable PS and do not relate it with the problems.
Comment 23 Stanislaw Gruszka 2017-05-09 15:07:51 UTC
(In reply to Stanislaw Gruszka from comment #22)
> What can eventually be done is add warning when PS is enabled, as seems same
> users enable PS and do not relate it with the problems.
Or distributions enable it without any user intervention, instead of stay with sane default.
Comment 24 Brian Havard 2017-05-12 04:17:57 UTC
After forcing power save off (using an if-up.d script) I've had no more problems with the wifi.
Comment 25 Stanislaw Gruszka 2017-05-15 08:03:01 UTC
Patch adding warning was posted:
http://marc.info/?l=linux-wireless&m=149483519701866&w=2