Bug 199807

Summary: iwl4965 last firmware version is buggy and should be rolled back
Product: Drivers Reporter: Ryan Underwood (nemesis)
Component: network-wirelessAssignee: drivers_network-wireless (drivers_network-wireless)
Status: CLOSED WILL_NOT_FIX    
Severity: normal CC: stf_xl
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.15 Subsystem:
Regression: No Bisected commit-id:
Attachments: boot dmesg
sample dmesg during failure mode

Description Ryan Underwood 2018-05-23 01:32:19 UTC
Created attachment 276139 [details]
boot dmesg

Hi,

The last iwl4965 firmware release is unusable with recent-ish kernels.  Every distribution has a bug report along these lines:

Ubuntu
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1225455

Debian
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=797091

SuSE
https://lists.opensuse.org/opensuse-bugs/2013-09/msg03231.html

Red Hat
https://bugzilla.redhat.com/show_bug.cgi?id=1430053

The common factors are:
iwl4965 0000:03:00.0: Loaded firmware version: 228.61.2.24                                                    
iwl4965 0000:03:00.0: Status: 0x000213E4, count: 5  (or Status: 0x000313E4)

I confirmed that rolling back the microcode to the previous 228.57.2.23 release that is still available on Intel's website completely fixes the problem and also improves performance.

This problem is so widespread and makes this hardware so useless that in the absence of a strong argument for an important fix in the last microcode release (228.61.2.24), we should not be supplying that one to users by default.
Comment 1 Ryan Underwood 2018-05-23 01:34:11 UTC
Created attachment 276141 [details]
sample dmesg during failure mode
Comment 2 Ryan Underwood 2018-05-23 01:37:33 UTC
Also, "completely fixes the problem" is an overstatement, let me rephrase.  The user-visible unusably-slow performance and dropped connections are completely fixed when using the older microcode.  Microcode crashes are still visible in the log with the older release; but in the case of the older microcode, it doesn't impact performance of the link.

Example:
[556283.340519] iwl4965 0000:07:00.0: Microcode SW error detected.  Restarting 0x82000000.
[556283.340530] iwl4965 0000:07:00.0: Loaded firmware version: 228.57.2.23
[556283.340549] iwl4965 0000:07:00.0: Start IWL Error Log Dump:
[556283.340554] iwl4965 0000:07:00.0: Status: 0x000213E4, count: 5
[556283.340701] iwl4965 0000:07:00.0: Desc                                  Time       data1      data2      line
[556283.340708] iwl4965 0000:07:00.0: NMI_INTERRUPT_WDG            (0x0004) 1467024232 0x00000002 0x03630000 208
[556283.340713] iwl4965 0000:07:00.0: pc      blink1  blink2  ilink1  ilink2  hcmd
[556283.340720] iwl4965 0000:07:00.0: 0x0046C 0x04B30 0x004C2 0x006DE 0x04BCC 0x27A001C
[556283.340725] iwl4965 0000:07:00.0: FH register values:
[556283.340743] iwl4965 0000:07:00.0:       FH49_RSCSR_CHNL0_STTS_WPTR_REG: 0X1d1a2700
[556283.340761] iwl4965 0000:07:00.0:      FH49_RSCSR_CHNL0_RBDCB_BASE_REG: 0X01090700
[556283.340778] iwl4965 0000:07:00.0:                FH49_RSCSR_CHNL0_WPTR: 0X000000f0
[556283.340796] iwl4965 0000:07:00.0:       FH49_MEM_RCSR_CHNL0_CONFIG_REG: 0X80809000
[556283.340812] iwl4965 0000:07:00.0:        FH49_MEM_RSSR_SHARED_CTRL_REG: 0X0000003c
[556283.340829] iwl4965 0000:07:00.0:          FH49_MEM_RSSR_RX_STATUS_REG: 0X03630000
[556283.340846] iwl4965 0000:07:00.0:   FH49_MEM_RSSR_RX_ENABLE_ERR_IRQ2DRV: 0X00000000
[556283.340863] iwl4965 0000:07:00.0:              FH49_TSSR_TX_STATUS_REG: 0X07fd0002
[556283.340880] iwl4965 0000:07:00.0:               FH49_TSSR_TX_ERROR_REG: 0X00000000
[556283.406101] iwl4965 0000:07:00.0: Timeout stopping DMA channel 1 [0x07fd0002]
[556283.407444] iwl4965 0000:07:00.0: Can't stop Rx DMA.
[556283.407769] ieee80211 phy13: Hardware restart was requested
Comment 3 Stanislaw Gruszka 2018-06-02 11:51:43 UTC
The 228.61.2.24 firmware was released about 10 years ago and being used since then. Removing support for it now does not sound like good idea. 

I would check if your distribution enable PowerSave (disabled by default), which is know to cause firmware crashes. Since 4.13 kernel there is warning if PS is enabled:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=438f3d13da5e0714f1add1652865b864a2c36eb7
Comment 4 Ryan Underwood 2018-06-02 15:06:02 UTC
Disabling powersave (via powertop) does not stop these particular crashes with the newer firmware nor the older one.  The difference is just that in the presence of these occasional crashes, the link doesn't slow to a near-halt with the older firmware.

I suspect that few people are still using this hardware with newer kernels.  They are part of a potentially larger set of people that includes those who tried and gave up because it didn't work.
Comment 5 Ryan Underwood 2018-06-02 15:10:21 UTC
I should ensure to state that I agree that there is risk to rolling it back, so it would be nice to figure out what is going on here.

It is easy to reproduce:
- swcrypto=1
- powersave on or off
- Some USB dongle running hostapd (I have RTL8818AU for now)
- Multiple simultaneous bulk streams - try downloading multiple YouTube videos at once, for example, or moving multiple files to a CIFS mount simultaneously
Comment 6 Stanislaw Gruszka 2018-06-06 15:24:08 UTC
Is this reproducible without swcrypto=1 ?
Comment 7 Ryan Underwood 2018-06-06 15:39:36 UTC
Without swcrypto=1 the situation is hopeless for other reasons.  I am actually amazed that it is not the default.  However, I have not tested it recently and will do so again.
Comment 8 Stanislaw Gruszka 2018-06-07 08:53:40 UTC
What mean hopeless, it does not associate with AP ? What encryption/settings are you using ?
Comment 9 Ryan Underwood 2018-06-07 14:04:54 UTC
The last time I tried it without swcrypto=1 years ago the firmware constantly crashed and it was slow.  I have always used WPA2 personal, TKIP.  What settings?   

By the way, it's a lot easier to try this hardware yourself and see just how bad it is than for me to explain it to you in this text box. :-)
Comment 10 Stanislaw Gruszka 2018-06-08 09:19:35 UTC
I have the hardware and it works flawlessly for me.

I would try AES instead of TKIP. TKIP is flawed anyway.

I wanted to ask you to provide debug logs, but loose willingness to work on this bug. You know the workaround the problem anyway.
Comment 11 Ryan Underwood 2018-06-08 14:40:36 UTC
Losing interest in a bug that's reported by users in every distribution for years?

I cannot control what some random router supports and expecting the user to be in charge of that is frankly ridiculous.

This is exactly the kind of problem that provides a poor user experience as soon a user installs Linux.  If you're too lazy to work on it as you stated, at least leave it open for someone who will.
Comment 12 Ryan Underwood 2018-06-08 14:48:56 UTC
This should be easy enough for you: use the last firmware and download 5 kernel tarballs simultaneously from a fast local mirror.  If the firmware doesn't crash, then something is different about your configuration.