Bug 207557 - Intel dual AC 7260 crashes randomly
Summary: Intel dual AC 7260 crashes randomly
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless-intel (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: Default virtual assignee for network-wireless-intel
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-03 13:05 UTC by JJ Singh
Modified: 2022-08-22 16:54 UTC (History)
8 users (show)

See Also:
Kernel Version: 5.6.10
Subsystem:
Regression: No
Bisected commit-id:


Attachments
output of "dmesg | grep iwlwifi" (10.59 KB, text/plain)
2020-05-03 13:05 UTC, JJ Singh
Details
Journalctl logs - part1 (2.75 MB, text/plain)
2021-01-26 22:47 UTC, Przemyslaw Kulczycki
Details
Journalctl logs - part2 (1.37 MB, text/plain)
2021-01-26 22:48 UTC, Przemyslaw Kulczycki
Details

Description JJ Singh 2020-05-03 13:05:54 UTC
Created attachment 288875 [details]
output of "dmesg | grep iwlwifi"

I'm facing random crashes of iwlwifi driver. I'm using fedora 32 but this issue can also be reproducible on Ubuntu 20.04. I've tried and searched for hours for a fix but unable to find one, filing bug here is the only last resort I had. I have "Intel Corporation Dual Band Wireless-AC 7260" card.
Also here is an output of rfkill:
ID TYPE      DEVICE                 SOFT      HARD
 0 wlan      ideapad_wlan      unblocked unblocked
 1 bluetooth ideapad_bluetooth unblocked unblocked
 2 wlan      phy0              unblocked unblocked
 3 bluetooth hci0              unblocked unblocked

Here is an output of $ ethtool -i wlp8s0 | grep firmware

firmware-version: 17.3216344376.0 7260-17.ucode
Comment 1 JJ Singh 2020-05-03 13:12:00 UTC
Forgot to mention that I've to reboot to wlan work again.
Comment 2 JJ Singh 2020-05-06 20:58:08 UTC
Can we have some info on this? I even tried upgrading to kernel 5.6.10 but the issue is still there! I'm not the only one with this please see https://forum.mxlinux.org/viewtopic.php?t=55392

This is a Firmware/Driver issue.
Comment 3 Paul Ausbeck 2020-08-18 22:57:26 UTC
I'm going to tag along on this bug as I'm seeing likely the same problem.
The crash that I see is not that ramdom, I can easily trigger it by doing
a 1GB curl transfer which won't ever complete as the crash happens to frequently. As far as I can tell a reboot is necessary to clear the fault.
I use one machine as a testbed for wireless cards as it has a mPCIe slot.
Thus far I've used Intel 5100, 6205, 200AX, and now 7260AC cards in this machine.
I've only had problems with 7260AC cards, of which I have two, both exhibit the same problem.
So far I have found that the problem only occurs when connected at 2.4GHz.
Also, only when connected at 40MHz, 20MHz connections OK, xfer BW ~10MB/s.
Additionally, only when connected at 40MHz at boot.
If found that if I connect at 2.4GHz/20MHz and make a test transfer, I can then
reconfigure the AP to force 40MHz BW following which subsequent transfers
complete successfully at ~24MB/s, what I'm expecting from that configuration.
The 5GHz band does not exhibit any firmware crashes but does suffer from highly
variable transfer bandwidth.
When connected at 80MHz BW, maxes out at 30MB/s but that's not consistent. Most
of the time I only see half that, 15MB/s and that's without moving anything.
I'm using Debian Buster with kernel 4.19, though I've also tried 5.6 and 5.7 
backported kernels and backported firmware. Same problem seen with the newer kernels and firmware.
I've looked at a number of wireless cards over the years and I haven't seen this kind of flaky problem before. I hope this report is taken seriously.
I regard this card as particularly important in that it is the newest and most
capable Intel Wireless card available in the mPCIE form factor. I would really like to see it working properly under linux.


qm77 motherboard, 3820QM CPU
Linux imb170 4.19.0-10-amd64 #1 SMP Debian 4.19.132-1 (2020-07-24) x86_64 GNU/Linux
03:00.0 Network controller: Intel Corporation Wireless 7260 (rev bb)
       description: Wireless interface
       product: Wireless 7260
       vendor: Intel Corporation
       physical id: 0
       bus info: pci@0000:03:00.0
       logical name: wlan0
       version: bb
       serial: 00:16:6f:e7:16:2a
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress bus_master cap_list ethernet physical wireless
       configuration: broadcast=yes driver=iwlwifi driverversion=4.19.0-10-amd64 firmware=17.3216344376.0 ip=192.168.1.126 latency=0 link=yes multicast=yes wireless=IEEE 802.11
       resources: irq:38 memory:f7a00000-f7a01fff
odule = "iwlwifi"

  Attributes:
    coresize            = "249856"
    initsize            = "0"
    initstate           = "live"
    refcnt              = "1"
    taint               = ""
    uevent              = <store method only>

  Parameters:
    11n_disable         = "0"
    amsdu_size          = "0"
    antenna_coupling    = "0"
    bt_coex_active      = "Y"
    d0i3_disable        = "Y"
    d0i3_timeout        = "1000"
    disable_11ac        = "N"
    disable_11ax        = "N"
    fw_monitor          = "N"
    fw_restart          = "Y"
    lar_disable         = "N"
    led_mode            = "0"
    nvm_file            = "(null)"
    power_level         = "0"
    power_save          = "N"
    remove_when_gone    = "N"
    swcrypto            = "0"
    uapsd_disable       = "3"

[  179.224409] iwlwifi 0000:03:00.0: Failed to wake NIC for hcmd
[  179.224451] iwlwifi 0000:03:00.0: Error sending SCAN_OFFLOAD_REQUEST_CMD: enqueue_hcmd failed: -5
[  179.224458] iwlwifi 0000:03:00.0: Scan failed! ret -5
[  182.226400] ------------[ cut here ]------------
[  182.226405] Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff)
[  182.226455] WARNING: CPU: 1 PID: 0 at drivers/net/wireless/intel/iwlwifi/pcie/trans.c:2033 iwl_trans_pcie_grab_nic_access+0x1e8/0x220 [iwlwifi]
[  182.226456] Modules linked in: cpufreq_powersave cpufreq_conservative cpufreq_userspace snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ccm algif_aead cbc des_generic intel_rapl arc4 algif_skcipher cmac sha512_ssse3 sha512_generic md4 algif_hash af_alg x86_pkg_temp_thermal intel_powerclamp coretemp iwlmvm kvm_intel i915 kvm mac80211 snd_hda_intel irqbypass snd_hda_codec crct10dif_pclmul crc32_pclmul snd_hda_core ghash_clmulni_intel mei_wdt iwlwifi drm_kms_helper snd_hwdep snd_pcm intel_cstate btusb snd_timer btrtl btbcm snd ppdev evdev btintel pcc_cpufreq intel_uncore mei_me sg drm cfg80211 mei soundcore bluetooth intel_rapl_perf i2c_algo_bit pcspkr iTCO_wdt parport_pc iTCO_vendor_support parport drbg ansi_cprng video button ecdh_generic rfkill nfsd auth_rpcgss nfs_acl lockd grace
[  182.226496]  sunrpc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sd_mod crc32c_intel ahci xhci_pci libahci xhci_hcd libata nvme ehci_pci aesni_intel ehci_hcd e1000e aes_x86_64 scsi_mod crypto_simd usbcore cryptd glue_helper nvme_core i2c_i801 lpc_ich mfd_core usb_common thermal fan
[  182.226521] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.0-10-amd64 #1 Debian 4.19.132-1
[  182.226522] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./IMB-170, BIOS P1.90 04/30/2018
[  182.226534] RIP: 0010:iwl_trans_pcie_grab_nic_access+0x1e8/0x220 [iwlwifi]
[  182.226537] Code: 07 e2 49 8d 56 08 bf 00 02 00 00 e8 a2 5e fe e0 e9 33 ff ff ff 89 c6 48 c7 c7 80 f5 ac c0 c6 05 69 46 02 00 01 e8 c2 8a fc e0 <0f> 0b e9 ee fe ff ff 48 8b 7b 30 48 c7 c1 e8 f5 ac c0 31 d2 31 f6
[  182.226539] RSP: 0018:ffff8d3156443e00 EFLAGS: 00010086
[  182.226541] RAX: 0000000000000000 RBX: ffff8d314fb60018 RCX: 0000000000000006
[  182.226542] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff8d31564566b0
[  182.226544] RBP: 0000000000000000 R08: 0000000000000306 R09: 0000000000000004
[  182.226545] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8d314fb6a258
[  182.226547] R13: ffff8d3156443e30 R14: 00000000ffffffff R15: 0000000000000004
[  182.226549] FS:  0000000000000000(0000) GS:ffff8d3156440000(0000) knlGS:0000000000000000
[  182.226550] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  182.226552] CR2: 00007fe3e5625d78 CR3: 000000014c80a006 CR4: 00000000001606e0
[  182.226554] Call Trace:
[  182.226558]  <IRQ>
[  182.226569]  iwl_read_prph+0x32/0x90 [iwlwifi]
[  182.226581]  iwl_trans_pcie_log_scd_error+0x13a/0x210 [iwlwifi]
[  182.226591]  iwl_pcie_txq_stuck_timer+0x46/0x70 [iwlwifi]
[  182.226601]  ? iwl_pcie_clear_cmd_in_flight+0x80/0x80 [iwlwifi]
[  182.226608]  call_timer_fn+0x2b/0x130
[  182.226612]  run_timer_softirq+0x1c7/0x3e0
[  182.226617]  ? tick_sched_timer+0x37/0x70
[  182.226621]  ? __hrtimer_run_queues+0x110/0x280
[  182.226626]  ? recalibrate_cpu_khz+0x10/0x10
[  182.226628]  ? ktime_get+0x3a/0xa0
[  182.226634]  __do_softirq+0xde/0x2d8
[  182.226640]  irq_exit+0xba/0xc0
[  182.226644]  smp_apic_timer_interrupt+0x74/0x140
[  182.226648]  apic_timer_interrupt+0xf/0x20
[  182.226650]  </IRQ>
[  182.226654] RIP: 0010:cpuidle_enter_state+0xb9/0x320
[  182.226656] Code: e8 4c b9 b0 ff 80 7c 24 0b 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 3b 02 00 00 31 ff e8 1e a9 b6 ff fb 66 0f 1f 44 00 00 <48> b8 ff ff ff ff f3 01 00 00 48 2b 1c 24 ba ff ff ff 7f 48 39 c3
[  182.226658] RSP: 0018:ffffa9cf80cd3e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  182.226660] RAX: ffff8d31564620c0 RBX: 0000002a6c74a32e RCX: 000000000000001f
[  182.226662] RDX: 0000002a6c74a32e RSI: 000000002f83de23 RDI: 0000000000000000
[  182.226663] RBP: ffff8d315646a300 R08: 0000000000000002 R09: 0000000000021980
[  182.226664] R10: 0000007a3e62ed4b R11: ffff8d31564610a8 R12: 0000000000000005
[  182.226666] R13: ffffffffa2ab71f8 R14: 0000000000000005 R15: 0000000000000000
[  182.226673]  do_idle+0x228/0x270
[  182.226677]  cpu_startup_entry+0x6f/0x80
[  182.226680]  start_secondary+0x1a4/0x200
[  182.226685]  secondary_startup_64+0xa4/0xb0
[  182.226688] ---[ end trace ed4ef1147e5a66cf ]---
[  182.226694] iwlwifi 0000:03:00.0: iwlwifi transaction failed, dumping registers
[  182.226704] iwlwifi 0000:03:00.0: iwlwifi device config registers:
[  182.226760] iwlwifi 0000:03:00.0: 00000000: 08b18086 00100000 028000bb 00000000 00000004 00000000 00000000 00000000
[  182.226767] iwlwifi 0000:03:00.0: 00000020: 00000000 00000000 00000000 40708086 00000000 000000c8 00000000 00000100
[  182.226773] iwlwifi 0000:03:00.0: iwlwifi device memory mapped registers:
[  182.226812] iwlwifi 0000:03:00.0: 00000000: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
[  182.226819] iwlwifi 0000:03:00.0: 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
[  182.226828] iwlwifi 0000:03:00.0: iwlwifi device AER capability structure:
[  182.226860] iwlwifi 0000:03:00.0: 00000000: 14010001 00100000 00000000 00462031 000031c1 00002000 00000014 40000001
[  182.226866] iwlwifi 0000:03:00.0: 00000020: 0000000f f7a00460 00000000
[  182.226870] iwlwifi 0000:03:00.0: iwlwifi parent port (0000:00:1c.2) config registers:
[  182.226899] iwlwifi 0000:00:1c.2: 00000000: 1e148086 00100007 060400c4 00810010 00000000 00000000 00030300 200000f0
[  182.226907] iwlwifi 0000:00:1c.2: 00000020: f7a0f7a0 0001fff1 00000000 00000000 00000000 00000040 00000000 0010030b
[  182.281442] iwlwifi 0000:03:00.0: Queue 4 is active on fifo 2 and stuck for 10000 ms. SW [138, 143] HW [90, 90] FH TRB=0x05a5a5a5a
Comment 4 Paul Ausbeck 2020-08-19 18:10:49 UTC
I'm adding lspci output for my test machine. As one can see, it's pretty much all Intel.

00:00.0 Host bridge: Intel Corporation 3rd Gen Core processor DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express Root Port 1 (rev c4)
00:1c.2 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 3 (rev c4)
00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation QM77 Express Chipset LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus Controller (rev 04)
01:00.0 Non-Volatile memory controller: Sandisk Corp WD Black 2018/PC SN720 NVMe SSD
03:00.0 Network controller: Intel Corporation Wireless 7260 (rev bb)
04:00.0 Ethernet controller: Intel Corporation 82583V Gigabit Network Connection
Comment 5 Paul Ausbeck 2020-08-19 18:54:22 UTC
After further testing I revise assessment of yesterday concerning the influence of band/bandwidth combinations on the crash manifestation. It turns out that the crash can happen on both the 2.4 and 5GHz bands and at both 20 and 40 MHz bandwidths on the 2.4GHz band. It's just less likely to occur at 20MHz bandwidth in the 2.4GHz band than when using 40MHz bandwidth. It also appears that allowing legacy b rates at the AP increases the frequency of the crash. Also configuring the AP to disssociate on low acknowledgement may also increase the crash likelihood.

By disabling legacy b rates and disassociation of low ack at the AP and by performing enough runs, I was able to get transfer throughput measurements in my test harness to compare the 7260 with other cards that I've tested. My throughput test is a simple 1GB file transfer using curl remotehost -> /dev/null. The transfer direction is toward the wireless device, the data is pulled from a wired host.

Card       2.4/20    2.4/40   5.2/80
N5100       5.1MB/s   24.2      5-10*   *Variable, human body modulates BW
N6205      10.1       21.7      22.0    Theoretical max 150/300Mb/s >1/2 OK
AX200      11.2       21.0      49.6    Theoretical 150/300/866Mb/s ~1/2 OK
7260AC     11.0       22.2      23.2    MCS 15 OK, MCS 15 OK, VHT-MCS 7-9 poor AC

In my test harness the 7260 card operates at near what I would call the expected throughput on the 2.4GHz band at both 20MHz and 40MHz channel bandwidths. However, I would characterize the 5.2GHz/80MHz AC throughput as less than stellar, less than half what I would expect. I was hoping for something more akin to the performance of the AX200 client.

The AP that I am currently using is the linksys ea6350v3, which uses the Qualcomm/Atheros ipq4018 SoC. I'm running openWRT on the AP. I like this radio, it is quite consistent, throughtput wise, from one device to another. To clarify, all seven of the ea6350v3's that I have yield similar throughput measurements. For further clarification, I stayed with the geriatric WRT54GL for many years just because I couldn't find a newer device that operated reliably and consistently at a higher level, until the ea6350v3 that is.

root@ea6350f:~# uname -a
Linux ea6350f 4.14.167 #0 SMP Wed Jan 29 16:05:35 2020 armv7l GNU/Linux
root@ea6350f:/etc# cat openwrt_release
DISTRIB_ID='OpenWrt'
DISTRIB_RELEASE='19.07.1'
DISTRIB_REVISION='r10911-c155900f66'
DISTRIB_TARGET='ipq40xx/generic'
DISTRIB_ARCH='arm_cortex-a7_neon-vfpv4'
DISTRIB_DESCRIPTION='OpenWrt 19.07.1 r10911-c155900f66'
DISTRIB_TAINTS=''

The AP kernel log shows the following messages each time the client 7260 crashes:

[436037.379260] ath10k_ahb a000000.wifi: peer-unmap-event: unknown peer id 0
[436086.794546] ath10k_ahb a800000.wifi: 10.4 wmi init: vdevs: 16  peers: 48  tid: 96
[436086.794591] ath10k_ahb a800000.wifi: msdu-desc: 2500  skid: 32
[436086.841915] ath10k_ahb a800000.wifi: wmi print 'P 48/48 V 16 K 144 PH 176 T 186  msdu-desc: 2500  sw-crypt: 0 ct-sta: 0'
[436086.842966] ath10k_ahb a800000.wifi: wmi print 'free: 56528 iram: 23400 sram: 32520'
[436087.141520] ath10k_ahb a800000.wifi: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
Comment 6 Paul Ausbeck 2020-09-06 23:00:59 UTC
Attempted to work around the driver crash by unloading and reloading iwlmvm.ko. This results in a "Error, can not clear persistence bit" message in the kernel message log upon restart. It appears that following the initial crash the 7260 hardware is no longer responding to the kernel driver. Is there some way short of reboot to reset the 7260 device so that the kernel driver can re-initialize?

Unload and reload sequence:

modprobe -r iwlmvm
modprobe iwlmvm

dmesg output:

[19553.559632] Intel(R) Wireless WiFi driver for Linux
[19553.559633] Copyright(c) 2003- 2015 Intel Corporation
[19553.560176] iwlwifi 0000:03:00.0: firmware: direct-loading firmware iwlwifi-7260-17.ucode
[19553.560300] iwlwifi 0000:03:00.0: loaded firmware version 17.3216344376.0 op_mode iwlmvm
[19553.569183] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 7260, REV=0xFFFFFFFF
[19553.569196] iwlwifi 0000:03:00.0: Error, can not clear persistence bit
[19553.587422] ------------[ cut here ]------------
[19553.587423] Timeout waiting for hardware access (CSR_GP_CNTRL 0xffffffff)
[19553.587436] WARNING: CPU: 5 PID: 2212 at drivers/net/wireless/intel/iwlwifi/pcie/trans.c:2033 iwl_trans_pcie_grab_nic_access+0x1e8/0x220 [iwlwifi]
[19553.587437] Modules linked in: iwlmvm(+) iwlwifi mac80211 cfg80211 cpufreq_powersave cpufreq_conservative cpufreq_userspace snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic intel_rapl ccm algif_aead cbc des_generic arc4 algif_skcipher cmac sha512_ssse3 sha512_generic x86_pkg_temp_thermal intel_powerclamp md4 algif_hash coretemp af_alg kvm_intel kvm snd_hda_intel snd_hda_codec btusb irqbypass btrtl crct10dif_pclmul btbcm i915 crc32_pclmul snd_hda_core btintel ghash_clmulni_intel snd_hwdep intel_cstate bluetooth drm_kms_helper snd_pcm intel_uncore drbg snd_timer mei_wdt ansi_cprng snd evdev ppdev drm pcc_cpufreq intel_rapl_perf soundcore pcspkr ecdh_generic sg mei_me iTCO_wdt mei iTCO_vendor_support rfkill i2c_algo_bit parport_pc parport button video nfsd auth_rpcgss nfs_acl lockd grace
[19553.587451]  sunrpc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb sd_mod crc32c_intel ahci libahci libata nvme xhci_pci aesni_intel xhci_hcd ehci_pci e1000e ehci_hcd scsi_mod aes_x86_64 crypto_simd usbcore cryptd i2c_i801 glue_helper nvme_core lpc_ich mfd_core usb_common thermal fan [last unloaded: cfg80211]
[19553.587459] CPU: 5 PID: 2212 Comm: modprobe Tainted: G        W         4.19.0-10-amd64 #1 Debian 4.19.132-1
[19553.587459] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./IMB-170, BIOS P1.90 04/30/2018
[19553.587463] RIP: 0010:iwl_trans_pcie_grab_nic_access+0x1e8/0x220 [iwlwifi]
[19553.587464] Code: 09 dd 49 8d 56 08 bf 00 02 00 00 e8 a2 de ff db e9 33 ff ff ff 89 c6 48 c7 c7 80 75 eb c0 c6 05 69 46 02 00 01 e8 c2 0a fe db <0f> 0b e9 ee fe ff ff 48 8b 7b 30 48 c7 c1 e8 75 eb c0 31 d2 31 f6
[19553.587465] RSP: 0018:ffff9b5840d57b40 EFLAGS: 00010086
[19553.587466] RAX: 0000000000000000 RBX: ffff8de5b4840018 RCX: 0000000000000006
[19553.587466] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff8de5d65566b0
[19553.587467] RBP: 0000000000000000 R08: 00000000000004cc R09: 0000000000000004
[19553.587467] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8de5b484a258
[19553.587468] R13: ffff9b5840d57b70 R14: 00000000ffffffff R15: 0000000fffffffe0
[19553.587468] FS:  00007ff1f6988480(0000) GS:ffff8de5d6540000(0000) knlGS:0000000000000000
[19553.587469] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[19553.587469] CR2: 00007ffc6a472888 CR3: 00000002125a8002 CR4: 00000000001606e0
[19553.587470] Call Trace:
[19553.587475]  iwl_read_prph+0x32/0x90 [iwlwifi]
[19553.587479]  iwl_pcie_apm_init+0x15a/0x210 [iwlwifi]
[19553.587482]  iwl_pcie_apm_stop+0x324/0x360 [iwlwifi]
[19553.587485]  iwl_trans_pcie_op_mode_leave+0x86/0x150 [iwlwifi]
[19553.587491]  iwl_op_mode_mvm_start+0x8f0/0x990 [iwlmvm]
[19553.587494]  iwl_opmode_register+0x7c/0xf0 [iwlwifi]
[19553.587496]  ? 0xffffffffc0ac9000
[19553.587501]  iwl_mvm_init+0x34/0x1000 [iwlmvm]
[19553.587503]  do_one_initcall+0x46/0x1c3
[19553.587505]  ? free_unref_page_commit+0x91/0x100
[19553.587507]  ? _cond_resched+0x15/0x30
[19553.587509]  ? kmem_cache_alloc_trace+0x15e/0x1e0
[19553.587512]  do_init_module+0x5a/0x210
[19553.587513]  load_module+0x2167/0x23d0
[19553.587515]  ? __do_sys_finit_module+0xad/0x110
[19553.587516]  __do_sys_finit_module+0xad/0x110
[19553.587518]  do_syscall_64+0x53/0x110
[19553.587519]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[19553.587520] RIP: 0033:0x7ff1f6aa2f59
[19553.587521] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48
[19553.587522] RSP: 002b:00007ffc6a475978 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[19553.587523] RAX: ffffffffffffffda RBX: 0000556256d32e10 RCX: 00007ff1f6aa2f59
[19553.587523] RDX: 0000000000000000 RSI: 00005562558893f0 RDI: 0000000000000006
[19553.587523] RBP: 00005562558893f0 R08: 0000000000000000 R09: 0000000000000000
[19553.587524] R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000000
[19553.587524] R13: 0000556256d32f40 R14: 0000000000040000 R15: 0000556256d32e10
[19553.587525] ---[ end trace bdb25fc91c9d5a21 ]---
[19553.587527] iwlwifi 0000:03:00.0: iwlwifi transaction failed, dumping registers
[19553.587529] iwlwifi 0000:03:00.0: iwlwifi device config registers:
[19553.587572] iwlwifi 0000:03:00.0: 00000000: 08b18086 00100406 028000bb 00000000 00000004 00000000 00000000 00000000
[19553.587575] iwlwifi 0000:03:00.0: 00000020: 00000000 00000000 00000000 40708086 00000000 000000c8 00000000 00000100
[19553.587577] iwlwifi 0000:03:00.0: iwlwifi device memory mapped registers:
[19553.587613] iwlwifi 0000:03:00.0: 00000000: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
[19553.587615] iwlwifi 0000:03:00.0: 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
[19553.587620] iwlwifi 0000:03:00.0: iwlwifi device AER capability structure:
[19553.587648] iwlwifi 0000:03:00.0: 00000000: 14010001 00100000 00000000 00462031 000031c1 00002000 00000014 40000001
[19553.587650] iwlwifi 0000:03:00.0: 00000020: 0000000f f7a00460 00000000
[19553.587652] iwlwifi 0000:03:00.0: iwlwifi parent port (0000:00:1c.2) config registers:
[19553.587669] iwlwifi 0000:00:1c.2: 00000000: 1e148086 00100007 060400c4 00810010 00000000 00000000 00030300 200000f0
[19553.587671] iwlwifi 0000:00:1c.2: 00000020: f7a0f7a0 0001fff1 00000000 00000000 00000000 00000040 00000000 0010030b
Comment 7 Paul Ausbeck 2020-09-07 04:46:13 UTC
I found that the 7260 can be reset without reboot by removing the device at the pci level and then re-scanning the pci bus to bring the device back, thusly:

cd /sys/bus/pci/devices/0000\:03\:00.0
echo 1 | sudo tee remove
cd /sys/bus/pci
echo 1 | sudo tee rescan

According to the kernel message log, an ASPM configuration problem is discovered on pci rescan. This indicates, at least to me, that the failure may therefore be related to ASPM. Perhaps the device is going to a lower ASPM power level without proper coordination with the host.

sudo dmesg

[19453.872457] pci 0000:03:00.0: [8086:08b1] type 00 class 0x028000
[19453.872534] pci 0000:03:00.0: reg 0x10: [mem 0x00000000-0x00001fff 64bit]
[19453.872797] pci 0000:03:00.0: PME# supported from D0 D3hot D3cold
[19453.873017] pcieport 0000:00:1c.2: ASPM: current common clock configuration is broken, reconfiguring
[19453.884494] pci 0000:03:00.0: BAR 0: assigned [mem 0xf7a00000-0xf7a01fff 64bit]
[19453.884594] iwlwifi 0000:03:00.0: enabling device (0100 -> 0102)
[19453.885478] iwlwifi 0000:03:00.0: firmware: direct-loading firmware iwlwifi-7260-17.ucode
[19453.885897] iwlwifi 0000:03:00.0: loaded firmware version 17.3216344376.0 op_mode iwlmvm
[19453.885923] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 7260, REV=0x144
[19453.904591] iwlwifi 0000:03:00.0: base HW address: 00:16:6f:e7:16:2a
[19454.102848] ieee80211 phy2: Selected rate control algorithm 'iwl-mvm-rs'
[19454.400222] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[19457.578660] wlan0: authenticate with 60:38:e0:87:3a:41
[19457.581658] wlan0: send auth to 60:38:e0:87:3a:41 (try 1/3)
[19457.599202] wlan0: authenticated
[19457.600444] wlan0: associate with 60:38:e0:87:3a:41 (try 1/3)
[19457.607129] wlan0: RX AssocResp from 60:38:e0:87:3a:41 (capab=0x421 status=0 aid=1)
[19457.608029] wlan0: associated
[19457.608560] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
Comment 8 Paul Ausbeck 2020-09-07 17:05:10 UTC
The ASPM message that I previously noted gave me the idea of forcing ASPM via kernel parameter. This had no effect on 7260 connection reliability.

However, I noticed something that I had previously hadn't, that the 7260 doesn't like suspending and resuming. After an S3 suspend/resume cycle, curl transfer speed is very low < 1MB/s and doesn't last long before the driver crashes as before. Interestingly, this low bandwidth unstable condition survives a device PCI device removal and rescan. To clarify, one the machine has been suspended and resumed, the 7260 never works well or at all until a complete reboot.

One other thing I noted while looking around for 7260 info, at least one person finds that the 7260 is "infamous":

https://askubuntu.com/questions/645506/what-driver-for-intel-ac-7260-adapter

So given my own experience, and finding that the 7260 "17" firmware hasn't been updated for 2.5+ years, I'm thinking now that further effort on the 7260 likely won't be worth anything.

Lastly, the reason that I looked at the 7260 in the first place is that it has mPCI form factor and u.fl RF connectors vs m.2 and MHF4 for newer Intel cards. Though the newer cards and connectors are smaller, using adapters and extra cables actually requires more room than a proper fitting card. Further, I've recently found some newer mPCIe cards with U.fl connectors: QualComm/Atheros QCA6174, Broadcom BCM94352, and what look like Chinese repackaged Intel radios, unbranded 7265AC and unbranded MPE-AX3000H. I've seen the QCA6174 before in a Samsung Ativ Book 9 notebook and I like it. I've also wanted to look at what Broadcom is up to in the client radio space, and looking at these Chinese knockoffs also looks worthwhile. These radios are all cheap enough that I can just buy them all and see what they can do. So right now, I thinking that unless something new happens on the 7260 front, I'm not likely to at it any more.
Comment 9 JJ Singh 2020-09-07 17:33:03 UTC
Hi, @Paul I'd like to thank you for showing the determination and providing your thoughts on the issue. Actually it seems like commenting here won't do anything, you have to raise a ticket with the intel iwlwifi team. That said I've already raised that ticket and I'm actively working with the support team on this. But due to my job, I'm not able to provide them the info they need ASAP, saying that it's been 2 months since I'm in contact with the support team but just last week they realized it's actually a driver issue. I'm still requested to provide them with more information. I request you to raise a support ticket with them on this matter. I'll also be a huge help to me as you may be able to provide them with the required info faster than me. Feel free to contact me via my mail if you want to help.
Thanks,
JJ
Comment 10 JJ Singh 2020-09-07 17:42:03 UTC
@Paul I just realized you may not be able to see my mail, you can just comment here, I'll contact you if you are interested.
Thanks,
JJ
Comment 11 Przemyslaw Kulczycki 2021-01-26 22:46:30 UTC
I'm experiencing the same bug on Ubuntu 20.04 with linux 5.4.0-64-generic.
I've raised a bug in Ubuntu's bug tracker about it:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1913350
It contains lots of logs.

Intel Wireless-AC 7260 (rev bb) wifi card crashes randomly.
Sometimes after 1h of using my laptop, and sometimes after resuming from sleep.
It can't work again until I reboot my laptop.
The most relevant error message is:
sty 25 14:42:13 leetbook kernel: iwlwifi 0000:25:00.0: Failed to wake NIC for hcmd
sty 25 14:42:13 leetbook kernel: iwlwifi 0000:25:00.0: Error sending STATISTICS_CMD: enqueue_hcmd failed: -5

The card works fine in Windows 10 so it's not a hardware issue.
This is not a stock card from HP EliteBook 8470w, I've replaced it to gain Wifi 802.11ac transfer speeds.

The workaround script from Ubuntu bug 1673344 works to fix the wifi without rebooting.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1673344/comments/37
It was also posted on the Kernel's bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=191601
Comment 12 Przemyslaw Kulczycki 2021-01-26 22:47:33 UTC
Created attachment 294873 [details]
Journalctl logs - part1

Journalctl logs - part1
Comment 13 Przemyslaw Kulczycki 2021-01-26 22:48:09 UTC
Created attachment 294875 [details]
Journalctl logs - part2

Journalctl logs - part2
Comment 14 Levi Webb 2021-04-02 15:35:38 UTC
I can confirm experiencing this bug over a very long period of time on my personal system with this card. Paul Ausbeck's description of the issue is extremely on-point, although I would add that interference from other devices in addition to high bandwidth usage is a much more reliable way to reproduce the crash.

With older kernel versions, this issue used to simply hang the entire system. This can still happen rarely on my machine, but I have not found a way to reproduce that type of crash. I am currently using `5.11.11-arch1-1`.
Comment 15 Levi Webb 2021-04-02 15:37:40 UTC
I also happened to converge along the same hacky solution to reload the driver as Paul did, so for any other users stuck with this hardware waiting for Intel's driver team to properly support this product, you can adopt this script to your liking:

echo "Listening for device crashes"
last=`date +%s`
inst=0
dmesg -W -Lnever | grep --line-buffered "iwlwifi 0000:06:00.0: Failed to wake NIC for hcmd" | while read -r l;
do
    inst=$(expr $inst + 1)
    if [ $inst -ge 2 ]
    then
        cur=`date +%s`
        dif=$(expr $cur - $last)
        if [ $dif -ge 5 ]
        then
            echo -n "Detected crash, cycling device..."
            echo "1" > /sys/bus/pci/devices/0000\:06\:00.0//remove
            sleep 1
            echo "1" > /sys/bus/pci/rescan
            echo " done"
            last=`date +%s`
        fi
        inst=0
    fi
done

When the driver crashes, it takes about 5-10 seconds for it to automatically recover if this script is running as superuser.

Note You need to log in before you can comment on or make changes to this bug.