Bug 199551 - iwlwifi:9260: BUG at drivers/net/wireless/intel/iwlwifi/pcie/rx.c:425 - with more than 16 CPUs
Summary: iwlwifi:9260: BUG at drivers/net/wireless/intel/iwlwifi/pcie/rx.c:425 - with ...
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Intel Linux Wireless
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-28 13:21 UTC by wgjak47
Modified: 2018-05-29 06:53 UTC (History)
4 users (show)

See Also:
Kernel Version: 4.14 && 4.15
Tree: Mainline
Regression: No


Attachments
the full dmesg message (70.79 KB, text/plain)
2018-04-28 13:21 UTC, wgjak47
Details
the dmidecode info (16.66 KB, text/plain)
2018-04-28 13:22 UTC, wgjak47
Details
the config of my kernel of version 4.15 on gentoo (152.90 KB, text/plain)
2018-04-28 13:23 UTC, wgjak47
Details
the new kerenl config (152.90 KB, text/plain)
2018-04-29 13:27 UTC, wgjak47
Details
the dmesg with the new kernel (71.51 KB, text/plain)
2018-04-29 13:27 UTC, wgjak47
Details
the dmesg with patch when try to connect some network (100.67 KB, text/plain)
2018-05-06 06:14 UTC, wgjak47
Details

Description wgjak47 2018-04-28 13:21:13 UTC
Created attachment 275631 [details]
the full dmesg message

I have an Intel® Wireless-AC 9260 device. It can't work with this message:  

[    8.069769] kernel BUG at drivers/net/wireless/intel/iwlwifi/pcie/rx.c:425!

I test kernel versions 4.14 and 4.15 in gentoo and kernel 4.15 in ubuntu 18.04.
And I can't found any wireless device like wlan0 in tree /sys/class/net


tree /sys/class/net:
/sys/class/net
├── bond0 -> ../../devices/virtual/net/bond0
├── bonding_masters
├── br-7aa37a5816b5 -> ../../devices/virtual/net/br-7aa37a5816b5
├── docker0 -> ../../devices/virtual/net/docker0
├── enp6s0 -> ../../devices/pci0000:00/0000:00:01.3/0000:01:00.2/0000:02:03.0/0000:06:00.0/net/enp6s0
├── ifb0 -> ../../devices/virtual/net/ifb0
├── ifb1 -> ../../devices/virtual/net/ifb1
└── lo -> ../../devices/virtual/net/lo
Comment 1 wgjak47 2018-04-28 13:22:09 UTC
Created attachment 275633 [details]
the dmidecode info
Comment 2 wgjak47 2018-04-28 13:23:53 UTC
Created attachment 275635 [details]
the config of my kernel of version 4.15 on gentoo
Comment 3 Emmanuel Grumbach 2018-04-29 06:32:04 UTC
This bug should be fixed by:

commit 7298cba8c7e6d51a7ae78eaee5d7b2aa405c76b5
Author: Shaul Triebitz <shaul.triebitz@intel.com>
Date:   Thu Mar 22 14:14:45 2018 +0200

    [BUGFIX] iwlwifi: pcie: fix race in Rx buffer allocator
    
    Make sure rx_allocator worker is canceled before running
    rx_init routine.
    rx_init frees and re-allocates all rxb's pages.
    rx_allocator worker also allocates pages for the
    used rxb's.
    Running rx_init and rx_allocator simultaniously causes
    kernel panic.
    
    type=bugfix
    fixes=unknown
    ticket=jira:WIFI-8507
    
    Change-Id: Ic974ec345a172f53e5806735e2c59e0218481ff2
    Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
    x-iwlwifi-stack-dev: e8e4ac843fc1e7ea7e7eb74faba7a59e04616542


in our internal tree:
https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/backport-iwlwifi.git/

can you test?

Thanks.
Comment 4 wgjak47 2018-04-29 13:26:51 UTC
I tryed the backport-iwlwifi with new kernel config...
The problem still exists:

[    8.045707] iwlwifi 0000:05:00.0: loaded firmware version 34.0.0 op_mode iwlmvm
[    8.183969] iwlwifi 0000:05:00.0: Detected Intel(R) Dual Band Wireless AC 9260, REV=0x324
[    8.189960] kernel BUG at /home/wgjak47/Code/backport-iwlwifi/drivers/net/wireless/intel/iwlwifi/pcie/rx.c:452!
Comment 5 wgjak47 2018-04-29 13:27:18 UTC
Created attachment 275665 [details]
the new kerenl config
Comment 6 wgjak47 2018-04-29 13:27:48 UTC
Created attachment 275667 [details]
the dmesg with the new kernel
Comment 7 Jonathan Dunlap 2018-05-03 19:50:03 UTC
I have the same bug! I have the same wifi card in a Gigabyte x470 Gaming 7.

Fedora 28, kernel 4.16.5-300.fc28.x86_64

sudo lspci -nn | grep -i network

05:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network 
Connection [8086:1539] (rev 03)
06:00.0 Network controller [0280]: Intel Corporation Wireless-AC 9260 [8086:2526] (rev 29)
---
sudo dmesg | grep -e iwlwifi -e 9260

[    5.949573] iwlwifi 0000:06:00.0: enabling device (0000 -> 0002)
[    5.956596] iwlwifi 0000:06:00.0: Direct firmware load for iwlwifi-9260-th-b0-jf-b0-36.ucode failed with error -2
[    5.956607] iwlwifi 0000:06:00.0: Direct firmware load for iwlwifi-9260-th-b0-jf-b0-35.ucode failed with error -2
[    5.970257] iwlwifi 0000:06:00.0: loaded firmware version 34.0.0 op_mode iwlmvm
[    6.056544] iwlwifi 0000:06:00.0: Detected Intel(R) Dual Band Wireless AC 9260, REV=0x324
>>[    6.062623] kernel BUG at
>>drivers/net/wireless/intel/iwlwifi/pcie/rx.c:425!
[    6.062680] Modules linked in: iwlmvm(+) edac_mce_amd mac80211 kvm_amd kvm snd_hda_codec_realtek snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul snd_usb_audio(+) uvcvideo snd_hda_intel btusb drm_kms_helper iwlwifi btrtl btbcm ghash_clmulni_intel videobuf2_vmalloc btintel snd_hda_codec videobuf2_memops snd_usbmidi_lib videobuf2_v4l2 snd_hda_core drm bluetooth videobuf2_common snd_rawmidi snd_hwdep videodev snd_seq cfg80211 snd_seq_device media joydev snd_pcm ipmi_devintf ipmi_msghandler snd_timer ecdh_generic snd wmi_bmof rfkill soundcore sp5100_tco i2c_piix4 k10temp ccp shpchp acpi_cpufreq binfmt_misc hid_apple mxm_wmi igb crc32c_intel ptp pps_core dca i2c_algo_bit wmi
[    6.063032] RIP: 0010:iwl_pcie_rxq_alloc_rbs+0x182/0x1f0 [iwlwifi]
[    6.063340]  _iwl_pcie_rx_init+0x25c/0x730 [iwlwifi]
[    6.063366]  iwl_pcie_rx_init+0x2b/0x3b0 [iwlwifi]
[    6.063392]  iwl_trans_pcie_start_fw+0x293/0x6b0 [iwlwifi]
[    6.063548]  ? iwl_trans_pcie_start_hw+0x59/0x1b0 [iwlwifi]
[    6.063607]  _iwl_op_mode_start.isra.8+0x47/0xa0 [iwlwifi]
[    6.063652]  iwl_opmode_register+0x6f/0xe0 [iwlwifi]

Original reddit post for other info:
https://www.reddit.com/r/linuxquestions/comments/8gde2w/intel_9560_wifi_driver_active_but_no_interface/
Comment 8 Jonathan Dunlap 2018-05-03 20:23:35 UTC
I've also reported this issue to the Redhat kernel bug tracker:
https://bugzilla.redhat.com/show_bug.cgi?id=1574679
Comment 9 Hao Wei Tee 2018-05-05 11:47:13 UTC
This seems to be caused in part by this out-of-bounds array access:

// https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/rx.c?id=d3a6f7fb97fc34a38e40cc56392e701598f99863#n948
	num_alloc = queue_size + allocator_pool_size;
	BUILD_BUG_ON(ARRAY_SIZE(trans_pcie->global_table) !=
		     ARRAY_SIZE(trans_pcie->rx_pool));
	for (i = 0; i < num_alloc; i++) {
		struct iwl_rx_mem_buffer *rxb = &trans_pcie->rx_pool[i];

I added some debug printks and it appears num_alloc is 613, while RX_POOL_SIZE is 607.

    struct iwl_trans_pcie {
	    struct iwl_rxq *rxq;
	    struct iwl_rx_mem_buffer rx_pool[RX_POOL_SIZE];
	    struct iwl_rx_mem_buffer *global_table[RX_POOL_SIZE];

Looking at what we've overrun into.. I'm surprised nothing else has gone wrong.

Either that or I'm missing something here.
Comment 10 Hao Wei Tee 2018-05-05 12:13:10 UTC
It appears trans->num_rx_queues is 17 while IWL_MAX_RX_HW_QUEUES is 16. I have no idea if this is normal or not.

If it is, perhaps the fix is simply to limit num_alloc to RX_POOL_SIZE? i.e.

    num_alloc = min_t(int, queue_size + allocator_pool_size, RX_POOL_SIZE);
Comment 11 Hao Wei Tee 2018-05-05 12:40:30 UTC
Applying that gives me another error:

[   26.706154] ------------[ cut here ]------------
[   26.706154] Invalid rxb from HW 0
[   26.706172] WARNING: CPU: 15 PID: 1106 at /home/angelsl/Development/Compile/backport-iwlwifi/drivers/net/wireless/intel/iwlwifi/pcie/rx.c:1377 iwl_pcie_rx_handle+0xa56/0xa90 [iwlwifi]
[   26.706173] Modules linked in: iwlmvm(O+) mac80211(O) iwlwifi(O) cfg80211(O) compat(O) fuse it87(O) hwmon_vid edac_mce_amd kvm_amd snd_hda_codec_realtek kvm snd_hda_codec_generic snd_hda_codec_hdmi btusb uvcvideo btrtl nls_iso8859_1 nls_cp437 snd_hda_intel btbcm videobuf2_vmalloc vfat btintel irqbypass videobuf2_memops fat snd_usb_audio snd_hda_codec crct10dif_pclmul videobuf2_v4l2 crc32_pclmul ghash_clmulni_intel bluetooth videobuf2_common snd_usbmidi_lib pcbc snd_hda_core igb snd_rawmidi wmi_bmof mxm_wmi snd_hwdep snd_seq_device videodev snd_pcm aesni_intel ptp snd_timer aes_x86_64 crypto_simd ecdh_generic sp5100_tco glue_helper snd pps_core input_leds mousedev media joydev led_class cryptd psmouse dca rfkill i2c_piix4 pcspkr soundcore k10temp ccp rng_core shpchp rtc_cmos gpio_amdpt pinctrl_amd
[   26.706203]  evdev wmi mac_hid acpi_cpufreq ip_tables x_tables ext4 crc16 mbcache jbd2 fscrypto sd_mod hid_generic usbhid hid serio_raw atkbd libps2 ahci xhci_pci libahci xhci_hcd crc32c_intel libata usbcore scsi_mod usb_common i8042 serio amdgpu chash i2c_algo_bit gpu_sched drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm agpgart
[   26.706221] CPU: 15 PID: 1106 Comm: irq/113-iwlwifi Tainted: G        W  O     4.16.6-1-ARCH #1
[   26.706221] Hardware name: Gigabyte Technology Co., Ltd. X470 AORUS GAMING 5 WIFI/X470 AORUS GAMING 5 WIFI-CF, BIOS F3d 04/17/2018
[   26.706228] RIP: 0010:iwl_pcie_rx_handle+0xa56/0xa90 [iwlwifi]
[   26.706229] RSP: 0018:ffff9ab6c33b3db0 EFLAGS: 00010286
[   26.706230] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000202
[   26.706231] RDX: 0000000080000202 RSI: ffffffff9ae680bc RDI: 00000000ffffffff
[   26.706231] RBP: 0000000000000000 R08: 0000000000000028 R09: 0000000000000458
[   26.706232] R10: fffffffffff6e8e7 R11: 0000000000000001 R12: ffff8d74d7810018
[   26.706233] R13: ffff8d74d7810018 R14: ffff8d74d79d0000 R15: ffff8d74d79d0000
[   26.706234] FS:  0000000000000000(0000) GS:ffff8d751efc0000(0000) knlGS:0000000000000000
[   26.706235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   26.706235] CR2: 00007f23acff5f68 CR3: 000000020d00a000 CR4: 00000000003406e0
[   26.706236] Call Trace:
[   26.706242]  ? finish_task_switch+0x85/0x2c0
[   26.706244]  ? __update_idle_core+0x20/0xb0
[   26.706251]  iwl_pcie_irq_msix_handler+0x468/0x4c0 [iwlwifi]
[   26.706254]  ? irq_forced_thread_fn+0x70/0x70
[   26.706255]  ? irq_thread_dtor+0xa0/0xa0
[   26.706256]  irq_thread_fn+0x21/0x50
[   26.706258]  irq_thread+0x142/0x1a0
[   26.706259]  ? wake_threads_waitq+0x30/0x30
[   26.706261]  kthread+0x113/0x130
[   26.706263]  ? kthread_create_on_node+0x70/0x70
[   26.706266]  ret_from_fork+0x22/0x40
[   26.706267] Code: f7 e8 5f ef ff ff e9 18 f6 ff ff 48 c7 c7 f9 51 f1 c0 4d 89 f4 4d 89 fe e8 12 4f 1f d9 89 ee 48 c7 c7 09 52 f1 c0 e8 ba 28 19 d9 <0f> 0b 4c 89 e7 e8 20 6c ff ff e9 1b fc ff ff e8 36 2b 19 d9 48 
[   26.706290] ---[ end trace 3d58ae2f86cd2464 ]---
[   26.706302] iwlwifi 0000:05:00.0: Microcode SW error detected. Restarting 0x1.
[   26.706306] iwlwifi 0000:05:00.0: Not valid error log pointer 0x00000000 for Init uCode
[   26.706337] iwlwifi 0000:05:00.0: SecBoot CPU1 Status: 0x3, CPU2 Status: 0x2412
[   26.706339] iwlwifi 0000:05:00.0: Failed to start INIT ucode: -5
[   26.718591] iwlwifi 0000:05:00.0: Failed to run INIT ucode: -5

I'm going to leave it from here, this is way out of my depth. Hopefully something I said is useful.
Comment 12 Hao Wei Tee 2018-05-05 14:47:20 UTC
OK, I got it.

// https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next.git/tree/drivers/net/wireless/intel/iwlwifi/pcie/trans.c?id=d3a6f7fb97fc34a38e40cc56392e701598f99863#n1610
	max_irqs = min_t(u32, nr_online_cpus + 2, IWL_MAX_RX_HW_QUEUES);

This should be `IWL_MAX_RX_HW_QUEUES - 1`, because later on if `num_irqs <= nr_online_cpus` (which will be the case on a 16-core system that will be common on these AMD boards), then we create `num_irqs + 1` queues, which will cause more out-of-bounds array accesses because there are quite a few arrays that are defined with size IWL_MAX_RX_HW_QUEUES.

This change alone should be sufficient to fix the issue.
Comment 13 Hao Wei Tee 2018-05-05 15:14:53 UTC
FWIW, patch: https://marc.info/?l=linux-wireless&m=152553304932044
Comment 14 wgjak47 2018-05-06 05:52:34 UTC
(In reply to Hao Wei Tee from comment #13)
> FWIW, patch: https://marc.info/?l=linux-wireless&m=152553304932044

Thank you very much. I change the

     max_irqs = min_t(u32, nr_online_cpus + 2, IWL_MAX_RX_HW_QUEUES);

to
     max_irqs = min_t(u32, nr_online_cpus + 2, IWL_MAX_RX_HW_QUEUES - 1);

It works:
 wgjak47@wgjak47Ryzen  ~  dmesg | grep iwlwifi
[    7.963470] Loading modules backported from iwlwifi
[    7.963471] iwlwifi-stack-public:master:6965:d9c7f227
[    8.098173] iwlwifi 0000:05:00.0: enabling device (0000 -> 0002)
[    8.099460] iwlwifi 0000:05:00.0: Direct firmware load for iwl-dbg-cfg.ini failed with error -2
[    8.099475] iwlwifi 0000:05:00.0: Direct firmware load for iwlwifi-9260-th-b0-jf-b0-39.ucode failed with error -2
[    8.099485] iwlwifi 0000:05:00.0: Direct firmware load for iwlwifi-9260-th-b0-jf-b0-38.ucode failed with error -2
[    8.099491] iwlwifi 0000:05:00.0: Direct firmware load for iwlwifi-9260-th-b0-jf-b0-37.ucode failed with error -2
[    8.099498] iwlwifi 0000:05:00.0: Direct firmware load for iwlwifi-9260-th-b0-jf-b0-36.ucode failed with error -2
[    8.099505] iwlwifi 0000:05:00.0: Direct firmware load for iwlwifi-9260-th-b0-jf-b0-35.ucode failed with error -2
[    8.103565] iwlwifi 0000:05:00.0: loaded firmware version 34.0.0 op_mode iwlmvm
[    8.224606] iwlwifi 0000:05:00.0: Detected Intel(R) Dual Band Wireless AC 9260, REV=0x324
[    8.362902] iwlwifi 0000:05:00.0: base HW address: a0:c5:89:f9:4a:75
[    8.431680] iwlwifi 0000:05:00.0 wlp5s0: renamed from wlan0
Comment 15 wgjak47 2018-05-06 06:14:37 UTC
Created attachment 275789 [details]
the dmesg with patch when try to connect some network
Comment 16 wgjak47 2018-05-06 06:41:19 UTC
It's seem not only this bug... When I try to connect to any network. It look like something timeout...
the dmesg:
WARNING: CPU: 3 PID: 3511 at /home/wgjak47/Code/backport-iwlwifi/drivers/net/wireless/intel/iwlwifi/pcie/trans.c:2062 iwl_trans_pcie_grab_nic_access+0x19c/0x230 [iwlwifi]
Comment 17 wgjak47 2018-05-06 06:53:04 UTC
I try to disable the amd SMT. It's mean the num of cpu reduce to 8 from 16...
It's all work fine...Look like the iwlwifi driver need to be tested on pc with 16 cpu or more...
Comment 18 Hao Wei Tee 2018-05-06 07:24:39 UTC
Yeah, looks like there are more issues with high-logical-CPU-count systems.. maybe it's still related to the number of receive queues.

I'll see what I can find..
Comment 19 Hao Wei Tee 2018-05-06 08:27:47 UTC
Seems like the MSI-X code doesn't work very well when the number of interrupts given by the OS 2 or more short of what we ask for. Trying to see why but this is really unfamiliar territory for me here.

Anyway, I think a workaround for now is to use module option disable_msix=1. It probably will impact speed, but at least it works.
Comment 20 Hao Wei Tee 2018-05-06 08:52:26 UTC
There is some problem with IWL_SHARED_IRQ_FIRST_RSS handling.
Comment 21 Emmanuel Grumbach 2018-05-06 09:06:08 UTC
We are looking into this.

Many thanks for the report, analysis, patch etc...
Comment 22 Hao Wei Tee 2018-05-06 10:28:14 UTC
> use module option disable_msix=1

Looks like that isn't in the module that is shipped with kernel v4.16.

Oops.
Comment 23 Jonathan Dunlap 2018-05-07 13:35:53 UTC
> It's mean the num of cpu reduce to 8 from 16... It's all work fine
(In reply to Hao Wei Tee from comment #18)
> Yeah, looks like there are more issues with high-logical-CPU-count systems..

note: the Intel 9260 is shipping in most of the released AMD x470 boards, which are almost exclusively used with 16 cores units (2700x).
Comment 24 Emmanuel Grumbach 2018-05-07 13:41:33 UTC
point made :)

I can't commit on an ETA for a fix though.
Comment 25 Emmanuel Grumbach 2018-05-08 12:30:35 UTC
Your third version of the fix [1] has been reviewed internally and it looks fine.

We will pick it up.

All, please take that patch and report any further issue you may have.
Leaving the bug open for now.


[1] https://patchwork.kernel.org/patch/10382693/
Comment 26 Emmanuel Grumbach 2018-05-08 12:30:47 UTC
And.... thank you!
Comment 27 Hao Wei Tee 2018-05-08 12:37:34 UTC
(In reply to Emmanuel Grumbach from comment #25)
> Your third version of the fix [1] has been reviewed internally and it looks
> fine.
> 
> We will pick it up.

That's great! Thank you for looking at it :)

> All, please take that patch and report any further issue you may have.
> Leaving the bug open for now.

I think there may still be problems when IWL_SHARED_IRQ_FIRST_RSS is activated -- at least when I tried, it seemed like many packets were getting lost. I didn't manage to figure out why though. (Perhaps it was a mistake on my end? I tested it by just making it request for nr_cpus-2 IRQs.)

But I guess that is a separate, although related, bug from this.
Comment 28 Hao Wei Tee 2018-05-08 12:39:13 UTC
(In reply to Hao Wei Tee from comment #27)
> my end? I tested it by just making it request for nr_cpus-2 IRQs.)

Sorry, that should have been max_irqs-2.
Comment 29 Emmanuel Grumbach 2018-05-10 07:53:00 UTC
Fix is now merged in our internal tree 

https://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/backport-iwlwifi.git/

It'll follow the regular process to upstream kernel.

Thanks!!!
Comment 30 Jonathan Dunlap 2018-05-11 22:46:05 UTC
YES! It works! I cloned your backport repo and used the below steps to install it and now my wifi works flawlessly. Can't wait for this to reach upstream for other fellow folks.

https://github.com/kimduho/linux/wiki/Linux-Back-port-Driver-Installation
Comment 31 Jonathan Dunlap 2018-05-28 19:53:13 UTC
(In reply to Emmanuel Grumbach from comment #29)
> It'll follow the regular process to upstream kernel.

@Emmanuel any updates on upstreaming? Any guess on what version this might land in?
Comment 32 Luca Coelho 2018-05-29 06:53:00 UTC
Thanks for reminding! Somehow this went under my radar and didn't get pushed out.

I'll will send it out today and we will try to get it into v4.17 (with a Fixes: tag, so it will reach stable trees as well), but I'm not sure it will still be taken, since we are already in rc7, so it's quite late.

If it doesn't get into v4.17, then it will take longer time, but should eventually reach it (as part of stable).

Note You need to log in before you can comment on or make changes to this bug.