Bug 219375

Summary: iwl_mvm_scan_umac_v14_and_above [iwlmvm] causes soft lockup
Product: Drivers Reporter: Eric Li (draydere)
Component: network-wireless-intelAssignee: Default virtual assignee for network-wireless-intel (drivers_network-wireless-intel)
Status: CLOSED CODE_FIX    
Severity: high CC: draydere, emmanuel.grumbach
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: patch fixing variable type
iwl wifi dump
large iwl wifi dump

Description Eric Li 2024-10-10 19:53:34 UTC
This only happens in certain locations at my university, which is really weird. A soft lockup happens, and a kworker:events_unbound takes up a whole CPU thread. It seems to be caused by the iwlwifi/iwlmvm, and doesn't happen when I have airplane mode on.

I don't think it's shown in DMESG, but I'm currently using AX211 WiFi card.

dmesg output

watchdog: BUG: soft lockup - CPU#5 stuck for 78s! [kworker/u24:6:35898]
Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel rfcomm snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib ip_set qrtr bnep uinput snd_ctl_led snd_soc_sof_sdw snd_soc_intel_hda_dsp_common sunrpc snd_sof_probes snd_soc_intel_sof_maxim_common snd_soc_rt715_sdca snd_soc_rt1316_sdw snd_hda_codec_hdmi regmap_sdw_mbq regmap_sdw snd_soc_dmic snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink binfmt_misc soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match vfat snd_soc_acpi fat soundwire_generic_allocation soundwire_bus intel_uncore_frequency intel_uncore_frequency_common x86_pkg_temp_thermal snd_soc_core intel_powerclamp iwlmvm coretemp snd_compress ac97_bus snd_pcm_dmaengine kvm_intel snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec mac80211 kvm
 snd_hda_core snd_hwdep irqbypass snd_seq libarc4 rapl btusb snd_seq_device processor_thermal_device_pci spi_nor hid_sensor_als dell_laptop iTCO_wdt btrtl intel_cstate intel_pmc_bxt mei_hdcp mei_pxp mtd spi_ljca gpio_ljca i2c_ljca iTCO_vendor_support intel_rapl_msr dell_wmi iwlwifi intel_uncore snd_pcm btintel hid_sensor_trigger processor_thermal_device dell_wmi_ddv pcspkr btbcm processor_thermal_wt_hint dell_smbios hid_sensor_iio_common snd_timer btmtk processor_thermal_rfim dcdbas industrialio_triggered_buffer cfg80211 bluetooth dell_smm_hwmon dell_wmi_sysman firmware_attributes_class ledtrig_audio dell_wmi_descriptor wmi_bmof usb_ljca mei_me snd spi_intel_pci processor_thermal_rapl kfifo_buf spi_intel i2c_i801 industrialio mei rfkill intel_rapl_common soundcore idma64 i2c_smbus processor_thermal_wt_req thunderbolt igen6_edac processor_thermal_power_floor processor_thermal_mbox int3403_thermal intel_skl_int3472_tps68470 tps68470_regulator int340x_thermal_zone intel_pmc_core clk_tps68470 intel_vsec
 nft_reject_inet pmt_telemetry intel_hid int3400_thermal nf_reject_ipv4 pmt_class intel_skl_int3472_discrete acpi_thermal_rel sparse_keymap acpi_pad acpi_tad nf_reject_ipv6 joydev nft_reject nft_masq nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables i2c_dev loop nfnetlink zram xe drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec hid_sensor_hub intel_ishtp_hid i915 crct10dif_pclmul crc32_pclmul crc32c_intel i2c_algo_bit polyval_clmulni drm_buddy polyval_generic ttm nvme ghash_clmulni_intel nvme_core drm_display_helper ucsi_acpi sha512_ssse3 video hid_multitouch typec_ucsi intel_ish_ipc sha256_ssse3 spi_pxa2xx_platform sha1_ssse3 typec dw_dmac cec intel_ishtp nvme_auth i2c_hid_acpi i2c_hid wmi pinctrl_tigerlake serio_raw ip6_tables ip_tables fuse
CPU: 5 PID: 35898 Comm: kworker/u24:6 Tainted: G             L     6.8.5-301.fc40.x86_64 #1
Hardware name: Dell Inc. XPS 9315/00KRKP, BIOS 1.23.0 08/08/2024
Workqueue: events_unbound cfg80211_wiphy_work [cfg80211]
RIP: 0010:iwl_mvm_scan_umac_v14_and_above+0x4f3/0xde0 [iwlmvm]
Code: 54 24 30 4c 89 54 24 38 4c 89 44 24 40 eb 0f 83 c6 01 40 0f b6 c6 39 e8 0f 83 fa 00 00 00 40 0f b6 c6 48 8d 04 80 49 8d 3c 86 <44> 39 7f 04 75 df 0f b6 47 11 3c 80 74 15 0f b6 14 24 80 fa 80 0f
RSP: 0018:ffffb64c405a3848 EFLAGS: 00000297
RAX: 000000000000002d RBX: ffff9feac1fcd000 RCX: 0000000000000000
RDX: ffff9feac1fcd46d RSI: 00000000cbac4409 RDI: ffff9fede2e60324
RBP: 000000000000011e R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000003
R13: 0000000000000000 R14: ffff9fede2e60270 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff9ff22f740000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc961bdc000 CR3: 000000029b422000 CR4: 0000000000f50ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 ? watchdog_timer_fn+0x1ea/0x270
 ? __pfx_watchdog_timer_fn+0x10/0x10
 ? __hrtimer_run_queues+0x12f/0x2a0
 ? hrtimer_interrupt+0xf8/0x230
 ? __sysvec_apic_timer_interrupt+0x4a/0x140
 ? sysvec_apic_timer_interrupt+0x6d/0x90
 </IRQ>
 <TASK>
 ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
 ? iwl_mvm_scan_umac_v14_and_above+0x4f3/0xde0 [iwlmvm]
 ? iwl_mvm_scan_umac_v14_and_above+0x443/0xde0 [iwlmvm]
 iwl_mvm_reg_scan_start+0x3e7/0x660 [iwlmvm]
 iwl_mvm_mac_hw_scan+0x4e/0x70 [iwlmvm]
 drv_hw_scan+0x9f/0x150 [mac80211]
 __ieee80211_start_scan+0x296/0x750 [mac80211]
 ? cfg80211_scan_6ghz+0x3f2/0xef0 [cfg80211]
 rdev_scan+0x25/0xd0 [cfg80211]
 cfg80211_scan_6ghz+0x48b/0xef0 [cfg80211]
 ? ttwu_do_activate+0x64/0x220
 ? try_to_wake_up+0x233/0x670
 ___cfg80211_scan_done+0x1e3/0x250 [cfg80211]
 cfg80211_wiphy_work+0xab/0xe0 [cfg80211]
 process_one_work+0x16d/0x330
 worker_thread+0x273/0x3c0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xe5/0x120
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x31/0x50
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>
Comment 1 Johannes Berg 2024-10-21 13:35:39 UTC
Try the attached patch.
Comment 2 Johannes Berg 2024-10-21 13:36:05 UTC
Created attachment 307031 [details]
patch fixing variable type
Comment 3 Eric Li 2024-10-22 17:59:47 UTC
Thanks. Haven't extensively tested it, but my laptop didn't go into a soft lockup long after it normally would have, so I'm inclined to say that it is resolved for me.
Comment 4 Johannes Berg 2024-10-23 07:10:05 UTC
Good, thanks!

I've been wondering if there's another bug somewhere else that _triggers_ this, or if you really have >255 APs reported in reduced neighbor reports. Or maybe if there's an AP bug?

Would you be able to dump the output of "iw <device> scan dump -u" to a file, at the right place where it'd have locked up?
Comment 5 Eric Li 2024-10-23 16:19:38 UTC
Created attachment 307059 [details]
iwl wifi dump
Comment 6 Eric Li 2024-10-23 16:27:11 UTC
I attached 2 iw dumps, with very different sizes. I can't reproduce getting the larger size without just running "iw" every once in a while.
Comment 7 Eric Li 2024-10-23 16:28:03 UTC
Created attachment 307060 [details]
large iwl wifi dump
Comment 8 Johannes Berg 2024-10-28 10:25:52 UTC
Thanks. That looks like you simply have a LOT of APs, the dump has 156 APs reporting neighbors, though only 2*99 unique neighbors seem to be reported.

That's not quite above 255 yet, but it's plausible that walking around a bit and getting within range of more APs within 30 seconds could find enough to crash.

I was worried we had a bug elsewhere that caused the code to _think_ there were so many APs, when there actually weren't, due to some bug or something, but this makes it look like that's not the case (here anyway).


So thanks for the report; the commit has since landed, so I'll close the bug.