This only happens in certain locations at my university, which is really weird. A soft lockup happens, and a kworker:events_unbound takes up a whole CPU thread. It seems to be caused by the iwlwifi/iwlmvm, and doesn't happen when I have airplane mode on. I don't think it's shown in DMESG, but I'm currently using AX211 WiFi card. dmesg output watchdog: BUG: soft lockup - CPU#5 stuck for 78s! [kworker/u24:6:35898] Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel rfcomm snd_seq_dummy snd_hrtimer nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib ip_set qrtr bnep uinput snd_ctl_led snd_soc_sof_sdw snd_soc_intel_hda_dsp_common sunrpc snd_sof_probes snd_soc_intel_sof_maxim_common snd_soc_rt715_sdca snd_soc_rt1316_sdw snd_hda_codec_hdmi regmap_sdw_mbq regmap_sdw snd_soc_dmic snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel snd_sof_intel_hda_mlink binfmt_misc soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match vfat snd_soc_acpi fat soundwire_generic_allocation soundwire_bus intel_uncore_frequency intel_uncore_frequency_common x86_pkg_temp_thermal snd_soc_core intel_powerclamp iwlmvm coretemp snd_compress ac97_bus snd_pcm_dmaengine kvm_intel snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec mac80211 kvm snd_hda_core snd_hwdep irqbypass snd_seq libarc4 rapl btusb snd_seq_device processor_thermal_device_pci spi_nor hid_sensor_als dell_laptop iTCO_wdt btrtl intel_cstate intel_pmc_bxt mei_hdcp mei_pxp mtd spi_ljca gpio_ljca i2c_ljca iTCO_vendor_support intel_rapl_msr dell_wmi iwlwifi intel_uncore snd_pcm btintel hid_sensor_trigger processor_thermal_device dell_wmi_ddv pcspkr btbcm processor_thermal_wt_hint dell_smbios hid_sensor_iio_common snd_timer btmtk processor_thermal_rfim dcdbas industrialio_triggered_buffer cfg80211 bluetooth dell_smm_hwmon dell_wmi_sysman firmware_attributes_class ledtrig_audio dell_wmi_descriptor wmi_bmof usb_ljca mei_me snd spi_intel_pci processor_thermal_rapl kfifo_buf spi_intel i2c_i801 industrialio mei rfkill intel_rapl_common soundcore idma64 i2c_smbus processor_thermal_wt_req thunderbolt igen6_edac processor_thermal_power_floor processor_thermal_mbox int3403_thermal intel_skl_int3472_tps68470 tps68470_regulator int340x_thermal_zone intel_pmc_core clk_tps68470 intel_vsec nft_reject_inet pmt_telemetry intel_hid int3400_thermal nf_reject_ipv4 pmt_class intel_skl_int3472_discrete acpi_thermal_rel sparse_keymap acpi_pad acpi_tad nf_reject_ipv6 joydev nft_reject nft_masq nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables i2c_dev loop nfnetlink zram xe drm_ttm_helper gpu_sched drm_suballoc_helper drm_gpuvm drm_exec hid_sensor_hub intel_ishtp_hid i915 crct10dif_pclmul crc32_pclmul crc32c_intel i2c_algo_bit polyval_clmulni drm_buddy polyval_generic ttm nvme ghash_clmulni_intel nvme_core drm_display_helper ucsi_acpi sha512_ssse3 video hid_multitouch typec_ucsi intel_ish_ipc sha256_ssse3 spi_pxa2xx_platform sha1_ssse3 typec dw_dmac cec intel_ishtp nvme_auth i2c_hid_acpi i2c_hid wmi pinctrl_tigerlake serio_raw ip6_tables ip_tables fuse CPU: 5 PID: 35898 Comm: kworker/u24:6 Tainted: G L 6.8.5-301.fc40.x86_64 #1 Hardware name: Dell Inc. XPS 9315/00KRKP, BIOS 1.23.0 08/08/2024 Workqueue: events_unbound cfg80211_wiphy_work [cfg80211] RIP: 0010:iwl_mvm_scan_umac_v14_and_above+0x4f3/0xde0 [iwlmvm] Code: 54 24 30 4c 89 54 24 38 4c 89 44 24 40 eb 0f 83 c6 01 40 0f b6 c6 39 e8 0f 83 fa 00 00 00 40 0f b6 c6 48 8d 04 80 49 8d 3c 86 <44> 39 7f 04 75 df 0f b6 47 11 3c 80 74 15 0f b6 14 24 80 fa 80 0f RSP: 0018:ffffb64c405a3848 EFLAGS: 00000297 RAX: 000000000000002d RBX: ffff9feac1fcd000 RCX: 0000000000000000 RDX: ffff9feac1fcd46d RSI: 00000000cbac4409 RDI: ffff9fede2e60324 RBP: 000000000000011e R08: 0000000000000002 R09: 0000000000000000 R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000003 R13: 0000000000000000 R14: ffff9fede2e60270 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff9ff22f740000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc961bdc000 CR3: 000000029b422000 CR4: 0000000000f50ef0 PKRU: 55555554 Call Trace: <IRQ> ? watchdog_timer_fn+0x1ea/0x270 ? __pfx_watchdog_timer_fn+0x10/0x10 ? __hrtimer_run_queues+0x12f/0x2a0 ? hrtimer_interrupt+0xf8/0x230 ? __sysvec_apic_timer_interrupt+0x4a/0x140 ? sysvec_apic_timer_interrupt+0x6d/0x90 </IRQ> <TASK> ? asm_sysvec_apic_timer_interrupt+0x1a/0x20 ? iwl_mvm_scan_umac_v14_and_above+0x4f3/0xde0 [iwlmvm] ? iwl_mvm_scan_umac_v14_and_above+0x443/0xde0 [iwlmvm] iwl_mvm_reg_scan_start+0x3e7/0x660 [iwlmvm] iwl_mvm_mac_hw_scan+0x4e/0x70 [iwlmvm] drv_hw_scan+0x9f/0x150 [mac80211] __ieee80211_start_scan+0x296/0x750 [mac80211] ? cfg80211_scan_6ghz+0x3f2/0xef0 [cfg80211] rdev_scan+0x25/0xd0 [cfg80211] cfg80211_scan_6ghz+0x48b/0xef0 [cfg80211] ? ttwu_do_activate+0x64/0x220 ? try_to_wake_up+0x233/0x670 ___cfg80211_scan_done+0x1e3/0x250 [cfg80211] cfg80211_wiphy_work+0xab/0xe0 [cfg80211] process_one_work+0x16d/0x330 worker_thread+0x273/0x3c0 ? __pfx_worker_thread+0x10/0x10 kthread+0xe5/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x31/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK>
Try the attached patch.
Created attachment 307031 [details] patch fixing variable type
Thanks. Haven't extensively tested it, but my laptop didn't go into a soft lockup long after it normally would have, so I'm inclined to say that it is resolved for me.
Good, thanks! I've been wondering if there's another bug somewhere else that _triggers_ this, or if you really have >255 APs reported in reduced neighbor reports. Or maybe if there's an AP bug? Would you be able to dump the output of "iw <device> scan dump -u" to a file, at the right place where it'd have locked up?
Created attachment 307059 [details] iwl wifi dump
I attached 2 iw dumps, with very different sizes. I can't reproduce getting the larger size without just running "iw" every once in a while.
Created attachment 307060 [details] large iwl wifi dump
Thanks. That looks like you simply have a LOT of APs, the dump has 156 APs reporting neighbors, though only 2*99 unique neighbors seem to be reported. That's not quite above 255 yet, but it's plausible that walking around a bit and getting within range of more APs within 30 seconds could find enough to crash. I was worried we had a bug elsewhere that caused the code to _think_ there were so many APs, when there actually weren't, due to some bug or something, but this makes it look like that's not the case (here anyway). So thanks for the report; the commit has since landed, so I'll close the bug.