Created attachment 287807 [details] complete logs of kernel NULL pointer dereference, address: 00000000000000d8 Hi, I'm using a 08:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 50) with: ethtool -i wlp8s0 | grep firmware firmware-version: 36.77d01142.0 8265-36.ucode kernel version 5.6.0-0.rc4.git0.1.fc33.x86_64+debug but the bug is present in all earlier kernel versions 5.x.x I tested as an access point with hostapd under Fedora 31 and after some time in the journal logs appears: Mar 05 09:46:33 hostapd[1449]: wlp8s0: STA 2c:dc:ad:34:d8:57 IEEE 802.11: did no t acknowledge authentication response Mar 05 09:46:47 kernel: BUG: kernel NULL pointer dereference, address: 000000000 00000d8 Mar 05 09:46:47 kernel: #PF: supervisor write access in kernel mode Mar 05 09:46:47 kernel: #PF: error_code(0x0002) - not-present page Mar 05 09:46:47 kernel: PGD 0 P4D 0 Mar 05 09:46:47 kernel: Oops: 0002 [#1] SMP PTI Mar 05 09:46:47 kernel: CPU: 0 PID: 2086 Comm: kworker/0:0 Not tainted 5.6.0-0.r c4.git0.1.fc33.x86_64+debug #1 Mar 05 09:46:47 kernel: Hardware name: LENOVO 10A8S08P00/SHARKBAY, BIOS FBKTDBAU S 12/24/2019 Mar 05 09:46:47 kernel: Workqueue: events iwl_mvm_add_new_dqa_stream_wk [iwlmvm] Mar 05 09:46:47 kernel: RIP: 0010:iwl_trans_pcie_txq_enable+0x4d/0x460 [iwlwifi] Mar 05 09:46:47 kernel: Code: bc c7 80 19 00 00 66 89 14 24 f0 48 0f ab 87 80 29 00 00 73 0d 80 3d af 37 04 00 00 0f 84 c6 03 00 00 44 89 c7 e8 03 42 b0 d0 <49> 89 87 d8 00 00 00 4d 85 f6 0f 84 d1 02 00 00 41 0f b6 06 89 44 Mar 05 09:46:47 kernel: RSP: 0018:ffffac44c081fd08 EFLAGS: 00010202 Mar 05 09:46:47 kernel: RAX: 0000000000002710 RBX: 0000000000000000 RCX: 0000000 000000000 Mar 05 09:46:47 kernel: RDX: 3ffffffffffffffe RSI: 000000000000001f RDI: 0000000 000002710 Mar 05 09:46:47 kernel: RBP: ffff8ee008040028 R08: 0000000000002710 R09: 0000000 000000001 Mar 05 09:46:47 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000 00000001f Mar 05 09:46:47 kernel: R13: ffff8ee00567b3e8 R14: 0000000000000000 R15: 0000000 000000000 Mar 05 09:46:47 kernel: FS: 0000000000000000(0000) GS:ffff8ee00e800000(0000) kn lGS:0000000000000000 Mar 05 09:46:47 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 05 09:46:47 kernel: CR2: 00000000000000d8 CR3: 0000000040612003 CR4: 0000000 0001606f0 Mar 05 09:46:47 kernel: Call Trace: Mar 05 09:46:47 kernel: iwl_mvm_enable_txq+0x1d8/0x2b0 [iwlmvm] Mar 05 09:46:47 kernel: iwl_mvm_add_new_dqa_stream_wk+0x203/0x910 [iwlmvm] Mar 05 09:46:47 kernel: process_one_work+0x25d/0x570 Mar 05 09:46:47 kernel: worker_thread+0x55/0x3d0 Mar 05 09:46:47 kernel: ? process_one_work+0x570/0x570 Mar 05 09:46:47 kernel: kthread+0x120/0x140 Mar 05 09:46:47 kernel: ? __kthread_bind_mask+0x60/0x60 Mar 05 09:46:47 kernel: ret_from_fork+0x3a/0x50 Mar 05 09:46:47 kernel: Modules linked in: ebtable_filter ebtables ip6table_filt er ip6_tables ccm 8021q garp mrp bridge stp llc xt_pkttype xt_state xt_conntrack iptable_filter xt_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ ipv4 libcrc32c nf_log_ipv4 nf_log_common xt_dscp xt_multiport xt_LOG xt_set xt_l ength xt_mark iptable_mangle ip_set_hash_net ip_set_bitmap_port ip_set_hash_ip i p_set_hash_netport ip_set nfnetlink sunrpc vfat fat intel_rapl_msr intel_rapl_co mmon x86_pkg_temp_thermal snd_hda_codec_realtek intel_powerclamp coretemp kvm_in tel joydev snd_hda_codec_generic kvm iwlmvm irqbypass ledtrig_audio mac80211 lib arc4 crct10dif_pclmul snd_hda_codec_hdmi snd_hda_intel crc32_pclmul snd_intel_ds pcfg snd_hda_codec ghash_clmulni_intel snd_hda_core intel_cstate snd_hwdep intel _uncore snd_seq snd_seq_device intel_rapl_perf iTCO_wdt iTCO_vendor_support iwlw ifi mei_hdcp snd_pcm cfg80211 wmi_bmof i2c_i801 snd_timer snd rfkill soundcore m ei_me mei lpc_ich ip_tables raid1 i915 i2c_algo_bit cec Mar 05 09:46:47 kernel: drm_kms_helper drm r8169 crc32c_intel ums_realtek e1000 e uas usb_storage wmi video Mar 05 09:46:47 kernel: CR2: 00000000000000d8 Mar 05 09:46:47 kernel: ---[ end trace 759169036689fe78 ]--- I can provide more info or test. thanks in advance, Gabriel
Are you able to reproduce this issue? If I supplied you a patch would you be able to apply it and run it?
(In reply to Mark Asselstine from comment #1) > Are you able to reproduce this issue? Yes > If I supplied you a patch would you be able to apply it and run it? Yes, I can apply it to a kernel SRPM and build that under mock. thanks, Gabriel
Can you also do a few checks for me. 1) Ensure CONFIG_DEBUG_INFO is enabled, rebuild the kernel if needed. Reproduce the issue to get an updated Oops, then load the kernel into gdb (gdb vmlinux). Then use the list command to print the line where the issue is occurring (for your previous Ooops this would be 'l * iwl_trans_pcie_txq_enable+0x4d', based on the RIP in the Oops output.). 2) Recompile the kernel with CONFIG_SMP and CONFIG_PREEMPT turned off and see if you can still reproduce the issue. Thanks
(In reply to Mark Asselstine from comment #3) > Can you also do a few checks for me. > > 1) Ensure CONFIG_DEBUG_INFO is enabled, rebuild the kernel if needed. > Reproduce the issue to get an updated Oops, then load the kernel into gdb > (gdb vmlinux). Then use the list command to print the line where the issue > is occurring (for your previous Ooops this would be 'l * > iwl_trans_pcie_txq_enable+0x4d', based on the RIP in the Oops output.). I tried but that printed No symbol "iwl_trans_pcie_txq_enable" in current context. using: (gdb) add-symbol-file /usr/lib/debug/usr/lib/modules/5.6.0-0.rc4.git0.1.local.fc31.x86_64+debug/kernel/drivers/net/wireless/intel/iwlwifi/iwlwifi.ko.debug 0xffffffffc06e5000 -s .data 0xffffffffc0734000 -s .bss 0xffffffffc0738500 l * iwl_trans_pcie_txq_enable+0x4d 0xffffffffc06f3d5d is in iwl_trans_pcie_txq_enable (./include/linux/jiffies.h:371). 366 if (__builtin_constant_p(m)) { 367 if ((int)m < 0) 368 return MAX_JIFFY_OFFSET; 369 return _msecs_to_jiffies(m); 370 } else { 371 return __msecs_to_jiffies(m); 372 } 373 } 374 375 extern unsigned long __usecs_to_jiffies(const unsigned int u); > > 2) Recompile the kernel with CONFIG_SMP and CONFIG_PREEMPT turned off and > see if you can still reproduce the issue. yes thats happens without SMP too > > Thanks If need I can try something else, thanks
Created attachment 287843 [details] logs of kernel NULL pointer dereference, address: 00000000000000d8 one processor
Thanks Gabriello, yes my bad the symbol was of course in the module but you figured it out. I spent time over the weekend looking at your original log as lockdep highlighted a potential deadlock, unfortunately I wasn't able to devise a sensible way to untangle things so hopefully he is responsive to my ping. I will be sure to update the case if I find anything else useful.
Created attachment 287847 [details] debug patch Can you apply the attached 0001-iwlwifi-print-more-debug-info.patch. It looks like the returned txq is invalid and I am still trying to track down why. This should print information to confirm this as well as provide additional info.
Created attachment 287851 [details] log with 0001-iwlwifi-print-more-debug-info.patch applied I applied the patch to 5.6.0-0.rc5 so here is the log when the kernel bug occurred, please let me know if need something else. thanks, Gabriel
Looking like an off-by-one error Mar 10 08:56:43 kernel: Invalid txq. trans_pci:0xffff888405a20ab0, txq_id:31, num:31 txq_id should max out at 30 with a total number of queues of 31. We can see this in other loops in tx.c that iterate over the queues for (i = 0; i < trans->trans_cfg->base_params->num_of_queues; i++) { So we end up getting struct iwl_txq *txq = trans_pcie->txq[txq_id]; returning null. And then txq->wd_timeout = msecs_to_jiffies(wdg_timeout); Causes the Oops. The 0xd8 we see in the oops is the offset of wd_timeout in the iwl_txq struct. So now to look at iwl_mvm_sta_alloc_queue() to see why the queue being selected is out of range. Thanks for running things with the patch, this should help track down the issue.
Created attachment 287861 [details] proposed fix Please drop the debug patch and apply the newly attached 0001-iwlwifi-avoid-off-by-one-error-due-to-queue-count-mi.patch Once you can confirm it I will send it upstream and see what the maintainers think. There is definitely an issue with the range used to find free queues, hopefully they agree with my fix.
Created attachment 287871 [details] first run with the proposed patch Hi, applied the patch, in the first run, appeared: Mar 11 02:24:35 kernel: WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected Mar 11 05:31:46 kernel: WARNING: CPU: 0 PID: 871 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:3715 iwl_mvm_sta_modify_sleep_tx_count+0x259/0x300 [iwlmvm] Mar 11 05:31:46 kernel: WARNING: CPU: 0 PID: 871 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:3715 iwl_mvm_sta_modify_sleep_tx_count+0x259/0x300 [iwlmvm] Mar 11 05:32:03 kernel: WARNING: CPU: 0 PID: 2582 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:533 iwl_mvm_inactivity_check+0xd75/0xf10 [iwlmvm] Mar 11 05:33:02 kernel: WARNING: CPU: 0 PID: 871 at drivers/net/wireless/intel/iwlwifi/mvm/tx.c:1144 iwl_mvm_tx_mpdu+0x580/0x5a0 [iwlmvm] Mar 11 05:53:20 kernel: WARNING: CPU: 0 PID: 2582 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:533 iwl_mvm_inactivity_check+0xd75/0xf10 [iwlmvm] Mar 11 05:32:08 kernel: BUG: scheduling while atomic: kworker/0:1/2582/0x00000002 Mar 11 05:32:08 kernel: BUG: workqueue leaked lock or atomic: kworker/0:1/0x7fffffff/2582 Mar 11 05:32:08 kernel: BUG: sleeping function called from invalid context at kernel/workqueue.c:2290 Mar 11 05:32:08 kernel: BUG: scheduling while atomic: kworker/0:1/2582/0x00000000 Mar 11 05:53:20 kernel: BUG: scheduling while atomic: kworker/0:1/2582/0x00000002 Mar 11 05:53:20 kernel: BUG: workqueue leaked lock or atomic: kworker/0:1/0x7fffffff/2582 Mar 11 05:53:20 kernel: BUG: sleeping function called from invalid context at kernel/workqueue.c:2290 Mar 11 05:53:20 kernel: BUG: scheduling while atomic: kworker/0:1/2582/0x00000000 Mar 11 06:08:09 kernel: BUG: kernel NULL pointer dereference, address: 00000000000000d8 so the still happens and this log don't have the lines after "BUG: kernel NULL pointer dereference" but I attached it because have more and different BUG and WARNING lines than the following attachment thanks, Gabriel
Created attachment 287873 [details] log with proposed patch and BUG: kernel NULL pointer dereference complete Hi, this log shows: Mar 11 07:57:52 kernel: WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected Mar 11 08:16:14 kernel: WARNING: CPU: 0 PID: 352 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:533 iwl_mvm_inactivity_check+0xd75/0xf10 [iwlmvm] Mar 11 08:25:28 kernel: WARNING: CPU: 0 PID: 2090 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:533 iwl_mvm_inactivity_check+0xd75/0xf10 [iwlmvm] Mar 11 08:25:29 kernel: WARNING: CPU: 0 PID: 2090 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:533 iwl_mvm_inactivity_check+0xd75/0xf10 [iwlmvm] Mar 11 08:25:29 kernel: WARNING: CPU: 3 PID: 27 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:533 iwl_mvm_inactivity_check+0xd75/0xf10 [iwlmvm] and Mar 11 09:07:11 kernel: BUG: kernel NULL pointer dereference, address: 00000000000000d8 Mar 11 09:07:12 kernel: BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:38 with: Mar 11 09:07:12 kernel: RIP: 0010:iwl_trans_pcie_txq_enable+0x4e/0x460 [iwlwifi] thanks, Gabriel
I'll have a look tonight, it appears I didn't get all the places where the off-by-one can occur. NOTE the "WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected" issue is as far as I can tell a different issue. Unfortunately I don't know this code well enough to fix this, so I have emailed the person who introduced the lock in the hopes that they can have a look.
Created attachment 287903 [details] reuse queue number fix I am still looking at all the execution paths that are possibly returning an invalid queue number. Looking at the next path I identified what I believe is a code error. The attached 0001-iwlwifi-set-free_queue-to-the-loop-variable-not-a-me.patch addresses this new issue and I have already sent the fix upstream. This fix doesn't prevent the queue number from from being invalid as you are seeing, so it doesn't directly fix your issue but may indirectly prevent you from hitting the failure. Please go ahead and add it to your build, along with the firs fix, and let me know how it goes. I will continue to look at the other code paths tomorrow.
Created attachment 287921 [details] log with 0001-iwlwifi-set-free_queue-to-the-loop-variable-not-a-me.patch and 0001-iwlwifi-avoid-off-by-one-error-due-to-queue-count-mi.patch (In reply to Mark Asselstine from comment #14) > Created attachment 287903 [details] > reuse queue number fix > > I am still looking at all the execution paths that are possibly returning an > invalid queue number. Looking at the next path I identified what I believe > is a code error. The attached > 0001-iwlwifi-set-free_queue-to-the-loop-variable-not-a-me.patch addresses > this new issue and I have already sent the fix upstream. This fix doesn't > prevent the queue number from from being invalid as you are seeing, so it > doesn't directly fix your issue but may indirectly prevent you from hitting > the failure. Please go ahead and add it to your build, along with the firs > fix, and let me know how it goes. I will continue to look at the other code > paths tomorrow. Hi, I appreciate your help in this issue, I applied your two patches but the bug happens with them too Mar 13 08:36:06 kernel: BUG: kernel NULL pointer dereference, address: 00000000000000d8 if "WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected" is a different issue I will file another bug for better tracking of it thanks, Gabriel
As I mentioned in my comment I know there are still code paths that will cause the failure, so I am not surprised you are still seeing the failure. Both of the patches fix real issues but I need to continue analysing the remaining code paths as there is at least one more error here. Yes. Definitely raise a second bug, you can cut and paste the contents of your very first log, lines 2450-2752 I will most likely get another patch to you as I track down the next path that allows the queue number to exceed the max queue number. Should be available in the next 8hrs or so.
Can you please enable the IWLWIFI_DEBUG kconfig and then *before* you perform the reproducer can you please run the following command % echo 0x80000001 > /sys/module/iwlwifi/parameters/debug Once you complete the reproducer there should be additional information which should help me. Thanks.
Created attachment 287931 [details] log with /sys/module/iwlwifi/parameters/debug = 0x80000001 (In reply to Mark Asselstine from comment #17) > Can you please enable the IWLWIFI_DEBUG kconfig and then *before* you > perform the reproducer can you please run the following command > > % echo 0x80000001 > /sys/module/iwlwifi/parameters/debug > > Once you complete the reproducer there should be additional information > which should help me. Thanks. Hi, I attached the log with the info requested. I updated the bug https://bugzilla.kernel.org/show_bug.cgi?id=206199 "WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected at startup" with the contents of my first log lines 2450-2752 thanks, Gabriel
Created attachment 287937 [details] log with /sys/module/iwlwifi/parameters/debug = 0x80000001 and BUG: kernel NULL pointer dereference complete This log have the BUG: kernel NULL pointer dereference: Mar 15 16:19:18 kernel: iwlwifi 0000:08:00.0: I iwl_mvm_remove_inactive_tids Que ue 30 is inactive Mar 15 16:19:18 kernel: iwlwifi 0000:08:00.0: U iwl_mvm_sta_alloc_queue Using re served queue #31 Mar 15 16:19:18 kernel: iwlwifi 0000:08:00.0: U iwl_mvm_sta_alloc_queue Allocati ng queue #31 to sta 10 on tid 7 Mar 15 16:19:18 kernel: iwlwifi 0000:08:00.0: U iwl_mvm_update_txq_mapping Enabl ing TXQ #31 tids=0x80 Mar 15 16:19:18 kernel: BUG: kernel NULL pointer dereference, address: 000000000 00000d8 Mar 15 16:19:18 kernel: #PF: supervisor write access in kernel mode Mar 15 16:19:18 kernel: #PF: error_code(0x0002) - not-present page Mar 15 16:19:19 kernel: PGD 0 P4D 0 Mar 15 16:19:19 kernel: Oops: 0002 [#1] SMP PTI Mar 15 16:19:19 kernel: CPU: 1 PID: 2372 Comm: kworker/1:1 Tainted: G W 5.6.0-0.rc5.git0.1.iwlwififix2.fc31.x86_64+debug #1 Mar 15 16:19:19 kernel: Hardware name: LENOVO 10A8S08P00/SHARKBAY, BIOS FBKTDBAU S 12/24/2019 Mar 15 16:19:19 kernel: Workqueue: events iwl_mvm_add_new_dqa_stream_wk [iwlmvm] Mar 15 16:19:19 kernel: RIP: 0010:iwl_trans_pcie_txq_enable+0x4e/0x460 [iwlwifi] Mar 15 16:19:19 kernel: Code: ac c7 80 19 00 00 66 89 14 24 f0 48 0f ab 87 80 29 00 00 73 0d 80 3d 3e 3e 04 00 00 0f 84 cc 03 00 00 44 89 c7 e8 62 91 aa c0 <49> 89 85 d8 00 00 00 48 85 ed 0f 84 d3 02 00 00 0f b6 45 00 89 44 Mar 15 16:19:19 kernel: RSP: 0018:ffffc90001e97cf0 EFLAGS: 00010202 Mar 15 16:19:19 kernel: RAX: 0000000000002710 RBX: ffff888408100028 RCX: 0000000 000000000 Mar 15 16:19:19 kernel: RDX: 3ffffffffffffffe RSI: 000000000000001f RDI: 0000000 000002710 Mar 15 16:19:19 kernel: RBP: 0000000000000000 R08: 0000000000002710 R09: 0000000 000000001 Mar 15 16:19:19 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000 00000001f Mar 15 16:19:19 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000 000002710 Mar 15 16:19:19 kernel: FS: 0000000000000000(0000) GS:ffff88840ea00000(0000) kn lGS:0000000000000000 Mar 15 16:19:19 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 15 16:19:19 kernel: CR2: 00000000000000d8 CR3: 0000000002612001 CR4: 00000000001606e0 Mar 15 16:19:19 kernel: Call Trace: Mar 15 16:19:19 kernel: iwl_mvm_enable_txq+0x1e1/0x2b0 [iwlmvm] Mar 15 16:19:19 kernel: iwl_mvm_add_new_dqa_stream_wk+0x224/0x940 [iwlmvm] Mar 15 16:19:19 kernel: process_one_work+0x25d/0x570 Mar 15 16:19:19 kernel: worker_thread+0x50/0x3b0 Mar 15 16:19:19 kernel: kthread+0x106/0x140 Mar 15 16:19:19 kernel: ? process_one_work+0x570/0x570 Mar 15 16:19:19 kernel: ? kthread_park+0x90/0x90 Mar 15 16:19:19 kernel: ret_from_fork+0x3a/0x50 Mar 15 16:19:19 kernel: Modules linked in: ebtable_filter ebtables ccm 8021q garp mrp ip6table_filter ip6_tables bridge stp llc xt_pkttype xt_state xt_conntrack iptable_filter xt_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c nf_log_ipv4 nf_log_common xt_dscp xt_multiport xt_LOG xt_set xt_length xt_mark iptable_mangle ip_set_hash_net ip_set_bitmap_port ip_set_hash_ip ip_set_hash_netport ip_set nfnetlink sunrpc vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iwlmvm irqbypass crct10dif_pclmul mac80211 snd_hda_codec_realtek snd_hda_codec_generic libarc4 ledtrig_audio joydev iTCO_wdt iTCO_vendor_support crc32_pclmul snd_hda_codec_hdmi ghash_clmulni_intel intel_cstate iwlwifi snd_hda_intel mei_hdcp snd_intel_dspcfg snd_hda_codec intel_uncore intel_rapl_perf cfg80211 wmi_bmof snd_hda_core snd_hwdep snd_seq rfkill i2c_i801 snd_seq_device snd_pcm snd_timer snd soundcore mei_me lpc_ich mei ip_tables raid1 i915 i2c_algo_bit cec Mar 15 16:19:19 kernel: drm_kms_helper drm ums_realtek uas crc32c_intel usb_storage e1000e r8169 wmi video Mar 15 16:19:19 kernel: CR2: 00000000000000d8 Mar 15 16:19:19 kernel: ---[ end trace 5f5574f6174f62d1 ]--- Mar 15 16:19:19 kernel: RIP: 0010:iwl_trans_pcie_txq_enable+0x4e/0x460 [iwlwifi] Mar 15 16:19:19 kernel: Code: ac c7 80 19 00 00 66 89 14 24 f0 48 0f ab 87 80 29 00 00 73 0d 80 3d 3e 3e 04 00 00 0f 84 cc 03 00 00 44 89 c7 e8 62 91 aa c0 <49> 89 85 d8 00 00 00 48 85 ed 0f 84 d3 02 00 00 0f b6 45 00 89 44 Mar 15 16:19:19 kernel: RSP: 0018:ffffc90001e97cf0 EFLAGS: 00010202 Mar 15 16:19:19 kernel: RAX: 0000000000002710 RBX: ffff888408100028 RCX: 0000000000000000 Mar 15 16:19:19 kernel: RDX: 3ffffffffffffffe RSI: 000000000000001f RDI: 0000000000002710 Mar 15 16:19:19 kernel: RBP: 0000000000000000 R08: 0000000000002710 R09: 0000000000000001 Mar 15 16:19:19 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000001f Mar 15 16:19:19 kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000002710 Mar 15 16:19:19 kernel: FS: 0000000000000000(0000) GS:ffff88840ea00000(0000) knlGS:0000000000000000 Mar 15 16:19:19 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 15 16:19:19 kernel: CR2: 00000000000000d8 CR3: 0000000002612001 CR4: 00000000001606e0 thanks, Gabriel
I have an Intel 8260 and decided to create a setup I could test with here. Could you share any details of your reproducer? I have a potential fix but I would love to be able to reproduce things here to validate my assumptions.
Created attachment 287939 [details] hostapd configuration file (In reply to Mark Asselstine from comment #20) > I have an Intel 8260 and decided to create a setup I could test with here. > Could you share any details of your reproducer? Yes, the machine is running Fedora 31 x86_64 with hostapd as an access point to several cellphones and 2 smart tv the dns and dhcp servers are running in another machine for reproducing the bug, happens when a smart tv is streaming, so happens under heavy transfer, but not always happens sooner, and also occurs without streaming but I don't know what trigger the bug so I'm attaching my hostapd.conf, if you need any other configuration file or details I can provide them > > I have a potential fix but I would love to be able to reproduce things here > to validate my assumptions. thanks, Gabriel
Created attachment 287955 [details] work in progress fix The logs have been useful and the cause of your issue is very clear. Unfortunately the best fix is not clear. I have attached a version of a fix, but it is untested. I was hoping to have a reproducer here but no luck yet. Please give it a try if you have time. Continue to keep the debug logging on via 0x80000001 as per my previous comment.
Created attachment 287963 [details] 2 logs with the proposed fix and one image from the last state of top (In reply to Mark Asselstine from comment #22) > Created attachment 287955 [details] > work in progress fix > > The logs have been useful and the cause of your issue is very clear. > Unfortunately the best fix is not clear. I have attached a version of a fix, > but it is untested. I was hoping to have a reproducer here but no luck yet. > Please give it a try if you have time. Continue to keep the debug logging on > via 0x80000001 as per my previous comment. I patched the kernel this fix along with 0001-iwlwifi-set-free_queue-to-the-loop-variable-not-a-me.patch In the first run with the patched kernel ran fine from Mar 17 00:06:17 until Mar 17 04:58:59 when the system freeze and nothing interesting at that time shows in the log that is covered by the 001.log 22Mb in size the last Bug was: Mar 17 03:43:47 kernel: BUG: sleeping function called from invalid context at net/core/dev.c:9919 Mar 17 03:43:47 kernel: BUG: scheduling while atomic: kworker/2:1/4052/0x00000002 Mar 17 03:43:47 kernel: BUG: scheduling while atomic: kworker/2:1/4052/0x00000000 the last WARNING was: Mar 17 04:32:29 kernel: WARNING: CPU: 2 PID: 961 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:3713 iwl_mvm_sta_modify_sleep_tx_count+0x259/0x300 [iwlmvm] Mar 17 04:32:30 kernel: WARNING: CPU: 2 PID: 961 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:3713 iwl_mvm_sta_modify_sleep_tx_count+0x259/0x300 [iwlmvm] I know which the system freeze because two minutes after that the system reboot via the watchdog and the system start at Mar 17 05:00:52 and freeze again at Mar 17 20:27:30 with the same situation described above but covered by the 002.log 8Mb in size the last BUG was: Mar 17 20:17:27 kernel: BUG: sleeping function called from invalid context at net/core/dev.c:9919 Mar 17 20:17:27 kernel: BUG: scheduling while atomic: kworker/0:1/4875/0x00000002 Mar 17 20:17:27 kernel: BUG: scheduling while atomic: kworker/0:1/4875/0x00000000 the last WARNING was: Mar 17 07:34:10 kernel: WARNING: CPU: 0 PID: 881 at drivers/net/wireless/intel/iwlwifi/mvm/tx.c:1728 iwl_mvm_rx_tx_cmd+0xb52/0xc80 [iwlmvm] I attached one image from the top running until 20:27:44 the 797 process is the system-journal and the 881 is irq/40-iwlwifi/881 So the patch works I streamed NF to a TV for three hours without problem, but maybe another code path is missed or something diferent happens because the machine freeze, and both times freeze without streaming, so if you need additional info please tell me thanks, Gabriel
Gabriel. Good to hear. This is a new issue. My fix to prevent the use of the non-existent transmit queue seems to have done its job, I will clean it up and send it upstream I didn't get a chance to look at the new trace, at first glance it looks like the same area as the reported deadlock. I will definitely have a look tomorrow. For now you can drop the extra logging, it is providing no useful insight.
Created attachment 287991 [details] log with iwlwifi 0000:08:00.0: No available queues for tid 8 on sta_id 10 Hi, This log have more info I don't know if is sufficient, because the machine freeze at that point and reboots via the watchdog but shows the following: Mar 20 21:40:27 kernel: iwlwifi 0000:08:00.0: Failed to disable queue 6 (ret=-5) Mar 20 21:40:27 kernel: iwlwifi 0000:08:00.0: Failed to free inactive queue 6 (ret=-5) Mar 20 21:40:27 kernel: iwlwifi 0000:08:00.0: No DATA queues available to share Mar 20 21:40:27 kernel: ------------[ cut here ]------------ Mar 20 21:40:27 kernel: WARNING: CPU: 2 PID: 9930 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:1273 iwl_mvm_add_new_dqa_stream_wk+0x787/0x920 [iwlmvm] Mar 20 21:40:27 kernel: Modules linked in: ebtable_filter ebtables ip6table_filter ip6_tables ccm 8021q garp mrp bridge stp llc xt_pkttype xt_state xt_conntrack iptable_filter xt_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c nf_log_ipv4 nf_log_common xt_dscp xt_multiport xt_LOG xt_set xt_length xt_mark iptable_mangle ip_set_hash_net ip_set_bitmap_port ip_set_hash_ip ip_set_hash_netport ip_set nfnetlink sunrpc vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass iwlmvm crct10dif_pclmul snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio mac80211 libarc4 snd_hda_codec_hdmi snd_hda_intel crc32_pclmul snd_intel_dspcfg iwlwifi ghash_clmulni_intel iTCO_wdt iTCO_vendor_support joydev mei_hdcp snd_hda_codec snd_hda_core snd_hwdep intel_cstate cfg80211 snd_seq intel_uncore snd_seq_device snd_pcm intel_rapl_perf wmi_bmof i2c_i801 rfkill lpc_ich snd_timer snd mei_me soundcore mei ip_tables raid1 i915 i2c_algo_bit cec Mar 20 21:40:27 kernel: crc32c_intel ums_realtek uas drm_kms_helper r8169 drm e1000e usb_storage wmi video Mar 20 21:40:27 kernel: CPU: 2 PID: 9930 Comm: kworker/2:0 Tainted: G W 5.6.0-0.rc6.git1.1.iwlwififix3.fc31.x86_64 #1 Mar 20 21:40:27 kernel: Hardware name: LENOVO 10A8S08P00/SHARKBAY, BIOS FBKTDBAUS 12/24/2019 Mar 20 21:40:27 kernel: Workqueue: events iwl_mvm_add_new_dqa_stream_wk [iwlmvm] Mar 20 21:40:27 kernel: RIP: 0010:iwl_mvm_add_new_dqa_stream_wk+0x787/0x920 [iwlmvm] Mar 20 21:40:27 kernel: Code: 0f 85 4b 01 00 00 0f b6 44 24 5c 3c ff 0f 85 3e 01 00 00 49 8b bd e0 f0 ff ff 48 c7 c1 70 6b 97 c0 31 d2 31 f6 e8 49 7e dc ff <0f> 0b 49 8b bd e0 f0 ff ff 44 8b 44 24 04 31 d2 31 f6 44 0f b6 4c Mar 20 21:40:27 kernel: RSP: 0018:ffffc90002d73da8 EFLAGS: 00010286 Mar 20 21:40:27 kernel: RAX: 0000000000000000 RBX: ffff8883d6f71d18 RCX: 0000000000000007 Mar 20 21:40:27 kernel: RDX: 0000000000000002 RSI: ffffffff8116de03 RDI: ffff88840eddc5b0 Mar 20 21:40:27 kernel: RBP: ffff8883cccd8cd0 R08: 0000000000000000 R09: 0000000000000000 Mar 20 21:40:27 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000000ff Mar 20 21:40:27 kernel: R13: ffff8884083d4308 R14: 0000000000000000 R15: ffff8884083d33e8 Mar 20 21:40:27 kernel: FS: 0000000000000000(0000) GS:ffff88840ec00000(0000) knlGS:0000000000000000 Mar 20 21:40:27 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 20 21:40:27 kernel: CR2: 00007fac65ba1f80 CR3: 0000000002612005 CR4: 00000000001606e0 Mar 20 21:40:27 kernel: Call Trace: Mar 20 21:40:27 kernel: process_one_work+0x25d/0x570 Mar 20 21:40:27 kernel: worker_thread+0x50/0x3b0 Mar 20 21:40:27 kernel: kthread+0x106/0x140 Mar 20 21:40:27 kernel: ? process_one_work+0x570/0x570 Mar 20 21:40:27 kernel: ? kthread_park+0x90/0x90 Mar 20 21:40:27 kernel: ret_from_fork+0x3a/0x50 Mar 20 21:40:27 kernel: irq event stamp: 0 Mar 20 21:40:27 kernel: hardirqs last enabled at (0): [<0000000000000000>] 0x0 Mar 20 21:40:27 kernel: hardirqs last disabled at (0): [<ffffffff810e267e>] copy_process+0x73e/0x1ef0 Mar 20 21:40:27 kernel: softirqs last enabled at (0): [<ffffffff810e267e>] copy_process+0x73e/0x1ef0 Mar 20 21:40:27 kernel: softirqs last disabled at (0): [<0000000000000000>] 0x0 Mar 20 21:40:27 kernel: ---[ end trace 6aa5111e4bceb69b ]--- Mar 20 21:40:27 kernel: iwlwifi 0000:08:00.0: No available queues for tid 8 on sta_id 10 Mar 20 21:40:27 kernel: BUG: workqueue leaked lock or atomic: kworker/2:0/0x7fffffff/9930 last function: iwl_mvm_add_new_dqa_stream_wk [iwlmvm] Mar 20 21:40:27 kernel: INFO: lockdep is turned off. Mar 20 21:40:27 kernel: CPU: 2 PID: 9930 Comm: kworker/2:0 Tainted: G W 5.6.0-0.rc6.git1.1.iwlwififix3.fc31.x86_64 #1 Mar 20 21:40:27 kernel: Hardware name: LENOVO 10A8S08P00/SHARKBAY, BIOS FBKTDBAUS 12/24/2019 Mar 20 21:40:27 kernel: Workqueue: events iwl_mvm_add_new_dqa_stream_wk [iwlmvm] Mar 20 21:40:27 kernel: Call Trace: Mar 20 21:40:27 kernel: dump_stack+0x8f/0xd0 Mar 20 21:40:27 kernel: process_one_work.cold+0x41/0x46 Mar 20 21:40:27 kernel: worker_thread+0x50/0x3b0 Mar 20 21:40:27 kernel: kthread+0x106/0x140 Mar 20 21:40:27 kernel: ? process_one_work+0x570/0x570 Mar 20 21:40:27 kernel: ? kthread_park+0x90/0x90 Mar 20 21:40:27 kernel: ret_from_fork+0x3a/0x50 Mar 20 21:40:27 kernel: BUG: scheduling while atomic: kworker/2:0/9930/0x00000000 Mar 20 21:40:27 kernel: INFO: lockdep is turned off. Mar 20 21:40:27 kernel: Modules linked in: ebtable_filter ebtables ip6table_filter ip6_tables ccm 8021q garp mrp bridge stp llc xt_pkttype xt_state xt_conntrack iptable_filter xt_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c nf_log_ipv4 nf_log_common xt_dscp xt_multiport xt_LOG xt_set xt_length xt_mark iptable_mangle ip_set_hash_net ip_set_bitmap_port ip_set_hash_ip ip_set_hash_netport ip_set nfnetlink sunrpc vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass iwlmvm crct10dif_pclmul snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio mac80211 libarc4 snd_hda_codec_hdmi snd_hda_intel crc32_pclmul snd_intel_dspcfg iwlwifi ghash_clmulni_intel iTCO_wdt iTCO_vendor_support joydev mei_hdcp snd_hda_codec snd_hda_core snd_hwdep intel_cstate cfg80211 snd_seq intel_uncore snd_seq_device snd_pcm intel_rapl_perf wmi_bmof i2c_i801 rfkill lpc_ich snd_timer snd mei_me soundcore mei ip_tables raid1 i915 i2c_algo_bit cec Mar 20 21:40:27 kernel: crc32c_intel ums_realtek uas drm_kms_helper r8169 drm e1000e usb_storage wmi video Mar 20 21:40:27 kernel: CPU: 2 PID: 9930 Comm: kworker/2:0 Tainted: G W 5.6.0-0.rc6.git1.1.iwlwififix3.fc31.x86_64 #1 Mar 20 21:40:27 kernel: Hardware name: LENOVO 10A8S08P00/SHARKBAY, BIOS FBKTDBAUS 12/24/2019 Mar 20 21:40:27 kernel: Workqueue: 0x0 (events) Mar 20 21:40:27 kernel: Call Trace: Mar 20 21:40:27 kernel: dump_stack+0x8f/0xd0 Mar 20 21:40:27 kernel: __schedule_bug.cold+0x64/0x71 Mar 20 21:40:27 kernel: __schedule+0x822/0x990 Mar 20 21:40:27 kernel: ? process_one_work+0x343/0x570 Mar 20 21:40:27 kernel: schedule+0x5f/0xd0 Mar 20 21:40:27 kernel: worker_thread+0xd1/0x3b0 Mar 20 21:40:27 kernel: kthread+0x106/0x140 Mar 20 21:40:27 kernel: ? process_one_work+0x570/0x570 Mar 20 21:40:27 kernel: ? kthread_park+0x90/0x90 Mar 20 21:40:27 kernel: ret_from_fork+0x3a/0x50 Mar 20 21:40:28 kernel: iwlwifi 0000:08:00.0: Applying debug destination EXTERNAL_DRAM Mar 20 21:40:28 kernel: iwlwifi 0000:08:00.0: Applying debug destination EXTERNAL_DRAM Mar 20 21:40:28 kernel: iwlwifi 0000:08:00.0: FW already configured (0) - re-configuring and I don't know if open separate bugs reports for each of the following: Mar 20 10:00:23 kernel: WARNING: CPU: 0 PID: 893 at drivers/net/wireless/intel/iwlwifi/mvm/sta.c:3713 iwl_mvm_sta_modify_sleep_tx_count+0x259/0x300 [iwlmvm] Mar 20 21:40:27 kernel: WARNING: CPU: 2 PID: 9930 at drivers/net/wireless/intel /iwlwifi/mvm/sta.c:1273 iwl_mvm_add_new_dqa_stream_wk+0x787/0x920 [iwlmvm] Mar 20 21:40:27 kernel: BUG: scheduling while atomic: hostapd/1756/0x00000000 Mar 20 21:39:47 kernel: BUG: scheduling while atomic: kworker/0:3/9856/0x00000002 Mar 20 21:40:27 kernel: iwlwifi 0000:08:00.0: Microcode SW error detected. Restarting 0x2000000. thanks, Gabriel
Gabriello, I am looking into this issue. Once again I understand what is going on iwl_mvm_add_new_dqa_stream_wk -> iwl_mvm_inactivity_check -> starts and rcu read side critical section rcu_read_lock() -> iwl_mvm_free_inactive_queue -> iwl_mvm_remove_sta_queue_marking -> synchronize_net -> might sleep I am just sorting out how to change the code such that this doesn't happen. With all this COVID-19 stuff going on I have been busy and only now getting back to looking at this. Hopefully I will have a patch for you to try out shortly.
Gabriello, Can you close this issue and open a new one with the new log that spits out the sleeping while atomic error? I have emailed you a patch to this latest issue. We can continue the diagnosis and such in the new issue once you create it.
Mark, sure, using the following patches: https://bugzilla.kernel.org/attachment.cgi?id=287903 https://bugzilla.kernel.org/attachment.cgi?id=287955 the bug is resolved. the new bug about: BUG: sleeping function called from invalid context at net/core/dev.c:9919 is at https://bugzilla.kernel.org/show_bug.cgi?id=206971 Thanks, Gabriel
I just ran into this with 5.4.25. I'm in the process of testing the patches although I don't know a way to trigger the issue in a timely fashion.
Hi, along with the two patches referenced in: https://bugzilla.kernel.org/show_bug.cgi?id=206765#c28 get the patch: https://bugzilla.kernel.org/attachment.cgi?id=288779 referenced in: https://bugzilla.kernel.org/show_bug.cgi?id=206971#c6
We made a bunch of fixes in this area which were included in v5.7-rc3, v5.4.36 and v5.6.8. Can you retest with one of these versions?
(In reply to Luca Coelho from comment #31) > We made a bunch of fixes in this area which were included in v5.7-rc3, > v5.4.36 and v5.6.8. Can you retest with one of these versions? Hi, This bug is fixed in my testing with v5.6.8 and newer, currently using v5.7 thanks, Gabriel
Thanks for the report! Closing the bug.