Bug 204141

Summary: iwlwifi - kernel BUG at lib/list_debug.c:54 - WIFI-28668
Product: Drivers Reporter: Tom Seewald (tseewald)
Component: network-wirelessAssignee: DO NOT USE - assign "network-wireless-intel" component instead (linuxwifi)
Status: CLOSED CODE_FIX    
Severity: normal CC: chrzaszc, dev, georgmueller, luca, skyler, tomi
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.1.16-300.fc30.x86_64 Subsystem:
Regression: No Bisected commit-id:
Attachments: Full journalctl output for full context
lspci -vvv output

Description Tom Seewald 2019-07-11 15:33:45 UTC
Created attachment 283627 [details]
Full journalctl output for full context

Laptop Model: Dell Latitude 7490
Wifi Card: Intel 8265
Distribution: Fedora 30

Problem: A kernel bug appears to be randomly triggered after AP association, which in turn causes the laptop to misbehave and ultimately requires a shutdown by holding the power button down.

e.g. The laptop does not fully shutdown without physical intervention and random applications are totally hung and cannot be killed.

See journalctl.txt for the full context of the error.

Let me know if you need any additional information.
Comment 1 Tom Seewald 2019-07-11 15:34:19 UTC
Created attachment 283629 [details]
lspci -vvv output
Comment 2 Luca Coelho 2019-07-12 10:08:36 UTC
Thanks for reporting.  I have created an internal ticket to track this and will have someone look into this issue ASAP.

This was also discussed in this thread:

https://lkml.org/lkml/2019/5/30/723
Comment 3 Tom Seewald 2019-07-12 14:47:31 UTC
I am unsure if this is relevant, but in my case at least, this has so far occurred only on a network using WPA2-EAP.
Comment 4 Georg Müller 2019-07-15 13:50:44 UTC
There is also a bug report in redhat bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1717115

With a look at the source code, I think there is a modification of the list without a lock:

The change was introduced here:
iwlwifi: mvm: support mac80211 TXQs model
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cfbc6c4c5b91c7725ef14465b98ac347d31f2334

In the patch, there is one list_del_init (the one which causes the oops) in iwl_mvm_add_new_dqa_stream_wk(), and one list_add_tail() in iwl_mvm_mac_wake_tx_queue().

They do not share the same lock. There is a mutex in iwl_mvm_add_new_dqa_stream_wk(), but nothing in iwl_mvm_mac_wake_tx_queue(). Maybe it would help to use the mutex here or - if this is too expensive - introduce a spin lock for this list?

I am just guessing, but the list_add_tail() looks like the only thing not guarded by the mutex.
Comment 5 Tom Seewald 2019-08-13 06:02:40 UTC
Let me know if there is any additional information/logs needed, or if there are patches you'd like us to test.
Comment 6 Tom Seewald 2019-12-21 03:43:27 UTC
I haven't seen this bug in several months so I'm going to assume it's been fixed, but I don't know what fixed it.  I'll set this as resolved for now.
Comment 7 Skyler Hawthorne 2019-12-21 04:52:56 UTC
I also haven't seen it in several months. There must have been a patch that fixed it either directly or indirectly.
Comment 8 Luca Coelho 2019-12-21 09:35:28 UTC
Awesome! Thanks for reporting.