Bug 217528 - ath11k: slub_debug=F output indicates bug in ath11k: corrupting kmalloc-1k
Summary: ath11k: slub_debug=F output indicates bug in ath11k: corrupting kmalloc-1k
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-06 12:31 UTC by py0xc3
Modified: 2023-07-26 07:18 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description py0xc3 2023-06-06 12:31:09 UTC
Please see the full slub_debug=F `journalctl -r`: https://gitlab.com/py0xc31/public-tmp-storage/-/raw/main/slub_debug-F/HIT/slub_debug_HIT.log   
  
Hinting extracts from the above mentioned `journalctl -r`:  
  
```  
...  
Jun 05 18:56:20 fedora.domain kernel: Hardware name: LENOVO 21CHCTO1WW/21CHCTO1WW, BIOS R23ET60W (1.30 ) 09/14/2022  
Jun 05 18:56:20 fedora.domain kernel: CPU: 1 PID: 13592 Comm: kworker/u32:6 Tainted: G    B              6.3.5-200.fc38.x86_64 #1  
Jun 05 18:56:20 fedora.domain kernel: Slab 0xffffeffd4d324000 objects=32 used=10 fp=0xffff8fd10c901400 flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)  
Jun 05 18:56:20 fedora.domain kernel:   -----------------------------------------------------------------------------  
Jun 05 18:56:20 fedora.domain kernel: BUG kmalloc-1k (Tainted: G B             ): Wrong object count. Counter is 10 but counted were 28  
Jun 05 18:56:20 fedora.domain kernel:   =============================================================================  
Jun 05 18:56:20 fedora.domain kernel: Disabling lock debugging due to kernel taint  
...  
```  
  
```  
...  
Jun 05 18:56:20 fedora.domain kernel: Object 0xffff8fd10c902000 @offset=8192 fp=0xc5d6e3752d901092  
Jun 05 18:56:20 fedora.domain kernel: Slab 0xffffeffd4d324000 objects=32 used=10 fp=0xffff8fd10c901400 flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)  
Jun 05 18:56:20 fedora.domain kernel:   -----------------------------------------------------------------------------  
Jun 05 18:56:20 fedora.domain kernel: BUG kmalloc-1k (Not tainted): Freechain corrupt  
Jun 05 18:56:20 fedora.domain kernel:   =============================================================================  
Jun 05 18:56:17 fedora.domain kernel: ath11k_pci 0000:02:00.0: Failed to set the requested Country regulatory setting  
Jun 05 18:56:17 fedora.domain kernel: ath11k_pci 0000:02:00.0: Failed to set the requested Country regulatory setting  
...  
```  

-> After the issues: cat /proc/sys/kernel/tainted -> 32  

-> Normally: cat /proc/sys/kernel/tainted -> 0  

The problem/bug has been handled and identified in: https://bugzilla.redhat.com/show_bug.cgi?id=2193110 (relevant are mostly the comments of today)  

Also: http://lists.infradead.org/pipermail/ath11k/2023-June/004476.html  

Thanks to Yi Hao for helping to identify the bug!  

-------  
The issues I have experienced (sometimes once a day, sometimes over 30 times a day): full system freezes, subsequent freezes from application by application (one by one), error outputs in TTYs; and except the slub_debug=F log above, shutdown has never worked after an issue has occurred in the cases where no freeze occurred.  
-------  

Now, I have disabled all debugging and work again with my preferred settings (including my SELinux confined users as detailed in the Fedora bug report), which we changed during the Fedora bug report for testing, but now with `module_blacklist=ath11k_pci,ath11k` -> so far, the blacklisting seems to have solved all issues I have documented in the Fedora bug report (for now, I also kept `amdgpu.dcdebugmask=0x10`). Both amd_pstate=passive and amd_pstate=active seem to work now.  

However, I have created a second boot option where ath11* is NOT blacklisted, which I use when I am around known WiFis: When the system can be connected to a known WiFi throughout the boot, I have no issues as well.  

To avoid misunderstandings: the issues have also appeared when I have not logged in KDE/GUI but only logged in the TTY terminal with root (which has not been a confined user) while SDDM was always still enabled. Also, the issues have appeared when I disabled all SELinux confined user accounts.  

So the issues can be solved just by: keep permanently connected to a WiFi network OR boot with `module_blacklist=ath11k_pci,ath11k`.  

I have derived this assumption (which seems to work out) from the last point in my comment #47 in the Fedora bug report (https://bugzilla.redhat.com/show_bug.cgi?id=2193110#c47).
Comment 1 py0xc3 2023-06-06 12:53:52 UTC
Supplement: all details are contained in the `journalctl -r` above. But to give you a slight summary of major information:  

LENOVO 21CHCTO1WW, AMD Ryzen 7 6850U PRO (only integrated AMD graphics).  
Linux version 6.3.5-200.fc38.x86_64 (gcc (GCC) 13.1.1 20230511 (Red Hat 13.1.1-2), GNU ld version 2.39-9.fc38) #1 SMP PREEMPT_DYNAMIC  
Relevant boot parameter/data: vmlinuz-6.3.5-200.fc38.x86_64 root=... ro rootflags=subvol=root rd.luks.uuid=... rhgb quiet amd_pstate=passive amdgpu.dcdebugmask=0x10 slub_debug=F  
BIOS: R23ET60W (1.30 )  
Operating system: Fedora 38 KDE Spin.  

IMPORTANT: I have these issues already for longer, and they were already contained in 6.2.X kernels! Unfortunately, I can no longer say when they started (it could be also before 6.2.X because the issues have started with just a small amount of freezes from time to time while I had no time to focus on that back then).  

I will try to update the BIOS in the coming days to 1.35 and then test again by remove the blacklisting.
Comment 2 py0xc3 2023-06-06 13:05:10 UTC
The following is from now (so with the issue mitigation/blacklisting in place) and not from the corrupted boot with slub_debuf=F (so I am currently on a working boot that is tainted=0 and `... amd_pstate=active amdgpu.dcdebugmask=0x10 module_blacklist=ath11k_pci,ath11k`), but it should be still contain the relevant data for you. `lspci -mnn` AND `find /lib/firmware/ath11k/ -type f | xargs md5sum`:  
https://gitlab.com/py0xc31/public-tmp-storage/-/raw/main/slub_debug-F/lspci-mnn_find-ath11k-type-f-xargs-md5sum
Comment 3 py0xc3 2023-07-14 14:07:12 UTC
I have started to test with kernel 6.4.2 on 7th July (and migrated to 6.4.3 today) without `module_blacklist=ath11k_pci,ath11k` and without `module_blacklist=ath11k_pci,ath11k`.

So far it seems that kernel 6.4 has solved both issues (which means, the issue of this report but also the PSR issue that was mitigated by `amdgpu.dcdebugmask=0x10`).

I will keep this report open but if the issue does not reappear until the end of the month, I will close this report as RESOLVED by kernel 6.4.
Comment 4 py0xc3 2023-07-25 15:10:08 UTC
Since migrating to kernel 6.4.X (starting with 6.4.2; currently I am on 6.4.4), I no longer experience the issue. Some fix/code in 6.4.X solves the bug. There is no longer blacklisting necessary: issue resolved by updating to kernel 6.4+.

Note You need to log in before you can comment on or make changes to this bug.