Created attachment 261255 [details] dmesg output with IWLWIFI_DEBUG=y After a day or two of up-time, the iwlwifi driver reports a microcode SW error after spewing: iwlwifi 0000:04:00.0: swiotlb buffer is full (sz: 4096 bytes) iwlwifi 0000:04:00.0: DMA: Out of SW-IOMMU space for 4096 bytes Removing and reinserting the iwl/80211 modules sometimes fixes the issue for a short amount of time. I typically need to reboot to get things working again. Kernel: 4.14.6-gentoo Firmware: 34.0.1 wpa_supplicant: 2.6 with KRACK patches This issue cropped up in conjunction with updating to the 4.14.x kernel and installing a linux-firmware snapshot newer than bf04291309d3169c0ad3b8db52564235bbd08e30 (2017-10-09) which updated the 31 firmware and added 34.
Created attachment 261257 [details] 4.14.6-gentoo kernel config
I'm having the same problems, and it's not just firmware related. Running Fedora myself. 4.14.4 Was working fine with 31 (while in FC26). Once I updated to FC27, that updated the firmware to 34, and Kernel to 4.14.5. The DMA problems occur after about 8 hours of use. My dmesg is at https://pastebin.com/r7VyQVWM The device : Intel Corporation Wireless 8260 (rev 3a) Once I blamed f/w for this, I removed 34, leaving myself with 31. (32 and 33 don't work well either, with them I have a constant stream of kernel module failures). After booting the latest FC27 kernel, 4.14.6-300, however, I ran into the same DMA problems around 8 hours in, even though it had firmware version 31. So, at this point I'm stuck with 4.13.16-202, and firmware 31.
Pawel, you say you're stuck with 4.13.16, why can't you run 4.14.4, if you say that it works fine? If you can confirm that 4.14.4 works fine, it will be much easier to figure out what broke in 4.14.5 or 4.14.6...
The only IOMMU-related change I can see between 4.14.4 and 4.14.6 is this: commit ce1079588ebc42a2e3a0d310a2ea1f3a75aa8d49 Author: Robin Murphy <robin.murphy@arm.com> AuthorDate: Thu Sep 28 15:14:01 2017 +0100 Commit: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CommitDate: Thu Dec 14 09:52:55 2017 +0100 iommu/vt-d: Fix scatterlist offset handling commit 29a90b70893817e2f2bb3cea40a29f5308e21b21 upstream. ...and I don't see anything in iwlwifi that could have broken this. Could you try to revert this patch and see if it solves the problem?
I was probably wrong on this. A lot of things changed nearly at the same time. I went through my reboot log, this is what I see (boot times, Europe/Berlin): Dec 08 22:25:51 4.13.16-202 things are fine Dec 15 17:51:52 4.14.4-200.fc26 fine, but I didn't run this long enough (*) Dec 15 19:14:09 4.14.5-300.fc27 crash w .34 f/w Dec 16 16:34:30 4.14.4-200.fc26 crash w .34 f/w (dmesg is from this run) Dec 17 01:56:28 4.13.16-202.fs26 works fine Dec 18 23:21:02 4.14.6-300.fc27 crash w .31 f/w So, I remember 4.14.4 working for me, but most likely because I didn't run it long enough for the problem to manifest. So, it then seems that it's not firmware, and is something at least between 4.13.16 and 4.14.4. Any pointers on how to test specific commits? I haven't built my own kernels for years :) Thank you!
Given that it takes me 24-48 hours to trigger the problem, tracking down a regression window is taking some time. I'm not sure this is [directly] Intel IOMMU related, as I don't have GART_IOMMU / CALGARY_IOMMU / IOMMU_SUPPORT / INTEL_IOMMU enabled. Just SWIOTLB / IOMMU_HELPER. I think I've confirmed that this isn't a firmware regression. I've reproduced the problem with 31.532993.0, which I believe is the version I was using before I upgraded to 4.14. I've reproduced the issue on 4.14.6 and 4.14.7. I'm currently testing 4.14.2. I don't think I previously ran any version <=4.14.5 long enough to trigger the problem, so we'll see what shakes out. However, I am a bit concerned that this is a 4.14.0 regression. It looks like there was a lot of DMA related code refactoring[1], and that will be harder to bisect. [1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/drivers/net/wireless/intel/iwlwifi/?id=v4.13.16&id2=v4.14.1
Can you try to look over ifup / ifdown? Let's see if that makes the bug happen more easily.
It looks like 4.14.2 is also broken. At this point, is it worth trying to test 4.14.0 or 4.14.1? It doesn't look like there's any iwlwifi/DMA/IOMMU/swiotlb changes in 4.14.1. Is using the backport-iwlwifi repo the preferred way to bisect the 4.14-rc development changes? What bounds should I be looking at?
I don't thing that bisecting is the way to go for now. Please check what I wrote in comment 7. Ifup / ifdown cycles and see if that makes the bug show up more often.
Created attachment 273331 [details] dmesg output from interfce cycling I don't believe Gentoo has ifup/ifdown, so I think I got networkmanager to do something similar. while true; do echo "Toggling $(date)"; nmcli r wifi off; sleep 5; nmcli r wifi on; sleep 10; wget -q http://www.google.com -O /dev/null; done It looks like it took about 1.5 hours to trigger a failure.
Here we go: iwlwifi 0000:04:00.0: swiotlb buffer is full (sz: 32268 bytes) [25119.413636] swiotlb: coherent allocation failed for device 0000:04:00.0 size=32268 [25119.413637] CPU: 3 PID: 1559 Comm: NetworkManager Tainted: P O 4.14.7-gentoo #1 [25119.413638] Hardware name: LENOVO 20ENCTO1WW/20ENCTO1WW, BIOS N1EET73W (1.46 ) 09/28/2017 [25119.413638] Call Trace: [25119.413642] dump_stack+0x46/0x5a [25119.413645] swiotlb_alloc_coherent+0x13a/0x160 [25119.413648] iwl_pcie_load_section+0xd2/0x4d0 [iwlwifi] [25119.413650] ? iwl_trans_pcie_grab_nic_access+0x76/0xe0 [iwlwifi] [25119.413652] ? iwl_trans_pcie_release_nic_access+0x2d/0x40 [iwlwifi] [25119.413653] iwl_pcie_load_cpu_sections_8000.isra.19+0xe8/0x290 [iwlwifi] [25119.413655] iwl_trans_pcie_start_fw+0x42c/0x6b0 [iwlwifi] [25119.413657] iwl_mvm_load_ucode_wait_alive+0xf6/0x2f0 [iwlmvm] [25119.413659] ? __schedule+0x186/0x4a0 I guess the problem is on the DMA allocation for the firmware loading... Thanks
Created attachment 273367 [details] Fix canditate Hi, I think I found the bug. Please apply the patch attached and let me know.
Patch looks good so far. I tried cycling the interface for a few hours and didn't encounter the problem. I've also had a stable wifi connection for over 54 hours, which is the longest so far. I'll keep an eye on it through the end of the week, but things look promising so far.
good to know. Thanks. I'll leave this open for another day or two before I close it.
Patch queued for 4.15
Reporting back in, I've had stable wifi for ~114 hours, which more than doubles the longest up-time I've had with this bug. So it looks pretty well fixed to me. Thanks for the quick turn-around :)
Thanks for the confirmation. Patch is now on its way for 4.15 and will be backported to 4.14.