|Summary:||iwlwifi: Out of SW-IOMMU space|
|Product:||Drivers||Reporter:||Matthew Turnbull (sparky)|
|Component:||network-wireless||Assignee:||DO NOT USE - assign "network-wireless-intel" component instead (linuxwifi)|
|Severity:||normal||CC:||luca, pawel.veselov, rabin|
dmesg output with IWLWIFI_DEBUG=y
4.14.6-gentoo kernel config
dmesg output from interfce cycling
Description Matthew Turnbull 2017-12-19 04:16:37 UTC
Comment 1 Matthew Turnbull 2017-12-19 04:17:50 UTC
Created attachment 261257 [details] 4.14.6-gentoo kernel config
Comment 2 Pawel 2017-12-20 12:43:20 UTC
I'm having the same problems, and it's not just firmware related. Running Fedora myself. 4.14.4 Was working fine with 31 (while in FC26). Once I updated to FC27, that updated the firmware to 34, and Kernel to 4.14.5. The DMA problems occur after about 8 hours of use. My dmesg is at https://pastebin.com/r7VyQVWM The device : Intel Corporation Wireless 8260 (rev 3a) Once I blamed f/w for this, I removed 34, leaving myself with 31. (32 and 33 don't work well either, with them I have a constant stream of kernel module failures). After booting the latest FC27 kernel, 4.14.6-300, however, I ran into the same DMA problems around 8 hours in, even though it had firmware version 31. So, at this point I'm stuck with 4.13.16-202, and firmware 31.
Comment 3 Luca Coelho 2017-12-20 16:48:23 UTC
Pawel, you say you're stuck with 4.13.16, why can't you run 4.14.4, if you say that it works fine? If you can confirm that 4.14.4 works fine, it will be much easier to figure out what broke in 4.14.5 or 4.14.6...
Comment 4 Luca Coelho 2017-12-20 16:58:07 UTC
The only IOMMU-related change I can see between 4.14.4 and 4.14.6 is this: commit ce1079588ebc42a2e3a0d310a2ea1f3a75aa8d49 Author: Robin Murphy <firstname.lastname@example.org> AuthorDate: Thu Sep 28 15:14:01 2017 +0100 Commit: Greg Kroah-Hartman <email@example.com> CommitDate: Thu Dec 14 09:52:55 2017 +0100 iommu/vt-d: Fix scatterlist offset handling commit 29a90b70893817e2f2bb3cea40a29f5308e21b21 upstream. ...and I don't see anything in iwlwifi that could have broken this. Could you try to revert this patch and see if it solves the problem?
Comment 5 Pawel 2017-12-20 17:40:34 UTC
I was probably wrong on this. A lot of things changed nearly at the same time. I went through my reboot log, this is what I see (boot times, Europe/Berlin): Dec 08 22:25:51 4.13.16-202 things are fine Dec 15 17:51:52 4.14.4-200.fc26 fine, but I didn't run this long enough (*) Dec 15 19:14:09 4.14.5-300.fc27 crash w .34 f/w Dec 16 16:34:30 4.14.4-200.fc26 crash w .34 f/w (dmesg is from this run) Dec 17 01:56:28 4.13.16-202.fs26 works fine Dec 18 23:21:02 4.14.6-300.fc27 crash w .31 f/w So, I remember 4.14.4 working for me, but most likely because I didn't run it long enough for the problem to manifest. So, it then seems that it's not firmware, and is something at least between 4.13.16 and 4.14.4. Any pointers on how to test specific commits? I haven't built my own kernels for years :) Thank you!
Comment 6 Matthew Turnbull 2017-12-23 23:28:46 UTC
Given that it takes me 24-48 hours to trigger the problem, tracking down a regression window is taking some time. I'm not sure this is [directly] Intel IOMMU related, as I don't have GART_IOMMU / CALGARY_IOMMU / IOMMU_SUPPORT / INTEL_IOMMU enabled. Just SWIOTLB / IOMMU_HELPER. I think I've confirmed that this isn't a firmware regression. I've reproduced the problem with 31.532993.0, which I believe is the version I was using before I upgraded to 4.14. I've reproduced the issue on 4.14.6 and 4.14.7. I'm currently testing 4.14.2. I don't think I previously ran any version <=4.14.5 long enough to trigger the problem, so we'll see what shakes out. However, I am a bit concerned that this is a 4.14.0 regression. It looks like there was a lot of DMA related code refactoring, and that will be harder to bisect.  https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/drivers/net/wireless/intel/iwlwifi/?id=v4.13.16&id2=v4.14.1
Comment 7 Emmanuel Grumbach 2017-12-24 06:56:11 UTC
Can you try to look over ifup / ifdown? Let's see if that makes the bug happen more easily.
Comment 8 Matthew Turnbull 2017-12-27 10:14:47 UTC
It looks like 4.14.2 is also broken. At this point, is it worth trying to test 4.14.0 or 4.14.1? It doesn't look like there's any iwlwifi/DMA/IOMMU/swiotlb changes in 4.14.1. Is using the backport-iwlwifi repo the preferred way to bisect the 4.14-rc development changes? What bounds should I be looking at?
Comment 9 Emmanuel Grumbach 2017-12-27 10:17:06 UTC
I don't thing that bisecting is the way to go for now. Please check what I wrote in comment 7. Ifup / ifdown cycles and see if that makes the bug show up more often.
Comment 10 Matthew Turnbull 2017-12-27 20:09:19 UTC
Created attachment 273331 [details] dmesg output from interfce cycling I don't believe Gentoo has ifup/ifdown, so I think I got networkmanager to do something similar. while true; do echo "Toggling $(date)"; nmcli r wifi off; sleep 5; nmcli r wifi on; sleep 10; wget -q http://www.google.com -O /dev/null; done It looks like it took about 1.5 hours to trigger a failure.
Comment 11 Emmanuel Grumbach 2017-12-27 20:20:54 UTC
Here we go: iwlwifi 0000:04:00.0: swiotlb buffer is full (sz: 32268 bytes) [25119.413636] swiotlb: coherent allocation failed for device 0000:04:00.0 size=32268 [25119.413637] CPU: 3 PID: 1559 Comm: NetworkManager Tainted: P O 4.14.7-gentoo #1 [25119.413638] Hardware name: LENOVO 20ENCTO1WW/20ENCTO1WW, BIOS N1EET73W (1.46 ) 09/28/2017 [25119.413638] Call Trace: [25119.413642] dump_stack+0x46/0x5a [25119.413645] swiotlb_alloc_coherent+0x13a/0x160 [25119.413648] iwl_pcie_load_section+0xd2/0x4d0 [iwlwifi] [25119.413650] ? iwl_trans_pcie_grab_nic_access+0x76/0xe0 [iwlwifi] [25119.413652] ? iwl_trans_pcie_release_nic_access+0x2d/0x40 [iwlwifi] [25119.413653] iwl_pcie_load_cpu_sections_8000.isra.19+0xe8/0x290 [iwlwifi] [25119.413655] iwl_trans_pcie_start_fw+0x42c/0x6b0 [iwlwifi] [25119.413657] iwl_mvm_load_ucode_wait_alive+0xf6/0x2f0 [iwlmvm] [25119.413659] ? __schedule+0x186/0x4a0 I guess the problem is on the DMA allocation for the firmware loading... Thanks
Comment 12 Emmanuel Grumbach 2017-12-31 14:35:27 UTC
Created attachment 273367 [details] Fix canditate Hi, I think I found the bug. Please apply the patch attached and let me know.
Comment 13 Matthew Turnbull 2018-01-03 15:27:31 UTC
Patch looks good so far. I tried cycling the interface for a few hours and didn't encounter the problem. I've also had a stable wifi connection for over 54 hours, which is the longest so far. I'll keep an eye on it through the end of the week, but things look promising so far.
Comment 14 Emmanuel Grumbach 2018-01-03 16:30:15 UTC
good to know. Thanks. I'll leave this open for another day or two before I close it.
Comment 15 Emmanuel Grumbach 2018-01-04 15:44:15 UTC
Patch queued for 4.15
Comment 16 Matthew Turnbull 2018-01-05 22:10:58 UTC
Reporting back in, I've had stable wifi for ~114 hours, which more than doubles the longest up-time I've had with this bug. So it looks pretty well fixed to me. Thanks for the quick turn-around :)
Comment 17 Emmanuel Grumbach 2018-01-06 19:17:17 UTC
Thanks for the confirmation. Patch is now on its way for 4.15 and will be backported to 4.14.