Bug 198201

Summary: iwlwifi: Out of SW-IOMMU space
Product: Drivers Reporter: Matthew Turnbull (sparky)
Component: network-wirelessAssignee: DO NOT USE - assign "network-wireless-intel" component instead (linuxwifi)
Status: CLOSED CODE_FIX    
Severity: normal CC: luca, pawel.veselov, rabin
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.14 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output with IWLWIFI_DEBUG=y
4.14.6-gentoo kernel config
dmesg output from interfce cycling
Fix canditate

Description Matthew Turnbull 2017-12-19 04:16:37 UTC
Created attachment 261255 [details]
dmesg output with IWLWIFI_DEBUG=y

After a day or two of up-time, the iwlwifi driver reports a microcode SW error after spewing:

iwlwifi 0000:04:00.0: swiotlb buffer is full (sz: 4096 bytes)
iwlwifi 0000:04:00.0: DMA: Out of SW-IOMMU space for 4096 bytes

Removing and reinserting the iwl/80211 modules sometimes fixes the issue for a short amount of time. I typically need to reboot to get things working again.

Kernel: 4.14.6-gentoo
Firmware: 34.0.1
wpa_supplicant: 2.6 with KRACK patches

This issue cropped up in conjunction with updating to the 4.14.x kernel and installing a linux-firmware snapshot newer than bf04291309d3169c0ad3b8db52564235bbd08e30 (2017-10-09) which updated the 31 firmware and added 34.
Comment 1 Matthew Turnbull 2017-12-19 04:17:50 UTC
Created attachment 261257 [details]
4.14.6-gentoo kernel config
Comment 2 Pawel 2017-12-20 12:43:20 UTC
I'm having the same problems, and it's not just firmware related. Running Fedora myself. 4.14.4 Was working fine with 31 (while in FC26). Once I updated to FC27, that updated the firmware to 34, and Kernel to 4.14.5. The DMA problems occur after about 8 hours of use. My dmesg is at 

https://pastebin.com/r7VyQVWM

The device : Intel Corporation Wireless 8260 (rev 3a)

Once I blamed f/w for this, I removed 34, leaving myself with 31. (32 and 33 don't work well either, with them I have a constant stream of kernel module failures).

After booting the latest FC27 kernel, 4.14.6-300, however, I ran into the same DMA problems around 8 hours in, even though it had firmware version 31.

So, at this point I'm stuck with 4.13.16-202, and firmware 31.
Comment 3 Luca Coelho 2017-12-20 16:48:23 UTC
Pawel, you say you're stuck with 4.13.16, why can't you run 4.14.4, if you say that it works fine?

If you can confirm that 4.14.4 works fine, it will be much easier to figure out what broke in 4.14.5 or 4.14.6...
Comment 4 Luca Coelho 2017-12-20 16:58:07 UTC
The only IOMMU-related change I can see between 4.14.4 and 4.14.6 is this:

commit ce1079588ebc42a2e3a0d310a2ea1f3a75aa8d49
Author:     Robin Murphy <robin.murphy@arm.com>
AuthorDate: Thu Sep 28 15:14:01 2017 +0100
Commit:     Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CommitDate: Thu Dec 14 09:52:55 2017 +0100

    iommu/vt-d: Fix scatterlist offset handling
    
    commit 29a90b70893817e2f2bb3cea40a29f5308e21b21 upstream.

...and I don't see anything in iwlwifi that could have broken this.

Could you try to revert this patch and see if it solves the problem?
Comment 5 Pawel 2017-12-20 17:40:34 UTC
I was probably wrong on this. A lot of things changed nearly at the same time. I went through my reboot log, this is what I see (boot times, Europe/Berlin):

Dec 08 22:25:51 4.13.16-202 things are fine
Dec 15 17:51:52 4.14.4-200.fc26 fine, but I didn't run this long enough (*)
Dec 15 19:14:09 4.14.5-300.fc27 crash w .34 f/w
Dec 16 16:34:30 4.14.4-200.fc26 crash w .34 f/w (dmesg is from this run)
Dec 17 01:56:28 4.13.16-202.fs26 works fine
Dec 18 23:21:02 4.14.6-300.fc27 crash w .31 f/w

So, I remember 4.14.4 working for me, but most likely because I didn't run it long enough for the problem to manifest.

So, it then seems that it's not firmware, and is something at least between 4.13.16 and 4.14.4.

Any pointers on how to test specific commits? I haven't built my own kernels for years :) Thank you!
Comment 6 Matthew Turnbull 2017-12-23 23:28:46 UTC
Given that it takes me 24-48 hours to trigger the problem, tracking down a regression window is taking some time.

I'm not sure this is [directly] Intel IOMMU related, as I don't have GART_IOMMU / CALGARY_IOMMU / IOMMU_SUPPORT / INTEL_IOMMU enabled. Just SWIOTLB / IOMMU_HELPER.

I think I've confirmed that this isn't a firmware regression. I've reproduced the problem with 31.532993.0, which I believe is the version I was using before I upgraded to 4.14.

I've reproduced the issue on 4.14.6 and 4.14.7. I'm currently testing 4.14.2. I don't think I previously ran any version <=4.14.5 long enough to trigger the problem, so we'll see what shakes out.

However, I am a bit concerned that this is a 4.14.0 regression. It looks like there was a lot of DMA related code refactoring[1], and that will be harder to bisect.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/drivers/net/wireless/intel/iwlwifi/?id=v4.13.16&id2=v4.14.1
Comment 7 Emmanuel Grumbach 2017-12-24 06:56:11 UTC
Can you try to look over ifup / ifdown?
Let's see if that makes the bug happen more easily.
Comment 8 Matthew Turnbull 2017-12-27 10:14:47 UTC
It looks like 4.14.2 is also broken.

At this point, is it worth trying to test 4.14.0 or 4.14.1? It doesn't look like there's any iwlwifi/DMA/IOMMU/swiotlb changes in 4.14.1.

Is using the backport-iwlwifi repo the preferred way to bisect the 4.14-rc development changes? What bounds should I be looking at?
Comment 9 Emmanuel Grumbach 2017-12-27 10:17:06 UTC
I don't thing that bisecting is the way to go for now.

Please check what I wrote in comment 7.

Ifup / ifdown cycles and see if that makes the bug show up more often.
Comment 10 Matthew Turnbull 2017-12-27 20:09:19 UTC
Created attachment 273331 [details]
dmesg output from interfce cycling

I don't believe Gentoo has ifup/ifdown, so I think I got networkmanager to do something similar.

while true; do echo "Toggling $(date)"; nmcli r wifi off; sleep 5; nmcli r wifi on; sleep 10; wget -q http://www.google.com -O /dev/null; done

It looks like it took about 1.5 hours to trigger a failure.
Comment 11 Emmanuel Grumbach 2017-12-27 20:20:54 UTC
Here we go:

 iwlwifi 0000:04:00.0: swiotlb buffer is full (sz: 32268 bytes)
[25119.413636] swiotlb: coherent allocation failed for device 0000:04:00.0 size=32268
[25119.413637] CPU: 3 PID: 1559 Comm: NetworkManager Tainted: P           O    4.14.7-gentoo #1
[25119.413638] Hardware name: LENOVO 20ENCTO1WW/20ENCTO1WW, BIOS N1EET73W (1.46 ) 09/28/2017
[25119.413638] Call Trace:
[25119.413642]  dump_stack+0x46/0x5a
[25119.413645]  swiotlb_alloc_coherent+0x13a/0x160
[25119.413648]  iwl_pcie_load_section+0xd2/0x4d0 [iwlwifi]
[25119.413650]  ? iwl_trans_pcie_grab_nic_access+0x76/0xe0 [iwlwifi]
[25119.413652]  ? iwl_trans_pcie_release_nic_access+0x2d/0x40 [iwlwifi]
[25119.413653]  iwl_pcie_load_cpu_sections_8000.isra.19+0xe8/0x290 [iwlwifi]
[25119.413655]  iwl_trans_pcie_start_fw+0x42c/0x6b0 [iwlwifi]
[25119.413657]  iwl_mvm_load_ucode_wait_alive+0xf6/0x2f0 [iwlmvm]
[25119.413659]  ? __schedule+0x186/0x4a0

I guess the problem is on the DMA allocation for the firmware loading...
Thanks
Comment 12 Emmanuel Grumbach 2017-12-31 14:35:27 UTC
Created attachment 273367 [details]
Fix canditate

Hi,

I think I found the bug. Please apply the patch attached and let me know.
Comment 13 Matthew Turnbull 2018-01-03 15:27:31 UTC
Patch looks good so far. I tried cycling the interface for a few hours and didn't encounter the problem. I've also had a stable wifi connection for over 54 hours, which is the longest so far. I'll keep an eye on it through the end of the week, but things look promising so far.
Comment 14 Emmanuel Grumbach 2018-01-03 16:30:15 UTC
good to know.

Thanks.

I'll leave this open for another day or two before I close it.
Comment 15 Emmanuel Grumbach 2018-01-04 15:44:15 UTC
Patch queued for 4.15
Comment 16 Matthew Turnbull 2018-01-05 22:10:58 UTC
Reporting back in, I've had stable wifi for ~114 hours, which more than doubles the longest up-time I've had with this bug. So it looks pretty well fixed to me.

Thanks for the quick turn-around :)
Comment 17 Emmanuel Grumbach 2018-01-06 19:17:17 UTC
Thanks for the confirmation.

Patch is now on its way for 4.15 and will be backported to 4.14.