Bug 188351

Summary: mwifiex causes skb_over_panic on Surface Pro 4
Product: Networking Reporter: Jeff Farthing (jeff)
Component: WirelessAssignee: networking_wireless (networking_wireless)
Status: NEW ---    
Severity: blocking CC: afzal.arsalan, cavoegele, gbhat, skeirik2
Priority: P1    
Hardware: All   
OS: Linux   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=189151
Kernel Version: 4.9-rc6 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: skb_over_panic on 4.9-rc6
Deletes a seemingly extra call to skb_reserve()
skb_over_panic on Surface 3
Patch including changes to mwifiex tree from 4.9.6 kernel to current linux-next

Description Jeff Farthing 2016-11-22 04:00:56 UTC
Created attachment 245511 [details]
skb_over_panic on 4.9-rc6

After successfully installing Arch Linux on my Surface Pro 4, I'm never able to install anything as using pacman via wifi crashes the system. This occurs on 4.8 and up until 4.9-rc6. In order to actually get any error output, I had to shrink the console font and actually scan the tablet, so I have an image of the error output. This is from 4.9-rc6.
Comment 1 Jeff Farthing 2016-11-22 04:02:15 UTC
Also, FWIW, this does not occur on the 4.4 LTS kernel, nor on the 4.7 kernel that I have on my Archiso.
Comment 2 Arsalan Afzal 2016-11-25 05:15:26 UTC
Can confirm this occurs on my Surface Book which has the same wireless chipset from 4.8.10 to 4.9 RC6. An extremely reproducible manner I've found of triggering this bug using sftp to another ssh server. I get an immediate crash whenever executing get <filename>. Will post a full kernel crash log once I get back from vacation and compile a debug kernel (Thanks Arch).
Comment 3 Jeff Farthing 2016-11-25 17:23:45 UTC
Simply using pacman does it everytime for me.
Comment 4 Stephen Skeirik 2016-11-27 13:11:06 UTC
Created attachment 245971 [details]
Deletes a seemingly extra call to skb_reserve()

I also have a MSP4 laptop, and also too much time on my hands.

Using git blame, I found that skb_reserve() is called twice for DMA alignment in the function mwifiex_11n_aggregate_pkt() in file 11n_aggr.c. These additions occurred in commits:

84b313b35f8158d7ff35d649d18428a7740db49e
e35000ead491d71e59ab6f7458971321e06150a0

The second commit calls a helper function mwifiex_alloc_dma_align_buf() which eventually calls skb_reserve(). It seems that this error alone did not cause the kernel panic, but in commit:

c81396f3da22aa8f1e8fbf7943616a0839c4d63d

a loop bound on a packet processing queue changed:

-               if (skb_tailroom(skb_aggr) < (skb_src->len + LLC_SNAP_LEN))
+               if ((skb_aggr->len + skb_src->len + LLC_SNAP_LEN) >
+                   adapter->tx_buf_size)

I suppose the correct loop bound caused a buffer overflow because reserved memory was not actually backed by real memory... or something. I'm not really sure about all the details. In any case, I just commented out the first call to skb_reserve(), recompiled the kernel, and haven't had any problems since.
Comment 5 Arsalan Afzal 2016-11-29 07:48:19 UTC
Well go figure, that seemed to have done it. Have been running about 48 hours with this patch, and it seems to have done the trick. As an addendum, you probably want to clean up the patch you submitted because I had to the trim the first couple lines and it still complained about something with stopping mid line. Other than that, who knows if this is the optimal way to fix it but it certainly delivers results.
Comment 6 Arsalan Afzal 2016-11-29 07:49:27 UTC
Also I patched this on top of 4.9RC7 for more information using a olddefconfig on the stock Arch Linux kernel.
Comment 7 Chad Voegele 2017-01-06 18:46:19 UTC
Created attachment 250571 [details]
skb_over_panic on Surface 3

This patch is working for me on a Surface 3 (non-pro).

I included the kernel log before the patch was applied.
Comment 8 Chad Voegele 2017-01-30 01:46:57 UTC
This issue reappeared for me today during a Dropbox LAN sync. After a little digging, it appears to be triggered by any high transfer rate activity. Running a test on speedtest.net tripped it immediately. Pings to Google DNS and browsing the web for several minutes did not trigger it though.
Comment 9 Arsalan Afzal 2017-01-30 01:57:32 UTC
(In reply to Chad Voegele from comment #8)
> This issue reappeared for me today during a Dropbox LAN sync. After a little
> digging, it appears to be triggered by any high transfer rate activity.
> Running a test on speedtest.net tripped it immediately. Pings to Google DNS
> and browsing the web for several minutes did not trigger it though.

Is this with or without the above patch? I've been building my kernels with this patch for a couple months now and have not run into this problem since then including with high throughput uploads and downloads.
Comment 10 Chad Voegele 2017-01-31 01:28:52 UTC
Yeah, I have the patch applied. I did a little more testing today. I don't see the issue at school including high throughput. It is repeatable though on my home router (TP Archer C5). I tried both 5GHz and 2.4GHz but that distinction doesn't seem to matter.
Comment 11 Stephen Skeirik 2017-01-31 21:47:27 UTC
So, apparently, some engineers at Marvell found this bug and proposed a fix which is currently available in linux-next (it was added the 19th of this month). You can see the log entry for this commit at the URL below:

https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/net/wireless/marvell/mwifiex/11n_aggr.c?id=5f0a221f59ad6b72202ef9c6e232086de8c336f2

I haven't built it and tried it myself, but I recommend you pull this patch (plus any other patches to mwifiex in linux-next) and rebuild. If it works (or doesn't work) for you, I'm we would all be curious to hear about it.
Comment 12 Stephen Skeirik 2017-02-01 08:04:12 UTC
Created attachment 253761 [details]
Patch including changes to mwifiex tree from 4.9.6 kernel to current linux-next

So, curiousity got the better of me, and I made a patch based on the ideas in comment #11 with all changes to mwifiex from v4.9.6 until linux-next HEAD (with the exception that I had to rollback one tiny commit in linux-next because it seemed to require some deep changes outside the mwifiex codebase).

This patch shoud apply cleanly to linux kernel v4.9.6 (which happens to be the current stable version for Arch Linux, if you happen to use that distro).
Comment 13 Ganapathi Bhat 2018-11-25 08:01:32 UTC
Hi Chad,

(In reply to Chad Voegele from comment #10)
> Yeah, I have the patch applied. I did a little more testing today. I don't
> see the issue at school including high throughput. It is repeatable though
> on my home router (TP Archer C5). I tried both 5GHz and 2.4GHz but that
> distinction doesn't seem to matter.
Can you try applying the patch given in comment#11? Let us know if it solved your issue.

Regards,
Ganapathi