Created attachment 245511 [details] skb_over_panic on 4.9-rc6 After successfully installing Arch Linux on my Surface Pro 4, I'm never able to install anything as using pacman via wifi crashes the system. This occurs on 4.8 and up until 4.9-rc6. In order to actually get any error output, I had to shrink the console font and actually scan the tablet, so I have an image of the error output. This is from 4.9-rc6.
Also, FWIW, this does not occur on the 4.4 LTS kernel, nor on the 4.7 kernel that I have on my Archiso.
Can confirm this occurs on my Surface Book which has the same wireless chipset from 4.8.10 to 4.9 RC6. An extremely reproducible manner I've found of triggering this bug using sftp to another ssh server. I get an immediate crash whenever executing get <filename>. Will post a full kernel crash log once I get back from vacation and compile a debug kernel (Thanks Arch).
Simply using pacman does it everytime for me.
Created attachment 245971 [details] Deletes a seemingly extra call to skb_reserve() I also have a MSP4 laptop, and also too much time on my hands. Using git blame, I found that skb_reserve() is called twice for DMA alignment in the function mwifiex_11n_aggregate_pkt() in file 11n_aggr.c. These additions occurred in commits: 84b313b35f8158d7ff35d649d18428a7740db49e e35000ead491d71e59ab6f7458971321e06150a0 The second commit calls a helper function mwifiex_alloc_dma_align_buf() which eventually calls skb_reserve(). It seems that this error alone did not cause the kernel panic, but in commit: c81396f3da22aa8f1e8fbf7943616a0839c4d63d a loop bound on a packet processing queue changed: - if (skb_tailroom(skb_aggr) < (skb_src->len + LLC_SNAP_LEN)) + if ((skb_aggr->len + skb_src->len + LLC_SNAP_LEN) > + adapter->tx_buf_size) I suppose the correct loop bound caused a buffer overflow because reserved memory was not actually backed by real memory... or something. I'm not really sure about all the details. In any case, I just commented out the first call to skb_reserve(), recompiled the kernel, and haven't had any problems since.
Well go figure, that seemed to have done it. Have been running about 48 hours with this patch, and it seems to have done the trick. As an addendum, you probably want to clean up the patch you submitted because I had to the trim the first couple lines and it still complained about something with stopping mid line. Other than that, who knows if this is the optimal way to fix it but it certainly delivers results.
Also I patched this on top of 4.9RC7 for more information using a olddefconfig on the stock Arch Linux kernel.
Created attachment 250571 [details] skb_over_panic on Surface 3 This patch is working for me on a Surface 3 (non-pro). I included the kernel log before the patch was applied.
This issue reappeared for me today during a Dropbox LAN sync. After a little digging, it appears to be triggered by any high transfer rate activity. Running a test on speedtest.net tripped it immediately. Pings to Google DNS and browsing the web for several minutes did not trigger it though.
(In reply to Chad Voegele from comment #8) > This issue reappeared for me today during a Dropbox LAN sync. After a little > digging, it appears to be triggered by any high transfer rate activity. > Running a test on speedtest.net tripped it immediately. Pings to Google DNS > and browsing the web for several minutes did not trigger it though. Is this with or without the above patch? I've been building my kernels with this patch for a couple months now and have not run into this problem since then including with high throughput uploads and downloads.
Yeah, I have the patch applied. I did a little more testing today. I don't see the issue at school including high throughput. It is repeatable though on my home router (TP Archer C5). I tried both 5GHz and 2.4GHz but that distinction doesn't seem to matter.
So, apparently, some engineers at Marvell found this bug and proposed a fix which is currently available in linux-next (it was added the 19th of this month). You can see the log entry for this commit at the URL below: https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/net/wireless/marvell/mwifiex/11n_aggr.c?id=5f0a221f59ad6b72202ef9c6e232086de8c336f2 I haven't built it and tried it myself, but I recommend you pull this patch (plus any other patches to mwifiex in linux-next) and rebuild. If it works (or doesn't work) for you, I'm we would all be curious to hear about it.
Created attachment 253761 [details] Patch including changes to mwifiex tree from 4.9.6 kernel to current linux-next So, curiousity got the better of me, and I made a patch based on the ideas in comment #11 with all changes to mwifiex from v4.9.6 until linux-next HEAD (with the exception that I had to rollback one tiny commit in linux-next because it seemed to require some deep changes outside the mwifiex codebase). This patch shoud apply cleanly to linux kernel v4.9.6 (which happens to be the current stable version for Arch Linux, if you happen to use that distro).
Hi Chad, (In reply to Chad Voegele from comment #10) > Yeah, I have the patch applied. I did a little more testing today. I don't > see the issue at school including high throughput. It is repeatable though > on my home router (TP Archer C5). I tried both 5GHz and 2.4GHz but that > distinction doesn't seem to matter. Can you try applying the patch given in comment#11? Let us know if it solved your issue. Regards, Ganapathi