Bug 84911

Summary: OOPS: kernel BUG at net/core/skbuff.c:99! Bricked my system.
Product: Networking Reporter: Henrik Asp (solenskiner)
Component: WirelessAssignee: networking_wireless (networking_wireless)
Status: RESOLVED DUPLICATE    
Severity: high CC: alan, asselsm, solenskiner, stf_xl
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.16.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: Picture of the Oops.
possibly relevant logs
possibly relevant logs
Picture of second Oops

Description Henrik Asp 2014-09-21 12:08:25 UTC
Created attachment 151131 [details]
Picture of the Oops.

I was not doing anything special at all, just checking facebook. Out of the blue the kernel oopsied, i took a few photos of it and rebootet. The screen just went into powersaving mode after showing me some output from systemd.

I tried again this time appending "1" to the grub kernel command line _guessing_ it was something with the graphic driver or xorg which i recently changed from catalyst to radeonSI, but this time it wouldnt even show me any output from systemd before putting the screen in power saving mode.

Windows still works. 

Dist: Archlinux
Kernel: version 3.16.2
Filesystems: ext4
Systemd: version 216-3
Xorg: 1.16
Driver: xf86-video-ati version 1:7.4.0-3
Mesa: 10.2.8-1
DRI: ati-dri version 10.2.8-1
Firefox: firefox-kde-opensuse version 32.0.1-1 from Arch User Repository

CPU: AMD FX-8150 8C 3.6G 16M AM3+ 125W
Motherboard: MK Gigabyte GA-990FXA-UD7 ATX, 990FX, AM3+
Memory: Corsair 16GB (2x8192MB) CL9 VENGEANCE
PSU: be quiet! Pure Power L7 730W 
Disks: Plextor SSD PX-128M3-03 128GB
       HD3TB Seagate SATA 6Gb/s 64MB ST3000DM001
       HD3TB Seagate SATA 6Gb/s 64MB ST3000DM001
Wifi: Dynamode WL-700N-RXS 150Mbps Nano 802.11n Wireless USB Adapter Dongle, with chipset Ralink 5370 

Attached is my best try to restich and clean up multiple pictures of the kernel oops message, I did screw up some of the photos, though I hope the relevant information is still readable.

If there is anything else I can provide, dont hesitate to contact me at solenskiner at gmail dot com. I am not subscribed to any lists or so.
Comment 1 Stanislaw Gruszka 2014-09-22 09:15:26 UTC
Is this issue reproducible for you? Are you using AP mode?
Comment 2 Henrik Asp 2014-09-22 16:43:14 UTC
I just got the machine back up again today, I'm not sure exactly what the problem was but some shotgun debugging worked ;)

I am using AP mode, but I have not been able to reproduce the issue yet.

I found a bunch of related warnings in my logs, which I will attach in case they might be useful.
Comment 3 Henrik Asp 2014-09-22 16:44:18 UTC
Created attachment 151331 [details]
possibly relevant logs
Comment 4 Henrik Asp 2014-09-22 16:45:45 UTC
Created attachment 151341 [details]
possibly relevant logs
Comment 5 Mark Asselstine 2014-09-23 03:31:49 UTC
I am starting to get an idea as to what might be going on here and might have a patch for this shortly. Basically the driver is attempting to align the skb.data to a 4 byte boundary, however, the skb lacks the necessary headroom so skb_push bails accordingly. So we need to check for the necessary headroom before extending the skb.data to allow it to be 4-byte alligned and if we lack the necessary headroom use another method to setup the byte alignment to avoid the BUG.

Beyond the kernel BUG I would assume the rest of the failure is unrelated. There is a chance that your reboot caused some filesystem issue. Whenever you have these things happen always try to shutdown your system as gracefully as possible. Magic-sysrq 's u b' (sync, unmount, reboot) is your friend. But the BUG itself will not have caused any persistent issues.
Comment 6 Henrik Asp 2014-09-23 08:23:52 UTC
That was quick =)

Im guessing skb would stand for socket kernel buffer or something like that, which would make it a memory management issue, no? I most likely had at least 5gb free ram (not discounting buffers) plus 16gb unused swap. Hope thats relevant.

Yup, the rest of the failure was probably a combination of a series of lazy fuckups of mine. No worries, got it back up easily enough, and didnt lose any critical data AFAIK. It was perplexing not being able to write to disc without any errors (and the *wierd* behaviours that caused), but i figured it out i caught a glimplse about ext complaining of a full journal. Thanks for the tip, i know about the magic key; the kernel didnt answer me so i guessed it was completely down and went hard poweroff after a while. My bad
Comment 7 Stanislaw Gruszka 2014-09-23 11:59:26 UTC
(In reply to Henrik Asp from comment #2)
> I am using AP mode, but I have not been able to reproduce the issue yet.

Someone else report that updating hostapd cause similar issue to him. Downgrading hostapd would be the workaround for the bug then.

Beyond that I can see that you are using Gigabyte board. There is know memory corruption issue on boards from *87* series which can manifest like similar panic:
https://bugzilla.kernel.org/show_bug.cgi?id=68171

You have different model, but you can try update bios too.
Comment 8 Henrik Asp 2014-09-23 19:10:24 UTC
I just got a proper hang on boot, even the network went down. Luckely the kernel synced, remounted and rebooted just fine so I managed to catch this:

sep 23 18:24:03 Moerta kernel: [UFW ALLOW] IN= OUT=enp4s0 SRC=78.68.188.225 DST=195.67.199.9 LEN=64 TOS=0x00 PREC=0x00 TTL=64 ID=46570 DF PROTO=UDP SPT=11286 DPT=53 LEN=44 
sep 23 18:24:14 Moerta kernel: INFO: task gpg-agent:689 blocked for more than 120 seconds.
sep 23 18:24:14 Moerta kernel:       Not tainted 3.16.3-1-ARCH #1
sep 23 18:24:14 Moerta kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sep 23 18:24:14 Moerta kernel: gpg-agent       D 0000000000000004     0   689    688 0x00000000
sep 23 18:24:14 Moerta kernel:  ffff8800c93f3d60 0000000000000086 ffff8804112432f0 0000000000014580
sep 23 18:24:14 Moerta kernel:  ffff8800c93f3fd8 0000000000014580 ffff8804112432f0 ffffffff810136f7
sep 23 18:24:14 Moerta kernel:  ffff88042fd119c0 0000000000000001 ffff8804112438e8 ffff880417906900
sep 23 18:24:14 Moerta kernel: Call Trace:
sep 23 18:24:14 Moerta kernel:  [<ffffffff810136f7>] ? __switch_to+0x1f7/0x5e0
sep 23 18:24:14 Moerta kernel:  [<ffffffff8109cc9a>] ? finish_task_switch+0x4a/0xf0
sep 23 18:24:14 Moerta kernel:  [<ffffffff8152cc99>] ? __schedule+0x3b9/0x980
sep 23 18:24:14 Moerta kernel:  [<ffffffff810addfe>] ? enqueue_entity+0x24e/0xaa0
sep 23 18:24:14 Moerta kernel:  [<ffffffff811b663f>] ? __mem_cgroup_commit_charge+0x9f/0x3e0
sep 23 18:24:14 Moerta kernel:  [<ffffffff8152d289>] schedule+0x29/0x70
sep 23 18:24:14 Moerta kernel:  [<ffffffff8152c479>] schedule_timeout+0x1b9/0x240
sep 23 18:24:14 Moerta kernel:  [<ffffffff8109fff7>] ? check_preempt_curr+0x57/0x90
sep 23 18:24:14 Moerta kernel:  [<ffffffff8152dd3f>] wait_for_common+0xcf/0x190
sep 23 18:24:14 Moerta kernel:  [<ffffffff810a2ce0>] ? wake_up_process+0x50/0x50
sep 23 18:24:14 Moerta kernel:  [<ffffffff8152de1d>] wait_for_completion+0x1d/0x20
sep 23 18:24:14 Moerta kernel:  [<ffffffff8108a3a9>] flush_work+0xe9/0x170
sep 23 18:24:14 Moerta kernel:  [<ffffffff81088080>] ? work_busy+0x80/0x80
sep 23 18:24:14 Moerta kernel:  [<ffffffff8115c172>] lru_add_drain_all+0x132/0x170
sep 23 18:24:14 Moerta kernel:  [<ffffffff8117f6ef>] SyS_mlock+0x2f/0x140
sep 23 18:24:14 Moerta kernel:  [<ffffffff81531129>] system_call_fastpath+0x16/0x1b
sep 23 18:24:14 Moerta kernel: INFO: task ntpd:742 blocked for more than 120 seconds.
sep 23 18:24:14 Moerta kernel:       Not tainted 3.16.3-1-ARCH #1
sep 23 18:24:14 Moerta kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sep 23 18:24:14 Moerta kernel: ntpd            D 0000000000000002     0   742      1 0x00000004
sep 23 18:24:14 Moerta kernel:  ffff880037b33e90 0000000000000086 ffff88040eeff010 0000000000014580
sep 23 18:24:14 Moerta kernel:  ffff880037b33fd8 0000000000014580 ffff88040eeff010 ffff880412c47700
sep 23 18:24:14 Moerta kernel:  ffff880037b33de8 ffffffff811775ec 00007fb8175af000 ffff880037b33e20
sep 23 18:24:14 Moerta kernel: Call Trace:
sep 23 18:24:14 Moerta kernel:  [<ffffffff811775ec>] ? tlb_flush_mmu_free+0x2c/0x50
sep 23 18:24:14 Moerta kernel:  [<ffffffff811781cd>] ? tlb_finish_mmu+0x4d/0x50
sep 23 18:24:14 Moerta kernel:  [<ffffffff81180072>] ? unmap_region+0xe2/0x130
sep 23 18:24:14 Moerta kernel:  [<ffffffff8152d289>] schedule+0x29/0x70
sep 23 18:24:14 Moerta kernel:  [<ffffffff8152d6f6>] schedule_preempt_disabled+0x16/0x20
sep 23 18:24:14 Moerta kernel:  [<ffffffff8152f075>] __mutex_lock_slowpath+0xe5/0x230
sep 23 18:24:14 Moerta kernel:  [<ffffffff8152f1d7>] mutex_lock+0x17/0x30
sep 23 18:24:14 Moerta kernel:  [<ffffffff8115c072>] lru_add_drain_all+0x32/0x170
sep 23 18:24:14 Moerta kernel:  [<ffffffff8117f935>] SyS_mlockall+0xb5/0x1d0
sep 23 18:24:14 Moerta kernel:  [<ffffffff81531129>] system_call_fastpath+0x16/0x1b

This related or a second issue?
Comment 9 Stanislaw Gruszka 2014-09-25 11:32:53 UTC
If this is mermory corruption issue this can be related, otherwise not.
Comment 10 Mark Asselstine 2014-09-25 14:51:20 UTC
Agreed. I have prepared a fix for the original BUG. I will send it out to the lists (rt2x00list and possibly net-dev) once I am satisfied with it and attach it here for anyone who cares to try it out.
Comment 11 Mark Asselstine 2014-09-26 02:22:41 UTC
I am still trying hard to find a code path where the headroom would not be set large enough to allow this failure to occur. Unfortunately the line that would be just before where your picture cuts off would have had some useful information (possibly scrolled off your screen). Can you look at the uncropped version of your picture to see if there is a line like the following:

[   16.762115] skbuff: skb_under_panic: text:ffffffff8153a84c len:424 put:24 head:ffff880000191000 data:ffff880000190ff0 tail:0x198 end:0x2c0 dev:wlan0

Have a look and post it here if you do.
Comment 12 Henrik Asp 2014-09-27 19:56:56 UTC
No lines missed; there was plenty of empty space left on my screen below where i cropped the picture so it wouldnt have scrolled off, i also only cropped of the frame of the monitor on top. Gonna doublecheck... Yup, nothing missing. Sorry.
Comment 13 Mark Asselstine 2014-09-29 03:01:50 UTC
Thanks for checking. As it turns out I believe I have found a code execution path that can lead to all the skb_headroom being used. It will make the patch a bit trickier to put together but I have some ideas and with Stanislaw now the official rt2x00 maintainer, he will have the opportunity to vet my work and make sure it makes sense :).
Comment 14 Stanislaw Gruszka 2014-11-02 11:01:14 UTC
For the record:
http://marc.info/?t=141213514600002&r=1&w=2

In short: we don't know why skb under panic happen.

I'll post l2pad alignment change (it fixes performance on embedded CPUs), but is not clear if it fixes the panic problem. We do not have proof of that fix (either logical explanation or test confirming it fixes the issue).
Comment 15 Henrik Asp 2014-11-02 12:19:31 UTC
Created attachment 156231 [details]
Picture of second Oops
Comment 16 Henrik Asp 2014-11-02 12:20:02 UTC
Oh crap, i totally forgot. This panic happened again like a week ago. I attached an image.

Which kernel do you guess this patch will land in? I guess I can report back if it happens again after that.
Comment 17 Stanislaw Gruszka 2014-11-02 13:10:09 UTC
Patch will be applied on 3.19 (or on 3.18 if it gets to fixes stream). 

I posted patch here:
http://marc.info/?l=linux-wireless&m=141493211022464&w=2

Could you apply it and recompile the kernel ? If don't want to recompile full kernel you can test patch using backports: https://backports.wiki.kernel.org/index.php/Main_Page
Comment 18 Stanislaw Gruszka 2014-11-06 16:07:53 UTC
On bug 72471 user reported that the patch did not fix the problem. I'll continue to debug issue there. Please close this bug as duplicate (I do not have permissions to do that).
Comment 19 Alan 2014-12-10 19:28:24 UTC
Stanislaw - if you ask the bugzilla maintainer email address to look you up in maintainers and give you super powers you'll be able to close any bug

*** This bug has been marked as a duplicate of bug 72491 ***