Bug 84911
Summary: | OOPS: kernel BUG at net/core/skbuff.c:99! Bricked my system. | ||
---|---|---|---|
Product: | Networking | Reporter: | Henrik Asp (solenskiner) |
Component: | Wireless | Assignee: | networking_wireless (networking_wireless) |
Status: | RESOLVED DUPLICATE | ||
Severity: | high | CC: | alan, asselsm, solenskiner, stf_xl |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 3.16.2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Picture of the Oops.
possibly relevant logs possibly relevant logs Picture of second Oops |
Is this issue reproducible for you? Are you using AP mode? I just got the machine back up again today, I'm not sure exactly what the problem was but some shotgun debugging worked ;) I am using AP mode, but I have not been able to reproduce the issue yet. I found a bunch of related warnings in my logs, which I will attach in case they might be useful. Created attachment 151331 [details]
possibly relevant logs
Created attachment 151341 [details]
possibly relevant logs
I am starting to get an idea as to what might be going on here and might have a patch for this shortly. Basically the driver is attempting to align the skb.data to a 4 byte boundary, however, the skb lacks the necessary headroom so skb_push bails accordingly. So we need to check for the necessary headroom before extending the skb.data to allow it to be 4-byte alligned and if we lack the necessary headroom use another method to setup the byte alignment to avoid the BUG. Beyond the kernel BUG I would assume the rest of the failure is unrelated. There is a chance that your reboot caused some filesystem issue. Whenever you have these things happen always try to shutdown your system as gracefully as possible. Magic-sysrq 's u b' (sync, unmount, reboot) is your friend. But the BUG itself will not have caused any persistent issues. That was quick =) Im guessing skb would stand for socket kernel buffer or something like that, which would make it a memory management issue, no? I most likely had at least 5gb free ram (not discounting buffers) plus 16gb unused swap. Hope thats relevant. Yup, the rest of the failure was probably a combination of a series of lazy fuckups of mine. No worries, got it back up easily enough, and didnt lose any critical data AFAIK. It was perplexing not being able to write to disc without any errors (and the *wierd* behaviours that caused), but i figured it out i caught a glimplse about ext complaining of a full journal. Thanks for the tip, i know about the magic key; the kernel didnt answer me so i guessed it was completely down and went hard poweroff after a while. My bad (In reply to Henrik Asp from comment #2) > I am using AP mode, but I have not been able to reproduce the issue yet. Someone else report that updating hostapd cause similar issue to him. Downgrading hostapd would be the workaround for the bug then. Beyond that I can see that you are using Gigabyte board. There is know memory corruption issue on boards from *87* series which can manifest like similar panic: https://bugzilla.kernel.org/show_bug.cgi?id=68171 You have different model, but you can try update bios too. I just got a proper hang on boot, even the network went down. Luckely the kernel synced, remounted and rebooted just fine so I managed to catch this: sep 23 18:24:03 Moerta kernel: [UFW ALLOW] IN= OUT=enp4s0 SRC=78.68.188.225 DST=195.67.199.9 LEN=64 TOS=0x00 PREC=0x00 TTL=64 ID=46570 DF PROTO=UDP SPT=11286 DPT=53 LEN=44 sep 23 18:24:14 Moerta kernel: INFO: task gpg-agent:689 blocked for more than 120 seconds. sep 23 18:24:14 Moerta kernel: Not tainted 3.16.3-1-ARCH #1 sep 23 18:24:14 Moerta kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. sep 23 18:24:14 Moerta kernel: gpg-agent D 0000000000000004 0 689 688 0x00000000 sep 23 18:24:14 Moerta kernel: ffff8800c93f3d60 0000000000000086 ffff8804112432f0 0000000000014580 sep 23 18:24:14 Moerta kernel: ffff8800c93f3fd8 0000000000014580 ffff8804112432f0 ffffffff810136f7 sep 23 18:24:14 Moerta kernel: ffff88042fd119c0 0000000000000001 ffff8804112438e8 ffff880417906900 sep 23 18:24:14 Moerta kernel: Call Trace: sep 23 18:24:14 Moerta kernel: [<ffffffff810136f7>] ? __switch_to+0x1f7/0x5e0 sep 23 18:24:14 Moerta kernel: [<ffffffff8109cc9a>] ? finish_task_switch+0x4a/0xf0 sep 23 18:24:14 Moerta kernel: [<ffffffff8152cc99>] ? __schedule+0x3b9/0x980 sep 23 18:24:14 Moerta kernel: [<ffffffff810addfe>] ? enqueue_entity+0x24e/0xaa0 sep 23 18:24:14 Moerta kernel: [<ffffffff811b663f>] ? __mem_cgroup_commit_charge+0x9f/0x3e0 sep 23 18:24:14 Moerta kernel: [<ffffffff8152d289>] schedule+0x29/0x70 sep 23 18:24:14 Moerta kernel: [<ffffffff8152c479>] schedule_timeout+0x1b9/0x240 sep 23 18:24:14 Moerta kernel: [<ffffffff8109fff7>] ? check_preempt_curr+0x57/0x90 sep 23 18:24:14 Moerta kernel: [<ffffffff8152dd3f>] wait_for_common+0xcf/0x190 sep 23 18:24:14 Moerta kernel: [<ffffffff810a2ce0>] ? wake_up_process+0x50/0x50 sep 23 18:24:14 Moerta kernel: [<ffffffff8152de1d>] wait_for_completion+0x1d/0x20 sep 23 18:24:14 Moerta kernel: [<ffffffff8108a3a9>] flush_work+0xe9/0x170 sep 23 18:24:14 Moerta kernel: [<ffffffff81088080>] ? work_busy+0x80/0x80 sep 23 18:24:14 Moerta kernel: [<ffffffff8115c172>] lru_add_drain_all+0x132/0x170 sep 23 18:24:14 Moerta kernel: [<ffffffff8117f6ef>] SyS_mlock+0x2f/0x140 sep 23 18:24:14 Moerta kernel: [<ffffffff81531129>] system_call_fastpath+0x16/0x1b sep 23 18:24:14 Moerta kernel: INFO: task ntpd:742 blocked for more than 120 seconds. sep 23 18:24:14 Moerta kernel: Not tainted 3.16.3-1-ARCH #1 sep 23 18:24:14 Moerta kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. sep 23 18:24:14 Moerta kernel: ntpd D 0000000000000002 0 742 1 0x00000004 sep 23 18:24:14 Moerta kernel: ffff880037b33e90 0000000000000086 ffff88040eeff010 0000000000014580 sep 23 18:24:14 Moerta kernel: ffff880037b33fd8 0000000000014580 ffff88040eeff010 ffff880412c47700 sep 23 18:24:14 Moerta kernel: ffff880037b33de8 ffffffff811775ec 00007fb8175af000 ffff880037b33e20 sep 23 18:24:14 Moerta kernel: Call Trace: sep 23 18:24:14 Moerta kernel: [<ffffffff811775ec>] ? tlb_flush_mmu_free+0x2c/0x50 sep 23 18:24:14 Moerta kernel: [<ffffffff811781cd>] ? tlb_finish_mmu+0x4d/0x50 sep 23 18:24:14 Moerta kernel: [<ffffffff81180072>] ? unmap_region+0xe2/0x130 sep 23 18:24:14 Moerta kernel: [<ffffffff8152d289>] schedule+0x29/0x70 sep 23 18:24:14 Moerta kernel: [<ffffffff8152d6f6>] schedule_preempt_disabled+0x16/0x20 sep 23 18:24:14 Moerta kernel: [<ffffffff8152f075>] __mutex_lock_slowpath+0xe5/0x230 sep 23 18:24:14 Moerta kernel: [<ffffffff8152f1d7>] mutex_lock+0x17/0x30 sep 23 18:24:14 Moerta kernel: [<ffffffff8115c072>] lru_add_drain_all+0x32/0x170 sep 23 18:24:14 Moerta kernel: [<ffffffff8117f935>] SyS_mlockall+0xb5/0x1d0 sep 23 18:24:14 Moerta kernel: [<ffffffff81531129>] system_call_fastpath+0x16/0x1b This related or a second issue? If this is mermory corruption issue this can be related, otherwise not. Agreed. I have prepared a fix for the original BUG. I will send it out to the lists (rt2x00list and possibly net-dev) once I am satisfied with it and attach it here for anyone who cares to try it out. I am still trying hard to find a code path where the headroom would not be set large enough to allow this failure to occur. Unfortunately the line that would be just before where your picture cuts off would have had some useful information (possibly scrolled off your screen). Can you look at the uncropped version of your picture to see if there is a line like the following: [ 16.762115] skbuff: skb_under_panic: text:ffffffff8153a84c len:424 put:24 head:ffff880000191000 data:ffff880000190ff0 tail:0x198 end:0x2c0 dev:wlan0 Have a look and post it here if you do. No lines missed; there was plenty of empty space left on my screen below where i cropped the picture so it wouldnt have scrolled off, i also only cropped of the frame of the monitor on top. Gonna doublecheck... Yup, nothing missing. Sorry. Thanks for checking. As it turns out I believe I have found a code execution path that can lead to all the skb_headroom being used. It will make the patch a bit trickier to put together but I have some ideas and with Stanislaw now the official rt2x00 maintainer, he will have the opportunity to vet my work and make sure it makes sense :). For the record: http://marc.info/?t=141213514600002&r=1&w=2 In short: we don't know why skb under panic happen. I'll post l2pad alignment change (it fixes performance on embedded CPUs), but is not clear if it fixes the panic problem. We do not have proof of that fix (either logical explanation or test confirming it fixes the issue). Created attachment 156231 [details]
Picture of second Oops
Oh crap, i totally forgot. This panic happened again like a week ago. I attached an image. Which kernel do you guess this patch will land in? I guess I can report back if it happens again after that. Patch will be applied on 3.19 (or on 3.18 if it gets to fixes stream). I posted patch here: http://marc.info/?l=linux-wireless&m=141493211022464&w=2 Could you apply it and recompile the kernel ? If don't want to recompile full kernel you can test patch using backports: https://backports.wiki.kernel.org/index.php/Main_Page On bug 72471 user reported that the patch did not fix the problem. I'll continue to debug issue there. Please close this bug as duplicate (I do not have permissions to do that). Stanislaw - if you ask the bugzilla maintainer email address to look you up in maintainers and give you super powers you'll be able to close any bug *** This bug has been marked as a duplicate of bug 72491 *** |
Created attachment 151131 [details] Picture of the Oops. I was not doing anything special at all, just checking facebook. Out of the blue the kernel oopsied, i took a few photos of it and rebootet. The screen just went into powersaving mode after showing me some output from systemd. I tried again this time appending "1" to the grub kernel command line _guessing_ it was something with the graphic driver or xorg which i recently changed from catalyst to radeonSI, but this time it wouldnt even show me any output from systemd before putting the screen in power saving mode. Windows still works. Dist: Archlinux Kernel: version 3.16.2 Filesystems: ext4 Systemd: version 216-3 Xorg: 1.16 Driver: xf86-video-ati version 1:7.4.0-3 Mesa: 10.2.8-1 DRI: ati-dri version 10.2.8-1 Firefox: firefox-kde-opensuse version 32.0.1-1 from Arch User Repository CPU: AMD FX-8150 8C 3.6G 16M AM3+ 125W Motherboard: MK Gigabyte GA-990FXA-UD7 ATX, 990FX, AM3+ Memory: Corsair 16GB (2x8192MB) CL9 VENGEANCE PSU: be quiet! Pure Power L7 730W Disks: Plextor SSD PX-128M3-03 128GB HD3TB Seagate SATA 6Gb/s 64MB ST3000DM001 HD3TB Seagate SATA 6Gb/s 64MB ST3000DM001 Wifi: Dynamode WL-700N-RXS 150Mbps Nano 802.11n Wireless USB Adapter Dongle, with chipset Ralink 5370 Attached is my best try to restich and clean up multiple pictures of the kernel oops message, I did screw up some of the photos, though I hope the relevant information is still readable. If there is anything else I can provide, dont hesitate to contact me at solenskiner at gmail dot com. I am not subscribed to any lists or so.