Bug 72471 - rt2800 oops on skb_push, apparently underflows the skb area
Summary: rt2800 oops on skb_push, apparently underflows the skb area
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: network-wireless (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_network-wireless@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-03-19 11:29 UTC by Antti S. Lankila
Modified: 2014-12-12 14:50 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.14-rc6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Digital camera shot of the oops (1.09 MB, image/jpeg)
2014-03-19 11:29 UTC, Antti S. Lankila
Details
skb_trace.patch (6.15 KB, text/plain)
2014-11-06 16:04 UTC, Stanislaw Gruszka
Details

Description Antti S. Lankila 2014-03-19 11:29:05 UTC
Created attachment 130031 [details]
Digital camera shot of the oops

I have RT2800-based 5 GHz wireless card that serves as access point with hostapd. The hardware has served me well for years, until a new hostapd was uploaded on 6.3.2014 into Ubuntu repositories. From that day onwards, I have been unable to use rt2800pci except for about an hour at a time before the system freezes in an oops.

I found Tuomas Räsänen complaining about a crash that smells like the same issue to me: http://permalink.gmane.org/gmane.linux.drivers.rt2x00.user/2460

This bug is duplicate of ubuntu bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1289378
Comment 1 Stanislaw Gruszka 2014-03-20 14:10:28 UTC
I asked Tuomas for some testing but never get answer for that:
http://rt2x00.serialmonkey.com/pipermail/users_rt2x00.serialmonkey.com/2014-January/006547.html

It is also interesting what hostap change start to cause this skb headroom underestimation. I'll try to update hostap on my local machine and try to find out ...
Comment 2 Stanislaw Gruszka 2014-10-01 09:45:53 UTC
I'm not able to reproduce this problem. Could you provide your .config ? Also since you are able to reproduce, parhaps you can configure kdump and provide memory dump when issue happens ?
Comment 3 Antti S. Lankila 2014-11-02 12:01:50 UTC
I found a convenient moment to try this with 3.16.0-24-generic kernel from Ubuntu utopic. I can still reproduce the problem. The kernel config is whatever that distribution ships -- I can attach the file if it's useful.

I tried installing and configuring the kdump utilities, but I did not get anything in /var/crash. The system seemed to simply freeze as before. (The system is currently headless.) As far as I understand it, kdump should launch another kernel when an oops happens, mount the filesystem, and write stuff into /var/crash, like the kdump system core or something. But didn't work out, and I don't currently know why.
Comment 4 Stanislaw Gruszka 2014-11-02 13:17:08 UTC
Could you apply below patch and check if it fixes the issue ?

http://marc.info/?l=linux-wireless&m=141493211022464&w=2

Regarding kdump, does "echo c > /proc/sysrq-trigger" generate crash image ?
Comment 5 Antti S. Lankila 2014-11-02 15:32:59 UTC
Hmm. I got what appeared to be an immediate system freeze and no dump. I guess that the kdump configuration is not good enough to mount the root filesystem and write the dump there, or something. I suppose I'll have to set up a console to have any hope of debugging this.

I'll have to check this a bit later, I need this machine running most of the time.
Comment 6 Stanislaw Gruszka 2014-11-02 15:48:09 UTC
Please test patch if you can, there is some hope that it fixes the panic. To do not compile custom kernel, you can use backport project: https://backports.wiki.kernel.org/index.php/Main_Page
Comment 7 Antti S. Lankila 2014-11-03 21:55:29 UTC
I put the patch into the source tree and rebuilt the 3.6.0-24 kernel but I don't think it fixed the issue. The system crashed in some 30 minutes of use, as usual.
Comment 8 Stanislaw Gruszka 2014-11-04 11:48:45 UTC
Ok, please do not continue kdump effort for now. I'll prepare debug patch which use ftrace to try to find out why we do not get skb's with sufficient headroom.
Comment 9 Stanislaw Gruszka 2014-11-06 16:04:24 UTC
Created attachment 156921 [details]
skb_trace.patch

Please apply this patch. It traces skb's in mac80211 and rt2x00.  Kernel should be compiled with ftrace support. Not sure if that enabled on debian, it should be enough to have those options enabled:

CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y

We gather print do:

mount -t debugfs debugfs /sys/kernel/debug/
cd /sys/kernel/debug/tracing/
echo 1 > tracing_on
cat trace_pipe > ~/skb_trace.txt

When the problem will happen, kernel will not panic but I make it sleep on skb_panic. This can cause troubles if skb_push was called in atomic context, but kernel should be able to finish tracing and save data in skb_trace.txt.

Note this can gather lot of data, you can restart tracing if skb_trace.txt will become too big.
Comment 10 Antti S. Lankila 2014-11-08 10:29:08 UTC
Just a quick confirmation -- skb tracing doesn't need to be enabled from /sys/kernel/debug/events/skb/enable?
Comment 11 Antti S. Lankila 2014-11-08 13:37:48 UTC
I have enabled skb tracing as a working assumption that this is required. I've now been running about 3 hours without encountering a crash. We may have a heisenbug.
Comment 12 Stanislaw Gruszka 2014-11-08 17:36:59 UTC
(In reply to Antti S. Lankila from comment #10)
> /sys/kernel/debug/events/skb/enable?

I didn't want to enable that, but this definitely will provide more information what could be crucial to see why panic happens.

I think issue is not easy to reproduce, but IMHO tracing should not affect how reproducible it is, as long we have some strange race condition.
Comment 13 Stanislaw Gruszka 2014-11-08 17:41:49 UTC
Did you also apply patch from comment 4 ? If so perhaps panic will be better reproducible without it.
Comment 14 Antti S. Lankila 2014-11-09 09:21:37 UTC
The patch from comment 4 has not been applied. What I did was, I took the ubuntu 14.10 source package "linux", and did "patch -p1 <comment9.patch" at its root, then the usual "fakeroot debian/rules binary-arch", which spewed out the kernel image deb. I then installed that and rebooted into that kernel and enabled the skb events and left it logging those events. Before going to bed last night, I read your comment where you implied that it's not necessary to enable skb events, so I disabled the skb events for the night. However, there has not been any crashes now.

Not enabling any events worries me. Right now, the trace_pipe has not written a single event over the course of the whole night. When I had the skb events logging, I got some 1 GB of trace log over about 12 hours. Even if it actually crashed now, are you sure that there would be any trace log to show for it?

My best guess at this point of time is that enabling the tracing via the tracing_on file is sufficient to change some timings in the kernel so that the crash can't happen. Could that be true? Are the tracepoints activated all across the kernel when tracing_on is set and merely filtered out, or are the tracing points selectively patched only into subsystems based on the enabled events? I guess if the crash doesn't occur soon, I'll disable the tracing to see if that it is sufficient to restore the crashing behavior.
Comment 15 Antti S. Lankila 2014-11-09 09:46:35 UTC
Okay, I read a bit of the source. I am mixing up the various tracing facilities in my head. We aren't actually using ftrace here, the trace_printk is a completely separate facility and I get that giant warning in my dmesg about using it at boot, so that checks out.

As I think I see it, the trace_printk's are always executed, but they simply don't print anything unless they are enabled. I can see that trace_printk is enabled in options and tracing is on, but I can't see anything written out. I am currently trying to understand why not.
Comment 16 Antti S. Lankila 2014-11-09 09:57:19 UTC
Okay, sorry for the comment spam, figured it out. Turns out that the wireless modules are all split into some linux-image-extra package which I had not installed. So I've been running with my self-compiled kernel image without the new rt2x00lib which now contains the calls to trace_printk according to grep. Let's hope it crashes this time, and I'll leave the skb tracing off.
Comment 17 Antti S. Lankila 2014-11-09 16:21:46 UTC
The crash occurred just now. Due to the size of the file, I can't attach it here. I'm including the full dump from the point of time I started tracing.

https://bacchus.bel.fi/skb-trace.txt.gz

my guess is that only the last few dozen kB are relevant, but here's 50 MB of compressed log anyway...
Comment 18 Antti S. Lankila 2014-11-09 17:20:58 UTC
Christ. I just realized something horrible. I did not install the -extra package when I was supposedly testing the comment 4's patch. I will build a kernel with comment 4's patch included, and see what happens.
Comment 19 Antti S. Lankila 2014-11-11 08:57:14 UTC
Okay. I've been running this patch from comment 4 + comment 9 for a few days. No crashes. I logged for the first few days but I don't think it will crash anymore, so I stopped doing that now.
Comment 20 Stanislaw Gruszka 2014-12-12 14:50:22 UTC
Fixed by:

commit cfd9167af14eb4ec21517a32911d460083ee3d59
Author: Stanislaw Gruszka <sgruszka@redhat.com>
Date:   Tue Nov 11 14:28:47 2014 +0100

    rt2x00: do not align payload on modern H/W

Note You need to log in before you can comment on or make changes to this bug.