Created attachment 130031 [details]
Digital camera shot of the oops
I have RT2800-based 5 GHz wireless card that serves as access point with hostapd. The hardware has served me well for years, until a new hostapd was uploaded on 6.3.2014 into Ubuntu repositories. From that day onwards, I have been unable to use rt2800pci except for about an hour at a time before the system freezes in an oops.
I found Tuomas Räsänen complaining about a crash that smells like the same issue to me: http://permalink.gmane.org/gmane.linux.drivers.rt2x00.user/2460
This bug is duplicate of ubuntu bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1289378
I asked Tuomas for some testing but never get answer for that:
It is also interesting what hostap change start to cause this skb headroom underestimation. I'll try to update hostap on my local machine and try to find out ...
I'm not able to reproduce this problem. Could you provide your .config ? Also since you are able to reproduce, parhaps you can configure kdump and provide memory dump when issue happens ?
I found a convenient moment to try this with 3.16.0-24-generic kernel from Ubuntu utopic. I can still reproduce the problem. The kernel config is whatever that distribution ships -- I can attach the file if it's useful.
I tried installing and configuring the kdump utilities, but I did not get anything in /var/crash. The system seemed to simply freeze as before. (The system is currently headless.) As far as I understand it, kdump should launch another kernel when an oops happens, mount the filesystem, and write stuff into /var/crash, like the kdump system core or something. But didn't work out, and I don't currently know why.
Could you apply below patch and check if it fixes the issue ?
Regarding kdump, does "echo c > /proc/sysrq-trigger" generate crash image ?
Hmm. I got what appeared to be an immediate system freeze and no dump. I guess that the kdump configuration is not good enough to mount the root filesystem and write the dump there, or something. I suppose I'll have to set up a console to have any hope of debugging this.
I'll have to check this a bit later, I need this machine running most of the time.
Please test patch if you can, there is some hope that it fixes the panic. To do not compile custom kernel, you can use backport project: https://backports.wiki.kernel.org/index.php/Main_Page
I put the patch into the source tree and rebuilt the 3.6.0-24 kernel but I don't think it fixed the issue. The system crashed in some 30 minutes of use, as usual.
Ok, please do not continue kdump effort for now. I'll prepare debug patch which use ftrace to try to find out why we do not get skb's with sufficient headroom.
Created attachment 156921 [details]
Please apply this patch. It traces skb's in mac80211 and rt2x00. Kernel should be compiled with ftrace support. Not sure if that enabled on debian, it should be enough to have those options enabled:
We gather print do:
mount -t debugfs debugfs /sys/kernel/debug/
echo 1 > tracing_on
cat trace_pipe > ~/skb_trace.txt
When the problem will happen, kernel will not panic but I make it sleep on skb_panic. This can cause troubles if skb_push was called in atomic context, but kernel should be able to finish tracing and save data in skb_trace.txt.
Note this can gather lot of data, you can restart tracing if skb_trace.txt will become too big.
Just a quick confirmation -- skb tracing doesn't need to be enabled from /sys/kernel/debug/events/skb/enable?
I have enabled skb tracing as a working assumption that this is required. I've now been running about 3 hours without encountering a crash. We may have a heisenbug.
(In reply to Antti S. Lankila from comment #10)
I didn't want to enable that, but this definitely will provide more information what could be crucial to see why panic happens.
I think issue is not easy to reproduce, but IMHO tracing should not affect how reproducible it is, as long we have some strange race condition.
Did you also apply patch from comment 4 ? If so perhaps panic will be better reproducible without it.
The patch from comment 4 has not been applied. What I did was, I took the ubuntu 14.10 source package "linux", and did "patch -p1 <comment9.patch" at its root, then the usual "fakeroot debian/rules binary-arch", which spewed out the kernel image deb. I then installed that and rebooted into that kernel and enabled the skb events and left it logging those events. Before going to bed last night, I read your comment where you implied that it's not necessary to enable skb events, so I disabled the skb events for the night. However, there has not been any crashes now.
Not enabling any events worries me. Right now, the trace_pipe has not written a single event over the course of the whole night. When I had the skb events logging, I got some 1 GB of trace log over about 12 hours. Even if it actually crashed now, are you sure that there would be any trace log to show for it?
My best guess at this point of time is that enabling the tracing via the tracing_on file is sufficient to change some timings in the kernel so that the crash can't happen. Could that be true? Are the tracepoints activated all across the kernel when tracing_on is set and merely filtered out, or are the tracing points selectively patched only into subsystems based on the enabled events? I guess if the crash doesn't occur soon, I'll disable the tracing to see if that it is sufficient to restore the crashing behavior.
Okay, I read a bit of the source. I am mixing up the various tracing facilities in my head. We aren't actually using ftrace here, the trace_printk is a completely separate facility and I get that giant warning in my dmesg about using it at boot, so that checks out.
As I think I see it, the trace_printk's are always executed, but they simply don't print anything unless they are enabled. I can see that trace_printk is enabled in options and tracing is on, but I can't see anything written out. I am currently trying to understand why not.
Okay, sorry for the comment spam, figured it out. Turns out that the wireless modules are all split into some linux-image-extra package which I had not installed. So I've been running with my self-compiled kernel image without the new rt2x00lib which now contains the calls to trace_printk according to grep. Let's hope it crashes this time, and I'll leave the skb tracing off.
The crash occurred just now. Due to the size of the file, I can't attach it here. I'm including the full dump from the point of time I started tracing.
my guess is that only the last few dozen kB are relevant, but here's 50 MB of compressed log anyway...
Christ. I just realized something horrible. I did not install the -extra package when I was supposedly testing the comment 4's patch. I will build a kernel with comment 4's patch included, and see what happens.
Okay. I've been running this patch from comment 4 + comment 9 for a few days. No crashes. I logged for the first few days but I don't think it will crash anymore, so I stopped doing that now.
Author: Stanislaw Gruszka <email@example.com>
Date: Tue Nov 11 14:28:47 2014 +0100
rt2x00: do not align payload on modern H/W