Bug 80201
Summary: | general protection fault: 0000 [#1] SMP (while using HTB) | ||
---|---|---|---|
Product: | Networking | Reporter: | Cenek Zach (cenek.zach) |
Component: | Other | Assignee: | Stephen Hemminger (stephen) |
Status: | NEW --- | ||
Severity: | normal | CC: | alan, eric.dumazet, szg00000, xiyou.wangcong |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | Linux 3.10.41-1.el6.elrepo.x86_64, Linux 3.14.13 vanilla | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
Kernel GPF stack trace
Kernel GPF stack trace 2 crash info 2014/08/25 crash in HTB module #1 crash in SFQ module #1 crash in SFQ module #2 |
Created attachment 144021 [details]
Kernel GPF stack trace 2
Another similar occurence; same hardware and configuration. This time NULL pointer dereference was detected: ... BUG: unable to handle kernel NULL pointer dereference at 0000000000000002 ... See Kernel GPF stack trace 2. This kind of setup is not supported. HTB leaves must have work conserving qdisc. Sorry, I do not understand. We have SFQ qdisc under the HTB leave; SFQ is work conserving. it should still not oops either. Hmm... the bug seems to be triggered in SFQ, as 'perturb 10' uses a timer. When sfq_rehash() is called, root qdisc is properly held, but sfq_reset() might be called without qdisc being held, via qdisc_destroy() Can you check that not using 'perturb 10' is helping ? As we encountered this on our production server, we have moved to latest (at the time) longterm - vanilla 3.14.13. We have it compiled with debug symbols, so I should be able to provide more info if the problem occurs again (I'm not sure how likely it is as the versions are quite apart). Note that 3.14 will crash the same way, if you use 'pertub XX' in your sfq qdisc Ok, we will be able to test it then. When the crash occurs next time, I will remove the perturb 10 option. We have now 4 servers under load and have observed 3 crashes in 2 days (with the 'perturb 10' option). I have setup the SFQ without the 'perturb' option on all servers around 15:30, and observed 1 crash already. I placed the end of the log and back-trace from the vmcore to the attachments. Incidently, does the omitting of the 'perturb' option disables the rehashing, or is a default value used? 'tc qdisc show' does not show the option any more: # tc qdisc show qdisc htb 1: dev em1 root refcnt 47 r2q 5000 default 30 direct_packets_stat 9545 qdisc sfq 10: dev em1 parent 1:1 limit 127p quantum 1514b Created attachment 148041 [details]
crash info 2014/08/25
I might have a clue. We are changing the limits imposed by HTB each minute using following command: $TC class change dev "$BL_DEVICE" parent 1: classid 1:1 htb rate "$LIMIT" burst "$BL_BURST" cburst "$BL_CBURST" quantum 60000 And this command was definitely running in 3/4 cases at the time of crash (found him in 'crash> bt -a') Without 'perturb', the rehash timer will be disabled, which means rehashing is disabled too. Can you try to find where exactly the kernel crashes? I mean try to map the faulting address to the source code. It should give us some clue on which pointer was NULL'ed. Created attachment 148661 [details]
crash in HTB module #1
Created attachment 148671 [details]
crash in SFQ module #1
Created attachment 148681 [details]
crash in SFQ module #2
We have experienced 7 crashes, 3 with and 4 without the perturb option. I found 3 places where the crashes occurred, and uploaded a txt file with details for each case. I will call them, according to the files: HTB#1, SFQ#1 and SFQ#2 Crash cases: With the 'perturb 10' option: HTB#1: 1x SFQ#1: 2x Withou the 'perturb 10' option: HTB#1: 1x SFQ#1: 1x SFQ#2: 2x However, the crashes without the option were all on single server (out of four, all running the same kernel and configuration). It might be coincidence, but it might indicate a corruption somewhere... |
Created attachment 142971 [details] Kernel GPF stack trace Encountered GPF under normal circumstances - no heavy load (CPU, IO, net). HTB configuration is very simple: 1 HTB class with SFQ qdisc and filter on source port 80: tc qdisc add dev eth0 root handle 1: htb default 30 tc class add dev eth0 parent 1: classid 1:1 htb rate $LIMIT burst 1500k cburst 1500k tc qdisc add dev eth0 parent 1:1 handle 10: sfq perturb 10 tc filter add dev eth0 protocol ip parent 1:0 prio 1 u32 match ip sport 80 0xffff flowid 1:1 Relevant part of vmcore-dmesg.txt attached.