Distribution: Debian Problem Description: rb_erase Kernel Panic EIP: [<c01c313b>] rb_erase+0xf6/Bx22f SS:ESP 0068:da3c3e34 Kernel panic - not syncing: Fatal exception in interrupt Full call trace in attachment. Steps to reproduce: tc qdisc add dev eth1 root handle 1: htb tc class add dev eth1 parent 1: classid 1:1 htb rate 36000Kbit tc class add dev eth1 parent 1:1 classid 1:11 htb rate 28000Kbit prio 0 tc class add dev eth1 parent 1:1 classid 1:15 htb rate 1000Kbit ceil 28000Kbit prio 0 tc class add dev eth1 parent 1:1 classid 1:19 htb rate 1000Kbit ceil 5000Kbit prio 2 N from 10 to 10k(One class and filter per user) tc class add dev eth1 parent 1:{11,15,19} classid 1:$N htb rate 1Kbit ceil {$SPEED}Kbit tc filter add dev eth1 parent parent 1: protocol ip pref $N handle $N fw flowid 1:$N Classes and filters for users may change or add in real time.
Created attachment 20498 [details] Call trace from server remote control
Could you add some details, e.g.: - is it vanilla 2.6.26.5 kernel or debian built, or with some other patches? (Could it be replaced with newer version BTW?) - how often this bug happens or could it be reproduced reliably? - is there any chance to get the beginning lines of this bug report (before the call trace)? Thanks, Jarek P.
1) It is vanilla 2.6.26.5 from kernel.org. Yes i can replace with new one. 2) This bug happened only once during half-year server usage. But if it happens only once it is still a bug. =) During last year there was another bug, that annoyed me: http://bugzilla.kernel.org/show_bug.cgi?id=11571 Thank you for fix. 3) There is no chance to get begining lines. There is nothing in syslog and i have only screenshot from iLo(hp remote console). For now i wonder, what means "/* If this triggers, it is a bug in this code, but it need not be fatal */" in htb code near htb_safe_rb_erase.
Of course, you are right this is a bug even if spotted only once, and it's very nice you've reported this. I only tried to assess chances of debugging it. Alas it doesn't look so nice. =( This call trace is similar to reports I tracked long time ago and fixed just in ...2.6.26 (my own earlier bug BTW). I looked a lot for this at that time and if there was something left I guess it's really well hidden. So, I'll try to look again, but I'm not very optimistic if it's so unreproducible. I'm glad you mentioned this 11571 bug: as a matter of fact I wasn't sure this patch worked for you, so I'll add this info to that report. The htb_safe_rb_erase comment means somebody admitted the possibility of double erasing on some code path, and made it safe for debugging with WARN_ON. But I can't remember such reports, so probably it's better then expected. Thanks, Jarek P.
After looking for this I think there really is at least one buggy place in sch_htb in kernels 2.6.26 and older, which can cause such oopses. I mean htb_destroy_class calling htb_safe_rb_erase without sch_tree_lock. Now I wonder why it happened so rarely... Happily this place was fixed (BTW) in 2.6.27 by this patch: "net-sched: sch_htb: move hash and sibling list removal to htb_delete" commit fbd8f1379aeeb3e44a59302a6b2850636130bb2a http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=fbd8f1379aeeb3e44a59302a6b2850636130bb2a My recommendation is to upgrade to 2.6.27 or later. Alternatively you could try to use this patch vs. 2.6.26, but I didn't test this, and 2.6.26 is rarely updated now. So, this bug report is really helpful, but my proposal is to close it unless confirmed after some time on newer kernels. Thanks, Jarek P.