Bug 11718
Summary: | Can't classificate problem. maybe hrtimer data structures got wrecked | ||
---|---|---|---|
Product: | Networking | Reporter: | Badalian Slava (slavon.net) |
Component: | IPV4 | Assignee: | Arnaldo Carvalho de Melo (acme) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | ccaputo, jarkao2 |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.27 | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Badalian Slava
2008-10-08 02:12:11 UTC
2.6.26.6 Bug still here. [ 5280.696710] BUG: NMI Watchdog detected LOCKUP<3>e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang [ 5280.696710] Tx Queue <0> [ 5280.696710] TDH <18> [ 5280.696710] TDT <18> [ 5280.696710] next_to_use <18> [ 5280.696710] next_to_clean <6d> [ 5280.696710] buffer_info[next_to_clean] [ 5280.696710] time_stamp <4bf406> [ 5280.696710] next_to_watch <6d> [ 5280.696710] jiffies <4c02a9> [ 5280.696710] next_to_watch.status <1> [ 5280.696710] on CPU3, ip c01fafb0, registers: [ 5280.696710] Modules linked in: netconsole i2c_i801 e1000e e1000 i2c_core [ 5280.696710] [ 5280.696710] Pid: 0, comm: swapper Not tainted (2.6.26.6-fw #1) [ 5280.696710] EIP: 0060:[<c01fafb0>] EFLAGS: 00000096 CPU: 3 [ 5280.696710] EIP is at rb_insert_color+0x10/0xc0 [ 5280.696710] EAX: f55554a4 EBX: f55554a4 ECX: 00000000 EDX: f55554a4 [ 5280.696710] ESI: f55554a4 EDI: f55554a4 EBP: c202d0d4 ESP: f7c5fe04 [ 5280.696710] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 5280.696710] Process swapper (pid: 0, ti=f7c5e000 task=f7c32940 task.ti=f7c5e000) [ 5280.696710] Stack: f55554a4 00000000 c202d0cc c202d0d4 c013aa4f f55554a4 c202d0cc c202d0cc [ 5280.696710] c044b0a0 c013af3a c013d1bd f7c5fe54 d948bc00 000004cc 00000000 00000286 [ 5280.696710] f5555000 ffffffff 00000000 00000000 c02d521e 00000000 f5555000 c02da9d6 [ 5280.696710] Call Trace: [ 5280.696710] [<c013aa4f>] enqueue_hrtimer+0x5f/0x80 [ 5280.696710] [<c013af3a>] hrtimer_start+0xaa/0x130 [ 5280.696710] [<c013d1bd>] getnstimeofday+0x3d/0xe0 [ 5280.696710] [<c02d521e>] qdisc_watchdog_schedule+0x1e/0x30 [ 5280.696710] [<c02da9d6>] htb_dequeue+0x6a6/0x810 [ 5280.696710] [<c02d409c>] __qdisc_run+0x19c/0x1d0 [ 5280.696710] [<c013b19d>] hrtimer_run_pending+0x1d/0x90 [ 5280.696710] [<c02c7a6e>] net_tx_action+0xbe/0xf0 [ 5280.696710] [<c012a1c2>] __do_softirq+0x82/0x100 [ 5280.696710] [<c012a277>] do_softirq+0x37/0x40 [ 5280.696710] [<c0107120>] do_IRQ+0x40/0x80 [ 5280.696710] [<c01055a3>] common_interrupt+0x23/0x28 [ 5280.696710] [<c010a602>] mwait_idle+0x32/0x40 [ 5280.696710] [<c010a5d0>] mwait_idle+0x0/0x40 [ 5280.696710] [<c01036e8>] cpu_idle+0x48/0xc0 [ 5280.696710] ======================= [ 5280.696710] Code: 03 09 d0 89 03 8b 1c 24 83 c4 0c c3 89 56 04 eb e3 8d 76 00 8d bc 27 00 00 00 00 55 89 d5 57 89 c7 56 53 90 8d b4 26 00 00 00 00 <8b> 1f 83 e3 fc 74 32 8b 03 89 d9 a8 01 75 2a 89 c6 83 e6 fc 8b [ 6951.841662] BUG: NMI Watchdog detected LOCKUP on CPU3, ip c01fde4c, registers: [ 6951.841662] Modules linked in: sch_sfq sch_htb netconsole e1000 i2c_i801 e1000e i2c_core [ 6951.841662] [ 6951.841662] Pid: 0, comm: swapper Not tainted (2.6.27-fw #1) [ 6951.841662] EIP: 0060:[<c01fde4c>] EFLAGS: 00000092 CPU: 3 [ 6951.841662] EIP is at __rb_rotate_right+0xc/0x70 [ 6951.841662] EAX: f70c3c68 EBX: f70c3c68 ECX: f70c3c68 EDX: c202c134 [ 6951.841662] ESI: f70c3c68 EDI: f70c3c68 EBP: c202c134 ESP: f785fc2c [ 6951.841662] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 [ 6951.841662] Process swapper (pid: 0, ti=f785e000 task=f7832940 task.ti=f785e000) [ 6951.841662] Stack: f70c3c68 f70c3c68 f70c3c68 c01fdf41 f70c3c68 00000000 c202c12c c202c134 [ 6951.841662] c013a91f f70c3c68 c202c12c c202212c c045b100 c013ae0a 00000000 c013d63d [ 6951.841662] 9a011800 00000652 00000001 00000282 00000652 f70c3c68 00000000 00000000 [ 6951.841662] Call Trace: [ 6951.841662] [<c01fdf41>] rb_insert_color+0x91/0xc0 [ 6951.841662] [<c013a91f>] enqueue_hrtimer+0x5f/0x80 [ 6951.841662] [<c013ae0a>] hrtimer_start+0xaa/0x130 [ 6951.841662] [<c013d63d>] getnstimeofday+0x3d/0xe0 [ 6951.841662] [<c02de83d>] qdisc_watchdog_schedule+0x3d/0x50 [ 6951.841662] [<f88ac343>] htb_dequeue+0x683/0x7b0 [sch_htb] [ 6951.841662] [<c02ce692>] dev_hard_start_xmit+0x1d2/0x2c0 [ 6951.841662] [<c02dc87a>] __qdisc_run+0x13a/0x1d0 [ 6951.841662] [<c02d0ed7>] dev_queue_xmit+0x227/0x4f0 [ 6951.841662] [<c02f29ff>] ip_finish_output+0x11f/0x280 [ 6951.841662] [<c02f00e0>] ip_forward+0x290/0x310 [ 6951.841662] [<c02efe35>] ip_forward_finish+0x25/0x40 [ 6951.841662] [<c02ee9a2>] ip_rcv_finish+0x122/0x360 [ 6951.841662] [<c02c8cc6>] __alloc_skb+0x36/0x120 [ 6951.841662] [<c02c9d02>] __netdev_alloc_skb+0x22/0x50 [ 6951.841662] [<c02eee20>] ip_rcv+0x0/0x290 [ 6951.841662] [<c02ce064>] netif_receive_skb+0x274/0x4d0 [ 6951.841662] [<c0108b1a>] nommu_map_single+0x2a/0x60 [ 6951.841662] [<f883be39>] e1000_receive_skb+0x49/0x80 [e1000e] [ 6951.841662] [<f883e84c>] e1000_clean_rx_irq+0x23c/0x300 [e1000e] [ 6951.841662] [<f883b3ad>] e1000_clean+0x1bd/0x570 [e1000e] [ 6951.841662] [<c02d03bc>] net_rx_action+0x13c/0x200 [ 6951.841662] [<c0129b72>] __do_softirq+0x82/0x100 [ 6951.841662] [<c0129c27>] do_softirq+0x37/0x40 [ 6951.841662] [<c0106060>] do_IRQ+0x40/0x80 [ 6951.841662] [<c01134c7>] smp_apic_timer_interrupt+0x57/0x90 [ 6951.841662] [<c010457f>] common_interrupt+0x23/0x28 [ 6951.841662] [<c0109aa2>] mwait_idle+0x32/0x40 [ 6951.841662] [<c01026c8>] cpu_idle+0x48/0xe0 [ 6951.841662] ======================= [ 6951.841662] Code: 24 08 83 e0 03 09 d0 89 03 8b 1c 24 83 c4 0c c3 89 56 08 eb e3 8d 76 00 8d bc 27 00 00 00 00 83 ec 0c 89 1c 24 89 c3 89 7c 24 08 <89> d7 89 74 24 04 8b 50 08 8b 30 8b 4a 04 83 e6 fc 85 c9 89 48 2.6.27 get now! INFO: This bug is tracked on netdev with Subject: deadlocks if use htb. Summary of tests. Jarek answer: > Here is my current opinion on this bug: > > 1) I'm almost sure it's not a htb, but hrtimers bug (some race), > > 2) the htb patches you've tested are not "the proper" way of fixing > it; I see substantial changes in hrtimers code in the "-tip" tree > (probably for 2.6.29), which, probably, you'll be advised by > hrtimers maintainers to try, and I guess, it's not easy on a > production system, > > So, it's up to you: > > 1) since these patches work for you, you can stop with testing and > wait with these patched kernels until 2.6.29 (I can propose this > #2 patch as a temporary fix then), > > 2) for curiosity you could try this patch #4 alone on one box first > (after reverting at least patch #2), but again: if it works, it > could be only treated as a temporary hack, and alternative of #2. > > Thanks, > Jarek P. Problem temporary fixed for me (system not crashed for 1 week) and i can wait for new kernels long time, but i can test hrtimer fixes if anyone intersted for this. On Thu, Dec 18, 2008 at 03:42:52AM -0800, bugme-daemon@bugzilla.kernel.org wrote: ... > Problem temporary fixed for me (system not crashed for 1 week) and i can wait > for new kernels long time, but i can test hrtimer fixes if anyone intersted > for > this. Sure we are. Here is a link to the patches in the -tip tree: http://git.kernel.org/?p=linux/kernel/git/mingo/linux-2.6-sched-devel.git;a=history;f=kernel/hrtimer.c;h=b741f850426e5ba8841feca4c730f3da1c65f7b8;hb=HEAD I mean top three Peter Zijlstra's "hrtimer: removing all ur callback modes" patches. They should apply to the current -linus or -net tree, but I didn't try to compile. Jarek P. Per Jarek's suggestion, I ran 2.6.28 plus Peter Zijlstra's "hrtimer: removing all ur callback modes" patches dated 2008-11-25, 2008-12-04 and 2008-12-08. Uptime was 2 days 22 hours before I hit what appears to be an unrelated bug related to the IPv6 FIB. (Reported on dev lists with subject 'panic with 2.6.28 while doing "ip -6 route"'.) Will continue testing with Zijlstra's patches... I should add that with 2.6.28, without the Zijlstra patches, the system would hang after about an hour. For the record: this bug is expected to be fixed now: 1) in 2.6.29 tree by above mentioned Peter Zijlstra's changes to hrtimers, 2) in 2.6.28.2 and 2.6.27.13 by a temporary patch to sch_htb: http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.28.y.git;a=commit;h=e46032840eae03a502638049468edc1167345c9c http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=commit;h=9befaf375925471a49159d775b38d42c04e218a1 so this bug report could be closed. Jarek P. |