Hello the problem is that after some time of working my router have some performance problems. The problem exist on kernel 2.6.32.7 and 2.6.33-rc6-git5 The problem dont exist on 2.6.32.2 kernel. on 2.6.33-rc6-git5 i have this info:BUG: soft lockup - CPU#1 stuck for 61s! [events/1:28] Modules linked in: CPU 1 Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT RIP: 0010:[<ffffffff810a3d89>] [<ffffffff810a3d89>] kmem_cache_free+0x11b/0x11c RSP: 0018:ffff880028243e50 EFLAGS: 00000292 RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0 RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680 RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0 R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7 FS: 0000000000000000(0000) GS:ffff880028240000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task ffff88031f9a4f80) Stack: ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023 <0> 01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040 <0> 0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740 Call Trace: <IRQ> [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74 [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235 [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d [<ffffffff81035311>] ? __do_softirq+0xd7/0x196 [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 <EOI> [<ffffffff81004599>] ? do_softirq+0x31/0x63 [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86 [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7 [<ffffffff8136b08c>] ? schedule+0x82c/0x906 [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc [<ffffffff810479bd>] ? kthread+0x79/0x81 [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10 [<ffffffff81047944>] ? kthread+0x0/0x81 [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10 Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08 48 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f <c3> 55 48 89 f5 53 48 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10 Call Trace: <IRQ> [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74 [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235 [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d [<ffffffff81035311>] ? __do_softirq+0xd7/0x196 [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 <EOI> [<ffffffff81004599>] ? do_softirq+0x31/0x63 [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86 [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7 [<ffffffff8136b08c>] ? schedule+0x82c/0x906 [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc [<ffffffff810479bd>] ? kthread+0x79/0x81 [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10 [<ffffffff81047944>] ? kthread+0x0/0x81 [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10 And on 2.6.32.7 Feb 4 13:12:40 TM_01_C1 ------------[ cut here ]------------ Feb 4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x130/0x1d6() Feb 4 13:12:40 TM_01_C1 Hardware name: X7DCT Feb 4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out Feb 4 13:12:40 TM_01_C1 Modules linked in: oprofile Feb 4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1 Feb 4 13:12:40 TM_01_C1 Call Trace: Feb 4 13:12:40 TM_01_C1 <IRQ> [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6 Feb 4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6 Feb 4 13:12:40 TM_01_C1 [<ffffffff81038811>] ? warn_slowpath_common+0x77/0xa3 Feb 4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? warn_slowpath_fmt+0x51/0x59 Feb 4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e Feb 4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? try_to_wake_up+0x1eb/0x1f8 Feb 4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? netdev_drivername+0x3b/0x40 Feb 4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6 Feb 4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44 Feb 4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6 Feb 4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ? run_timer_softirq+0x1ff/0x29d Feb 4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7 Feb 4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196 Feb 4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28 Feb 4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66 Feb 4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ? smp_apic_timer_interrupt+0x87/0x95 Feb 4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ? apic_timer_interrupt+0x13/0x20 Feb 4 13:12:40 TM_01_C1 <EOI> [<ffffffff810111f5>] ? mwait_idle+0x9b/0xa0 Feb 4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c Feb 4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]--- Machine is a BGP router with some iproute traffic management.
And info before this soft lockup is: Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! vlan0811: 5 rebuilds is over limit, route caching disabled Route hash chain too long! Adjust your secret_interval!
And also after this soft lockup there is a weird thing that i cant make cpu affinity on my nic. NIC is blocked then on random CPU and change of /proc/irq/xx/smp_affinity is not helping
Ok sorry aditional info: there is also problem on kernel 2.6.32 Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! vlan0811: 5 rebuilds is over limit, route caching disabled Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! Route hash chain too long! Adjust your secret_interval! vlan0811: 13 rebuilds is over limit, route caching disabled Route hash chain too long! Adjust your secret_interval! BUG: soft lockup - CPU#3 stuck for 61s! [events/3:30] Modules linked in: CPU 3: Modules linked in: Pid: 30, comm: events/3 Not tainted 2.6.32 #1 X7DCT RIP: 0010:[<ffffffff813095a7>] [<ffffffff813095a7>] sfq_enqueue+0xec/0x34d RSP: 0018:ffff8800282c3cf8 EFLAGS: 00000202 RAX: 0000000000000005 RBX: ffff8802f96bc0e0 RCX: 0000000036116e35 RDX: ffff8803021b1830 RSI: 0000000099ee00c1 RDI: 000000000a3c669d RBP: ffffffff8100b873 R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff81311fd5 R11: ffffffff812edc84 R12: ffff8800282c3c70 R13: ffff8802f96bc000 R14: ffff8802f222eec0 R15: ffffffff8101b143 FS: 0000000000000000(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f1540b91f70 CR3: 000000031d7e5000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: <IRQ> [<ffffffff8130869f>] ? hfsc_enqueue+0x1be/0x727 [<ffffffff812fc534>] ? eth_header+0x24/0x9c [<ffffffff812edf1d>] ? dev_queue_xmit+0x299/0x3b3 [<ffffffff81319e73>] ? ip_rcv_finish+0x2cf/0x2e9 [<ffffffff8131a10a>] ? ip_rcv+0x27d/0x2c2 [<ffffffff8126a540>] ? e1000_clean_rx_irq+0x22e/0x2de [<ffffffff8126bbdd>] ? e1000_clean+0x70/0x21c [<ffffffff812ed555>] ? net_rx_action+0x71/0x160 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28 [<ffffffff812f1a6f>] ? dst_gc_task+0x0/0x1a7 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28 <EOI> [<ffffffff8100d645>] ? do_softirq+0x31/0x66 [<ffffffff8103dcf4>] ? local_bh_enable_ip+0x78/0x89 [<ffffffff812f1a6f>] ? dst_gc_task+0x0/0x1a7 [<ffffffff812f1b3d>] ? dst_gc_task+0xce/0x1a7 [<ffffffff813620c2>] ? thread_return+0x3e/0x104 [<ffffffff813637c8>] ? _spin_lock_irqsave+0x9/0xe [<ffffffff8104bf54>] ? worker_thread+0x14c/0x1dc [<ffffffff8104f095>] ? autoremove_wake_function+0x0/0x2e [<ffffffff8104be08>] ? worker_thread+0x0/0x1dc [<ffffffff8104ede1>] ? kthread+0x79/0x81 [<ffffffff8100bd7a>] ? child_rip+0xa/0x20 [<ffffffff8104ed68>] ? kthread+0x0/0x81 [<ffffffff8104ed68>] ? kthread+0x0/0x81 [<ffffffff8100bd70>] ? child_rip+0x0/0x20 BUG: soft lockup - CPU#3 stuck for 61s! [events/3:30] Modules linked in: CPU 3: Modules linked in: Pid: 30, comm: events/3 Not tainted 2.6.32 #1 X7DCT RIP: 0010:[<ffffffff8126976f>] [<ffffffff8126976f>] e1000_xmit_frame+0x7d9/0x921 RSP: 0018:ffff8800282c3b98 EFLAGS: 00000206 RAX: 0000000000004100 RBX: ffff880308c740c0 RCX: ffff880316682c00 RDX: ffffc900008ff000 RSI: 00000000000000c0 RDI: 00000000032b0000 RBP: ffffffff8100b873 R08: 0000000042000000 R09: ffff88031e07f240 R10: ffff880316682cc0 R11: 0000000000000000 R12: ffff8800282c3b10 R13: ffff88031e07f240 R14: 0000000000000001 R15: ffffffff8101b143 FS: 0000000000000000(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007f1540b91f70 CR3: 000000031d7e5000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: <IRQ> [<ffffffff812695ac>] ? e1000_xmit_frame+0x616/0x921 [<ffffffff812edadc>] ? dev_hard_start_xmit+0x20e/0x2b2 [<ffffffff812fd117>] ? sch_direct_xmit+0x5e/0x156 [<ffffffff812edee1>] ? dev_queue_xmit+0x25d/0x3b3 [<ffffffff813544d4>] ? vlan_dev_hwaccel_hard_start_xmit+0x63/0x7d [<ffffffff812edadc>] ? dev_hard_start_xmit+0x20e/0x2b2 [<ffffffff812fd117>] ? sch_direct_xmit+0x5e/0x156 [<ffffffff812fd2c8>] ? __qdisc_run+0xb9/0xd8 [<ffffffff812edf35>] ? dev_queue_xmit+0x2b1/0x3b3 [<ffffffff81319e73>] ? ip_rcv_finish+0x2cf/0x2e9 [<ffffffff8131a10a>] ? ip_rcv+0x27d/0x2c2 [<ffffffff8126a540>] ? e1000_clean_rx_irq+0x22e/0x2de [<ffffffff8126bbdd>] ? e1000_clean+0x70/0x21c [<ffffffff812ed555>] ? net_rx_action+0x71/0x160 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28 [<ffffffff812f1a6f>] ? dst_gc_task+0x0/0x1a7 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28 <EOI> [<ffffffff8100d645>] ? do_softirq+0x31/0x66 [<ffffffff8103dcf4>] ? local_bh_enable_ip+0x78/0x89 [<ffffffff812f1a6f>] ? dst_gc_task+0x0/0x1a7 [<ffffffff812f1b3d>] ? dst_gc_task+0xce/0x1a7 [<ffffffff813620c2>] ? thread_return+0x3e/0x104 [<ffffffff813637c8>] ? _spin_lock_irqsave+0x9/0xe [<ffffffff8104bf54>] ? worker_thread+0x14c/0x1dc [<ffffffff8104f095>] ? autoremove_wake_function+0x0/0x2e [<ffffffff8104be08>] ? worker_thread+0x0/0x1dc [<ffffffff8104ede1>] ? kthread+0x79/0x81 [<ffffffff8100bd7a>] ? child_rip+0xa/0x20 [<ffffffff8104ed68>] ? kthread+0x0/0x81 [<ffffffff8104ed68>] ? kthread+0x0/0x81 [<ffffffff8100bd70>] ? child_rip+0x0/0x20 dst cache overflow
Aditional info about host: This is linux gentoo x86_64 - with fib_trie more than 8000 iproute filters on interface with hfsc classes Machine have this network interfaces: 04:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet Controller (Copper) (rev 03) 05:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller Topology: Clients <- ibgp -> [ vlan0811@eth0 + vlan0508@eth0 - BGP process - eth1 ] <- ebgp -> Internet Some fibtriestats: cat /proc/net/fib_triestat Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes. Main: Aver depth: 2.58 Max depth: 6 Leaves: 297566 Prefixes: 312528 Internal nodes: 70105 1: 35598 2: 15065 3: 10956 4: 4743 5: 2311 6: 955 7: 365 8: 110 9: 1 16: 1 Pointers: 570992 Null ptrs: 203322 Total size: 37010 kB Counters: --------- gets = 9298381 backtracks = 358013 semantic match passed = 9194516 semantic match miss = 97 null node hit= 4270863 skipped node resize = 0 Local: Aver depth: 3.66 Max depth: 4 Leaves: 15 Prefixes: 16 Internal nodes: 11 1: 8 2: 3 Pointers: 28 Null ptrs: 3 Total size: 3 kB Counters: --------- gets = 9358237 backtracks = 9128813 semantic match passed = 64893 semantic match miss = 0 null node hit= 2102 skipped node resize = 0 Meminfo: cat /proc/meminfo MemTotal: 12309556 kB MemFree: 11466272 kB Buffers: 4356 kB Cached: 28672 kB SwapCached: 0 kB Active: 276288 kB Inactive: 17988 kB Active(anon): 261404 kB Inactive(anon): 0 kB Active(file): 14884 kB Inactive(file): 17988 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 261252 kB Mapped: 4056 kB Shmem: 156 kB Slab: 527288 kB SReclaimable: 8100 kB SUnreclaim: 519188 kB KernelStack: 992 kB PageTables: 1604 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 6154776 kB Committed_AS: 269144 kB VmallocTotal: 34359738367 kB VmallocUsed: 8944 kB VmallocChunk: 34359724163 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 3392 kB DirectMap2M: 12578816 kB Interrupts: cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 44 0 0 0 0 3 0 0 IO-APIC-edge timer 1: 0 1 0 0 0 0 1 0 IO-APIC-edge i8042 9: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi acpi 14: 0 0 0 0 0 0 0 0 IO-APIC-edge ide0 15: 0 0 0 0 0 0 0 0 IO-APIC-edge ide1 20: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3 21: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb6 22: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb5 23: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb4 29: 2020 136 135 135 134 138 136 136 PCI-MSI-edge ahci 30: 267279 272260 1178931 268413 274942 274003 270552 268966 PCI-MSI-edge eth0 31: 176230 785586 276756 175203 168674 169608 173063 174649 PCI-MSI-edge eth1 32: 0 0 0 0 0 1 1 0 PCI-MSI-edge 33: 0 1 0 0 0 0 0 1 PCI-MSI-edge 34: 0 0 1 0 0 1 0 0 PCI-MSI-edge 35: 1 0 0 1 0 0 0 0 PCI-MSI-edge 36: 0 0 0 0 1 0 0 1 PCI-MSI-edge 37: 1 0 0 0 0 0 1 0 PCI-MSI-edge 38: 0 0 0 1 1 0 0 0 PCI-MSI-edge 39: 0 1 1 0 0 0 0 0 PCI-MSI-edge NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts LOC: 2955414 3213591 4900145 3163096 3132839 3396767 3225617 3510714 Local timer interrupts SPU: 0 0 0 0 0 0 0 0 Spurious interrupts PMI: 0 0 0 0 0 0 0 0 Performance monitoring interrupts PND: 0 0 0 0 0 0 0 0 Performance pending work RES: 673 768 2452 535 584 674 985 979 Rescheduling interrupts CAL: 378 276 92 86 341 347 92 89 Function call interrupts TLB: 27 25 23 26 29 25 28 33 TLB shootdowns ERR: 0 MIS: 0
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Sat, 6 Feb 2010 14:39:46 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=15239 > > Summary: Problem with network performance after a while. > Product: Networking > Version: 2.5 > Kernel Version: 2.6.33-rc6-git5 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: IPV4 > AssignedTo: shemminger@linux-foundation.org > ReportedBy: pstaszewski@artcom.pl > Regression: No > > > Hello the problem is that after some time of working my router have some > performance problems. > > The problem exist on kernel 2.6.32.7 and 2.6.33-rc6-git5 > The problem dont exist on 2.6.32.2 kernel. That would be bad. That's a regression in 2.6.33-rc, and we've backported that regression into 2.6.32.x -stable. > on 2.6.33-rc6-git5 i have this info:BUG: soft lockup - CPU#1 stuck for 61s! > [events/1:28] > Modules linked in: > CPU 1 > Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT > RIP: 0010:[<ffffffff810a3d89>] [<ffffffff810a3d89>] > kmem_cache_free+0x11b/0x11c > RSP: 0018:ffff880028243e50 EFLAGS: 00000292 > RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0 > RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680 > RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c > R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0 > R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7 > FS: 0000000000000000(0000) GS:ffff880028240000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task > ffff88031f9a4f80) > Stack: > ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023 > <0> 01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040 > <0> 0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740 > Call Trace: > <IRQ> > [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74 > [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235 > [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c > [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d > [<ffffffff81035311>] ? __do_softirq+0xd7/0x196 > [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 > [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 > [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 > <EOI> > [<ffffffff81004599>] ? do_softirq+0x31/0x63 > [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86 > [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 > [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7 > [<ffffffff8136b08c>] ? schedule+0x82c/0x906 > [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b > [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d > [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc > [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e > [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc > [<ffffffff810479bd>] ? kthread+0x79/0x81 > [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff81047944>] ? kthread+0x0/0x81 > [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10 > Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08 48 > 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f <c3> 55 48 89 f5 53 > 48 > 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10 > Call Trace: > <IRQ> [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74 > [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235 > [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c > [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d > [<ffffffff81035311>] ? __do_softirq+0xd7/0x196 > [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 > [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 > [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 > <EOI> [<ffffffff81004599>] ? do_softirq+0x31/0x63 > [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86 > [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 > [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7 > [<ffffffff8136b08c>] ? schedule+0x82c/0x906 > [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b > [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d > [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc > [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e > [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc > [<ffffffff810479bd>] ? kthread+0x79/0x81 > [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10 > [<ffffffff81047944>] ? kthread+0x0/0x81 > [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10 > > And on 2.6.32.7 > Feb 4 13:12:40 TM_01_C1 ------------[ cut here ]------------ > Feb 4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261 > dev_watchdog+0x130/0x1d6() > Feb 4 13:12:40 TM_01_C1 Hardware name: X7DCT > Feb 4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 > timed > out > Feb 4 13:12:40 TM_01_C1 Modules linked in: oprofile > Feb 4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1 > Feb 4 13:12:40 TM_01_C1 Call Trace: > Feb 4 13:12:40 TM_01_C1 <IRQ> [<ffffffff812fcaf7>] ? > dev_watchdog+0x130/0x1d6 > Feb 4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6 > Feb 4 13:12:40 TM_01_C1 [<ffffffff81038811>] ? > warn_slowpath_common+0x77/0xa3 > Feb 4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? warn_slowpath_fmt+0x51/0x59 > Feb 4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e > Feb 4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? try_to_wake_up+0x1eb/0x1f8 > Feb 4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? netdev_drivername+0x3b/0x40 > Feb 4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6 > Feb 4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44 > Feb 4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6 > Feb 4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ? run_timer_softirq+0x1ff/0x29d > Feb 4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7 > Feb 4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196 > Feb 4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28 > Feb 4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66 > Feb 4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ? > smp_apic_timer_interrupt+0x87/0x95 > Feb 4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ? > apic_timer_interrupt+0x13/0x20 > Feb 4 13:12:40 TM_01_C1 <EOI> [<ffffffff810111f5>] ? mwait_idle+0x9b/0xa0 > Feb 4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c > Feb 4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]--- > > > Machine is a BGP router with some iproute traffic management.
On Monday 08 February 2010, David Miller wrote: > From: "Rafael J. Wysocki" <rjw@sisk.pl> > Date: Mon, 8 Feb 2010 23:51:30 +0100 > > > On Monday 08 February 2010, Andrew Morton wrote: > >> On Sat, 6 Feb 2010 14:39:46 GMT > >> bugzilla-daemon@bugzilla.kernel.org wrote: > >> > >> > http://bugzilla.kernel.org/show_bug.cgi?id=15239 > >> > >> post-2.6.32 regression, backported into -stable :( > > > > Added, CCing netdev. > > No need, we already worked on this and have a fix, > see: > > http://marc.info/?l=linux-netdev&m=126565759512718&w=2 Great, thanks!
Handled-By : Eric Dumazet <eric.dumazet@gmail.com> Patch : http://marc.info/?l=linux-netdev&m=126565759512718&w=2
From: "Brandeburg, Jesse" <jesse.brandeburg@intel.com> Date: Mon, 8 Feb 2010 15:06:24 -0800 (Pacific Standard Time) >> > on 2.6.33-rc6-git5 i have this info: > >> > BUG: soft lockup - CPU#1 stuck for 61s! >> > [events/1:28] > > so why is this events/1 thing here? Jesse, this is taken care of already, see: http://marc.info/?l=linux-netdev&m=126565759512718&w=2
From: Andrew Morton <akpm@linux-foundation.org> Date: Mon, 8 Feb 2010 14:15:13 -0800 > That would be bad. That's a regression in 2.6.33-rc, and we've > backported that regression into 2.6.32.x -stable. Luckily we have a fix already. :-)
Andrew, thanks. On Mon, 8 Feb 2010, Andrew Morton wrote: > > Hello the problem is that after some time of working my router have some > > performance problems. > > > > The problem exist on kernel 2.6.32.7 and 2.6.33-rc6-git5 > > The problem dont exist on 2.6.32.2 kernel. > > That would be bad. That's a regression in 2.6.33-rc, and we've > backported that regression into 2.6.32.x -stable. there were no changes to e1000e from v2.6.32..v2.6.32.7 # git whatchanged v2.6.32..v2.6.32.7 drivers/net/e1000e > > on 2.6.33-rc6-git5 i have this info: > > BUG: soft lockup - CPU#1 stuck for 61s! > > [events/1:28] so why is this events/1 thing here? I've see this kernel thread related to ipv6 stack hanging when under heavy incoming packet load as the interface comes up. Was the interface just brought up or something changed in rules or something? > > Modules linked in: > > CPU 1 > > Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT > > RIP: 0010:[<ffffffff810a3d89>] [<ffffffff810a3d89>] > > kmem_cache_free+0x11b/0x11c > > RSP: 0018:ffff880028243e50 EFLAGS: 00000292 > > RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0 > > RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680 > > RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c > > R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0 > > R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7 > > FS: 0000000000000000(0000) GS:ffff880028240000(0000) > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task > ffff88031f9a4f80) > > Stack: > > ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023 > > <0> 01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040 > > <0> 0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740 > > Call Trace: > > <IRQ> > > [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74 > > [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235 clean_tx_irq in e1000e has a break-out condition to keep the transmit cleanup for running forever (if another cpu is transmitting) so it seems that something else must have happened. Can kmem_cache_free sleep or stall on a lock? > > [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c > > [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d > > [<ffffffff81035311>] ? __do_softirq+0xd7/0x196 > > [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 > > [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 > > [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 > > <EOI> > > [<ffffffff81004599>] ? do_softirq+0x31/0x63 > > [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86 > > [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 > > [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7 > > [<ffffffff8136b08c>] ? schedule+0x82c/0x906 > > [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b > > [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d > > [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc > > [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e > > [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc > > [<ffffffff810479bd>] ? kthread+0x79/0x81 > > [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10 > > [<ffffffff81047944>] ? kthread+0x0/0x81 > > [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10 > > Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08 > 48 > > 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f <c3> 55 48 89 f5 > 53 48 > > 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10 > > Call Trace: > > <IRQ> [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74 > > [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235 > > [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c > > [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d > > [<ffffffff81035311>] ? __do_softirq+0xd7/0x196 > > [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 > > [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 > > [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 > > <EOI> [<ffffffff81004599>] ? do_softirq+0x31/0x63 > > [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86 > > [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 > > [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7 > > [<ffffffff8136b08c>] ? schedule+0x82c/0x906 > > [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b > > [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d > > [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc > > [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e > > [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc > > [<ffffffff810479bd>] ? kthread+0x79/0x81 > > [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10 > > [<ffffffff81047944>] ? kthread+0x0/0x81 > > [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10 > > > > And on 2.6.32.7 > > Feb 4 13:12:40 TM_01_C1 ------------[ cut here ]------------ > > Feb 4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261 > > dev_watchdog+0x130/0x1d6() > > Feb 4 13:12:40 TM_01_C1 Hardware name: X7DCT > > Feb 4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 > timed > > out > > Feb 4 13:12:40 TM_01_C1 Modules linked in: oprofile > > Feb 4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1 the above doesn't match your statement that it is 2.6.32.7 > > Feb 4 13:12:40 TM_01_C1 Call Trace: > > Feb 4 13:12:40 TM_01_C1 <IRQ> [<ffffffff812fcaf7>] ? > dev_watchdog+0x130/0x1d6 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff81038811>] ? > warn_slowpath_common+0x77/0xa3 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? warn_slowpath_fmt+0x51/0x59 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e > > Feb 4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? try_to_wake_up+0x1eb/0x1f8 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? netdev_drivername+0x3b/0x40 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ? > run_timer_softirq+0x1ff/0x29d > > Feb 4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ? > > smp_apic_timer_interrupt+0x87/0x95 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ? > apic_timer_interrupt+0x13/0x20 > > Feb 4 13:12:40 TM_01_C1 <EOI> [<ffffffff810111f5>] ? mwait_idle+0x9b/0xa0 > > Feb 4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c > > Feb 4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]--- > > > > > > Machine is a BGP router with some iproute traffic management. please send a full dmesg from boot. also full lspci -vvv. We sporadically see some issue with some people's use of traffic management with some e1000/e devices, we have no repro or common reproduction case, so haven't solved the issue yet. I suspect there is a strange interaction with HTB or one of the other queueing disciplines that causes some odd skb or set of skb's to arrive, but that is just a theory. I see two different bugs above, one is the tx hang, could it be that you didn't have SOFTLOCKUP detection enabled on 2.6.32? A softlockup could cause a tx hang. The other bug is the softlockup and I don't know how the driver could be at fault in that listed stack path. We simply can't loop more than tx_ring->count times, which could never take > 60 seconds unless one of the calls we were making was getting stalled for some reason. So your traffic is "heavy" forwarding at the time of the failure, do you have any input on average packet size? I'm trying to figure out how we might reproduce this. Jesse
Le mardi 09 février 2010 à 08:19 +0100, Paweł Staszewski a écrit : > W dniu 2010-02-08 23:55, David Miller pisze: > > From: Andrew Morton<akpm@linux-foundation.org> > > Date: Mon, 8 Feb 2010 14:15:13 -0800 > > > > > >> That would be bad. That's a regression in 2.6.33-rc, and we've > >> backported that regression into 2.6.32.x -stable. > >> > > Luckily we have a fix already. :-) > > > > Yes - i make bugreport and after i write also here on the list > > And Eric Dumazet help to solve problem with "route cache" - and locking > load on one cpu after information about "route cache disabled". > > After his patch i don't see any errors in dmesg. > OK, but this is real old bug (since 2.6.24), not a 2.6.32/2.6.33 regression... You were just lucky not hitting it before, probably because you tried different settings.
Reply-To: pstaszewski@itcare.pl W dniu 2010-02-09 08:36, Eric Dumazet pisze: > Le mardi 09 février 2010 à 08:19 +0100, Paweł Staszewski a écrit : > >> W dniu 2010-02-08 23:55, David Miller pisze: >> >>> From: Andrew Morton<akpm@linux-foundation.org> >>> Date: Mon, 8 Feb 2010 14:15:13 -0800 >>> >>> >>> >>>> That would be bad. That's a regression in 2.6.33-rc, and we've >>>> backported that regression into 2.6.32.x -stable. >>>> >>>> >>> Luckily we have a fix already. :-) >>> >>> >> Yes - i make bugreport and after i write also here on the list >> >> And Eric Dumazet help to solve problem with "route cache" - and locking >> load on one cpu after information about "route cache disabled". >> >> After his patch i don't see any errors in dmesg. >> >> > OK, but this is real old bug (since 2.6.24), not a 2.6.32/2.6.33 > regression... > > You were just lucky not hitting it before, probably because you tried > different settings. > > > Yes i think it was all about change of settings. I start to search what settings will be the best - with some experiments - about what will happen if.... > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > >
Reply-To: pstaszewski@itcare.pl W dniu 2010-02-09 00:06, Brandeburg, Jesse pisze: > Andrew, thanks. > > On Mon, 8 Feb 2010, Andrew Morton wrote: > >>> Hello the problem is that after some time of working my router have some >>> performance problems. >>> >>> The problem exist on kernel 2.6.32.7 and 2.6.33-rc6-git5 >>> The problem dont exist on 2.6.32.2 kernel. >>> >> That would be bad. That's a regression in 2.6.33-rc, and we've >> backported that regression into 2.6.32.x -stable. >> > there were no changes to e1000e from v2.6.32..v2.6.32.7 > # git whatchanged v2.6.32..v2.6.32.7 drivers/net/e1000e > > > >>> on 2.6.33-rc6-git5 i have this info: >>> > >>> BUG: soft lockup - CPU#1 stuck for 61s! >>> [events/1:28] >>> > so why is this events/1 thing here? I've see this kernel thread related > to ipv6 stack hanging when under heavy incoming packet load as the > interface comes up. Was the interface just brought up or something > changed in rules or something? > > >>> Modules linked in: >>> CPU 1 >>> Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT >>> RIP: 0010:[<ffffffff810a3d89>] [<ffffffff810a3d89>] >>> kmem_cache_free+0x11b/0x11c >>> RSP: 0018:ffff880028243e50 EFLAGS: 00000292 >>> RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0 >>> RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680 >>> RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c >>> R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0 >>> R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7 >>> FS: 0000000000000000(0000) GS:ffff880028240000(0000) >>> knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>> CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0 >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >>> Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task >>> ffff88031f9a4f80) >>> Stack: >>> ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023 >>> <0> 01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040 >>> <0> 0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740 >>> Call Trace: >>> <IRQ> >>> [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74 >>> [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235 >>> > clean_tx_irq in e1000e has a break-out condition to keep the transmit > cleanup for running forever (if another cpu is transmitting) so it seems > that something else must have happened. Can kmem_cache_free sleep or > stall on a lock? > > > >>> [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c >>> [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d >>> [<ffffffff81035311>] ? __do_softirq+0xd7/0x196 >>> [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 >>> [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 >>> [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 >>> <EOI> >>> [<ffffffff81004599>] ? do_softirq+0x31/0x63 >>> [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86 >>> [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 >>> [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7 >>> [<ffffffff8136b08c>] ? schedule+0x82c/0x906 >>> [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b >>> [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d >>> [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc >>> [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e >>> [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc >>> [<ffffffff810479bd>] ? kthread+0x79/0x81 >>> [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10 >>> [<ffffffff81047944>] ? kthread+0x0/0x81 >>> [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10 >>> Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08 >>> 48 >>> 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f<c3> 55 48 89 f5 >>> 53 48 >>> 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10 >>> Call Trace: >>> <IRQ> [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74 >>> [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235 >>> [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c >>> [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d >>> [<ffffffff81035311>] ? __do_softirq+0xd7/0x196 >>> [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 >>> [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 >>> [<ffffffff81002dac>] ? call_softirq+0x1c/0x28 >>> <EOI> [<ffffffff81004599>] ? do_softirq+0x31/0x63 >>> [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86 >>> [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7 >>> [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7 >>> [<ffffffff8136b08c>] ? schedule+0x82c/0x906 >>> [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b >>> [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d >>> [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc >>> [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e >>> [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc >>> [<ffffffff810479bd>] ? kthread+0x79/0x81 >>> [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10 >>> [<ffffffff81047944>] ? kthread+0x0/0x81 >>> [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10 >>> >>> And on 2.6.32.7 >>> Feb 4 13:12:40 TM_01_C1 ------------[ cut here ]------------ >>> Feb 4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261 >>> dev_watchdog+0x130/0x1d6() >>> Feb 4 13:12:40 TM_01_C1 Hardware name: X7DCT >>> Feb 4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 >>> timed >>> out >>> Feb 4 13:12:40 TM_01_C1 Modules linked in: oprofile >>> Feb 4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1 >>> > the above doesn't match your statement that it is 2.6.32.7 > > >>> Feb 4 13:12:40 TM_01_C1 Call Trace: >>> Feb 4 13:12:40 TM_01_C1<IRQ> [<ffffffff812fcaf7>] ? >>> dev_watchdog+0x130/0x1d6 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff81038811>] ? >>> warn_slowpath_common+0x77/0xa3 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? warn_slowpath_fmt+0x51/0x59 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? try_to_wake_up+0x1eb/0x1f8 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? netdev_drivername+0x3b/0x40 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ? >>> run_timer_softirq+0x1ff/0x29d >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ? >>> smp_apic_timer_interrupt+0x87/0x95 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ? >>> apic_timer_interrupt+0x13/0x20 >>> Feb 4 13:12:40 TM_01_C1<EOI> [<ffffffff810111f5>] ? mwait_idle+0x9b/0xa0 >>> Feb 4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c >>> Feb 4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]--- >>> >>> >>> Machine is a BGP router with some iproute traffic management. >>> > please send a full dmesg from boot. also full lspci -vvv. > > dmesg and lspci in attached files. > We sporadically see some issue with some people's use of traffic > management with some e1000/e devices, we have no repro or common > reproduction case, so haven't solved the issue yet. I suspect there is a > strange interaction with HTB or one of the other queueing disciplines that > causes some odd skb or set of skb's to arrive, but that is just a theory. > > there is about 10000 iproute filters per interface. (hfsc classes) > I see two different bugs above, one is the tx hang, could it be that you > didn't have SOFTLOCKUP detection enabled on 2.6.32? A softlockup could > cause a tx hang. > > The other bug is the softlockup and I don't know how the driver could be > at fault in that listed stack path. We simply can't loop more than > tx_ring->count times, which could never take> 60 seconds unless one of > the calls we were making was getting stalled for some reason. > > So your traffic is "heavy" forwarding at the time of the failure, do you > have any input on average packet size? I'm trying to figure out how we > might reproduce this. > > avg pkt size is about 811 bytes > Jesse > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > >
Reply-To: pstaszewski@itcare.pl W dniu 2010-02-08 23:55, David Miller pisze: > From: Andrew Morton<akpm@linux-foundation.org> > Date: Mon, 8 Feb 2010 14:15:13 -0800 > > >> That would be bad. That's a regression in 2.6.33-rc, and we've >> backported that regression into 2.6.32.x -stable. >> > Luckily we have a fix already. :-) > Yes - i make bugreport and after i write also here on the list And Eric Dumazet help to solve problem with "route cache" - and locking load on one cpu after information about "route cache disabled". After his patch i don't see any errors in dmesg. But i think there is a little more in my bugreport on bugzilla. Regards Pawel > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > >
Not a regression according to comment #11.