Bug 15239 - Problem with network performance after a while.
Summary: Problem with network performance after a while.
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-06 14:39 UTC by Pawel Staszewski
Modified: 2010-02-09 12:30 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.33-rc6-git5
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Pawel Staszewski 2010-02-06 14:39:41 UTC
Hello the problem is that after some time of working my router have some performance problems.

The problem exist on kernel 2.6.32.7 and 2.6.33-rc6-git5
The problem dont exist on 2.6.32.2 kernel.

on 2.6.33-rc6-git5 i have this info:BUG: soft lockup - CPU#1 stuck for 61s! [events/1:28]
Modules linked in:
CPU 1
Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT
RIP: 0010:[<ffffffff810a3d89>]  [<ffffffff810a3d89>] kmem_cache_free+0x11b/0x11c
RSP: 0018:ffff880028243e50  EFLAGS: 00000292
RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0
RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680
RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c
R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0
R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7
FS:  0000000000000000(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task ffff88031f9a4f80)
Stack:
 ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023
<0> 01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040
<0> 0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740
Call Trace:
 <IRQ>
 [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
 [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
 [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
 [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
 [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
 [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
 [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
 [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
 <EOI>
 [<ffffffff81004599>] ? do_softirq+0x31/0x63
 [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
 [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
 [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
 [<ffffffff8136b08c>] ? schedule+0x82c/0x906
 [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
 [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
 [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
 [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
 [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
 [<ffffffff810479bd>] ? kthread+0x79/0x81
 [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff81047944>] ? kthread+0x0/0x81
 [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08 48 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f <c3> 55 48 89 f5 53 48 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10
Call Trace:
 <IRQ>  [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
 [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
 [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
 [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
 [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
 [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
 [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
 [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
 <EOI>  [<ffffffff81004599>] ? do_softirq+0x31/0x63
 [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
 [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
 [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
 [<ffffffff8136b08c>] ? schedule+0x82c/0x906
 [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
 [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
 [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
 [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
 [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
 [<ffffffff810479bd>] ? kthread+0x79/0x81
 [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
 [<ffffffff81047944>] ? kthread+0x0/0x81
 [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10

And on 2.6.32.7
Feb  4 13:12:40 TM_01_C1 ------------[ cut here ]------------
Feb  4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x130/0x1d6()
Feb  4 13:12:40 TM_01_C1 Hardware name: X7DCT
Feb  4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Feb  4 13:12:40 TM_01_C1 Modules linked in: oprofile
Feb  4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1
Feb  4 13:12:40 TM_01_C1 Call Trace:
Feb  4 13:12:40 TM_01_C1 <IRQ>  [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
Feb  4 13:12:40 TM_01_C1 [<ffffffff81038811>] ? warn_slowpath_common+0x77/0xa3
Feb  4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? warn_slowpath_fmt+0x51/0x59
Feb  4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e
Feb  4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? try_to_wake_up+0x1eb/0x1f8
Feb  4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? netdev_drivername+0x3b/0x40
Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
Feb  4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44
Feb  4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6
Feb  4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ? run_timer_softirq+0x1ff/0x29d
Feb  4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7
Feb  4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196
Feb  4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
Feb  4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66
Feb  4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ? smp_apic_timer_interrupt+0x87/0x95
Feb  4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ? apic_timer_interrupt+0x13/0x20
Feb  4 13:12:40 TM_01_C1 <EOI>  [<ffffffff810111f5>] ? mwait_idle+0x9b/0xa0
Feb  4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c
Feb  4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]---


Machine is a BGP router with some iproute traffic management.
Comment 1 Pawel Staszewski 2010-02-06 14:42:08 UTC
And info before this soft lockup is:

Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
vlan0811: 5 rebuilds is over limit, route caching disabled
Route hash chain too long!
Adjust your secret_interval!
Comment 2 Pawel Staszewski 2010-02-06 14:51:14 UTC
And also after this soft lockup there is a weird thing that i cant make cpu affinity on my nic.
NIC is blocked then on random CPU and change of /proc/irq/xx/smp_affinity is not helping
Comment 3 Pawel Staszewski 2010-02-07 15:03:22 UTC
Ok sorry aditional info:
there is also problem on kernel 2.6.32

Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
vlan0811: 5 rebuilds is over limit, route caching disabled
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
vlan0811: 13 rebuilds is over limit, route caching disabled
Route hash chain too long!
Adjust your secret_interval!

BUG: soft lockup - CPU#3 stuck for 61s! [events/3:30]
Modules linked in:
CPU 3:
Modules linked in:
Pid: 30, comm: events/3 Not tainted 2.6.32 #1 X7DCT
RIP: 0010:[<ffffffff813095a7>]  [<ffffffff813095a7>] sfq_enqueue+0xec/0x34d
RSP: 0018:ffff8800282c3cf8  EFLAGS: 00000202
RAX: 0000000000000005 RBX: ffff8802f96bc0e0 RCX: 0000000036116e35
RDX: ffff8803021b1830 RSI: 0000000099ee00c1 RDI: 000000000a3c669d
RBP: ffffffff8100b873 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff81311fd5 R11: ffffffff812edc84 R12: ffff8800282c3c70
R13: ffff8802f96bc000 R14: ffff8802f222eec0 R15: ffffffff8101b143
FS:  0000000000000000(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f1540b91f70 CR3: 000000031d7e5000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>  [<ffffffff8130869f>] ? hfsc_enqueue+0x1be/0x727
 [<ffffffff812fc534>] ? eth_header+0x24/0x9c
 [<ffffffff812edf1d>] ? dev_queue_xmit+0x299/0x3b3
 [<ffffffff81319e73>] ? ip_rcv_finish+0x2cf/0x2e9
 [<ffffffff8131a10a>] ? ip_rcv+0x27d/0x2c2
 [<ffffffff8126a540>] ? e1000_clean_rx_irq+0x22e/0x2de
 [<ffffffff8126bbdd>] ? e1000_clean+0x70/0x21c
 [<ffffffff812ed555>] ? net_rx_action+0x71/0x160
 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196
 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
 [<ffffffff812f1a6f>] ? dst_gc_task+0x0/0x1a7
 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
 <EOI>  [<ffffffff8100d645>] ? do_softirq+0x31/0x66
 [<ffffffff8103dcf4>] ? local_bh_enable_ip+0x78/0x89
 [<ffffffff812f1a6f>] ? dst_gc_task+0x0/0x1a7
 [<ffffffff812f1b3d>] ? dst_gc_task+0xce/0x1a7
 [<ffffffff813620c2>] ? thread_return+0x3e/0x104
 [<ffffffff813637c8>] ? _spin_lock_irqsave+0x9/0xe
 [<ffffffff8104bf54>] ? worker_thread+0x14c/0x1dc
 [<ffffffff8104f095>] ? autoremove_wake_function+0x0/0x2e
 [<ffffffff8104be08>] ? worker_thread+0x0/0x1dc
 [<ffffffff8104ede1>] ? kthread+0x79/0x81
 [<ffffffff8100bd7a>] ? child_rip+0xa/0x20
 [<ffffffff8104ed68>] ? kthread+0x0/0x81
 [<ffffffff8104ed68>] ? kthread+0x0/0x81
 [<ffffffff8100bd70>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#3 stuck for 61s! [events/3:30]
Modules linked in:
CPU 3:
Modules linked in:
Pid: 30, comm: events/3 Not tainted 2.6.32 #1 X7DCT
RIP: 0010:[<ffffffff8126976f>]  [<ffffffff8126976f>] e1000_xmit_frame+0x7d9/0x921
RSP: 0018:ffff8800282c3b98  EFLAGS: 00000206
RAX: 0000000000004100 RBX: ffff880308c740c0 RCX: ffff880316682c00
RDX: ffffc900008ff000 RSI: 00000000000000c0 RDI: 00000000032b0000
RBP: ffffffff8100b873 R08: 0000000042000000 R09: ffff88031e07f240
R10: ffff880316682cc0 R11: 0000000000000000 R12: ffff8800282c3b10
R13: ffff88031e07f240 R14: 0000000000000001 R15: ffffffff8101b143
FS:  0000000000000000(0000) GS:ffff8800282c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f1540b91f70 CR3: 000000031d7e5000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>  [<ffffffff812695ac>] ? e1000_xmit_frame+0x616/0x921
 [<ffffffff812edadc>] ? dev_hard_start_xmit+0x20e/0x2b2
 [<ffffffff812fd117>] ? sch_direct_xmit+0x5e/0x156
 [<ffffffff812edee1>] ? dev_queue_xmit+0x25d/0x3b3
 [<ffffffff813544d4>] ? vlan_dev_hwaccel_hard_start_xmit+0x63/0x7d
 [<ffffffff812edadc>] ? dev_hard_start_xmit+0x20e/0x2b2
 [<ffffffff812fd117>] ? sch_direct_xmit+0x5e/0x156
 [<ffffffff812fd2c8>] ? __qdisc_run+0xb9/0xd8
 [<ffffffff812edf35>] ? dev_queue_xmit+0x2b1/0x3b3
 [<ffffffff81319e73>] ? ip_rcv_finish+0x2cf/0x2e9
 [<ffffffff8131a10a>] ? ip_rcv+0x27d/0x2c2
 [<ffffffff8126a540>] ? e1000_clean_rx_irq+0x22e/0x2de
 [<ffffffff8126bbdd>] ? e1000_clean+0x70/0x21c
 [<ffffffff812ed555>] ? net_rx_action+0x71/0x160
 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196
 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
 [<ffffffff812f1a6f>] ? dst_gc_task+0x0/0x1a7
 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
 <EOI>  [<ffffffff8100d645>] ? do_softirq+0x31/0x66
 [<ffffffff8103dcf4>] ? local_bh_enable_ip+0x78/0x89
 [<ffffffff812f1a6f>] ? dst_gc_task+0x0/0x1a7
 [<ffffffff812f1b3d>] ? dst_gc_task+0xce/0x1a7
 [<ffffffff813620c2>] ? thread_return+0x3e/0x104
 [<ffffffff813637c8>] ? _spin_lock_irqsave+0x9/0xe
 [<ffffffff8104bf54>] ? worker_thread+0x14c/0x1dc
 [<ffffffff8104f095>] ? autoremove_wake_function+0x0/0x2e
 [<ffffffff8104be08>] ? worker_thread+0x0/0x1dc
 [<ffffffff8104ede1>] ? kthread+0x79/0x81
 [<ffffffff8100bd7a>] ? child_rip+0xa/0x20
 [<ffffffff8104ed68>] ? kthread+0x0/0x81
 [<ffffffff8104ed68>] ? kthread+0x0/0x81
 [<ffffffff8100bd70>] ? child_rip+0x0/0x20
dst cache overflow
Comment 4 Pawel Staszewski 2010-02-07 15:26:01 UTC
Aditional info about host:
This is linux gentoo x86_64 - with fib_trie
more than 8000 iproute filters on interface with hfsc classes

Machine have this network interfaces:
04:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet Controller (Copper) (rev 03)
05:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller

Topology:
                     
Clients <- ibgp -> [ vlan0811@eth0 + vlan0508@eth0 - BGP process - eth1 ] <- ebgp -> Internet 

Some fibtriestats:
cat /proc/net/fib_triestat
Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
Main:
        Aver depth:     2.58
        Max depth:      6
        Leaves:         297566
        Prefixes:       312528
        Internal nodes: 70105
          1: 35598  2: 15065  3: 10956  4: 4743  5: 2311  6: 955  7: 365  8: 110  9: 1  16: 1
        Pointers: 570992
Null ptrs: 203322
Total size: 37010  kB

Counters:
---------
gets = 9298381
backtracks = 358013
semantic match passed = 9194516
semantic match miss = 97
null node hit= 4270863
skipped node resize = 0

Local:
        Aver depth:     3.66
        Max depth:      4
        Leaves:         15
        Prefixes:       16
        Internal nodes: 11
          1: 8  2: 3
        Pointers: 28
Null ptrs: 3
Total size: 3  kB

Counters:
---------
gets = 9358237
backtracks = 9128813
semantic match passed = 64893
semantic match miss = 0
null node hit= 2102
skipped node resize = 0


Meminfo:
cat /proc/meminfo
MemTotal:       12309556 kB
MemFree:        11466272 kB
Buffers:            4356 kB
Cached:            28672 kB
SwapCached:            0 kB
Active:           276288 kB
Inactive:          17988 kB
Active(anon):     261404 kB
Inactive(anon):        0 kB
Active(file):      14884 kB
Inactive(file):    17988 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        261252 kB
Mapped:             4056 kB
Shmem:               156 kB
Slab:             527288 kB
SReclaimable:       8100 kB
SUnreclaim:       519188 kB
KernelStack:         992 kB
PageTables:         1604 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6154776 kB
Committed_AS:     269144 kB
VmallocTotal:   34359738367 kB
VmallocUsed:        8944 kB
VmallocChunk:   34359724163 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        3392 kB
DirectMap2M:    12578816 kB

Interrupts:
cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
  0:         44          0          0          0          0          3          0          0   IO-APIC-edge      timer
  1:          0          1          0          0          0          0          1          0   IO-APIC-edge      i8042
  9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
 14:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide0
 15:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide1
 20:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
 21:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb6
 22:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb5
 23:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb4
 29:       2020        136        135        135        134        138        136        136   PCI-MSI-edge      ahci
 30:     267279     272260    1178931     268413     274942     274003     270552     268966   PCI-MSI-edge      eth0
 31:     176230     785586     276756     175203     168674     169608     173063     174649   PCI-MSI-edge      eth1
 32:          0          0          0          0          0          1          1          0   PCI-MSI-edge
 33:          0          1          0          0          0          0          0          1   PCI-MSI-edge
 34:          0          0          1          0          0          1          0          0   PCI-MSI-edge
 35:          1          0          0          1          0          0          0          0   PCI-MSI-edge
 36:          0          0          0          0          1          0          0          1   PCI-MSI-edge
 37:          1          0          0          0          0          0          1          0   PCI-MSI-edge
 38:          0          0          0          1          1          0          0          0   PCI-MSI-edge
 39:          0          1          1          0          0          0          0          0   PCI-MSI-edge
NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:    2955414    3213591    4900145    3163096    3132839    3396767    3225617    3510714   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0          0          0   Performance monitoring interrupts
PND:          0          0          0          0          0          0          0          0   Performance pending work
RES:        673        768       2452        535        584        674        985        979   Rescheduling interrupts
CAL:        378        276         92         86        341        347         92         89   Function call interrupts
TLB:         27         25         23         26         29         25         28         33   TLB shootdowns
ERR:          0
MIS:          0
Comment 5 Andrew Morton 2010-02-08 22:16:07 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sat, 6 Feb 2010 14:39:46 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=15239
> 
>            Summary: Problem with network performance after a while.
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.33-rc6-git5
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: pstaszewski@artcom.pl
>         Regression: No
> 
> 
> Hello the problem is that after some time of working my router have some
> performance problems.
> 
> The problem exist on kernel 2.6.32.7 and 2.6.33-rc6-git5
> The problem dont exist on 2.6.32.2 kernel.

That would be bad.  That's a regression in 2.6.33-rc, and we've
backported that regression into 2.6.32.x -stable.

> on 2.6.33-rc6-git5 i have this info:BUG: soft lockup - CPU#1 stuck for 61s!
> [events/1:28]
> Modules linked in:
> CPU 1
> Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT
> RIP: 0010:[<ffffffff810a3d89>]  [<ffffffff810a3d89>]
> kmem_cache_free+0x11b/0x11c
> RSP: 0018:ffff880028243e50  EFLAGS: 00000292
> RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0
> RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680
> RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c
> R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0
> R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7
> FS:  0000000000000000(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task
> ffff88031f9a4f80)
> Stack:
>  ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023
> <0> 01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040
> <0> 0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740
> Call Trace:
>  <IRQ>
>  [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
>  [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
>  [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
>  [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
>  [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>  <EOI>
>  [<ffffffff81004599>] ? do_softirq+0x31/0x63
>  [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>  [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
>  [<ffffffff8136b08c>] ? schedule+0x82c/0x906
>  [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
>  [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
>  [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
>  [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
>  [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
>  [<ffffffff810479bd>] ? kthread+0x79/0x81
>  [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
>  [<ffffffff81047944>] ? kthread+0x0/0x81
>  [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
> Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08 48
> 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f <c3> 55 48 89 f5 53
> 48
> 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10
> Call Trace:
>  <IRQ>  [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
>  [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
>  [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
>  [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
>  [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>  <EOI>  [<ffffffff81004599>] ? do_softirq+0x31/0x63
>  [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>  [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
>  [<ffffffff8136b08c>] ? schedule+0x82c/0x906
>  [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
>  [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
>  [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
>  [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
>  [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
>  [<ffffffff810479bd>] ? kthread+0x79/0x81
>  [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
>  [<ffffffff81047944>] ? kthread+0x0/0x81
>  [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
> 
> And on 2.6.32.7
> Feb  4 13:12:40 TM_01_C1 ------------[ cut here ]------------
> Feb  4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261
> dev_watchdog+0x130/0x1d6()
> Feb  4 13:12:40 TM_01_C1 Hardware name: X7DCT
> Feb  4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0
> timed
> out
> Feb  4 13:12:40 TM_01_C1 Modules linked in: oprofile
> Feb  4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1
> Feb  4 13:12:40 TM_01_C1 Call Trace:
> Feb  4 13:12:40 TM_01_C1 <IRQ>  [<ffffffff812fcaf7>] ?
> dev_watchdog+0x130/0x1d6
> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
> Feb  4 13:12:40 TM_01_C1 [<ffffffff81038811>] ?
> warn_slowpath_common+0x77/0xa3
> Feb  4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? warn_slowpath_fmt+0x51/0x59
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e
> Feb  4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? try_to_wake_up+0x1eb/0x1f8
> Feb  4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? netdev_drivername+0x3b/0x40
> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44
> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6
> Feb  4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ? run_timer_softirq+0x1ff/0x29d
> Feb  4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ?
> smp_apic_timer_interrupt+0x87/0x95
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ?
> apic_timer_interrupt+0x13/0x20
> Feb  4 13:12:40 TM_01_C1 <EOI>  [<ffffffff810111f5>] ? mwait_idle+0x9b/0xa0
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c
> Feb  4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]---
> 
> 
> Machine is a BGP router with some iproute traffic management.
Comment 6 Rafael J. Wysocki 2010-02-08 23:05:01 UTC
On Monday 08 February 2010, David Miller wrote:
> From: "Rafael J. Wysocki" <rjw@sisk.pl>
> Date: Mon, 8 Feb 2010 23:51:30 +0100
> 
> > On Monday 08 February 2010, Andrew Morton wrote:
> >> On Sat, 6 Feb 2010 14:39:46 GMT
> >> bugzilla-daemon@bugzilla.kernel.org wrote:
> >> 
> >> > http://bugzilla.kernel.org/show_bug.cgi?id=15239
> >> 
> >> post-2.6.32 regression, backported into -stable :(
> > 
> > Added, CCing netdev.
> 
> No need, we already worked on this and have a fix,
> see:
> 
> http://marc.info/?l=linux-netdev&m=126565759512718&w=2

Great, thanks!
Comment 7 Rafael J. Wysocki 2010-02-08 23:07:37 UTC
Handled-By : Eric Dumazet <eric.dumazet@gmail.com>
Patch : http://marc.info/?l=linux-netdev&m=126565759512718&w=2
Comment 8 David S. Miller 2010-02-08 23:31:33 UTC
From: "Brandeburg, Jesse" <jesse.brandeburg@intel.com>
Date: Mon, 8 Feb 2010 15:06:24 -0800 (Pacific Standard Time)

>> > on 2.6.33-rc6-git5 i have this info:
> 
>> > BUG: soft lockup - CPU#1 stuck for 61s!
>> > [events/1:28]
> 
> so why is this events/1 thing here?

Jesse, this is taken care of already, see:

http://marc.info/?l=linux-netdev&m=126565759512718&w=2
Comment 9 David S. Miller 2010-02-08 23:51:32 UTC
From: Andrew Morton <akpm@linux-foundation.org>
Date: Mon, 8 Feb 2010 14:15:13 -0800

> That would be bad.  That's a regression in 2.6.33-rc, and we've
> backported that regression into 2.6.32.x -stable.

Luckily we have a fix already. :-)
Comment 10 Jesse Brandeburg 2010-02-09 00:20:24 UTC
Andrew, thanks.

On Mon, 8 Feb 2010, Andrew Morton wrote:
> > Hello the problem is that after some time of working my router have some
> > performance problems.
> > 
> > The problem exist on kernel 2.6.32.7 and 2.6.33-rc6-git5
> > The problem dont exist on 2.6.32.2 kernel.
> 
> That would be bad.  That's a regression in 2.6.33-rc, and we've
> backported that regression into 2.6.32.x -stable.

there were no changes to e1000e from v2.6.32..v2.6.32.7
# git whatchanged v2.6.32..v2.6.32.7 drivers/net/e1000e

 
> > on 2.6.33-rc6-git5 i have this info:

> > BUG: soft lockup - CPU#1 stuck for 61s!
> > [events/1:28]

so why is this events/1 thing here?  I've see this kernel thread related 
to ipv6 stack hanging when under heavy incoming packet load as the 
interface comes up.  Was the interface just brought up or something 
changed in rules or something?

> > Modules linked in:
> > CPU 1
> > Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT
> > RIP: 0010:[<ffffffff810a3d89>]  [<ffffffff810a3d89>]
> > kmem_cache_free+0x11b/0x11c
> > RSP: 0018:ffff880028243e50  EFLAGS: 00000292
> > RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0
> > RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680
> > RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c
> > R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0
> > R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7
> > FS:  0000000000000000(0000) GS:ffff880028240000(0000)
> knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task
> ffff88031f9a4f80)
> > Stack:
> >  ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023
> > <0> 01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040
> > <0> 0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740
> > Call Trace:
> >  <IRQ>
> >  [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
> >  [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235

clean_tx_irq in e1000e has a break-out condition to keep the transmit 
cleanup for running forever (if another cpu is transmitting) so it seems 
that something else must have happened.  Can kmem_cache_free sleep or 
stall on a lock?


> >  [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
> >  [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
> >  [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
> >  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
> >  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
> >  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
> >  <EOI>
> >  [<ffffffff81004599>] ? do_softirq+0x31/0x63
> >  [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
> >  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
> >  [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
> >  [<ffffffff8136b08c>] ? schedule+0x82c/0x906
> >  [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
> >  [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
> >  [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
> >  [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
> >  [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
> >  [<ffffffff810479bd>] ? kthread+0x79/0x81
> >  [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
> >  [<ffffffff81047944>] ? kthread+0x0/0x81
> >  [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
> > Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08
> 48
> > 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f <c3> 55 48 89 f5
> 53 48
> > 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10
> > Call Trace:
> >  <IRQ>  [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
> >  [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
> >  [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
> >  [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
> >  [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
> >  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
> >  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
> >  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
> >  <EOI>  [<ffffffff81004599>] ? do_softirq+0x31/0x63
> >  [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
> >  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
> >  [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
> >  [<ffffffff8136b08c>] ? schedule+0x82c/0x906
> >  [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
> >  [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
> >  [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
> >  [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
> >  [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
> >  [<ffffffff810479bd>] ? kthread+0x79/0x81
> >  [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
> >  [<ffffffff81047944>] ? kthread+0x0/0x81
> >  [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
> > 
> > And on 2.6.32.7
> > Feb  4 13:12:40 TM_01_C1 ------------[ cut here ]------------
> > Feb  4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261
> > dev_watchdog+0x130/0x1d6()
> > Feb  4 13:12:40 TM_01_C1 Hardware name: X7DCT
> > Feb  4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0
> timed
> > out
> > Feb  4 13:12:40 TM_01_C1 Modules linked in: oprofile
> > Feb  4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1

the above doesn't match your statement that it is 2.6.32.7

> > Feb  4 13:12:40 TM_01_C1 Call Trace:
> > Feb  4 13:12:40 TM_01_C1 <IRQ>  [<ffffffff812fcaf7>] ?
> dev_watchdog+0x130/0x1d6
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff81038811>] ?
> warn_slowpath_common+0x77/0xa3
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? warn_slowpath_fmt+0x51/0x59
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? try_to_wake_up+0x1eb/0x1f8
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? netdev_drivername+0x3b/0x40
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ?
> run_timer_softirq+0x1ff/0x29d
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ?
> > smp_apic_timer_interrupt+0x87/0x95
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ?
> apic_timer_interrupt+0x13/0x20
> > Feb  4 13:12:40 TM_01_C1 <EOI>  [<ffffffff810111f5>] ? mwait_idle+0x9b/0xa0
> > Feb  4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c
> > Feb  4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]---
> > 
> > 
> > Machine is a BGP router with some iproute traffic management.

please send a full dmesg from boot.  also full lspci -vvv.

We sporadically see some issue with some people's use of traffic 
management with some e1000/e devices, we have no repro or common 
reproduction case, so haven't solved the issue yet.  I suspect there is a 
strange interaction with HTB or one of the other queueing disciplines that 
causes some odd skb or set of skb's to arrive, but that is just a theory.

I see two different bugs above, one is the tx hang, could it be that you 
didn't have SOFTLOCKUP detection enabled on 2.6.32?  A softlockup could 
cause a tx hang.

The other bug is the softlockup and I don't know how the driver could be 
at fault in that listed stack path.  We simply can't loop more than 
tx_ring->count times, which could never take > 60 seconds unless one of 
the calls we were making was getting stalled for some reason.

So your traffic is "heavy" forwarding at the time of the failure, do you 
have any input on average packet size?  I'm trying to figure out how we 
might reproduce this.

Jesse
Comment 11 Eric Dumazet 2010-02-09 07:37:01 UTC
Le mardi 09 février 2010 à 08:19 +0100, Paweł Staszewski a écrit :
> W dniu 2010-02-08 23:55, David Miller pisze:
> > From: Andrew Morton<akpm@linux-foundation.org>
> > Date: Mon, 8 Feb 2010 14:15:13 -0800
> >
> >    
> >> That would be bad.  That's a regression in 2.6.33-rc, and we've
> >> backported that regression into 2.6.32.x -stable.
> >>      
> > Luckily we have a fix already. :-)
> >    
> 
> Yes - i make bugreport and after i write also here on the list
> 
> And Eric Dumazet help to solve problem with "route cache" - and locking 
> load on one cpu after information about "route cache disabled".
> 
> After his patch i don't see any errors in dmesg.
> 

OK, but this is real old bug (since 2.6.24), not a 2.6.32/2.6.33
regression...

You were just lucky not hitting it before, probably because you tried
different settings.
Comment 12 Anonymous Emailer 2010-02-09 07:53:27 UTC
Reply-To: pstaszewski@itcare.pl

W dniu 2010-02-09 08:36, Eric Dumazet pisze:
> Le mardi 09 février 2010 à 08:19 +0100, Paweł Staszewski a écrit :
>    
>> W dniu 2010-02-08 23:55, David Miller pisze:
>>      
>>> From: Andrew Morton<akpm@linux-foundation.org>
>>> Date: Mon, 8 Feb 2010 14:15:13 -0800
>>>
>>>
>>>        
>>>> That would be bad.  That's a regression in 2.6.33-rc, and we've
>>>> backported that regression into 2.6.32.x -stable.
>>>>
>>>>          
>>> Luckily we have a fix already. :-)
>>>
>>>        
>> Yes - i make bugreport and after i write also here on the list
>>
>> And Eric Dumazet help to solve problem with "route cache" - and locking
>> load on one cpu after information about "route cache disabled".
>>
>> After his patch i don't see any errors in dmesg.
>>
>>      
> OK, but this is real old bug (since 2.6.24), not a 2.6.32/2.6.33
> regression...
>
> You were just lucky not hitting it before, probably because you tried
> different settings.
>
>
>    
Yes i think it was all about change of settings.
I start to search what settings will be the best - with some experiments 
- about what will happen if....



> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
Comment 13 Anonymous Emailer 2010-02-09 07:54:05 UTC
Reply-To: pstaszewski@itcare.pl

W dniu 2010-02-09 08:36, Eric Dumazet pisze:
> Le mardi 09 février 2010 à 08:19 +0100, Paweł Staszewski a écrit :
>    
>> W dniu 2010-02-08 23:55, David Miller pisze:
>>      
>>> From: Andrew Morton<akpm@linux-foundation.org>
>>> Date: Mon, 8 Feb 2010 14:15:13 -0800
>>>
>>>
>>>        
>>>> That would be bad.  That's a regression in 2.6.33-rc, and we've
>>>> backported that regression into 2.6.32.x -stable.
>>>>
>>>>          
>>> Luckily we have a fix already. :-)
>>>
>>>        
>> Yes - i make bugreport and after i write also here on the list
>>
>> And Eric Dumazet help to solve problem with "route cache" - and locking
>> load on one cpu after information about "route cache disabled".
>>
>> After his patch i don't see any errors in dmesg.
>>
>>      
> OK, but this is real old bug (since 2.6.24), not a 2.6.32/2.6.33
> regression...
>
> You were just lucky not hitting it before, probably because you tried
> different settings.
>
>
>    
Yes i think it was all about change of settings.
I start to search what settings will be the best - with some experiments 
- about what will happen if....



> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
Comment 14 Anonymous Emailer 2010-02-09 08:04:47 UTC
Reply-To: pstaszewski@itcare.pl

W dniu 2010-02-09 00:06, Brandeburg, Jesse pisze:
> Andrew, thanks.
>
> On Mon, 8 Feb 2010, Andrew Morton wrote:
>    
>>> Hello the problem is that after some time of working my router have some
>>> performance problems.
>>>
>>> The problem exist on kernel 2.6.32.7 and 2.6.33-rc6-git5
>>> The problem dont exist on 2.6.32.2 kernel.
>>>        
>> That would be bad.  That's a regression in 2.6.33-rc, and we've
>> backported that regression into 2.6.32.x -stable.
>>      
> there were no changes to e1000e from v2.6.32..v2.6.32.7
> # git whatchanged v2.6.32..v2.6.32.7 drivers/net/e1000e
>
>
>    
>>> on 2.6.33-rc6-git5 i have this info:
>>>        
>    
>>> BUG: soft lockup - CPU#1 stuck for 61s!
>>> [events/1:28]
>>>        
> so why is this events/1 thing here?  I've see this kernel thread related
> to ipv6 stack hanging when under heavy incoming packet load as the
> interface comes up.  Was the interface just brought up or something
> changed in rules or something?
>
>    
>>> Modules linked in:
>>> CPU 1
>>> Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT
>>> RIP: 0010:[<ffffffff810a3d89>]  [<ffffffff810a3d89>]
>>> kmem_cache_free+0x11b/0x11c
>>> RSP: 0018:ffff880028243e50  EFLAGS: 00000292
>>> RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0
>>> RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680
>>> RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c
>>> R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0
>>> R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7
>>> FS:  0000000000000000(0000) GS:ffff880028240000(0000)
>>> knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task
>>> ffff88031f9a4f80)
>>> Stack:
>>>   ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023
>>> <0>  01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040
>>> <0>  0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740
>>> Call Trace:
>>>   <IRQ>
>>>   [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
>>>   [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
>>>        
> clean_tx_irq in e1000e has a break-out condition to keep the transmit
> cleanup for running forever (if another cpu is transmitting) so it seems
> that something else must have happened.  Can kmem_cache_free sleep or
> stall on a lock?
>
>
>    
>>>   [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
>>>   [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
>>>   [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
>>>   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>>>   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>>   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>>>   <EOI>
>>>   [<ffffffff81004599>] ? do_softirq+0x31/0x63
>>>   [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
>>>   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>>   [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
>>>   [<ffffffff8136b08c>] ? schedule+0x82c/0x906
>>>   [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
>>>   [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
>>>   [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
>>>   [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
>>>   [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
>>>   [<ffffffff810479bd>] ? kthread+0x79/0x81
>>>   [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
>>>   [<ffffffff81047944>] ? kthread+0x0/0x81
>>>   [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
>>> Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08
>>> 48
>>> 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f<c3>  55 48 89 f5
>>> 53 48
>>> 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10
>>> Call Trace:
>>>   <IRQ>   [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
>>>   [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
>>>   [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
>>>   [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
>>>   [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
>>>   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>>>   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>>   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>>>   <EOI>   [<ffffffff81004599>] ? do_softirq+0x31/0x63
>>>   [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
>>>   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>>   [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
>>>   [<ffffffff8136b08c>] ? schedule+0x82c/0x906
>>>   [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
>>>   [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
>>>   [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
>>>   [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
>>>   [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
>>>   [<ffffffff810479bd>] ? kthread+0x79/0x81
>>>   [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
>>>   [<ffffffff81047944>] ? kthread+0x0/0x81
>>>   [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
>>>
>>> And on 2.6.32.7
>>> Feb  4 13:12:40 TM_01_C1 ------------[ cut here ]------------
>>> Feb  4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261
>>> dev_watchdog+0x130/0x1d6()
>>> Feb  4 13:12:40 TM_01_C1 Hardware name: X7DCT
>>> Feb  4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0
>>> timed
>>> out
>>> Feb  4 13:12:40 TM_01_C1 Modules linked in: oprofile
>>> Feb  4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1
>>>        
> the above doesn't match your statement that it is 2.6.32.7
>
>    
>>> Feb  4 13:12:40 TM_01_C1 Call Trace:
>>> Feb  4 13:12:40 TM_01_C1<IRQ>   [<ffffffff812fcaf7>] ?
>>> dev_watchdog+0x130/0x1d6
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff81038811>] ?
>>> warn_slowpath_common+0x77/0xa3
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? warn_slowpath_fmt+0x51/0x59
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? try_to_wake_up+0x1eb/0x1f8
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? netdev_drivername+0x3b/0x40
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ?
>>> run_timer_softirq+0x1ff/0x29d
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ?
>>> smp_apic_timer_interrupt+0x87/0x95
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ?
>>> apic_timer_interrupt+0x13/0x20
>>> Feb  4 13:12:40 TM_01_C1<EOI>   [<ffffffff810111f5>] ? mwait_idle+0x9b/0xa0
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c
>>> Feb  4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]---
>>>
>>>
>>> Machine is a BGP router with some iproute traffic management.
>>>        
> please send a full dmesg from boot.  also full lspci -vvv.
>
>    
dmesg and lspci in attached files.
> We sporadically see some issue with some people's use of traffic
> management with some e1000/e devices, we have no repro or common
> reproduction case, so haven't solved the issue yet.  I suspect there is a
> strange interaction with HTB or one of the other queueing disciplines that
> causes some odd skb or set of skb's to arrive, but that is just a theory.
>
>    
there is about 10000 iproute filters per interface. (hfsc classes)
> I see two different bugs above, one is the tx hang, could it be that you
> didn't have SOFTLOCKUP detection enabled on 2.6.32?  A softlockup could
> cause a tx hang.
>
> The other bug is the softlockup and I don't know how the driver could be
> at fault in that listed stack path.  We simply can't loop more than
> tx_ring->count times, which could never take>  60 seconds unless one of
> the calls we were making was getting stalled for some reason.
>
> So your traffic is "heavy" forwarding at the time of the failure, do you
> have any input on average packet size?  I'm trying to figure out how we
> might reproduce this.
>
>    
avg pkt size is about 811 bytes

> Jesse
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
Comment 15 Anonymous Emailer 2010-02-09 08:05:10 UTC
Reply-To: pstaszewski@itcare.pl

W dniu 2010-02-09 00:06, Brandeburg, Jesse pisze:
> Andrew, thanks.
>
> On Mon, 8 Feb 2010, Andrew Morton wrote:
>    
>>> Hello the problem is that after some time of working my router have some
>>> performance problems.
>>>
>>> The problem exist on kernel 2.6.32.7 and 2.6.33-rc6-git5
>>> The problem dont exist on 2.6.32.2 kernel.
>>>        
>> That would be bad.  That's a regression in 2.6.33-rc, and we've
>> backported that regression into 2.6.32.x -stable.
>>      
> there were no changes to e1000e from v2.6.32..v2.6.32.7
> # git whatchanged v2.6.32..v2.6.32.7 drivers/net/e1000e
>
>
>    
>>> on 2.6.33-rc6-git5 i have this info:
>>>        
>    
>>> BUG: soft lockup - CPU#1 stuck for 61s!
>>> [events/1:28]
>>>        
> so why is this events/1 thing here?  I've see this kernel thread related
> to ipv6 stack hanging when under heavy incoming packet load as the
> interface comes up.  Was the interface just brought up or something
> changed in rules or something?
>
>    
>>> Modules linked in:
>>> CPU 1
>>> Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT
>>> RIP: 0010:[<ffffffff810a3d89>]  [<ffffffff810a3d89>]
>>> kmem_cache_free+0x11b/0x11c
>>> RSP: 0018:ffff880028243e50  EFLAGS: 00000292
>>> RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0
>>> RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680
>>> RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c
>>> R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0
>>> R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7
>>> FS:  0000000000000000(0000) GS:ffff880028240000(0000)
>>> knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task
>>> ffff88031f9a4f80)
>>> Stack:
>>>   ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023
>>> <0>  01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040
>>> <0>  0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740
>>> Call Trace:
>>>   <IRQ>
>>>   [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
>>>   [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
>>>        
> clean_tx_irq in e1000e has a break-out condition to keep the transmit
> cleanup for running forever (if another cpu is transmitting) so it seems
> that something else must have happened.  Can kmem_cache_free sleep or
> stall on a lock?
>
>
>    
>>>   [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
>>>   [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
>>>   [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
>>>   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>>>   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>>   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>>>   <EOI>
>>>   [<ffffffff81004599>] ? do_softirq+0x31/0x63
>>>   [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
>>>   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>>   [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
>>>   [<ffffffff8136b08c>] ? schedule+0x82c/0x906
>>>   [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
>>>   [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
>>>   [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
>>>   [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
>>>   [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
>>>   [<ffffffff810479bd>] ? kthread+0x79/0x81
>>>   [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
>>>   [<ffffffff81047944>] ? kthread+0x0/0x81
>>>   [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
>>> Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08
>>> 48
>>> 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f<c3>  55 48 89 f5
>>> 53 48
>>> 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10
>>> Call Trace:
>>>   <IRQ>   [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
>>>   [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
>>>   [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
>>>   [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
>>>   [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
>>>   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>>>   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>>   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>>>   <EOI>   [<ffffffff81004599>] ? do_softirq+0x31/0x63
>>>   [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
>>>   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>>   [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
>>>   [<ffffffff8136b08c>] ? schedule+0x82c/0x906
>>>   [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
>>>   [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
>>>   [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
>>>   [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
>>>   [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
>>>   [<ffffffff810479bd>] ? kthread+0x79/0x81
>>>   [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
>>>   [<ffffffff81047944>] ? kthread+0x0/0x81
>>>   [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
>>>
>>> And on 2.6.32.7
>>> Feb  4 13:12:40 TM_01_C1 ------------[ cut here ]------------
>>> Feb  4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261
>>> dev_watchdog+0x130/0x1d6()
>>> Feb  4 13:12:40 TM_01_C1 Hardware name: X7DCT
>>> Feb  4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0
>>> timed
>>> out
>>> Feb  4 13:12:40 TM_01_C1 Modules linked in: oprofile
>>> Feb  4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1
>>>        
> the above doesn't match your statement that it is 2.6.32.7
>
>    
>>> Feb  4 13:12:40 TM_01_C1 Call Trace:
>>> Feb  4 13:12:40 TM_01_C1<IRQ>   [<ffffffff812fcaf7>] ?
>>> dev_watchdog+0x130/0x1d6
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff81038811>] ?
>>> warn_slowpath_common+0x77/0xa3
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? warn_slowpath_fmt+0x51/0x59
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? try_to_wake_up+0x1eb/0x1f8
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? netdev_drivername+0x3b/0x40
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ?
>>> run_timer_softirq+0x1ff/0x29d
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ?
>>> smp_apic_timer_interrupt+0x87/0x95
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ?
>>> apic_timer_interrupt+0x13/0x20
>>> Feb  4 13:12:40 TM_01_C1<EOI>   [<ffffffff810111f5>] ? mwait_idle+0x9b/0xa0
>>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c
>>> Feb  4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]---
>>>
>>>
>>> Machine is a BGP router with some iproute traffic management.
>>>        
> please send a full dmesg from boot.  also full lspci -vvv.
>
>    
dmesg and lspci in attached files.
> We sporadically see some issue with some people's use of traffic
> management with some e1000/e devices, we have no repro or common
> reproduction case, so haven't solved the issue yet.  I suspect there is a
> strange interaction with HTB or one of the other queueing disciplines that
> causes some odd skb or set of skb's to arrive, but that is just a theory.
>
>    
there is about 10000 iproute filters per interface. (hfsc classes)
> I see two different bugs above, one is the tx hang, could it be that you
> didn't have SOFTLOCKUP detection enabled on 2.6.32?  A softlockup could
> cause a tx hang.
>
> The other bug is the softlockup and I don't know how the driver could be
> at fault in that listed stack path.  We simply can't loop more than
> tx_ring->count times, which could never take>  60 seconds unless one of
> the calls we were making was getting stalled for some reason.
>
> So your traffic is "heavy" forwarding at the time of the failure, do you
> have any input on average packet size?  I'm trying to figure out how we
> might reproduce this.
>
>    
avg pkt size is about 811 bytes

> Jesse
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
Comment 16 Anonymous Emailer 2010-02-09 08:20:43 UTC
Reply-To: pstaszewski@itcare.pl

W dniu 2010-02-08 23:55, David Miller pisze:
> From: Andrew Morton<akpm@linux-foundation.org>
> Date: Mon, 8 Feb 2010 14:15:13 -0800
>
>    
>> That would be bad.  That's a regression in 2.6.33-rc, and we've
>> backported that regression into 2.6.32.x -stable.
>>      
> Luckily we have a fix already. :-)
>    

Yes - i make bugreport and after i write also here on the list

And Eric Dumazet help to solve problem with "route cache" - and locking 
load on one cpu after information about "route cache disabled".

After his patch i don't see any errors in dmesg.


But i think there is a little more in my bugreport on bugzilla.



Regards
Pawel

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
Comment 17 Anonymous Emailer 2010-02-09 08:21:21 UTC
Reply-To: pstaszewski@itcare.pl

W dniu 2010-02-08 23:55, David Miller pisze:
> From: Andrew Morton<akpm@linux-foundation.org>
> Date: Mon, 8 Feb 2010 14:15:13 -0800
>
>    
>> That would be bad.  That's a regression in 2.6.33-rc, and we've
>> backported that regression into 2.6.32.x -stable.
>>      
> Luckily we have a fix already. :-)
>    

Yes - i make bugreport and after i write also here on the list

And Eric Dumazet help to solve problem with "route cache" - and locking 
load on one cpu after information about "route cache disabled".

After his patch i don't see any errors in dmesg.


But i think there is a little more in my bugreport on bugzilla.



Regards
Pawel

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
Comment 18 Rafael J. Wysocki 2010-02-09 12:30:24 UTC
Not a regression according to comment #11.

Note You need to log in before you can comment on or make changes to this bug.