Bug 6682 - BUG: soft lockup detected on CPU#0! / ksoftirqd takse 100% CPU
Summary: BUG: soft lockup detected on CPU#0! / ksoftirqd takse 100% CPU
Status: CLOSED CODE_FIX
Alias: None
Product: Networking
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Arnaldo Carvalho de Melo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-06-12 16:07 UTC by Michal Margula
Modified: 2008-03-24 18:26 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.15.6
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Michal Margula 2006-06-12 16:07:20 UTC
Most recent kernel where this bug did not occur: (unknown)
Distribution: Gentoo
Hardware Environment: 2x Xeon 2.66, 1 GB RAM, NICS: 2 x e1000, and one double
port e100. Based on Intel E7501 architecture (2U rack Intel chassis).
Software Environment: quagga 0.98.6
Problem Description: ksoftirqd/0 takes 100% of CPU. further investigation shows
no sing of network flood or something (and also 2 of 3 NICs are e1000 with
NAPI). Ocassionaly there are "BUG: soft lockup detected on CPU#0!".


Steps to reproduce:

There is no simple way to reproduce. I think that everythint started when we
attached second provider with BGP support. We are using quagga which injects
about 186 000 routes into kernel. When running for a while (at least few hours,
sometimes a day) we get 100% usage on ksoftirqd/0 and following messages in logs:

BUG: soft lockup detected on CPU#0!

Pid: 6506, comm:                zebra
EIP: 0060:[<c027f6fd>] CPU: 0
EIP is at _spin_lock+0x7/0xf
 EFLAGS: 00000286    Not tainted  (2.6.15.6)
EAX: f6203180 EBX: e6fbf000 ECX: 00000000 EDX: f6bec000
ESI: f6203000 EDI: eddb4b80 EBP: fffffff4 DS: 007b ES: 007b
CR0: 8005003b CR2: aca6dff0 CR3: 361ad000 CR4: 000006d0
 [<c02396f9>] dev_queue_xmit+0xe0/0x203
 [<c0250de8>] ip_output+0x1e1/0x237
 [<c024f3f5>] ip_forward+0x181/0x1df
 [<c024e21a>] ip_rcv+0x40c/0x485
 [<c0239bd0>] netif_receive_skb+0x12f/0x165
 [<f885aa4c>] e1000_clean_rx_irq+0x389/0x410 [e1000]
 [<f885a1ca>] e1000_clean+0x94/0x12f [e1000]
 [<c0239d5a>] net_rx_action+0x69/0xf0
 [<c011a305>] __do_softirq+0x55/0xbd
 [<c011a39a>] do_softirq+0x2d/0x31
 [<c011a3f8>] local_bh_enable+0x5a/0x65
 [<c024a0a1>] rt_run_flush+0x5f/0x80
 [<c027623f>] fn_hash_insert+0x352/0x39f
 [<c027364c>] inet_rtm_newroute+0x57/0x62
 [<c02413ed>] rtnetlink_rcv_msg+0x1a8/0x1cb
 [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
 [<c0247c1e>] netlink_rcv_skb+0x3a/0x8b
 [<c0247cb1>] netlink_run_queue+0x42/0xc3
 [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
 [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
 [<c0241227>] rtnetlink_rcv+0x22/0x40
 [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
 [<c024764c>] netlink_data_ready+0x17/0x54
 [<c0246a99>] netlink_sendskb+0x1f/0x39
 [<c0247449>] netlink_sendmsg+0x27b/0x28c
 [<c0231467>] sock_sendmsg+0xce/0xe9
 [<c0112b36>] __wake_up+0x27/0x3b
 [<c01a6216>] copy_to_user+0x38/0x42
 [<c01a625a>] copy_from_user+0x3a/0x60
 [<c01a625a>] copy_from_user+0x3a/0x60
 [<c0126be2>] autoremove_wake_function+0x0/0x3a
 [<c0236bcd>] verify_iovec+0x49/0x7f
 [<c02327f2>] sys_sendmsg+0x152/0x1a8
 [<c0147a62>] do_sync_read+0xb8/0xeb
 [<c01a6216>] copy_to_user+0x38/0x42
 [<c0126be2>] autoremove_wake_function+0x0/0x3a
 [<c0122b7a>] getrusage+0x34/0x43
 [<c0168504>] inotify_dentry_parent_queue_event+0x29/0x7c
 [<c01a625a>] copy_from_user+0x3a/0x60
 [<c0232b6b>] sys_socketcall+0x167/0x180
 [<c0102433>] sysenter_past_esp+0x54/0x75

BUG: soft lockup detected on CPU#0!

Pid: 6506, comm:                zebra
EIP: 0060:[<f8952052>] CPU: 0
EIP is at u32_classify+0x52/0x170 [cls_u32]
 EFLAGS: 00000206    Not tainted  (2.6.15.6)
EAX: e2fbd020 EBX: f48649c0 ECX: 00000010 EDX: 29b09d5a
ESI: f48649ec EDI: 00000001 EBP: e2fbd020 DS: 007b ES: 007b
CR0: 8005003b CR2: 08154004 CR3: 361ad000 CR4: 000006d0
 [<f88462fa>] ipt_do_table+0x2de/0x2fd [ip_tables]
 [<f883b523>] ip_nat_fn+0x177/0x185 [iptable_nat]
 [<f88e159f>] ip_refrag+0x23/0x5f [ip_conntrack]
 [<c0244d82>] tc_classify+0x2c/0x3f
 [<f895514b>] htb_classify+0x14b/0x1dd [sch_htb]
 [<f8955638>] htb_enqueue+0x1d/0x13a [sch_htb]
 [<c02396fd>] dev_queue_xmit+0xe4/0x203
 [<c0250de8>] ip_output+0x1e1/0x237
 [<c024f3f5>] ip_forward+0x181/0x1df
 [<c024e21a>] ip_rcv+0x40c/0x485
 [<c0239bd0>] netif_receive_skb+0x12f/0x165
 [<f885aa4c>] e1000_clean_rx_irq+0x389/0x410 [e1000]
 [<f885a1ca>] e1000_clean+0x94/0x12f [e1000]
 [<c0239d5a>] net_rx_action+0x69/0xf0
 [<c011a305>] __do_softirq+0x55/0xbd
 [<c011a39a>] do_softirq+0x2d/0x31
 [<c011a3f8>] local_bh_enable+0x5a/0x65
 [<c024a0a1>] rt_run_flush+0x5f/0x80
 [<c027623f>] fn_hash_insert+0x352/0x39f
 [<c027364c>] inet_rtm_newroute+0x57/0x62
 [<c02413ed>] rtnetlink_rcv_msg+0x1a8/0x1cb
 [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
 [<c0247c1e>] netlink_rcv_skb+0x3a/0x8b
 [<c0247cb1>] netlink_run_queue+0x42/0xc3
 [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
 [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
 [<c0241227>] rtnetlink_rcv+0x22/0x40
 [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
 [<c024764c>] netlink_data_ready+0x17/0x54
 [<c0246a99>] netlink_sendskb+0x1f/0x39
 [<c0247449>] netlink_sendmsg+0x27b/0x28c
 [<c0231467>] sock_sendmsg+0xce/0xe9
 [<c0112b36>] __wake_up+0x27/0x3b
 [<c01a625a>] copy_from_user+0x3a/0x60
 [<c01a625a>] copy_from_user+0x3a/0x60
 [<c0126be2>] autoremove_wake_function+0x0/0x3a
 [<c0236bcd>] verify_iovec+0x49/0x7f
 [<c02327f2>] sys_sendmsg+0x152/0x1a8
 [<c0147a62>] do_sync_read+0xb8/0xeb
 [<c0122b7a>] getrusage+0x34/0x43
 [<c011d72f>] update_wall_time+0xa/0x32
 [<c011d9e5>] do_timer+0x33/0xa9
 [<c01a625a>] copy_from_user+0x3a/0x60
 [<c0232b6b>] sys_socketcall+0x167/0x180
 [<c0102433>] sysenter_past_esp+0x54/0x75

It happens on 2.6.15.6. Tonight I will try 2.6.16.16 with FIB_TRIE instead of
FIB_HASH. I am submitting that bug under network but I am not completly sure if
it belongs here.
Comment 1 Michal Margula 2006-06-13 16:03:40 UTC
2.6.16.16 with FIB_TRIE did not help.
Changing interface from e1000 to e100 did not help.

It looks like a flood though no other signs of it (increased pps rate, or
bandwidth). Maybe some malformed packets causing so high usage?
Comment 2 Andrew Morton 2006-06-19 15:16:28 UTC
bugme-daemon@bugzilla.kernel.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=6682
> 
>            Summary: BUG: soft lockup detected on CPU#0! / ksoftirqd takse
>                     100% CPU
>     Kernel Version: 2.6.15.6
>             Status: NEW
>           Severity: normal
>              Owner: acme@conectiva.com.br
>          Submitter: alchemyx@uznam.net.pl
> 
> 
> Most recent kernel where this bug did not occur: (unknown)
> Distribution: Gentoo
> Hardware Environment: 2x Xeon 2.66, 1 GB RAM, NICS: 2 x e1000, and one double
> port e100. Based on Intel E7501 architecture (2U rack Intel chassis).
> Software Environment: quagga 0.98.6
> Problem Description: ksoftirqd/0 takes 100% of CPU. further investigation shows
> no sing of network flood or something (and also 2 of 3 NICs are e1000 with
> NAPI). Ocassionaly there are "BUG: soft lockup detected on CPU#0!".
> 
> 
> Steps to reproduce:
> 
> There is no simple way to reproduce. I think that everythint started when we
> attached second provider with BGP support. We are using quagga which injects
> about 186 000 routes into kernel. When running for a while (at least few hours,
> sometimes a day) we get 100% usage on ksoftirqd/0 and following messages in logs:
> 
> BUG: soft lockup detected on CPU#0!
> 
> Pid: 6506, comm:                zebra
> EIP: 0060:[<c027f6fd>] CPU: 0
> EIP is at _spin_lock+0x7/0xf
>  EFLAGS: 00000286    Not tainted  (2.6.15.6)
> EAX: f6203180 EBX: e6fbf000 ECX: 00000000 EDX: f6bec000
> ESI: f6203000 EDI: eddb4b80 EBP: fffffff4 DS: 007b ES: 007b
> CR0: 8005003b CR2: aca6dff0 CR3: 361ad000 CR4: 000006d0
>  [<c02396f9>] dev_queue_xmit+0xe0/0x203
>  [<c0250de8>] ip_output+0x1e1/0x237
>  [<c024f3f5>] ip_forward+0x181/0x1df
>  [<c024e21a>] ip_rcv+0x40c/0x485
>  [<c0239bd0>] netif_receive_skb+0x12f/0x165
>  [<f885aa4c>] e1000_clean_rx_irq+0x389/0x410 [e1000]
>  [<f885a1ca>] e1000_clean+0x94/0x12f [e1000]
>  [<c0239d5a>] net_rx_action+0x69/0xf0
>  [<c011a305>] __do_softirq+0x55/0xbd
>  [<c011a39a>] do_softirq+0x2d/0x31
>  [<c011a3f8>] local_bh_enable+0x5a/0x65
>  [<c024a0a1>] rt_run_flush+0x5f/0x80
>  [<c027623f>] fn_hash_insert+0x352/0x39f
>  [<c027364c>] inet_rtm_newroute+0x57/0x62
>  [<c02413ed>] rtnetlink_rcv_msg+0x1a8/0x1cb
>  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
>  [<c0247c1e>] netlink_rcv_skb+0x3a/0x8b
>  [<c0247cb1>] netlink_run_queue+0x42/0xc3
>  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
>  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
>  [<c0241227>] rtnetlink_rcv+0x22/0x40
>  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
>  [<c024764c>] netlink_data_ready+0x17/0x54
>  [<c0246a99>] netlink_sendskb+0x1f/0x39
>  [<c0247449>] netlink_sendmsg+0x27b/0x28c
>  [<c0231467>] sock_sendmsg+0xce/0xe9
>  [<c0112b36>] __wake_up+0x27/0x3b
>  [<c01a6216>] copy_to_user+0x38/0x42
>  [<c01a625a>] copy_from_user+0x3a/0x60
>  [<c01a625a>] copy_from_user+0x3a/0x60
>  [<c0126be2>] autoremove_wake_function+0x0/0x3a
>  [<c0236bcd>] verify_iovec+0x49/0x7f
>  [<c02327f2>] sys_sendmsg+0x152/0x1a8
>  [<c0147a62>] do_sync_read+0xb8/0xeb
>  [<c01a6216>] copy_to_user+0x38/0x42
>  [<c0126be2>] autoremove_wake_function+0x0/0x3a
>  [<c0122b7a>] getrusage+0x34/0x43
>  [<c0168504>] inotify_dentry_parent_queue_event+0x29/0x7c
>  [<c01a625a>] copy_from_user+0x3a/0x60
>  [<c0232b6b>] sys_socketcall+0x167/0x180
>  [<c0102433>] sysenter_past_esp+0x54/0x75
> 
> BUG: soft lockup detected on CPU#0!
> 
> Pid: 6506, comm:                zebra
> EIP: 0060:[<f8952052>] CPU: 0
> EIP is at u32_classify+0x52/0x170 [cls_u32]
>  EFLAGS: 00000206    Not tainted  (2.6.15.6)
> EAX: e2fbd020 EBX: f48649c0 ECX: 00000010 EDX: 29b09d5a
> ESI: f48649ec EDI: 00000001 EBP: e2fbd020 DS: 007b ES: 007b
> CR0: 8005003b CR2: 08154004 CR3: 361ad000 CR4: 000006d0
>  [<f88462fa>] ipt_do_table+0x2de/0x2fd [ip_tables]
>  [<f883b523>] ip_nat_fn+0x177/0x185 [iptable_nat]
>  [<f88e159f>] ip_refrag+0x23/0x5f [ip_conntrack]
>  [<c0244d82>] tc_classify+0x2c/0x3f
>  [<f895514b>] htb_classify+0x14b/0x1dd [sch_htb]
>  [<f8955638>] htb_enqueue+0x1d/0x13a [sch_htb]
>  [<c02396fd>] dev_queue_xmit+0xe4/0x203
>  [<c0250de8>] ip_output+0x1e1/0x237
>  [<c024f3f5>] ip_forward+0x181/0x1df
>  [<c024e21a>] ip_rcv+0x40c/0x485
>  [<c0239bd0>] netif_receive_skb+0x12f/0x165
>  [<f885aa4c>] e1000_clean_rx_irq+0x389/0x410 [e1000]
>  [<f885a1ca>] e1000_clean+0x94/0x12f [e1000]
>  [<c0239d5a>] net_rx_action+0x69/0xf0
>  [<c011a305>] __do_softirq+0x55/0xbd
>  [<c011a39a>] do_softirq+0x2d/0x31
>  [<c011a3f8>] local_bh_enable+0x5a/0x65
>  [<c024a0a1>] rt_run_flush+0x5f/0x80
>  [<c027623f>] fn_hash_insert+0x352/0x39f
>  [<c027364c>] inet_rtm_newroute+0x57/0x62
>  [<c02413ed>] rtnetlink_rcv_msg+0x1a8/0x1cb
>  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
>  [<c0247c1e>] netlink_rcv_skb+0x3a/0x8b
>  [<c0247cb1>] netlink_run_queue+0x42/0xc3
>  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
>  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
>  [<c0241227>] rtnetlink_rcv+0x22/0x40
>  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
>  [<c024764c>] netlink_data_ready+0x17/0x54
>  [<c0246a99>] netlink_sendskb+0x1f/0x39
>  [<c0247449>] netlink_sendmsg+0x27b/0x28c
>  [<c0231467>] sock_sendmsg+0xce/0xe9
>  [<c0112b36>] __wake_up+0x27/0x3b
>  [<c01a625a>] copy_from_user+0x3a/0x60
>  [<c01a625a>] copy_from_user+0x3a/0x60
>  [<c0126be2>] autoremove_wake_function+0x0/0x3a
>  [<c0236bcd>] verify_iovec+0x49/0x7f
>  [<c02327f2>] sys_sendmsg+0x152/0x1a8
>  [<c0147a62>] do_sync_read+0xb8/0xeb
>  [<c0122b7a>] getrusage+0x34/0x43
>  [<c011d72f>] update_wall_time+0xa/0x32
>  [<c011d9e5>] do_timer+0x33/0xa9
>  [<c01a625a>] copy_from_user+0x3a/0x60
>  [<c0232b6b>] sys_socketcall+0x167/0x180
>  [<c0102433>] sysenter_past_esp+0x54/0x75
> 
> It happens on 2.6.15.6. Tonight I will try 2.6.16.16 with FIB_TRIE instead of
> FIB_HASH. I am submitting that bug under network but I am not completly sure if
> it belongs here.

This also happens on 2.6.16.6.

Comment 3 Anonymous Emailer 2006-06-19 17:25:48 UTC
Reply-To: paulmck@us.ibm.com

On Mon, Jun 19, 2006 at 03:20:10PM -0700, Andrew Morton wrote:
> bugme-daemon@bugzilla.kernel.org wrote:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=6682
> > 
> >            Summary: BUG: soft lockup detected on CPU#0! / ksoftirqd takse
> >                     100% CPU
> >     Kernel Version: 2.6.15.6
> >             Status: NEW
> >           Severity: normal
> >              Owner: acme@conectiva.com.br
> >          Submitter: alchemyx@uznam.net.pl
> > 
> > 
> > Most recent kernel where this bug did not occur: (unknown)
> > Distribution: Gentoo
> > Hardware Environment: 2x Xeon 2.66, 1 GB RAM, NICS: 2 x e1000, and one double
> > port e100. Based on Intel E7501 architecture (2U rack Intel chassis).
> > Software Environment: quagga 0.98.6
> > Problem Description: ksoftirqd/0 takes 100% of CPU. further investigation shows
> > no sing of network flood or something (and also 2 of 3 NICs are e1000 with
> > NAPI). Ocassionaly there are "BUG: soft lockup detected on CPU#0!".
> > 
> > Steps to reproduce:
> > 
> > There is no simple way to reproduce. I think that everythint started when we
> > attached second provider with BGP support. We are using quagga which injects
> > about 186 000 routes into kernel. When running for a while (at least few hours,
> > sometimes a day) we get 100% usage on ksoftirqd/0 and following messages in logs:

Is it possible that there is a routing loop, either in the overall
configuration or in some intermediate point in the route injection?
Both CPUs seem to be receiving ethernet packets at the time of the oops.

						Thanx, Paul

> > BUG: soft lockup detected on CPU#0!
> > 
> > Pid: 6506, comm:                zebra
> > EIP: 0060:[<c027f6fd>] CPU: 0
> > EIP is at _spin_lock+0x7/0xf
> >  EFLAGS: 00000286    Not tainted  (2.6.15.6)
> > EAX: f6203180 EBX: e6fbf000 ECX: 00000000 EDX: f6bec000
> > ESI: f6203000 EDI: eddb4b80 EBP: fffffff4 DS: 007b ES: 007b
> > CR0: 8005003b CR2: aca6dff0 CR3: 361ad000 CR4: 000006d0
> >  [<c02396f9>] dev_queue_xmit+0xe0/0x203
> >  [<c0250de8>] ip_output+0x1e1/0x237
> >  [<c024f3f5>] ip_forward+0x181/0x1df
> >  [<c024e21a>] ip_rcv+0x40c/0x485
> >  [<c0239bd0>] netif_receive_skb+0x12f/0x165
> >  [<f885aa4c>] e1000_clean_rx_irq+0x389/0x410 [e1000]
> >  [<f885a1ca>] e1000_clean+0x94/0x12f [e1000]
> >  [<c0239d5a>] net_rx_action+0x69/0xf0
> >  [<c011a305>] __do_softirq+0x55/0xbd
> >  [<c011a39a>] do_softirq+0x2d/0x31
> >  [<c011a3f8>] local_bh_enable+0x5a/0x65
> >  [<c024a0a1>] rt_run_flush+0x5f/0x80
> >  [<c027623f>] fn_hash_insert+0x352/0x39f
> >  [<c027364c>] inet_rtm_newroute+0x57/0x62
> >  [<c02413ed>] rtnetlink_rcv_msg+0x1a8/0x1cb
> >  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
> >  [<c0247c1e>] netlink_rcv_skb+0x3a/0x8b
> >  [<c0247cb1>] netlink_run_queue+0x42/0xc3
> >  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
> >  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
> >  [<c0241227>] rtnetlink_rcv+0x22/0x40
> >  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
> >  [<c024764c>] netlink_data_ready+0x17/0x54
> >  [<c0246a99>] netlink_sendskb+0x1f/0x39
> >  [<c0247449>] netlink_sendmsg+0x27b/0x28c
> >  [<c0231467>] sock_sendmsg+0xce/0xe9
> >  [<c0112b36>] __wake_up+0x27/0x3b
> >  [<c01a6216>] copy_to_user+0x38/0x42
> >  [<c01a625a>] copy_from_user+0x3a/0x60
> >  [<c01a625a>] copy_from_user+0x3a/0x60
> >  [<c0126be2>] autoremove_wake_function+0x0/0x3a
> >  [<c0236bcd>] verify_iovec+0x49/0x7f
> >  [<c02327f2>] sys_sendmsg+0x152/0x1a8
> >  [<c0147a62>] do_sync_read+0xb8/0xeb
> >  [<c01a6216>] copy_to_user+0x38/0x42
> >  [<c0126be2>] autoremove_wake_function+0x0/0x3a
> >  [<c0122b7a>] getrusage+0x34/0x43
> >  [<c0168504>] inotify_dentry_parent_queue_event+0x29/0x7c
> >  [<c01a625a>] copy_from_user+0x3a/0x60
> >  [<c0232b6b>] sys_socketcall+0x167/0x180
> >  [<c0102433>] sysenter_past_esp+0x54/0x75
> > 
> > BUG: soft lockup detected on CPU#0!
> > 
> > Pid: 6506, comm:                zebra
> > EIP: 0060:[<f8952052>] CPU: 0
> > EIP is at u32_classify+0x52/0x170 [cls_u32]
> >  EFLAGS: 00000206    Not tainted  (2.6.15.6)
> > EAX: e2fbd020 EBX: f48649c0 ECX: 00000010 EDX: 29b09d5a
> > ESI: f48649ec EDI: 00000001 EBP: e2fbd020 DS: 007b ES: 007b
> > CR0: 8005003b CR2: 08154004 CR3: 361ad000 CR4: 000006d0
> >  [<f88462fa>] ipt_do_table+0x2de/0x2fd [ip_tables]
> >  [<f883b523>] ip_nat_fn+0x177/0x185 [iptable_nat]
> >  [<f88e159f>] ip_refrag+0x23/0x5f [ip_conntrack]
> >  [<c0244d82>] tc_classify+0x2c/0x3f
> >  [<f895514b>] htb_classify+0x14b/0x1dd [sch_htb]
> >  [<f8955638>] htb_enqueue+0x1d/0x13a [sch_htb]
> >  [<c02396fd>] dev_queue_xmit+0xe4/0x203
> >  [<c0250de8>] ip_output+0x1e1/0x237
> >  [<c024f3f5>] ip_forward+0x181/0x1df
> >  [<c024e21a>] ip_rcv+0x40c/0x485
> >  [<c0239bd0>] netif_receive_skb+0x12f/0x165
> >  [<f885aa4c>] e1000_clean_rx_irq+0x389/0x410 [e1000]
> >  [<f885a1ca>] e1000_clean+0x94/0x12f [e1000]
> >  [<c0239d5a>] net_rx_action+0x69/0xf0
> >  [<c011a305>] __do_softirq+0x55/0xbd
> >  [<c011a39a>] do_softirq+0x2d/0x31
> >  [<c011a3f8>] local_bh_enable+0x5a/0x65
> >  [<c024a0a1>] rt_run_flush+0x5f/0x80
> >  [<c027623f>] fn_hash_insert+0x352/0x39f
> >  [<c027364c>] inet_rtm_newroute+0x57/0x62
> >  [<c02413ed>] rtnetlink_rcv_msg+0x1a8/0x1cb
> >  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
> >  [<c0247c1e>] netlink_rcv_skb+0x3a/0x8b
> >  [<c0247cb1>] netlink_run_queue+0x42/0xc3
> >  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
> >  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
> >  [<c0241227>] rtnetlink_rcv+0x22/0x40
> >  [<c0241245>] rtnetlink_rcv_msg+0x0/0x1cb
> >  [<c024764c>] netlink_data_ready+0x17/0x54
> >  [<c0246a99>] netlink_sendskb+0x1f/0x39
> >  [<c0247449>] netlink_sendmsg+0x27b/0x28c
> >  [<c0231467>] sock_sendmsg+0xce/0xe9
> >  [<c0112b36>] __wake_up+0x27/0x3b
> >  [<c01a625a>] copy_from_user+0x3a/0x60
> >  [<c01a625a>] copy_from_user+0x3a/0x60
> >  [<c0126be2>] autoremove_wake_function+0x0/0x3a
> >  [<c0236bcd>] verify_iovec+0x49/0x7f
> >  [<c02327f2>] sys_sendmsg+0x152/0x1a8
> >  [<c0147a62>] do_sync_read+0xb8/0xeb
> >  [<c0122b7a>] getrusage+0x34/0x43
> >  [<c011d72f>] update_wall_time+0xa/0x32
> >  [<c011d9e5>] do_timer+0x33/0xa9
> >  [<c01a625a>] copy_from_user+0x3a/0x60
> >  [<c0232b6b>] sys_socketcall+0x167/0x180
> >  [<c0102433>] sysenter_past_esp+0x54/0x75
> > 
> > It happens on 2.6.15.6. Tonight I will try 2.6.16.16 with FIB_TRIE instead of
> > FIB_HASH. I am submitting that bug under network but I am not completly sure if
> > it belongs here.
> 
> This also happens on 2.6.16.6.

Comment 4 Herbert Xu 2006-06-20 05:32:53 UTC
On Mon, Jun 19, 2006 at 10:20:10PM +0000, Andrew Morton wrote:
>
> >  [<c02396f9>] dev_queue_xmit+0xe0/0x203
> >  [<c0250de8>] ip_output+0x1e1/0x237
> >  [<c024f3f5>] ip_forward+0x181/0x1df
> >  [<c024e21a>] ip_rcv+0x40c/0x485
> >  [<c0239bd0>] netif_receive_skb+0x12f/0x165
> >  [<f885aa4c>] e1000_clean_rx_irq+0x389/0x410 [e1000]
> >  [<f885a1ca>] e1000_clean+0x94/0x12f [e1000]
> >  [<c0239d5a>] net_rx_action+0x69/0xf0
> >  [<c011a305>] __do_softirq+0x55/0xbd
> >  [<c011a39a>] do_softirq+0x2d/0x31
> >  [<c011a3f8>] local_bh_enable+0x5a/0x65
> >  [<c024a0a1>] rt_run_flush+0x5f/0x80

Could you tell us the frequency of route updates on this machine?
Route updates are pretty expensive especially when a large number
of flows hits your machine right afterwards.

You can monitor this by running ip mon.  You might just be getting
bogus route updates causing unnecessary flushes to the routing cache.

Cheers,
Comment 5 Michal Margula 2006-06-20 05:57:36 UTC
First of all - at the moment I have 2.6.16.19 and it is running OK for about 6
days now, but still testing. 

I did ip mon for a period of 300 seconds and I got:

sauron alchemyx # time ip mon > /tmp/ROUTE
Killed

real    5m1.234s
user    0m0.020s
sys     0m0.056s
sauron alchemyx # wc -l /tmp/ROUTE
953 /tmp/ROUTE

So it is about 3 changes per seconds. Is it much? Funny thing I noticed:

212.52.159.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
Deleted 212.52.158.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
212.52.158.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
Deleted 212.52.157.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
212.52.157.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
Deleted 212.52.156.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
212.52.156.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
Deleted 212.52.155.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
212.52.155.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
Deleted 212.52.154.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
212.52.154.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
Deleted 212.52.153.0/24 via 217.153.19.17 dev eth0  proto zebra equalize
212.52.153.0/24 via 217.153.19.17 dev eth0  proto zebra equalize

So quagga is doing some unecessary work putting also some load on routing table.
Am I right?
Comment 6 Anonymous Emailer 2006-06-20 08:52:07 UTC
Reply-To: Robert.Olsson@data.slu.se


Hello!

Yes seems the system is very loaded for some reason 

> > sometimes a day) we get 100% usage on ksoftirqd/0 and following messages 
   in logs:

as all softirq's are run via ksoftirqd. That's still OK but why don't the 
watchdog get any CPU share at all? Mismatch in priorities? 

Herbert Xu writes:
> On Mon, Jun 19, 2006 at 10:20:10PM +0000, Andrew Morton wrote:
 > >
 > > >  [<c02396f9>] dev_queue_xmit+0xe0/0x203
 > > >  [<c0250de8>] ip_output+0x1e1/0x237
 > > >  [<c024f3f5>] ip_forward+0x181/0x1df
 > > >  [<c024e21a>] ip_rcv+0x40c/0x485
 > > >  [<c0239bd0>] netif_receive_skb+0x12f/0x165
 > > >  [<f885aa4c>] e1000_clean_rx_irq+0x389/0x410 [e1000]
 > > >  [<f885a1ca>] e1000_clean+0x94/0x12f [e1000]
 > > >  [<c0239d5a>] net_rx_action+0x69/0xf0
 > > >  [<c011a305>] __do_softirq+0x55/0xbd
 > > >  [<c011a39a>] do_softirq+0x2d/0x31
 > > >  [<c011a3f8>] local_bh_enable+0x5a/0x65
 > > >  [<c024a0a1>] rt_run_flush+0x5f/0x80

Normal for a router...

 > Could you tell us the frequency of route updates on this machine?
 > Route updates are pretty expensive especially when a large number
 > of flows hits your machine right afterwards.

Yes flush is costly an unfortunly hard to avoid. We discussed this a
bit before...

 > You can monitor this by running ip mon.  You might just be getting
 > bogus route updates causing unnecessary flushes to the routing cache.

Just sampled 10 min in one of routers with full 2 * (Full BGP). Well
remember Zebra/Quagga has just one set in kernel. Anyway during the 
10 minutes I looked I got 4 (insertion/deletions)/second in average.

Cheers.
					--ro

 
 

Comment 7 Natalie Protasevich 2007-07-08 11:33:08 UTC
ANy updates on this problem?
Thanks.
Comment 8 Michal Margula 2007-07-08 15:21:35 UTC
At this point we are not using Linux for BGP and routing to upstream providers.
Since 2.6.16.19 everything was quite OK. There was no "BUG: soft lockup detected 
on CPU#0!" but still we were getting 100% load on ksoftirqd. We have never
discovered what caused that situation.

We are currently using two linux boxes for QoS with HTB. ksoftirqd there depeneds
mostly on number of queues and amount of traffic. And from time to time those
boxes hang (when deleting root qdisc), usually when there is high traffic.
Combination of watchdog and dynamic routing helps to minimize damages, but it
still happens.
Comment 9 Natalie Protasevich 2007-07-08 15:48:12 UTC
You might want to check what is the other CPU doing by running top or taking sysrq-t trace to see what other processes are on the system and if there any lock or resource contention, and also check if memory goes too low. Running profiler is also something to try - to see which user and kernel functions are the biggest contributors.
Comment 10 Andrew Morton 2007-07-25 20:09:14 UTC
An update from Herbert:

I think there are two problems here:

1) A busy router that gets frequent route updates spends too
much time flushing route caches that it can't use.

2) We have some kind of scheduling problem so that when this
happens it triggers the soft lockup watchdog.

I'm not sure if anything's been done to solve 2), but 1)
certainly hasn't changed.

Having said that, 1) is not an easy problem to solve and
AFAIK nobody is working on it at the moment.
Comment 11 Herbert Xu 2007-07-25 21:15:53 UTC
Oh and of course there is

3) A routing daemon that generates unnecessary route changes.  You should look at eliminating those updates where possible.
Comment 12 Natalie Protasevich 2007-10-22 22:46:42 UTC
Michal, any updates on this? Was the problem resolved for you?
Thanks.
Comment 13 Herbert Xu 2007-10-22 23:04:39 UTC
Natalie, I just checked the routing code and nothing has changed with respect to flushing so the problem still stands.
Comment 14 Michal Margula 2007-10-23 01:25:57 UTC
Boxes are now quite stable - 138 days uptime (would be longer but we had rewire
our racks to make some room) on both servers with HTB. Current
used kernel 2.6.20.7. So i think it is OK now.

We have sometimes following messages in dmesg, but maybe because of big load.
It shows in every HTB box, but they are same hardware configurations.

e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
  Tx Queue             <0>
  TDH                  <98>
  TDT                  <98>
  next_to_use          <98>
  next_to_clean        <ed>
buffer_info[next_to_clean]
  time_stamp           <b1a58d51>
  next_to_watch        <ed>
  jiffies              <b1a58e5a>
  next_to_watch.status <1>
e1000: eth1: e1000_clean_tx_irq: Detected Tx Unit Hang
  Tx Queue             <0>
  TDH                  <50>
  TDT                  <50>
  next_to_use          <50>
  next_to_clean        <a5>
buffer_info[next_to_clean]
  time_stamp           <b2135604>
  next_to_watch        <a5>
  jiffies              <b2135730>
  next_to_watch.status <1>
Comment 15 Herbert Xu 2007-10-23 02:11:06 UTC
Michal, do you still run BGP on it? If not then that's what has cured the original problem rather than any code fixes that may have gone in.
Comment 16 Michal Margula 2007-10-23 03:06:04 UTC
I run BGP on it, but it is for other purposes than before and it has
only about 20 routes or so and those change rarely. So you are right
it maybe because of not having huge routing tables.
Comment 17 Natalie Protasevich 2008-03-24 15:51:48 UTC
Michal, any updates on this problem from your side?
If you can't reproduce the problem anymore, maybe we should close the bug now? unless Herbert needs to investigate and work more on this.
Comment 18 Michal Margula 2008-03-24 16:28:19 UTC
No updates unfortunately. I think closing is OK for now.
Comment 19 Herbert Xu 2008-03-24 18:16:46 UTC
You know what, Eriz Dumazet did actually fix this bug back in January so I agree with the closing too.  Thanks!
Comment 20 Natalie Protasevich 2008-03-24 18:26:17 UTC
This is great that it was fixed, closing it then.

Note You need to log in before you can comment on or make changes to this bug.