Bug 14470

Summary: freez in TCP stack
Product: Networking Reporter: Vaclav Bilek (kolo)
Component: IPV4Assignee: Stephen Hemminger (stephen)
Status: CLOSED CODE_FIX    
Severity: high CC: alan, bill, davetha, info, jura, kernel.org, kolo, ole, pesmail2003, rdewit, spike, subscribe, tdevelioglu, va, wdaher, yuriy
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 2.6.31 Subsystem:
Regression: No Bisected commit-id:

Description Vaclav Bilek 2009-10-26 12:47:19 UTC
We are hiting kernel panics on Dell R610 servers with e1000e NICs; it apears usualy under a high network trafic ( around 100Mbit/s) but it is not a rule it has happened even on low trafic.

Servers are used as reverse http proxy (varnish).

On 6 equal servers this panic happens aprox 2 times a day depending on network load. Machine completly freezes till the management watchdog reboots. 


We had to put serial console on these servers to catch the oops. Is there anything else We can do to debug this?
The RIP is always the same:

RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>] tcp_xmit_retransmit_queue+0x8c/0x290

rest of the oops always differs a litle ... here is an example:

RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>] tcp_xmit_retransmit_queue+0x8c/0x290
RSP: 0018:ffffc90000003a40  EFLAGS: 00010246
RAX: ffff8807e7420678 RBX: ffff8807e74205c0 RCX: 0000000000000000
RDX: 000000004598a105 RSI: 0000000000000000 RDI: ffff8807e74205c0
RBP: ffffc90000003a80 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8807e74205c0 R14: ffff8807e7420678 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffffc90000000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81608000, task ffffffff81631440)
Stack:
 ffffc90000003a60 0000000000000000 4598a105e74205c0 000000004598a101
<0> 000000000000050e ffff8807e74205c0 0000000000000003 0000000000000000
<0> ffffc90000003b40 ffffffff8141ae4a ffff8807e7420678 0000000000000000
Call Trace:
 <IRQ>
 [<ffffffff8141ae4a>] tcp_ack+0x170a/0x1dd0
 [<ffffffff8141c362>] tcp_rcv_state_process+0x122/0xab0
 [<ffffffff81422c6c>] tcp_v4_do_rcv+0xac/0x220
 [<ffffffff813fd02f>] ? nf_iterate+0x5f/0x90
 [<ffffffff81424b26>] tcp_v4_rcv+0x586/0x6b0
 [<ffffffff813fd0c5>] ? nf_hook_slow+0x65/0xf0
 [<ffffffff81406b70>] ? ip_local_deliver_finish+0x0/0x120
 [<ffffffff81406bcf>] ip_local_deliver_finish+0x5f/0x120
 [<ffffffff8140715b>] ip_local_deliver+0x3b/0x90
 [<ffffffff81406971>] ip_rcv_finish+0x141/0x340
 [<ffffffff8140701f>] ip_rcv+0x24f/0x350
 [<ffffffff813e7ced>] netif_receive_skb+0x20d/0x2f0
 [<ffffffff813e7e90>] napi_skb_finish+0x40/0x50
 [<ffffffff813e82f4>] napi_gro_receive+0x34/0x40
 [<ffffffff8133e0c8>] e1000_receive_skb+0x48/0x60
 [<ffffffff81342342>] e1000_clean_rx_irq+0xf2/0x330
 [<ffffffff813410a1>] e1000_clean+0x81/0x2a0
 [<ffffffff81054ce1>] ? ktime_get+0x11/0x50
 [<ffffffff813eaf1c>] net_rx_action+0x9c/0x130
 [<ffffffff81046940>] ? get_next_timer_interrupt+0x1d0/0x210
 [<ffffffff81041bd7>] __do_softirq+0xb7/0x160
 [<ffffffff8100c27c>] call_softirq+0x1c/0x30
 [<ffffffff8100e04d>] do_softirq+0x3d/0x80
 [<ffffffff81041b0b>] irq_exit+0x7b/0x90
 [<ffffffff8100d613>] do_IRQ+0x73/0xe0
 [<ffffffff8100bb13>] ret_from_intr+0x0/0xa
 <EOI>
 [<ffffffff81296e6c>] ? acpi_idle_enter_bm+0x245/0x271
 [<ffffffff81296e62>] ? acpi_idle_enter_bm+0x23b/0x271
 [<ffffffff813c7a08>] ? cpuidle_idle_call+0x98/0xf0
 [<ffffffff8100a104>] ? cpu_idle+0x94/0xd0
 [<ffffffff81468db6>] ? rest_init+0x66/0x70
 [<ffffffff816a082f>] ? start_kernel+0x2ef/0x340
 [<ffffffff8169fd54>] ? x86_64_start_reservations+0x84/0x90
 [<ffffffff8169fe32>] ? x86_64_start_kernel+0xd2/0x100
Code: 00 eb 28 8b 83 d0 03 00 00 41 39 44 24 40 0f 89 00 01 00 00 41 0f b6 cd 41 bd 2f 00 00 00 83 e1 03 0f 84 fc 00 00 00 4d 8b 24 24 <49> 8b 04 24 4d 39 f4 0f 18 08 0f 84 d9 00 00 00 4c 3b a3 b8 01
RIP  [<ffffffff814203cc>] tcp_xmit_retransmit_queue+0x8c/0x290
 RSP <ffffc90000003a40>
CR2: 0000000000000000
---[ end trace d97d99c9ae1d52cc ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper Tainted: G      D    2.6.31 #2
Call Trace:
 <IRQ>  [<ffffffff8103cab0>] panic+0xa0/0x170
 [<ffffffff8100bb13>] ? ret_from_intr+0x0/0xa
 [<ffffffff8103c74e>] ? print_oops_end_marker+0x1e/0x20
 [<ffffffff8100f38e>] oops_end+0x9e/0xb0
 [<ffffffff81025b9a>] no_context+0x15a/0x250
 [<ffffffff81025e2b>] __bad_area_nosemaphore+0xdb/0x1c0
 [<ffffffff813e89e9>] ? dev_hard_start_xmit+0x269/0x2f0
 [<ffffffff81025fae>] bad_area_nosemaphore+0xe/0x10
 [<ffffffff8102639f>] do_page_fault+0x17f/0x260
 [<ffffffff8147eadf>] page_fault+0x1f/0x30
 [<ffffffff814203cc>] ? tcp_xmit_retransmit_queue+0x8c/0x290
 [<ffffffff8141ae4a>] tcp_ack+0x170a/0x1dd0
 [<ffffffff8141c362>] tcp_rcv_state_process+0x122/0xab0
 [<ffffffff81422c6c>] tcp_v4_do_rcv+0xac/0x220
 [<ffffffff813fd02f>] ? nf_iterate+0x5f/0x90
 [<ffffffff81424b26>] tcp_v4_rcv+0x586/0x6b0
 [<ffffffff813fd0c5>] ? nf_hook_slow+0x65/0xf0
 [<ffffffff81406b70>] ? ip_local_deliver_finish+0x0/0x120
 [<ffffffff81406bcf>] ip_local_deliver_finish+0x5f/0x120
 [<ffffffff8140715b>] ip_local_deliver+0x3b/0x90
 [<ffffffff81406971>] ip_rcv_finish+0x141/0x340
 [<ffffffff8140701f>] ip_rcv+0x24f/0x350
 [<ffffffff813e7ced>] netif_receive_skb+0x20d/0x2f0
 [<ffffffff813e7e90>] napi_skb_finish+0x40/0x50
 [<ffffffff813e82f4>] napi_gro_receive+0x34/0x40
 [<ffffffff8133e0c8>] e1000_receive_skb+0x48/0x60
 [<ffffffff81342342>] e1000_clean_rx_irq+0xf2/0x330
 [<ffffffff813410a1>] e1000_clean+0x81/0x2a0
 [<ffffffff81054ce1>] ? ktime_get+0x11/0x50
 [<ffffffff813eaf1c>] net_rx_action+0x9c/0x130
 [<ffffffff81046940>] ? get_next_timer_interrupt+0x1d0/0x210
 [<ffffffff81041bd7>] __do_softirq+0xb7/0x160
 [<ffffffff8100c27c>] call_softirq+0x1c/0x30
 [<ffffffff8100e04d>] do_softirq+0x3d/0x80
 [<ffffffff81041b0b>] irq_exit+0x7b/0x90
 [<ffffffff8100d613>] do_IRQ+0x73/0xe0
 [<ffffffff8100bb13>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff81296e6c>] ? acpi_idle_enter_bm+0x245/0x271
 [<ffffffff81296e62>] ? acpi_idle_enter_bm+0x23b/0x271
 [<ffffffff813c7a08>] ? cpuidle_idle_call+0x98/0xf0
 [<ffffffff8100a104>] ? cpu_idle+0x94/0xd0
 [<ffffffff81468db6>] ? rest_init+0x66/0x70
 [<ffffffff816a082f>] ? start_kernel+0x2ef/0x340
 [<ffffffff8169fd54>] ? x86_64_start_reservations+0x84/0x90
 [<ffffffff8169fe32>] ? x86_64_start_kernel+0xd2/0x100
Comment 1 Andrew Morton 2009-10-28 22:13:50 UTC
On Mon, 26 Oct 2009 08:41:32 -0700
Stephen Hemminger <shemminger@linux-foundation.org> wrote:

> 
> 
> Begin forwarded message:
> 
> Date: Mon, 26 Oct 2009 12:47:22 GMT
> From: bugzilla-daemon@bugzilla.kernel.org
> To: shemminger@linux-foundation.org
> Subject: [Bug 14470] New: freez in TCP stack
> 

Stephen, please retain the bugzilla and reporter email cc's when
forwarding a report to a mailing list.


> http://bugzilla.kernel.org/show_bug.cgi?id=14470
> 
>            Summary: freez in TCP stack
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.31
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: kolo@albatani.cz
>         Regression: No
> 
> 
> We are hiting kernel panics on Dell R610 servers with e1000e NICs; it apears
> usualy under a high network trafic ( around 100Mbit/s) but it is not a rule
> it
> has happened even on low trafic.
> 
> Servers are used as reverse http proxy (varnish).
> 
> On 6 equal servers this panic happens aprox 2 times a day depending on
> network
> load. Machine completly freezes till the management watchdog reboots. 
> 

Twice a day on six separate machines.  That ain't no hardware glitch.

Vaclav, are you able to say whether this is a regression?  Did those
machines run 2.6.30 (for example)?

Thanks.

> We had to put serial console on these servers to catch the oops. Is there
> anything else We can do to debug this?
> The RIP is always the same:
> 
> RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>]
> tcp_xmit_retransmit_queue+0x8c/0x290
> 
> rest of the oops always differs a litle ... here is an example:
> 
> RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>]
> tcp_xmit_retransmit_queue+0x8c/0x290
> RSP: 0018:ffffc90000003a40  EFLAGS: 00010246
> RAX: ffff8807e7420678 RBX: ffff8807e74205c0 RCX: 0000000000000000
> RDX: 000000004598a105 RSI: 0000000000000000 RDI: ffff8807e74205c0
> RBP: ffffc90000003a80 R08: 0000000000000003 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: ffff8807e74205c0 R14: ffff8807e7420678 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffffc90000000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffffffff81608000, task ffffffff81631440)
> Stack:
>  ffffc90000003a60 0000000000000000 4598a105e74205c0 000000004598a101
> <0> 000000000000050e ffff8807e74205c0 0000000000000003 0000000000000000
> <0> ffffc90000003b40 ffffffff8141ae4a ffff8807e7420678 0000000000000000
> Call Trace:
>  <IRQ>
>  [<ffffffff8141ae4a>] tcp_ack+0x170a/0x1dd0
>  [<ffffffff8141c362>] tcp_rcv_state_process+0x122/0xab0
>  [<ffffffff81422c6c>] tcp_v4_do_rcv+0xac/0x220
>  [<ffffffff813fd02f>] ? nf_iterate+0x5f/0x90
>  [<ffffffff81424b26>] tcp_v4_rcv+0x586/0x6b0
>  [<ffffffff813fd0c5>] ? nf_hook_slow+0x65/0xf0
>  [<ffffffff81406b70>] ? ip_local_deliver_finish+0x0/0x120
>  [<ffffffff81406bcf>] ip_local_deliver_finish+0x5f/0x120
>  [<ffffffff8140715b>] ip_local_deliver+0x3b/0x90
>  [<ffffffff81406971>] ip_rcv_finish+0x141/0x340
>  [<ffffffff8140701f>] ip_rcv+0x24f/0x350
>  [<ffffffff813e7ced>] netif_receive_skb+0x20d/0x2f0
>  [<ffffffff813e7e90>] napi_skb_finish+0x40/0x50
>  [<ffffffff813e82f4>] napi_gro_receive+0x34/0x40
>  [<ffffffff8133e0c8>] e1000_receive_skb+0x48/0x60
>  [<ffffffff81342342>] e1000_clean_rx_irq+0xf2/0x330
>  [<ffffffff813410a1>] e1000_clean+0x81/0x2a0
>  [<ffffffff81054ce1>] ? ktime_get+0x11/0x50
>  [<ffffffff813eaf1c>] net_rx_action+0x9c/0x130
>  [<ffffffff81046940>] ? get_next_timer_interrupt+0x1d0/0x210
>  [<ffffffff81041bd7>] __do_softirq+0xb7/0x160
>  [<ffffffff8100c27c>] call_softirq+0x1c/0x30
>  [<ffffffff8100e04d>] do_softirq+0x3d/0x80
>  [<ffffffff81041b0b>] irq_exit+0x7b/0x90
>  [<ffffffff8100d613>] do_IRQ+0x73/0xe0
>  [<ffffffff8100bb13>] ret_from_intr+0x0/0xa
>  <EOI>
>  [<ffffffff81296e6c>] ? acpi_idle_enter_bm+0x245/0x271
>  [<ffffffff81296e62>] ? acpi_idle_enter_bm+0x23b/0x271
>  [<ffffffff813c7a08>] ? cpuidle_idle_call+0x98/0xf0
>  [<ffffffff8100a104>] ? cpu_idle+0x94/0xd0
>  [<ffffffff81468db6>] ? rest_init+0x66/0x70
>  [<ffffffff816a082f>] ? start_kernel+0x2ef/0x340
>  [<ffffffff8169fd54>] ? x86_64_start_reservations+0x84/0x90
>  [<ffffffff8169fe32>] ? x86_64_start_kernel+0xd2/0x100
> Code: 00 eb 28 8b 83 d0 03 00 00 41 39 44 24 40 0f 89 00 01 00 00 41 0f b6 cd
> 41 bd 2f 00 00 00 83 e1 03 0f 84 fc 00 00 00 4d 8b 24 24 <49> 8b 04 24 4d 39
> f4
> 0f 18 08 0f 84 d9 00 00 00 4c 3b a3 b8 01
> RIP  [<ffffffff814203cc>] tcp_xmit_retransmit_queue+0x8c/0x290
>  RSP <ffffc90000003a40>
> CR2: 0000000000000000
> ---[ end trace d97d99c9ae1d52cc ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Pid: 0, comm: swapper Tainted: G      D    2.6.31 #2
> Call Trace:
>  <IRQ>  [<ffffffff8103cab0>] panic+0xa0/0x170
>  [<ffffffff8100bb13>] ? ret_from_intr+0x0/0xa
>  [<ffffffff8103c74e>] ? print_oops_end_marker+0x1e/0x20
>  [<ffffffff8100f38e>] oops_end+0x9e/0xb0
>  [<ffffffff81025b9a>] no_context+0x15a/0x250
>  [<ffffffff81025e2b>] __bad_area_nosemaphore+0xdb/0x1c0
>  [<ffffffff813e89e9>] ? dev_hard_start_xmit+0x269/0x2f0
>  [<ffffffff81025fae>] bad_area_nosemaphore+0xe/0x10
>  [<ffffffff8102639f>] do_page_fault+0x17f/0x260
>  [<ffffffff8147eadf>] page_fault+0x1f/0x30
>  [<ffffffff814203cc>] ? tcp_xmit_retransmit_queue+0x8c/0x290
>  [<ffffffff8141ae4a>] tcp_ack+0x170a/0x1dd0
>  [<ffffffff8141c362>] tcp_rcv_state_process+0x122/0xab0
>  [<ffffffff81422c6c>] tcp_v4_do_rcv+0xac/0x220
>  [<ffffffff813fd02f>] ? nf_iterate+0x5f/0x90
>  [<ffffffff81424b26>] tcp_v4_rcv+0x586/0x6b0
>  [<ffffffff813fd0c5>] ? nf_hook_slow+0x65/0xf0
>  [<ffffffff81406b70>] ? ip_local_deliver_finish+0x0/0x120
>  [<ffffffff81406bcf>] ip_local_deliver_finish+0x5f/0x120
>  [<ffffffff8140715b>] ip_local_deliver+0x3b/0x90
>  [<ffffffff81406971>] ip_rcv_finish+0x141/0x340
>  [<ffffffff8140701f>] ip_rcv+0x24f/0x350
>  [<ffffffff813e7ced>] netif_receive_skb+0x20d/0x2f0
>  [<ffffffff813e7e90>] napi_skb_finish+0x40/0x50
>  [<ffffffff813e82f4>] napi_gro_receive+0x34/0x40
>  [<ffffffff8133e0c8>] e1000_receive_skb+0x48/0x60
>  [<ffffffff81342342>] e1000_clean_rx_irq+0xf2/0x330
>  [<ffffffff813410a1>] e1000_clean+0x81/0x2a0
>  [<ffffffff81054ce1>] ? ktime_get+0x11/0x50
>  [<ffffffff813eaf1c>] net_rx_action+0x9c/0x130
>  [<ffffffff81046940>] ? get_next_timer_interrupt+0x1d0/0x210
>  [<ffffffff81041bd7>] __do_softirq+0xb7/0x160
>  [<ffffffff8100c27c>] call_softirq+0x1c/0x30
>  [<ffffffff8100e04d>] do_softirq+0x3d/0x80
>  [<ffffffff81041b0b>] irq_exit+0x7b/0x90
>  [<ffffffff8100d613>] do_IRQ+0x73/0xe0
>  [<ffffffff8100bb13>] ret_from_intr+0x0/0xa
>  <EOI>  [<ffffffff81296e6c>] ? acpi_idle_enter_bm+0x245/0x271
>  [<ffffffff81296e62>] ? acpi_idle_enter_bm+0x23b/0x271
>  [<ffffffff813c7a08>] ? cpuidle_idle_call+0x98/0xf0
>  [<ffffffff8100a104>] ? cpu_idle+0x94/0xd0
>  [<ffffffff81468db6>] ? rest_init+0x66/0x70
>  [<ffffffff816a082f>] ? start_kernel+0x2ef/0x340
>  [<ffffffff8169fd54>] ? x86_64_start_reservations+0x84/0x90
>  [<ffffffff8169fe32>] ? x86_64_start_kernel+0xd2/0x100
>
Comment 2 Eric Dumazet 2009-10-29 05:35:15 UTC
Andrew Morton a écrit :
> On Mon, 26 Oct 2009 08:41:32 -0700
> Stephen Hemminger <shemminger@linux-foundation.org> wrote:
> 
>>
>> Begin forwarded message:
>>
>> Date: Mon, 26 Oct 2009 12:47:22 GMT
>> From: bugzilla-daemon@bugzilla.kernel.org
>> To: shemminger@linux-foundation.org
>> Subject: [Bug 14470] New: freez in TCP stack
>>
> 
> Stephen, please retain the bugzilla and reporter email cc's when
> forwarding a report to a mailing list.
> 
> 
>> http://bugzilla.kernel.org/show_bug.cgi?id=14470
>>
>>            Summary: freez in TCP stack
>>            Product: Networking
>>            Version: 2.5
>>     Kernel Version: 2.6.31
>>           Platform: All
>>         OS/Version: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: high
>>           Priority: P1
>>          Component: IPV4
>>         AssignedTo: shemminger@linux-foundation.org
>>         ReportedBy: kolo@albatani.cz
>>         Regression: No
>>
>>
>> We are hiting kernel panics on Dell R610 servers with e1000e NICs; it apears
>> usualy under a high network trafic ( around 100Mbit/s) but it is not a rule
>> it
>> has happened even on low trafic.
>>
>> Servers are used as reverse http proxy (varnish).
>>
>> On 6 equal servers this panic happens aprox 2 times a day depending on
>> network
>> load. Machine completly freezes till the management watchdog reboots. 
>>
> 
> Twice a day on six separate machines.  That ain't no hardware glitch.
> 
> Vaclav, are you able to say whether this is a regression?  Did those
> machines run 2.6.30 (for example)?
> 
> Thanks.
> 
>> We had to put serial console on these servers to catch the oops. Is there
>> anything else We can do to debug this?
>> The RIP is always the same:
>>
>> RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>]
>> tcp_xmit_retransmit_queue+0x8c/0x290
>>
>> rest of the oops always differs a litle ... here is an example:
>>
>> RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>]
>> tcp_xmit_retransmit_queue+0x8c/0x290
>> RSP: 0018:ffffc90000003a40  EFLAGS: 00010246
>> RAX: ffff8807e7420678 RBX: ffff8807e74205c0 RCX: 0000000000000000
>> RDX: 000000004598a105 RSI: 0000000000000000 RDI: ffff8807e74205c0
>> RBP: ffffc90000003a80 R08: 0000000000000003 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>> R13: ffff8807e74205c0 R14: ffff8807e7420678 R15: 0000000000000000
>> FS:  0000000000000000(0000) GS:ffffc90000000000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006f0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process swapper (pid: 0, threadinfo ffffffff81608000, task ffffffff81631440)
>> Stack:
>>  ffffc90000003a60 0000000000000000 4598a105e74205c0 000000004598a101
>> <0> 000000000000050e ffff8807e74205c0 0000000000000003 0000000000000000
>> <0> ffffc90000003b40 ffffffff8141ae4a ffff8807e7420678 0000000000000000
>> Call Trace:
>>  <IRQ>
>>  [<ffffffff8141ae4a>] tcp_ack+0x170a/0x1dd0
>>  [<ffffffff8141c362>] tcp_rcv_state_process+0x122/0xab0
>>  [<ffffffff81422c6c>] tcp_v4_do_rcv+0xac/0x220
>>  [<ffffffff813fd02f>] ? nf_iterate+0x5f/0x90
>>  [<ffffffff81424b26>] tcp_v4_rcv+0x586/0x6b0
>>  [<ffffffff813fd0c5>] ? nf_hook_slow+0x65/0xf0
>>  [<ffffffff81406b70>] ? ip_local_deliver_finish+0x0/0x120
>>  [<ffffffff81406bcf>] ip_local_deliver_finish+0x5f/0x120
>>  [<ffffffff8140715b>] ip_local_deliver+0x3b/0x90
>>  [<ffffffff81406971>] ip_rcv_finish+0x141/0x340
>>  [<ffffffff8140701f>] ip_rcv+0x24f/0x350
>>  [<ffffffff813e7ced>] netif_receive_skb+0x20d/0x2f0
>>  [<ffffffff813e7e90>] napi_skb_finish+0x40/0x50
>>  [<ffffffff813e82f4>] napi_gro_receive+0x34/0x40
>>  [<ffffffff8133e0c8>] e1000_receive_skb+0x48/0x60
>>  [<ffffffff81342342>] e1000_clean_rx_irq+0xf2/0x330
>>  [<ffffffff813410a1>] e1000_clean+0x81/0x2a0
>>  [<ffffffff81054ce1>] ? ktime_get+0x11/0x50
>>  [<ffffffff813eaf1c>] net_rx_action+0x9c/0x130
>>  [<ffffffff81046940>] ? get_next_timer_interrupt+0x1d0/0x210
>>  [<ffffffff81041bd7>] __do_softirq+0xb7/0x160
>>  [<ffffffff8100c27c>] call_softirq+0x1c/0x30
>>  [<ffffffff8100e04d>] do_softirq+0x3d/0x80
>>  [<ffffffff81041b0b>] irq_exit+0x7b/0x90
>>  [<ffffffff8100d613>] do_IRQ+0x73/0xe0
>>  [<ffffffff8100bb13>] ret_from_intr+0x0/0xa
>>  <EOI>
>>  [<ffffffff81296e6c>] ? acpi_idle_enter_bm+0x245/0x271
>>  [<ffffffff81296e62>] ? acpi_idle_enter_bm+0x23b/0x271
>>  [<ffffffff813c7a08>] ? cpuidle_idle_call+0x98/0xf0
>>  [<ffffffff8100a104>] ? cpu_idle+0x94/0xd0
>>  [<ffffffff81468db6>] ? rest_init+0x66/0x70
>>  [<ffffffff816a082f>] ? start_kernel+0x2ef/0x340
>>  [<ffffffff8169fd54>] ? x86_64_start_reservations+0x84/0x90
>>  [<ffffffff8169fe32>] ? x86_64_start_kernel+0xd2/0x100
>> Code: 00 eb 28 8b 83 d0 03 00 00 41 39 44 24 40 0f 89 00 01 00 00 41 0f b6
>> cd
>> 41 bd 2f 00 00 00 83 e1 03 0f 84 fc 00 00 00 4d 8b 24 24 <49> 8b 04 24 4d 39
>> f4
>> 0f 18 08 0f 84 d9 00 00 00 4c 3b a3 b8 01
>> RIP  [<ffffffff814203cc>] tcp_xmit_retransmit_queue+0x8c/0x290
>>  RSP <ffffc90000003a40>
>> CR2: 0000000000000000
>> ---[ end trace d97d99c9ae1d52cc ]---
>> Kernel panic - not syncing: Fatal exception in interrupt
>> Pid: 0, comm: swapper Tainted: G      D    2.6.31 #2
>> Call Trace:
>>  <IRQ>  [<ffffffff8103cab0>] panic+0xa0/0x170
>>  [<ffffffff8100bb13>] ? ret_from_intr+0x0/0xa
>>  [<ffffffff8103c74e>] ? print_oops_end_marker+0x1e/0x20
>>  [<ffffffff8100f38e>] oops_end+0x9e/0xb0
>>  [<ffffffff81025b9a>] no_context+0x15a/0x250
>>  [<ffffffff81025e2b>] __bad_area_nosemaphore+0xdb/0x1c0
>>  [<ffffffff813e89e9>] ? dev_hard_start_xmit+0x269/0x2f0
>>  [<ffffffff81025fae>] bad_area_nosemaphore+0xe/0x10
>>  [<ffffffff8102639f>] do_page_fault+0x17f/0x260
>>  [<ffffffff8147eadf>] page_fault+0x1f/0x30
>>  [<ffffffff814203cc>] ? tcp_xmit_retransmit_queue+0x8c/0x290
>>  [<ffffffff8141ae4a>] tcp_ack+0x170a/0x1dd0
>>  [<ffffffff8141c362>] tcp_rcv_state_process+0x122/0xab0
>>  [<ffffffff81422c6c>] tcp_v4_do_rcv+0xac/0x220
>>  [<ffffffff813fd02f>] ? nf_iterate+0x5f/0x90
>>  [<ffffffff81424b26>] tcp_v4_rcv+0x586/0x6b0
>>  [<ffffffff813fd0c5>] ? nf_hook_slow+0x65/0xf0
>>  [<ffffffff81406b70>] ? ip_local_deliver_finish+0x0/0x120
>>  [<ffffffff81406bcf>] ip_local_deliver_finish+0x5f/0x120
>>  [<ffffffff8140715b>] ip_local_deliver+0x3b/0x90
>>  [<ffffffff81406971>] ip_rcv_finish+0x141/0x340
>>  [<ffffffff8140701f>] ip_rcv+0x24f/0x350
>>  [<ffffffff813e7ced>] netif_receive_skb+0x20d/0x2f0
>>  [<ffffffff813e7e90>] napi_skb_finish+0x40/0x50
>>  [<ffffffff813e82f4>] napi_gro_receive+0x34/0x40
>>  [<ffffffff8133e0c8>] e1000_receive_skb+0x48/0x60
>>  [<ffffffff81342342>] e1000_clean_rx_irq+0xf2/0x330
>>  [<ffffffff813410a1>] e1000_clean+0x81/0x2a0
>>  [<ffffffff81054ce1>] ? ktime_get+0x11/0x50
>>  [<ffffffff813eaf1c>] net_rx_action+0x9c/0x130
>>  [<ffffffff81046940>] ? get_next_timer_interrupt+0x1d0/0x210
>>  [<ffffffff81041bd7>] __do_softirq+0xb7/0x160
>>  [<ffffffff8100c27c>] call_softirq+0x1c/0x30
>>  [<ffffffff8100e04d>] do_softirq+0x3d/0x80
>>  [<ffffffff81041b0b>] irq_exit+0x7b/0x90
>>  [<ffffffff8100d613>] do_IRQ+0x73/0xe0
>>  [<ffffffff8100bb13>] ret_from_intr+0x0/0xa
>>  <EOI>  [<ffffffff81296e6c>] ? acpi_idle_enter_bm+0x245/0x271
>>  [<ffffffff81296e62>] ? acpi_idle_enter_bm+0x23b/0x271
>>  [<ffffffff813c7a08>] ? cpuidle_idle_call+0x98/0xf0
>>  [<ffffffff8100a104>] ? cpu_idle+0x94/0xd0
>>  [<ffffffff81468db6>] ? rest_init+0x66/0x70
>>  [<ffffffff816a082f>] ? start_kernel+0x2ef/0x340
>>  [<ffffffff8169fd54>] ? x86_64_start_reservations+0x84/0x90
>>  [<ffffffff8169fe32>] ? x86_64_start_kernel+0xd2/0x100
>>


Code: 00 eb 28 8b 83 d0 03 00 00
  41 39 44 24 40    cmp    %eax,0x40(%r12)
  0f 89 00 01 00 00 jns ...
  41 0f b6 cd       movzbl %r13b,%ecx
  41 bd 2f 00 00 00 mov    $0x2f000000,%r13d
  83 e1 03          and    $0x3,%ecx
  0f 84 fc 00 00 00 je ...
  4d 8b 24 24       mov    (%r12),%r12    skb = skb->next
<>49 8b 04 24       mov    (%r12),%rax     << NULL POINTER dereference >>
  4d 39 f4          cmp    %r14,%r12
  0f 18 08          prefetcht0 (%rax)
  0f 84 d9 00 00 00 je  ...
  4c 3b a3 b8 01    cmp


crash is in 
void tcp_xmit_retransmit_queue(struct sock *sk)
{

<< HERE >> tcp_for_write_queue_from(skb, sk) {

}


Some skb in sk_write_queue has a NULL ->next pointer

Strange thing is R14 and RAX =ffff8807e7420678  (&sk->sk_write_queue) 
R14 is the stable value during the loop, while RAW is scratch register.

I dont have full disassembly for this function, but I guess we just entered the loop
(or RAX should be really different at this point)

So, maybe list head itself is corrupted (sk->sk_write_queue->next = NULL)

or, retransmit_skb_hint problem ? (we forget to set it to NULL in some cases ?)
Comment 3 Eric Dumazet 2009-10-29 05:59:45 UTC
Eric Dumazet a écrit :
> Andrew Morton a écrit :
>> On Mon, 26 Oct 2009 08:41:32 -0700
>> Stephen Hemminger <shemminger@linux-foundation.org> wrote:
>>
>>> Begin forwarded message:
>>>
>>> Date: Mon, 26 Oct 2009 12:47:22 GMT
>>> From: bugzilla-daemon@bugzilla.kernel.org
>>> To: shemminger@linux-foundation.org
>>> Subject: [Bug 14470] New: freez in TCP stack
>>>
>> Stephen, please retain the bugzilla and reporter email cc's when
>> forwarding a report to a mailing list.
>>
>>
>>> http://bugzilla.kernel.org/show_bug.cgi?id=14470
>>>
>>>            Summary: freez in TCP stack
>>>            Product: Networking
>>>            Version: 2.5
>>>     Kernel Version: 2.6.31
>>>           Platform: All
>>>         OS/Version: Linux
>>>               Tree: Mainline
>>>             Status: NEW
>>>           Severity: high
>>>           Priority: P1
>>>          Component: IPV4
>>>         AssignedTo: shemminger@linux-foundation.org
>>>         ReportedBy: kolo@albatani.cz
>>>         Regression: No
>>>
>>>
>>> We are hiting kernel panics on Dell R610 servers with e1000e NICs; it
>>> apears
>>> usualy under a high network trafic ( around 100Mbit/s) but it is not a rule
>>> it
>>> has happened even on low trafic.
>>>
>>> Servers are used as reverse http proxy (varnish).
>>>
>>> On 6 equal servers this panic happens aprox 2 times a day depending on
>>> network
>>> load. Machine completly freezes till the management watchdog reboots. 
>>>
>> Twice a day on six separate machines.  That ain't no hardware glitch.
>>
>> Vaclav, are you able to say whether this is a regression?  Did those
>> machines run 2.6.30 (for example)?
>>
>> Thanks.
>>
>>> We had to put serial console on these servers to catch the oops. Is there
>>> anything else We can do to debug this?
>>> The RIP is always the same:
>>>
>>> RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>]
>>> tcp_xmit_retransmit_queue+0x8c/0x290
>>>
>>> rest of the oops always differs a litle ... here is an example:
>>>
>>> RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>]
>>> tcp_xmit_retransmit_queue+0x8c/0x290
>>> RSP: 0018:ffffc90000003a40  EFLAGS: 00010246
>>> RAX: ffff8807e7420678 RBX: ffff8807e74205c0 RCX: 0000000000000000
>>> RDX: 000000004598a105 RSI: 0000000000000000 RDI: ffff8807e74205c0
>>> RBP: ffffc90000003a80 R08: 0000000000000003 R09: 0000000000000000
>>> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
>>> R13: ffff8807e74205c0 R14: ffff8807e7420678 R15: 0000000000000000
>>> FS:  0000000000000000(0000) GS:ffffc90000000000(0000)
>>> knlGS:0000000000000000
>>> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>>> CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006f0
>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> Process swapper (pid: 0, threadinfo ffffffff81608000, task
>>> ffffffff81631440)
>>> Stack:
>>>  ffffc90000003a60 0000000000000000 4598a105e74205c0 000000004598a101
>>> <0> 000000000000050e ffff8807e74205c0 0000000000000003 0000000000000000
>>> <0> ffffc90000003b40 ffffffff8141ae4a ffff8807e7420678 0000000000000000
>>> Call Trace:
>>>  <IRQ>
>>>  [<ffffffff8141ae4a>] tcp_ack+0x170a/0x1dd0
>>>  [<ffffffff8141c362>] tcp_rcv_state_process+0x122/0xab0
>>>  [<ffffffff81422c6c>] tcp_v4_do_rcv+0xac/0x220
>>>  [<ffffffff813fd02f>] ? nf_iterate+0x5f/0x90
>>>  [<ffffffff81424b26>] tcp_v4_rcv+0x586/0x6b0
>>>  [<ffffffff813fd0c5>] ? nf_hook_slow+0x65/0xf0
>>>  [<ffffffff81406b70>] ? ip_local_deliver_finish+0x0/0x120
>>>  [<ffffffff81406bcf>] ip_local_deliver_finish+0x5f/0x120
>>>  [<ffffffff8140715b>] ip_local_deliver+0x3b/0x90
>>>  [<ffffffff81406971>] ip_rcv_finish+0x141/0x340
>>>  [<ffffffff8140701f>] ip_rcv+0x24f/0x350
>>>  [<ffffffff813e7ced>] netif_receive_skb+0x20d/0x2f0
>>>  [<ffffffff813e7e90>] napi_skb_finish+0x40/0x50
>>>  [<ffffffff813e82f4>] napi_gro_receive+0x34/0x40
>>>  [<ffffffff8133e0c8>] e1000_receive_skb+0x48/0x60
>>>  [<ffffffff81342342>] e1000_clean_rx_irq+0xf2/0x330
>>>  [<ffffffff813410a1>] e1000_clean+0x81/0x2a0
>>>  [<ffffffff81054ce1>] ? ktime_get+0x11/0x50
>>>  [<ffffffff813eaf1c>] net_rx_action+0x9c/0x130
>>>  [<ffffffff81046940>] ? get_next_timer_interrupt+0x1d0/0x210
>>>  [<ffffffff81041bd7>] __do_softirq+0xb7/0x160
>>>  [<ffffffff8100c27c>] call_softirq+0x1c/0x30
>>>  [<ffffffff8100e04d>] do_softirq+0x3d/0x80
>>>  [<ffffffff81041b0b>] irq_exit+0x7b/0x90
>>>  [<ffffffff8100d613>] do_IRQ+0x73/0xe0
>>>  [<ffffffff8100bb13>] ret_from_intr+0x0/0xa
>>>  <EOI>
>>>  [<ffffffff81296e6c>] ? acpi_idle_enter_bm+0x245/0x271
>>>  [<ffffffff81296e62>] ? acpi_idle_enter_bm+0x23b/0x271
>>>  [<ffffffff813c7a08>] ? cpuidle_idle_call+0x98/0xf0
>>>  [<ffffffff8100a104>] ? cpu_idle+0x94/0xd0
>>>  [<ffffffff81468db6>] ? rest_init+0x66/0x70
>>>  [<ffffffff816a082f>] ? start_kernel+0x2ef/0x340
>>>  [<ffffffff8169fd54>] ? x86_64_start_reservations+0x84/0x90
>>>  [<ffffffff8169fe32>] ? x86_64_start_kernel+0xd2/0x100
>>> Code: 00 eb 28 8b 83 d0 03 00 00 41 39 44 24 40 0f 89 00 01 00 00 41 0f b6
>>> cd
>>> 41 bd 2f 00 00 00 83 e1 03 0f 84 fc 00 00 00 4d 8b 24 24 <49> 8b 04 24 4d
>>> 39 f4
>>> 0f 18 08 0f 84 d9 00 00 00 4c 3b a3 b8 01
>>> RIP  [<ffffffff814203cc>] tcp_xmit_retransmit_queue+0x8c/0x290
>>>  RSP <ffffc90000003a40>
>>> CR2: 0000000000000000
>>> ---[ end trace d97d99c9ae1d52cc ]---
>>> Kernel panic - not syncing: Fatal exception in interrupt
>>> Pid: 0, comm: swapper Tainted: G      D    2.6.31 #2
>>> Call Trace:
>>>  <IRQ>  [<ffffffff8103cab0>] panic+0xa0/0x170
>>>  [<ffffffff8100bb13>] ? ret_from_intr+0x0/0xa
>>>  [<ffffffff8103c74e>] ? print_oops_end_marker+0x1e/0x20
>>>  [<ffffffff8100f38e>] oops_end+0x9e/0xb0
>>>  [<ffffffff81025b9a>] no_context+0x15a/0x250
>>>  [<ffffffff81025e2b>] __bad_area_nosemaphore+0xdb/0x1c0
>>>  [<ffffffff813e89e9>] ? dev_hard_start_xmit+0x269/0x2f0
>>>  [<ffffffff81025fae>] bad_area_nosemaphore+0xe/0x10
>>>  [<ffffffff8102639f>] do_page_fault+0x17f/0x260
>>>  [<ffffffff8147eadf>] page_fault+0x1f/0x30
>>>  [<ffffffff814203cc>] ? tcp_xmit_retransmit_queue+0x8c/0x290
>>>  [<ffffffff8141ae4a>] tcp_ack+0x170a/0x1dd0
>>>  [<ffffffff8141c362>] tcp_rcv_state_process+0x122/0xab0
>>>  [<ffffffff81422c6c>] tcp_v4_do_rcv+0xac/0x220
>>>  [<ffffffff813fd02f>] ? nf_iterate+0x5f/0x90
>>>  [<ffffffff81424b26>] tcp_v4_rcv+0x586/0x6b0
>>>  [<ffffffff813fd0c5>] ? nf_hook_slow+0x65/0xf0
>>>  [<ffffffff81406b70>] ? ip_local_deliver_finish+0x0/0x120
>>>  [<ffffffff81406bcf>] ip_local_deliver_finish+0x5f/0x120
>>>  [<ffffffff8140715b>] ip_local_deliver+0x3b/0x90
>>>  [<ffffffff81406971>] ip_rcv_finish+0x141/0x340
>>>  [<ffffffff8140701f>] ip_rcv+0x24f/0x350
>>>  [<ffffffff813e7ced>] netif_receive_skb+0x20d/0x2f0
>>>  [<ffffffff813e7e90>] napi_skb_finish+0x40/0x50
>>>  [<ffffffff813e82f4>] napi_gro_receive+0x34/0x40
>>>  [<ffffffff8133e0c8>] e1000_receive_skb+0x48/0x60
>>>  [<ffffffff81342342>] e1000_clean_rx_irq+0xf2/0x330
>>>  [<ffffffff813410a1>] e1000_clean+0x81/0x2a0
>>>  [<ffffffff81054ce1>] ? ktime_get+0x11/0x50
>>>  [<ffffffff813eaf1c>] net_rx_action+0x9c/0x130
>>>  [<ffffffff81046940>] ? get_next_timer_interrupt+0x1d0/0x210
>>>  [<ffffffff81041bd7>] __do_softirq+0xb7/0x160
>>>  [<ffffffff8100c27c>] call_softirq+0x1c/0x30
>>>  [<ffffffff8100e04d>] do_softirq+0x3d/0x80
>>>  [<ffffffff81041b0b>] irq_exit+0x7b/0x90
>>>  [<ffffffff8100d613>] do_IRQ+0x73/0xe0
>>>  [<ffffffff8100bb13>] ret_from_intr+0x0/0xa
>>>  <EOI>  [<ffffffff81296e6c>] ? acpi_idle_enter_bm+0x245/0x271
>>>  [<ffffffff81296e62>] ? acpi_idle_enter_bm+0x23b/0x271
>>>  [<ffffffff813c7a08>] ? cpuidle_idle_call+0x98/0xf0
>>>  [<ffffffff8100a104>] ? cpu_idle+0x94/0xd0
>>>  [<ffffffff81468db6>] ? rest_init+0x66/0x70
>>>  [<ffffffff816a082f>] ? start_kernel+0x2ef/0x340
>>>  [<ffffffff8169fd54>] ? x86_64_start_reservations+0x84/0x90
>>>  [<ffffffff8169fe32>] ? x86_64_start_kernel+0xd2/0x100
>>>
> 
> 
> Code: 00 eb 28 8b 83 d0 03 00 00
>   41 39 44 24 40    cmp    %eax,0x40(%r12)
>   0f 89 00 01 00 00 jns ...
>   41 0f b6 cd       movzbl %r13b,%ecx
>   41 bd 2f 00 00 00 mov    $0x2f000000,%r13d
>   83 e1 03          and    $0x3,%ecx
>   0f 84 fc 00 00 00 je ...
>   4d 8b 24 24       mov    (%r12),%r12    skb = skb->next
> <>49 8b 04 24       mov    (%r12),%rax     << NULL POINTER dereference >>
>   4d 39 f4          cmp    %r14,%r12
>   0f 18 08          prefetcht0 (%rax)
>   0f 84 d9 00 00 00 je  ...
>   4c 3b a3 b8 01    cmp
> 
> 
> crash is in 
> void tcp_xmit_retransmit_queue(struct sock *sk)
> {
> 
> << HERE >> tcp_for_write_queue_from(skb, sk) {
> 
> }
> 
> 
> Some skb in sk_write_queue has a NULL ->next pointer
> 
> Strange thing is R14 and RAX =ffff8807e7420678  (&sk->sk_write_queue) 
> R14 is the stable value during the loop, while RAW is scratch register.
> 
> I dont have full disassembly for this function, but I guess we just entered
> the loop
> (or RAX should be really different at this point)
> 
> So, maybe list head itself is corrupted (sk->sk_write_queue->next = NULL)
> 
> or, retransmit_skb_hint problem ? (we forget to set it to NULL in some cases
> ?)
> 

David, what do you think of following patch ?

I wonder if we should reorganize code to add sanity checks in tcp_unlink_write_queue()
that the skb we delete from queue is not still referenced.

[PATCH] tcp: clear retrans hints in tcp_send_synack()

There is a small possibility the skb we unlink from write queue 
is still referenced by retrans hints.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fcd278a..b22a72d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2201,6 +2201,7 @@ int tcp_send_synack(struct sock *sk)
 			struct sk_buff *nskb = skb_copy(skb, GFP_ATOMIC);
 			if (nskb == NULL)
 				return -ENOMEM;
+			tcp_clear_all_retrans_hints(tcp_sk(sk));
 			tcp_unlink_write_queue(skb, sk);
 			skb_header_release(nskb);
 			__tcp_add_write_queue_head(sk, nskb);
Comment 4 David S. Miller 2009-10-29 06:02:16 UTC
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Oct 2009 06:59:41 +0100

> David, what do you think of following patch ?
> 
> I wonder if we should reorganize code to add sanity checks in
> tcp_unlink_write_queue()
> that the skb we delete from queue is not still referenced.
> 
> [PATCH] tcp: clear retrans hints in tcp_send_synack()
> 
> There is a small possibility the skb we unlink from write queue 
> is still referenced by retrans hints.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Yes, the first thing I thought of when I saw this crash was the hints.

I'll think this over.
Comment 5 Anonymous Emailer 2009-10-29 06:26:16 UTC
Reply-To: v.bilek@1art.cz

bugzilla-daemon@bugzilla.kernel.org napsal(a):
> http://bugzilla.kernel.org/show_bug.cgi?id=14470
> 
> 
> 
> 
> 
> --- Comment #1 from Andrew Morton <akpm@linux-foundation.org>  2009-10-28
> 22:13:50 ---
> On Mon, 26 Oct 2009 08:41:32 -0700
> Stephen Hemminger <shemminger@linux-foundation.org> wrote:
> 
>>
>> Begin forwarded message:
>>
>> Date: Mon, 26 Oct 2009 12:47:22 GMT
>> From: bugzilla-daemon@bugzilla.kernel.org
>> To: shemminger@linux-foundation.org
>> Subject: [Bug 14470] New: freez in TCP stack
>>
> 
> Stephen, please retain the bugzilla and reporter email cc's when
> forwarding a report to a mailing list.
> 
> 
>> http://bugzilla.kernel.org/show_bug.cgi?id=14470
>>
>>            Summary: freez in TCP stack
>>            Product: Networking
>>            Version: 2.5
>>     Kernel Version: 2.6.31
>>           Platform: All
>>         OS/Version: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: high
>>           Priority: P1
>>          Component: IPV4
>>         AssignedTo: shemminger@linux-foundation.org
>>         ReportedBy: kolo@albatani.cz
>>         Regression: No
>>
>>
>> We are hiting kernel panics on Dell R610 servers with e1000e NICs; it apears
>> usualy under a high network trafic ( around 100Mbit/s) but it is not a rule
>> it
>> has happened even on low trafic.
>>
>> Servers are used as reverse http proxy (varnish).
>>
>> On 6 equal servers this panic happens aprox 2 times a day depending on
>> network
>> load. Machine completly freezes till the management watchdog reboots. 
>>
> 
> Twice a day on six separate machines.  That ain't no hardware glitch.
> 
> Vaclav, are you able to say whether this is a regression?  Did those
> machines run 2.6.30 (for example)?
> 
> Thanks.
> 
Cant say if it was the same bug we hit running on 2.6.30 but symptoms
were the same ( high net load ; total freez).
Comment 6 Yuriy Shkandybin 2009-10-29 07:31:17 UTC
I got it from 2.6.29 on several HP DL180
2.6.28-6 works well
Comment 7 David S. Miller 2009-10-29 07:59:52 UTC
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Oct 2009 06:59:41 +0100

> [PATCH] tcp: clear retrans hints in tcp_send_synack()
> 
> There is a small possibility the skb we unlink from write queue 
> is still referenced by retrans hints.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

So, this would only be true if we were dealing with a data
packet here.  We're not, this is a SYN+ACK which happens to
be cloned in the write queue.

The hint SKBs pointers can only point to real data packets.

And we're only dealing with data packets once we enter established
state, and when we enter established by definition we have unlinked
and freed up any SYN and SYN+ACK SKBs in the write queue.
Comment 8 Ilpo Järvinen 2009-10-29 12:59:06 UTC
On Thu, 29 Oct 2009, Eric Dumazet wrote:

> Andrew Morton a écrit :
> > On Mon, 26 Oct 2009 08:41:32 -0700
> > Stephen Hemminger <shemminger@linux-foundation.org> wrote:
> > 
> >>
> >> Begin forwarded message:
> >>
> >> Date: Mon, 26 Oct 2009 12:47:22 GMT
> >> From: bugzilla-daemon@bugzilla.kernel.org
> >> To: shemminger@linux-foundation.org
> >> Subject: [Bug 14470] New: freez in TCP stack
> >>
> > 
> > Stephen, please retain the bugzilla and reporter email cc's when
> > forwarding a report to a mailing list.
> > 
> > 
> >> http://bugzilla.kernel.org/show_bug.cgi?id=14470
> >>
> >>            Summary: freez in TCP stack
> >>            Product: Networking
> >>            Version: 2.5
> >>     Kernel Version: 2.6.31
> >>           Platform: All
> >>         OS/Version: Linux
> >>               Tree: Mainline
> >>             Status: NEW
> >>           Severity: high
> >>           Priority: P1
> >>          Component: IPV4
> >>         AssignedTo: shemminger@linux-foundation.org
> >>         ReportedBy: kolo@albatani.cz
> >>         Regression: No
> >>
> >>
> >> We are hiting kernel panics on Dell R610 servers with e1000e NICs; it
> apears
> >> usualy under a high network trafic ( around 100Mbit/s) but it is not a
> rule it
> >> has happened even on low trafic.
> >>
> >> Servers are used as reverse http proxy (varnish).
> >>
> >> On 6 equal servers this panic happens aprox 2 times a day depending on
> network
> >> load. Machine completly freezes till the management watchdog reboots. 
> >>
> > 
> > Twice a day on six separate machines.  That ain't no hardware glitch.
> > 
> > Vaclav, are you able to say whether this is a regression?  Did those
> > machines run 2.6.30 (for example)?
> > 
> > Thanks.
> > 
> >> We had to put serial console on these servers to catch the oops. Is there
> >> anything else We can do to debug this?
> >> The RIP is always the same:
> >>
> >> RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>]
> >> tcp_xmit_retransmit_queue+0x8c/0x290
> >>
> >> rest of the oops always differs a litle ... here is an example:
> >>
> >> RIP: 0010:[<ffffffff814203cc>]  [<ffffffff814203cc>]
> >> tcp_xmit_retransmit_queue+0x8c/0x290
> >> RSP: 0018:ffffc90000003a40  EFLAGS: 00010246
> >> RAX: ffff8807e7420678 RBX: ffff8807e74205c0 RCX: 0000000000000000
> >> RDX: 000000004598a105 RSI: 0000000000000000 RDI: ffff8807e74205c0
> >> RBP: ffffc90000003a80 R08: 0000000000000003 R09: 0000000000000000
> >> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> >> R13: ffff8807e74205c0 R14: ffff8807e7420678 R15: 0000000000000000
> >> FS:  0000000000000000(0000) GS:ffffc90000000000(0000)
> knlGS:0000000000000000
> >> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> >> CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006f0
> >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> Process swapper (pid: 0, threadinfo ffffffff81608000, task
> ffffffff81631440)
> >> Stack:
> >>  ffffc90000003a60 0000000000000000 4598a105e74205c0 000000004598a101
> >> <0> 000000000000050e ffff8807e74205c0 0000000000000003 0000000000000000
> >> <0> ffffc90000003b40 ffffffff8141ae4a ffff8807e7420678 0000000000000000
> >> Call Trace:
> >>  <IRQ>
> >>  [<ffffffff8141ae4a>] tcp_ack+0x170a/0x1dd0
> >>  [<ffffffff8141c362>] tcp_rcv_state_process+0x122/0xab0
> >>  [<ffffffff81422c6c>] tcp_v4_do_rcv+0xac/0x220
> >>  [<ffffffff813fd02f>] ? nf_iterate+0x5f/0x90
> >>  [<ffffffff81424b26>] tcp_v4_rcv+0x586/0x6b0
> >>  [<ffffffff813fd0c5>] ? nf_hook_slow+0x65/0xf0
> >>  [<ffffffff81406b70>] ? ip_local_deliver_finish+0x0/0x120
> >>  [<ffffffff81406bcf>] ip_local_deliver_finish+0x5f/0x120
> >>  [<ffffffff8140715b>] ip_local_deliver+0x3b/0x90
> >>  [<ffffffff81406971>] ip_rcv_finish+0x141/0x340
> >>  [<ffffffff8140701f>] ip_rcv+0x24f/0x350
> >>  [<ffffffff813e7ced>] netif_receive_skb+0x20d/0x2f0
> >>  [<ffffffff813e7e90>] napi_skb_finish+0x40/0x50
> >>  [<ffffffff813e82f4>] napi_gro_receive+0x34/0x40
> >>  [<ffffffff8133e0c8>] e1000_receive_skb+0x48/0x60
> >>  [<ffffffff81342342>] e1000_clean_rx_irq+0xf2/0x330
> >>  [<ffffffff813410a1>] e1000_clean+0x81/0x2a0
> >>  [<ffffffff81054ce1>] ? ktime_get+0x11/0x50
> >>  [<ffffffff813eaf1c>] net_rx_action+0x9c/0x130
> >>  [<ffffffff81046940>] ? get_next_timer_interrupt+0x1d0/0x210
> >>  [<ffffffff81041bd7>] __do_softirq+0xb7/0x160
> >>  [<ffffffff8100c27c>] call_softirq+0x1c/0x30
> >>  [<ffffffff8100e04d>] do_softirq+0x3d/0x80
> >>  [<ffffffff81041b0b>] irq_exit+0x7b/0x90
> >>  [<ffffffff8100d613>] do_IRQ+0x73/0xe0
> >>  [<ffffffff8100bb13>] ret_from_intr+0x0/0xa
> >>  <EOI>
> >>  [<ffffffff81296e6c>] ? acpi_idle_enter_bm+0x245/0x271
> >>  [<ffffffff81296e62>] ? acpi_idle_enter_bm+0x23b/0x271
> >>  [<ffffffff813c7a08>] ? cpuidle_idle_call+0x98/0xf0
> >>  [<ffffffff8100a104>] ? cpu_idle+0x94/0xd0
> >>  [<ffffffff81468db6>] ? rest_init+0x66/0x70
> >>  [<ffffffff816a082f>] ? start_kernel+0x2ef/0x340
> >>  [<ffffffff8169fd54>] ? x86_64_start_reservations+0x84/0x90
> >>  [<ffffffff8169fe32>] ? x86_64_start_kernel+0xd2/0x100
> >> Code: 00 eb 28 8b 83 d0 03 00 00 41 39 44 24 40 0f 89 00 01 00 00 41 0f b6
> cd
> >> 41 bd 2f 00 00 00 83 e1 03 0f 84 fc 00 00 00 4d 8b 24 24 <49> 8b 04 24 4d
> 39 f4
> >> 0f 18 08 0f 84 d9 00 00 00 4c 3b a3 b8 01
> >> RIP  [<ffffffff814203cc>] tcp_xmit_retransmit_queue+0x8c/0x290
> >>  RSP <ffffc90000003a40>
> >> CR2: 0000000000000000
> >> ---[ end trace d97d99c9ae1d52cc ]---
> >> Kernel panic - not syncing: Fatal exception in interrupt
> >> Pid: 0, comm: swapper Tainted: G      D    2.6.31 #2
> >> Call Trace:
> >>  <IRQ>  [<ffffffff8103cab0>] panic+0xa0/0x170
> >>  [<ffffffff8100bb13>] ? ret_from_intr+0x0/0xa
> >>  [<ffffffff8103c74e>] ? print_oops_end_marker+0x1e/0x20
> >>  [<ffffffff8100f38e>] oops_end+0x9e/0xb0
> >>  [<ffffffff81025b9a>] no_context+0x15a/0x250
> >>  [<ffffffff81025e2b>] __bad_area_nosemaphore+0xdb/0x1c0
> >>  [<ffffffff813e89e9>] ? dev_hard_start_xmit+0x269/0x2f0
> >>  [<ffffffff81025fae>] bad_area_nosemaphore+0xe/0x10
> >>  [<ffffffff8102639f>] do_page_fault+0x17f/0x260
> >>  [<ffffffff8147eadf>] page_fault+0x1f/0x30
> >>  [<ffffffff814203cc>] ? tcp_xmit_retransmit_queue+0x8c/0x290
> >>  [<ffffffff8141ae4a>] tcp_ack+0x170a/0x1dd0
> >>  [<ffffffff8141c362>] tcp_rcv_state_process+0x122/0xab0
> >>  [<ffffffff81422c6c>] tcp_v4_do_rcv+0xac/0x220
> >>  [<ffffffff813fd02f>] ? nf_iterate+0x5f/0x90
> >>  [<ffffffff81424b26>] tcp_v4_rcv+0x586/0x6b0
> >>  [<ffffffff813fd0c5>] ? nf_hook_slow+0x65/0xf0
> >>  [<ffffffff81406b70>] ? ip_local_deliver_finish+0x0/0x120
> >>  [<ffffffff81406bcf>] ip_local_deliver_finish+0x5f/0x120
> >>  [<ffffffff8140715b>] ip_local_deliver+0x3b/0x90
> >>  [<ffffffff81406971>] ip_rcv_finish+0x141/0x340
> >>  [<ffffffff8140701f>] ip_rcv+0x24f/0x350
> >>  [<ffffffff813e7ced>] netif_receive_skb+0x20d/0x2f0
> >>  [<ffffffff813e7e90>] napi_skb_finish+0x40/0x50
> >>  [<ffffffff813e82f4>] napi_gro_receive+0x34/0x40
> >>  [<ffffffff8133e0c8>] e1000_receive_skb+0x48/0x60
> >>  [<ffffffff81342342>] e1000_clean_rx_irq+0xf2/0x330
> >>  [<ffffffff813410a1>] e1000_clean+0x81/0x2a0
> >>  [<ffffffff81054ce1>] ? ktime_get+0x11/0x50
> >>  [<ffffffff813eaf1c>] net_rx_action+0x9c/0x130
> >>  [<ffffffff81046940>] ? get_next_timer_interrupt+0x1d0/0x210
> >>  [<ffffffff81041bd7>] __do_softirq+0xb7/0x160
> >>  [<ffffffff8100c27c>] call_softirq+0x1c/0x30
> >>  [<ffffffff8100e04d>] do_softirq+0x3d/0x80
> >>  [<ffffffff81041b0b>] irq_exit+0x7b/0x90
> >>  [<ffffffff8100d613>] do_IRQ+0x73/0xe0
> >>  [<ffffffff8100bb13>] ret_from_intr+0x0/0xa
> >>  <EOI>  [<ffffffff81296e6c>] ? acpi_idle_enter_bm+0x245/0x271
> >>  [<ffffffff81296e62>] ? acpi_idle_enter_bm+0x23b/0x271
> >>  [<ffffffff813c7a08>] ? cpuidle_idle_call+0x98/0xf0
> >>  [<ffffffff8100a104>] ? cpu_idle+0x94/0xd0
> >>  [<ffffffff81468db6>] ? rest_init+0x66/0x70
> >>  [<ffffffff816a082f>] ? start_kernel+0x2ef/0x340
> >>  [<ffffffff8169fd54>] ? x86_64_start_reservations+0x84/0x90
> >>  [<ffffffff8169fe32>] ? x86_64_start_kernel+0xd2/0x100
> >>
> 
> 
> Code: 00 eb 28 8b 83 d0 03 00 00
>   41 39 44 24 40    cmp    %eax,0x40(%r12)
>   0f 89 00 01 00 00 jns ...
>   41 0f b6 cd       movzbl %r13b,%ecx
>   41 bd 2f 00 00 00 mov    $0x2f000000,%r13d
>   83 e1 03          and    $0x3,%ecx
>   0f 84 fc 00 00 00 je ...
>   4d 8b 24 24       mov    (%r12),%r12    skb = skb->next
> <>49 8b 04 24       mov    (%r12),%rax     << NULL POINTER dereference >>
>   4d 39 f4          cmp    %r14,%r12
>   0f 18 08          prefetcht0 (%rax)
>   0f 84 d9 00 00 00 je  ...
>   4c 3b a3 b8 01    cmp
> 
> 
> crash is in 
> void tcp_xmit_retransmit_queue(struct sock *sk)
> {
> 
> << HERE >> tcp_for_write_queue_from(skb, sk) {
> 
> }
> 
> 
> Some skb in sk_write_queue has a NULL ->next pointer
> 
> Strange thing is R14 and RAX =ffff8807e7420678  (&sk->sk_write_queue) 
> R14 is the stable value during the loop, while RAW is scratch register.
> 
> I dont have full disassembly for this function, but I guess we just 
> entered the loop (or RAX should be really different at this point)
> 
> So, maybe list head itself is corrupted (sk->sk_write_queue->next = NULL)

One more alternative along those lines could perhaps be:

We enter with empty write_queue there and with the hint being null, so we 
take the else branch... and skb_peek then gives us the NULL ptr. However, 
I cannot see how this could happen as all branches trap with return 
before the reach tcp_xmit_retransmit_queue.
 
> or, retransmit_skb_hint problem ? (we forget to set it to NULL in some 
> cases ?)

...I don't understand how a stale reference would yield to a consistent 
NULL ptr crash there rather than hard to track corruption for most of the 
times and random crashes then here and there. Or perhaps we were just very 
lucky to immediately get only those reports which point out to the right 
track :-).

...I tried to find what is wrong with it but sadly came up only
ah-this-is-it-oh-wait-it's-ok type of things.
Comment 9 Eric Dumazet 2009-10-29 14:08:36 UTC
> ...I don't understand how a stale reference would yield to a consistent 
> NULL ptr crash there rather than hard to track corruption for most of the 
> times and random crashes then here and there. Or perhaps we were just very 
> lucky to immediately get only those reports which point out to the right 
> track :-).
> 


When a skb is freed, and re-allocated, we clear most of its fields
in __alloc_skb()

memset(skb, 0, offsetof(struct sk_buff, tail));

Then if this skb is freed again, not queued anywhere, its skb->next stays NULL

So if we have a stale reference to a freed skb, we can :

- Get a NULL pointer, or a poisonned value (if SLUB_DEBUG)


Here is a debug patch to check we dont have stale pointers, maybe this will help ?sync


[PATCH] tcp: check stale pointers in tcp_unlink_write_queue()

In order to track some obscure bug, we check in tcp_unlink_write_queue() if
we dont have stale references to unlinked skb

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/tcp.h     |    4 ++++
 net/ipv4/tcp.c        |    2 +-
 net/ipv4/tcp_input.c  |    4 ++--
 net/ipv4/tcp_output.c |    8 ++++----
 4 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 740d09b..09da342 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1357,6 +1357,10 @@ static inline void tcp_insert_write_queue_before(struct sk_buff *new,
 
 static inline void tcp_unlink_write_queue(struct sk_buff *skb, struct sock *sk)
 {
+	WARN_ON(skb == tcp_sk(sk)->retransmit_skb_hint);
+	WARN_ON(skb == tcp_sk(sk)->lost_skb_hint);
+	WARN_ON(skb == tcp_sk(sk)->scoreboard_skb_hint);
+	WARN_ON(skb == sk->sk_send_head);
 	__skb_unlink(skb, &sk->sk_write_queue);
 }
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e0cfa63..328bdb1 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1102,11 +1102,11 @@ out:
 
 do_fault:
 	if (!skb->len) {
-		tcp_unlink_write_queue(skb, sk);
 		/* It is the one place in all of TCP, except connection
 		 * reset, where we can be unlinking the send_head.
 		 */
 		tcp_check_send_head(sk, skb);
+		tcp_unlink_write_queue(skb, sk);
 		sk_wmem_free_skb(sk, skb);
 	}
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ba0eab6..fccc6e9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3251,13 +3251,13 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 		if (!fully_acked)
 			break;
 
-		tcp_unlink_write_queue(skb, sk);
-		sk_wmem_free_skb(sk, skb);
 		tp->scoreboard_skb_hint = NULL;
 		if (skb == tp->retransmit_skb_hint)
 			tp->retransmit_skb_hint = NULL;
 		if (skb == tp->lost_skb_hint)
 			tp->lost_skb_hint = NULL;
+		tcp_unlink_write_queue(skb, sk);
+		sk_wmem_free_skb(sk, skb);
 	}
 
 	if (likely(between(tp->snd_up, prior_snd_una, tp->snd_una)))
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 616c686..196171d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1791,6 +1791,10 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
 	tcp_highest_sack_combine(sk, next_skb, skb);
 
+	/* changed transmit queue under us so clear hints */
+	tcp_clear_retrans_hints_partial(tp);
+	if (next_skb == tp->retransmit_skb_hint)
+		tp->retransmit_skb_hint = skb;
 	tcp_unlink_write_queue(next_skb, sk);
 
 	skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
@@ -1813,10 +1817,6 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 	 */
 	TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked & TCPCB_EVER_RETRANS;
 
-	/* changed transmit queue under us so clear hints */
-	tcp_clear_retrans_hints_partial(tp);
-	if (next_skb == tp->retransmit_skb_hint)
-		tp->retransmit_skb_hint = skb;
 
 	tcp_adjust_pcount(sk, next_skb, tcp_skb_pcount(next_skb));
Comment 10 Herbert Xu 2009-10-30 20:19:08 UTC
Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote:
> 
> One more alternative along those lines could perhaps be:
> 
> We enter with empty write_queue there and with the hint being null, so we 
> take the else branch... and skb_peek then gives us the NULL ptr. However, 
> I cannot see how this could happen as all branches trap with return 
> before the reach tcp_xmit_retransmit_queue.

Why don't we add a WARN_ON in there and see if it triggers?

Thanks,
Comment 11 Vaclav Bilek 2009-11-06 12:58:35 UTC
aproved regression ... on 2.6.28.6 runs stable
Comment 12 David Collins 2009-11-11 19:38:35 UTC
I'm having the exact same issue, it looks like 2.6.28.9 is working fine though.  2.6.29+ I believe is having the problem with crashes on high network load.
Comment 13 Ilpo Järvinen 2009-11-26 21:54:58 UTC
On Thu, 29 Oct 2009, David Miller wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 29 Oct 2009 06:59:41 +0100
> 
> > [PATCH] tcp: clear retrans hints in tcp_send_synack()
> > 
> > There is a small possibility the skb we unlink from write queue 
> > is still referenced by retrans hints.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> So, this would only be true if we were dealing with a data
> packet here.  We're not, this is a SYN+ACK which happens to
> be cloned in the write queue.
> 
> The hint SKBs pointers can only point to real data packets.
> 
> And we're only dealing with data packets once we enter established
> state, and when we enter established by definition we have unlinked
> and freed up any SYN and SYN+ACK SKBs in the write queue.

How about this then... Does the original reporter have NFS in use?

[PATCH] tcp: clear hints to avoid a stale one (nfs only affected?)

Eric Dumazet mentioned in a context of another problem:

"Well, it seems NFS reuses its socket, so maybe we miss some
cleaning as spotted in this old patch"

I've not check under which conditions that actually happens but
if true, we need to make sure we don't accidently leave stale
hints behind when the write queue had to be purged (whether reusing
with NFS can actually happen if purging took place is something I'm
not sure of).

...At least it compiles.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/net/tcp.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..6b13faa 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1228,6 +1228,7 @@ static inline void tcp_write_queue_purge(struct sock *sk)
 	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
 		sk_wmem_free_skb(sk, skb);
 	sk_mem_reclaim(sk);
+	tcp_clear_all_retrans_hints(tcp_sk(sk));
 }
 
 static inline struct sk_buff *tcp_write_queue_head(struct sock *sk)
Comment 14 David S. Miller 2009-11-26 23:37:35 UTC
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 26 Nov 2009 23:54:53 +0200 (EET)

> How about this then... Does the original reporter have NFS in use?
> 
> [PATCH] tcp: clear hints to avoid a stale one (nfs only affected?)

I must be getting old and senile, but I specifically remembered that
we prevented a socket from ever being bound again once it has been
bound one time specifically so we didn't have to deal with issues
like this.

I really don't think it's valid for NFS to reuse the socket structure
like this over and over again.  And that's why only NFS can reproduce
this, the interfaces provided userland can't actually go through this
sequence after a socket goes down one time all the way to close.

Do we really want to audit each and every odd member of the socket
structure from the generic portion all the way down to INET and
TCP specifics to figure out what needs to get zero'd out?

So much relies upon the one-time full zero out during sock allocation.

Let's fix NFS instead.
Comment 15 Eric Dumazet 2009-11-27 06:17:13 UTC
David Miller a écrit :

> I must be getting old and senile, but I specifically remembered that
> we prevented a socket from ever being bound again once it has been
> bound one time specifically so we didn't have to deal with issues
> like this.
> 
> I really don't think it's valid for NFS to reuse the socket structure
> like this over and over again.  And that's why only NFS can reproduce
> this, the interfaces provided userland can't actually go through this
> sequence after a socket goes down one time all the way to close.
> 
> Do we really want to audit each and every odd member of the socket
> structure from the generic portion all the way down to INET and
> TCP specifics to figure out what needs to get zero'd out?

An audit is always welcomed, we might find bugs :)

> 
> So much relies upon the one-time full zero out during sock allocation.
> 
> Let's fix NFS instead.

bugzilla reference : http://bugzilla.kernel.org/show_bug.cgi?id=14580

Trond said :
  NFS MUST reuse the same port because on most servers, the replay cache is keyed
  to the port number. In other words, when we replay an RPC call, the server will
  only recognise it as a replay if it originates from the same port.
  See http://www.connectathon.org/talks96/werme1.html


Please note the socket stays bound to a given local port.

We want to connect() it to a possible other target, that's all.

In NFS case 'other target' is in fact the same target, but this
is a special case of a more general one.

Hmm... if an application wants to keep a local port for itself (not
allowing another one to get this (ephemeral ?) port during the 
close()/socket()/bind() window), this is the only way.
TCP state machine allows this IMHO.

google for "tcp AF_UNSPEC connect" to find many references and man pages
for this stuff.

http://kerneltrap.org/Linux/Connect_Specification_versus_Man_Page

How other Unixes / OS handle this ?
How many applications use this trick ?
Comment 16 Vaclav Bilek 2009-11-27 06:32:51 UTC
we donot use NFS 
only varnish http reverse proxy: http://varnish.projects.linpro.no/
Comment 17 Vaclav Bilek 2009-12-03 09:23:55 UTC
(In reply to comment #16)
> we donot use NFS 
> only varnish http reverse proxy: http://varnish.projects.linpro.no/

(In reply to comment #9)
> > ...I don't understand how a stale reference would yield to a consistent 
> > NULL ptr crash there rather than hard to track corruption for most of the 
> > times and random crashes then here and there. Or perhaps we were just very 
> > lucky to immediately get only those reports which point out to the right 
> > track :-).
> > 
> 
> 
> When a skb is freed, and re-allocated, we clear most of its fields
> in __alloc_skb()
> 
> memset(skb, 0, offsetof(struct sk_buff, tail));
> 
> Then if this skb is freed again, not queued anywhere, its skb->next stays
> NULL
> 
> So if we have a stale reference to a freed skb, we can :
> 
> - Get a NULL pointer, or a poisonned value (if SLUB_DEBUG)
> 
> 
> Here is a debug patch to check we dont have stale pointers, maybe this will
> help ?sync
> 
> 
> [PATCH] tcp: check stale pointers in tcp_unlink_write_queue()
> 
> In order to track some obscure bug, we check in tcp_unlink_write_queue() if
> we dont have stale references to unlinked skb
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/net/tcp.h     |    4 ++++
>  net/ipv4/tcp.c        |    2 +-
>  net/ipv4/tcp_input.c  |    4 ++--
>  net/ipv4/tcp_output.c |    8 ++++----
>  4 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 740d09b..09da342 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1357,6 +1357,10 @@ static inline void
> tcp_insert_write_queue_before(struct
> sk_buff *new,
> 
>  static inline void tcp_unlink_write_queue(struct sk_buff *skb, struct sock
> *sk)
>  {
> +    WARN_ON(skb == tcp_sk(sk)->retransmit_skb_hint);
> +    WARN_ON(skb == tcp_sk(sk)->lost_skb_hint);
> +    WARN_ON(skb == tcp_sk(sk)->scoreboard_skb_hint);
> +    WARN_ON(skb == sk->sk_send_head);
>      __skb_unlink(skb, &sk->sk_write_queue);
>  }
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e0cfa63..328bdb1 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1102,11 +1102,11 @@ out:
> 
>  do_fault:
>      if (!skb->len) {
> -        tcp_unlink_write_queue(skb, sk);
>          /* It is the one place in all of TCP, except connection
>           * reset, where we can be unlinking the send_head.
>           */
>          tcp_check_send_head(sk, skb);
> +        tcp_unlink_write_queue(skb, sk);
>          sk_wmem_free_skb(sk, skb);
>      }
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ba0eab6..fccc6e9 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3251,13 +3251,13 @@ static int tcp_clean_rtx_queue(struct sock *sk, int
> prior_fackets,
>          if (!fully_acked)
>              break;
> 
> -        tcp_unlink_write_queue(skb, sk);
> -        sk_wmem_free_skb(sk, skb);
>          tp->scoreboard_skb_hint = NULL;
>          if (skb == tp->retransmit_skb_hint)
>              tp->retransmit_skb_hint = NULL;
>          if (skb == tp->lost_skb_hint)
>              tp->lost_skb_hint = NULL;
> +        tcp_unlink_write_queue(skb, sk);
> +        sk_wmem_free_skb(sk, skb);
>      }
> 
>      if (likely(between(tp->snd_up, prior_snd_una, tp->snd_una)))
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 616c686..196171d 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1791,6 +1791,10 @@ static void tcp_collapse_retrans(struct sock *sk,
> struct
> sk_buff *skb)
> 
>      tcp_highest_sack_combine(sk, next_skb, skb);
> 
> +    /* changed transmit queue under us so clear hints */
> +    tcp_clear_retrans_hints_partial(tp);
> +    if (next_skb == tp->retransmit_skb_hint)
> +        tp->retransmit_skb_hint = skb;
>      tcp_unlink_write_queue(next_skb, sk);
> 
>      skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
> @@ -1813,10 +1817,6 @@ static void tcp_collapse_retrans(struct sock *sk,
> struct
> sk_buff *skb)
>       */
>      TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked &
> TCPCB_EVER_RETRANS;
> 
> -    /* changed transmit queue under us so clear hints */
> -    tcp_clear_retrans_hints_partial(tp);
> -    if (next_skb == tp->retransmit_skb_hint)
> -        tp->retransmit_skb_hint = skb;
> 
>      tcp_adjust_pcount(sk, next_skb, tcp_skb_pcount(next_skb));

any instruction against  what release (2.6.31.6?) apply that patch?
Comment 18 David Collins 2009-12-04 19:37:25 UTC
I have 2.6.32 running on roughly 50 servers.  So far there haven't been any crashes due to TCP.  There were a couple of changes to tcp_* in 2.6.31.10 & 2.6.32.  One was the following...
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=bbf31bf18d34caa87dd01f08bf713635593697f2

Has anyone gave 2.6.32 or 2.6.31.10 a try yet?

In the past 14 hours or so, there have been 0 crashes due to TCP.
Comment 19 David Collins 2009-12-07 17:59:22 UTC
Still no problems with 2.6.32,  I'm going to put this kernel on 50 more servers today to test it out.  I'll let you know how it goes.
Comment 20 Yuriy Shkandybin 2009-12-09 08:19:43 UTC
2.6.32 HP DL180G5
got crash today 
same trace
Comment 21 David Collins 2009-12-10 07:57:39 UTC
(In reply to comment #20)
> 2.6.32 HP DL180G5
> got crash today 
> same trace

Yup, it just happened on me as well.  I had 2 boxes have the problem.
Comment 22 info 2010-01-16 22:34:45 UTC
Have the same problem on ~10 boxes. Upgraded to 2.6.32.3, didn't help, they keep crashing like 1 server per day.
Comment 23 David Collins 2010-01-19 21:53:06 UTC
(In reply to comment #22)
> Have the same problem on ~10 boxes. Upgraded to 2.6.32.3, didn't help, they
> keep crashing like 1 server per day.

Just curious, what kind of boxes are you running, and what kind of network cards?
Comment 24 info 2010-01-20 01:58:13 UTC
It's supermicro, nvidia cards
Comment 25 Vaclav Bilek 2010-01-20 06:01:14 UTC
dell r610; intel 82571EB and 82575 NIC
Comment 26 Petr Sodomka 2010-01-29 21:09:26 UTC
I confirm this bug. I have 2 Ubuntu 2.6.31-17-servers and they freeze around once a day when under high load. It ends with the same stack trace. It happens with servers behind LVS (Keepalived) load balancer.

Server configuration:

INTEL MB DQ965GF/GUARDFISH/uATX/A,R,IG (Intel Corporation 82566DM Gigabit integrated) with another Intel Corporation 82541PI Gigabit network adapter. I tried both with the same results.
Comment 27 Petr Sodomka 2010-01-31 18:18:24 UTC
I've moved disks to completely different hardware (Tyan Transport GT20 B2925G20V4H, no e1000s, nvidia chipset and ethernet) and it still hangs. This is what I've found on the screen (typed by hand from screen, do it helps without more detail?):

tcp_rcv_state_process
tcp_v4_do_rcv
tcp_v4_rcv
ip_local_deliver_finish
nf_hook_slow
ip_local_deliver_finish
ip_local_deliver_finish
ip_local_deliver
ip_rcv_finish
ip_rcv
netif_receive_skb
process_backlog
net_rx_action
__do_softirq
call_softirq
do_softirq
irq_exit
do_IRQ
ret_from_intr
native_safe_halt
default_idle
c1e_idle
cpu_idle
start_secondary

maybe the fact that we all (?) have this problem with servers behind load balancer is important? I've moved back to 2.6.28-6 as recommended and it seems stable...
Comment 28 info 2010-01-31 19:29:56 UTC
I don't use any balancer in front. Just nginx... Seems I have no option than just go back to 2.6.28 until this problem is fixed in 2.6.31+

We thought it was because of netconsole + nvidia, removed netconsole from kernel, still happening with 2.6.32.x.


Jan 31 14:23:21 bstorage37-i [133369.201250] BUG: unable to handle kernel 
Jan 31 14:23:21 bstorage37-i NULL pointer dereference
Jan 31 14:23:21 bstorage37-i at (null)
Jan 31 14:23:21 bstorage37-i [133369.201274] IP:
Jan 31 14:23:21 bstorage37-i [<c060164a>] tcp_xmit_retransmit_queue+0x1b2/0x1dc
Jan 31 14:23:21 bstorage37-i [133369.201295] *pdpt = 0000000021b03001 
Jan 31 14:23:21 bstorage37-i *pde = 0000000000000000 
Jan 31 14:23:21 bstorage37-i 
Jan 31 14:23:21 bstorage37-i [133369.201311] Thread overran stack, or stack corrupted
Jan 31 14:23:21 bstorage37-i [133369.201323] Oops: 0000 [#1] 
Jan 31 14:23:21 bstorage37-i SMP 
Jan 31 14:23:21 bstorage37-i 
Jan 31 14:23:21 bstorage37-i [133369.201336] last sysfs file: /sys/devices/pci0000:00/0000:00:0f.0/0000:07:00.0/0000:08:01.0/0000:09:00.0/class
Jan 31 14:23:21 bstorage37-i [133369.201355] 
Jan 31 14:23:21 bstorage37-i [133369.201363] Pid: 0, comm: swapper Not tainted (2.6.31.6-v03 #2) H8DMU
Jan 31 14:23:21 bstorage37-i [133369.201377] EIP: 0060:[<c060164a>] EFLAGS: 00010246 CPU: 0
Jan 31 14:23:21 bstorage37-i [133369.201390] EIP is at tcp_xmit_retransmit_queue+0x1b2/0x1dc
Jan 31 14:23:21 bstorage37-i [133369.201401] EAX: dc5d08fc EBX: dc5d0880 ECX: 19dc6948 EDX: dc5d08fc
Jan 31 14:23:21 bstorage37-i [133369.201413] ESI: 00000000 EDI: 00000000 EBP: c0805d28 ESP: c0805d0c
Jan 31 14:23:21 bstorage37-i [133369.201426]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jan 31 14:23:21 bstorage37-i [133369.201438] Process swapper (pid: 0, ti=c0804000 task=c080b5a0 task.ti=c0804000)
Jan 31 14:23:21 bstorage37-i [133369.201453] Stack:
Jan 31 14:23:21 bstorage37-i [133369.201457]  00000202
Comment 29 Rob de Wit 2010-02-24 09:56:32 UTC
I confirm this bug with for two Supermicro AMD servers with nvidia MCP55 NICs, running 2.6.32.2. Both run apache httpd behind a LVS loadbalancer without NFS. Other network traffic would be ssh and snmp. eth0 is shared with an ipmi device.

They have also crashed both using 2.6.32.8, but I did not have an opportunity to look at the console before rebooting, so I don't know for sure it was in the tcp_xmit_retransmit_queue
Comment 30 Vitaly Ivanov 2010-02-26 16:50:01 UTC
I confirm this bug with for 6 Supermicro AMD servers with Intel 80003ES2LAN Gigabit Ethernet NICs, kernel version 2.6.29
Comment 31 info 2010-02-26 17:20:26 UTC
This problem disappeared when I downgraded kernel to 2.6.26.8.

supermicro
intel NIC
netconsole enabled.

so, it's not hardware-related.

waiting for kernel's fix.
Comment 32 Rob de Wit 2010-03-01 13:43:13 UTC
Just an update: both our supermicro's with nVidia NICs seem to stay up with 2.6.28.6
Comment 33 Petr Sodomka 2010-03-05 10:32:44 UTC
Anyone trying latest 2.6.33? I'm testing it on one machine and it's stable for 7 days now. It's this package: http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.33/ amd64 version.
Comment 34 info 2010-03-08 00:55:50 UTC
I'm installing 2.6.33 to ~7 boxes, lets see how stable it is.
Comment 35 Yuriy Yevtukhov 2010-03-09 23:40:22 UTC
*** Bug 15487 has been marked as a duplicate of this bug. ***
Comment 36 Yuriy Yevtukhov 2010-03-10 18:25:10 UTC
Any news about 2.6.33 ? 
I compared /net/ipv4/tcp_output.c between 2.6.32.8 and 2.6.34-rc1 and no differences found in tcp_xmit_retransmit_queue(). All I notices were about new sysctl variable tcp_cookie_size. And no information in kernel Changelogs about corrections in this field.
Comment 37 Yuriy Yevtukhov 2010-03-13 14:25:32 UTC
I'm not sure for 100%, but it seems that disabling SACK hides the problem on 2.6.32.8 (and may be earlier)
net.ipv4.tcp_sack=0
(by default it is enabled)

Without SACK i haven't noticed this bug yet on 27 servers with 2.6.32.8 and 4 servers with 2.6.31.2 (all highly loaded web servers)
Comment 38 Petr Sodomka 2010-03-14 13:54:17 UTC
Today, 2.6.33 server crashed again with the usual symptoms. Returning back to 2.6.28-6
Comment 39 info 2010-03-14 14:25:40 UTC
Petr, did you have tcp_sack=0 when it crashed?
Comment 40 Petr Sodomka 2010-03-14 16:38:12 UTC
No, I had default /proc/sys/net/ipv4/tcp_sack = 1

I was starting to think that 2.6.33 is going to be stable (it survived 14 days - 2.6.32 always lasted only few days) so I didn't try that recommendation.

Ok, it's an idea. I'm not returning to older version, I'll try that old freezing 2.6.31 with tcp_sack 0 and we'll see.
Comment 41 Matthew Hallacy 2010-03-16 09:58:53 UTC
We're seeing the same issue on 2.6.30.8 -> 2.6.33 (mostly 2.6.33 now) systems pushing 1gbit/s+, these are SuperMicro X8DTU and Dell PE850 systems. (This oops from the SuperMicro flavor w/ igb)

tcp_sack = 1

here's the OOPS (posting because it differs slightly from the above) 

/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.1/irq == igb

Mar 16 02:26:04 mobile02 kernel: [48936.189104] BUG: unable to handle kernel NULL pointer dereference at (null)
Mar 16 02:26:04 mobile02 kernel: [48936.189144] IP: [<ffffffff816cf308>] tcp_xmit_retransmit_queue+0x68/0x270
Mar 16 02:26:04 mobile02 kernel: [48936.189181] PGD c35c08067 PUD c35c09067 PMD 0
Mar 16 02:26:04 mobile02 kernel: [48936.189211] Oops: 0000 [#1] SMP
Mar 16 02:26:04 mobile02 kernel: [48936.189236] last sysfs file: /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.1/irq
Mar 16 02:26:04 mobile02 kernel: [48936.189280] CPU 15
Mar 16 02:26:04 mobile02 kernel: [48936.189329] Pid: 0, comm: swapper Not tainted 2.6.33 #1 X8DTU/X8DTU
Mar 16 02:26:04 mobile02 kernel: [48936.189360] RIP: 0010:[<ffffffff816cf308>]  [<ffffffff816cf308>] tcp_xmit_retransmit_queue+0x68/0x270
Mar 16 02:26:04 mobile02 kernel: [48936.189409] RSP: 0018:ffff8806555c3b10  EFLAGS: 00010246
Mar 16 02:26:04 mobile02 kernel: [48936.189435] RAX: 000000000adbc3a1 RBX: ffff880b7b649980 RCX: ffff880b7b649c98
Mar 16 02:26:04 mobile02 kernel: [48936.189464] RDX: 0000000000000002 RSI: 000000000000050e RDI: ffff880b7b649980
Mar 16 02:26:04 mobile02 kernel: [48936.189494] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000003
Mar 16 02:26:04 mobile02 kernel: [48936.189523] R10: 000000000adb9981 R11: 0000000000000000 R12: 0000000000000006
Mar 16 02:26:04 mobile02 kernel: [48936.189576] R13: 0000000000000000 R14: ffff880b7b649a48 R15: 0000000000000000
Mar 16 02:26:04 mobile02 kernel: [48936.189607] FS:  0000000000000000(0000) GS:ffff8806555c0000(0000) knlGS:0000000000000000
Mar 16 02:26:04 mobile02 kernel: [48936.189652] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 16 02:26:04 mobile02 kernel: [48936.189689] CR2: 0000000000000000 CR3: 0000000c35c07000 CR4: 00000000000006e0
Mar 16 02:26:04 mobile02 kernel: [48936.189721] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 16 02:26:04 mobile02 kernel: [48936.189750] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 16 02:26:04 mobile02 kernel: [48936.189781] Process swapper (pid: 0, threadinfo ffff880c3ce14000, task ffff88063cdd0cc0)
Mar 16 02:26:04 mobile02 kernel: [48936.189828] Stack:
Mar 16 02:26:04 mobile02 kernel: [48936.189847]  ffff880a8dba6c00 ffff880b7b649c98 0adbc3a1fdbf4c00 000000000000050e
Mar 16 02:26:04 mobile02 kernel: [48936.189881] <0> ffff880b7b649980 0000000000000006 0000000000000000 0000000000000000
Mar 16 02:26:04 mobile02 kernel: [48936.189933] <0> 0000000000000000 ffffffff816c9469 0000000000000005 0000000000000000
Mar 16 02:26:04 mobile02 kernel: [48936.190002] Call Trace:
Mar 16 02:26:04 mobile02 kernel: [48936.190023]  <IRQ>
Mar 16 02:26:04 mobile02 kernel: [48936.190046]  [<ffffffff816c9469>] ? tcp_ack+0x1389/0x2020
Mar 16 02:26:04 mobile02 kernel: [48936.190075]  [<ffffffff816ca625>] ? tcp_validate_incoming+0x105/0x330
Mar 16 02:26:04 mobile02 kernel: [48936.190104]  [<ffffffff816cb60e>] ? tcp_rcv_state_process+0x7e/0xc70
Mar 16 02:26:04 mobile02 kernel: [48936.190133]  [<ffffffff816d2d61>] ? tcp_v4_do_rcv+0xa1/0x230
Mar 16 02:26:04 mobile02 kernel: [48936.190160]  [<ffffffff816d34fa>] ? tcp_v4_rcv+0x60a/0x7e0
Mar 16 02:26:04 mobile02 kernel: [48936.190188]  [<ffffffff816b37ea>] ? ip_local_deliver_finish+0x8a/0x1a0
Mar 16 02:26:04 mobile02 kernel: [48936.190216]  [<ffffffff816b325d>] ? ip_rcv_finish+0x18d/0x3b0
Mar 16 02:26:04 mobile02 kernel: [48936.190243]  [<ffffffff816b36d7>] ? ip_rcv+0x257/0x2e0
Mar 16 02:26:04 mobile02 kernel: [48936.190272]  [<ffffffff8166b080>] ? napi_skb_finish+0x40/0x50
Mar 16 02:26:04 mobile02 kernel: [48936.190330]  [<ffffffff814e3150>] ? igb_poll+0x7d0/0xe50
Mar 16 02:26:04 mobile02 kernel: [48936.190356]  [<ffffffff8166b663>] ? net_rx_action+0x83/0x120
Mar 16 02:26:04 mobile02 kernel: [48936.190387]  [<ffffffff81044fe7>] ? __do_softirq+0xa7/0x130
Mar 16 02:26:04 mobile02 kernel: [48936.190417]  [<ffffffff8105e851>] ? ktime_get+0x61/0xe0
Mar 16 02:26:04 mobile02 kernel: [48936.190445]  [<ffffffff8100330c>] ? call_softirq+0x1c/0x30
Mar 16 02:26:04 mobile02 kernel: [48936.190472]  [<ffffffff8100524d>] ? do_softirq+0x4d/0x80
Mar 16 02:26:04 mobile02 kernel: [48936.190499]  [<ffffffff81044cd5>] ? irq_exit+0x75/0x90
Mar 16 02:26:04 mobile02 kernel: [48936.190525]  [<ffffffff810047ee>] ? do_IRQ+0x6e/0xf0
Mar 16 02:26:04 mobile02 kernel: [48936.190585]  [<ffffffff817a5ed3>] ? ret_from_intr+0x0/0xa
Mar 16 02:26:04 mobile02 kernel: [48936.190611]  <EOI>
Mar 16 02:26:04 mobile02 kernel: [48936.190635]  [<ffffffff8163b3b0>] ? menu_reflect+0x0/0x20
Mar 16 02:26:04 mobile02 kernel: [48936.190665]  [<ffffffff813936a8>] ? acpi_idle_enter_c1+0x8a/0xf3
Mar 16 02:26:04 mobile02 kernel: [48936.190693]  [<ffffffff81393672>] ? acpi_idle_enter_c1+0x54/0xf3
Mar 16 02:26:04 mobile02 kernel: [48936.190724]  [<ffffffff8163b4d8>] ? menu_select+0x108/0x290
Mar 16 02:26:04 mobile02 kernel: [48936.190751]  [<ffffffff8163a5da>] ? cpuidle_idle_call+0xba/0x120
Mar 16 02:26:04 mobile02 kernel: [48936.190779]  [<ffffffff810016fa>] ? cpu_idle+0xaa/0x110
Mar 16 02:26:04 mobile02 kernel: [48936.190807] Code: 05 00 00 4c 8d b3 c8 00 00 00 39 c2 89 54 24 14 78 04 89 44 24 14 48 8d 8b 18 03 00 00 45 31 ed 45 31 ff 48 89 4c 24 08 0f 1f 00 <48> 8b 45 00 49 39 ee 0f 18 08 74 62 48 3b ab 00 02 00 00 48 8d
Mar 16 02:26:04 mobile02 kernel: [48936.191007] RIP  [<ffffffff816cf308>] tcp_xmit_retransmit_queue+0x68/0x270
Mar 16 02:26:04 mobile02 kernel: [48936.191038]  RSP <ffff8806555c3b10>
Mar 16 02:26:04 mobile02 kernel: [48936.191060] CR2: 0000000000000000
Mar 16 02:26:04 mobile02 kernel: [48936.191412] ---[ end trace 3f1fda40fce80ab1 ]---
Mar 16 02:26:04 mobile02 kernel: [48936.191478] Kernel panic - not syncing: Fatal exception in interrupt
Mar 16 02:26:04 mobile02 kernel: [48936.191546] Pid: 0, comm: swapper Tainted: G      D    2.6.33 #1
Mar 16 02:26:04 mobile02 kernel: [48936.191616] Call Trace:
Mar 16 02:26:04 mobile02 kernel: [48936.191677]  <IRQ>  [<ffffffff817a2f4d>] ? panic+0x86/0x159
Mar 16 02:26:04 mobile02 kernel: [48936.191792]  [<ffffffff81002dd3>] ? apic_timer_interrupt+0x13/0x20
Mar 16 02:26:04 mobile02 kernel: [48936.191863]  [<ffffffff8136c060>] ? vgacon_cursor+0x0/0x240
Mar 16 02:26:04 mobile02 kernel: [48936.191930]  [<ffffffff8104037e>] ? kmsg_dump+0x7e/0x140
Mar 16 02:26:04 mobile02 kernel: [48936.191999]  [<ffffffff810066d5>] ? oops_end+0x95/0xa0
Mar 16 02:26:04 mobile02 kernel: [48936.193254]  [<ffffffff81023530>] ? no_context+0x100/0x270
Mar 16 02:26:04 mobile02 kernel: [48936.193321]  [<ffffffff810237f5>] ? __bad_area_nosemaphore+0x155/0x230
Mar 16 02:26:04 mobile02 kernel: [48936.193392]  [<ffffffff8167f497>] ? sch_direct_xmit+0x77/0x1d0
Mar 16 02:26:04 mobile02 kernel: [48936.193461]  [<ffffffff8166bffd>] ? dev_queue_xmit+0x13d/0x5a0
Mar 16 02:26:04 mobile02 kernel: [48936.193539]  [<ffffffff817a60df>] ? page_fault+0x1f/0x30
Mar 16 02:26:04 mobile02 kernel: [48936.193646]  [<ffffffff816cf308>] ? tcp_xmit_retransmit_queue+0x68/0x270
Mar 16 02:26:04 mobile02 kernel: [48936.193714]  [<ffffffff816c9469>] ? tcp_ack+0x1389/0x2020
Mar 16 02:26:04 mobile02 kernel: [48936.193782]  [<ffffffff816ca625>] ? tcp_validate_incoming+0x105/0x330
Mar 16 02:26:04 mobile02 kernel: [48936.193850]  [<ffffffff816cb60e>] ? tcp_rcv_state_process+0x7e/0xc70
Mar 16 02:26:04 mobile02 kernel: [48936.193918]  [<ffffffff816d2d61>] ? tcp_v4_do_rcv+0xa1/0x230
Mar 16 02:26:04 mobile02 kernel: [48936.193984]  [<ffffffff816d34fa>] ? tcp_v4_rcv+0x60a/0x7e0
Mar 16 02:26:04 mobile02 kernel: [48936.194054]  [<ffffffff816b37ea>] ? ip_local_deliver_finish+0x8a/0x1a0
Mar 16 02:26:04 mobile02 kernel: [48936.194125]  [<ffffffff816b325d>] ? ip_rcv_finish+0x18d/0x3b0
Mar 16 02:26:04 mobile02 kernel: [48936.194191]  [<ffffffff816b36d7>] ? ip_rcv+0x257/0x2e0
Mar 16 02:26:04 mobile02 kernel: [48936.194257]  [<ffffffff8166b080>] ? napi_skb_finish+0x40/0x50
Mar 16 02:26:04 mobile02 kernel: [48936.194348]  [<ffffffff814e3150>] ? igb_poll+0x7d0/0xe50
Mar 16 02:26:04 mobile02 kernel: [48936.194414]  [<ffffffff8166b663>] ? net_rx_action+0x83/0x120
Mar 16 02:26:04 mobile02 kernel: [48936.194482]  [<ffffffff81044fe7>] ? __do_softirq+0xa7/0x130
Mar 16 02:26:04 mobile02 kernel: [48936.194579]  [<ffffffff8105e851>] ? ktime_get+0x61/0xe0
Mar 16 02:26:04 mobile02 kernel: [48936.194649]  [<ffffffff8100330c>] ? call_softirq+0x1c/0x30
Mar 16 02:26:04 mobile02 kernel: [48936.194719]  [<ffffffff8100524d>] ? do_softirq+0x4d/0x80
Comment 42 Andrew Morton 2010-03-18 21:05:43 UTC
On Wed, 02 Dec 2009 22:24:46 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: "Ilpo J__rvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Thu, 26 Nov 2009 23:54:53 +0200 (EET)
> 
> > [PATCH] tcp: clear hints to avoid a stale one (nfs only affected?)
> 
> Ok, since Linus just released 2.6.32 I'm tossing this into net-next-2.6
> so it gets wider exposure.
> 
> I still want to see test results from the bug reporter, and if it fixes
> things we can toss this into -stable too.

Despite my request to take this to email, quite a few people have been
jumping onto this report via bugzilla:
http://bugzilla.kernel.org/show_bug.cgi?id=14470

Bit of a pita, but it'd be worth someone taking a look to ensure that
we're all talking about the same bug.
Comment 43 info 2010-03-19 15:56:01 UTC
confirmed the same bug with 2.6.33:

tcp_sack = 1

Mar 19 23:42:55 bstorage25-i [644676.050103] EIP: [<c141c0f1>] 
Mar 19 23:42:55 bstorage25-i tcp_xmit_retransmit_queue+0x1a8/0x1d2
Mar 19 23:42:55 bstorage25-i SS:ESP 0068:c1631d28
Mar 19 23:42:55 bstorage25-i [644676.050103] CR2: 0000000000000000
Mar 19 23:42:55 bstorage25-i [644676.052710] ---[ end trace b193123ded1c81f0 ]---
Mar 19 23:42:55 bstorage25-i [644676.052943] Kernel panic - not syncing: Fatal exception in interrupt
Mar 19 23:42:55 bstorage25-i [644676.053147] Pid: 0, comm: swapper Tainted: G      D    2.6.33-v01 #1
Mar 19 23:42:55 bstorage25-i [644676.053350] Call Trace:

I'm going to do tcp_sack = 0 to see if there's any changes
Comment 44 Ilpo Järvinen 2010-03-19 16:34:47 UTC
On Thu, 18 Mar 2010, Andrew Morton wrote:

> On Wed, 02 Dec 2009 22:24:46 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: "Ilpo J__rvinen" <ilpo.jarvinen@helsinki.fi>
> > Date: Thu, 26 Nov 2009 23:54:53 +0200 (EET)
> > 
> > > [PATCH] tcp: clear hints to avoid a stale one (nfs only affected?)
> > 
> > Ok, since Linus just released 2.6.32 I'm tossing this into net-next-2.6
> > so it gets wider exposure.
> > 
> > I still want to see test results from the bug reporter, and if it fixes
> > things we can toss this into -stable too.
> 
> Despite my request to take this to email, quite a few people have been
> jumping onto this report via bugzilla:
> http://bugzilla.kernel.org/show_bug.cgi?id=14470
> 
> Bit of a pita, but it'd be worth someone taking a look to ensure that
> we're all talking about the same bug.

Could one try with this debug patch:

http://marc.info/?l=linux-kernel&m=126624014117610&w=2

It should prevent crashing too.
Comment 45 David Collins 2010-03-19 20:36:45 UTC
I've got a new kernel rolled out on over 200 servers with tcp_sack set to 0, and we haven't had any stability issues in over 72 hours.  We would have had at least 10 servers kernel panic by now.

I'll take a look at that patch and might give it a try if I have time.
Comment 46 Bill Wheatley 2010-03-25 15:33:18 UTC
The issue is still happening in stable 2.6.33.1

Is there a patch that's in the works for one of the next kernel revisions?
Comment 47 David Collins 2010-03-25 17:12:19 UTC
(In reply to comment #46)
> The issue is still happening in stable 2.6.33.1
> 
> Is there a patch that's in the works for one of the next kernel revisions?

Bill,
Have you tried disabling tcp_sack?

echo 0 > /proc/sys/net/ipv4/tcp_sack

... or add it to sysctl.conf.

... or here is a little patch to keep it disabled.

--- linux-2.6.33.1/net/ipv4/tcp_input.c.old	2010-03-19 18:39:43.000000000 -0500
+++ linux-2.6.33.1/net/ipv4/tcp_input.c	2010-03-19 18:39:29.000000000 -0500
@@ -74,7 +74,7 @@
 
 int sysctl_tcp_timestamps __read_mostly = 1;
 int sysctl_tcp_window_scaling __read_mostly = 1;
-int sysctl_tcp_sack __read_mostly = 1;
+int sysctl_tcp_sack __read_mostly = 0;
 int sysctl_tcp_fack __read_mostly = 1;
 int sysctl_tcp_reordering __read_mostly = TCP_FASTRETRANS_THRESH;
 int sysctl_tcp_ecn __read_mostly = 2;
Comment 48 Bill Wheatley 2010-04-05 17:03:39 UTC
Won't turning of tcp selective acknowledgement result in other performance penalties?
Comment 49 Taylan Develioglu 2010-04-08 09:50:37 UTC
Same problem on 2.6.31.12 and 2.6.32.8 on 4 machines. disabling sack hides the problem.
Comment 50 Ivan Zahariev (famzah) 2010-06-07 09:27:58 UTC
Same problem here:
* 2.6.32.13-grsec - PANICs
* 2.6.32.9-grsec - PANICs
* 2.6.27.4-grsec - OK

Hardware is Supermicro X7DVL with Intel Xeon CPUs and Intel Ethernet Controllers.

Disabling SACK resolves the problem.
Comment 51 Yuriy Yevtukhov 2010-09-20 14:39:27 UTC
Problem seems to be solved by:
commit dc6330590fbd5fad17f06663c5f0bed834054b2b
Author: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Date:   Mon Jul 19 01:16:18 2010 +0000

    tcp: fix crash in tcp_xmit_retransmit_queue
    
    commit 45e77d314585869dfe43c82679f7e08c9b35b898 upstream.
    
    It can happen that there are no packets in queue while calling
    tcp_xmit_retransmit_queue(). tcp_write_queue_head() then returns
    NULL and that gets deref'ed to get sacked into a local var.
    
    There is no work to do if no packets are outstanding so we just
    exit early.
    
    This oops was introduced by 08ebd1721ab8fd (tcp: remove tp->lost_out
    guard to make joining diff nicer).
    
    Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
    Reported-by: Lennart Schulte <lennart.schulte@nets.rwth-aachen.de>
    Tested-by: Lennart Schulte <lennart.schulte@nets.rwth-aachen.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

in 2.6.32.17 and in all other active branches after Jul 19 (you can search in corresponding changelog). 
I enabled sack and had no any crash for several weeks.

May be someone can test too.