Bug 10097

Summary: SMP BUG in __nf_conntrack_find
Product: Networking Reporter: Christian Casteyde (casteyde.christian)
Component: Netfilter/IptablesAssignee: networking_netfilter-iptables (networking_netfilter-iptables)
Status: CLOSED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: bunk
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25-rc3 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 9832    
Attachments: fix smp_processor_id warning

Description Christian Casteyde 2008-02-25 10:44:07 UTC
Latest working kernel version: 2.6.24.2
Earliest failing kernel version: 2.6.24-rc3 (not checked before)
Distribution: Bluewhite64
Hardware Environment: Athlon X2 4200

Software Environment:
samba 3.0, 2.6.25-rc3 kernel + HR + tickless + kernel SMP debugging

Problem Description:
The Samba smbd daemon triggers regularly the following BUG with 2.6.25-rc3:

BUG: using smp_processor_id() in preemptible [00000000] code: nmbd/3167
caller is __nf_conntrack_find+0x119/0x150
Pid: 3167, comm: nmbd Not tainted 2.6.25-rc3 #1

Call Trace:
 [<ffffffff8038f3f4>] debug_smp_processor_id+0xc4/0xd0
 [<ffffffff80555d79>] __nf_conntrack_find+0x119/0x150
 [<ffffffff80555dc9>] nf_conntrack_find_get+0x19/0x80
 [<ffffffff80556914>] nf_conntrack_in+0x1a4/0x5a0
 [<ffffffff8020bd33>] ? restore_args+0x0/0x30
 [<ffffffff8059d596>] ipv4_conntrack_local+0x66/0x70
 [<ffffffff80554362>] nf_iterate+0x62/0xa0
 [<ffffffff80567050>] ? dst_output+0x0/0x10
 [<ffffffff80554406>] nf_hook_slow+0x66/0xe0
 [<ffffffff80567050>] ? dst_output+0x0/0x10
 [<ffffffff80568825>] __ip_local_out+0xa5/0xb0
 [<ffffffff80568841>] ip_local_out+0x11/0x30
 [<ffffffff80568ac1>] ip_push_pending_frames+0x261/0x3e0
 [<ffffffff80587153>] udp_push_pending_frames+0x233/0x3d0
 [<ffffffff8058860f>] udp_sendmsg+0x30f/0x710
 [<ffffffff802328b0>] ? default_wake_function+0x0/0x10
 [<ffffffff8058f895>] inet_sendmsg+0x45/0x80
 [<ffffffff80531fcf>] sock_sendmsg+0xdf/0x110
 [<ffffffff80251270>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff802374c7>] ? hrtick_resched+0x77/0x90
 [<ffffffff8025e2b5>] ? trace_hardirqs_on+0xd5/0x160
 [<ffffffff80531735>] ? sockfd_lookup_light+0x45/0x80
 [<ffffffff805323da>] sys_sendto+0xea/0x120
 [<ffffffff80626bcb>] ? _spin_unlock_irq+0x2b/0x60
 [<ffffffff8025e2b5>] ? trace_hardirqs_on+0xd5/0x160
 [<ffffffff80626bd6>] ? _spin_unlock_irq+0x36/0x60
 [<ffffffff8020b6db>] system_call_after_swapgs+0x7b/0x80

Steps to reproduce:
Start smbd with the forementionned kernel instrumented for SMP and kernel debugging and hr + tickless enabled.
Comment 1 Anonymous Emailer 2008-02-25 11:12:26 UTC
Reply-To: akpm@linux-foundation.org

On Mon, 25 Feb 2008 10:44:08 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=10097
> 
>            Summary: SMP BUG in __nf_conntrack_find
>            Product: Networking
>            Version: 2.5
>      KernelVersion: 2.6.25-rc3
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Netfilter/Iptables
>         AssignedTo: networking_netfilter-iptables@kernel-bugs.osdl.org
>         ReportedBy: casteyde.christian@free.fr
> 
> 
> Latest working kernel version: 2.6.24.2
> Earliest failing kernel version: 2.6.24-rc3 (not checked before)
> Distribution: Bluewhite64
> Hardware Environment: Athlon X2 4200
> 
> Software Environment:
> samba 3.0, 2.6.25-rc3 kernel + HR + tickless + kernel SMP debugging
> 
> Problem Description:
> The Samba smbd daemon triggers regularly the following BUG with 2.6.25-rc3:
> 
> BUG: using smp_processor_id() in preemptible [00000000] code: nmbd/3167
> caller is __nf_conntrack_find+0x119/0x150
> Pid: 3167, comm: nmbd Not tainted 2.6.25-rc3 #1
> 
> Call Trace:
>  [<ffffffff8038f3f4>] debug_smp_processor_id+0xc4/0xd0
>  [<ffffffff80555d79>] __nf_conntrack_find+0x119/0x150
>  [<ffffffff80555dc9>] nf_conntrack_find_get+0x19/0x80
>  [<ffffffff80556914>] nf_conntrack_in+0x1a4/0x5a0
>  [<ffffffff8020bd33>] ? restore_args+0x0/0x30
>  [<ffffffff8059d596>] ipv4_conntrack_local+0x66/0x70
>  [<ffffffff80554362>] nf_iterate+0x62/0xa0
>  [<ffffffff80567050>] ? dst_output+0x0/0x10
>  [<ffffffff80554406>] nf_hook_slow+0x66/0xe0
>  [<ffffffff80567050>] ? dst_output+0x0/0x10
>  [<ffffffff80568825>] __ip_local_out+0xa5/0xb0
>  [<ffffffff80568841>] ip_local_out+0x11/0x30
>  [<ffffffff80568ac1>] ip_push_pending_frames+0x261/0x3e0
>  [<ffffffff80587153>] udp_push_pending_frames+0x233/0x3d0
>  [<ffffffff8058860f>] udp_sendmsg+0x30f/0x710
>  [<ffffffff802328b0>] ? default_wake_function+0x0/0x10
>  [<ffffffff8058f895>] inet_sendmsg+0x45/0x80
>  [<ffffffff80531fcf>] sock_sendmsg+0xdf/0x110
>  [<ffffffff80251270>] ? autoremove_wake_function+0x0/0x40
>  [<ffffffff802374c7>] ? hrtick_resched+0x77/0x90
>  [<ffffffff8025e2b5>] ? trace_hardirqs_on+0xd5/0x160
>  [<ffffffff80531735>] ? sockfd_lookup_light+0x45/0x80
>  [<ffffffff805323da>] sys_sendto+0xea/0x120
>  [<ffffffff80626bcb>] ? _spin_unlock_irq+0x2b/0x60
>  [<ffffffff8025e2b5>] ? trace_hardirqs_on+0xd5/0x160
>  [<ffffffff80626bd6>] ? _spin_unlock_irq+0x36/0x60
>  [<ffffffff8020b6db>] system_call_after_swapgs+0x7b/0x80
> 
> Steps to reproduce:
> Start smbd with the forementionned kernel instrumented for SMP and kernel
> debugging and hr + tickless enabled.
> 

Presumably this is in NF_CT_STAT_INC().  I wonder what caused it to start
happening.

Guys, this probably means that the developers who tested this change aren't
enabling the debug options which all kernel developers _should_ be enabling
when "testing" their code!  Documentation/SubmitChecklist has a handy list.

Should NF_CT_STAT_INC() be using local_inc()?
Comment 2 Patrick McHardy 2008-02-25 12:03:48 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=10097
> 
> 
> 
> 
> 
> ------- Comment #1 from anonymous@kernel-bugs.osdl.org  2008-02-25 11:12
> -------
> Reply-To: akpm@linux-foundation.org
> 
> On Mon, 25 Feb 2008 10:44:08 -0800 (PST) bugme-daemon@bugzilla.kernel.org
> wrote:
> 
>> http://bugzilla.kernel.org/show_bug.cgi?id=10097
>>
>>            Summary: SMP BUG in __nf_conntrack_find
>>            Product: Networking
>>            Version: 2.5
>>      KernelVersion: 2.6.25-rc3
>>           Platform: All
>>         OS/Version: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: normal
>>           Priority: P1
>>          Component: Netfilter/Iptables
>>         AssignedTo: networking_netfilter-iptables@kernel-bugs.osdl.org
>>         ReportedBy: casteyde.christian@free.fr
>>
>>
>> Latest working kernel version: 2.6.24.2
>> Earliest failing kernel version: 2.6.24-rc3 (not checked before)
>> Distribution: Bluewhite64
>> Hardware Environment: Athlon X2 4200
>>
>> Software Environment:
>> samba 3.0, 2.6.25-rc3 kernel + HR + tickless + kernel SMP debugging
>>
>> Problem Description:
>> The Samba smbd daemon triggers regularly the following BUG with 2.6.25-rc3:
>>
>> BUG: using smp_processor_id() in preemptible [00000000] code: nmbd/3167
>> caller is __nf_conntrack_find+0x119/0x150
>> Pid: 3167, comm: nmbd Not tainted 2.6.25-rc3 #1
>>
>> Call Trace:
>>  [<ffffffff8038f3f4>] debug_smp_processor_id+0xc4/0xd0
>>  [<ffffffff80555d79>] __nf_conntrack_find+0x119/0x150
>>  [<ffffffff80555dc9>] nf_conntrack_find_get+0x19/0x80
>>  [<ffffffff80556914>] nf_conntrack_in+0x1a4/0x5a0
>>  [<ffffffff8020bd33>] ? restore_args+0x0/0x30
>>  [<ffffffff8059d596>] ipv4_conntrack_local+0x66/0x70
>>  [<ffffffff80554362>] nf_iterate+0x62/0xa0
>>  [<ffffffff80567050>] ? dst_output+0x0/0x10
>>  [<ffffffff80554406>] nf_hook_slow+0x66/0xe0
>>  [<ffffffff80567050>] ? dst_output+0x0/0x10
>>  [<ffffffff80568825>] __ip_local_out+0xa5/0xb0
>>  [<ffffffff80568841>] ip_local_out+0x11/0x30
>>  [<ffffffff80568ac1>] ip_push_pending_frames+0x261/0x3e0
>>  [<ffffffff80587153>] udp_push_pending_frames+0x233/0x3d0
>>  [<ffffffff8058860f>] udp_sendmsg+0x30f/0x710
>>  [<ffffffff802328b0>] ? default_wake_function+0x0/0x10
>>  [<ffffffff8058f895>] inet_sendmsg+0x45/0x80
>>  [<ffffffff80531fcf>] sock_sendmsg+0xdf/0x110
>>  [<ffffffff80251270>] ? autoremove_wake_function+0x0/0x40
>>  [<ffffffff802374c7>] ? hrtick_resched+0x77/0x90
>>  [<ffffffff8025e2b5>] ? trace_hardirqs_on+0xd5/0x160
>>  [<ffffffff80531735>] ? sockfd_lookup_light+0x45/0x80
>>  [<ffffffff805323da>] sys_sendto+0xea/0x120
>>  [<ffffffff80626bcb>] ? _spin_unlock_irq+0x2b/0x60
>>  [<ffffffff8025e2b5>] ? trace_hardirqs_on+0xd5/0x160
>>  [<ffffffff80626bd6>] ? _spin_unlock_irq+0x36/0x60
>>  [<ffffffff8020b6db>] system_call_after_swapgs+0x7b/0x80
>>
>> Steps to reproduce:
>> Start smbd with the forementionned kernel instrumented for SMP and kernel
>> debugging and hr + tickless enabled.
>>
> 
> Presumably this is in NF_CT_STAT_INC().  I wonder what caused it to start
> happening.
> 
> Guys, this probably means that the developers who tested this change aren't
> enabling the debug options which all kernel developers _should_ be enabling
> when "testing" their code!  Documentation/SubmitChecklist has a handy list.

I guess the reporter is using preemptible RCU, which wasn't upstream
when I tested this.

> Should NF_CT_STAT_INC() be using local_inc()?


No, it needs to disable BHs to avoid corrupting the counters.
I have the fix queued for this, will send upstream soon.
Comment 3 Christian Casteyde 2008-02-25 13:20:57 UTC
Yes, I've also activated Premptible RCU (I was willing to test since I'm quite lucky finding problems).

I usually run -rc3 and following kernels with all debug options activated (except kobject because it floods my logs) to test them, and I indeed did it for 2.6.24 and hadn't seen this problem. So it the NF code has not changed, premptible RCU is a good guess. I can check without it if you want - just tell me.

Btw, the problem occurs also with other processes (khttp, dnsmasq, etc), not just Samba (quite logical).
Comment 4 Patrick McHardy 2008-02-25 13:34:36 UTC
Created attachment 14987 [details]
fix smp_processor_id warning

This is the patch I have queued, please try whether it fixes the problem for you.
Comment 5 Christian Casteyde 2008-02-25 14:38:43 UTC
The patch seems to be OK. I didn't manage to reproduce with it.
For the record, I also tried without the patch and with non premptible RCU, the problem doesn't appear either.
Thanks
Comment 6 Patrick McHardy 2008-02-25 18:37:35 UTC
Thanks for testing this (and for updating the bugreport :))
Comment 7 Adrian Bunk 2008-03-16 08:40:20 UTC
fixed by commit 4e29e9ec7e0707d3925f5dcc29af0d3f04e49833