Bug 33502

Summary: Caught 64-bit read from uninitialized memory in __alloc_skb
Product: Networking Reporter: Christian Casteyde (casteyde.christian)
Component: IPV4Assignee: Stephen Hemminger (stephen)
Status: CLOSED UNREPRODUCIBLE    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.39-rc3 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: .config file used to reproduce

Description Christian Casteyde 2011-04-17 19:29:38 UTC
Acer Aspire 1511LMi
Athlon 64 3GHz in 64bits mode
Slackware 64 13.1

Since 2.6.39-rc3 with kmemcheck enabled, I get the following warning:
...
pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0fffff: excluding 0xc0000-0xfffff
pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff: excluding 0x60000000-0x
60ffffff
pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff: excluding 0xa0000000-0x
a0ffffff
udev: renamed network interface wlan0 to eth1
WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88001b0bb800)
00b00b1b0088ffff0000000000000000cafe1dea20009b0000299a3100000000
 u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u
 ^

Pid: 1511, comm: udevd Not tainted 2.6.39-rc3 #1 Acer,Inc. Aspire 1510  /Aspire 1510
RIP: 0010:[<ffffffff810c2f0c>]  [<ffffffff810c2f0c>] __kmalloc_track_caller+0xbc/0x1d0
RSP: 0018:ffff88001d3a7a18  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000010 RCX: 000000000000284f
RDX: 000000000000284e RSI: ffff88001fe5b160 RDI: ffffffff8177e39a
RBP: ffff88001d3a7a48 R08: 0000000000000000 R09: ffff88001b931100
R10: 0000000000000000 R11: 0000000000000003 R12: ffff88001b0bb800
R13: ffff88001f803840 R14: 00000000000004d0 R15: ffffffff814769c6
FS:  00007f6ee81f1700(0000) GS:ffffffff81a1b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88001d0b3938 CR3: 000000001d38b000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
 [<ffffffff8147ccf2>] __alloc_skb+0x72/0x190
 [<ffffffff814769c6>] sock_alloc_send_pskb+0x236/0x3a0
 [<ffffffff81476b40>] sock_alloc_send_skb+0x10/0x20
 [<ffffffff81523c18>] unix_dgram_sendmsg+0x298/0x770
 [<ffffffff814715f3>] sock_sendmsg+0xe3/0x110
 [<ffffffff81472603>] sys_sendmsg+0x243/0x3c0
 [<ffffffff815e7238>] system_call_fastpath+0x16/0x1b
 [<ffffffffffffffff>] 0xffffffffffffffff
Adding 506012k swap on /dev/sda1.  Priority:-1 extents:1 across:506012k 
EXT4-fs (sda2): re-mounted. Opts: (null)
EXT4-fs (sda2): re-mounted. Opts: (null)
EXT3-fs: barriers not enabled
kjournald starting.  Commit interval 5 seconds
EXT3-fs (sda3): using internal journal
EXT3-fs (sda3): mounted filesystem with writeback data mode
b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
ADDRCONF(NETDEV_UP): eth1: link is not ready
b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
...

I cannot build 2.6.39-rc2, and in 2.6.39-rc1 I used to have another warning but not this one.
Comment 1 Andrew Morton 2011-04-18 22:39:45 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sun, 17 Apr 2011 19:29:39 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=33502
> 
>            Summary: Caught 64-bit read from uninitialized memory in
>                     __alloc_skb
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.39-rc3
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: casteyde.christian@free.fr
>         Regression: Yes
> 
> 
> Acer Aspire 1511LMi
> Athlon 64 3GHz in 64bits mode
> Slackware 64 13.1
> 
> Since 2.6.39-rc3 with kmemcheck enabled, I get the following warning:
> ...
> pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0fffff: excluding
> 0xc0000-0xfffff
> pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff:
> excluding
> 0x60000000-0x
> 60ffffff
> pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff:
> excluding
> 0xa0000000-0x
> a0ffffff
> udev: renamed network interface wlan0 to eth1
> WARNING: kmemcheck: Caught 64-bit read from uninitialized memory
> (ffff88001b0bb800)
> 00b00b1b0088ffff0000000000000000cafe1dea20009b0000299a3100000000
>  u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u
>  ^
> 
> Pid: 1511, comm: udevd Not tainted 2.6.39-rc3 #1 Acer,Inc. Aspire 1510 
> /Aspire
> 1510
> RIP: 0010:[<ffffffff810c2f0c>]  [<ffffffff810c2f0c>]
> __kmalloc_track_caller+0xbc/0x1d0
> RSP: 0018:ffff88001d3a7a18  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000010 RCX: 000000000000284f
> RDX: 000000000000284e RSI: ffff88001fe5b160 RDI: ffffffff8177e39a
> RBP: ffff88001d3a7a48 R08: 0000000000000000 R09: ffff88001b931100
> R10: 0000000000000000 R11: 0000000000000003 R12: ffff88001b0bb800
> R13: ffff88001f803840 R14: 00000000000004d0 R15: ffffffff814769c6
> FS:  00007f6ee81f1700(0000) GS:ffffffff81a1b000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff88001d0b3938 CR3: 000000001d38b000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
>  [<ffffffff8147ccf2>] __alloc_skb+0x72/0x190
>  [<ffffffff814769c6>] sock_alloc_send_pskb+0x236/0x3a0
>  [<ffffffff81476b40>] sock_alloc_send_skb+0x10/0x20
>  [<ffffffff81523c18>] unix_dgram_sendmsg+0x298/0x770
>  [<ffffffff814715f3>] sock_sendmsg+0xe3/0x110
>  [<ffffffff81472603>] sys_sendmsg+0x243/0x3c0
>  [<ffffffff815e7238>] system_call_fastpath+0x16/0x1b
>  [<ffffffffffffffff>] 0xffffffffffffffff

hum.  I wonder if kmemcheck is disliking prefetchw()?
Comment 2 Eric Dumazet 2011-04-19 02:51:38 UTC
Le lundi 18 avril 2011 à 15:38 -0700, Andrew Morton a écrit :
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Sun, 17 Apr 2011 19:29:39 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=33502
> > 
> >            Summary: Caught 64-bit read from uninitialized memory in
> >                     __alloc_skb
> >            Product: Networking
> >            Version: 2.5
> >     Kernel Version: 2.6.39-rc3
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: IPV4
> >         AssignedTo: shemminger@linux-foundation.org
> >         ReportedBy: casteyde.christian@free.fr
> >         Regression: Yes
> > 
> > 
> > Acer Aspire 1511LMi
> > Athlon 64 3GHz in 64bits mode
> > Slackware 64 13.1
> > 
> > Since 2.6.39-rc3 with kmemcheck enabled, I get the following warning:
> > ...
> > pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0fffff: excluding
> > 0xc0000-0xfffff
> > pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff:
> excluding
> > 0x60000000-0x
> > 60ffffff
> > pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff:
> excluding
> > 0xa0000000-0x
> > a0ffffff
> > udev: renamed network interface wlan0 to eth1
> > WARNING: kmemcheck: Caught 64-bit read from uninitialized memory
> > (ffff88001b0bb800)
> > 00b00b1b0088ffff0000000000000000cafe1dea20009b0000299a3100000000
> >  u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u
> >  ^
> > 
> > Pid: 1511, comm: udevd Not tainted 2.6.39-rc3 #1 Acer,Inc. Aspire 1510 
> /Aspire
> > 1510
> > RIP: 0010:[<ffffffff810c2f0c>]  [<ffffffff810c2f0c>]
> > __kmalloc_track_caller+0xbc/0x1d0
> > RSP: 0018:ffff88001d3a7a18  EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: 0000000000000010 RCX: 000000000000284f
> > RDX: 000000000000284e RSI: ffff88001fe5b160 RDI: ffffffff8177e39a
> > RBP: ffff88001d3a7a48 R08: 0000000000000000 R09: ffff88001b931100
> > R10: 0000000000000000 R11: 0000000000000003 R12: ffff88001b0bb800
> > R13: ffff88001f803840 R14: 00000000000004d0 R15: ffffffff814769c6
> > FS:  00007f6ee81f1700(0000) GS:ffffffff81a1b000(0000)
> knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: ffff88001d0b3938 CR3: 000000001d38b000 CR4: 00000000000006f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
> >  [<ffffffff8147ccf2>] __alloc_skb+0x72/0x190
> >  [<ffffffff814769c6>] sock_alloc_send_pskb+0x236/0x3a0
> >  [<ffffffff81476b40>] sock_alloc_send_skb+0x10/0x20
> >  [<ffffffff81523c18>] unix_dgram_sendmsg+0x298/0x770
> >  [<ffffffff814715f3>] sock_sendmsg+0xe3/0x110
> >  [<ffffffff81472603>] sys_sendmsg+0x243/0x3c0
> >  [<ffffffff815e7238>] system_call_fastpath+0x16/0x1b
> >  [<ffffffffffffffff>] 0xffffffffffffffff
> 
> hum.  I wonder if kmemcheck is disliking prefetchw()?

Nope, prefetchw() is OK versus kmemcheck.

This is in __kmalloc_track_caller(), not in networking stuff.

CC Christoph Lameter

I guess this_cpu_cmpxchg16b() is the offender.

A disassembly of __kmalloc_track_caller() would help, but I feel its the
read of s->cpu_slab->freelist

It seems to be at address 
0xffff88001b0bb800 and contains 0xffff88001b0bb000 but kmemcheck thinks
its not initialized.

Its located in percpu zone, maybe kmemcheck has a problem with it ?

alloc_kmem_cache_cpus() does a call to __alloc_percpu(), so this must
have been zeroed at the very beginning of kmem_cache life.


Hmm, looking at mm/slub.c, I wonder what prevents "object" from pointing
to a now freed and unreachable zone of memory. (Say we are interrupted,
object given to interrupt handler and this one wants to change page bits
to trap access)
Comment 3 Eric Dumazet 2011-04-19 03:09:54 UTC
Le mardi 19 avril 2011 à 04:51 +0200, Eric Dumazet a écrit :

> Hmm, looking at mm/slub.c, I wonder what prevents "object" from pointing
> to a now freed and unreachable zone of memory. (Say we are interrupted,
> object given to interrupt handler and this one wants to change page bits
> to trap access)


Yes, I suspect this whole business is not kmemcheck compatable, or
DEBUG_PAGEALLOC

get_freepointer(s, object) can access to freed memory and kmemcheck
triggers the fault, while this_cpu_cmpxchg_double() would presumably
detect a change of tid and would not perform the freelist/tid change.
Comment 4 Eric Dumazet 2011-04-19 03:20:49 UTC
Le mardi 19 avril 2011 à 05:09 +0200, Eric Dumazet a écrit :
> Le mardi 19 avril 2011 à 04:51 +0200, Eric Dumazet a écrit :
> 
> > Hmm, looking at mm/slub.c, I wonder what prevents "object" from pointing
> > to a now freed and unreachable zone of memory. (Say we are interrupted,
> > object given to interrupt handler and this one wants to change page bits
> > to trap access)
> 
> 
> Yes, I suspect this whole business is not kmemcheck compatable, or
> DEBUG_PAGEALLOC
> 
> get_freepointer(s, object) can access to freed memory and kmemcheck
> triggers the fault, while this_cpu_cmpxchg_double() would presumably
> detect a change of tid and would not perform the freelist/tid change.
> 
> 
> 

Christian, please try following patch, thanks

diff --git a/mm/slub.c b/mm/slub.c
index 94d2a33..84febe9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1540,7 +1540,12 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 	}
 }
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#if defined(CONFIG_CMPXCHG_LOCAL) && \
+	!defined(CONFIG_KMEMCHECK) && !defined(DEBUG_PAGEALLOC)
+#define SLUB_USE_CMPXCHG_DOUBLE
+#endif
+
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 #ifdef CONFIG_PREEMPT
 /*
  * Calculate the next globally unique transaction for disambiguiation
@@ -1604,7 +1609,7 @@ static inline void note_cmpxchg_failure(const char *n,
 
 void init_kmem_cache_cpus(struct kmem_cache *s)
 {
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	int cpu;
 
 	for_each_possible_cpu(cpu)
@@ -1643,7 +1648,7 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 		page->inuse--;
 	}
 	c->page = NULL;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	c->tid = next_tid(c->tid);
 #endif
 	unfreeze_slab(s, page, tail);
@@ -1780,7 +1785,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 {
 	void **object;
 	struct page *new;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -1819,7 +1824,7 @@ load_freelist:
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	c->tid = next_tid(c->tid);
 	local_irq_restore(flags);
 #endif
@@ -1858,7 +1863,7 @@ new_slab:
 	}
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 	return NULL;
@@ -1887,7 +1892,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 {
 	void **object;
 	struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long tid;
 #else
 	unsigned long flags;
@@ -1896,7 +1901,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 	if (slab_pre_alloc_hook(s, gfpflags))
 		return NULL;
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_save(flags);
 #else
 redo:
@@ -1910,7 +1915,7 @@ redo:
 	 */
 	c = __this_cpu_ptr(s->cpu_slab);
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	/*
 	 * The transaction ids are globally unique per cpu and per operation on
 	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
@@ -1927,7 +1932,7 @@ redo:
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 		/*
 		 * The cmpxchg will only match if there was no additional
 		 * operation and if we are on the right processor.
@@ -1954,7 +1959,7 @@ redo:
 		stat(s, ALLOC_FASTPATH);
 	}
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 
@@ -2034,7 +2039,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 {
 	void *prior;
 	void **object = (void *)x;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -2070,7 +2075,7 @@ checks_ok:
 
 out_unlock:
 	slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 	return;
@@ -2084,7 +2089,7 @@ slab_empty:
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 	stat(s, FREE_SLAB);
@@ -2113,7 +2118,7 @@ static __always_inline void slab_free(struct kmem_cache *s,
 {
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long tid;
 #else
 	unsigned long flags;
@@ -2121,7 +2126,7 @@ static __always_inline void slab_free(struct kmem_cache *s,
 
 	slab_free_hook(s, x);
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_save(flags);
 
 #else
@@ -2136,7 +2141,7 @@ redo:
 	 */
 	c = __this_cpu_ptr(s->cpu_slab);
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	tid = c->tid;
 	barrier();
 #endif
@@ -2144,7 +2149,7 @@ redo:
 	if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
 		set_freepointer(s, object, c->freelist);
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				c->freelist, tid,
@@ -2160,7 +2165,7 @@ redo:
 	} else
 		__slab_free(s, page, x, addr);
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 }
@@ -2354,7 +2359,7 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
 			SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	/*
 	 * Must align to double word boundary for the double cmpxchg instructions
 	 * to work.
Comment 5 Anonymous Emailer 2011-04-19 18:09:58 UTC
Reply-To: cl@linux.com

On Tue, 19 Apr 2011, Eric Dumazet wrote:

> get_freepointer(s, object) can access to freed memory and kmemcheck
> triggers the fault, while this_cpu_cmpxchg_double() would presumably
> detect a change of tid and would not perform the freelist/tid change.

Sounds right. The new lockless patchset for slub that uses a locked
cmpxchg16b will make this behavior even more common since it will do more
speculative accesses.
Comment 6 Anonymous Emailer 2011-04-19 18:09:59 UTC
Reply-To: cl@linux.com

On Tue, 19 Apr 2011, Eric Dumazet wrote:

> get_freepointer(s, object) can access to freed memory and kmemcheck
> triggers the fault, while this_cpu_cmpxchg_double() would presumably
> detect a change of tid and would not perform the freelist/tid change.

Sounds right. The new lockless patchset for slub that uses a locked
cmpxchg16b will make this behavior even more common since it will do more
speculative accesses.
Comment 7 Anonymous Emailer 2011-04-19 18:11:07 UTC
Reply-To: cl@linux.com

On Tue, 19 Apr 2011, Eric Dumazet wrote:

>       }
>  }
>
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#if defined(CONFIG_CMPXCHG_LOCAL) && \
> +     !defined(CONFIG_KMEMCHECK) && !defined(DEBUG_PAGEALLOC)
> +#define SLUB_USE_CMPXCHG_DOUBLE
> +#endif
> +
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>  #ifdef CONFIG_PREEMPT
>  /*

Ugg.. Isnt there some way to indicate to kmemcheck that a speculative
access is occurring?
Comment 8 Anonymous Emailer 2011-04-19 18:11:09 UTC
Reply-To: cl@linux.com

On Tue, 19 Apr 2011, Eric Dumazet wrote:

>       }
>  }
>
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#if defined(CONFIG_CMPXCHG_LOCAL) && \
> +     !defined(CONFIG_KMEMCHECK) && !defined(DEBUG_PAGEALLOC)
> +#define SLUB_USE_CMPXCHG_DOUBLE
> +#endif
> +
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>  #ifdef CONFIG_PREEMPT
>  /*

Ugg.. Isnt there some way to indicate to kmemcheck that a speculative
access is occurring?
Comment 9 Christian Casteyde 2011-04-19 20:12:57 UTC
In response of comment #3
I didn't manage to apply the patch on -rc3 (neither on -rc4)
Comment 10 Eric Dumazet 2011-04-19 20:18:25 UTC
Le mardi 19 avril 2011 à 12:10 -0500, Christoph Lameter a écrit :
> On Tue, 19 Apr 2011, Eric Dumazet wrote:
> 
> >     }
> >  }
> >
> > -#ifdef CONFIG_CMPXCHG_LOCAL
> > +#if defined(CONFIG_CMPXCHG_LOCAL) && \
> > +   !defined(CONFIG_KMEMCHECK) && !defined(DEBUG_PAGEALLOC)
> > +#define SLUB_USE_CMPXCHG_DOUBLE
> > +#endif
> > +
> > +#ifdef SLUB_USE_CMPXCHG_DOUBLE
> >  #ifdef CONFIG_PREEMPT
> >  /*
> 
> Ugg.. Isnt there some way to indicate to kmemcheck that a speculative
> access is occurring?

Yes, here is a totally untested patch (only compiled here), to keep
kmemcheck & cmpxchg_double together.



diff --git a/mm/slub.c b/mm/slub.c
index 94d2a33..43c08c7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -261,6 +261,14 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object)
 	return *(void **)(object + s->offset);
 }
 
+static inline void *get_freepointer_mark(struct kmem_cache *s, void *object)
+{
+	void *ptr = object + s->offset;
+
+	kmemcheck_mark_initialized(ptr, sizeof(void *));
+	return *(void **)ptr;
+}
+
 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
 {
 	*(void **)(object + s->offset) = fp;
@@ -1540,7 +1548,11 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 	}
 }
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(DEBUG_PAGEALLOC)
+#define SLUB_USE_CMPXCHG_DOUBLE
+#endif
+
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 #ifdef CONFIG_PREEMPT
 /*
  * Calculate the next globally unique transaction for disambiguiation
@@ -1604,7 +1616,7 @@ static inline void note_cmpxchg_failure(const char *n,
 
 void init_kmem_cache_cpus(struct kmem_cache *s)
 {
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	int cpu;
 
 	for_each_possible_cpu(cpu)
@@ -1643,7 +1655,7 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 		page->inuse--;
 	}
 	c->page = NULL;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	c->tid = next_tid(c->tid);
 #endif
 	unfreeze_slab(s, page, tail);
@@ -1780,7 +1792,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 {
 	void **object;
 	struct page *new;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -1819,7 +1831,7 @@ load_freelist:
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	c->tid = next_tid(c->tid);
 	local_irq_restore(flags);
 #endif
@@ -1858,7 +1870,7 @@ new_slab:
 	}
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 	return NULL;
@@ -1887,7 +1899,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 {
 	void **object;
 	struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long tid;
 #else
 	unsigned long flags;
@@ -1896,7 +1908,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 	if (slab_pre_alloc_hook(s, gfpflags))
 		return NULL;
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_save(flags);
 #else
 redo:
@@ -1910,7 +1922,7 @@ redo:
 	 */
 	c = __this_cpu_ptr(s->cpu_slab);
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	/*
 	 * The transaction ids are globally unique per cpu and per operation on
 	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
@@ -1927,7 +1939,7 @@ redo:
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 		/*
 		 * The cmpxchg will only match if there was no additional
 		 * operation and if we are on the right processor.
@@ -1943,7 +1955,8 @@ redo:
 		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				object, tid,
-				get_freepointer(s, object), next_tid(tid)))) {
+				get_freepointer_mark(s, object),
+				next_tid(tid)))) {
 
 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;
@@ -1954,7 +1967,7 @@ redo:
 		stat(s, ALLOC_FASTPATH);
 	}
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 
@@ -2034,7 +2047,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 {
 	void *prior;
 	void **object = (void *)x;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -2070,7 +2083,7 @@ checks_ok:
 
 out_unlock:
 	slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 	return;
@@ -2084,7 +2097,7 @@ slab_empty:
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 	stat(s, FREE_SLAB);
@@ -2113,7 +2126,7 @@ static __always_inline void slab_free(struct kmem_cache *s,
 {
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long tid;
 #else
 	unsigned long flags;
@@ -2121,7 +2134,7 @@ static __always_inline void slab_free(struct kmem_cache *s,
 
 	slab_free_hook(s, x);
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_save(flags);
 
 #else
@@ -2136,7 +2149,7 @@ redo:
 	 */
 	c = __this_cpu_ptr(s->cpu_slab);
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	tid = c->tid;
 	barrier();
 #endif
@@ -2144,7 +2157,7 @@ redo:
 	if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
 		set_freepointer(s, object, c->freelist);
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				c->freelist, tid,
@@ -2160,7 +2173,7 @@ redo:
 	} else
 		__slab_free(s, page, x, addr);
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 }
@@ -2354,7 +2367,7 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
 			SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	/*
 	 * Must align to double word boundary for the double cmpxchg instructions
 	 * to work.
Comment 11 Anonymous Emailer 2011-04-19 21:19:11 UTC
Reply-To: cl@linux.com

On Tue, 19 Apr 2011, Eric Dumazet wrote:

> > Ugg.. Isnt there some way to indicate to kmemcheck that a speculative
> > access is occurring?
>
> Yes, here is a totally untested patch (only compiled here), to keep
> kmemcheck & cmpxchg_double together.

Ok looks somewhat saner. Why does DEBUG_PAGEALLOC need this? Pages are
never freed as long as a page is frozen and a page needs to be frozen to
be allocated from. I dont see how we could reference a free page.
Comment 12 Anonymous Emailer 2011-04-19 21:19:18 UTC
Reply-To: cl@linux.com

On Tue, 19 Apr 2011, Eric Dumazet wrote:

> > Ugg.. Isnt there some way to indicate to kmemcheck that a speculative
> > access is occurring?
>
> Yes, here is a totally untested patch (only compiled here), to keep
> kmemcheck & cmpxchg_double together.

Ok looks somewhat saner. Why does DEBUG_PAGEALLOC need this? Pages are
never freed as long as a page is frozen and a page needs to be frozen to
be allocated from. I dont see how we could reference a free page.
Comment 13 Eric Dumazet 2011-04-20 05:04:58 UTC
Le mardi 19 avril 2011 à 16:18 -0500, Christoph Lameter a écrit :
> On Tue, 19 Apr 2011, Eric Dumazet wrote:
> 
> > > Ugg.. Isnt there some way to indicate to kmemcheck that a speculative
> > > access is occurring?
> >
> > Yes, here is a totally untested patch (only compiled here), to keep
> > kmemcheck & cmpxchg_double together.
> 
> Ok looks somewhat saner. Why does DEBUG_PAGEALLOC need this? Pages are
> never freed as long as a page is frozen and a page needs to be frozen to
> be allocated from. I dont see how we could reference a free page.

We run lockless and IRQ enabled, are you sure we cannot at this point :

CPU0 - Receive an IRQ
CPU0 - Allocate this same object consume the page(s) containing this object
CPU0 - Pass this object/memory to another cpu1
CPU1 - Free memory and free the page(s)
CPU0 - Return from IRQ
CPU0 - Access now unreachable memory ?
Comment 14 Eric Dumazet 2011-04-20 06:05:03 UTC
Le mercredi 20 avril 2011 à 08:56 +0300, Pekka Enberg a écrit :
> On 4/19/11 11:17 PM, Eric Dumazet wrote:
> > Le mardi 19 avril 2011 à 12:10 -0500, Christoph Lameter a écrit :
> >> On Tue, 19 Apr 2011, Eric Dumazet wrote:
> >>
> >>>           }
> >>>   }
> >>>
> >>> -#ifdef CONFIG_CMPXCHG_LOCAL
> >>> +#if defined(CONFIG_CMPXCHG_LOCAL)&&  \
> >>> + !defined(CONFIG_KMEMCHECK)&&  !defined(DEBUG_PAGEALLOC)
> >>> +#define SLUB_USE_CMPXCHG_DOUBLE
> >>> +#endif
> >>> +
> >>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
> >>>   #ifdef CONFIG_PREEMPT
> >>>   /*
> >> Ugg.. Isnt there some way to indicate to kmemcheck that a speculative
> >> access is occurring?
> > Yes, here is a totally untested patch (only compiled here), to keep
> > kmemcheck&  cmpxchg_double together.
> 
> Looks good to me! Can we get someone to test it as well?

I tested first patch, but was not able to trigger the fault anyway
without it. I'll try a bit more today if time permits...

It would be nice if Christian can reproduce the initial problem easily
and give us the steps...
Comment 15 Pekka Enberg 2011-04-20 06:38:00 UTC
On 4/19/11 11:17 PM, Eric Dumazet wrote:
> Le mardi 19 avril 2011 à 12:10 -0500, Christoph Lameter a écrit :
>> On Tue, 19 Apr 2011, Eric Dumazet wrote:
>>
>>>     }
>>>   }
>>>
>>> -#ifdef CONFIG_CMPXCHG_LOCAL
>>> +#if defined(CONFIG_CMPXCHG_LOCAL)&&  \
>>> +   !defined(CONFIG_KMEMCHECK)&&  !defined(DEBUG_PAGEALLOC)
>>> +#define SLUB_USE_CMPXCHG_DOUBLE
>>> +#endif
>>> +
>>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>>   #ifdef CONFIG_PREEMPT
>>>   /*
>> Ugg.. Isnt there some way to indicate to kmemcheck that a speculative
>> access is occurring?
> Yes, here is a totally untested patch (only compiled here), to keep
> kmemcheck&  cmpxchg_double together.

Looks good to me! Can we get someone to test it as well?
Comment 16 Pekka Enberg 2011-04-20 07:50:15 UTC
On 4/20/11 10:45 AM, casteyde.christian@free.fr wrote:
> Indeed I can reproduce each time I boot, however I didn't managed to apply
> the
> patch to test it on -rc3 (neither -rc4), only the first hunk passed.

That's odd. Eric?

> I simply boot on this laptop with kmemcheck activated to get the warning,
> with
> vanilla -rc3. I've also tested -rc4 yersterday (without the patch therefore),
> the warning was not the same (it's a bit later, I will post the dmesg output
> on
> bugzilla tonight).

That's expected. You're unlikely to see the problem in the exact same 
place across different kernel configs and versions.

			Pekka
Comment 17 Eric Dumazet 2011-04-20 08:10:26 UTC
Le mercredi 20 avril 2011 à 10:49 +0300, Pekka Enberg a écrit :
> On 4/20/11 10:45 AM, casteyde.christian@free.fr wrote:
> > Indeed I can reproduce each time I boot, however I didn't managed to apply
> the
> > patch to test it on -rc3 (neither -rc4), only the first hunk passed.
> 
> That's odd. Eric?
> 

I dont know, my patches were linux-2.6 based, and I pull tree very
often ;)

> > I simply boot on this laptop with kmemcheck activated to get the warning,
> with
> > vanilla -rc3. I've also tested -rc4 yersterday (without the patch
> therefore),
> > the warning was not the same (it's a bit later, I will post the dmesg
> output on
> > bugzilla tonight).
> 
> That's expected. You're unlikely to see the problem in the exact same 
> place across different kernel configs and versions.
> 

Christian, please send us your .config

Thanks
Comment 18 Pekka Enberg 2011-04-20 08:22:04 UTC
On 4/20/11 11:09 AM, Eric Dumazet wrote:
> Le mercredi 20 avril 2011 à 10:49 +0300, Pekka Enberg a écrit :
>> On 4/20/11 10:45 AM, casteyde.christian@free.fr wrote:
>>> Indeed I can reproduce each time I boot, however I didn't managed to apply
>>> the
>>> patch to test it on -rc3 (neither -rc4), only the first hunk passed.
>>
>> That's odd. Eric?
>>
>
> I dont know, my patches were linux-2.6 based, and I pull tree very
> often ;)

Btw, the patch applies fine here on top of 2.6.39-rc4.
Comment 19 Christian Casteyde 2011-04-20 08:41:56 UTC
Quoting Eric Dumazet <eric.dumazet@gmail.com>:


> > Looks good to me! Can we get someone to test it as well?
>
> I tested first patch, but was not able to trigger the fault anyway
> without it. I'll try a bit more today if time permits...
>
> It would be nice if Christian can reproduce the initial problem easily
> and give us the steps...
>
>
>
>

Indeed I can reproduce each time I boot, however I didn't managed to apply the
patch to test it on -rc3 (neither -rc4), only the first hunk passed.

I simply boot on this laptop with kmemcheck activated to get the warning, with
vanilla -rc3. I've also tested -rc4 yersterday (without the patch therefore),
the warning was not the same (it's a bit later, I will post the dmesg output on
bugzilla tonight).

For info, my laptop is single core if that matters, it is preemptible, with
slub, 64bits, on an old CPU (Athlon64 3000 pre- lahf in 64bits).
Comment 20 Eric Dumazet 2011-04-20 09:08:19 UTC
I had one splat (before applying any patch), adding CONFIG_PREEMPT to my
build :

[   84.629936] WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88011a7376f0)
[   84.629939] 5868ba1a0188ffff5868ba1a0188ffff7078731a0188ffffe0e1181a0188ffff
[   84.629952]  i i i i i i i i i i i i i i i i u u u u u u u u u u u u u u u u
[   84.629963]                                  ^
[   84.629964] 
[   84.629966] Pid: 2060, comm: 05-wait_for_sys Not tainted 2.6.39-rc4-00237-ge3de956-dirty #550 HP ProLiant BL460c G6
[   84.629969] RIP: 0010:[<ffffffff810f0e80>]  [<ffffffff810f0e80>] kmem_cache_alloc+0x50/0x190
[   84.629977] RSP: 0018:ffff88011a0b9988  EFLAGS: 00010286
[   84.629979] RAX: 0000000000000000 RBX: ffff88011a739a98 RCX: 0000000000620780
[   84.629980] RDX: 0000000000620740 RSI: 0000000000017400 RDI: ffffffff810db8f5
[   84.629982] RBP: ffff88011a0b99b8 R08: ffff88011a261dd8 R09: 0000000000000009
[   84.629983] R10: 0000000000000002 R11: ffff88011fffce00 R12: ffff88011b00a600
[   84.629985] R13: ffff88011a7376f0 R14: 00000000000000d0 R15: ffff88011abbb0c0
[   84.629987] FS:  0000000000000000(0000) GS:ffff88011fc00000(0063) knlGS:00000000f76ff6c0
[   84.629989] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
[   84.629991] CR2: ffff88011998dab8 CR3: 000000011a13f000 CR4: 00000000000006f0
[   84.629992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   84.629994] DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
[   84.629995]  [<ffffffff810db8f5>] anon_vma_prepare+0x55/0x190
[   84.630000]  [<ffffffff810cfe90>] handle_pte_fault+0x470/0x6e0
[   84.630002]  [<ffffffff810d1654>] handle_mm_fault+0x124/0x1a0
[   84.630005]  [<ffffffff8147a510>] do_page_fault+0x160/0x560
[   84.630008]  [<ffffffff81477fcf>] page_fault+0x1f/0x30
[   84.630013]  [<ffffffff810b31f8>] generic_file_aio_read+0x528/0x740
[   84.630017]  [<ffffffff810f5461>] do_sync_read+0xd1/0x110
[   84.630021]  [<ffffffff810f5c36>] vfs_read+0xc6/0x160
[   84.630023]  [<ffffffff810f60b0>] sys_read+0x50/0x90
[   84.630025]  [<ffffffff8147f4e9>] sysenter_dispatch+0x7/0x27
[   84.630030]  [<ffffffffffffffff>] 0xffffffffffffffff

ffffffff810f0e6d:       0f 84 b5 00 00 00       je     ffffffff810f0f28 <kmem_cache_alloc+0xf8>
ffffffff810f0e73:       49 63 44 24 20          movslq 0x20(%r12),%rax
ffffffff810f0e78:       48 8d 4a 40             lea    0x40(%rdx),%rcx
ffffffff810f0e7c:       49 8b 34 24             mov    (%r12),%rsi
>>ffffffff810f0e80:       49 8b 5c 05 00          mov    0x0(%r13,%rax,1),%rbx
ffffffff810f0e85:       4c 89 e8                mov    %r13,%rax
ffffffff810f0e88:       e8 83 19 0e 00          callq  ffffffff811d2810 <this_cpu_cmpxchg16b_emu>
ffffffff810f0e8d:       0f 1f 40 00             nopl   0x0(%rax)
ffffffff810f0e91:       84 c0                   test   %al,%al
ffffffff810f0e93:       74 c1                   je     ffffffff810f0e56 <kmem_cache_alloc+0x26>


So this is definitely the access to object->next that triggers the fault.

Problem is : my (2nd) patch only reduces the window of occurrence of the bug, since we can
be preempted/interrupted between the kmemcheck_mark_initialized() and access to object->next :

The preempt/irq could allocate the object and free it again.

So KMEMCHECK would require us to block IRQ to be safe.

Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I did in my first patch.

Christoph, please reconsider first version of patch (disabling SLUB_CMPXCHG_DOUBLE if
either KMEMCHECK or DEBUG_PAGEALLOC is defined)

Thanks
Comment 21 Eric Dumazet 2011-04-20 10:03:15 UTC
Le mercredi 20 avril 2011 à 11:07 +0200, Eric Dumazet a écrit :
> I had one splat (before applying any patch), adding CONFIG_PREEMPT to my
> build :
> 
> [   84.629936] WARNING: kmemcheck: Caught 64-bit read from uninitialized
> memory (ffff88011a7376f0)
> [   84.629939]
> 5868ba1a0188ffff5868ba1a0188ffff7078731a0188ffffe0e1181a0188ffff
> [   84.629952]  i i i i i i i i i i i i i i i i u u u u u u u u u u u u u u u
> u
> [   84.629963]                                  ^
> [   84.629964] 
> [   84.629966] Pid: 2060, comm: 05-wait_for_sys Not tainted
> 2.6.39-rc4-00237-ge3de956-dirty #550 HP ProLiant BL460c G6
> [   84.629969] RIP: 0010:[<ffffffff810f0e80>]  [<ffffffff810f0e80>]
> kmem_cache_alloc+0x50/0x190
> [   84.629977] RSP: 0018:ffff88011a0b9988  EFLAGS: 00010286
> [   84.629979] RAX: 0000000000000000 RBX: ffff88011a739a98 RCX:
> 0000000000620780
> [   84.629980] RDX: 0000000000620740 RSI: 0000000000017400 RDI:
> ffffffff810db8f5
> [   84.629982] RBP: ffff88011a0b99b8 R08: ffff88011a261dd8 R09:
> 0000000000000009
> [   84.629983] R10: 0000000000000002 R11: ffff88011fffce00 R12:
> ffff88011b00a600
> [   84.629985] R13: ffff88011a7376f0 R14: 00000000000000d0 R15:
> ffff88011abbb0c0
> [   84.629987] FS:  0000000000000000(0000) GS:ffff88011fc00000(0063)
> knlGS:00000000f76ff6c0
> [   84.629989] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> [   84.629991] CR2: ffff88011998dab8 CR3: 000000011a13f000 CR4:
> 00000000000006f0
> [   84.629992] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [   84.629994] DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7:
> 0000000000000400
> [   84.629995]  [<ffffffff810db8f5>] anon_vma_prepare+0x55/0x190
> [   84.630000]  [<ffffffff810cfe90>] handle_pte_fault+0x470/0x6e0
> [   84.630002]  [<ffffffff810d1654>] handle_mm_fault+0x124/0x1a0
> [   84.630005]  [<ffffffff8147a510>] do_page_fault+0x160/0x560
> [   84.630008]  [<ffffffff81477fcf>] page_fault+0x1f/0x30
> [   84.630013]  [<ffffffff810b31f8>] generic_file_aio_read+0x528/0x740
> [   84.630017]  [<ffffffff810f5461>] do_sync_read+0xd1/0x110
> [   84.630021]  [<ffffffff810f5c36>] vfs_read+0xc6/0x160
> [   84.630023]  [<ffffffff810f60b0>] sys_read+0x50/0x90
> [   84.630025]  [<ffffffff8147f4e9>] sysenter_dispatch+0x7/0x27
> [   84.630030]  [<ffffffffffffffff>] 0xffffffffffffffff
> 
> ffffffff810f0e6d:       0f 84 b5 00 00 00       je     ffffffff810f0f28
> <kmem_cache_alloc+0xf8>
> ffffffff810f0e73:       49 63 44 24 20          movslq 0x20(%r12),%rax
> ffffffff810f0e78:       48 8d 4a 40             lea    0x40(%rdx),%rcx
> ffffffff810f0e7c:       49 8b 34 24             mov    (%r12),%rsi
> >>ffffffff810f0e80:       49 8b 5c 05 00          mov   
> 0x0(%r13,%rax,1),%rbx
> ffffffff810f0e85:       4c 89 e8                mov    %r13,%rax
> ffffffff810f0e88:       e8 83 19 0e 00          callq  ffffffff811d2810
> <this_cpu_cmpxchg16b_emu>
> ffffffff810f0e8d:       0f 1f 40 00             nopl   0x0(%rax)
> ffffffff810f0e91:       84 c0                   test   %al,%al
> ffffffff810f0e93:       74 c1                   je     ffffffff810f0e56
> <kmem_cache_alloc+0x26>
> 
> 
> So this is definitely the access to object->next that triggers the fault.
> 
> Problem is : my (2nd) patch only reduces the window of occurrence of the bug,
> since we can
> be preempted/interrupted between the kmemcheck_mark_initialized() and access
> to object->next :
> 
> The preempt/irq could allocate the object and free it again.
> 
> So KMEMCHECK would require us to block IRQ to be safe.
> 
> Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I did in
> my first patch.
> 
> Christoph, please reconsider first version of patch (disabling
> SLUB_CMPXCHG_DOUBLE if
> either KMEMCHECK or DEBUG_PAGEALLOC is defined)

Or this alternate one : keep cmpxchg_double() stuff but block IRQ in
slab_alloc() if KMEMCHECK or DEBUG_PAGEALLOC

I tested it and got no kmemcheck splat

Thanks

[PATCH] slub: block IRQ in slab_alloc() if KMEMCHECK or DEBUG_PAGEALLOC

Christian Casteyde reported a KMEMCHECK splat in slub code.

Problem is now we are lockless and allow IRQ in slab_alloc, the object
we manipulate from freelist can be allocated and freed right before we
try to read object->next

Same problem can happen with DEBUG_PAGEALLOC

Reported-by: Christian Casteyde <casteyde.christian@free.fr>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Cc: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   16 ++++++++++++----
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 94d2a33..9c34a2b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1881,7 +1881,13 @@ debug:
  * If not then __slab_alloc is called for slow processing.
  *
  * Otherwise we can simply pick the next object from the lockless free list.
+ * Since we access object->next, we must disable irqs if KMEMCHECK or
+ * DEBUG_PAGEALLOC is defined, since object could be allocated and freed
+ * under us.
  */
+#if !defined(CONFIG_CMPXCHG_LOCAL) || defined(CONFIG_KMEMCHECK) || defined(CONFIG_DEBUG_PAGEALLOC)
+#define MASK_IRQ_IN_SLAB_ALLOC
+#endif
 static __always_inline void *slab_alloc(struct kmem_cache *s,
 		gfp_t gfpflags, int node, unsigned long addr)
 {
@@ -1889,16 +1895,18 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 	struct kmem_cache_cpu *c;
 #ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long tid;
-#else
+#endif
+#ifdef MASK_IRQ_IN_SLAB_ALLOC
 	unsigned long flags;
 #endif
 
 	if (slab_pre_alloc_hook(s, gfpflags))
 		return NULL;
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifdef MASK_IRQ_IN_SLAB_ALLOC
 	local_irq_save(flags);
-#else
+#endif
+#ifdef CONFIG_CMPXCHG_LOCAL
 redo:
 #endif
 
@@ -1954,7 +1962,7 @@ redo:
 		stat(s, ALLOC_FASTPATH);
 	}
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifdef MASK_IRQ_IN_SLAB_ALLOC
 	local_irq_restore(flags);
 #endif
Comment 22 Eric Dumazet 2011-04-20 14:27:00 UTC
Le mercredi 20 avril 2011 à 09:05 -0500, Christoph Lameter a écrit :
> On Wed, 20 Apr 2011, Eric Dumazet wrote:
> 
> > > Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I did
> in my first patch.
> 
> Ok your first patch seems to be the sanest approach.
> 
> >  {
> > @@ -1889,16 +1895,18 @@ static __always_inline void *slab_alloc(struct
> kmem_cache *s,
> >     struct kmem_cache_cpu *c;
> >  #ifdef CONFIG_CMPXCHG_LOCAL
> >     unsigned long tid;
> > -#else
> > +#endif
> > +#ifdef MASK_IRQ_IN_SLAB_ALLOC
> >     unsigned long flags;
> >  #endif
> >
> 
> Yea well that does not bring us much.

Well, we keep the fast free path ?

only slab_alloc() would have to disable irqs for ~20 instructions.
Comment 23 Eric Dumazet 2011-04-20 15:01:57 UTC
Le mercredi 20 avril 2011 à 09:42 -0500, Christoph Lameter a écrit :

> Avoiding the irq handling yields the savings that improve the fastpath. if
> you do both then there is only a regression left. So lets go with
> disabling the CMPXCHG_LOCAL.

OK, let's do that then.

Thanks

[PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC

Christian Casteyde reported a KMEMCHECK splat in slub code.

Problem is now we are lockless and allow IRQ in slab_alloc(), the object
we manipulate from freelist can be allocated and freed right before we
try to read object->next.

Same problem can happen with DEBUG_PAGEALLOC

Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or
CONFIG_DEBUG_PAGEALLOC is defined.

Reported-by: Christian Casteyde <casteyde.christian@free.fr>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Cc: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   45 +++++++++++++++++++++++++--------------------
 1 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 94d2a33..f31ab2c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1540,7 +1540,12 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
 	}
 }
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \
+				     !defined(CONFIG_DEBUG_PAGEALLOC)
+#define SLUB_USE_CMPXCHG_DOUBLE
+#endif
+
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 #ifdef CONFIG_PREEMPT
 /*
  * Calculate the next globally unique transaction for disambiguiation
@@ -1604,7 +1609,7 @@ static inline void note_cmpxchg_failure(const char *n,
 
 void init_kmem_cache_cpus(struct kmem_cache *s)
 {
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	int cpu;
 
 	for_each_possible_cpu(cpu)
@@ -1643,7 +1648,7 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 		page->inuse--;
 	}
 	c->page = NULL;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	c->tid = next_tid(c->tid);
 #endif
 	unfreeze_slab(s, page, tail);
@@ -1780,7 +1785,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 {
 	void **object;
 	struct page *new;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -1819,7 +1824,7 @@ load_freelist:
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	c->tid = next_tid(c->tid);
 	local_irq_restore(flags);
 #endif
@@ -1858,7 +1863,7 @@ new_slab:
 	}
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 	return NULL;
@@ -1887,7 +1892,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 {
 	void **object;
 	struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long tid;
 #else
 	unsigned long flags;
@@ -1896,7 +1901,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 	if (slab_pre_alloc_hook(s, gfpflags))
 		return NULL;
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_save(flags);
 #else
 redo:
@@ -1910,7 +1915,7 @@ redo:
 	 */
 	c = __this_cpu_ptr(s->cpu_slab);
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	/*
 	 * The transaction ids are globally unique per cpu and per operation on
 	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
@@ -1927,7 +1932,7 @@ redo:
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 		/*
 		 * The cmpxchg will only match if there was no additional
 		 * operation and if we are on the right processor.
@@ -1954,7 +1959,7 @@ redo:
 		stat(s, ALLOC_FASTPATH);
 	}
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 
@@ -2034,7 +2039,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 {
 	void *prior;
 	void **object = (void *)x;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -2070,7 +2075,7 @@ checks_ok:
 
 out_unlock:
 	slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 	return;
@@ -2084,7 +2089,7 @@ slab_empty:
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 	stat(s, FREE_SLAB);
@@ -2113,7 +2118,7 @@ static __always_inline void slab_free(struct kmem_cache *s,
 {
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	unsigned long tid;
 #else
 	unsigned long flags;
@@ -2121,7 +2126,7 @@ static __always_inline void slab_free(struct kmem_cache *s,
 
 	slab_free_hook(s, x);
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_save(flags);
 
 #else
@@ -2136,7 +2141,7 @@ redo:
 	 */
 	c = __this_cpu_ptr(s->cpu_slab);
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	tid = c->tid;
 	barrier();
 #endif
@@ -2144,7 +2149,7 @@ redo:
 	if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
 		set_freepointer(s, object, c->freelist);
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				c->freelist, tid,
@@ -2160,7 +2165,7 @@ redo:
 	} else
 		__slab_free(s, page, x, addr);
 
-#ifndef CONFIG_CMPXCHG_LOCAL
+#ifndef SLUB_USE_CMPXCHG_DOUBLE
 	local_irq_restore(flags);
 #endif
 }
@@ -2354,7 +2359,7 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s)
 	BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
 			SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
 
-#ifdef CONFIG_CMPXCHG_LOCAL
+#ifdef SLUB_USE_CMPXCHG_DOUBLE
 	/*
 	 * Must align to double word boundary for the double cmpxchg instructions
 	 * to work.
Comment 24 Anonymous Emailer 2011-04-20 15:04:35 UTC
Reply-To: cl@linux.com

On Wed, 20 Apr 2011, Eric Dumazet wrote:

> Le mardi 19 avril 2011 à 16:18 -0500, Christoph Lameter a écrit :
> > On Tue, 19 Apr 2011, Eric Dumazet wrote:
> >
> > > > Ugg.. Isnt there some way to indicate to kmemcheck that a speculative
> > > > access is occurring?
> > >
> > > Yes, here is a totally untested patch (only compiled here), to keep
> > > kmemcheck & cmpxchg_double together.
> >
> > Ok looks somewhat saner. Why does DEBUG_PAGEALLOC need this? Pages are
> > never freed as long as a page is frozen and a page needs to be frozen to
> > be allocated from. I dont see how we could reference a free page.
>
> We run lockless and IRQ enabled, are you sure we cannot at this point :
>
> CPU0 - Receive an IRQ
> CPU0 - Allocate this same object consume the page(s) containing this object
> CPU0 - Pass this object/memory to another cpu1
> CPU1 - Free memory and free the page(s)
> CPU0 - Return from IRQ
> CPU0 - Access now unreachable memory ?

True this can occur.
Comment 25 Anonymous Emailer 2011-04-20 15:04:39 UTC
Reply-To: cl@linux.com

On Wed, 20 Apr 2011, Eric Dumazet wrote:

> Le mardi 19 avril 2011 à 16:18 -0500, Christoph Lameter a écrit :
> > On Tue, 19 Apr 2011, Eric Dumazet wrote:
> >
> > > > Ugg.. Isnt there some way to indicate to kmemcheck that a speculative
> > > > access is occurring?
> > >
> > > Yes, here is a totally untested patch (only compiled here), to keep
> > > kmemcheck & cmpxchg_double together.
> >
> > Ok looks somewhat saner. Why does DEBUG_PAGEALLOC need this? Pages are
> > never freed as long as a page is frozen and a page needs to be frozen to
> > be allocated from. I dont see how we could reference a free page.
>
> We run lockless and IRQ enabled, are you sure we cannot at this point :
>
> CPU0 - Receive an IRQ
> CPU0 - Allocate this same object consume the page(s) containing this object
> CPU0 - Pass this object/memory to another cpu1
> CPU1 - Free memory and free the page(s)
> CPU0 - Return from IRQ
> CPU0 - Access now unreachable memory ?

True this can occur.
Comment 26 Anonymous Emailer 2011-04-20 15:06:08 UTC
Reply-To: cl@linux.com

On Wed, 20 Apr 2011, Eric Dumazet wrote:

> > Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I did in
> my first patch.

Ok your first patch seems to be the sanest approach.

>  {
> @@ -1889,16 +1895,18 @@ static __always_inline void *slab_alloc(struct
> kmem_cache *s,
>       struct kmem_cache_cpu *c;
>  #ifdef CONFIG_CMPXCHG_LOCAL
>       unsigned long tid;
> -#else
> +#endif
> +#ifdef MASK_IRQ_IN_SLAB_ALLOC
>       unsigned long flags;
>  #endif
>

Yea well that does not bring us much.
Comment 27 Anonymous Emailer 2011-04-20 15:06:16 UTC
Reply-To: cl@linux.com

On Wed, 20 Apr 2011, Eric Dumazet wrote:

> > Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I did in
> my first patch.

Ok your first patch seems to be the sanest approach.

>  {
> @@ -1889,16 +1895,18 @@ static __always_inline void *slab_alloc(struct
> kmem_cache *s,
>       struct kmem_cache_cpu *c;
>  #ifdef CONFIG_CMPXCHG_LOCAL
>       unsigned long tid;
> -#else
> +#endif
> +#ifdef MASK_IRQ_IN_SLAB_ALLOC
>       unsigned long flags;
>  #endif
>

Yea well that does not bring us much.
Comment 28 Eric Dumazet 2011-04-20 15:31:11 UTC
Le mercredi 20 avril 2011 à 10:17 -0500, Christoph Lameter a écrit :
> On Wed, 20 Apr 2011, Eric Dumazet wrote:
> 
> > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC
> 
> Acked-by: Christoph Lameter <cl@linux.com>
> 
> >
> > -#ifdef CONFIG_CMPXCHG_LOCAL
> > +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \
> > +                                !defined(CONFIG_DEBUG_PAGEALLOC)
> > +#define SLUB_USE_CMPXCHG_DOUBLE
> 
> #undef CONFIG_CMPXCHG_LOCAL instead? Would reduce patch size but its kind
> of hacky.
> 

Well, its definitely hacky :)

I chose a local definition (instead of Kconfig game), referring to
cmpxchg_double()
Comment 29 Eric Dumazet 2011-04-20 15:34:56 UTC
Le mercredi 20 avril 2011 à 17:15 +0200, Vegard Nossum a écrit :

>  Thanks, guys. I wonder: Is it possible to make a reproducible test-case 
>  (e.g. loadable module) for this?
> 

Not easy to code a specific test-case, but just use regular workload,
(and network trafic to get interrupts). Use CONFIG_PREEMPT to trigger
preemptions.

>  Also, pardon my ignorance, but can you explain why this is a bug with 
>  kmemcheck/page-alloc debug and not without them?

Without them, we can read object->next without trigerring a fault.
We can read garbage data (if object is in use, it was overwritten with
user data), but this doesnt matter because cmpxchg_double() wont perform
its work (since tid will have change)
Comment 30 Anonymous Emailer 2011-04-20 15:43:26 UTC
Reply-To: cl@linux.com

On Wed, 20 Apr 2011, Eric Dumazet wrote:

> Le mercredi 20 avril 2011 à 09:05 -0500, Christoph Lameter a écrit :
> > On Wed, 20 Apr 2011, Eric Dumazet wrote:
> >
> > > > Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I
> did in my first patch.
> >
> > Ok your first patch seems to be the sanest approach.
> >
> > >  {
> > > @@ -1889,16 +1895,18 @@ static __always_inline void *slab_alloc(struct
> kmem_cache *s,
> > >   struct kmem_cache_cpu *c;
> > >  #ifdef CONFIG_CMPXCHG_LOCAL
> > >   unsigned long tid;
> > > -#else
> > > +#endif
> > > +#ifdef MASK_IRQ_IN_SLAB_ALLOC
> > >   unsigned long flags;
> > >  #endif
> > >
> >
> > Yea well that does not bring us much.
>
> Well, we keep the fast free path ?
>
> only slab_alloc() would have to disable irqs for ~20 instructions.

Avoiding the irq handling yields the savings that improve the fastpath. if
you do both then there is only a regression left. So lets go with
disabling the CMPXCHG_LOCAL.
Comment 31 Anonymous Emailer 2011-04-20 15:43:47 UTC
Reply-To: cl@linux.com

On Wed, 20 Apr 2011, Eric Dumazet wrote:

> Le mercredi 20 avril 2011 à 09:05 -0500, Christoph Lameter a écrit :
> > On Wed, 20 Apr 2011, Eric Dumazet wrote:
> >
> > > > Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I
> did in my first patch.
> >
> > Ok your first patch seems to be the sanest approach.
> >
> > >  {
> > > @@ -1889,16 +1895,18 @@ static __always_inline void *slab_alloc(struct
> kmem_cache *s,
> > >   struct kmem_cache_cpu *c;
> > >  #ifdef CONFIG_CMPXCHG_LOCAL
> > >   unsigned long tid;
> > > -#else
> > > +#endif
> > > +#ifdef MASK_IRQ_IN_SLAB_ALLOC
> > >   unsigned long flags;
> > >  #endif
> > >
> >
> > Yea well that does not bring us much.
>
> Well, we keep the fast free path ?
>
> only slab_alloc() would have to disable irqs for ~20 instructions.

Avoiding the irq handling yields the savings that improve the fastpath. if
you do both then there is only a regression left. So lets go with
disabling the CMPXCHG_LOCAL.
Comment 32 Anonymous Emailer 2011-04-20 16:07:43 UTC
Reply-To: vegardno@ifi.uio.no

 On Wed, 20 Apr 2011 17:01:27 +0200, Eric Dumazet 
 <eric.dumazet@gmail.com> wrote:
> Le mercredi 20 avril 2011 à 09:42 -0500, Christoph Lameter a écrit :
>
>> Avoiding the irq handling yields the savings that improve the 
>> fastpath. if
>> you do both then there is only a regression left. So lets go with
>> disabling the CMPXCHG_LOCAL.
>
> OK, let's do that then.
>
> Thanks
>
> [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or 
> DEBUG_PAGEALLOC
>
> Christian Casteyde reported a KMEMCHECK splat in slub code.
>
> Problem is now we are lockless and allow IRQ in slab_alloc(), the 
> object
> we manipulate from freelist can be allocated and freed right before 
> we
> try to read object->next.
>
> Same problem can happen with DEBUG_PAGEALLOC
>
> Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or
> CONFIG_DEBUG_PAGEALLOC is defined.
>
> Reported-by: Christian Casteyde <casteyde.christian@free.fr>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
> Cc: Vegard Nossum <vegardno@ifi.uio.no>
> Cc: Christoph Lameter <cl@linux.com>

 Thanks, guys. I wonder: Is it possible to make a reproducible test-case 
 (e.g. loadable module) for this?

 Also, pardon my ignorance, but can you explain why this is a bug with 
 kmemcheck/page-alloc debug and not without them?

 Thanks,


 Vegard
Comment 33 Anonymous Emailer 2011-04-20 16:18:26 UTC
Reply-To: cl@linux.com

On Wed, 20 Apr 2011, Eric Dumazet wrote:

> [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC

Acked-by: Christoph Lameter <cl@linux.com>

>
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \
> +                                  !defined(CONFIG_DEBUG_PAGEALLOC)
> +#define SLUB_USE_CMPXCHG_DOUBLE

#undef CONFIG_CMPXCHG_LOCAL instead? Would reduce patch size but its kind
of hacky.
Comment 34 Anonymous Emailer 2011-04-20 16:18:26 UTC
Reply-To: cl@linux.com

On Wed, 20 Apr 2011, Eric Dumazet wrote:

> [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC

Acked-by: Christoph Lameter <cl@linux.com>

>
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \
> +                                  !defined(CONFIG_DEBUG_PAGEALLOC)
> +#define SLUB_USE_CMPXCHG_DOUBLE

#undef CONFIG_CMPXCHG_LOCAL instead? Would reduce patch size but its kind
of hacky.
Comment 35 Christian Casteyde 2011-04-20 19:47:03 UTC
In response to comment #23
With this patch on 2.6.39-rc4 I get the following warning:

b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
ADDRCONF(NETDEV_UP): eth1: link is not ready
b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
ADDRCONF(NETDEV_UP): eth1: link is not ready
eth1: direct probe to 2e:94:8b:66:86:88 (try 1/3)
eth1: direct probe responded
eth1: authenticate with 2e:94:8b:66:86:88 (try 1)
eth1: authenticated
eth1: associate with 2e:94:8b:66:86:88 (try 1)
eth1: RX AssocResp from 2e:94:8b:66:86:88 (capab=0x411 status=0 aid=1)
eth1: associated
ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
eth1: no IPv6 routers present
WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88001b000c00)
0008001b0088ffff0010000000000000c055530000eaffff0010000000000000
 u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u
 ^

Pid: 1874, comm: dhcpcd Not tainted 2.6.39-rc4 #4 Acer,Inc. Aspire 1510  /Aspire 1510
RIP: 0010:[<ffffffff81263d9c>]  [<ffffffff81263d9c>] memmove+0x2c/0x190
RSP: 0018:ffff88001c17f8a0  EFLAGS: 00010206
RAX: ffff88001b000c00 RBX: ffff88001b070400 RCX: 0000000000000020
RDX: 000000000000016f RSI: ffff88001b000c00 RDI: ffff88001b000c00
RBP: ffff88001c17f8f8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000200 R14: 0000000000000000 R15: 0000000000000001
FS:  00007fd7df758700(0000) GS:ffffffff81a1b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88001d23a4b8 CR3: 000000001ba46000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
 [<ffffffff815c7584>] ieee80211_skb_resize+0x94/0x120
 [<ffffffff815c97a4>] ieee80211_xmit+0xa4/0x280
 [<ffffffff815caf86>] ieee80211_subif_start_xmit+0x3e6/0x910
 [<ffffffff81486654>] dev_hard_start_xmit+0x384/0x6a0
 [<ffffffff8149d344>] sch_direct_xmit+0xd4/0x260
 [<ffffffff81486b4e>] dev_queue_xmit+0x1de/0x710
 [<ffffffff8156af23>] packet_sendmsg+0xa73/0xbe0
 [<ffffffff81471813>] sock_sendmsg+0xe3/0x110
 [<ffffffff81472154>] sys_sendto+0x134/0x180
 [<ffffffff815e74f8>] system_call_fastpath+0x16/0x1b
 [<ffffffffffffffff>] 0xffffffffffffffff

Indeed I get the same also without the patch.
I add my config file as attachment below.
Comment 36 Christian Casteyde 2011-04-20 19:48:03 UTC
Created attachment 54852 [details]
.config file used to reproduce
Comment 37 Eric Dumazet 2011-04-20 19:55:29 UTC
Le mercredi 20 avril 2011 à 21:36 +0200, Christian Casteyde a écrit :

> I'm not sure it's the same problem anyway as I've said previously.
> I append my config file used to build this kernel.

This is not the same problem, and is a known false positive from skb
reallocation. (skb_reserve() doesnt mark memory as initialized)
Comment 38 Christian Casteyde 2011-04-20 20:16:15 UTC
Le mercredi 20 avril 2011 17:01:27, Eric Dumazet a écrit :
> Le mercredi 20 avril 2011 à 09:42 -0500, Christoph Lameter a écrit :
> > Avoiding the irq handling yields the savings that improve the fastpath.
> > if you do both then there is only a regression left. So lets go with
> > disabling the CMPXCHG_LOCAL.
> 
> OK, let's do that then.
> 
> Thanks
> 
> [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC
> 
> Christian Casteyde reported a KMEMCHECK splat in slub code.
> 
> Problem is now we are lockless and allow IRQ in slab_alloc(), the object
> we manipulate from freelist can be allocated and freed right before we
> try to read object->next.
> 
> Same problem can happen with DEBUG_PAGEALLOC
> 
> Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or
> CONFIG_DEBUG_PAGEALLOC is defined.
> 
> Reported-by: Christian Casteyde <casteyde.christian@free.fr>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
> Cc: Vegard Nossum <vegardno@ifi.uio.no>
> Cc: Christoph Lameter <cl@linux.com>
> ---
>  mm/slub.c |   45 +++++++++++++++++++++++++--------------------
>  1 files changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 94d2a33..f31ab2c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1540,7 +1540,12 @@ static void unfreeze_slab(struct kmem_cache *s,
> struct page *page, int tail) }
>  }
> 
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \
> +                                  !defined(CONFIG_DEBUG_PAGEALLOC)
> +#define SLUB_USE_CMPXCHG_DOUBLE
> +#endif
> +
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>  #ifdef CONFIG_PREEMPT
>  /*
>   * Calculate the next globally unique transaction for disambiguiation
> @@ -1604,7 +1609,7 @@ static inline void note_cmpxchg_failure(const char
> *n,
> 
>  void init_kmem_cache_cpus(struct kmem_cache *s)
>  {
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       int cpu;
> 
>       for_each_possible_cpu(cpu)
> @@ -1643,7 +1648,7 @@ static void deactivate_slab(struct kmem_cache *s,
> struct kmem_cache_cpu *c) page->inuse--;
>       }
>       c->page = NULL;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       c->tid = next_tid(c->tid);
>  #endif
>       unfreeze_slab(s, page, tail);
> @@ -1780,7 +1785,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t
> gfpflags, int node, {
>       void **object;
>       struct page *new;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long flags;
> 
>       local_irq_save(flags);
> @@ -1819,7 +1824,7 @@ load_freelist:
>       c->node = page_to_nid(c->page);
>  unlock_out:
>       slab_unlock(c->page);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       c->tid = next_tid(c->tid);
>       local_irq_restore(flags);
>  #endif
> @@ -1858,7 +1863,7 @@ new_slab:
>       }
>       if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
>               slab_out_of_memory(s, gfpflags, node);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>       return NULL;
> @@ -1887,7 +1892,7 @@ static __always_inline void *slab_alloc(struct
> kmem_cache *s, {
>       void **object;
>       struct kmem_cache_cpu *c;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long tid;
>  #else
>       unsigned long flags;
> @@ -1896,7 +1901,7 @@ static __always_inline void *slab_alloc(struct
> kmem_cache *s, if (slab_pre_alloc_hook(s, gfpflags))
>               return NULL;
> 
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_save(flags);
>  #else
>  redo:
> @@ -1910,7 +1915,7 @@ redo:
>        */
>       c = __this_cpu_ptr(s->cpu_slab);
> 
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       /*
>        * The transaction ids are globally unique per cpu and per operation on
>        * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
> @@ -1927,7 +1932,7 @@ redo:
>               object = __slab_alloc(s, gfpflags, node, addr, c);
> 
>       else {
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>               /*
>                * The cmpxchg will only match if there was no additional
>                * operation and if we are on the right processor.
> @@ -1954,7 +1959,7 @@ redo:
>               stat(s, ALLOC_FASTPATH);
>       }
> 
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
> 
> @@ -2034,7 +2039,7 @@ static void __slab_free(struct kmem_cache *s, struct
> page *page, {
>       void *prior;
>       void **object = (void *)x;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long flags;
> 
>       local_irq_save(flags);
> @@ -2070,7 +2075,7 @@ checks_ok:
> 
>  out_unlock:
>       slab_unlock(page);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>       return;
> @@ -2084,7 +2089,7 @@ slab_empty:
>               stat(s, FREE_REMOVE_PARTIAL);
>       }
>       slab_unlock(page);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>       stat(s, FREE_SLAB);
> @@ -2113,7 +2118,7 @@ static __always_inline void slab_free(struct
> kmem_cache *s, {
>       void **object = (void *)x;
>       struct kmem_cache_cpu *c;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long tid;
>  #else
>       unsigned long flags;
> @@ -2121,7 +2126,7 @@ static __always_inline void slab_free(struct
> kmem_cache *s,
> 
>       slab_free_hook(s, x);
> 
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_save(flags);
> 
>  #else
> @@ -2136,7 +2141,7 @@ redo:
>        */
>       c = __this_cpu_ptr(s->cpu_slab);
> 
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       tid = c->tid;
>       barrier();
>  #endif
> @@ -2144,7 +2149,7 @@ redo:
>       if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
>               set_freepointer(s, object, c->freelist);
> 
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>               if (unlikely(!this_cpu_cmpxchg_double(
>                               s->cpu_slab->freelist, s->cpu_slab->tid,
>                               c->freelist, tid,
> @@ -2160,7 +2165,7 @@ redo:
>       } else
>               __slab_free(s, page, x, addr);
> 
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>  }
> @@ -2354,7 +2359,7 @@ static inline int alloc_kmem_cache_cpus(struct
> kmem_cache *s) BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
>                       SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
> 
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       /*
>        * Must align to double word boundary for the double cmpxchg
>  instructions
>        * to work.


I applied this patch to 2.6.39-rc4 (sorry for the last test, it indeed applied 
correctly), I still get the following:

ADDRCONF(NETDEV_UP): eth1: link is not ready
b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
ADDRCONF(NETDEV_UP): eth1: link is not ready
eth1: direct probe to 2e:94:8b:66:86:88 (try 1/3)
eth1: direct probe responded
eth1: authenticate with 2e:94:8b:66:86:88 (try 1)
eth1: authenticated
eth1: associate with 2e:94:8b:66:86:88 (try 1)
eth1: RX AssocResp from 2e:94:8b:66:86:88 (capab=0x411 status=0 aid=1)
eth1: associated
ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
eth1: no IPv6 routers present
WARNING: kmemcheck: Caught 64-bit read from uninitialized memory 
(ffff88001b000c00)
0008001b0088ffff0010000000000000c055530000eaffff0010000000000000
 u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u
 ^

Pid: 1874, comm: dhcpcd Not tainted 2.6.39-rc4 #4 Acer,Inc. Aspire 1510  
/Aspire 1510
RIP: 0010:[<ffffffff81263d9c>]  [<ffffffff81263d9c>] memmove+0x2c/0x190
RSP: 0018:ffff88001c17f8a0  EFLAGS: 00010206
RAX: ffff88001b000c00 RBX: ffff88001b070400 RCX: 0000000000000020
RDX: 000000000000016f RSI: ffff88001b000c00 RDI: ffff88001b000c00
RBP: ffff88001c17f8f8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000200 R14: 0000000000000000 R15: 0000000000000001
FS:  00007fd7df758700(0000) GS:ffffffff81a1b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88001d23a4b8 CR3: 000000001ba46000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
 [<ffffffff815c7584>] ieee80211_skb_resize+0x94/0x120
 [<ffffffff815c97a4>] ieee80211_xmit+0xa4/0x280
 [<ffffffff815caf86>] ieee80211_subif_start_xmit+0x3e6/0x910
 [<ffffffff81486654>] dev_hard_start_xmit+0x384/0x6a0
 [<ffffffff8149d344>] sch_direct_xmit+0xd4/0x260
 [<ffffffff81486b4e>] dev_queue_xmit+0x1de/0x710
 [<ffffffff8156af23>] packet_sendmsg+0xa73/0xbe0
 [<ffffffff81471813>] sock_sendmsg+0xe3/0x110
 [<ffffffff81472154>] sys_sendto+0x134/0x180
 [<ffffffff815e74f8>] system_call_fastpath+0x16/0x1b
 [<ffffffffffffffff>] 0xffffffffffffffff


I'm not sure it's the same problem anyway as I've said previously.
I append my config file used to build this kernel.
Comment 39 Christian Casteyde 2011-04-20 20:16:19 UTC
Le mercredi 20 avril 2011 17:01:27, Eric Dumazet a écrit :
> Le mercredi 20 avril 2011 à 09:42 -0500, Christoph Lameter a écrit :
> > Avoiding the irq handling yields the savings that improve the fastpath.
> > if you do both then there is only a regression left. So lets go with
> > disabling the CMPXCHG_LOCAL.
> 
> OK, let's do that then.
> 
> Thanks
> 
> [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC
> 
> Christian Casteyde reported a KMEMCHECK splat in slub code.
> 
> Problem is now we are lockless and allow IRQ in slab_alloc(), the object
> we manipulate from freelist can be allocated and freed right before we
> try to read object->next.
> 
> Same problem can happen with DEBUG_PAGEALLOC
> 
> Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or
> CONFIG_DEBUG_PAGEALLOC is defined.
> 
> Reported-by: Christian Casteyde <casteyde.christian@free.fr>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
> Cc: Vegard Nossum <vegardno@ifi.uio.no>
> Cc: Christoph Lameter <cl@linux.com>
> ---
>  mm/slub.c |   45 +++++++++++++++++++++++++--------------------
>  1 files changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 94d2a33..f31ab2c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1540,7 +1540,12 @@ static void unfreeze_slab(struct kmem_cache *s,
> struct page *page, int tail) }
>  }
> 
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \
> +                                  !defined(CONFIG_DEBUG_PAGEALLOC)
> +#define SLUB_USE_CMPXCHG_DOUBLE
> +#endif
> +
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>  #ifdef CONFIG_PREEMPT
>  /*
>   * Calculate the next globally unique transaction for disambiguiation
> @@ -1604,7 +1609,7 @@ static inline void note_cmpxchg_failure(const char
> *n,
> 
>  void init_kmem_cache_cpus(struct kmem_cache *s)
>  {
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       int cpu;
> 
>       for_each_possible_cpu(cpu)
> @@ -1643,7 +1648,7 @@ static void deactivate_slab(struct kmem_cache *s,
> struct kmem_cache_cpu *c) page->inuse--;
>       }
>       c->page = NULL;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       c->tid = next_tid(c->tid);
>  #endif
>       unfreeze_slab(s, page, tail);
> @@ -1780,7 +1785,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t
> gfpflags, int node, {
>       void **object;
>       struct page *new;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long flags;
> 
>       local_irq_save(flags);
> @@ -1819,7 +1824,7 @@ load_freelist:
>       c->node = page_to_nid(c->page);
>  unlock_out:
>       slab_unlock(c->page);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       c->tid = next_tid(c->tid);
>       local_irq_restore(flags);
>  #endif
> @@ -1858,7 +1863,7 @@ new_slab:
>       }
>       if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
>               slab_out_of_memory(s, gfpflags, node);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>       return NULL;
> @@ -1887,7 +1892,7 @@ static __always_inline void *slab_alloc(struct
> kmem_cache *s, {
>       void **object;
>       struct kmem_cache_cpu *c;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long tid;
>  #else
>       unsigned long flags;
> @@ -1896,7 +1901,7 @@ static __always_inline void *slab_alloc(struct
> kmem_cache *s, if (slab_pre_alloc_hook(s, gfpflags))
>               return NULL;
> 
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_save(flags);
>  #else
>  redo:
> @@ -1910,7 +1915,7 @@ redo:
>        */
>       c = __this_cpu_ptr(s->cpu_slab);
> 
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       /*
>        * The transaction ids are globally unique per cpu and per operation on
>        * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
> @@ -1927,7 +1932,7 @@ redo:
>               object = __slab_alloc(s, gfpflags, node, addr, c);
> 
>       else {
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>               /*
>                * The cmpxchg will only match if there was no additional
>                * operation and if we are on the right processor.
> @@ -1954,7 +1959,7 @@ redo:
>               stat(s, ALLOC_FASTPATH);
>       }
> 
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
> 
> @@ -2034,7 +2039,7 @@ static void __slab_free(struct kmem_cache *s, struct
> page *page, {
>       void *prior;
>       void **object = (void *)x;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long flags;
> 
>       local_irq_save(flags);
> @@ -2070,7 +2075,7 @@ checks_ok:
> 
>  out_unlock:
>       slab_unlock(page);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>       return;
> @@ -2084,7 +2089,7 @@ slab_empty:
>               stat(s, FREE_REMOVE_PARTIAL);
>       }
>       slab_unlock(page);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>       stat(s, FREE_SLAB);
> @@ -2113,7 +2118,7 @@ static __always_inline void slab_free(struct
> kmem_cache *s, {
>       void **object = (void *)x;
>       struct kmem_cache_cpu *c;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long tid;
>  #else
>       unsigned long flags;
> @@ -2121,7 +2126,7 @@ static __always_inline void slab_free(struct
> kmem_cache *s,
> 
>       slab_free_hook(s, x);
> 
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_save(flags);
> 
>  #else
> @@ -2136,7 +2141,7 @@ redo:
>        */
>       c = __this_cpu_ptr(s->cpu_slab);
> 
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       tid = c->tid;
>       barrier();
>  #endif
> @@ -2144,7 +2149,7 @@ redo:
>       if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
>               set_freepointer(s, object, c->freelist);
> 
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>               if (unlikely(!this_cpu_cmpxchg_double(
>                               s->cpu_slab->freelist, s->cpu_slab->tid,
>                               c->freelist, tid,
> @@ -2160,7 +2165,7 @@ redo:
>       } else
>               __slab_free(s, page, x, addr);
> 
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>  }
> @@ -2354,7 +2359,7 @@ static inline int alloc_kmem_cache_cpus(struct
> kmem_cache *s) BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
>                       SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
> 
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       /*
>        * Must align to double word boundary for the double cmpxchg
>  instructions
>        * to work.


I applied this patch to 2.6.39-rc4 (sorry for the last test, it indeed applied 
correctly), I still get the following:

ADDRCONF(NETDEV_UP): eth1: link is not ready
b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
ADDRCONF(NETDEV_UP): eth1: link is not ready
eth1: direct probe to 2e:94:8b:66:86:88 (try 1/3)
eth1: direct probe responded
eth1: authenticate with 2e:94:8b:66:86:88 (try 1)
eth1: authenticated
eth1: associate with 2e:94:8b:66:86:88 (try 1)
eth1: RX AssocResp from 2e:94:8b:66:86:88 (capab=0x411 status=0 aid=1)
eth1: associated
ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
eth1: no IPv6 routers present
WARNING: kmemcheck: Caught 64-bit read from uninitialized memory 
(ffff88001b000c00)
0008001b0088ffff0010000000000000c055530000eaffff0010000000000000
 u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u
 ^

Pid: 1874, comm: dhcpcd Not tainted 2.6.39-rc4 #4 Acer,Inc. Aspire 1510  
/Aspire 1510
RIP: 0010:[<ffffffff81263d9c>]  [<ffffffff81263d9c>] memmove+0x2c/0x190
RSP: 0018:ffff88001c17f8a0  EFLAGS: 00010206
RAX: ffff88001b000c00 RBX: ffff88001b070400 RCX: 0000000000000020
RDX: 000000000000016f RSI: ffff88001b000c00 RDI: ffff88001b000c00
RBP: ffff88001c17f8f8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000200 R14: 0000000000000000 R15: 0000000000000001
FS:  00007fd7df758700(0000) GS:ffffffff81a1b000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff88001d23a4b8 CR3: 000000001ba46000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
 [<ffffffff815c7584>] ieee80211_skb_resize+0x94/0x120
 [<ffffffff815c97a4>] ieee80211_xmit+0xa4/0x280
 [<ffffffff815caf86>] ieee80211_subif_start_xmit+0x3e6/0x910
 [<ffffffff81486654>] dev_hard_start_xmit+0x384/0x6a0
 [<ffffffff8149d344>] sch_direct_xmit+0xd4/0x260
 [<ffffffff81486b4e>] dev_queue_xmit+0x1de/0x710
 [<ffffffff8156af23>] packet_sendmsg+0xa73/0xbe0
 [<ffffffff81471813>] sock_sendmsg+0xe3/0x110
 [<ffffffff81472154>] sys_sendto+0x134/0x180
 [<ffffffff815e74f8>] system_call_fastpath+0x16/0x1b
 [<ffffffffffffffff>] 0xffffffffffffffff


I'm not sure it's the same problem anyway as I've said previously.
I append my config file used to build this kernel.
Comment 40 Eric Dumazet 2011-04-20 20:33:04 UTC
Le mercredi 20 avril 2011 à 21:55 +0200, Eric Dumazet a écrit :
> Le mercredi 20 avril 2011 à 21:36 +0200, Christian Casteyde a écrit :
> 
> > I'm not sure it's the same problem anyway as I've said previously.
> > I append my config file used to build this kernel.
> 
> This is not the same problem, and is a known false positive from skb
> reallocation. (skb_reserve() doesnt mark memory as initialized) 
> 
> 

Please try following patch. It's a bit tricky, because network stack has
different functions to fill bytes in skb and change pointers/offsets in
it.

Alternative would be to change pskb_expand_head() to not copy zone
between skb->head and skb->data (might contain unitialized data)

Thanks


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 79aafbb..3f2cba4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1252,6 +1252,7 @@ static inline int skb_tailroom(const struct sk_buff *skb)
  */
 static inline void skb_reserve(struct sk_buff *skb, int len)
 {
+	kmemcheck_mark_initialized(skb->data, len);
 	skb->data += len;
 	skb->tail += len;
 }
Comment 41 Christian Casteyde 2011-04-21 21:12:43 UTC
I didn't managed to reproduce with both patches on 2.6.39-rc4, except I once got another bug I reported in https://bugzilla.kernel.org/show_bug.cgi?id=33802.
Comment 42 Eric Dumazet 2011-05-05 06:18:51 UTC
Le mercredi 20 avril 2011 à 17:01 +0200, Eric Dumazet a écrit :
> Le mercredi 20 avril 2011 à 09:42 -0500, Christoph Lameter a écrit :
> 
> > Avoiding the irq handling yields the savings that improve the fastpath. if
> > you do both then there is only a regression left. So lets go with
> > disabling the CMPXCHG_LOCAL.
> 
> OK, let's do that then.
> 
> Thanks
> 
> [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC
> 
> Christian Casteyde reported a KMEMCHECK splat in slub code.
> 
> Problem is now we are lockless and allow IRQ in slab_alloc(), the object
> we manipulate from freelist can be allocated and freed right before we
> try to read object->next.
> 
> Same problem can happen with DEBUG_PAGEALLOC
> 
> Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or
> CONFIG_DEBUG_PAGEALLOC is defined.
> 
> Reported-by: Christian Casteyde <casteyde.christian@free.fr>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Pekka Enberg <penberg@cs.helsinki.fi>
> Cc: Vegard Nossum <vegardno@ifi.uio.no>
> Cc: Christoph Lameter <cl@linux.com>
> ---
>  mm/slub.c |   45 +++++++++++++++++++++++++--------------------
>  1 files changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 94d2a33..f31ab2c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1540,7 +1540,12 @@ static void unfreeze_slab(struct kmem_cache *s, struct
> page *page, int tail)
>       }
>  }
>  
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \
> +                                  !defined(CONFIG_DEBUG_PAGEALLOC)
> +#define SLUB_USE_CMPXCHG_DOUBLE
> +#endif
> +
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>  #ifdef CONFIG_PREEMPT
>  /*
>   * Calculate the next globally unique transaction for disambiguiation
> @@ -1604,7 +1609,7 @@ static inline void note_cmpxchg_failure(const char *n,
>  
>  void init_kmem_cache_cpus(struct kmem_cache *s)
>  {
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       int cpu;
>  
>       for_each_possible_cpu(cpu)
> @@ -1643,7 +1648,7 @@ static void deactivate_slab(struct kmem_cache *s,
> struct kmem_cache_cpu *c)
>               page->inuse--;
>       }
>       c->page = NULL;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       c->tid = next_tid(c->tid);
>  #endif
>       unfreeze_slab(s, page, tail);
> @@ -1780,7 +1785,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t
> gfpflags, int node,
>  {
>       void **object;
>       struct page *new;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long flags;
>  
>       local_irq_save(flags);
> @@ -1819,7 +1824,7 @@ load_freelist:
>       c->node = page_to_nid(c->page);
>  unlock_out:
>       slab_unlock(c->page);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       c->tid = next_tid(c->tid);
>       local_irq_restore(flags);
>  #endif
> @@ -1858,7 +1863,7 @@ new_slab:
>       }
>       if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
>               slab_out_of_memory(s, gfpflags, node);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>       return NULL;
> @@ -1887,7 +1892,7 @@ static __always_inline void *slab_alloc(struct
> kmem_cache *s,
>  {
>       void **object;
>       struct kmem_cache_cpu *c;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long tid;
>  #else
>       unsigned long flags;
> @@ -1896,7 +1901,7 @@ static __always_inline void *slab_alloc(struct
> kmem_cache *s,
>       if (slab_pre_alloc_hook(s, gfpflags))
>               return NULL;
>  
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_save(flags);
>  #else
>  redo:
> @@ -1910,7 +1915,7 @@ redo:
>        */
>       c = __this_cpu_ptr(s->cpu_slab);
>  
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       /*
>        * The transaction ids are globally unique per cpu and per operation on
>        * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
> @@ -1927,7 +1932,7 @@ redo:
>               object = __slab_alloc(s, gfpflags, node, addr, c);
>  
>       else {
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>               /*
>                * The cmpxchg will only match if there was no additional
>                * operation and if we are on the right processor.
> @@ -1954,7 +1959,7 @@ redo:
>               stat(s, ALLOC_FASTPATH);
>       }
>  
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>  
> @@ -2034,7 +2039,7 @@ static void __slab_free(struct kmem_cache *s, struct
> page *page,
>  {
>       void *prior;
>       void **object = (void *)x;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long flags;
>  
>       local_irq_save(flags);
> @@ -2070,7 +2075,7 @@ checks_ok:
>  
>  out_unlock:
>       slab_unlock(page);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>       return;
> @@ -2084,7 +2089,7 @@ slab_empty:
>               stat(s, FREE_REMOVE_PARTIAL);
>       }
>       slab_unlock(page);
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>       stat(s, FREE_SLAB);
> @@ -2113,7 +2118,7 @@ static __always_inline void slab_free(struct kmem_cache
> *s,
>  {
>       void **object = (void *)x;
>       struct kmem_cache_cpu *c;
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       unsigned long tid;
>  #else
>       unsigned long flags;
> @@ -2121,7 +2126,7 @@ static __always_inline void slab_free(struct kmem_cache
> *s,
>  
>       slab_free_hook(s, x);
>  
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_save(flags);
>  
>  #else
> @@ -2136,7 +2141,7 @@ redo:
>        */
>       c = __this_cpu_ptr(s->cpu_slab);
>  
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       tid = c->tid;
>       barrier();
>  #endif
> @@ -2144,7 +2149,7 @@ redo:
>       if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
>               set_freepointer(s, object, c->freelist);
>  
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>               if (unlikely(!this_cpu_cmpxchg_double(
>                               s->cpu_slab->freelist, s->cpu_slab->tid,
>                               c->freelist, tid,
> @@ -2160,7 +2165,7 @@ redo:
>       } else
>               __slab_free(s, page, x, addr);
>  
> -#ifndef CONFIG_CMPXCHG_LOCAL
> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>       local_irq_restore(flags);
>  #endif
>  }
> @@ -2354,7 +2359,7 @@ static inline int alloc_kmem_cache_cpus(struct
> kmem_cache *s)
>       BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE <
>                       SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
>  
> -#ifdef CONFIG_CMPXCHG_LOCAL
> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>       /*
>        * Must align to double word boundary for the double cmpxchg
>  instructions
>        * to work.
> 


Is anybody taking the patch and sending it upstream ?

Thanks
Comment 43 Pekka Enberg 2011-05-05 06:23:23 UTC
On 5/5/11 9:18 AM, Eric Dumazet wrote:
> Le mercredi 20 avril 2011 à 17:01 +0200, Eric Dumazet a écrit :
>> Le mercredi 20 avril 2011 à 09:42 -0500, Christoph Lameter a écrit :
>>
>>> Avoiding the irq handling yields the savings that improve the fastpath. if
>>> you do both then there is only a regression left. So lets go with
>>> disabling the CMPXCHG_LOCAL.
>>
>> OK, let's do that then.
>>
>> Thanks
>>
>> [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC
>>
>> Christian Casteyde reported a KMEMCHECK splat in slub code.
>>
>> Problem is now we are lockless and allow IRQ in slab_alloc(), the object
>> we manipulate from freelist can be allocated and freed right before we
>> try to read object->next.
>>
>> Same problem can happen with DEBUG_PAGEALLOC
>>
>> Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or
>> CONFIG_DEBUG_PAGEALLOC is defined.
>>
>> Reported-by: Christian Casteyde<casteyde.christian@free.fr>
>> Signed-off-by: Eric Dumazet<eric.dumazet@gmail.com>
>> Cc: Andrew Morton<akpm@linux-foundation.org>
>> Cc: Pekka Enberg<penberg@cs.helsinki.fi>
>> Cc: Vegard Nossum<vegardno@ifi.uio.no>
>> Cc: Christoph Lameter<cl@linux.com>
>> ---
>>   mm/slub.c |   45 +++++++++++++++++++++++++--------------------
>>   1 files changed, 25 insertions(+), 20 deletions(-)
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 94d2a33..f31ab2c 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1540,7 +1540,12 @@ static void unfreeze_slab(struct kmem_cache *s,
>> struct page *page, int tail)
>>      }
>>   }
>>
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#if defined(CONFIG_CMPXCHG_LOCAL)&&  !defined(CONFIG_KMEMCHECK)&&  \
>> +                                 !defined(CONFIG_DEBUG_PAGEALLOC)
>> +#define SLUB_USE_CMPXCHG_DOUBLE
>> +#endif
>> +
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>   #ifdef CONFIG_PREEMPT
>>   /*
>>    * Calculate the next globally unique transaction for disambiguiation
>> @@ -1604,7 +1609,7 @@ static inline void note_cmpxchg_failure(const char *n,
>>
>>   void init_kmem_cache_cpus(struct kmem_cache *s)
>>   {
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      int cpu;
>>
>>      for_each_possible_cpu(cpu)
>> @@ -1643,7 +1648,7 @@ static void deactivate_slab(struct kmem_cache *s,
>> struct kmem_cache_cpu *c)
>>              page->inuse--;
>>      }
>>      c->page = NULL;
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      c->tid = next_tid(c->tid);
>>   #endif
>>      unfreeze_slab(s, page, tail);
>> @@ -1780,7 +1785,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t
>> gfpflags, int node,
>>   {
>>      void **object;
>>      struct page *new;
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      unsigned long flags;
>>
>>      local_irq_save(flags);
>> @@ -1819,7 +1824,7 @@ load_freelist:
>>      c->node = page_to_nid(c->page);
>>   unlock_out:
>>      slab_unlock(c->page);
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      c->tid = next_tid(c->tid);
>>      local_irq_restore(flags);
>>   #endif
>> @@ -1858,7 +1863,7 @@ new_slab:
>>      }
>>      if (!(gfpflags&  __GFP_NOWARN)&&  printk_ratelimit())
>>              slab_out_of_memory(s, gfpflags, node);
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      local_irq_restore(flags);
>>   #endif
>>      return NULL;
>> @@ -1887,7 +1892,7 @@ static __always_inline void *slab_alloc(struct
>> kmem_cache *s,
>>   {
>>      void **object;
>>      struct kmem_cache_cpu *c;
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      unsigned long tid;
>>   #else
>>      unsigned long flags;
>> @@ -1896,7 +1901,7 @@ static __always_inline void *slab_alloc(struct
>> kmem_cache *s,
>>      if (slab_pre_alloc_hook(s, gfpflags))
>>              return NULL;
>>
>> -#ifndef CONFIG_CMPXCHG_LOCAL
>> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>>      local_irq_save(flags);
>>   #else
>>   redo:
>> @@ -1910,7 +1915,7 @@ redo:
>>       */
>>      c = __this_cpu_ptr(s->cpu_slab);
>>
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      /*
>>       * The transaction ids are globally unique per cpu and per operation on
>>       * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
>> @@ -1927,7 +1932,7 @@ redo:
>>              object = __slab_alloc(s, gfpflags, node, addr, c);
>>
>>      else {
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>              /*
>>               * The cmpxchg will only match if there was no additional
>>               * operation and if we are on the right processor.
>> @@ -1954,7 +1959,7 @@ redo:
>>              stat(s, ALLOC_FASTPATH);
>>      }
>>
>> -#ifndef CONFIG_CMPXCHG_LOCAL
>> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>>      local_irq_restore(flags);
>>   #endif
>>
>> @@ -2034,7 +2039,7 @@ static void __slab_free(struct kmem_cache *s, struct
>> page *page,
>>   {
>>      void *prior;
>>      void **object = (void *)x;
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      unsigned long flags;
>>
>>      local_irq_save(flags);
>> @@ -2070,7 +2075,7 @@ checks_ok:
>>
>>   out_unlock:
>>      slab_unlock(page);
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      local_irq_restore(flags);
>>   #endif
>>      return;
>> @@ -2084,7 +2089,7 @@ slab_empty:
>>              stat(s, FREE_REMOVE_PARTIAL);
>>      }
>>      slab_unlock(page);
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      local_irq_restore(flags);
>>   #endif
>>      stat(s, FREE_SLAB);
>> @@ -2113,7 +2118,7 @@ static __always_inline void slab_free(struct
>> kmem_cache *s,
>>   {
>>      void **object = (void *)x;
>>      struct kmem_cache_cpu *c;
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      unsigned long tid;
>>   #else
>>      unsigned long flags;
>> @@ -2121,7 +2126,7 @@ static __always_inline void slab_free(struct
>> kmem_cache *s,
>>
>>      slab_free_hook(s, x);
>>
>> -#ifndef CONFIG_CMPXCHG_LOCAL
>> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>>      local_irq_save(flags);
>>
>>   #else
>> @@ -2136,7 +2141,7 @@ redo:
>>       */
>>      c = __this_cpu_ptr(s->cpu_slab);
>>
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      tid = c->tid;
>>      barrier();
>>   #endif
>> @@ -2144,7 +2149,7 @@ redo:
>>      if (likely(page == c->page&&  c->node != NUMA_NO_NODE)) {
>>              set_freepointer(s, object, c->freelist);
>>
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>              if (unlikely(!this_cpu_cmpxchg_double(
>>                              s->cpu_slab->freelist, s->cpu_slab->tid,
>>                              c->freelist, tid,
>> @@ -2160,7 +2165,7 @@ redo:
>>      } else
>>              __slab_free(s, page, x, addr);
>>
>> -#ifndef CONFIG_CMPXCHG_LOCAL
>> +#ifndef SLUB_USE_CMPXCHG_DOUBLE
>>      local_irq_restore(flags);
>>   #endif
>>   }
>> @@ -2354,7 +2359,7 @@ static inline int alloc_kmem_cache_cpus(struct
>> kmem_cache *s)
>>      BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE<
>>                      SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu));
>>
>> -#ifdef CONFIG_CMPXCHG_LOCAL
>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE
>>      /*
>>       * Must align to double word boundary for the double cmpxchg
>>   instructions
>>       * to work.
>>
>
>
> Is anybody taking the patch and sending it upstream ?

Oh, sorry, it got lost in my inbox. I can take care of it.

			Pekka
Comment 44 Eric Dumazet 2011-05-05 06:51:28 UTC
Le jeudi 05 mai 2011 à 09:22 +0300, Pekka Enberg a écrit :

> Oh, sorry, it got lost in my inbox. I can take care of it.
> 

Thanks Pekka !
Comment 45 Anonymous Emailer 2011-05-05 18:40:47 UTC
Reply-To: cl@linux.com

On Thu, 5 May 2011, Pekka Enberg wrote:

> > > -#ifdef CONFIG_CMPXCHG_LOCAL
> > > +#ifdef SLUB_USE_CMPXCHG_DOUBLE
> > >           /*
> > >            * Must align to double word boundary for the double cmpxchg
> > > instructions
> > >            * to work.
> > >
> >
> >
> > Is anybody taking the patch and sending it upstream ?
>
> Oh, sorry, it got lost in my inbox. I can take care of it.

Combining this one with the patch I sent to remove the #ifdeffery would
make this much simpler.
Comment 46 Anonymous Emailer 2011-05-05 18:40:47 UTC
Reply-To: cl@linux.com

On Thu, 5 May 2011, Pekka Enberg wrote:

> > > -#ifdef CONFIG_CMPXCHG_LOCAL
> > > +#ifdef SLUB_USE_CMPXCHG_DOUBLE
> > >           /*
> > >            * Must align to double word boundary for the double cmpxchg
> > > instructions
> > >            * to work.
> > >
> >
> >
> > Is anybody taking the patch and sending it upstream ?
>
> Oh, sorry, it got lost in my inbox. I can take care of it.

Combining this one with the patch I sent to remove the #ifdeffery would
make this much simpler.
Comment 47 Eric Dumazet 2011-05-05 18:48:46 UTC
Le jeudi 05 mai 2011 à 13:40 -0500, Christoph Lameter a écrit :
> On Thu, 5 May 2011, Pekka Enberg wrote:
> 
> > > > -#ifdef CONFIG_CMPXCHG_LOCAL
> > > > +#ifdef SLUB_USE_CMPXCHG_DOUBLE
> > > >         /*
> > > >          * Must align to double word boundary for the double cmpxchg
> > > > instructions
> > > >          * to work.
> > > >
> > >
> > >
> > > Is anybody taking the patch and sending it upstream ?
> >
> > Oh, sorry, it got lost in my inbox. I can take care of it.
> 
> Combining this one with the patch I sent to remove the #ifdeffery would
> make this much simpler.

I must missed it, could you please resend it ?

Thanks
Comment 48 Anonymous Emailer 2011-05-05 19:05:46 UTC
Reply-To: cl@linux.com

On Thu, 5 May 2011, Eric Dumazet wrote:

> > Combining this one with the patch I sent to remove the #ifdeffery would
> > make this much simpler.
>
> I must missed it, could you please resend it ?


Subject: slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery

Remove the #ifdefs. This means that the irqsafe_cpu_cmpxchg_double() is used
everywhere.

There may be performance implications since:

A. We now have to manage a transaction ID for all arches

B. The interrupt holdoff for arches not supporting CONFIG_CMPXCHG_LOCAL is reduced
to a very short irqoff section.

There are no multiple irqoff/irqon sequences as a result of this change. Even in the fallback
case we only have to do one disable and enable like before.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/slub_def.h |    2 -
 mm/slub.c                |   56 -----------------------------------------------
 2 files changed, 58 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2011-05-04 09:33:08.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2011-05-04 09:42:05.000000000 -0500
@@ -37,9 +37,7 @@ enum stat_item {

 struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to next available object */
-#ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long tid;	/* Globally unique transaction id */
-#endif
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
 #ifdef CONFIG_SLUB_STATS
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-05-04 09:41:59.000000000 -0500
+++ linux-2.6/mm/slub.c	2011-05-04 09:48:11.000000000 -0500
@@ -1540,7 +1540,6 @@ static void unfreeze_slab(struct kmem_ca
 	}
 }

-#ifdef CONFIG_CMPXCHG_LOCAL
 #ifdef CONFIG_PREEMPT
 /*
  * Calculate the next globally unique transaction for disambiguiation
@@ -1600,17 +1599,12 @@ static inline void note_cmpxchg_failure(
 	stat(s, CMPXCHG_DOUBLE_CPU_FAIL);
 }

-#endif
-
 void init_kmem_cache_cpus(struct kmem_cache *s)
 {
-#ifdef CONFIG_CMPXCHG_LOCAL
 	int cpu;

 	for_each_possible_cpu(cpu)
 		per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
-#endif
-
 }
 /*
  * Remove the cpu slab
@@ -1643,9 +1637,7 @@ static void deactivate_slab(struct kmem_
 		page->inuse--;
 	}
 	c->page = NULL;
-#ifdef CONFIG_CMPXCHG_LOCAL
 	c->tid = next_tid(c->tid);
-#endif
 	unfreeze_slab(s, page, tail);
 }

@@ -1780,7 +1772,6 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	struct page *new;
-#ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long flags;

 	local_irq_save(flags);
@@ -1792,7 +1783,6 @@ static void *__slab_alloc(struct kmem_ca
 	 */
 	c = this_cpu_ptr(s->cpu_slab);
 #endif
-#endif

 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
@@ -1819,10 +1809,8 @@ load_freelist:
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
-#ifdef CONFIG_CMPXCHG_LOCAL
 	c->tid = next_tid(c->tid);
 	local_irq_restore(flags);
-#endif
 	stat(s, ALLOC_SLOWPATH);
 	return object;

@@ -1858,9 +1846,7 @@ new_slab:
 	}
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
-#ifdef CONFIG_CMPXCHG_LOCAL
 	local_irq_restore(flags);
-#endif
 	return NULL;
 debug:
 	if (!alloc_debug_processing(s, c->page, object, addr))
@@ -1887,20 +1873,12 @@ static __always_inline void *slab_alloc(
 {
 	void **object;
 	struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long tid;
-#else
-	unsigned long flags;
-#endif

 	if (slab_pre_alloc_hook(s, gfpflags))
 		return NULL;

-#ifndef CONFIG_CMPXCHG_LOCAL
-	local_irq_save(flags);
-#else
 redo:
-#endif

 	/*
 	 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
@@ -1910,7 +1888,6 @@ redo:
 	 */
 	c = __this_cpu_ptr(s->cpu_slab);

-#ifdef CONFIG_CMPXCHG_LOCAL
 	/*
 	 * The transaction ids are globally unique per cpu and per operation on
 	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
@@ -1919,7 +1896,6 @@ redo:
 	 */
 	tid = c->tid;
 	barrier();
-#endif

 	object = c->freelist;
 	if (unlikely(!object || !node_match(c, node)))
@@ -1927,7 +1903,6 @@ redo:
 		object = __slab_alloc(s, gfpflags, node, addr, c);

 	else {
-#ifdef CONFIG_CMPXCHG_LOCAL
 		/*
 		 * The cmpxchg will only match if there was no additional
 		 * operation and if we are on the right processor.
@@ -1948,16 +1923,9 @@ redo:
 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;
 		}
-#else
-		c->freelist = get_freepointer(s, object);
-#endif
 		stat(s, ALLOC_FASTPATH);
 	}

-#ifndef CONFIG_CMPXCHG_LOCAL
-	local_irq_restore(flags);
-#endif
-
 	if (unlikely(gfpflags & __GFP_ZERO) && object)
 		memset(object, 0, s->objsize);

@@ -2034,11 +2002,9 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
-#ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long flags;

 	local_irq_save(flags);
-#endif
 	slab_lock(page);
 	stat(s, FREE_SLOWPATH);

@@ -2070,9 +2036,7 @@ checks_ok:

 out_unlock:
 	slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
 	local_irq_restore(flags);
-#endif
 	return;

 slab_empty:
@@ -2084,9 +2048,7 @@ slab_empty:
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
 	local_irq_restore(flags);
-#endif
 	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
@@ -2113,20 +2075,11 @@ static __always_inline void slab_free(st
 {
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long tid;
-#else
-	unsigned long flags;
-#endif

 	slab_free_hook(s, x);

-#ifndef CONFIG_CMPXCHG_LOCAL
-	local_irq_save(flags);
-
-#else
 redo:
-#endif

 	/*
 	 * Determine the currently cpus per cpu slab.
@@ -2136,15 +2089,12 @@ redo:
 	 */
 	c = __this_cpu_ptr(s->cpu_slab);

-#ifdef CONFIG_CMPXCHG_LOCAL
 	tid = c->tid;
 	barrier();
-#endif

 	if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
 		set_freepointer(s, object, c->freelist);

-#ifdef CONFIG_CMPXCHG_LOCAL
 		if (unlikely(!irqsafe_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				c->freelist, tid,
@@ -2153,16 +2103,10 @@ redo:
 			note_cmpxchg_failure("slab_free", s, tid);
 			goto redo;
 		}
-#else
-		c->freelist = object;
-#endif
 		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);

-#ifndef CONFIG_CMPXCHG_LOCAL
-	local_irq_restore(flags);
-#endif
 }

 void kmem_cache_free(struct kmem_cache *s, void *x)
Comment 49 Anonymous Emailer 2011-05-05 19:05:54 UTC
Reply-To: cl@linux.com

On Thu, 5 May 2011, Eric Dumazet wrote:

> > Combining this one with the patch I sent to remove the #ifdeffery would
> > make this much simpler.
>
> I must missed it, could you please resend it ?


Subject: slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery

Remove the #ifdefs. This means that the irqsafe_cpu_cmpxchg_double() is used
everywhere.

There may be performance implications since:

A. We now have to manage a transaction ID for all arches

B. The interrupt holdoff for arches not supporting CONFIG_CMPXCHG_LOCAL is reduced
to a very short irqoff section.

There are no multiple irqoff/irqon sequences as a result of this change. Even in the fallback
case we only have to do one disable and enable like before.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/slub_def.h |    2 -
 mm/slub.c                |   56 -----------------------------------------------
 2 files changed, 58 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2011-05-04 09:33:08.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h	2011-05-04 09:42:05.000000000 -0500
@@ -37,9 +37,7 @@ enum stat_item {

 struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to next available object */
-#ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long tid;	/* Globally unique transaction id */
-#endif
 	struct page *page;	/* The slab from which we are allocating */
 	int node;		/* The node of the page (or -1 for debug) */
 #ifdef CONFIG_SLUB_STATS
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-05-04 09:41:59.000000000 -0500
+++ linux-2.6/mm/slub.c	2011-05-04 09:48:11.000000000 -0500
@@ -1540,7 +1540,6 @@ static void unfreeze_slab(struct kmem_ca
 	}
 }

-#ifdef CONFIG_CMPXCHG_LOCAL
 #ifdef CONFIG_PREEMPT
 /*
  * Calculate the next globally unique transaction for disambiguiation
@@ -1600,17 +1599,12 @@ static inline void note_cmpxchg_failure(
 	stat(s, CMPXCHG_DOUBLE_CPU_FAIL);
 }

-#endif
-
 void init_kmem_cache_cpus(struct kmem_cache *s)
 {
-#ifdef CONFIG_CMPXCHG_LOCAL
 	int cpu;

 	for_each_possible_cpu(cpu)
 		per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
-#endif
-
 }
 /*
  * Remove the cpu slab
@@ -1643,9 +1637,7 @@ static void deactivate_slab(struct kmem_
 		page->inuse--;
 	}
 	c->page = NULL;
-#ifdef CONFIG_CMPXCHG_LOCAL
 	c->tid = next_tid(c->tid);
-#endif
 	unfreeze_slab(s, page, tail);
 }

@@ -1780,7 +1772,6 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	struct page *new;
-#ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long flags;

 	local_irq_save(flags);
@@ -1792,7 +1783,6 @@ static void *__slab_alloc(struct kmem_ca
 	 */
 	c = this_cpu_ptr(s->cpu_slab);
 #endif
-#endif

 	/* We handle __GFP_ZERO in the caller */
 	gfpflags &= ~__GFP_ZERO;
@@ -1819,10 +1809,8 @@ load_freelist:
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
-#ifdef CONFIG_CMPXCHG_LOCAL
 	c->tid = next_tid(c->tid);
 	local_irq_restore(flags);
-#endif
 	stat(s, ALLOC_SLOWPATH);
 	return object;

@@ -1858,9 +1846,7 @@ new_slab:
 	}
 	if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
 		slab_out_of_memory(s, gfpflags, node);
-#ifdef CONFIG_CMPXCHG_LOCAL
 	local_irq_restore(flags);
-#endif
 	return NULL;
 debug:
 	if (!alloc_debug_processing(s, c->page, object, addr))
@@ -1887,20 +1873,12 @@ static __always_inline void *slab_alloc(
 {
 	void **object;
 	struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long tid;
-#else
-	unsigned long flags;
-#endif

 	if (slab_pre_alloc_hook(s, gfpflags))
 		return NULL;

-#ifndef CONFIG_CMPXCHG_LOCAL
-	local_irq_save(flags);
-#else
 redo:
-#endif

 	/*
 	 * Must read kmem_cache cpu data via this cpu ptr. Preemption is
@@ -1910,7 +1888,6 @@ redo:
 	 */
 	c = __this_cpu_ptr(s->cpu_slab);

-#ifdef CONFIG_CMPXCHG_LOCAL
 	/*
 	 * The transaction ids are globally unique per cpu and per operation on
 	 * a per cpu queue. Thus they can be guarantee that the cmpxchg_double
@@ -1919,7 +1896,6 @@ redo:
 	 */
 	tid = c->tid;
 	barrier();
-#endif

 	object = c->freelist;
 	if (unlikely(!object || !node_match(c, node)))
@@ -1927,7 +1903,6 @@ redo:
 		object = __slab_alloc(s, gfpflags, node, addr, c);

 	else {
-#ifdef CONFIG_CMPXCHG_LOCAL
 		/*
 		 * The cmpxchg will only match if there was no additional
 		 * operation and if we are on the right processor.
@@ -1948,16 +1923,9 @@ redo:
 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;
 		}
-#else
-		c->freelist = get_freepointer(s, object);
-#endif
 		stat(s, ALLOC_FASTPATH);
 	}

-#ifndef CONFIG_CMPXCHG_LOCAL
-	local_irq_restore(flags);
-#endif
-
 	if (unlikely(gfpflags & __GFP_ZERO) && object)
 		memset(object, 0, s->objsize);

@@ -2034,11 +2002,9 @@ static void __slab_free(struct kmem_cach
 {
 	void *prior;
 	void **object = (void *)x;
-#ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long flags;

 	local_irq_save(flags);
-#endif
 	slab_lock(page);
 	stat(s, FREE_SLOWPATH);

@@ -2070,9 +2036,7 @@ checks_ok:

 out_unlock:
 	slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
 	local_irq_restore(flags);
-#endif
 	return;

 slab_empty:
@@ -2084,9 +2048,7 @@ slab_empty:
 		stat(s, FREE_REMOVE_PARTIAL);
 	}
 	slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
 	local_irq_restore(flags);
-#endif
 	stat(s, FREE_SLAB);
 	discard_slab(s, page);
 	return;
@@ -2113,20 +2075,11 @@ static __always_inline void slab_free(st
 {
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
 	unsigned long tid;
-#else
-	unsigned long flags;
-#endif

 	slab_free_hook(s, x);

-#ifndef CONFIG_CMPXCHG_LOCAL
-	local_irq_save(flags);
-
-#else
 redo:
-#endif

 	/*
 	 * Determine the currently cpus per cpu slab.
@@ -2136,15 +2089,12 @@ redo:
 	 */
 	c = __this_cpu_ptr(s->cpu_slab);

-#ifdef CONFIG_CMPXCHG_LOCAL
 	tid = c->tid;
 	barrier();
-#endif

 	if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
 		set_freepointer(s, object, c->freelist);

-#ifdef CONFIG_CMPXCHG_LOCAL
 		if (unlikely(!irqsafe_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				c->freelist, tid,
@@ -2153,16 +2103,10 @@ redo:
 			note_cmpxchg_failure("slab_free", s, tid);
 			goto redo;
 		}
-#else
-		c->freelist = object;
-#endif
 		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr);

-#ifndef CONFIG_CMPXCHG_LOCAL
-	local_irq_restore(flags);
-#endif
 }

 void kmem_cache_free(struct kmem_cache *s, void *x)
Comment 50 Anonymous Emailer 2011-05-09 20:21:57 UTC
Reply-To: penberg@kernel.org

On Wed, 20 Apr 2011, Eric Dumazet wrote:
> [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC
>
> Christian Casteyde reported a KMEMCHECK splat in slub code.
>
> Problem is now we are lockless and allow IRQ in slab_alloc(), the object
> we manipulate from freelist can be allocated and freed right before we
> try to read object->next.
>
> Same problem can happen with DEBUG_PAGEALLOC
>
> Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or
> CONFIG_DEBUG_PAGEALLOC is defined.

Christoph, Eric, is this still relevant after commit 1759415 ("slub: 
Remove CONFIG_CMPXCHG_LOCAL ifdeffery") in slab/next of slab.git?

 			Pekka
Comment 51 Anonymous Emailer 2011-05-09 20:44:29 UTC
Reply-To: penberg@kernel.org

On Mon, 9 May 2011, Christoph Lameter wrote:
> There is still an issue and now you can no longer fix the thing through
> CONFIG_CMPXCHG_LOCAL.
>
> It needs to be legal for slub to deref the counter even if the object has
> been freed.

We can use kmemcheck_mark_initialized() for that.
Comment 52 Anonymous Emailer 2011-05-09 21:04:34 UTC
Reply-To: cl@linux.com

On Mon, 9 May 2011, Pekka Enberg wrote:

> On Wed, 20 Apr 2011, Eric Dumazet wrote:
> > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC
> >
> > Christian Casteyde reported a KMEMCHECK splat in slub code.
> >
> > Problem is now we are lockless and allow IRQ in slab_alloc(), the object
> > we manipulate from freelist can be allocated and freed right before we
> > try to read object->next.
> >
> > Same problem can happen with DEBUG_PAGEALLOC
> >
> > Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or
> > CONFIG_DEBUG_PAGEALLOC is defined.
>
> Christoph, Eric, is this still relevant after commit 1759415 ("slub: Remove
> CONFIG_CMPXCHG_LOCAL ifdeffery") in slab/next of slab.git?

There is still an issue and now you can no longer fix the thing through
CONFIG_CMPXCHG_LOCAL.

It needs to be legal for slub to deref the counter even if the object has
been freed.
Comment 53 Anonymous Emailer 2011-05-09 21:04:42 UTC
Reply-To: cl@linux.com

On Mon, 9 May 2011, Pekka Enberg wrote:

> On Wed, 20 Apr 2011, Eric Dumazet wrote:
> > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC
> >
> > Christian Casteyde reported a KMEMCHECK splat in slub code.
> >
> > Problem is now we are lockless and allow IRQ in slab_alloc(), the object
> > we manipulate from freelist can be allocated and freed right before we
> > try to read object->next.
> >
> > Same problem can happen with DEBUG_PAGEALLOC
> >
> > Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or
> > CONFIG_DEBUG_PAGEALLOC is defined.
>
> Christoph, Eric, is this still relevant after commit 1759415 ("slub: Remove
> CONFIG_CMPXCHG_LOCAL ifdeffery") in slab/next of slab.git?

There is still an issue and now you can no longer fix the thing through
CONFIG_CMPXCHG_LOCAL.

It needs to be legal for slub to deref the counter even if the object has
been freed.
Comment 54 Eric Dumazet 2011-05-10 08:43:43 UTC
Le lundi 09 mai 2011 à 15:04 -0500, Christoph Lameter a écrit :
> On Mon, 9 May 2011, Pekka Enberg wrote:
> 
> > On Wed, 20 Apr 2011, Eric Dumazet wrote:
> > > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC
> > >
> > > Christian Casteyde reported a KMEMCHECK splat in slub code.
> > >
> > > Problem is now we are lockless and allow IRQ in slab_alloc(), the object
> > > we manipulate from freelist can be allocated and freed right before we
> > > try to read object->next.
> > >
> > > Same problem can happen with DEBUG_PAGEALLOC
> > >
> > > Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or
> > > CONFIG_DEBUG_PAGEALLOC is defined.
> >
> > Christoph, Eric, is this still relevant after commit 1759415 ("slub: Remove
> > CONFIG_CMPXCHG_LOCAL ifdeffery") in slab/next of slab.git?
> 
> There is still an issue and now you can no longer fix the thing through
> CONFIG_CMPXCHG_LOCAL.
> 
> It needs to be legal for slub to deref the counter even if the object has
> been freed.
> 

I am trying to follow things but honestly I am lost.

Isnt commit 1759415e63 planned for 2.6.40 ?
( ref :
http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=1759415e630e5db0dd2390df9f94892cbfb9a8a2 )

How shall we fix things for 2.6.39 ? I thought my patch was OK for that.


Its a bit hard to work with you on this stuff, for a report I made ages
ago, I find it incredible its not yet fixed in linux-2.6.

Christoph, is your plan to make SLUB not compatable with
CONFIG_DEBUG_PAGEALLOC ? Luckily we still have SLAB ;)

I am a bit surprised of 1759415e63 commit. Its obviously wrong for
DEBUG_PAGEALLOG. Sure KMEMCHECK could be handled differently (since we
are !SMP in this case)

???
Comment 55 Pekka Enberg 2011-05-10 09:48:09 UTC
On 5/10/11 11:43 AM, Eric Dumazet wrote:
> I am trying to follow things but honestly I am lost.
> Isnt commit 1759415e63 planned for 2.6.40 ?
> ( ref :
>
> http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=1759415e630e5db0dd2390df9f94892cbfb9a8a2
> )

Yes, it's for 2.6.40.

> How shall we fix things for 2.6.39 ? I thought my patch was OK for that.
>

It's so late in the release cycle that I think the best option is to fix 
it in 2.6.40 and backport it to -stable together with the above commit.

> Its a bit hard to work with you on this stuff, for a report I made ages
> ago, I find it incredible its not yet fixed in linux-2.6.

It's not incredible, I simply managed to miss your patch. Sorry about that.

                     Pekka
Comment 56 Eric Dumazet 2011-05-10 10:03:58 UTC
Le mardi 10 mai 2011 à 12:47 +0300, Pekka Enberg a écrit :
> On 5/10/11 11:43 AM, Eric Dumazet wrote:
> > I am trying to follow things but honestly I am lost.
> > Isnt commit 1759415e63 planned for 2.6.40 ?
> > ( ref :
> >
> http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=1759415e630e5db0dd2390df9f94892cbfb9a8a2
> )
> 
> Yes, it's for 2.6.40.
> 
> > How shall we fix things for 2.6.39 ? I thought my patch was OK for that.
> >
> 
> It's so late in the release cycle that I think the best option is to fix 
> it in 2.6.40 and backport it to -stable together with the above commit.
> 
> > Its a bit hard to work with you on this stuff, for a report I made ages
> > ago, I find it incredible its not yet fixed in linux-2.6.
> 
> It's not incredible, I simply managed to miss your patch. Sorry about that.
> 

Pekka, my word was probably too strong, sorry for that.

What I meant is I dont understand how Christoph expect to solve this
problem if irqsafe_cpu_cmpxchg_double() is used everywhere.
Comment 57 Pekka Enberg 2011-05-10 10:04:07 UTC
On 5/10/11 11:43 AM, Eric Dumazet wrote:
> Le lundi 09 mai 2011 à 15:04 -0500, Christoph Lameter a écrit :
>> On Mon, 9 May 2011, Pekka Enberg wrote:
>>
>>> On Wed, 20 Apr 2011, Eric Dumazet wrote:
>>>> [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC
>>>>
>>>> Christian Casteyde reported a KMEMCHECK splat in slub code.
>>>>
>>>> Problem is now we are lockless and allow IRQ in slab_alloc(), the object
>>>> we manipulate from freelist can be allocated and freed right before we
>>>> try to read object->next.
>>>>
>>>> Same problem can happen with DEBUG_PAGEALLOC
>>>>
>>>> Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or
>>>> CONFIG_DEBUG_PAGEALLOC is defined.
>>> Christoph, Eric, is this still relevant after commit 1759415 ("slub: Remove
>>> CONFIG_CMPXCHG_LOCAL ifdeffery") in slab/next of slab.git?
>> There is still an issue and now you can no longer fix the thing through
>> CONFIG_CMPXCHG_LOCAL.
>>
>> It needs to be legal for slub to deref the counter even if the object has
>> been freed.
>>
> I am trying to follow things but honestly I am lost.
>
> Isnt commit 1759415e63 planned for 2.6.40 ?
> ( ref :
>
> http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=1759415e630e5db0dd2390df9f94892cbfb9a8a2
> )
>
> How shall we fix things for 2.6.39 ? I thought my patch was OK for that.
>
>
> Its a bit hard to work with you on this stuff, for a report I made ages
> ago, I find it incredible its not yet fixed in linux-2.6.
>
Can't we fix the issue by putting kmemcheck_mark_initialized() to 
set_freepointer()?
Comment 58 Pekka Enberg 2011-05-10 10:11:13 UTC
Hi Eric,

On 5/10/11 1:03 PM, Eric Dumazet wrote:
> What I meant is I dont understand how Christoph expect to solve this
> problem if irqsafe_cpu_cmpxchg_double() is used everywhere.

Do you see any problem with annotating the freelist pointers with 
kmemcheck_mark_initialized() in set_freepointer()?

The #ifdef removal was requested by Linus et al when we fixed a bug that 
was causing problems during boot on certain CPUs. So even though it 
invalidates your fix, it's the way to go forward.

			Pekka
Comment 59 Eric Dumazet 2011-05-10 10:17:42 UTC
Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit :

> Can't we fix the issue by putting kmemcheck_mark_initialized() to 
> set_freepointer()?

This would solve kmemcheck problem, not DEBUG_PAGEALLOC
Comment 60 Pekka Enberg 2011-05-10 10:20:00 UTC
On 5/10/11 1:17 PM, Eric Dumazet wrote:
> Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit :
>
>> Can't we fix the issue by putting kmemcheck_mark_initialized() to
>> set_freepointer()?
>
> This would solve kmemcheck problem, not DEBUG_PAGEALLOC

Oh, right. Christoph? We need to support DEBUG_PAGEALLOC with SLUB.
Comment 61 Eric Dumazet 2011-05-10 11:53:06 UTC
Le mardi 10 mai 2011 à 13:19 +0300, Pekka Enberg a écrit :
> On 5/10/11 1:17 PM, Eric Dumazet wrote:
> > Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit :
> >
> >> Can't we fix the issue by putting kmemcheck_mark_initialized() to
> >> set_freepointer()?
> >
> > This would solve kmemcheck problem, not DEBUG_PAGEALLOC
> 
> Oh, right. Christoph? We need to support DEBUG_PAGEALLOC with SLUB.


Maybe have a special version of "irqsafe_cpu_cmpxchg_double" and make it
use something like :


local_irq_save(flags);
if (probe_kernel_address(...)) {
	res = false;
} else {
	res = unsafe_cpu_cmpxchg_double(...)
	}
local_irq_restore(flags);

Just the idea, I dont have time to cook a patch.
Comment 62 Anonymous Emailer 2011-05-10 12:25:19 UTC
Reply-To: vegardno@ifi.uio.no

 On Tue, 10 May 2011 13:19:35 +0300, Pekka Enberg 
 <penberg@cs.helsinki.fi> wrote:
> On 5/10/11 1:17 PM, Eric Dumazet wrote:
>> Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit :
>>
>>> Can't we fix the issue by putting kmemcheck_mark_initialized() to
>>> set_freepointer()?
>>
>> This would solve kmemcheck problem, not DEBUG_PAGEALLOC
>
> Oh, right. Christoph? We need to support DEBUG_PAGEALLOC with SLUB.

 Hi,

 Disclaimer: I don't know how the algorithm is supposed to work, so 
 please excuse me if this makes no sense at all. But here goes:

 Presumably the problem is that the page can get freed, and that with 
 DEBUG_PAGEALLOC, the page will therefore not be present and subsequently 
 trigger a page fault when doing this cmpxchg() on the possibly freed 
 object.

 Regardless of DEBUG_PAGEALLOC or kmemcheck, what happens if the page 
 gets freed, then allocated again for a completely different purpose in 
 another part of the kernel, and new user of the page by chance writes 
 the same "tid" number that the cmpxchg() is expecting?

 I guess I'm asking: Are we relying on the fact that this has a 
 negligible probability of happening? Or did I just misunderstand what 
 the algorithm does, and it is in fact provable (in theory) that nothing 
 wrong can happen?

 Thanks,


 Vegard
Comment 63 Anonymous Emailer 2011-05-10 16:33:36 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Pekka Enberg wrote:

> On 5/10/11 1:17 PM, Eric Dumazet wrote:
> > Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit :
> >
> > > Can't we fix the issue by putting kmemcheck_mark_initialized() to
> > > set_freepointer()?
> >
> > This would solve kmemcheck problem, not DEBUG_PAGEALLOC
>
> Oh, right. Christoph? We need to support DEBUG_PAGEALLOC with SLUB.

Well back to the #ifdef then? Or has DEBUG_PAGEALLOC some override
mechanism that we can do a speculative memory access?
Comment 64 Anonymous Emailer 2011-05-10 16:33:37 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Pekka Enberg wrote:

> On 5/10/11 1:17 PM, Eric Dumazet wrote:
> > Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit :
> >
> > > Can't we fix the issue by putting kmemcheck_mark_initialized() to
> > > set_freepointer()?
> >
> > This would solve kmemcheck problem, not DEBUG_PAGEALLOC
>
> Oh, right. Christoph? We need to support DEBUG_PAGEALLOC with SLUB.

Well back to the #ifdef then? Or has DEBUG_PAGEALLOC some override
mechanism that we can do a speculative memory access?
Comment 65 Anonymous Emailer 2011-05-10 16:39:49 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Vegard Nossum wrote:

> Presumably the problem is that the page can get freed, and that with
> DEBUG_PAGEALLOC, the page will therefore not be present and subsequently
> trigger a page fault when doing this cmpxchg() on the possibly freed object.

The problem is not the cmpxchg. The cmpxchg is occurring on the per cpu
structure for the slab and that remains even if the page is freed.

The problem is the speculative fetch of the address of the following
object from a pointer into the page. The cmpxchg will fail in that case
because the TID was incremented and the result of the address fetch will
be discarded.

> Regardless of DEBUG_PAGEALLOC or kmemcheck, what happens if the page gets
> freed, then allocated again for a completely different purpose in another
> part
> of the kernel, and new user of the page by chance writes the same "tid"
> number
> that the cmpxchg() is expecting?

The tid is not stored in the page struct but in a per cpu structure.

Doing an explicit check if this is an illegal address in the PAGE_ALLOC
case and redoing the loop will address the issue.

So

#ifdef CONFIG_DEBUG_PAGE_ALLOC
	if (illegal_page_alloc-address(object))
		goto redo;
#endif

before the cmpxchg should do the trick.
Comment 66 Anonymous Emailer 2011-05-10 16:39:58 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Vegard Nossum wrote:

> Presumably the problem is that the page can get freed, and that with
> DEBUG_PAGEALLOC, the page will therefore not be present and subsequently
> trigger a page fault when doing this cmpxchg() on the possibly freed object.

The problem is not the cmpxchg. The cmpxchg is occurring on the per cpu
structure for the slab and that remains even if the page is freed.

The problem is the speculative fetch of the address of the following
object from a pointer into the page. The cmpxchg will fail in that case
because the TID was incremented and the result of the address fetch will
be discarded.

> Regardless of DEBUG_PAGEALLOC or kmemcheck, what happens if the page gets
> freed, then allocated again for a completely different purpose in another
> part
> of the kernel, and new user of the page by chance writes the same "tid"
> number
> that the cmpxchg() is expecting?

The tid is not stored in the page struct but in a per cpu structure.

Doing an explicit check if this is an illegal address in the PAGE_ALLOC
case and redoing the loop will address the issue.

So

#ifdef CONFIG_DEBUG_PAGE_ALLOC
	if (illegal_page_alloc-address(object))
		goto redo;
#endif

before the cmpxchg should do the trick.
Comment 67 Eric Dumazet 2011-05-10 17:15:12 UTC
Le mardi 10 mai 2011 à 11:39 -0500, Christoph Lameter a écrit :

> #ifdef CONFIG_DEBUG_PAGE_ALLOC
>       if (illegal_page_alloc-address(object))
>               goto redo;
> #endif
> 
> before the cmpxchg should do the trick.
> 

Again, it wont work...

You can have an IRQ right after the check and before cmpxchg

This interrupt can allocate this block of memory, free it, and unmap
page from memory.

cmpxchg() reads unmapped memory -> BUG
Comment 68 Anonymous Emailer 2011-05-10 17:31:10 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Eric Dumazet wrote:

> Le mardi 10 mai 2011 à 11:39 -0500, Christoph Lameter a écrit :
>
> > #ifdef CONFIG_DEBUG_PAGE_ALLOC
> >     if (illegal_page_alloc-address(object))
> >             goto redo;
> > #endif
> >
> > before the cmpxchg should do the trick.
> >
>
> Again, it wont work...
>
> You can have an IRQ right after the check and before cmpxchg

Ok guess then we also need to disable irq if CONFIG_PAGE_ALLOC is set?

The cmpxchg is not the problem. The problem is the following expression
which retrieves the pointer to the next available object from the object
on the page:

get_freepointer(s, object)

In the CONFIG_PAGE_ALLOC case we could disable interrupts, then do the
check, then fetch the pointer and then reenable interrupts.

All of this can occur before the cmpxchg.

> This interrupt can allocate this block of memory, free it, and unmap
> page from memory.
>
> cmpxchg() reads unmapped memory -> BUG

The cmpxchg is not accessing any memory on the page.
Comment 69 Anonymous Emailer 2011-05-10 17:31:11 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Eric Dumazet wrote:

> Le mardi 10 mai 2011 à 11:39 -0500, Christoph Lameter a écrit :
>
> > #ifdef CONFIG_DEBUG_PAGE_ALLOC
> >     if (illegal_page_alloc-address(object))
> >             goto redo;
> > #endif
> >
> > before the cmpxchg should do the trick.
> >
>
> Again, it wont work...
>
> You can have an IRQ right after the check and before cmpxchg

Ok guess then we also need to disable irq if CONFIG_PAGE_ALLOC is set?

The cmpxchg is not the problem. The problem is the following expression
which retrieves the pointer to the next available object from the object
on the page:

get_freepointer(s, object)

In the CONFIG_PAGE_ALLOC case we could disable interrupts, then do the
check, then fetch the pointer and then reenable interrupts.

All of this can occur before the cmpxchg.

> This interrupt can allocate this block of memory, free it, and unmap
> page from memory.
>
> cmpxchg() reads unmapped memory -> BUG

The cmpxchg is not accessing any memory on the page.
Comment 70 Anonymous Emailer 2011-05-10 17:43:58 UTC
Reply-To: cl@linux.com

Draft for a patch


Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath

Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have
marked as invalid to retrieve the pointer to the next free object.

Probe that address before dereferencing the pointer to the page.
All of that needs to occur with interrupts disabled since an interrupt
could cause the page status to change (as pointed out by Eric).

Signed-off-by: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-05-10 12:35:30.000000000 -0500
+++ linux-2.6/mm/slub.c	2011-05-10 12:38:53.000000000 -0500
@@ -261,6 +261,27 @@ static inline void *get_freepointer(stru
 	return *(void **)(object + s->offset);
 }

+static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
+{
+	void *p;
+
+#ifdef CONFIG_PAGE_ALLOC
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	if (probe_kernel_address(object))
+		p = NULL;	/* Invalid */
+	else
+		p = get_freepointer(s, object);
+
+	local_irq_restore(flags);
+#else
+	p = get_freepointer(s, object);
+#endif
+	return p;
+}
+
 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
 {
 	*(void **)(object + s->offset) = fp;
@@ -1933,7 +1954,7 @@ redo:
 		if (unlikely(!irqsafe_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				object, tid,
-				get_freepointer(s, object), next_tid(tid)))) {
+				get_freepointer_safe(s, object), next_tid(tid)))) {

 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;
Comment 71 Anonymous Emailer 2011-05-10 17:44:00 UTC
Reply-To: cl@linux.com

Draft for a patch


Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath

Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have
marked as invalid to retrieve the pointer to the next free object.

Probe that address before dereferencing the pointer to the page.
All of that needs to occur with interrupts disabled since an interrupt
could cause the page status to change (as pointed out by Eric).

Signed-off-by: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-05-10 12:35:30.000000000 -0500
+++ linux-2.6/mm/slub.c	2011-05-10 12:38:53.000000000 -0500
@@ -261,6 +261,27 @@ static inline void *get_freepointer(stru
 	return *(void **)(object + s->offset);
 }

+static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
+{
+	void *p;
+
+#ifdef CONFIG_PAGE_ALLOC
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	if (probe_kernel_address(object))
+		p = NULL;	/* Invalid */
+	else
+		p = get_freepointer(s, object);
+
+	local_irq_restore(flags);
+#else
+	p = get_freepointer(s, object);
+#endif
+	return p;
+}
+
 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
 {
 	*(void **)(object + s->offset) = fp;
@@ -1933,7 +1954,7 @@ redo:
 		if (unlikely(!irqsafe_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				object, tid,
-				get_freepointer(s, object), next_tid(tid)))) {
+				get_freepointer_safe(s, object), next_tid(tid)))) {

 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;
Comment 72 Eric Dumazet 2011-05-10 18:06:25 UTC
Le mardi 10 mai 2011 à 12:43 -0500, Christoph Lameter a écrit :
> Draft for a patch
> 
> 
> Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath
> 
> Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may
> have
> marked as invalid to retrieve the pointer to the next free object.
> 
> Probe that address before dereferencing the pointer to the page.
> All of that needs to occur with interrupts disabled since an interrupt
> could cause the page status to change (as pointed out by Eric).
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> ---
>  mm/slub.c |   23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c  2011-05-10 12:35:30.000000000 -0500
> +++ linux-2.6/mm/slub.c       2011-05-10 12:38:53.000000000 -0500
> @@ -261,6 +261,27 @@ static inline void *get_freepointer(stru
>       return *(void **)(object + s->offset);
>  }
> 
> +static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> +{
> +     void *p;
> +
> +#ifdef CONFIG_PAGE_ALLOC
> +     unsigned long flags;
> +
> +     local_irq_save(flags);
> +
> +     if (probe_kernel_address(object))
> +             p = NULL;       /* Invalid */
> +     else
> +             p = get_freepointer(s, object);
> +
> +     local_irq_restore(flags);
> +#else
> +     p = get_freepointer(s, object);
> +#endif
> +     return p;
> +}
> +
>  static inline void set_freepointer(struct kmem_cache *s, void *object, void
>  *fp)
>  {
>       *(void **)(object + s->offset) = fp;
> @@ -1933,7 +1954,7 @@ redo:
>               if (unlikely(!irqsafe_cpu_cmpxchg_double(
>                               s->cpu_slab->freelist, s->cpu_slab->tid,
>                               object, tid,
> -                             get_freepointer(s, object), next_tid(tid)))) {
> +                             get_freepointer_safe(s, object),
> next_tid(tid)))) {
> 
>                       note_cmpxchg_failure("slab_alloc", s, tid);
>                       goto redo;


Really this wont work Stephen

You have to disable IRQ _before_ even fetching 'object'

Or else, you can have an IRQ, allocate this object, pass to another cpu.

This other cpu can free the object and unmap page right after you did
the probe_kernel_address(object) (successfully), and before your cpu :

p = get_freepointer(s, object); << BUG >>

I really dont understand your motivation to keep the buggy commit.
Comment 73 Anonymous Emailer 2011-05-10 18:07:48 UTC
Reply-To: cl@linux.com

There is a simpler version and we can get away without interrupt disable I
think. The value that we get from the read does not matter since the TID
will not match.


Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath

Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have
marked as invalid to retrieve the pointer to the next free object.

Probe that address before dereferencing the pointer to the page.

Signed-off-by: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-05-10 12:54:00.000000000 -0500
+++ linux-2.6/mm/slub.c	2011-05-10 13:04:18.000000000 -0500
@@ -261,6 +261,18 @@ static inline void *get_freepointer(stru
 	return *(void **)(object + s->offset);
 }

+static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
+{
+	void *p;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+	probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p));
+#else
+	p = get_freepointer(s, object);
+#endif
+	return p;
+}
+
 static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
 {
 	*(void **)(object + s->offset) = fp;
@@ -1943,7 +1955,7 @@ redo:
 		if (unlikely(!irqsafe_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				object, tid,
-				get_freepointer(s, object), next_tid(tid)))) {
+				get_freepointer_safe(s, object), next_tid(tid)))) {

 			note_cmpxchg_failure("slab_alloc", s, tid);
 			goto redo;
Comment 74 Anonymous Emailer 2011-05-10 19:06:13 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Christoph Lameter wrote:

> > This other cpu can free the object and unmap page right after you did
> > the probe_kernel_address(object) (successfully), and before your cpu :
> >
> > p = get_freepointer(s, object); << BUG >>
>
> If the other cpu frees the object and unmaps the page then
> get_freepointer_safe() can obtain an arbitrary value since the TID was
> incremented. We will restart the loop and discard the value retrieved.

Ok. Forgot the element there of a different cpu. A different cpu cannot
unmap the page or free the page since the page is in a frozen state while
we allocate from it.

The page is only handled by the cpu it was assigned to until the cpu which
froze it releases it.

The only case that we need to protect against here is the case when an
interrupt or reschedule causes the *same* cpu to release the page. In that
case the TID must have been incremented.
Comment 75 Anonymous Emailer 2011-05-10 19:06:14 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Christoph Lameter wrote:

> > This other cpu can free the object and unmap page right after you did
> > the probe_kernel_address(object) (successfully), and before your cpu :
> >
> > p = get_freepointer(s, object); << BUG >>
>
> If the other cpu frees the object and unmaps the page then
> get_freepointer_safe() can obtain an arbitrary value since the TID was
> incremented. We will restart the loop and discard the value retrieved.

Ok. Forgot the element there of a different cpu. A different cpu cannot
unmap the page or free the page since the page is in a frozen state while
we allocate from it.

The page is only handled by the cpu it was assigned to until the cpu which
froze it releases it.

The only case that we need to protect against here is the case when an
interrupt or reschedule causes the *same* cpu to release the page. In that
case the TID must have been incremented.
Comment 76 Anonymous Emailer 2011-05-10 19:28:46 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Eric Dumazet wrote:

> > +   else
> > +           p = get_freepointer(s, object);
> > +
> > +   local_irq_restore(flags);
> > +#else
> > +   p = get_freepointer(s, object);
> > +#endif
> > +   return p;
> > +}
> > +
> >  static inline void set_freepointer(struct kmem_cache *s, void *object,
> void *fp)
> >  {
> >     *(void **)(object + s->offset) = fp;
> > @@ -1933,7 +1954,7 @@ redo:
> >             if (unlikely(!irqsafe_cpu_cmpxchg_double(
> >                             s->cpu_slab->freelist, s->cpu_slab->tid,
> >                             object, tid,
> > -                           get_freepointer(s, object), next_tid(tid)))) {
> > +                           get_freepointer_safe(s, object),
> next_tid(tid)))) {
> >
> >                     note_cmpxchg_failure("slab_alloc", s, tid);
> >                     goto redo;
>
>
> Really this wont work Stephen

I am not Stephen.

> You have to disable IRQ _before_ even fetching 'object'

The object pointer is being obtained from a per cpu structure and
not from the page. What is the problem with fetching the object pointer?

> Or else, you can have an IRQ, allocate this object, pass to another cpu.

If that occurs then TID is being incremented and we will restart the loop
getting the new object pointer from the per cpu structure. The object
pointer that we were considering is irrelevant.

> This other cpu can free the object and unmap page right after you did
> the probe_kernel_address(object) (successfully), and before your cpu :
>
> p = get_freepointer(s, object); << BUG >>

If the other cpu frees the object and unmaps the page then
get_freepointer_safe() can obtain an arbitrary value since the TID was
incremented. We will restart the loop and discard the value retrieved.
Comment 77 Anonymous Emailer 2011-05-10 19:28:48 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Eric Dumazet wrote:

> > +   else
> > +           p = get_freepointer(s, object);
> > +
> > +   local_irq_restore(flags);
> > +#else
> > +   p = get_freepointer(s, object);
> > +#endif
> > +   return p;
> > +}
> > +
> >  static inline void set_freepointer(struct kmem_cache *s, void *object,
> void *fp)
> >  {
> >     *(void **)(object + s->offset) = fp;
> > @@ -1933,7 +1954,7 @@ redo:
> >             if (unlikely(!irqsafe_cpu_cmpxchg_double(
> >                             s->cpu_slab->freelist, s->cpu_slab->tid,
> >                             object, tid,
> > -                           get_freepointer(s, object), next_tid(tid)))) {
> > +                           get_freepointer_safe(s, object),
> next_tid(tid)))) {
> >
> >                     note_cmpxchg_failure("slab_alloc", s, tid);
> >                     goto redo;
>
>
> Really this wont work Stephen

I am not Stephen.

> You have to disable IRQ _before_ even fetching 'object'

The object pointer is being obtained from a per cpu structure and
not from the page. What is the problem with fetching the object pointer?

> Or else, you can have an IRQ, allocate this object, pass to another cpu.

If that occurs then TID is being incremented and we will restart the loop
getting the new object pointer from the per cpu structure. The object
pointer that we were considering is irrelevant.

> This other cpu can free the object and unmap page right after you did
> the probe_kernel_address(object) (successfully), and before your cpu :
>
> p = get_freepointer(s, object); << BUG >>

If the other cpu frees the object and unmaps the page then
get_freepointer_safe() can obtain an arbitrary value since the TID was
incremented. We will restart the loop and discard the value retrieved.
Comment 78 Eric Dumazet 2011-05-10 19:32:57 UTC
Le mardi 10 mai 2011 à 13:28 -0500, Christoph Lameter a écrit :
> On Tue, 10 May 2011, Eric Dumazet wrote:
> 
> > > + else
> > > +         p = get_freepointer(s, object);
> > > +
> > > + local_irq_restore(flags);
> > > +#else
> > > + p = get_freepointer(s, object);
> > > +#endif
> > > + return p;
> > > +}
> > > +
> > >  static inline void set_freepointer(struct kmem_cache *s, void *object,
> void *fp)
> > >  {
> > >   *(void **)(object + s->offset) = fp;
> > > @@ -1933,7 +1954,7 @@ redo:
> > >           if (unlikely(!irqsafe_cpu_cmpxchg_double(
> > >                           s->cpu_slab->freelist, s->cpu_slab->tid,
> > >                           object, tid,
> > > -                         get_freepointer(s, object), next_tid(tid)))) {
> > > +                         get_freepointer_safe(s, object),
> next_tid(tid)))) {
> > >
> > >                   note_cmpxchg_failure("slab_alloc", s, tid);
> > >                   goto redo;
> >
> >
> > Really this wont work Stephen
> 
> I am not Stephen.
> 

Yes, sorry Christoph.

> > You have to disable IRQ _before_ even fetching 'object'
> 
> The object pointer is being obtained from a per cpu structure and
> not from the page. What is the problem with fetching the object pointer?
> 
> > Or else, you can have an IRQ, allocate this object, pass to another cpu.
> 
> If that occurs then TID is being incremented and we will restart the loop
> getting the new object pointer from the per cpu structure. The object
> pointer that we were considering is irrelevant.
> 

Problem is not restart the loop, but avoiding accessing a non valid
memory area.

> > This other cpu can free the object and unmap page right after you did
> > the probe_kernel_address(object) (successfully), and before your cpu :
> >
> > p = get_freepointer(s, object); << BUG >>
> 
> If the other cpu frees the object and unmaps the page then
> get_freepointer_safe() can obtain an arbitrary value since the TID was
> incremented. We will restart the loop and discard the value retrieved.
> 



In current code I see :

tid = c->tid;
barrier();
object = c->freelist;

There is no guarantee c->tid is fetched before c->freelist by cpu.

You need rmb() here.


I claim all this would be far more simple disabling IRQ before fetching
c->tid and c->freelist, in DEBUG_PAGE_ALLOC case.

You would not even need to use magic probe_kernel_read()


Why do you try so _hard_ trying to optimize this, I really wonder.
Nobody is able to read this code anymore and prove its correct.
Comment 79 Anonymous Emailer 2011-05-10 19:39:26 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Eric Dumazet wrote:

> > > You have to disable IRQ _before_ even fetching 'object'
> >
> > The object pointer is being obtained from a per cpu structure and
> > not from the page. What is the problem with fetching the object pointer?
> >
> > > Or else, you can have an IRQ, allocate this object, pass to another cpu.
> >
> > If that occurs then TID is being incremented and we will restart the loop
> > getting the new object pointer from the per cpu structure. The object
> > pointer that we were considering is irrelevant.
> >
>
> Problem is not restart the loop, but avoiding accessing a non valid
> memory area.

Yes and could you please explain clearly what the problem is?

>
> > > This other cpu can free the object and unmap page right after you did
> > > the probe_kernel_address(object) (successfully), and before your cpu :
> > >
> > > p = get_freepointer(s, object); << BUG >>
> >
> > If the other cpu frees the object and unmaps the page then
> > get_freepointer_safe() can obtain an arbitrary value since the TID was
> > incremented. We will restart the loop and discard the value retrieved.
> >
>
>
>
> In current code I see :
>
> tid = c->tid;
> barrier();
> object = c->freelist;
>
> There is no guarantee c->tid is fetched before c->freelist by cpu.
>
> You need rmb() here.

Nope. This is not processor to processor concurrency. this_cpu operations
only deal with concurrency issues on the same processor. I.e. interrupts
and preemption.

> I claim all this would be far more simple disabling IRQ before fetching
> c->tid and c->freelist, in DEBUG_PAGE_ALLOC case.
>
> You would not even need to use magic probe_kernel_read()
>
>
> Why do you try so _hard_ trying to optimize this, I really wonder.
> Nobody is able to read this code anymore and prove its correct.

Optimizing? You think about this as concurrency issue between multiple
cpus. That is fundamentally wrong. This is dealing with access to per cpu
data and the concurrency issues are only with code running on the *same*
cpu.
Comment 80 Anonymous Emailer 2011-05-10 19:39:34 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Eric Dumazet wrote:

> > > You have to disable IRQ _before_ even fetching 'object'
> >
> > The object pointer is being obtained from a per cpu structure and
> > not from the page. What is the problem with fetching the object pointer?
> >
> > > Or else, you can have an IRQ, allocate this object, pass to another cpu.
> >
> > If that occurs then TID is being incremented and we will restart the loop
> > getting the new object pointer from the per cpu structure. The object
> > pointer that we were considering is irrelevant.
> >
>
> Problem is not restart the loop, but avoiding accessing a non valid
> memory area.

Yes and could you please explain clearly what the problem is?

>
> > > This other cpu can free the object and unmap page right after you did
> > > the probe_kernel_address(object) (successfully), and before your cpu :
> > >
> > > p = get_freepointer(s, object); << BUG >>
> >
> > If the other cpu frees the object and unmaps the page then
> > get_freepointer_safe() can obtain an arbitrary value since the TID was
> > incremented. We will restart the loop and discard the value retrieved.
> >
>
>
>
> In current code I see :
>
> tid = c->tid;
> barrier();
> object = c->freelist;
>
> There is no guarantee c->tid is fetched before c->freelist by cpu.
>
> You need rmb() here.

Nope. This is not processor to processor concurrency. this_cpu operations
only deal with concurrency issues on the same processor. I.e. interrupts
and preemption.

> I claim all this would be far more simple disabling IRQ before fetching
> c->tid and c->freelist, in DEBUG_PAGE_ALLOC case.
>
> You would not even need to use magic probe_kernel_read()
>
>
> Why do you try so _hard_ trying to optimize this, I really wonder.
> Nobody is able to read this code anymore and prove its correct.

Optimizing? You think about this as concurrency issue between multiple
cpus. That is fundamentally wrong. This is dealing with access to per cpu
data and the concurrency issues are only with code running on the *same*
cpu.
Comment 81 Christian Casteyde 2011-05-10 20:01:43 UTC
Just for info, the computer on which I got these warnings only has one core indeed. Moreover, I cannot reproduce it anymore with rc7.

However I got another warning (far more stable), reported in https://bugzilla.kernel.org/show_bug.cgi?id=34882 (do not know if it's the same, I reported seperatly anyway).

The current one may not appear anymore either because I installed new Slackware on 32bits instead of 64bits before, or remove memory which has failed last week and reduced to 512 but it's faster now, or because it's rc7 of course.
Comment 82 Eric Dumazet 2011-05-10 20:06:58 UTC
Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit :

> Optimizing? You think about this as concurrency issue between multiple
> cpus. That is fundamentally wrong. This is dealing with access to per cpu
> data and the concurrency issues are only with code running on the *same*
> cpu.
> 

If you enable irqs, then this object can be allocated by _this_ cpu and
given to another one.

Another cpu can free the page, forcing you to call a very expensive
function, that might give obsolete result as soon it returns.

Maybe I am just tired tonight, this seems very obvious, I must miss
something.
Comment 83 Anonymous Emailer 2011-05-10 20:33:58 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Eric Dumazet wrote:

> Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit :
>
> > Optimizing? You think about this as concurrency issue between multiple
> > cpus. That is fundamentally wrong. This is dealing with access to per cpu
> > data and the concurrency issues are only with code running on the *same*
> > cpu.
> >
>
> If you enable irqs, then this object can be allocated by _this_ cpu and
> given to another one.

That will cause an incrementing of the tid.

> Another cpu can free the page, forcing you to call a very expensive
> function, that might give obsolete result as soon it returns.

No the other cpu cannot free the page since the page is pinned by
the current cpu (see PageFrozen()).

> Maybe I am just tired tonight, this seems very obvious, I must miss
> something.

Yeah you are way off thinking about cpu to cpu concurrency issues that do
not apply here.
Comment 84 Anonymous Emailer 2011-05-10 20:34:00 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Eric Dumazet wrote:

> Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit :
>
> > Optimizing? You think about this as concurrency issue between multiple
> > cpus. That is fundamentally wrong. This is dealing with access to per cpu
> > data and the concurrency issues are only with code running on the *same*
> > cpu.
> >
>
> If you enable irqs, then this object can be allocated by _this_ cpu and
> given to another one.

That will cause an incrementing of the tid.

> Another cpu can free the page, forcing you to call a very expensive
> function, that might give obsolete result as soon it returns.

No the other cpu cannot free the page since the page is pinned by
the current cpu (see PageFrozen()).

> Maybe I am just tired tonight, this seems very obvious, I must miss
> something.

Yeah you are way off thinking about cpu to cpu concurrency issues that do
not apply here.
Comment 85 Eric Dumazet 2011-05-10 20:46:26 UTC
Le mardi 10 mai 2011 à 15:33 -0500, Christoph Lameter a écrit :
> On Tue, 10 May 2011, Eric Dumazet wrote:
> 
> > Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit :
> >
> > > Optimizing? You think about this as concurrency issue between multiple
> > > cpus. That is fundamentally wrong. This is dealing with access to per cpu
> > > data and the concurrency issues are only with code running on the *same*
> > > cpu.
> > >
> >
> > If you enable irqs, then this object can be allocated by _this_ cpu and
> > given to another one.
> 
> That will cause an incrementing of the tid.
> 
> > Another cpu can free the page, forcing you to call a very expensive
> > function, that might give obsolete result as soon it returns.
> 
> No the other cpu cannot free the page since the page is pinned by
> the current cpu (see PageFrozen()).
> 

What happens then ? Other cpu calls kfree() on last nonfreed object for
this slab, and yet the page stay frozen ? How this page is going to be
freed at all ?

> > Maybe I am just tired tonight, this seems very obvious, I must miss
> > something.
> 
> Yeah you are way off thinking about cpu to cpu concurrency issues that do
> not apply here.

I fail to understand how current cpu can assert page ownership, if IRQs
are enabled, this seems obvious it cannot.
Comment 86 Anonymous Emailer 2011-05-10 21:22:32 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Eric Dumazet wrote:

> > No the other cpu cannot free the page since the page is pinned by
> > the current cpu (see PageFrozen()).
> >
>
> What happens then ? Other cpu calls kfree() on last nonfreed object for
> this slab, and yet the page stay frozen ? How this page is going to be
> freed at all ?

Yes the page stays frozen. The freed objects are used to replenish the
percpu free list when it becomes empty.

The page is going to be freed when a kmalloc() finds that the per cpu
freelist is empty and that the freelist of the page is also empty. Then
interrupts are disabled, the old page is unfrozen and a new
page is acquired for allocation.

> > > Maybe I am just tired tonight, this seems very obvious, I must miss
> > > something.
> >
> > Yeah you are way off thinking about cpu to cpu concurrency issues that do
> > not apply here.
>
> I fail to understand how current cpu can assert page ownership, if IRQs
> are enabled, this seems obvious it cannot.

The cpu sets a page flag called PageFrozen() and points a per cpu pointer
to the page.
Comment 87 Anonymous Emailer 2011-05-10 21:22:33 UTC
Reply-To: cl@linux.com

On Tue, 10 May 2011, Eric Dumazet wrote:

> > No the other cpu cannot free the page since the page is pinned by
> > the current cpu (see PageFrozen()).
> >
>
> What happens then ? Other cpu calls kfree() on last nonfreed object for
> this slab, and yet the page stay frozen ? How this page is going to be
> freed at all ?

Yes the page stays frozen. The freed objects are used to replenish the
percpu free list when it becomes empty.

The page is going to be freed when a kmalloc() finds that the per cpu
freelist is empty and that the freelist of the page is also empty. Then
interrupts are disabled, the old page is unfrozen and a new
page is acquired for allocation.

> > > Maybe I am just tired tonight, this seems very obvious, I must miss
> > > something.
> >
> > Yeah you are way off thinking about cpu to cpu concurrency issues that do
> > not apply here.
>
> I fail to understand how current cpu can assert page ownership, if IRQs
> are enabled, this seems obvious it cannot.

The cpu sets a page flag called PageFrozen() and points a per cpu pointer
to the page.
Comment 88 Eric Dumazet 2011-05-11 03:12:59 UTC
Le mardi 10 mai 2011 à 16:22 -0500, Christoph Lameter a écrit :
> On Tue, 10 May 2011, Eric Dumazet wrote:
> 
> > > No the other cpu cannot free the page since the page is pinned by
> > > the current cpu (see PageFrozen()).
> > >
> >
> > What happens then ? Other cpu calls kfree() on last nonfreed object for
> > this slab, and yet the page stay frozen ? How this page is going to be
> > freed at all ?
> 
> Yes the page stays frozen. The freed objects are used to replenish the
> percpu free list when it becomes empty.
> 
> The page is going to be freed when a kmalloc() finds that the per cpu
> freelist is empty and that the freelist of the page is also empty. Then
> interrupts are disabled, the old page is unfrozen and a new
> page is acquired for allocation.
> 
> > > > Maybe I am just tired tonight, this seems very obvious, I must miss
> > > > something.
> > >
> > > Yeah you are way off thinking about cpu to cpu concurrency issues that do
> > > not apply here.
> >
> > I fail to understand how current cpu can assert page ownership, if IRQs
> > are enabled, this seems obvious it cannot.
> 
> The cpu sets a page flag called PageFrozen() and points a per cpu pointer
> to the page.
> 
> 

So, if I understand you, there is no problem at all and no patch even
needed ? I can start a stress test and you guarantee there wont be a
crash ?

Sorry, its 5h11 in the morning here ;)
Comment 89 Anonymous Emailer 2011-05-12 14:37:17 UTC
Reply-To: cl@linux.com

On Wed, 11 May 2011, Eric Dumazet wrote:

> > The cpu sets a page flag called PageFrozen() and points a per cpu pointer
> > to the page.
> >
> >
>
> So, if I understand you, there is no problem at all and no patch even
> needed ? I can start a stress test and you guarantee there wont be a
> crash ?
>
> Sorry, its 5h11 in the morning here ;)

There is a problem if an interrupt or a preemption occurs and there is no
object left on the page. Then the current page will be unfrozen and a new
page put into place for allocation. The old page may then be freed by some other
process on another processor before we continue the interrupted
slab_alloc().

When slab_alloc() resumes in this scenario then it will ultimately see
that the tid was incremented and so the cmpxchg will fail. But before we
do the cmpxchgwe determine the pointer to the next object. And for that
we access the old page. The access must not cause a page fault (which it
currently does with CONFIG_DEBUG_PAGEALLOC). That is why we need the patch introducing

	get_freepointer_safe()

The result does not matter since we will repeat the cmpxchg loop.
Comment 90 Anonymous Emailer 2011-05-12 14:37:31 UTC
Reply-To: cl@linux.com

On Wed, 11 May 2011, Eric Dumazet wrote:

> > The cpu sets a page flag called PageFrozen() and points a per cpu pointer
> > to the page.
> >
> >
>
> So, if I understand you, there is no problem at all and no patch even
> needed ? I can start a stress test and you guarantee there wont be a
> crash ?
>
> Sorry, its 5h11 in the morning here ;)

There is a problem if an interrupt or a preemption occurs and there is no
object left on the page. Then the current page will be unfrozen and a new
page put into place for allocation. The old page may then be freed by some other
process on another processor before we continue the interrupted
slab_alloc().

When slab_alloc() resumes in this scenario then it will ultimately see
that the tid was incremented and so the cmpxchg will fail. But before we
do the cmpxchgwe determine the pointer to the next object. And for that
we access the old page. The access must not cause a page fault (which it
currently does with CONFIG_DEBUG_PAGEALLOC). That is why we need the patch introducing

	get_freepointer_safe()

The result does not matter since we will repeat the cmpxchg loop.
Comment 91 Christian Casteyde 2011-07-13 18:53:16 UTC
Update: still present in 3.0-rc7:

WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff8801c34ffa00)
00fc4fc30188ffff210000000000000001000000000000000000000000000000
 u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u
 ^

Pid: 2629, comm: akonadi_nepomuk Not tainted 3.0.0-rc7 #20 Acer Aspire 7750G/JE70_HR
RIP: 0010:[<ffffffff811203a2>]  [<ffffffff811203a2>] __kmalloc_node_track_caller+0x112/0x1a0
RSP: 0018:ffff8801a0125a38  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000010 RCX: 000000000003a560
RDX: 000000000003a558 RSI: 00000000001d3ec0 RDI: ffffffff81a8cb01
RBP: ffff8801a0125a78 R08: 0000000000000040 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff8801c34ffa00
R13: ffff8801c7404a00 R14: 00000000000004d0 R15: 00000000ffffffff
FS:  00007f93c32b6780(0000) GS:ffff8801c7800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8801c6642a80 CR3: 00000001a026c000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
 [<ffffffff8166dcee>] __alloc_skb+0x7e/0x1a0
 [<ffffffff81667ced>] sock_alloc_send_pskb+0x22d/0x3b0
 [<ffffffff81667e80>] sock_alloc_send_skb+0x10/0x20
 [<ffffffff8172897c>] unix_stream_sendmsg+0x32c/0x4f0
 [<ffffffff81661b58>] sock_aio_write+0x138/0x150
 [<ffffffff811250eb>] do_sync_readv_writev+0xcb/0x110
 [<ffffffff811253af>] do_readv_writev+0xcf/0x1e0
 [<ffffffff81125508>] vfs_writev+0x48/0x60
 [<ffffffff8112566f>] sys_writev+0x4f/0xa0
 [<ffffffff817dadfb>] system_call_fastpath+0x16/0x1b
 [<ffffffffffffffff>] 0xffffffffffffffff
Comment 92 Christian Casteyde 2011-09-01 16:59:45 UTC
Update: cannot reproduce on 3.1-rc4
Closing this bug.