Acer Aspire 1511LMi Athlon 64 3GHz in 64bits mode Slackware 64 13.1 Since 2.6.39-rc3 with kmemcheck enabled, I get the following warning: ... pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0fffff: excluding 0xc0000-0xfffff pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff: excluding 0x60000000-0x 60ffffff pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff: excluding 0xa0000000-0x a0ffffff udev: renamed network interface wlan0 to eth1 WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88001b0bb800) 00b00b1b0088ffff0000000000000000cafe1dea20009b0000299a3100000000 u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u ^ Pid: 1511, comm: udevd Not tainted 2.6.39-rc3 #1 Acer,Inc. Aspire 1510 /Aspire 1510 RIP: 0010:[<ffffffff810c2f0c>] [<ffffffff810c2f0c>] __kmalloc_track_caller+0xbc/0x1d0 RSP: 0018:ffff88001d3a7a18 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000010 RCX: 000000000000284f RDX: 000000000000284e RSI: ffff88001fe5b160 RDI: ffffffff8177e39a RBP: ffff88001d3a7a48 R08: 0000000000000000 R09: ffff88001b931100 R10: 0000000000000000 R11: 0000000000000003 R12: ffff88001b0bb800 R13: ffff88001f803840 R14: 00000000000004d0 R15: ffffffff814769c6 FS: 00007f6ee81f1700(0000) GS:ffffffff81a1b000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff88001d0b3938 CR3: 000000001d38b000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 [<ffffffff8147ccf2>] __alloc_skb+0x72/0x190 [<ffffffff814769c6>] sock_alloc_send_pskb+0x236/0x3a0 [<ffffffff81476b40>] sock_alloc_send_skb+0x10/0x20 [<ffffffff81523c18>] unix_dgram_sendmsg+0x298/0x770 [<ffffffff814715f3>] sock_sendmsg+0xe3/0x110 [<ffffffff81472603>] sys_sendmsg+0x243/0x3c0 [<ffffffff815e7238>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff Adding 506012k swap on /dev/sda1. Priority:-1 extents:1 across:506012k EXT4-fs (sda2): re-mounted. Opts: (null) EXT4-fs (sda2): re-mounted. Opts: (null) EXT3-fs: barriers not enabled kjournald starting. Commit interval 5 seconds EXT3-fs (sda3): using internal journal EXT3-fs (sda3): mounted filesystem with writeback data mode b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23) ADDRCONF(NETDEV_UP): eth1: link is not ready b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23) ... I cannot build 2.6.39-rc2, and in 2.6.39-rc1 I used to have another warning but not this one.
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Sun, 17 Apr 2011 19:29:39 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=33502 > > Summary: Caught 64-bit read from uninitialized memory in > __alloc_skb > Product: Networking > Version: 2.5 > Kernel Version: 2.6.39-rc3 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: IPV4 > AssignedTo: shemminger@linux-foundation.org > ReportedBy: casteyde.christian@free.fr > Regression: Yes > > > Acer Aspire 1511LMi > Athlon 64 3GHz in 64bits mode > Slackware 64 13.1 > > Since 2.6.39-rc3 with kmemcheck enabled, I get the following warning: > ... > pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0fffff: excluding > 0xc0000-0xfffff > pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff: > excluding > 0x60000000-0x > 60ffffff > pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff: > excluding > 0xa0000000-0x > a0ffffff > udev: renamed network interface wlan0 to eth1 > WARNING: kmemcheck: Caught 64-bit read from uninitialized memory > (ffff88001b0bb800) > 00b00b1b0088ffff0000000000000000cafe1dea20009b0000299a3100000000 > u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u > ^ > > Pid: 1511, comm: udevd Not tainted 2.6.39-rc3 #1 Acer,Inc. Aspire 1510 > /Aspire > 1510 > RIP: 0010:[<ffffffff810c2f0c>] [<ffffffff810c2f0c>] > __kmalloc_track_caller+0xbc/0x1d0 > RSP: 0018:ffff88001d3a7a18 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: 0000000000000010 RCX: 000000000000284f > RDX: 000000000000284e RSI: ffff88001fe5b160 RDI: ffffffff8177e39a > RBP: ffff88001d3a7a48 R08: 0000000000000000 R09: ffff88001b931100 > R10: 0000000000000000 R11: 0000000000000003 R12: ffff88001b0bb800 > R13: ffff88001f803840 R14: 00000000000004d0 R15: ffffffff814769c6 > FS: 00007f6ee81f1700(0000) GS:ffffffff81a1b000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: ffff88001d0b3938 CR3: 000000001d38b000 CR4: 00000000000006f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 > [<ffffffff8147ccf2>] __alloc_skb+0x72/0x190 > [<ffffffff814769c6>] sock_alloc_send_pskb+0x236/0x3a0 > [<ffffffff81476b40>] sock_alloc_send_skb+0x10/0x20 > [<ffffffff81523c18>] unix_dgram_sendmsg+0x298/0x770 > [<ffffffff814715f3>] sock_sendmsg+0xe3/0x110 > [<ffffffff81472603>] sys_sendmsg+0x243/0x3c0 > [<ffffffff815e7238>] system_call_fastpath+0x16/0x1b > [<ffffffffffffffff>] 0xffffffffffffffff hum. I wonder if kmemcheck is disliking prefetchw()?
Le lundi 18 avril 2011 à 15:38 -0700, Andrew Morton a écrit : > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Sun, 17 Apr 2011 19:29:39 GMT > bugzilla-daemon@bugzilla.kernel.org wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=33502 > > > > Summary: Caught 64-bit read from uninitialized memory in > > __alloc_skb > > Product: Networking > > Version: 2.5 > > Kernel Version: 2.6.39-rc3 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: IPV4 > > AssignedTo: shemminger@linux-foundation.org > > ReportedBy: casteyde.christian@free.fr > > Regression: Yes > > > > > > Acer Aspire 1511LMi > > Athlon 64 3GHz in 64bits mode > > Slackware 64 13.1 > > > > Since 2.6.39-rc3 with kmemcheck enabled, I get the following warning: > > ... > > pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0fffff: excluding > > 0xc0000-0xfffff > > pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff: > excluding > > 0x60000000-0x > > 60ffffff > > pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff: > excluding > > 0xa0000000-0x > > a0ffffff > > udev: renamed network interface wlan0 to eth1 > > WARNING: kmemcheck: Caught 64-bit read from uninitialized memory > > (ffff88001b0bb800) > > 00b00b1b0088ffff0000000000000000cafe1dea20009b0000299a3100000000 > > u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u > > ^ > > > > Pid: 1511, comm: udevd Not tainted 2.6.39-rc3 #1 Acer,Inc. Aspire 1510 > /Aspire > > 1510 > > RIP: 0010:[<ffffffff810c2f0c>] [<ffffffff810c2f0c>] > > __kmalloc_track_caller+0xbc/0x1d0 > > RSP: 0018:ffff88001d3a7a18 EFLAGS: 00010246 > > RAX: 0000000000000000 RBX: 0000000000000010 RCX: 000000000000284f > > RDX: 000000000000284e RSI: ffff88001fe5b160 RDI: ffffffff8177e39a > > RBP: ffff88001d3a7a48 R08: 0000000000000000 R09: ffff88001b931100 > > R10: 0000000000000000 R11: 0000000000000003 R12: ffff88001b0bb800 > > R13: ffff88001f803840 R14: 00000000000004d0 R15: ffffffff814769c6 > > FS: 00007f6ee81f1700(0000) GS:ffffffff81a1b000(0000) > knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: ffff88001d0b3938 CR3: 000000001d38b000 CR4: 00000000000006f0 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 > > [<ffffffff8147ccf2>] __alloc_skb+0x72/0x190 > > [<ffffffff814769c6>] sock_alloc_send_pskb+0x236/0x3a0 > > [<ffffffff81476b40>] sock_alloc_send_skb+0x10/0x20 > > [<ffffffff81523c18>] unix_dgram_sendmsg+0x298/0x770 > > [<ffffffff814715f3>] sock_sendmsg+0xe3/0x110 > > [<ffffffff81472603>] sys_sendmsg+0x243/0x3c0 > > [<ffffffff815e7238>] system_call_fastpath+0x16/0x1b > > [<ffffffffffffffff>] 0xffffffffffffffff > > hum. I wonder if kmemcheck is disliking prefetchw()? Nope, prefetchw() is OK versus kmemcheck. This is in __kmalloc_track_caller(), not in networking stuff. CC Christoph Lameter I guess this_cpu_cmpxchg16b() is the offender. A disassembly of __kmalloc_track_caller() would help, but I feel its the read of s->cpu_slab->freelist It seems to be at address 0xffff88001b0bb800 and contains 0xffff88001b0bb000 but kmemcheck thinks its not initialized. Its located in percpu zone, maybe kmemcheck has a problem with it ? alloc_kmem_cache_cpus() does a call to __alloc_percpu(), so this must have been zeroed at the very beginning of kmem_cache life. Hmm, looking at mm/slub.c, I wonder what prevents "object" from pointing to a now freed and unreachable zone of memory. (Say we are interrupted, object given to interrupt handler and this one wants to change page bits to trap access)
Le mardi 19 avril 2011 à 04:51 +0200, Eric Dumazet a écrit : > Hmm, looking at mm/slub.c, I wonder what prevents "object" from pointing > to a now freed and unreachable zone of memory. (Say we are interrupted, > object given to interrupt handler and this one wants to change page bits > to trap access) Yes, I suspect this whole business is not kmemcheck compatable, or DEBUG_PAGEALLOC get_freepointer(s, object) can access to freed memory and kmemcheck triggers the fault, while this_cpu_cmpxchg_double() would presumably detect a change of tid and would not perform the freelist/tid change.
Le mardi 19 avril 2011 à 05:09 +0200, Eric Dumazet a écrit : > Le mardi 19 avril 2011 à 04:51 +0200, Eric Dumazet a écrit : > > > Hmm, looking at mm/slub.c, I wonder what prevents "object" from pointing > > to a now freed and unreachable zone of memory. (Say we are interrupted, > > object given to interrupt handler and this one wants to change page bits > > to trap access) > > > Yes, I suspect this whole business is not kmemcheck compatable, or > DEBUG_PAGEALLOC > > get_freepointer(s, object) can access to freed memory and kmemcheck > triggers the fault, while this_cpu_cmpxchg_double() would presumably > detect a change of tid and would not perform the freelist/tid change. > > > Christian, please try following patch, thanks diff --git a/mm/slub.c b/mm/slub.c index 94d2a33..84febe9 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1540,7 +1540,12 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail) } } -#ifdef CONFIG_CMPXCHG_LOCAL +#if defined(CONFIG_CMPXCHG_LOCAL) && \ + !defined(CONFIG_KMEMCHECK) && !defined(DEBUG_PAGEALLOC) +#define SLUB_USE_CMPXCHG_DOUBLE +#endif + +#ifdef SLUB_USE_CMPXCHG_DOUBLE #ifdef CONFIG_PREEMPT /* * Calculate the next globally unique transaction for disambiguiation @@ -1604,7 +1609,7 @@ static inline void note_cmpxchg_failure(const char *n, void init_kmem_cache_cpus(struct kmem_cache *s) { -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE int cpu; for_each_possible_cpu(cpu) @@ -1643,7 +1648,7 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) page->inuse--; } c->page = NULL; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE c->tid = next_tid(c->tid); #endif unfreeze_slab(s, page, tail); @@ -1780,7 +1785,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, { void **object; struct page *new; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long flags; local_irq_save(flags); @@ -1819,7 +1824,7 @@ load_freelist: c->node = page_to_nid(c->page); unlock_out: slab_unlock(c->page); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE c->tid = next_tid(c->tid); local_irq_restore(flags); #endif @@ -1858,7 +1863,7 @@ new_slab: } if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit()) slab_out_of_memory(s, gfpflags, node); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif return NULL; @@ -1887,7 +1892,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s, { void **object; struct kmem_cache_cpu *c; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long tid; #else unsigned long flags; @@ -1896,7 +1901,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s, if (slab_pre_alloc_hook(s, gfpflags)) return NULL; -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_save(flags); #else redo: @@ -1910,7 +1915,7 @@ redo: */ c = __this_cpu_ptr(s->cpu_slab); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE /* * The transaction ids are globally unique per cpu and per operation on * a per cpu queue. Thus they can be guarantee that the cmpxchg_double @@ -1927,7 +1932,7 @@ redo: object = __slab_alloc(s, gfpflags, node, addr, c); else { -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE /* * The cmpxchg will only match if there was no additional * operation and if we are on the right processor. @@ -1954,7 +1959,7 @@ redo: stat(s, ALLOC_FASTPATH); } -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif @@ -2034,7 +2039,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, { void *prior; void **object = (void *)x; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long flags; local_irq_save(flags); @@ -2070,7 +2075,7 @@ checks_ok: out_unlock: slab_unlock(page); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif return; @@ -2084,7 +2089,7 @@ slab_empty: stat(s, FREE_REMOVE_PARTIAL); } slab_unlock(page); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif stat(s, FREE_SLAB); @@ -2113,7 +2118,7 @@ static __always_inline void slab_free(struct kmem_cache *s, { void **object = (void *)x; struct kmem_cache_cpu *c; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long tid; #else unsigned long flags; @@ -2121,7 +2126,7 @@ static __always_inline void slab_free(struct kmem_cache *s, slab_free_hook(s, x); -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_save(flags); #else @@ -2136,7 +2141,7 @@ redo: */ c = __this_cpu_ptr(s->cpu_slab); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE tid = c->tid; barrier(); #endif @@ -2144,7 +2149,7 @@ redo: if (likely(page == c->page && c->node != NUMA_NO_NODE)) { set_freepointer(s, object, c->freelist); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE if (unlikely(!this_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, c->freelist, tid, @@ -2160,7 +2165,7 @@ redo: } else __slab_free(s, page, x, addr); -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif } @@ -2354,7 +2359,7 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s) BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE < SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu)); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE /* * Must align to double word boundary for the double cmpxchg instructions * to work.
Reply-To: cl@linux.com On Tue, 19 Apr 2011, Eric Dumazet wrote: > get_freepointer(s, object) can access to freed memory and kmemcheck > triggers the fault, while this_cpu_cmpxchg_double() would presumably > detect a change of tid and would not perform the freelist/tid change. Sounds right. The new lockless patchset for slub that uses a locked cmpxchg16b will make this behavior even more common since it will do more speculative accesses.
Reply-To: cl@linux.com On Tue, 19 Apr 2011, Eric Dumazet wrote: > } > } > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#if defined(CONFIG_CMPXCHG_LOCAL) && \ > + !defined(CONFIG_KMEMCHECK) && !defined(DEBUG_PAGEALLOC) > +#define SLUB_USE_CMPXCHG_DOUBLE > +#endif > + > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > #ifdef CONFIG_PREEMPT > /* Ugg.. Isnt there some way to indicate to kmemcheck that a speculative access is occurring?
In response of comment #3 I didn't manage to apply the patch on -rc3 (neither on -rc4)
Le mardi 19 avril 2011 à 12:10 -0500, Christoph Lameter a écrit : > On Tue, 19 Apr 2011, Eric Dumazet wrote: > > > } > > } > > > > -#ifdef CONFIG_CMPXCHG_LOCAL > > +#if defined(CONFIG_CMPXCHG_LOCAL) && \ > > + !defined(CONFIG_KMEMCHECK) && !defined(DEBUG_PAGEALLOC) > > +#define SLUB_USE_CMPXCHG_DOUBLE > > +#endif > > + > > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > > #ifdef CONFIG_PREEMPT > > /* > > Ugg.. Isnt there some way to indicate to kmemcheck that a speculative > access is occurring? Yes, here is a totally untested patch (only compiled here), to keep kmemcheck & cmpxchg_double together. diff --git a/mm/slub.c b/mm/slub.c index 94d2a33..43c08c7 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -261,6 +261,14 @@ static inline void *get_freepointer(struct kmem_cache *s, void *object) return *(void **)(object + s->offset); } +static inline void *get_freepointer_mark(struct kmem_cache *s, void *object) +{ + void *ptr = object + s->offset; + + kmemcheck_mark_initialized(ptr, sizeof(void *)); + return *(void **)ptr; +} + static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) { *(void **)(object + s->offset) = fp; @@ -1540,7 +1548,11 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail) } } -#ifdef CONFIG_CMPXCHG_LOCAL +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(DEBUG_PAGEALLOC) +#define SLUB_USE_CMPXCHG_DOUBLE +#endif + +#ifdef SLUB_USE_CMPXCHG_DOUBLE #ifdef CONFIG_PREEMPT /* * Calculate the next globally unique transaction for disambiguiation @@ -1604,7 +1616,7 @@ static inline void note_cmpxchg_failure(const char *n, void init_kmem_cache_cpus(struct kmem_cache *s) { -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE int cpu; for_each_possible_cpu(cpu) @@ -1643,7 +1655,7 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) page->inuse--; } c->page = NULL; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE c->tid = next_tid(c->tid); #endif unfreeze_slab(s, page, tail); @@ -1780,7 +1792,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, { void **object; struct page *new; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long flags; local_irq_save(flags); @@ -1819,7 +1831,7 @@ load_freelist: c->node = page_to_nid(c->page); unlock_out: slab_unlock(c->page); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE c->tid = next_tid(c->tid); local_irq_restore(flags); #endif @@ -1858,7 +1870,7 @@ new_slab: } if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit()) slab_out_of_memory(s, gfpflags, node); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif return NULL; @@ -1887,7 +1899,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s, { void **object; struct kmem_cache_cpu *c; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long tid; #else unsigned long flags; @@ -1896,7 +1908,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s, if (slab_pre_alloc_hook(s, gfpflags)) return NULL; -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_save(flags); #else redo: @@ -1910,7 +1922,7 @@ redo: */ c = __this_cpu_ptr(s->cpu_slab); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE /* * The transaction ids are globally unique per cpu and per operation on * a per cpu queue. Thus they can be guarantee that the cmpxchg_double @@ -1927,7 +1939,7 @@ redo: object = __slab_alloc(s, gfpflags, node, addr, c); else { -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE /* * The cmpxchg will only match if there was no additional * operation and if we are on the right processor. @@ -1943,7 +1955,8 @@ redo: if (unlikely(!this_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, object, tid, - get_freepointer(s, object), next_tid(tid)))) { + get_freepointer_mark(s, object), + next_tid(tid)))) { note_cmpxchg_failure("slab_alloc", s, tid); goto redo; @@ -1954,7 +1967,7 @@ redo: stat(s, ALLOC_FASTPATH); } -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif @@ -2034,7 +2047,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, { void *prior; void **object = (void *)x; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long flags; local_irq_save(flags); @@ -2070,7 +2083,7 @@ checks_ok: out_unlock: slab_unlock(page); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif return; @@ -2084,7 +2097,7 @@ slab_empty: stat(s, FREE_REMOVE_PARTIAL); } slab_unlock(page); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif stat(s, FREE_SLAB); @@ -2113,7 +2126,7 @@ static __always_inline void slab_free(struct kmem_cache *s, { void **object = (void *)x; struct kmem_cache_cpu *c; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long tid; #else unsigned long flags; @@ -2121,7 +2134,7 @@ static __always_inline void slab_free(struct kmem_cache *s, slab_free_hook(s, x); -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_save(flags); #else @@ -2136,7 +2149,7 @@ redo: */ c = __this_cpu_ptr(s->cpu_slab); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE tid = c->tid; barrier(); #endif @@ -2144,7 +2157,7 @@ redo: if (likely(page == c->page && c->node != NUMA_NO_NODE)) { set_freepointer(s, object, c->freelist); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE if (unlikely(!this_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, c->freelist, tid, @@ -2160,7 +2173,7 @@ redo: } else __slab_free(s, page, x, addr); -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif } @@ -2354,7 +2367,7 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s) BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE < SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu)); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE /* * Must align to double word boundary for the double cmpxchg instructions * to work.
Reply-To: cl@linux.com On Tue, 19 Apr 2011, Eric Dumazet wrote: > > Ugg.. Isnt there some way to indicate to kmemcheck that a speculative > > access is occurring? > > Yes, here is a totally untested patch (only compiled here), to keep > kmemcheck & cmpxchg_double together. Ok looks somewhat saner. Why does DEBUG_PAGEALLOC need this? Pages are never freed as long as a page is frozen and a page needs to be frozen to be allocated from. I dont see how we could reference a free page.
Le mardi 19 avril 2011 à 16:18 -0500, Christoph Lameter a écrit : > On Tue, 19 Apr 2011, Eric Dumazet wrote: > > > > Ugg.. Isnt there some way to indicate to kmemcheck that a speculative > > > access is occurring? > > > > Yes, here is a totally untested patch (only compiled here), to keep > > kmemcheck & cmpxchg_double together. > > Ok looks somewhat saner. Why does DEBUG_PAGEALLOC need this? Pages are > never freed as long as a page is frozen and a page needs to be frozen to > be allocated from. I dont see how we could reference a free page. We run lockless and IRQ enabled, are you sure we cannot at this point : CPU0 - Receive an IRQ CPU0 - Allocate this same object consume the page(s) containing this object CPU0 - Pass this object/memory to another cpu1 CPU1 - Free memory and free the page(s) CPU0 - Return from IRQ CPU0 - Access now unreachable memory ?
Le mercredi 20 avril 2011 à 08:56 +0300, Pekka Enberg a écrit : > On 4/19/11 11:17 PM, Eric Dumazet wrote: > > Le mardi 19 avril 2011 à 12:10 -0500, Christoph Lameter a écrit : > >> On Tue, 19 Apr 2011, Eric Dumazet wrote: > >> > >>> } > >>> } > >>> > >>> -#ifdef CONFIG_CMPXCHG_LOCAL > >>> +#if defined(CONFIG_CMPXCHG_LOCAL)&& \ > >>> + !defined(CONFIG_KMEMCHECK)&& !defined(DEBUG_PAGEALLOC) > >>> +#define SLUB_USE_CMPXCHG_DOUBLE > >>> +#endif > >>> + > >>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE > >>> #ifdef CONFIG_PREEMPT > >>> /* > >> Ugg.. Isnt there some way to indicate to kmemcheck that a speculative > >> access is occurring? > > Yes, here is a totally untested patch (only compiled here), to keep > > kmemcheck& cmpxchg_double together. > > Looks good to me! Can we get someone to test it as well? I tested first patch, but was not able to trigger the fault anyway without it. I'll try a bit more today if time permits... It would be nice if Christian can reproduce the initial problem easily and give us the steps...
On 4/19/11 11:17 PM, Eric Dumazet wrote: > Le mardi 19 avril 2011 à 12:10 -0500, Christoph Lameter a écrit : >> On Tue, 19 Apr 2011, Eric Dumazet wrote: >> >>> } >>> } >>> >>> -#ifdef CONFIG_CMPXCHG_LOCAL >>> +#if defined(CONFIG_CMPXCHG_LOCAL)&& \ >>> + !defined(CONFIG_KMEMCHECK)&& !defined(DEBUG_PAGEALLOC) >>> +#define SLUB_USE_CMPXCHG_DOUBLE >>> +#endif >>> + >>> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >>> #ifdef CONFIG_PREEMPT >>> /* >> Ugg.. Isnt there some way to indicate to kmemcheck that a speculative >> access is occurring? > Yes, here is a totally untested patch (only compiled here), to keep > kmemcheck& cmpxchg_double together. Looks good to me! Can we get someone to test it as well?
On 4/20/11 10:45 AM, casteyde.christian@free.fr wrote: > Indeed I can reproduce each time I boot, however I didn't managed to apply > the > patch to test it on -rc3 (neither -rc4), only the first hunk passed. That's odd. Eric? > I simply boot on this laptop with kmemcheck activated to get the warning, > with > vanilla -rc3. I've also tested -rc4 yersterday (without the patch therefore), > the warning was not the same (it's a bit later, I will post the dmesg output > on > bugzilla tonight). That's expected. You're unlikely to see the problem in the exact same place across different kernel configs and versions. Pekka
Le mercredi 20 avril 2011 à 10:49 +0300, Pekka Enberg a écrit : > On 4/20/11 10:45 AM, casteyde.christian@free.fr wrote: > > Indeed I can reproduce each time I boot, however I didn't managed to apply > the > > patch to test it on -rc3 (neither -rc4), only the first hunk passed. > > That's odd. Eric? > I dont know, my patches were linux-2.6 based, and I pull tree very often ;) > > I simply boot on this laptop with kmemcheck activated to get the warning, > with > > vanilla -rc3. I've also tested -rc4 yersterday (without the patch > therefore), > > the warning was not the same (it's a bit later, I will post the dmesg > output on > > bugzilla tonight). > > That's expected. You're unlikely to see the problem in the exact same > place across different kernel configs and versions. > Christian, please send us your .config Thanks
On 4/20/11 11:09 AM, Eric Dumazet wrote: > Le mercredi 20 avril 2011 à 10:49 +0300, Pekka Enberg a écrit : >> On 4/20/11 10:45 AM, casteyde.christian@free.fr wrote: >>> Indeed I can reproduce each time I boot, however I didn't managed to apply >>> the >>> patch to test it on -rc3 (neither -rc4), only the first hunk passed. >> >> That's odd. Eric? >> > > I dont know, my patches were linux-2.6 based, and I pull tree very > often ;) Btw, the patch applies fine here on top of 2.6.39-rc4.
Quoting Eric Dumazet <eric.dumazet@gmail.com>: > > Looks good to me! Can we get someone to test it as well? > > I tested first patch, but was not able to trigger the fault anyway > without it. I'll try a bit more today if time permits... > > It would be nice if Christian can reproduce the initial problem easily > and give us the steps... > > > > Indeed I can reproduce each time I boot, however I didn't managed to apply the patch to test it on -rc3 (neither -rc4), only the first hunk passed. I simply boot on this laptop with kmemcheck activated to get the warning, with vanilla -rc3. I've also tested -rc4 yersterday (without the patch therefore), the warning was not the same (it's a bit later, I will post the dmesg output on bugzilla tonight). For info, my laptop is single core if that matters, it is preemptible, with slub, 64bits, on an old CPU (Athlon64 3000 pre- lahf in 64bits).
I had one splat (before applying any patch), adding CONFIG_PREEMPT to my build : [ 84.629936] WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88011a7376f0) [ 84.629939] 5868ba1a0188ffff5868ba1a0188ffff7078731a0188ffffe0e1181a0188ffff [ 84.629952] i i i i i i i i i i i i i i i i u u u u u u u u u u u u u u u u [ 84.629963] ^ [ 84.629964] [ 84.629966] Pid: 2060, comm: 05-wait_for_sys Not tainted 2.6.39-rc4-00237-ge3de956-dirty #550 HP ProLiant BL460c G6 [ 84.629969] RIP: 0010:[<ffffffff810f0e80>] [<ffffffff810f0e80>] kmem_cache_alloc+0x50/0x190 [ 84.629977] RSP: 0018:ffff88011a0b9988 EFLAGS: 00010286 [ 84.629979] RAX: 0000000000000000 RBX: ffff88011a739a98 RCX: 0000000000620780 [ 84.629980] RDX: 0000000000620740 RSI: 0000000000017400 RDI: ffffffff810db8f5 [ 84.629982] RBP: ffff88011a0b99b8 R08: ffff88011a261dd8 R09: 0000000000000009 [ 84.629983] R10: 0000000000000002 R11: ffff88011fffce00 R12: ffff88011b00a600 [ 84.629985] R13: ffff88011a7376f0 R14: 00000000000000d0 R15: ffff88011abbb0c0 [ 84.629987] FS: 0000000000000000(0000) GS:ffff88011fc00000(0063) knlGS:00000000f76ff6c0 [ 84.629989] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b [ 84.629991] CR2: ffff88011998dab8 CR3: 000000011a13f000 CR4: 00000000000006f0 [ 84.629992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 84.629994] DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 [ 84.629995] [<ffffffff810db8f5>] anon_vma_prepare+0x55/0x190 [ 84.630000] [<ffffffff810cfe90>] handle_pte_fault+0x470/0x6e0 [ 84.630002] [<ffffffff810d1654>] handle_mm_fault+0x124/0x1a0 [ 84.630005] [<ffffffff8147a510>] do_page_fault+0x160/0x560 [ 84.630008] [<ffffffff81477fcf>] page_fault+0x1f/0x30 [ 84.630013] [<ffffffff810b31f8>] generic_file_aio_read+0x528/0x740 [ 84.630017] [<ffffffff810f5461>] do_sync_read+0xd1/0x110 [ 84.630021] [<ffffffff810f5c36>] vfs_read+0xc6/0x160 [ 84.630023] [<ffffffff810f60b0>] sys_read+0x50/0x90 [ 84.630025] [<ffffffff8147f4e9>] sysenter_dispatch+0x7/0x27 [ 84.630030] [<ffffffffffffffff>] 0xffffffffffffffff ffffffff810f0e6d: 0f 84 b5 00 00 00 je ffffffff810f0f28 <kmem_cache_alloc+0xf8> ffffffff810f0e73: 49 63 44 24 20 movslq 0x20(%r12),%rax ffffffff810f0e78: 48 8d 4a 40 lea 0x40(%rdx),%rcx ffffffff810f0e7c: 49 8b 34 24 mov (%r12),%rsi >>ffffffff810f0e80: 49 8b 5c 05 00 mov 0x0(%r13,%rax,1),%rbx ffffffff810f0e85: 4c 89 e8 mov %r13,%rax ffffffff810f0e88: e8 83 19 0e 00 callq ffffffff811d2810 <this_cpu_cmpxchg16b_emu> ffffffff810f0e8d: 0f 1f 40 00 nopl 0x0(%rax) ffffffff810f0e91: 84 c0 test %al,%al ffffffff810f0e93: 74 c1 je ffffffff810f0e56 <kmem_cache_alloc+0x26> So this is definitely the access to object->next that triggers the fault. Problem is : my (2nd) patch only reduces the window of occurrence of the bug, since we can be preempted/interrupted between the kmemcheck_mark_initialized() and access to object->next : The preempt/irq could allocate the object and free it again. So KMEMCHECK would require us to block IRQ to be safe. Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I did in my first patch. Christoph, please reconsider first version of patch (disabling SLUB_CMPXCHG_DOUBLE if either KMEMCHECK or DEBUG_PAGEALLOC is defined) Thanks
Le mercredi 20 avril 2011 à 11:07 +0200, Eric Dumazet a écrit : > I had one splat (before applying any patch), adding CONFIG_PREEMPT to my > build : > > [ 84.629936] WARNING: kmemcheck: Caught 64-bit read from uninitialized > memory (ffff88011a7376f0) > [ 84.629939] > 5868ba1a0188ffff5868ba1a0188ffff7078731a0188ffffe0e1181a0188ffff > [ 84.629952] i i i i i i i i i i i i i i i i u u u u u u u u u u u u u u u > u > [ 84.629963] ^ > [ 84.629964] > [ 84.629966] Pid: 2060, comm: 05-wait_for_sys Not tainted > 2.6.39-rc4-00237-ge3de956-dirty #550 HP ProLiant BL460c G6 > [ 84.629969] RIP: 0010:[<ffffffff810f0e80>] [<ffffffff810f0e80>] > kmem_cache_alloc+0x50/0x190 > [ 84.629977] RSP: 0018:ffff88011a0b9988 EFLAGS: 00010286 > [ 84.629979] RAX: 0000000000000000 RBX: ffff88011a739a98 RCX: > 0000000000620780 > [ 84.629980] RDX: 0000000000620740 RSI: 0000000000017400 RDI: > ffffffff810db8f5 > [ 84.629982] RBP: ffff88011a0b99b8 R08: ffff88011a261dd8 R09: > 0000000000000009 > [ 84.629983] R10: 0000000000000002 R11: ffff88011fffce00 R12: > ffff88011b00a600 > [ 84.629985] R13: ffff88011a7376f0 R14: 00000000000000d0 R15: > ffff88011abbb0c0 > [ 84.629987] FS: 0000000000000000(0000) GS:ffff88011fc00000(0063) > knlGS:00000000f76ff6c0 > [ 84.629989] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b > [ 84.629991] CR2: ffff88011998dab8 CR3: 000000011a13f000 CR4: > 00000000000006f0 > [ 84.629992] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 84.629994] DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: > 0000000000000400 > [ 84.629995] [<ffffffff810db8f5>] anon_vma_prepare+0x55/0x190 > [ 84.630000] [<ffffffff810cfe90>] handle_pte_fault+0x470/0x6e0 > [ 84.630002] [<ffffffff810d1654>] handle_mm_fault+0x124/0x1a0 > [ 84.630005] [<ffffffff8147a510>] do_page_fault+0x160/0x560 > [ 84.630008] [<ffffffff81477fcf>] page_fault+0x1f/0x30 > [ 84.630013] [<ffffffff810b31f8>] generic_file_aio_read+0x528/0x740 > [ 84.630017] [<ffffffff810f5461>] do_sync_read+0xd1/0x110 > [ 84.630021] [<ffffffff810f5c36>] vfs_read+0xc6/0x160 > [ 84.630023] [<ffffffff810f60b0>] sys_read+0x50/0x90 > [ 84.630025] [<ffffffff8147f4e9>] sysenter_dispatch+0x7/0x27 > [ 84.630030] [<ffffffffffffffff>] 0xffffffffffffffff > > ffffffff810f0e6d: 0f 84 b5 00 00 00 je ffffffff810f0f28 > <kmem_cache_alloc+0xf8> > ffffffff810f0e73: 49 63 44 24 20 movslq 0x20(%r12),%rax > ffffffff810f0e78: 48 8d 4a 40 lea 0x40(%rdx),%rcx > ffffffff810f0e7c: 49 8b 34 24 mov (%r12),%rsi > >>ffffffff810f0e80: 49 8b 5c 05 00 mov > 0x0(%r13,%rax,1),%rbx > ffffffff810f0e85: 4c 89 e8 mov %r13,%rax > ffffffff810f0e88: e8 83 19 0e 00 callq ffffffff811d2810 > <this_cpu_cmpxchg16b_emu> > ffffffff810f0e8d: 0f 1f 40 00 nopl 0x0(%rax) > ffffffff810f0e91: 84 c0 test %al,%al > ffffffff810f0e93: 74 c1 je ffffffff810f0e56 > <kmem_cache_alloc+0x26> > > > So this is definitely the access to object->next that triggers the fault. > > Problem is : my (2nd) patch only reduces the window of occurrence of the bug, > since we can > be preempted/interrupted between the kmemcheck_mark_initialized() and access > to object->next : > > The preempt/irq could allocate the object and free it again. > > So KMEMCHECK would require us to block IRQ to be safe. > > Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I did in > my first patch. > > Christoph, please reconsider first version of patch (disabling > SLUB_CMPXCHG_DOUBLE if > either KMEMCHECK or DEBUG_PAGEALLOC is defined) Or this alternate one : keep cmpxchg_double() stuff but block IRQ in slab_alloc() if KMEMCHECK or DEBUG_PAGEALLOC I tested it and got no kmemcheck splat Thanks [PATCH] slub: block IRQ in slab_alloc() if KMEMCHECK or DEBUG_PAGEALLOC Christian Casteyde reported a KMEMCHECK splat in slub code. Problem is now we are lockless and allow IRQ in slab_alloc, the object we manipulate from freelist can be allocated and freed right before we try to read object->next Same problem can happen with DEBUG_PAGEALLOC Reported-by: Christian Casteyde <casteyde.christian@free.fr> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Vegard Nossum <vegardno@ifi.uio.no> Cc: Christoph Lameter <cl@linux.com> --- mm/slub.c | 16 ++++++++++++---- 1 files changed, 12 insertions(+), 4 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 94d2a33..9c34a2b 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1881,7 +1881,13 @@ debug: * If not then __slab_alloc is called for slow processing. * * Otherwise we can simply pick the next object from the lockless free list. + * Since we access object->next, we must disable irqs if KMEMCHECK or + * DEBUG_PAGEALLOC is defined, since object could be allocated and freed + * under us. */ +#if !defined(CONFIG_CMPXCHG_LOCAL) || defined(CONFIG_KMEMCHECK) || defined(CONFIG_DEBUG_PAGEALLOC) +#define MASK_IRQ_IN_SLAB_ALLOC +#endif static __always_inline void *slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, unsigned long addr) { @@ -1889,16 +1895,18 @@ static __always_inline void *slab_alloc(struct kmem_cache *s, struct kmem_cache_cpu *c; #ifdef CONFIG_CMPXCHG_LOCAL unsigned long tid; -#else +#endif +#ifdef MASK_IRQ_IN_SLAB_ALLOC unsigned long flags; #endif if (slab_pre_alloc_hook(s, gfpflags)) return NULL; -#ifndef CONFIG_CMPXCHG_LOCAL +#ifdef MASK_IRQ_IN_SLAB_ALLOC local_irq_save(flags); -#else +#endif +#ifdef CONFIG_CMPXCHG_LOCAL redo: #endif @@ -1954,7 +1962,7 @@ redo: stat(s, ALLOC_FASTPATH); } -#ifndef CONFIG_CMPXCHG_LOCAL +#ifdef MASK_IRQ_IN_SLAB_ALLOC local_irq_restore(flags); #endif
Le mercredi 20 avril 2011 à 09:05 -0500, Christoph Lameter a écrit : > On Wed, 20 Apr 2011, Eric Dumazet wrote: > > > > Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I did > in my first patch. > > Ok your first patch seems to be the sanest approach. > > > { > > @@ -1889,16 +1895,18 @@ static __always_inline void *slab_alloc(struct > kmem_cache *s, > > struct kmem_cache_cpu *c; > > #ifdef CONFIG_CMPXCHG_LOCAL > > unsigned long tid; > > -#else > > +#endif > > +#ifdef MASK_IRQ_IN_SLAB_ALLOC > > unsigned long flags; > > #endif > > > > Yea well that does not bring us much. Well, we keep the fast free path ? only slab_alloc() would have to disable irqs for ~20 instructions.
Le mercredi 20 avril 2011 à 09:42 -0500, Christoph Lameter a écrit : > Avoiding the irq handling yields the savings that improve the fastpath. if > you do both then there is only a regression left. So lets go with > disabling the CMPXCHG_LOCAL. OK, let's do that then. Thanks [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC Christian Casteyde reported a KMEMCHECK splat in slub code. Problem is now we are lockless and allow IRQ in slab_alloc(), the object we manipulate from freelist can be allocated and freed right before we try to read object->next. Same problem can happen with DEBUG_PAGEALLOC Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or CONFIG_DEBUG_PAGEALLOC is defined. Reported-by: Christian Casteyde <casteyde.christian@free.fr> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Vegard Nossum <vegardno@ifi.uio.no> Cc: Christoph Lameter <cl@linux.com> --- mm/slub.c | 45 +++++++++++++++++++++++++-------------------- 1 files changed, 25 insertions(+), 20 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 94d2a33..f31ab2c 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1540,7 +1540,12 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail) } } -#ifdef CONFIG_CMPXCHG_LOCAL +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \ + !defined(CONFIG_DEBUG_PAGEALLOC) +#define SLUB_USE_CMPXCHG_DOUBLE +#endif + +#ifdef SLUB_USE_CMPXCHG_DOUBLE #ifdef CONFIG_PREEMPT /* * Calculate the next globally unique transaction for disambiguiation @@ -1604,7 +1609,7 @@ static inline void note_cmpxchg_failure(const char *n, void init_kmem_cache_cpus(struct kmem_cache *s) { -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE int cpu; for_each_possible_cpu(cpu) @@ -1643,7 +1648,7 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) page->inuse--; } c->page = NULL; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE c->tid = next_tid(c->tid); #endif unfreeze_slab(s, page, tail); @@ -1780,7 +1785,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, { void **object; struct page *new; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long flags; local_irq_save(flags); @@ -1819,7 +1824,7 @@ load_freelist: c->node = page_to_nid(c->page); unlock_out: slab_unlock(c->page); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE c->tid = next_tid(c->tid); local_irq_restore(flags); #endif @@ -1858,7 +1863,7 @@ new_slab: } if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit()) slab_out_of_memory(s, gfpflags, node); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif return NULL; @@ -1887,7 +1892,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s, { void **object; struct kmem_cache_cpu *c; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long tid; #else unsigned long flags; @@ -1896,7 +1901,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s, if (slab_pre_alloc_hook(s, gfpflags)) return NULL; -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_save(flags); #else redo: @@ -1910,7 +1915,7 @@ redo: */ c = __this_cpu_ptr(s->cpu_slab); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE /* * The transaction ids are globally unique per cpu and per operation on * a per cpu queue. Thus they can be guarantee that the cmpxchg_double @@ -1927,7 +1932,7 @@ redo: object = __slab_alloc(s, gfpflags, node, addr, c); else { -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE /* * The cmpxchg will only match if there was no additional * operation and if we are on the right processor. @@ -1954,7 +1959,7 @@ redo: stat(s, ALLOC_FASTPATH); } -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif @@ -2034,7 +2039,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page, { void *prior; void **object = (void *)x; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long flags; local_irq_save(flags); @@ -2070,7 +2075,7 @@ checks_ok: out_unlock: slab_unlock(page); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif return; @@ -2084,7 +2089,7 @@ slab_empty: stat(s, FREE_REMOVE_PARTIAL); } slab_unlock(page); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif stat(s, FREE_SLAB); @@ -2113,7 +2118,7 @@ static __always_inline void slab_free(struct kmem_cache *s, { void **object = (void *)x; struct kmem_cache_cpu *c; -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE unsigned long tid; #else unsigned long flags; @@ -2121,7 +2126,7 @@ static __always_inline void slab_free(struct kmem_cache *s, slab_free_hook(s, x); -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_save(flags); #else @@ -2136,7 +2141,7 @@ redo: */ c = __this_cpu_ptr(s->cpu_slab); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE tid = c->tid; barrier(); #endif @@ -2144,7 +2149,7 @@ redo: if (likely(page == c->page && c->node != NUMA_NO_NODE)) { set_freepointer(s, object, c->freelist); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE if (unlikely(!this_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, c->freelist, tid, @@ -2160,7 +2165,7 @@ redo: } else __slab_free(s, page, x, addr); -#ifndef CONFIG_CMPXCHG_LOCAL +#ifndef SLUB_USE_CMPXCHG_DOUBLE local_irq_restore(flags); #endif } @@ -2354,7 +2359,7 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s) BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE < SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu)); -#ifdef CONFIG_CMPXCHG_LOCAL +#ifdef SLUB_USE_CMPXCHG_DOUBLE /* * Must align to double word boundary for the double cmpxchg instructions * to work.
Reply-To: cl@linux.com On Wed, 20 Apr 2011, Eric Dumazet wrote: > Le mardi 19 avril 2011 à 16:18 -0500, Christoph Lameter a écrit : > > On Tue, 19 Apr 2011, Eric Dumazet wrote: > > > > > > Ugg.. Isnt there some way to indicate to kmemcheck that a speculative > > > > access is occurring? > > > > > > Yes, here is a totally untested patch (only compiled here), to keep > > > kmemcheck & cmpxchg_double together. > > > > Ok looks somewhat saner. Why does DEBUG_PAGEALLOC need this? Pages are > > never freed as long as a page is frozen and a page needs to be frozen to > > be allocated from. I dont see how we could reference a free page. > > We run lockless and IRQ enabled, are you sure we cannot at this point : > > CPU0 - Receive an IRQ > CPU0 - Allocate this same object consume the page(s) containing this object > CPU0 - Pass this object/memory to another cpu1 > CPU1 - Free memory and free the page(s) > CPU0 - Return from IRQ > CPU0 - Access now unreachable memory ? True this can occur.
Reply-To: cl@linux.com On Wed, 20 Apr 2011, Eric Dumazet wrote: > > Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I did in > my first patch. Ok your first patch seems to be the sanest approach. > { > @@ -1889,16 +1895,18 @@ static __always_inline void *slab_alloc(struct > kmem_cache *s, > struct kmem_cache_cpu *c; > #ifdef CONFIG_CMPXCHG_LOCAL > unsigned long tid; > -#else > +#endif > +#ifdef MASK_IRQ_IN_SLAB_ALLOC > unsigned long flags; > #endif > Yea well that does not bring us much.
Le mercredi 20 avril 2011 à 10:17 -0500, Christoph Lameter a écrit : > On Wed, 20 Apr 2011, Eric Dumazet wrote: > > > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC > > Acked-by: Christoph Lameter <cl@linux.com> > > > > > -#ifdef CONFIG_CMPXCHG_LOCAL > > +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \ > > + !defined(CONFIG_DEBUG_PAGEALLOC) > > +#define SLUB_USE_CMPXCHG_DOUBLE > > #undef CONFIG_CMPXCHG_LOCAL instead? Would reduce patch size but its kind > of hacky. > Well, its definitely hacky :) I chose a local definition (instead of Kconfig game), referring to cmpxchg_double()
Le mercredi 20 avril 2011 à 17:15 +0200, Vegard Nossum a écrit : > Thanks, guys. I wonder: Is it possible to make a reproducible test-case > (e.g. loadable module) for this? > Not easy to code a specific test-case, but just use regular workload, (and network trafic to get interrupts). Use CONFIG_PREEMPT to trigger preemptions. > Also, pardon my ignorance, but can you explain why this is a bug with > kmemcheck/page-alloc debug and not without them? Without them, we can read object->next without trigerring a fault. We can read garbage data (if object is in use, it was overwritten with user data), but this doesnt matter because cmpxchg_double() wont perform its work (since tid will have change)
Reply-To: cl@linux.com On Wed, 20 Apr 2011, Eric Dumazet wrote: > Le mercredi 20 avril 2011 à 09:05 -0500, Christoph Lameter a écrit : > > On Wed, 20 Apr 2011, Eric Dumazet wrote: > > > > > > Then, just disable SLUB_CMPXCHG_DOUBLE if KMEMCHECK is defined, as I > did in my first patch. > > > > Ok your first patch seems to be the sanest approach. > > > > > { > > > @@ -1889,16 +1895,18 @@ static __always_inline void *slab_alloc(struct > kmem_cache *s, > > > struct kmem_cache_cpu *c; > > > #ifdef CONFIG_CMPXCHG_LOCAL > > > unsigned long tid; > > > -#else > > > +#endif > > > +#ifdef MASK_IRQ_IN_SLAB_ALLOC > > > unsigned long flags; > > > #endif > > > > > > > Yea well that does not bring us much. > > Well, we keep the fast free path ? > > only slab_alloc() would have to disable irqs for ~20 instructions. Avoiding the irq handling yields the savings that improve the fastpath. if you do both then there is only a regression left. So lets go with disabling the CMPXCHG_LOCAL.
Reply-To: vegardno@ifi.uio.no On Wed, 20 Apr 2011 17:01:27 +0200, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le mercredi 20 avril 2011 à 09:42 -0500, Christoph Lameter a écrit : > >> Avoiding the irq handling yields the savings that improve the >> fastpath. if >> you do both then there is only a regression left. So lets go with >> disabling the CMPXCHG_LOCAL. > > OK, let's do that then. > > Thanks > > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or > DEBUG_PAGEALLOC > > Christian Casteyde reported a KMEMCHECK splat in slub code. > > Problem is now we are lockless and allow IRQ in slab_alloc(), the > object > we manipulate from freelist can be allocated and freed right before > we > try to read object->next. > > Same problem can happen with DEBUG_PAGEALLOC > > Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or > CONFIG_DEBUG_PAGEALLOC is defined. > > Reported-by: Christian Casteyde <casteyde.christian@free.fr> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Pekka Enberg <penberg@cs.helsinki.fi> > Cc: Vegard Nossum <vegardno@ifi.uio.no> > Cc: Christoph Lameter <cl@linux.com> Thanks, guys. I wonder: Is it possible to make a reproducible test-case (e.g. loadable module) for this? Also, pardon my ignorance, but can you explain why this is a bug with kmemcheck/page-alloc debug and not without them? Thanks, Vegard
Reply-To: cl@linux.com On Wed, 20 Apr 2011, Eric Dumazet wrote: > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC Acked-by: Christoph Lameter <cl@linux.com> > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \ > + !defined(CONFIG_DEBUG_PAGEALLOC) > +#define SLUB_USE_CMPXCHG_DOUBLE #undef CONFIG_CMPXCHG_LOCAL instead? Would reduce patch size but its kind of hacky.
In response to comment #23 With this patch on 2.6.39-rc4 I get the following warning: b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23) ADDRCONF(NETDEV_UP): eth1: link is not ready b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23) ADDRCONF(NETDEV_UP): eth1: link is not ready eth1: direct probe to 2e:94:8b:66:86:88 (try 1/3) eth1: direct probe responded eth1: authenticate with 2e:94:8b:66:86:88 (try 1) eth1: authenticated eth1: associate with 2e:94:8b:66:86:88 (try 1) eth1: RX AssocResp from 2e:94:8b:66:86:88 (capab=0x411 status=0 aid=1) eth1: associated ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready eth1: no IPv6 routers present WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88001b000c00) 0008001b0088ffff0010000000000000c055530000eaffff0010000000000000 u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u ^ Pid: 1874, comm: dhcpcd Not tainted 2.6.39-rc4 #4 Acer,Inc. Aspire 1510 /Aspire 1510 RIP: 0010:[<ffffffff81263d9c>] [<ffffffff81263d9c>] memmove+0x2c/0x190 RSP: 0018:ffff88001c17f8a0 EFLAGS: 00010206 RAX: ffff88001b000c00 RBX: ffff88001b070400 RCX: 0000000000000020 RDX: 000000000000016f RSI: ffff88001b000c00 RDI: ffff88001b000c00 RBP: ffff88001c17f8f8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000200 R14: 0000000000000000 R15: 0000000000000001 FS: 00007fd7df758700(0000) GS:ffffffff81a1b000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff88001d23a4b8 CR3: 000000001ba46000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 [<ffffffff815c7584>] ieee80211_skb_resize+0x94/0x120 [<ffffffff815c97a4>] ieee80211_xmit+0xa4/0x280 [<ffffffff815caf86>] ieee80211_subif_start_xmit+0x3e6/0x910 [<ffffffff81486654>] dev_hard_start_xmit+0x384/0x6a0 [<ffffffff8149d344>] sch_direct_xmit+0xd4/0x260 [<ffffffff81486b4e>] dev_queue_xmit+0x1de/0x710 [<ffffffff8156af23>] packet_sendmsg+0xa73/0xbe0 [<ffffffff81471813>] sock_sendmsg+0xe3/0x110 [<ffffffff81472154>] sys_sendto+0x134/0x180 [<ffffffff815e74f8>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff Indeed I get the same also without the patch. I add my config file as attachment below.
Created attachment 54852 [details] .config file used to reproduce
Le mercredi 20 avril 2011 à 21:36 +0200, Christian Casteyde a écrit : > I'm not sure it's the same problem anyway as I've said previously. > I append my config file used to build this kernel. This is not the same problem, and is a known false positive from skb reallocation. (skb_reserve() doesnt mark memory as initialized)
Le mercredi 20 avril 2011 17:01:27, Eric Dumazet a écrit : > Le mercredi 20 avril 2011 à 09:42 -0500, Christoph Lameter a écrit : > > Avoiding the irq handling yields the savings that improve the fastpath. > > if you do both then there is only a regression left. So lets go with > > disabling the CMPXCHG_LOCAL. > > OK, let's do that then. > > Thanks > > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC > > Christian Casteyde reported a KMEMCHECK splat in slub code. > > Problem is now we are lockless and allow IRQ in slab_alloc(), the object > we manipulate from freelist can be allocated and freed right before we > try to read object->next. > > Same problem can happen with DEBUG_PAGEALLOC > > Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or > CONFIG_DEBUG_PAGEALLOC is defined. > > Reported-by: Christian Casteyde <casteyde.christian@free.fr> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Pekka Enberg <penberg@cs.helsinki.fi> > Cc: Vegard Nossum <vegardno@ifi.uio.no> > Cc: Christoph Lameter <cl@linux.com> > --- > mm/slub.c | 45 +++++++++++++++++++++++++-------------------- > 1 files changed, 25 insertions(+), 20 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 94d2a33..f31ab2c 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1540,7 +1540,12 @@ static void unfreeze_slab(struct kmem_cache *s, > struct page *page, int tail) } > } > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \ > + !defined(CONFIG_DEBUG_PAGEALLOC) > +#define SLUB_USE_CMPXCHG_DOUBLE > +#endif > + > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > #ifdef CONFIG_PREEMPT > /* > * Calculate the next globally unique transaction for disambiguiation > @@ -1604,7 +1609,7 @@ static inline void note_cmpxchg_failure(const char > *n, > > void init_kmem_cache_cpus(struct kmem_cache *s) > { > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > int cpu; > > for_each_possible_cpu(cpu) > @@ -1643,7 +1648,7 @@ static void deactivate_slab(struct kmem_cache *s, > struct kmem_cache_cpu *c) page->inuse--; > } > c->page = NULL; > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > c->tid = next_tid(c->tid); > #endif > unfreeze_slab(s, page, tail); > @@ -1780,7 +1785,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t > gfpflags, int node, { > void **object; > struct page *new; > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > unsigned long flags; > > local_irq_save(flags); > @@ -1819,7 +1824,7 @@ load_freelist: > c->node = page_to_nid(c->page); > unlock_out: > slab_unlock(c->page); > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > c->tid = next_tid(c->tid); > local_irq_restore(flags); > #endif > @@ -1858,7 +1863,7 @@ new_slab: > } > if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit()) > slab_out_of_memory(s, gfpflags, node); > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > local_irq_restore(flags); > #endif > return NULL; > @@ -1887,7 +1892,7 @@ static __always_inline void *slab_alloc(struct > kmem_cache *s, { > void **object; > struct kmem_cache_cpu *c; > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > unsigned long tid; > #else > unsigned long flags; > @@ -1896,7 +1901,7 @@ static __always_inline void *slab_alloc(struct > kmem_cache *s, if (slab_pre_alloc_hook(s, gfpflags)) > return NULL; > > -#ifndef CONFIG_CMPXCHG_LOCAL > +#ifndef SLUB_USE_CMPXCHG_DOUBLE > local_irq_save(flags); > #else > redo: > @@ -1910,7 +1915,7 @@ redo: > */ > c = __this_cpu_ptr(s->cpu_slab); > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > /* > * The transaction ids are globally unique per cpu and per operation on > * a per cpu queue. Thus they can be guarantee that the cmpxchg_double > @@ -1927,7 +1932,7 @@ redo: > object = __slab_alloc(s, gfpflags, node, addr, c); > > else { > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > /* > * The cmpxchg will only match if there was no additional > * operation and if we are on the right processor. > @@ -1954,7 +1959,7 @@ redo: > stat(s, ALLOC_FASTPATH); > } > > -#ifndef CONFIG_CMPXCHG_LOCAL > +#ifndef SLUB_USE_CMPXCHG_DOUBLE > local_irq_restore(flags); > #endif > > @@ -2034,7 +2039,7 @@ static void __slab_free(struct kmem_cache *s, struct > page *page, { > void *prior; > void **object = (void *)x; > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > unsigned long flags; > > local_irq_save(flags); > @@ -2070,7 +2075,7 @@ checks_ok: > > out_unlock: > slab_unlock(page); > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > local_irq_restore(flags); > #endif > return; > @@ -2084,7 +2089,7 @@ slab_empty: > stat(s, FREE_REMOVE_PARTIAL); > } > slab_unlock(page); > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > local_irq_restore(flags); > #endif > stat(s, FREE_SLAB); > @@ -2113,7 +2118,7 @@ static __always_inline void slab_free(struct > kmem_cache *s, { > void **object = (void *)x; > struct kmem_cache_cpu *c; > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > unsigned long tid; > #else > unsigned long flags; > @@ -2121,7 +2126,7 @@ static __always_inline void slab_free(struct > kmem_cache *s, > > slab_free_hook(s, x); > > -#ifndef CONFIG_CMPXCHG_LOCAL > +#ifndef SLUB_USE_CMPXCHG_DOUBLE > local_irq_save(flags); > > #else > @@ -2136,7 +2141,7 @@ redo: > */ > c = __this_cpu_ptr(s->cpu_slab); > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > tid = c->tid; > barrier(); > #endif > @@ -2144,7 +2149,7 @@ redo: > if (likely(page == c->page && c->node != NUMA_NO_NODE)) { > set_freepointer(s, object, c->freelist); > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > if (unlikely(!this_cpu_cmpxchg_double( > s->cpu_slab->freelist, s->cpu_slab->tid, > c->freelist, tid, > @@ -2160,7 +2165,7 @@ redo: > } else > __slab_free(s, page, x, addr); > > -#ifndef CONFIG_CMPXCHG_LOCAL > +#ifndef SLUB_USE_CMPXCHG_DOUBLE > local_irq_restore(flags); > #endif > } > @@ -2354,7 +2359,7 @@ static inline int alloc_kmem_cache_cpus(struct > kmem_cache *s) BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE < > SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu)); > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > /* > * Must align to double word boundary for the double cmpxchg > instructions > * to work. I applied this patch to 2.6.39-rc4 (sorry for the last test, it indeed applied correctly), I still get the following: ADDRCONF(NETDEV_UP): eth1: link is not ready b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23) ADDRCONF(NETDEV_UP): eth1: link is not ready eth1: direct probe to 2e:94:8b:66:86:88 (try 1/3) eth1: direct probe responded eth1: authenticate with 2e:94:8b:66:86:88 (try 1) eth1: authenticated eth1: associate with 2e:94:8b:66:86:88 (try 1) eth1: RX AssocResp from 2e:94:8b:66:86:88 (capab=0x411 status=0 aid=1) eth1: associated ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready eth1: no IPv6 routers present WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff88001b000c00) 0008001b0088ffff0010000000000000c055530000eaffff0010000000000000 u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u ^ Pid: 1874, comm: dhcpcd Not tainted 2.6.39-rc4 #4 Acer,Inc. Aspire 1510 /Aspire 1510 RIP: 0010:[<ffffffff81263d9c>] [<ffffffff81263d9c>] memmove+0x2c/0x190 RSP: 0018:ffff88001c17f8a0 EFLAGS: 00010206 RAX: ffff88001b000c00 RBX: ffff88001b070400 RCX: 0000000000000020 RDX: 000000000000016f RSI: ffff88001b000c00 RDI: ffff88001b000c00 RBP: ffff88001c17f8f8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000200 R14: 0000000000000000 R15: 0000000000000001 FS: 00007fd7df758700(0000) GS:ffffffff81a1b000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffff88001d23a4b8 CR3: 000000001ba46000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 [<ffffffff815c7584>] ieee80211_skb_resize+0x94/0x120 [<ffffffff815c97a4>] ieee80211_xmit+0xa4/0x280 [<ffffffff815caf86>] ieee80211_subif_start_xmit+0x3e6/0x910 [<ffffffff81486654>] dev_hard_start_xmit+0x384/0x6a0 [<ffffffff8149d344>] sch_direct_xmit+0xd4/0x260 [<ffffffff81486b4e>] dev_queue_xmit+0x1de/0x710 [<ffffffff8156af23>] packet_sendmsg+0xa73/0xbe0 [<ffffffff81471813>] sock_sendmsg+0xe3/0x110 [<ffffffff81472154>] sys_sendto+0x134/0x180 [<ffffffff815e74f8>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff I'm not sure it's the same problem anyway as I've said previously. I append my config file used to build this kernel.
Le mercredi 20 avril 2011 à 21:55 +0200, Eric Dumazet a écrit : > Le mercredi 20 avril 2011 à 21:36 +0200, Christian Casteyde a écrit : > > > I'm not sure it's the same problem anyway as I've said previously. > > I append my config file used to build this kernel. > > This is not the same problem, and is a known false positive from skb > reallocation. (skb_reserve() doesnt mark memory as initialized) > > Please try following patch. It's a bit tricky, because network stack has different functions to fill bytes in skb and change pointers/offsets in it. Alternative would be to change pskb_expand_head() to not copy zone between skb->head and skb->data (might contain unitialized data) Thanks diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 79aafbb..3f2cba4 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1252,6 +1252,7 @@ static inline int skb_tailroom(const struct sk_buff *skb) */ static inline void skb_reserve(struct sk_buff *skb, int len) { + kmemcheck_mark_initialized(skb->data, len); skb->data += len; skb->tail += len; }
I didn't managed to reproduce with both patches on 2.6.39-rc4, except I once got another bug I reported in https://bugzilla.kernel.org/show_bug.cgi?id=33802.
Le mercredi 20 avril 2011 à 17:01 +0200, Eric Dumazet a écrit : > Le mercredi 20 avril 2011 à 09:42 -0500, Christoph Lameter a écrit : > > > Avoiding the irq handling yields the savings that improve the fastpath. if > > you do both then there is only a regression left. So lets go with > > disabling the CMPXCHG_LOCAL. > > OK, let's do that then. > > Thanks > > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC > > Christian Casteyde reported a KMEMCHECK splat in slub code. > > Problem is now we are lockless and allow IRQ in slab_alloc(), the object > we manipulate from freelist can be allocated and freed right before we > try to read object->next. > > Same problem can happen with DEBUG_PAGEALLOC > > Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or > CONFIG_DEBUG_PAGEALLOC is defined. > > Reported-by: Christian Casteyde <casteyde.christian@free.fr> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Pekka Enberg <penberg@cs.helsinki.fi> > Cc: Vegard Nossum <vegardno@ifi.uio.no> > Cc: Christoph Lameter <cl@linux.com> > --- > mm/slub.c | 45 +++++++++++++++++++++++++-------------------- > 1 files changed, 25 insertions(+), 20 deletions(-) > > diff --git a/mm/slub.c b/mm/slub.c > index 94d2a33..f31ab2c 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1540,7 +1540,12 @@ static void unfreeze_slab(struct kmem_cache *s, struct > page *page, int tail) > } > } > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#if defined(CONFIG_CMPXCHG_LOCAL) && !defined(CONFIG_KMEMCHECK) && \ > + !defined(CONFIG_DEBUG_PAGEALLOC) > +#define SLUB_USE_CMPXCHG_DOUBLE > +#endif > + > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > #ifdef CONFIG_PREEMPT > /* > * Calculate the next globally unique transaction for disambiguiation > @@ -1604,7 +1609,7 @@ static inline void note_cmpxchg_failure(const char *n, > > void init_kmem_cache_cpus(struct kmem_cache *s) > { > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > int cpu; > > for_each_possible_cpu(cpu) > @@ -1643,7 +1648,7 @@ static void deactivate_slab(struct kmem_cache *s, > struct kmem_cache_cpu *c) > page->inuse--; > } > c->page = NULL; > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > c->tid = next_tid(c->tid); > #endif > unfreeze_slab(s, page, tail); > @@ -1780,7 +1785,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t > gfpflags, int node, > { > void **object; > struct page *new; > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > unsigned long flags; > > local_irq_save(flags); > @@ -1819,7 +1824,7 @@ load_freelist: > c->node = page_to_nid(c->page); > unlock_out: > slab_unlock(c->page); > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > c->tid = next_tid(c->tid); > local_irq_restore(flags); > #endif > @@ -1858,7 +1863,7 @@ new_slab: > } > if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit()) > slab_out_of_memory(s, gfpflags, node); > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > local_irq_restore(flags); > #endif > return NULL; > @@ -1887,7 +1892,7 @@ static __always_inline void *slab_alloc(struct > kmem_cache *s, > { > void **object; > struct kmem_cache_cpu *c; > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > unsigned long tid; > #else > unsigned long flags; > @@ -1896,7 +1901,7 @@ static __always_inline void *slab_alloc(struct > kmem_cache *s, > if (slab_pre_alloc_hook(s, gfpflags)) > return NULL; > > -#ifndef CONFIG_CMPXCHG_LOCAL > +#ifndef SLUB_USE_CMPXCHG_DOUBLE > local_irq_save(flags); > #else > redo: > @@ -1910,7 +1915,7 @@ redo: > */ > c = __this_cpu_ptr(s->cpu_slab); > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > /* > * The transaction ids are globally unique per cpu and per operation on > * a per cpu queue. Thus they can be guarantee that the cmpxchg_double > @@ -1927,7 +1932,7 @@ redo: > object = __slab_alloc(s, gfpflags, node, addr, c); > > else { > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > /* > * The cmpxchg will only match if there was no additional > * operation and if we are on the right processor. > @@ -1954,7 +1959,7 @@ redo: > stat(s, ALLOC_FASTPATH); > } > > -#ifndef CONFIG_CMPXCHG_LOCAL > +#ifndef SLUB_USE_CMPXCHG_DOUBLE > local_irq_restore(flags); > #endif > > @@ -2034,7 +2039,7 @@ static void __slab_free(struct kmem_cache *s, struct > page *page, > { > void *prior; > void **object = (void *)x; > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > unsigned long flags; > > local_irq_save(flags); > @@ -2070,7 +2075,7 @@ checks_ok: > > out_unlock: > slab_unlock(page); > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > local_irq_restore(flags); > #endif > return; > @@ -2084,7 +2089,7 @@ slab_empty: > stat(s, FREE_REMOVE_PARTIAL); > } > slab_unlock(page); > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > local_irq_restore(flags); > #endif > stat(s, FREE_SLAB); > @@ -2113,7 +2118,7 @@ static __always_inline void slab_free(struct kmem_cache > *s, > { > void **object = (void *)x; > struct kmem_cache_cpu *c; > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > unsigned long tid; > #else > unsigned long flags; > @@ -2121,7 +2126,7 @@ static __always_inline void slab_free(struct kmem_cache > *s, > > slab_free_hook(s, x); > > -#ifndef CONFIG_CMPXCHG_LOCAL > +#ifndef SLUB_USE_CMPXCHG_DOUBLE > local_irq_save(flags); > > #else > @@ -2136,7 +2141,7 @@ redo: > */ > c = __this_cpu_ptr(s->cpu_slab); > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > tid = c->tid; > barrier(); > #endif > @@ -2144,7 +2149,7 @@ redo: > if (likely(page == c->page && c->node != NUMA_NO_NODE)) { > set_freepointer(s, object, c->freelist); > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > if (unlikely(!this_cpu_cmpxchg_double( > s->cpu_slab->freelist, s->cpu_slab->tid, > c->freelist, tid, > @@ -2160,7 +2165,7 @@ redo: > } else > __slab_free(s, page, x, addr); > > -#ifndef CONFIG_CMPXCHG_LOCAL > +#ifndef SLUB_USE_CMPXCHG_DOUBLE > local_irq_restore(flags); > #endif > } > @@ -2354,7 +2359,7 @@ static inline int alloc_kmem_cache_cpus(struct > kmem_cache *s) > BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE < > SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu)); > > -#ifdef CONFIG_CMPXCHG_LOCAL > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > /* > * Must align to double word boundary for the double cmpxchg > instructions > * to work. > Is anybody taking the patch and sending it upstream ? Thanks
On 5/5/11 9:18 AM, Eric Dumazet wrote: > Le mercredi 20 avril 2011 à 17:01 +0200, Eric Dumazet a écrit : >> Le mercredi 20 avril 2011 à 09:42 -0500, Christoph Lameter a écrit : >> >>> Avoiding the irq handling yields the savings that improve the fastpath. if >>> you do both then there is only a regression left. So lets go with >>> disabling the CMPXCHG_LOCAL. >> >> OK, let's do that then. >> >> Thanks >> >> [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC >> >> Christian Casteyde reported a KMEMCHECK splat in slub code. >> >> Problem is now we are lockless and allow IRQ in slab_alloc(), the object >> we manipulate from freelist can be allocated and freed right before we >> try to read object->next. >> >> Same problem can happen with DEBUG_PAGEALLOC >> >> Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or >> CONFIG_DEBUG_PAGEALLOC is defined. >> >> Reported-by: Christian Casteyde<casteyde.christian@free.fr> >> Signed-off-by: Eric Dumazet<eric.dumazet@gmail.com> >> Cc: Andrew Morton<akpm@linux-foundation.org> >> Cc: Pekka Enberg<penberg@cs.helsinki.fi> >> Cc: Vegard Nossum<vegardno@ifi.uio.no> >> Cc: Christoph Lameter<cl@linux.com> >> --- >> mm/slub.c | 45 +++++++++++++++++++++++++-------------------- >> 1 files changed, 25 insertions(+), 20 deletions(-) >> >> diff --git a/mm/slub.c b/mm/slub.c >> index 94d2a33..f31ab2c 100644 >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -1540,7 +1540,12 @@ static void unfreeze_slab(struct kmem_cache *s, >> struct page *page, int tail) >> } >> } >> >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#if defined(CONFIG_CMPXCHG_LOCAL)&& !defined(CONFIG_KMEMCHECK)&& \ >> + !defined(CONFIG_DEBUG_PAGEALLOC) >> +#define SLUB_USE_CMPXCHG_DOUBLE >> +#endif >> + >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> #ifdef CONFIG_PREEMPT >> /* >> * Calculate the next globally unique transaction for disambiguiation >> @@ -1604,7 +1609,7 @@ static inline void note_cmpxchg_failure(const char *n, >> >> void init_kmem_cache_cpus(struct kmem_cache *s) >> { >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> int cpu; >> >> for_each_possible_cpu(cpu) >> @@ -1643,7 +1648,7 @@ static void deactivate_slab(struct kmem_cache *s, >> struct kmem_cache_cpu *c) >> page->inuse--; >> } >> c->page = NULL; >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> c->tid = next_tid(c->tid); >> #endif >> unfreeze_slab(s, page, tail); >> @@ -1780,7 +1785,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t >> gfpflags, int node, >> { >> void **object; >> struct page *new; >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> unsigned long flags; >> >> local_irq_save(flags); >> @@ -1819,7 +1824,7 @@ load_freelist: >> c->node = page_to_nid(c->page); >> unlock_out: >> slab_unlock(c->page); >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> c->tid = next_tid(c->tid); >> local_irq_restore(flags); >> #endif >> @@ -1858,7 +1863,7 @@ new_slab: >> } >> if (!(gfpflags& __GFP_NOWARN)&& printk_ratelimit()) >> slab_out_of_memory(s, gfpflags, node); >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> local_irq_restore(flags); >> #endif >> return NULL; >> @@ -1887,7 +1892,7 @@ static __always_inline void *slab_alloc(struct >> kmem_cache *s, >> { >> void **object; >> struct kmem_cache_cpu *c; >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> unsigned long tid; >> #else >> unsigned long flags; >> @@ -1896,7 +1901,7 @@ static __always_inline void *slab_alloc(struct >> kmem_cache *s, >> if (slab_pre_alloc_hook(s, gfpflags)) >> return NULL; >> >> -#ifndef CONFIG_CMPXCHG_LOCAL >> +#ifndef SLUB_USE_CMPXCHG_DOUBLE >> local_irq_save(flags); >> #else >> redo: >> @@ -1910,7 +1915,7 @@ redo: >> */ >> c = __this_cpu_ptr(s->cpu_slab); >> >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> /* >> * The transaction ids are globally unique per cpu and per operation on >> * a per cpu queue. Thus they can be guarantee that the cmpxchg_double >> @@ -1927,7 +1932,7 @@ redo: >> object = __slab_alloc(s, gfpflags, node, addr, c); >> >> else { >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> /* >> * The cmpxchg will only match if there was no additional >> * operation and if we are on the right processor. >> @@ -1954,7 +1959,7 @@ redo: >> stat(s, ALLOC_FASTPATH); >> } >> >> -#ifndef CONFIG_CMPXCHG_LOCAL >> +#ifndef SLUB_USE_CMPXCHG_DOUBLE >> local_irq_restore(flags); >> #endif >> >> @@ -2034,7 +2039,7 @@ static void __slab_free(struct kmem_cache *s, struct >> page *page, >> { >> void *prior; >> void **object = (void *)x; >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> unsigned long flags; >> >> local_irq_save(flags); >> @@ -2070,7 +2075,7 @@ checks_ok: >> >> out_unlock: >> slab_unlock(page); >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> local_irq_restore(flags); >> #endif >> return; >> @@ -2084,7 +2089,7 @@ slab_empty: >> stat(s, FREE_REMOVE_PARTIAL); >> } >> slab_unlock(page); >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> local_irq_restore(flags); >> #endif >> stat(s, FREE_SLAB); >> @@ -2113,7 +2118,7 @@ static __always_inline void slab_free(struct >> kmem_cache *s, >> { >> void **object = (void *)x; >> struct kmem_cache_cpu *c; >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> unsigned long tid; >> #else >> unsigned long flags; >> @@ -2121,7 +2126,7 @@ static __always_inline void slab_free(struct >> kmem_cache *s, >> >> slab_free_hook(s, x); >> >> -#ifndef CONFIG_CMPXCHG_LOCAL >> +#ifndef SLUB_USE_CMPXCHG_DOUBLE >> local_irq_save(flags); >> >> #else >> @@ -2136,7 +2141,7 @@ redo: >> */ >> c = __this_cpu_ptr(s->cpu_slab); >> >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> tid = c->tid; >> barrier(); >> #endif >> @@ -2144,7 +2149,7 @@ redo: >> if (likely(page == c->page&& c->node != NUMA_NO_NODE)) { >> set_freepointer(s, object, c->freelist); >> >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> if (unlikely(!this_cpu_cmpxchg_double( >> s->cpu_slab->freelist, s->cpu_slab->tid, >> c->freelist, tid, >> @@ -2160,7 +2165,7 @@ redo: >> } else >> __slab_free(s, page, x, addr); >> >> -#ifndef CONFIG_CMPXCHG_LOCAL >> +#ifndef SLUB_USE_CMPXCHG_DOUBLE >> local_irq_restore(flags); >> #endif >> } >> @@ -2354,7 +2359,7 @@ static inline int alloc_kmem_cache_cpus(struct >> kmem_cache *s) >> BUILD_BUG_ON(PERCPU_DYNAMIC_EARLY_SIZE< >> SLUB_PAGE_SHIFT * sizeof(struct kmem_cache_cpu)); >> >> -#ifdef CONFIG_CMPXCHG_LOCAL >> +#ifdef SLUB_USE_CMPXCHG_DOUBLE >> /* >> * Must align to double word boundary for the double cmpxchg >> instructions >> * to work. >> > > > Is anybody taking the patch and sending it upstream ? Oh, sorry, it got lost in my inbox. I can take care of it. Pekka
Le jeudi 05 mai 2011 à 09:22 +0300, Pekka Enberg a écrit : > Oh, sorry, it got lost in my inbox. I can take care of it. > Thanks Pekka !
Reply-To: cl@linux.com On Thu, 5 May 2011, Pekka Enberg wrote: > > > -#ifdef CONFIG_CMPXCHG_LOCAL > > > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > > > /* > > > * Must align to double word boundary for the double cmpxchg > > > instructions > > > * to work. > > > > > > > > > Is anybody taking the patch and sending it upstream ? > > Oh, sorry, it got lost in my inbox. I can take care of it. Combining this one with the patch I sent to remove the #ifdeffery would make this much simpler.
Le jeudi 05 mai 2011 à 13:40 -0500, Christoph Lameter a écrit : > On Thu, 5 May 2011, Pekka Enberg wrote: > > > > > -#ifdef CONFIG_CMPXCHG_LOCAL > > > > +#ifdef SLUB_USE_CMPXCHG_DOUBLE > > > > /* > > > > * Must align to double word boundary for the double cmpxchg > > > > instructions > > > > * to work. > > > > > > > > > > > > > Is anybody taking the patch and sending it upstream ? > > > > Oh, sorry, it got lost in my inbox. I can take care of it. > > Combining this one with the patch I sent to remove the #ifdeffery would > make this much simpler. I must missed it, could you please resend it ? Thanks
Reply-To: cl@linux.com On Thu, 5 May 2011, Eric Dumazet wrote: > > Combining this one with the patch I sent to remove the #ifdeffery would > > make this much simpler. > > I must missed it, could you please resend it ? Subject: slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery Remove the #ifdefs. This means that the irqsafe_cpu_cmpxchg_double() is used everywhere. There may be performance implications since: A. We now have to manage a transaction ID for all arches B. The interrupt holdoff for arches not supporting CONFIG_CMPXCHG_LOCAL is reduced to a very short irqoff section. There are no multiple irqoff/irqon sequences as a result of this change. Even in the fallback case we only have to do one disable and enable like before. Signed-off-by: Christoph Lameter <cl@linux.com> --- include/linux/slub_def.h | 2 - mm/slub.c | 56 ----------------------------------------------- 2 files changed, 58 deletions(-) Index: linux-2.6/include/linux/slub_def.h =================================================================== --- linux-2.6.orig/include/linux/slub_def.h 2011-05-04 09:33:08.000000000 -0500 +++ linux-2.6/include/linux/slub_def.h 2011-05-04 09:42:05.000000000 -0500 @@ -37,9 +37,7 @@ enum stat_item { struct kmem_cache_cpu { void **freelist; /* Pointer to next available object */ -#ifdef CONFIG_CMPXCHG_LOCAL unsigned long tid; /* Globally unique transaction id */ -#endif struct page *page; /* The slab from which we are allocating */ int node; /* The node of the page (or -1 for debug) */ #ifdef CONFIG_SLUB_STATS Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-05-04 09:41:59.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-05-04 09:48:11.000000000 -0500 @@ -1540,7 +1540,6 @@ static void unfreeze_slab(struct kmem_ca } } -#ifdef CONFIG_CMPXCHG_LOCAL #ifdef CONFIG_PREEMPT /* * Calculate the next globally unique transaction for disambiguiation @@ -1600,17 +1599,12 @@ static inline void note_cmpxchg_failure( stat(s, CMPXCHG_DOUBLE_CPU_FAIL); } -#endif - void init_kmem_cache_cpus(struct kmem_cache *s) { -#ifdef CONFIG_CMPXCHG_LOCAL int cpu; for_each_possible_cpu(cpu) per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu); -#endif - } /* * Remove the cpu slab @@ -1643,9 +1637,7 @@ static void deactivate_slab(struct kmem_ page->inuse--; } c->page = NULL; -#ifdef CONFIG_CMPXCHG_LOCAL c->tid = next_tid(c->tid); -#endif unfreeze_slab(s, page, tail); } @@ -1780,7 +1772,6 @@ static void *__slab_alloc(struct kmem_ca { void **object; struct page *new; -#ifdef CONFIG_CMPXCHG_LOCAL unsigned long flags; local_irq_save(flags); @@ -1792,7 +1783,6 @@ static void *__slab_alloc(struct kmem_ca */ c = this_cpu_ptr(s->cpu_slab); #endif -#endif /* We handle __GFP_ZERO in the caller */ gfpflags &= ~__GFP_ZERO; @@ -1819,10 +1809,8 @@ load_freelist: c->node = page_to_nid(c->page); unlock_out: slab_unlock(c->page); -#ifdef CONFIG_CMPXCHG_LOCAL c->tid = next_tid(c->tid); local_irq_restore(flags); -#endif stat(s, ALLOC_SLOWPATH); return object; @@ -1858,9 +1846,7 @@ new_slab: } if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit()) slab_out_of_memory(s, gfpflags, node); -#ifdef CONFIG_CMPXCHG_LOCAL local_irq_restore(flags); -#endif return NULL; debug: if (!alloc_debug_processing(s, c->page, object, addr)) @@ -1887,20 +1873,12 @@ static __always_inline void *slab_alloc( { void **object; struct kmem_cache_cpu *c; -#ifdef CONFIG_CMPXCHG_LOCAL unsigned long tid; -#else - unsigned long flags; -#endif if (slab_pre_alloc_hook(s, gfpflags)) return NULL; -#ifndef CONFIG_CMPXCHG_LOCAL - local_irq_save(flags); -#else redo: -#endif /* * Must read kmem_cache cpu data via this cpu ptr. Preemption is @@ -1910,7 +1888,6 @@ redo: */ c = __this_cpu_ptr(s->cpu_slab); -#ifdef CONFIG_CMPXCHG_LOCAL /* * The transaction ids are globally unique per cpu and per operation on * a per cpu queue. Thus they can be guarantee that the cmpxchg_double @@ -1919,7 +1896,6 @@ redo: */ tid = c->tid; barrier(); -#endif object = c->freelist; if (unlikely(!object || !node_match(c, node))) @@ -1927,7 +1903,6 @@ redo: object = __slab_alloc(s, gfpflags, node, addr, c); else { -#ifdef CONFIG_CMPXCHG_LOCAL /* * The cmpxchg will only match if there was no additional * operation and if we are on the right processor. @@ -1948,16 +1923,9 @@ redo: note_cmpxchg_failure("slab_alloc", s, tid); goto redo; } -#else - c->freelist = get_freepointer(s, object); -#endif stat(s, ALLOC_FASTPATH); } -#ifndef CONFIG_CMPXCHG_LOCAL - local_irq_restore(flags); -#endif - if (unlikely(gfpflags & __GFP_ZERO) && object) memset(object, 0, s->objsize); @@ -2034,11 +2002,9 @@ static void __slab_free(struct kmem_cach { void *prior; void **object = (void *)x; -#ifdef CONFIG_CMPXCHG_LOCAL unsigned long flags; local_irq_save(flags); -#endif slab_lock(page); stat(s, FREE_SLOWPATH); @@ -2070,9 +2036,7 @@ checks_ok: out_unlock: slab_unlock(page); -#ifdef CONFIG_CMPXCHG_LOCAL local_irq_restore(flags); -#endif return; slab_empty: @@ -2084,9 +2048,7 @@ slab_empty: stat(s, FREE_REMOVE_PARTIAL); } slab_unlock(page); -#ifdef CONFIG_CMPXCHG_LOCAL local_irq_restore(flags); -#endif stat(s, FREE_SLAB); discard_slab(s, page); return; @@ -2113,20 +2075,11 @@ static __always_inline void slab_free(st { void **object = (void *)x; struct kmem_cache_cpu *c; -#ifdef CONFIG_CMPXCHG_LOCAL unsigned long tid; -#else - unsigned long flags; -#endif slab_free_hook(s, x); -#ifndef CONFIG_CMPXCHG_LOCAL - local_irq_save(flags); - -#else redo: -#endif /* * Determine the currently cpus per cpu slab. @@ -2136,15 +2089,12 @@ redo: */ c = __this_cpu_ptr(s->cpu_slab); -#ifdef CONFIG_CMPXCHG_LOCAL tid = c->tid; barrier(); -#endif if (likely(page == c->page && c->node != NUMA_NO_NODE)) { set_freepointer(s, object, c->freelist); -#ifdef CONFIG_CMPXCHG_LOCAL if (unlikely(!irqsafe_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, c->freelist, tid, @@ -2153,16 +2103,10 @@ redo: note_cmpxchg_failure("slab_free", s, tid); goto redo; } -#else - c->freelist = object; -#endif stat(s, FREE_FASTPATH); } else __slab_free(s, page, x, addr); -#ifndef CONFIG_CMPXCHG_LOCAL - local_irq_restore(flags); -#endif } void kmem_cache_free(struct kmem_cache *s, void *x)
Reply-To: penberg@kernel.org On Wed, 20 Apr 2011, Eric Dumazet wrote: > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC > > Christian Casteyde reported a KMEMCHECK splat in slub code. > > Problem is now we are lockless and allow IRQ in slab_alloc(), the object > we manipulate from freelist can be allocated and freed right before we > try to read object->next. > > Same problem can happen with DEBUG_PAGEALLOC > > Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or > CONFIG_DEBUG_PAGEALLOC is defined. Christoph, Eric, is this still relevant after commit 1759415 ("slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery") in slab/next of slab.git? Pekka
Reply-To: penberg@kernel.org On Mon, 9 May 2011, Christoph Lameter wrote: > There is still an issue and now you can no longer fix the thing through > CONFIG_CMPXCHG_LOCAL. > > It needs to be legal for slub to deref the counter even if the object has > been freed. We can use kmemcheck_mark_initialized() for that.
Reply-To: cl@linux.com On Mon, 9 May 2011, Pekka Enberg wrote: > On Wed, 20 Apr 2011, Eric Dumazet wrote: > > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC > > > > Christian Casteyde reported a KMEMCHECK splat in slub code. > > > > Problem is now we are lockless and allow IRQ in slab_alloc(), the object > > we manipulate from freelist can be allocated and freed right before we > > try to read object->next. > > > > Same problem can happen with DEBUG_PAGEALLOC > > > > Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or > > CONFIG_DEBUG_PAGEALLOC is defined. > > Christoph, Eric, is this still relevant after commit 1759415 ("slub: Remove > CONFIG_CMPXCHG_LOCAL ifdeffery") in slab/next of slab.git? There is still an issue and now you can no longer fix the thing through CONFIG_CMPXCHG_LOCAL. It needs to be legal for slub to deref the counter even if the object has been freed.
Le lundi 09 mai 2011 à 15:04 -0500, Christoph Lameter a écrit : > On Mon, 9 May 2011, Pekka Enberg wrote: > > > On Wed, 20 Apr 2011, Eric Dumazet wrote: > > > [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC > > > > > > Christian Casteyde reported a KMEMCHECK splat in slub code. > > > > > > Problem is now we are lockless and allow IRQ in slab_alloc(), the object > > > we manipulate from freelist can be allocated and freed right before we > > > try to read object->next. > > > > > > Same problem can happen with DEBUG_PAGEALLOC > > > > > > Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or > > > CONFIG_DEBUG_PAGEALLOC is defined. > > > > Christoph, Eric, is this still relevant after commit 1759415 ("slub: Remove > > CONFIG_CMPXCHG_LOCAL ifdeffery") in slab/next of slab.git? > > There is still an issue and now you can no longer fix the thing through > CONFIG_CMPXCHG_LOCAL. > > It needs to be legal for slub to deref the counter even if the object has > been freed. > I am trying to follow things but honestly I am lost. Isnt commit 1759415e63 planned for 2.6.40 ? ( ref : http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=1759415e630e5db0dd2390df9f94892cbfb9a8a2 ) How shall we fix things for 2.6.39 ? I thought my patch was OK for that. Its a bit hard to work with you on this stuff, for a report I made ages ago, I find it incredible its not yet fixed in linux-2.6. Christoph, is your plan to make SLUB not compatable with CONFIG_DEBUG_PAGEALLOC ? Luckily we still have SLAB ;) I am a bit surprised of 1759415e63 commit. Its obviously wrong for DEBUG_PAGEALLOG. Sure KMEMCHECK could be handled differently (since we are !SMP in this case) ???
On 5/10/11 11:43 AM, Eric Dumazet wrote: > I am trying to follow things but honestly I am lost. > Isnt commit 1759415e63 planned for 2.6.40 ? > ( ref : > > http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=1759415e630e5db0dd2390df9f94892cbfb9a8a2 > ) Yes, it's for 2.6.40. > How shall we fix things for 2.6.39 ? I thought my patch was OK for that. > It's so late in the release cycle that I think the best option is to fix it in 2.6.40 and backport it to -stable together with the above commit. > Its a bit hard to work with you on this stuff, for a report I made ages > ago, I find it incredible its not yet fixed in linux-2.6. It's not incredible, I simply managed to miss your patch. Sorry about that. Pekka
Le mardi 10 mai 2011 à 12:47 +0300, Pekka Enberg a écrit : > On 5/10/11 11:43 AM, Eric Dumazet wrote: > > I am trying to follow things but honestly I am lost. > > Isnt commit 1759415e63 planned for 2.6.40 ? > > ( ref : > > > http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=1759415e630e5db0dd2390df9f94892cbfb9a8a2 > ) > > Yes, it's for 2.6.40. > > > How shall we fix things for 2.6.39 ? I thought my patch was OK for that. > > > > It's so late in the release cycle that I think the best option is to fix > it in 2.6.40 and backport it to -stable together with the above commit. > > > Its a bit hard to work with you on this stuff, for a report I made ages > > ago, I find it incredible its not yet fixed in linux-2.6. > > It's not incredible, I simply managed to miss your patch. Sorry about that. > Pekka, my word was probably too strong, sorry for that. What I meant is I dont understand how Christoph expect to solve this problem if irqsafe_cpu_cmpxchg_double() is used everywhere.
On 5/10/11 11:43 AM, Eric Dumazet wrote: > Le lundi 09 mai 2011 à 15:04 -0500, Christoph Lameter a écrit : >> On Mon, 9 May 2011, Pekka Enberg wrote: >> >>> On Wed, 20 Apr 2011, Eric Dumazet wrote: >>>> [PATCH v4] slub: dont use cmpxchg_double if KMEMCHECK or DEBUG_PAGEALLOC >>>> >>>> Christian Casteyde reported a KMEMCHECK splat in slub code. >>>> >>>> Problem is now we are lockless and allow IRQ in slab_alloc(), the object >>>> we manipulate from freelist can be allocated and freed right before we >>>> try to read object->next. >>>> >>>> Same problem can happen with DEBUG_PAGEALLOC >>>> >>>> Just dont use cmpxchg_double() if either CONFIG_KMEMCHECK or >>>> CONFIG_DEBUG_PAGEALLOC is defined. >>> Christoph, Eric, is this still relevant after commit 1759415 ("slub: Remove >>> CONFIG_CMPXCHG_LOCAL ifdeffery") in slab/next of slab.git? >> There is still an issue and now you can no longer fix the thing through >> CONFIG_CMPXCHG_LOCAL. >> >> It needs to be legal for slub to deref the counter even if the object has >> been freed. >> > I am trying to follow things but honestly I am lost. > > Isnt commit 1759415e63 planned for 2.6.40 ? > ( ref : > > http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=1759415e630e5db0dd2390df9f94892cbfb9a8a2 > ) > > How shall we fix things for 2.6.39 ? I thought my patch was OK for that. > > > Its a bit hard to work with you on this stuff, for a report I made ages > ago, I find it incredible its not yet fixed in linux-2.6. > Can't we fix the issue by putting kmemcheck_mark_initialized() to set_freepointer()?
Hi Eric, On 5/10/11 1:03 PM, Eric Dumazet wrote: > What I meant is I dont understand how Christoph expect to solve this > problem if irqsafe_cpu_cmpxchg_double() is used everywhere. Do you see any problem with annotating the freelist pointers with kmemcheck_mark_initialized() in set_freepointer()? The #ifdef removal was requested by Linus et al when we fixed a bug that was causing problems during boot on certain CPUs. So even though it invalidates your fix, it's the way to go forward. Pekka
Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit : > Can't we fix the issue by putting kmemcheck_mark_initialized() to > set_freepointer()? This would solve kmemcheck problem, not DEBUG_PAGEALLOC
On 5/10/11 1:17 PM, Eric Dumazet wrote: > Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit : > >> Can't we fix the issue by putting kmemcheck_mark_initialized() to >> set_freepointer()? > > This would solve kmemcheck problem, not DEBUG_PAGEALLOC Oh, right. Christoph? We need to support DEBUG_PAGEALLOC with SLUB.
Le mardi 10 mai 2011 à 13:19 +0300, Pekka Enberg a écrit : > On 5/10/11 1:17 PM, Eric Dumazet wrote: > > Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit : > > > >> Can't we fix the issue by putting kmemcheck_mark_initialized() to > >> set_freepointer()? > > > > This would solve kmemcheck problem, not DEBUG_PAGEALLOC > > Oh, right. Christoph? We need to support DEBUG_PAGEALLOC with SLUB. Maybe have a special version of "irqsafe_cpu_cmpxchg_double" and make it use something like : local_irq_save(flags); if (probe_kernel_address(...)) { res = false; } else { res = unsafe_cpu_cmpxchg_double(...) } local_irq_restore(flags); Just the idea, I dont have time to cook a patch.
Reply-To: vegardno@ifi.uio.no On Tue, 10 May 2011 13:19:35 +0300, Pekka Enberg <penberg@cs.helsinki.fi> wrote: > On 5/10/11 1:17 PM, Eric Dumazet wrote: >> Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit : >> >>> Can't we fix the issue by putting kmemcheck_mark_initialized() to >>> set_freepointer()? >> >> This would solve kmemcheck problem, not DEBUG_PAGEALLOC > > Oh, right. Christoph? We need to support DEBUG_PAGEALLOC with SLUB. Hi, Disclaimer: I don't know how the algorithm is supposed to work, so please excuse me if this makes no sense at all. But here goes: Presumably the problem is that the page can get freed, and that with DEBUG_PAGEALLOC, the page will therefore not be present and subsequently trigger a page fault when doing this cmpxchg() on the possibly freed object. Regardless of DEBUG_PAGEALLOC or kmemcheck, what happens if the page gets freed, then allocated again for a completely different purpose in another part of the kernel, and new user of the page by chance writes the same "tid" number that the cmpxchg() is expecting? I guess I'm asking: Are we relying on the fact that this has a negligible probability of happening? Or did I just misunderstand what the algorithm does, and it is in fact provable (in theory) that nothing wrong can happen? Thanks, Vegard
Reply-To: cl@linux.com On Tue, 10 May 2011, Pekka Enberg wrote: > On 5/10/11 1:17 PM, Eric Dumazet wrote: > > Le mardi 10 mai 2011 à 13:03 +0300, Pekka Enberg a écrit : > > > > > Can't we fix the issue by putting kmemcheck_mark_initialized() to > > > set_freepointer()? > > > > This would solve kmemcheck problem, not DEBUG_PAGEALLOC > > Oh, right. Christoph? We need to support DEBUG_PAGEALLOC with SLUB. Well back to the #ifdef then? Or has DEBUG_PAGEALLOC some override mechanism that we can do a speculative memory access?
Reply-To: cl@linux.com On Tue, 10 May 2011, Vegard Nossum wrote: > Presumably the problem is that the page can get freed, and that with > DEBUG_PAGEALLOC, the page will therefore not be present and subsequently > trigger a page fault when doing this cmpxchg() on the possibly freed object. The problem is not the cmpxchg. The cmpxchg is occurring on the per cpu structure for the slab and that remains even if the page is freed. The problem is the speculative fetch of the address of the following object from a pointer into the page. The cmpxchg will fail in that case because the TID was incremented and the result of the address fetch will be discarded. > Regardless of DEBUG_PAGEALLOC or kmemcheck, what happens if the page gets > freed, then allocated again for a completely different purpose in another > part > of the kernel, and new user of the page by chance writes the same "tid" > number > that the cmpxchg() is expecting? The tid is not stored in the page struct but in a per cpu structure. Doing an explicit check if this is an illegal address in the PAGE_ALLOC case and redoing the loop will address the issue. So #ifdef CONFIG_DEBUG_PAGE_ALLOC if (illegal_page_alloc-address(object)) goto redo; #endif before the cmpxchg should do the trick.
Le mardi 10 mai 2011 à 11:39 -0500, Christoph Lameter a écrit : > #ifdef CONFIG_DEBUG_PAGE_ALLOC > if (illegal_page_alloc-address(object)) > goto redo; > #endif > > before the cmpxchg should do the trick. > Again, it wont work... You can have an IRQ right after the check and before cmpxchg This interrupt can allocate this block of memory, free it, and unmap page from memory. cmpxchg() reads unmapped memory -> BUG
Reply-To: cl@linux.com On Tue, 10 May 2011, Eric Dumazet wrote: > Le mardi 10 mai 2011 à 11:39 -0500, Christoph Lameter a écrit : > > > #ifdef CONFIG_DEBUG_PAGE_ALLOC > > if (illegal_page_alloc-address(object)) > > goto redo; > > #endif > > > > before the cmpxchg should do the trick. > > > > Again, it wont work... > > You can have an IRQ right after the check and before cmpxchg Ok guess then we also need to disable irq if CONFIG_PAGE_ALLOC is set? The cmpxchg is not the problem. The problem is the following expression which retrieves the pointer to the next available object from the object on the page: get_freepointer(s, object) In the CONFIG_PAGE_ALLOC case we could disable interrupts, then do the check, then fetch the pointer and then reenable interrupts. All of this can occur before the cmpxchg. > This interrupt can allocate this block of memory, free it, and unmap > page from memory. > > cmpxchg() reads unmapped memory -> BUG The cmpxchg is not accessing any memory on the page.
Reply-To: cl@linux.com Draft for a patch Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have marked as invalid to retrieve the pointer to the next free object. Probe that address before dereferencing the pointer to the page. All of that needs to occur with interrupts disabled since an interrupt could cause the page status to change (as pointed out by Eric). Signed-off-by: Christoph Lameter <cl@linux.com> --- mm/slub.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-05-10 12:35:30.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-05-10 12:38:53.000000000 -0500 @@ -261,6 +261,27 @@ static inline void *get_freepointer(stru return *(void **)(object + s->offset); } +static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) +{ + void *p; + +#ifdef CONFIG_PAGE_ALLOC + unsigned long flags; + + local_irq_save(flags); + + if (probe_kernel_address(object)) + p = NULL; /* Invalid */ + else + p = get_freepointer(s, object); + + local_irq_restore(flags); +#else + p = get_freepointer(s, object); +#endif + return p; +} + static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) { *(void **)(object + s->offset) = fp; @@ -1933,7 +1954,7 @@ redo: if (unlikely(!irqsafe_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, object, tid, - get_freepointer(s, object), next_tid(tid)))) { + get_freepointer_safe(s, object), next_tid(tid)))) { note_cmpxchg_failure("slab_alloc", s, tid); goto redo;
Le mardi 10 mai 2011 à 12:43 -0500, Christoph Lameter a écrit : > Draft for a patch > > > Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath > > Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may > have > marked as invalid to retrieve the pointer to the next free object. > > Probe that address before dereferencing the pointer to the page. > All of that needs to occur with interrupts disabled since an interrupt > could cause the page status to change (as pointed out by Eric). > > Signed-off-by: Christoph Lameter <cl@linux.com> > --- > mm/slub.c | 23 ++++++++++++++++++++++- > 1 file changed, 22 insertions(+), 1 deletion(-) > > Index: linux-2.6/mm/slub.c > =================================================================== > --- linux-2.6.orig/mm/slub.c 2011-05-10 12:35:30.000000000 -0500 > +++ linux-2.6/mm/slub.c 2011-05-10 12:38:53.000000000 -0500 > @@ -261,6 +261,27 @@ static inline void *get_freepointer(stru > return *(void **)(object + s->offset); > } > > +static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) > +{ > + void *p; > + > +#ifdef CONFIG_PAGE_ALLOC > + unsigned long flags; > + > + local_irq_save(flags); > + > + if (probe_kernel_address(object)) > + p = NULL; /* Invalid */ > + else > + p = get_freepointer(s, object); > + > + local_irq_restore(flags); > +#else > + p = get_freepointer(s, object); > +#endif > + return p; > +} > + > static inline void set_freepointer(struct kmem_cache *s, void *object, void > *fp) > { > *(void **)(object + s->offset) = fp; > @@ -1933,7 +1954,7 @@ redo: > if (unlikely(!irqsafe_cpu_cmpxchg_double( > s->cpu_slab->freelist, s->cpu_slab->tid, > object, tid, > - get_freepointer(s, object), next_tid(tid)))) { > + get_freepointer_safe(s, object), > next_tid(tid)))) { > > note_cmpxchg_failure("slab_alloc", s, tid); > goto redo; Really this wont work Stephen You have to disable IRQ _before_ even fetching 'object' Or else, you can have an IRQ, allocate this object, pass to another cpu. This other cpu can free the object and unmap page right after you did the probe_kernel_address(object) (successfully), and before your cpu : p = get_freepointer(s, object); << BUG >> I really dont understand your motivation to keep the buggy commit.
Reply-To: cl@linux.com There is a simpler version and we can get away without interrupt disable I think. The value that we get from the read does not matter since the TID will not match. Subject: slub: Make CONFIG_PAGE_ALLOC work with new fastpath Fastpath can do a speculative access to a page that CONFIG_PAGE_ALLOC may have marked as invalid to retrieve the pointer to the next free object. Probe that address before dereferencing the pointer to the page. Signed-off-by: Christoph Lameter <cl@linux.com> --- mm/slub.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-05-10 12:54:00.000000000 -0500 +++ linux-2.6/mm/slub.c 2011-05-10 13:04:18.000000000 -0500 @@ -261,6 +261,18 @@ static inline void *get_freepointer(stru return *(void **)(object + s->offset); } +static inline void *get_freepointer_safe(struct kmem_cache *s, void *object) +{ + void *p; + +#ifdef CONFIG_DEBUG_PAGEALLOC + probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p)); +#else + p = get_freepointer(s, object); +#endif + return p; +} + static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) { *(void **)(object + s->offset) = fp; @@ -1943,7 +1955,7 @@ redo: if (unlikely(!irqsafe_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, object, tid, - get_freepointer(s, object), next_tid(tid)))) { + get_freepointer_safe(s, object), next_tid(tid)))) { note_cmpxchg_failure("slab_alloc", s, tid); goto redo;
Reply-To: cl@linux.com On Tue, 10 May 2011, Christoph Lameter wrote: > > This other cpu can free the object and unmap page right after you did > > the probe_kernel_address(object) (successfully), and before your cpu : > > > > p = get_freepointer(s, object); << BUG >> > > If the other cpu frees the object and unmaps the page then > get_freepointer_safe() can obtain an arbitrary value since the TID was > incremented. We will restart the loop and discard the value retrieved. Ok. Forgot the element there of a different cpu. A different cpu cannot unmap the page or free the page since the page is in a frozen state while we allocate from it. The page is only handled by the cpu it was assigned to until the cpu which froze it releases it. The only case that we need to protect against here is the case when an interrupt or reschedule causes the *same* cpu to release the page. In that case the TID must have been incremented.
Reply-To: cl@linux.com On Tue, 10 May 2011, Eric Dumazet wrote: > > + else > > + p = get_freepointer(s, object); > > + > > + local_irq_restore(flags); > > +#else > > + p = get_freepointer(s, object); > > +#endif > > + return p; > > +} > > + > > static inline void set_freepointer(struct kmem_cache *s, void *object, > void *fp) > > { > > *(void **)(object + s->offset) = fp; > > @@ -1933,7 +1954,7 @@ redo: > > if (unlikely(!irqsafe_cpu_cmpxchg_double( > > s->cpu_slab->freelist, s->cpu_slab->tid, > > object, tid, > > - get_freepointer(s, object), next_tid(tid)))) { > > + get_freepointer_safe(s, object), > next_tid(tid)))) { > > > > note_cmpxchg_failure("slab_alloc", s, tid); > > goto redo; > > > Really this wont work Stephen I am not Stephen. > You have to disable IRQ _before_ even fetching 'object' The object pointer is being obtained from a per cpu structure and not from the page. What is the problem with fetching the object pointer? > Or else, you can have an IRQ, allocate this object, pass to another cpu. If that occurs then TID is being incremented and we will restart the loop getting the new object pointer from the per cpu structure. The object pointer that we were considering is irrelevant. > This other cpu can free the object and unmap page right after you did > the probe_kernel_address(object) (successfully), and before your cpu : > > p = get_freepointer(s, object); << BUG >> If the other cpu frees the object and unmaps the page then get_freepointer_safe() can obtain an arbitrary value since the TID was incremented. We will restart the loop and discard the value retrieved.
Le mardi 10 mai 2011 à 13:28 -0500, Christoph Lameter a écrit : > On Tue, 10 May 2011, Eric Dumazet wrote: > > > > + else > > > + p = get_freepointer(s, object); > > > + > > > + local_irq_restore(flags); > > > +#else > > > + p = get_freepointer(s, object); > > > +#endif > > > + return p; > > > +} > > > + > > > static inline void set_freepointer(struct kmem_cache *s, void *object, > void *fp) > > > { > > > *(void **)(object + s->offset) = fp; > > > @@ -1933,7 +1954,7 @@ redo: > > > if (unlikely(!irqsafe_cpu_cmpxchg_double( > > > s->cpu_slab->freelist, s->cpu_slab->tid, > > > object, tid, > > > - get_freepointer(s, object), next_tid(tid)))) { > > > + get_freepointer_safe(s, object), > next_tid(tid)))) { > > > > > > note_cmpxchg_failure("slab_alloc", s, tid); > > > goto redo; > > > > > > Really this wont work Stephen > > I am not Stephen. > Yes, sorry Christoph. > > You have to disable IRQ _before_ even fetching 'object' > > The object pointer is being obtained from a per cpu structure and > not from the page. What is the problem with fetching the object pointer? > > > Or else, you can have an IRQ, allocate this object, pass to another cpu. > > If that occurs then TID is being incremented and we will restart the loop > getting the new object pointer from the per cpu structure. The object > pointer that we were considering is irrelevant. > Problem is not restart the loop, but avoiding accessing a non valid memory area. > > This other cpu can free the object and unmap page right after you did > > the probe_kernel_address(object) (successfully), and before your cpu : > > > > p = get_freepointer(s, object); << BUG >> > > If the other cpu frees the object and unmaps the page then > get_freepointer_safe() can obtain an arbitrary value since the TID was > incremented. We will restart the loop and discard the value retrieved. > In current code I see : tid = c->tid; barrier(); object = c->freelist; There is no guarantee c->tid is fetched before c->freelist by cpu. You need rmb() here. I claim all this would be far more simple disabling IRQ before fetching c->tid and c->freelist, in DEBUG_PAGE_ALLOC case. You would not even need to use magic probe_kernel_read() Why do you try so _hard_ trying to optimize this, I really wonder. Nobody is able to read this code anymore and prove its correct.
Reply-To: cl@linux.com On Tue, 10 May 2011, Eric Dumazet wrote: > > > You have to disable IRQ _before_ even fetching 'object' > > > > The object pointer is being obtained from a per cpu structure and > > not from the page. What is the problem with fetching the object pointer? > > > > > Or else, you can have an IRQ, allocate this object, pass to another cpu. > > > > If that occurs then TID is being incremented and we will restart the loop > > getting the new object pointer from the per cpu structure. The object > > pointer that we were considering is irrelevant. > > > > Problem is not restart the loop, but avoiding accessing a non valid > memory area. Yes and could you please explain clearly what the problem is? > > > > This other cpu can free the object and unmap page right after you did > > > the probe_kernel_address(object) (successfully), and before your cpu : > > > > > > p = get_freepointer(s, object); << BUG >> > > > > If the other cpu frees the object and unmaps the page then > > get_freepointer_safe() can obtain an arbitrary value since the TID was > > incremented. We will restart the loop and discard the value retrieved. > > > > > > In current code I see : > > tid = c->tid; > barrier(); > object = c->freelist; > > There is no guarantee c->tid is fetched before c->freelist by cpu. > > You need rmb() here. Nope. This is not processor to processor concurrency. this_cpu operations only deal with concurrency issues on the same processor. I.e. interrupts and preemption. > I claim all this would be far more simple disabling IRQ before fetching > c->tid and c->freelist, in DEBUG_PAGE_ALLOC case. > > You would not even need to use magic probe_kernel_read() > > > Why do you try so _hard_ trying to optimize this, I really wonder. > Nobody is able to read this code anymore and prove its correct. Optimizing? You think about this as concurrency issue between multiple cpus. That is fundamentally wrong. This is dealing with access to per cpu data and the concurrency issues are only with code running on the *same* cpu.
Just for info, the computer on which I got these warnings only has one core indeed. Moreover, I cannot reproduce it anymore with rc7. However I got another warning (far more stable), reported in https://bugzilla.kernel.org/show_bug.cgi?id=34882 (do not know if it's the same, I reported seperatly anyway). The current one may not appear anymore either because I installed new Slackware on 32bits instead of 64bits before, or remove memory which has failed last week and reduced to 512 but it's faster now, or because it's rc7 of course.
Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit : > Optimizing? You think about this as concurrency issue between multiple > cpus. That is fundamentally wrong. This is dealing with access to per cpu > data and the concurrency issues are only with code running on the *same* > cpu. > If you enable irqs, then this object can be allocated by _this_ cpu and given to another one. Another cpu can free the page, forcing you to call a very expensive function, that might give obsolete result as soon it returns. Maybe I am just tired tonight, this seems very obvious, I must miss something.
Reply-To: cl@linux.com On Tue, 10 May 2011, Eric Dumazet wrote: > Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit : > > > Optimizing? You think about this as concurrency issue between multiple > > cpus. That is fundamentally wrong. This is dealing with access to per cpu > > data and the concurrency issues are only with code running on the *same* > > cpu. > > > > If you enable irqs, then this object can be allocated by _this_ cpu and > given to another one. That will cause an incrementing of the tid. > Another cpu can free the page, forcing you to call a very expensive > function, that might give obsolete result as soon it returns. No the other cpu cannot free the page since the page is pinned by the current cpu (see PageFrozen()). > Maybe I am just tired tonight, this seems very obvious, I must miss > something. Yeah you are way off thinking about cpu to cpu concurrency issues that do not apply here.
Le mardi 10 mai 2011 à 15:33 -0500, Christoph Lameter a écrit : > On Tue, 10 May 2011, Eric Dumazet wrote: > > > Le mardi 10 mai 2011 à 14:38 -0500, Christoph Lameter a écrit : > > > > > Optimizing? You think about this as concurrency issue between multiple > > > cpus. That is fundamentally wrong. This is dealing with access to per cpu > > > data and the concurrency issues are only with code running on the *same* > > > cpu. > > > > > > > If you enable irqs, then this object can be allocated by _this_ cpu and > > given to another one. > > That will cause an incrementing of the tid. > > > Another cpu can free the page, forcing you to call a very expensive > > function, that might give obsolete result as soon it returns. > > No the other cpu cannot free the page since the page is pinned by > the current cpu (see PageFrozen()). > What happens then ? Other cpu calls kfree() on last nonfreed object for this slab, and yet the page stay frozen ? How this page is going to be freed at all ? > > Maybe I am just tired tonight, this seems very obvious, I must miss > > something. > > Yeah you are way off thinking about cpu to cpu concurrency issues that do > not apply here. I fail to understand how current cpu can assert page ownership, if IRQs are enabled, this seems obvious it cannot.
Reply-To: cl@linux.com On Tue, 10 May 2011, Eric Dumazet wrote: > > No the other cpu cannot free the page since the page is pinned by > > the current cpu (see PageFrozen()). > > > > What happens then ? Other cpu calls kfree() on last nonfreed object for > this slab, and yet the page stay frozen ? How this page is going to be > freed at all ? Yes the page stays frozen. The freed objects are used to replenish the percpu free list when it becomes empty. The page is going to be freed when a kmalloc() finds that the per cpu freelist is empty and that the freelist of the page is also empty. Then interrupts are disabled, the old page is unfrozen and a new page is acquired for allocation. > > > Maybe I am just tired tonight, this seems very obvious, I must miss > > > something. > > > > Yeah you are way off thinking about cpu to cpu concurrency issues that do > > not apply here. > > I fail to understand how current cpu can assert page ownership, if IRQs > are enabled, this seems obvious it cannot. The cpu sets a page flag called PageFrozen() and points a per cpu pointer to the page.
Le mardi 10 mai 2011 à 16:22 -0500, Christoph Lameter a écrit : > On Tue, 10 May 2011, Eric Dumazet wrote: > > > > No the other cpu cannot free the page since the page is pinned by > > > the current cpu (see PageFrozen()). > > > > > > > What happens then ? Other cpu calls kfree() on last nonfreed object for > > this slab, and yet the page stay frozen ? How this page is going to be > > freed at all ? > > Yes the page stays frozen. The freed objects are used to replenish the > percpu free list when it becomes empty. > > The page is going to be freed when a kmalloc() finds that the per cpu > freelist is empty and that the freelist of the page is also empty. Then > interrupts are disabled, the old page is unfrozen and a new > page is acquired for allocation. > > > > > Maybe I am just tired tonight, this seems very obvious, I must miss > > > > something. > > > > > > Yeah you are way off thinking about cpu to cpu concurrency issues that do > > > not apply here. > > > > I fail to understand how current cpu can assert page ownership, if IRQs > > are enabled, this seems obvious it cannot. > > The cpu sets a page flag called PageFrozen() and points a per cpu pointer > to the page. > > So, if I understand you, there is no problem at all and no patch even needed ? I can start a stress test and you guarantee there wont be a crash ? Sorry, its 5h11 in the morning here ;)
Reply-To: cl@linux.com On Wed, 11 May 2011, Eric Dumazet wrote: > > The cpu sets a page flag called PageFrozen() and points a per cpu pointer > > to the page. > > > > > > So, if I understand you, there is no problem at all and no patch even > needed ? I can start a stress test and you guarantee there wont be a > crash ? > > Sorry, its 5h11 in the morning here ;) There is a problem if an interrupt or a preemption occurs and there is no object left on the page. Then the current page will be unfrozen and a new page put into place for allocation. The old page may then be freed by some other process on another processor before we continue the interrupted slab_alloc(). When slab_alloc() resumes in this scenario then it will ultimately see that the tid was incremented and so the cmpxchg will fail. But before we do the cmpxchgwe determine the pointer to the next object. And for that we access the old page. The access must not cause a page fault (which it currently does with CONFIG_DEBUG_PAGEALLOC). That is why we need the patch introducing get_freepointer_safe() The result does not matter since we will repeat the cmpxchg loop.
Update: still present in 3.0-rc7: WARNING: kmemcheck: Caught 64-bit read from uninitialized memory (ffff8801c34ffa00) 00fc4fc30188ffff210000000000000001000000000000000000000000000000 u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u u ^ Pid: 2629, comm: akonadi_nepomuk Not tainted 3.0.0-rc7 #20 Acer Aspire 7750G/JE70_HR RIP: 0010:[<ffffffff811203a2>] [<ffffffff811203a2>] __kmalloc_node_track_caller+0x112/0x1a0 RSP: 0018:ffff8801a0125a38 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000010 RCX: 000000000003a560 RDX: 000000000003a558 RSI: 00000000001d3ec0 RDI: ffffffff81a8cb01 RBP: ffff8801a0125a78 R08: 0000000000000040 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff8801c34ffa00 R13: ffff8801c7404a00 R14: 00000000000004d0 R15: 00000000ffffffff FS: 00007f93c32b6780(0000) GS:ffff8801c7800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8801c6642a80 CR3: 00000001a026c000 CR4: 00000000000406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 [<ffffffff8166dcee>] __alloc_skb+0x7e/0x1a0 [<ffffffff81667ced>] sock_alloc_send_pskb+0x22d/0x3b0 [<ffffffff81667e80>] sock_alloc_send_skb+0x10/0x20 [<ffffffff8172897c>] unix_stream_sendmsg+0x32c/0x4f0 [<ffffffff81661b58>] sock_aio_write+0x138/0x150 [<ffffffff811250eb>] do_sync_readv_writev+0xcb/0x110 [<ffffffff811253af>] do_readv_writev+0xcf/0x1e0 [<ffffffff81125508>] vfs_writev+0x48/0x60 [<ffffffff8112566f>] sys_writev+0x4f/0xa0 [<ffffffff817dadfb>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff
Update: cannot reproduce on 3.1-rc4 Closing this bug.