Bug 189181 - BUG: unable to handle kernel NULL pointer dereference in mem_cgroup_node_nr_lru_pages
Summary: BUG: unable to handle kernel NULL pointer dereference in mem_cgroup_node_nr_l...
Status: NEW
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Slab Allocator (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-26 15:10 UTC by Marek Marczykowski-Górecki
Modified: 2016-12-01 10:58 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.8.10
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Full console log (247.75 KB, text/plain)
2016-11-26 15:10 UTC, Marek Marczykowski-Górecki
Details
signature.asc (182 bytes, application/pgp-signature)
2016-12-01 02:34 UTC, Marek Marczykowski-Górecki
Details
signature.asc (182 bytes, application/pgp-signature)
2016-12-01 10:58 UTC, Marek Marczykowski-Górecki
Details

Description Marek Marczykowski-Górecki 2016-11-26 15:10:16 UTC
Created attachment 245931 [details]
Full console log

Shortly after system startup sometimes (about 1/30 times) I get this:

[   15.665196] BUG: unable to handle kernel NULL pointer dereference at 0000000000000400
[   15.665213] IP: [<ffffffff8122d520>] mem_cgroup_node_nr_lru_pages+0x20/0x40
[   15.665225] PGD 0 
[   15.665230] Oops: 0000 [#1] SMP
[   15.665235] Modules linked in: fuse xt_nat xen_netback xt_REDIRECT nf_nat_redirect ip6table_filter ip6_tables xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_i
pv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack intel_rapl x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcspkr dummy_hcd udc_core u2mfn(O) 
xen_blkback xenfs xen_privcmd xen_blkfront
[   15.665285] CPU: 0 PID: 60 Comm: kswapd0 Tainted: G           O    4.8.10-12.pvops.qubes.x86_64 #1
[   15.665292] task: ffff880011863b00 task.stack: ffff880011868000
[   15.665297] RIP: e030:[<ffffffff8122d520>]  [<ffffffff8122d520>] mem_cgroup_node_nr_lru_pages+0x20/0x40
[   15.665307] RSP: e02b:ffff88001186bc70  EFLAGS: 00010293
[   15.665311] RAX: 0000000000000000 RBX: ffff88001186bd20 RCX: 0000000000000002
[   15.665317] RDX: 000000000000000c RSI: 0000000000000000 RDI: 0000000000000000
[   15.665322] RBP: ffff88001186bc70 R08: 28f5c28f5c28f5c3 R09: 0000000000000000
[   15.665327] R10: 0000000000006c34 R11: 0000000000000333 R12: 00000000000001f6
[   15.665332] R13: ffffffff81c6f6a0 R14: 0000000000000000 R15: 0000000000000000
[   15.665343] FS:  0000000000000000(0000) GS:ffff880013c00000(0000) knlGS:ffff880013d00000
[   15.665351] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.665358] CR2: 0000000000000400 CR3: 00000000122f2000 CR4: 0000000000042660
[   15.665366] Stack:
[   15.665371]  ffff88001186bc98 ffffffff811e0dda 00000000000002eb 0000000000000080
[   15.665384]  ffffffff81c6f6a0 ffff88001186bd70 ffffffff811c36d9 0000000000000000
[   15.665397]  ffff88001186bcb0 ffff88001186bcb0 ffff88001186bcc0 000000000000abc5
[   15.665410] Call Trace:
[   15.665419]  [<ffffffff811e0dda>] count_shadow_nodes+0x9a/0xa0
[   15.665428]  [<ffffffff811c36d9>] shrink_slab.part.42+0x119/0x3e0
[   15.666049]  [<ffffffff811c83ec>] shrink_node+0x22c/0x320
[   15.666049]  [<ffffffff811c928c>] kswapd+0x32c/0x700
[   15.666049]  [<ffffffff811c8f60>] ? mem_cgroup_shrink_node+0x180/0x180
[   15.666049]  [<ffffffff810c1b08>] kthread+0xd8/0xf0
[   15.666049]  [<ffffffff817a3abf>] ret_from_fork+0x1f/0x40
[   15.666049]  [<ffffffff810c1a30>] ? kthread_create_on_node+0x190/0x190
[   15.666049] Code: 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 3b 35 dd eb b1 00 55 48 89 e5 73 2c 89 d2 31 c9 31 c0 4c 63 ce 48 0f a3 ca 73 13 <4a> 8b b4 cf 00 04 00 00 41 89 c8 4a 03
 84 c6 80 00 00 00 83 c1 
[   15.666049] RIP  [<ffffffff8122d520>] mem_cgroup_node_nr_lru_pages+0x20/0x40
[   15.666049]  RSP <ffff88001186bc70>
[   15.666049] CR2: 0000000000000400
[   15.666049] ---[ end trace 100494b9edbdfc4d ]---

After this, there is another "unable to handle kerneel paging request" I guess because of do_exit in kswapd0, then a lot of soft lockups and system is unusable (see full log attached).

This is running in PV domU on Xen 4.7.0 (the same also happens on Xen 4.6.3). Same happens on 4.8.7 too. Previously it was working on v4.4.31 without any problem.
Comment 1 Andrew Morton 2016-11-29 22:56:45 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sat, 26 Nov 2016 15:10:16 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=189181
> 
>             Bug ID: 189181
>            Summary: BUG: unable to handle kernel NULL pointer dereference
>                     in mem_cgroup_node_nr_lru_pages
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 4.8.10
>           Hardware: Intel
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Slab Allocator
>           Assignee: akpm@linux-foundation.org
>           Reporter: marmarek@mimuw.edu.pl
>         Regression: No
> 
> Created attachment 245931 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=245931&action=edit
> Full console log
> 
> Shortly after system startup sometimes (about 1/30 times) I get this:
> 
> [   15.665196] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000400
> [   15.665213] IP: [<ffffffff8122d520>]
> mem_cgroup_node_nr_lru_pages+0x20/0x40
> [   15.665225] PGD 0 
> [   15.665230] Oops: 0000 [#1] SMP
> [   15.665235] Modules linked in: fuse xt_nat xen_netback xt_REDIRECT
> nf_nat_redirect ip6table_filter ip6_tables xt_conntrack ipt_MASQUERADE
> nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_i
> pv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack intel_rapl
> x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel
> ghash_clmulni_intel pcspkr dummy_hcd udc_core u2mfn(O) 
> xen_blkback xenfs xen_privcmd xen_blkfront
> [   15.665285] CPU: 0 PID: 60 Comm: kswapd0 Tainted: G           O   
> 4.8.10-12.pvops.qubes.x86_64 #1
> [   15.665292] task: ffff880011863b00 task.stack: ffff880011868000
> [   15.665297] RIP: e030:[<ffffffff8122d520>]  [<ffffffff8122d520>]
> mem_cgroup_node_nr_lru_pages+0x20/0x40
> [   15.665307] RSP: e02b:ffff88001186bc70  EFLAGS: 00010293
> [   15.665311] RAX: 0000000000000000 RBX: ffff88001186bd20 RCX:
> 0000000000000002
> [   15.665317] RDX: 000000000000000c RSI: 0000000000000000 RDI:
> 0000000000000000
> [   15.665322] RBP: ffff88001186bc70 R08: 28f5c28f5c28f5c3 R09:
> 0000000000000000
> [   15.665327] R10: 0000000000006c34 R11: 0000000000000333 R12:
> 00000000000001f6
> [   15.665332] R13: ffffffff81c6f6a0 R14: 0000000000000000 R15:
> 0000000000000000
> [   15.665343] FS:  0000000000000000(0000) GS:ffff880013c00000(0000)
> knlGS:ffff880013d00000
> [   15.665351] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   15.665358] CR2: 0000000000000400 CR3: 00000000122f2000 CR4:
> 0000000000042660
> [   15.665366] Stack:
> [   15.665371]  ffff88001186bc98 ffffffff811e0dda 00000000000002eb
> 0000000000000080
> [   15.665384]  ffffffff81c6f6a0 ffff88001186bd70 ffffffff811c36d9
> 0000000000000000
> [   15.665397]  ffff88001186bcb0 ffff88001186bcb0 ffff88001186bcc0
> 000000000000abc5
> [   15.665410] Call Trace:
> [   15.665419]  [<ffffffff811e0dda>] count_shadow_nodes+0x9a/0xa0
> [   15.665428]  [<ffffffff811c36d9>] shrink_slab.part.42+0x119/0x3e0
> [   15.666049]  [<ffffffff811c83ec>] shrink_node+0x22c/0x320
> [   15.666049]  [<ffffffff811c928c>] kswapd+0x32c/0x700
> [   15.666049]  [<ffffffff811c8f60>] ? mem_cgroup_shrink_node+0x180/0x180
> [   15.666049]  [<ffffffff810c1b08>] kthread+0xd8/0xf0
> [   15.666049]  [<ffffffff817a3abf>] ret_from_fork+0x1f/0x40
> [   15.666049]  [<ffffffff810c1a30>] ? kthread_create_on_node+0x190/0x190
> [   15.666049] Code: 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 3b 35 dd
> eb b1 00 55 48 89 e5 73 2c 89 d2 31 c9 31 c0 4c 63 ce 48 0f a3 ca 73 13 <4a>
> 8b
> b4 cf 00 04 00 00 41 89 c8 4a 03
>  84 c6 80 00 00 00 83 c1 
> [   15.666049] RIP  [<ffffffff8122d520>]
> mem_cgroup_node_nr_lru_pages+0x20/0x40
> [   15.666049]  RSP <ffff88001186bc70>
> [   15.666049] CR2: 0000000000000400
> [   15.666049] ---[ end trace 100494b9edbdfc4d ]---
> 
> After this, there is another "unable to handle kerneel paging request" I
> guess
> because of do_exit in kswapd0, then a lot of soft lockups and system is
> unusable (see full log attached).
> 
> This is running in PV domU on Xen 4.7.0 (the same also happens on Xen 4.6.3).
> Same happens on 4.8.7 too. Previously it was working on v4.4.31 without any
> problem.
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
Comment 2 Johannes Weiner 2016-11-30 18:21:54 UTC
Hi Michael,

On Wed, Nov 30, 2016 at 06:00:40PM +0100, Michal Hocko wrote:
> > > [   15.665196] BUG: unable to handle kernel NULL pointer dereference at
> > > 0000000000000400
> > > [   15.665213] IP: [<ffffffff8122d520>]
> mem_cgroup_node_nr_lru_pages+0x20/0x40
> > > [   15.665225] PGD 0 
> > > [   15.665230] Oops: 0000 [#1] SMP
> > > [   15.665235] Modules linked in: fuse xt_nat xen_netback xt_REDIRECT
> > > nf_nat_redirect ip6table_filter ip6_tables xt_conntrack ipt_MASQUERADE
> > > nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_i
> > > pv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack intel_rapl
> > > x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul crc32c_intel
> > > ghash_clmulni_intel pcspkr dummy_hcd udc_core u2mfn(O) 
> > > xen_blkback xenfs xen_privcmd xen_blkfront
> > > [   15.665285] CPU: 0 PID: 60 Comm: kswapd0 Tainted: G           O   
> > > 4.8.10-12.pvops.qubes.x86_64 #1
> > > [   15.665292] task: ffff880011863b00 task.stack: ffff880011868000
> > > [   15.665297] RIP: e030:[<ffffffff8122d520>]  [<ffffffff8122d520>]
> > > mem_cgroup_node_nr_lru_pages+0x20/0x40
> > > [   15.665307] RSP: e02b:ffff88001186bc70  EFLAGS: 00010293
> > > [   15.665311] RAX: 0000000000000000 RBX: ffff88001186bd20 RCX:
> > > 0000000000000002
> > > [   15.665317] RDX: 000000000000000c RSI: 0000000000000000 RDI:
> > > 0000000000000000
> 
> I cannot generate a similar code to yours but the above suggests that we
> are getting NULL memcg. This would suggest a global reclaim and
> count_shadow_nodes misinterprets that because it does
> 
>       if (memcg_kmem_enabled()) {
>               pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
>                                                    LRU_ALL_FILE);
>       } else {
>               pages = node_page_state(NODE_DATA(sc->nid), NR_ACTIVE_FILE) +
>                       node_page_state(NODE_DATA(sc->nid), NR_INACTIVE_FILE);
>       }
> 
> this might be a race with kmem enabling AFAICS. Anyaway I believe that
> the above check needs to ne extended for the sc->memcg != NULL

Yep, my locally built code looks very different from the report, but
it's clear that memcg is NULL. I didn't see the race you mention, but
it makes sense to me: shrink_slab() is supposed to filter memcg-aware
shrinkers based on whether we have a memcg or not, but it only does it
when kmem accounting is enabled; if it's disabled, the shrinker should
also use its non-memcg behavior. However, nothing prevents a memcg
with kmem from onlining between the filter and the shrinker run.

> diff --git a/mm/workingset.c b/mm/workingset.c
> index 617475f529f4..0f07522c5c0e 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -348,7 +348,7 @@ static unsigned long count_shadow_nodes(struct shrinker
> *shrinker,
>       shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc);
>       local_irq_enable();
>  
> -     if (memcg_kmem_enabled()) {
> +     if (memcg_kmem_enabled() && sc->memcg) {
>               pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
>                                                    LRU_ALL_FILE);
>       } else {

If we do that, I'd remove the racy memcg_kmem_enabled() check
altogether and just check for whether we have a memcg or not.

What do you think, Vladimir?
Comment 3 Balbir Singh 2016-12-01 00:33:19 UTC
On 01/12/16 05:30, Michal Hocko wrote:
> On Wed 30-11-16 13:16:53, Johannes Weiner wrote:
>> Hi Michael,
>>
>> On Wed, Nov 30, 2016 at 06:00:40PM +0100, Michal Hocko wrote:
> [...]
>>> diff --git a/mm/workingset.c b/mm/workingset.c
>>> index 617475f529f4..0f07522c5c0e 100644
>>> --- a/mm/workingset.c
>>> +++ b/mm/workingset.c
>>> @@ -348,7 +348,7 @@ static unsigned long count_shadow_nodes(struct shrinker
>>> *shrinker,
>>>     shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc);
>>>     local_irq_enable();
>>>  
>>> -   if (memcg_kmem_enabled()) {
>>> +   if (memcg_kmem_enabled() && sc->memcg) {
>>>             pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
>>>                                                  LRU_ALL_FILE);
>>>     } else {
>>
>> If we do that, I'd remove the racy memcg_kmem_enabled() check
>> altogether and just check for whether we have a memcg or not.
> 
> But that would make this a memcg aware shrinker even when kmem is not
> enabled...
> 
> But now that I am looking into the code
> shrink_slab:
>               if (memcg_kmem_enabled() &&
>                   !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>                       continue;
> 
> this should be taken care of already. So sc->memcg should be indeed
> sufficient. So unless I am missing something I will respin my local
> patch and post it later after the reporter has some time to test the
> current one.
>  

I did a quick dis-assembly of the code

R9 and RDI are NULL and the instruction seems to be

mov rsi, [rdi+r9*8+0x400]

RDI is NULL, sc->memcg is NULL, which indicates global reclaim

The check referred to earlier

                /*
                 * If kernel memory accounting is disabled, we ignore
                 * SHRINKER_MEMCG_AWARE flag and call all shrinkers
                 * passing NULL for memcg.
                 */
                if (memcg_kmem_enabled() &&
                    !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
                        continue;

So we do pass NULL for memcg

A check for sc->memcg should be enough in count_shadow_nodes and a VM_BUG_ON
for memcg == NULL in mem_cgroup_node_nr_lru_pages would be nice

Balbir Singh.
Comment 4 Marek Marczykowski-Górecki 2016-12-01 02:34:37 UTC
Created attachment 246521 [details]
signature.asc

On Wed, Nov 30, 2016 at 07:30:17PM +0100, Michal Hocko wrote:
> On Wed 30-11-16 13:16:53, Johannes Weiner wrote:
> > Hi Michael,
> > 
> > On Wed, Nov 30, 2016 at 06:00:40PM +0100, Michal Hocko wrote:
> [...]
> > > diff --git a/mm/workingset.c b/mm/workingset.c
> > > index 617475f529f4..0f07522c5c0e 100644
> > > --- a/mm/workingset.c
> > > +++ b/mm/workingset.c
> > > @@ -348,7 +348,7 @@ static unsigned long count_shadow_nodes(struct
> shrinker *shrinker,
> > >   shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc);
> > >   local_irq_enable();
> > >  
> > > - if (memcg_kmem_enabled()) {
> > > + if (memcg_kmem_enabled() && sc->memcg) {
> > >           pages = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
> > >                                                LRU_ALL_FILE);
> > >   } else {
> > 
> > If we do that, I'd remove the racy memcg_kmem_enabled() check
> > altogether and just check for whether we have a memcg or not.
> 
> But that would make this a memcg aware shrinker even when kmem is not
> enabled...
> 
> But now that I am looking into the code
> shrink_slab:
>               if (memcg_kmem_enabled() &&
>                   !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>                       continue;
> 
> this should be taken care of already. So sc->memcg should be indeed
> sufficient. So unless I am missing something I will respin my local
> patch and post it later after the reporter has some time to test the
> current one.

The above patch seems to help. At least the problem haven't occurred for
the last ~40 VM startups.

>  
> > What do you think, Vladimir?
>
Comment 5 Marek Marczykowski-Górecki 2016-12-01 10:58:18 UTC
Created attachment 246541 [details]
signature.asc

On Thu, Dec 01, 2016 at 08:02:13AM +0100, Michal Hocko wrote:
> On Thu 01-12-16 03:24:54, Marek Marczykowski-Górecki wrote:
> > On Wed, Nov 30, 2016 at 07:30:17PM +0100, Michal Hocko wrote:
> > > On Wed 30-11-16 13:16:53, Johannes Weiner wrote:
> > > > Hi Michael,
> > > > 
> > > > On Wed, Nov 30, 2016 at 06:00:40PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > diff --git a/mm/workingset.c b/mm/workingset.c
> > > > > index 617475f529f4..0f07522c5c0e 100644
> > > > > --- a/mm/workingset.c
> > > > > +++ b/mm/workingset.c
> > > > > @@ -348,7 +348,7 @@ static unsigned long count_shadow_nodes(struct
> shrinker *shrinker,
> > > > >       shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes,
> sc);
> > > > >       local_irq_enable();
> > > > >  
> > > > > -     if (memcg_kmem_enabled()) {
> > > > > +     if (memcg_kmem_enabled() && sc->memcg) {
> > > > >               pages = mem_cgroup_node_nr_lru_pages(sc->memcg,
> sc->nid,
> > > > >                                                    LRU_ALL_FILE);
> > > > >       } else {
> > > > 
> > > > If we do that, I'd remove the racy memcg_kmem_enabled() check
> > > > altogether and just check for whether we have a memcg or not.
> > > 
> > > But that would make this a memcg aware shrinker even when kmem is not
> > > enabled...
> > > 
> > > But now that I am looking into the code
> > > shrink_slab:
> > >           if (memcg_kmem_enabled() &&
> > >               !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> > >                   continue;
> > > 
> > > this should be taken care of already. So sc->memcg should be indeed
> > > sufficient. So unless I am missing something I will respin my local
> > > patch and post it later after the reporter has some time to test the
> > > current one.
> > 
> > The above patch seems to help. At least the problem haven't occurred for
> > the last ~40 VM startups.
> 
> I will consider this as
> Tested-by: Marek Marczykowski-Górecki <marmarek@mimuw.edu.pl>
> 
> OK? Thanks for the report and testing!

Yes.

Note You need to log in before you can comment on or make changes to this bug.