Created attachment 278997 [details] dmesg and kernel config I'm using Xubuntu 18.04 and I noticed that under memory pressure the script from https://github.com/pixelb/ps_mem.git (HEAD 1ed0bc5519d889d58235f2c35db01e4ede0d8231is) causing a kernel BUG and locking a CPU. On dmesg the following appears: BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0 After this BUG the computer performance becomes greatly degraded, some software do not close, some fail to open, some fail to work properly. As an example, bash fails to autocomplete. Steps to reproduce: 1) Be under memory pressure. Using dd to write a large file at /dev/shm works for this; 2) Run the script from https://github.com/pixelb/ps_mem.git Expected result: script will print information and system will keep working normally; Observed result: script is killed, kernel BUG happens, CPU get stuck and computer presents problems. I did not observe this with 4.17.19, I'll bisect and see if I can find which commit is causing this. I'm sorry if I'm reporting to the wrong product and component.
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). Vlastimil, it looks like your August 21 smaps changes are failing. This one is pretty urgent, please. Leonardo (yes?): thanks for reporting. Very helpful. On Thu, 11 Oct 2018 18:13:31 +0000 bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=201377 > > Bug ID: 201377 > Summary: Kernel BUG under memory pressure: unable to handle > kernel NULL pointer dereference at 00000000000000f0 > Product: Memory Management > Version: 2.5 > Kernel Version: 4.19-rc7 > Hardware: All > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > Assignee: akpm@linux-foundation.org > Reporter: leozinho29_eu@hotmail.com > Regression: No > > Created attachment 278997 [details] > --> https://bugzilla.kernel.org/attachment.cgi?id=278997&action=edit > dmesg and kernel config > > I'm using Xubuntu 18.04 and I noticed that under memory pressure the script > from https://github.com/pixelb/ps_mem.git (HEAD > 1ed0bc5519d889d58235f2c35db01e4ede0d8231is) causing a kernel BUG and locking > a > CPU. On dmesg the following appears: > > BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0 > > After this BUG the computer performance becomes greatly degraded, some > software > do not close, some fail to open, some fail to work properly. As an example, > bash fails to autocomplete. > > Steps to reproduce: > > 1) Be under memory pressure. Using dd to write a large file at /dev/shm works > for this; > 2) Run the script from https://github.com/pixelb/ps_mem.git > > Expected result: script will print information and system will keep working > normally; > > Observed result: script is killed, kernel BUG happens, CPU get stuck and > computer presents problems. > > I did not observe this with 4.17.19, I'll bisect and see if I can find which > commit is causing this. > > I'm sorry if I'm reporting to the wrong product and component. > > -- > You are receiving this mail because: > You are the assignee for the bug.
(cc linux-mm, argh) On Fri, 12 Oct 2018 15:55:33 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > Vlastimil, it looks like your August 21 smaps changes are failing. > This one is pretty urgent, please. > > Leonardo (yes?): thanks for reporting. Very helpful. > > On Thu, 11 Oct 2018 18:13:31 +0000 bugzilla-daemon@bugzilla.kernel.org wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=201377 > > > > Bug ID: 201377 > > Summary: Kernel BUG under memory pressure: unable to handle > > kernel NULL pointer dereference at 00000000000000f0 > > Product: Memory Management > > Version: 2.5 > > Kernel Version: 4.19-rc7 > > Hardware: All > > OS: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: Other > > Assignee: akpm@linux-foundation.org > > Reporter: leozinho29_eu@hotmail.com > > Regression: No > > > > Created attachment 278997 [details] > > --> https://bugzilla.kernel.org/attachment.cgi?id=278997&action=edit > > dmesg and kernel config > > > > I'm using Xubuntu 18.04 and I noticed that under memory pressure the script > > from https://github.com/pixelb/ps_mem.git (HEAD > > 1ed0bc5519d889d58235f2c35db01e4ede0d8231is) causing a kernel BUG and > locking a > > CPU. On dmesg the following appears: > > > > BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0 > > > > After this BUG the computer performance becomes greatly degraded, some > software > > do not close, some fail to open, some fail to work properly. As an example, > > bash fails to autocomplete. > > > > Steps to reproduce: > > > > 1) Be under memory pressure. Using dd to write a large file at /dev/shm > works > > for this; > > 2) Run the script from https://github.com/pixelb/ps_mem.git > > > > Expected result: script will print information and system will keep working > > normally; > > > > Observed result: script is killed, kernel BUG happens, CPU get stuck and > > computer presents problems. > > > > I did not observe this with 4.17.19, I'll bisect and see if I can find > which > > commit is causing this. > > > > I'm sorry if I'm reporting to the wrong product and component. > > > > -- > > You are receiving this mail because: > > You are the assignee for the bug.
On 10/13/18 12:56 AM, Andrew Morton wrote: > (cc linux-mm, argh) > > On Fri, 12 Oct 2018 15:55:33 -0700 Andrew Morton <akpm@linux-foundation.org> > wrote: > >> >> (switched to email. Please respond via emailed reply-to-all, not via the >> bugzilla web interface). >> >> Vlastimil, it looks like your August 21 smaps changes are failing. >> This one is pretty urgent, please. Thanks, will look in few hours. Glad that there will be rc8... >> Leonardo (yes?): thanks for reporting. Very helpful. >> >> On Thu, 11 Oct 2018 18:13:31 +0000 bugzilla-daemon@bugzilla.kernel.org >> wrote: >> >>> https://bugzilla.kernel.org/show_bug.cgi?id=201377 >>> >>> Bug ID: 201377 >>> Summary: Kernel BUG under memory pressure: unable to handle >>> kernel NULL pointer dereference at 00000000000000f0 >>> Product: Memory Management >>> Version: 2.5 >>> Kernel Version: 4.19-rc7 >>> Hardware: All >>> OS: Linux >>> Tree: Mainline >>> Status: NEW >>> Severity: normal >>> Priority: P1 >>> Component: Other >>> Assignee: akpm@linux-foundation.org >>> Reporter: leozinho29_eu@hotmail.com >>> Regression: No >>> >>> Created attachment 278997 [details] >>> --> https://bugzilla.kernel.org/attachment.cgi?id=278997&action=edit >>> dmesg and kernel config >>> >>> I'm using Xubuntu 18.04 and I noticed that under memory pressure the script >>> from https://github.com/pixelb/ps_mem.git (HEAD >>> 1ed0bc5519d889d58235f2c35db01e4ede0d8231is) causing a kernel BUG and >>> locking a >>> CPU. On dmesg the following appears: >>> >>> BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0 >>> >>> After this BUG the computer performance becomes greatly degraded, some >>> software >>> do not close, some fail to open, some fail to work properly. As an example, >>> bash fails to autocomplete. >>> >>> Steps to reproduce: >>> >>> 1) Be under memory pressure. Using dd to write a large file at /dev/shm >>> works >>> for this; >>> 2) Run the script from https://github.com/pixelb/ps_mem.git >>> >>> Expected result: script will print information and system will keep working >>> normally; >>> >>> Observed result: script is killed, kernel BUG happens, CPU get stuck and >>> computer presents problems. >>> >>> I did not observe this with 4.17.19, I'll bisect and see if I can find >>> which >>> commit is causing this. >>> >>> I'm sorry if I'm reporting to the wrong product and component. >>> >>> -- >>> You are receiving this mail because: >>> You are the assignee for the bug.
On 10/13/18 2:57 PM, Vlastimil Babka wrote: > On 10/13/18 12:56 AM, Andrew Morton wrote: >> (cc linux-mm, argh) >> >> On Fri, 12 Oct 2018 15:55:33 -0700 Andrew Morton <akpm@linux-foundation.org> >> wrote: >> >>> >>> (switched to email. Please respond via emailed reply-to-all, not via the >>> bugzilla web interface). >>> >>> Vlastimil, it looks like your August 21 smaps changes are failing. >>> This one is pretty urgent, please. > > Thanks, will look in few hours. Glad that there will be rc8... I think I found it, and it seems the bug was there all the time for smaps_rollup. Dunno why it was hit only now. Please test? ----8<---- From 948be25ee1bdddca8244d1a055fbf812022571e7 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka <vbabka@suse.cz> Date: Sun, 14 Oct 2018 08:59:44 +0200 Subject: [PATCH] mm: /proc/pid/smaps_rollup: fix NULL pointer deref in smaps_pte_range Leonardo reports an apparent regression in 4.19-rc7: BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 6032 Comm: python Not tainted 4.19.0-041900rc7-lowlatency #201810071631 Hardware name: LENOVO 80UG/Toronto 4A2, BIOS 0XCN45WW 08/09/2018 RIP: 0010:smaps_pte_range+0x32d/0x540 Code: 80 00 00 00 00 74 a9 48 89 de 41 f6 40 52 40 0f 85 04 02 00 00 49 2b 30 48 c1 ee 0c 49 03 b0 98 00 00 00 49 8b 80 a0 00 00 00 <48> 8b b8 f0 00 00 00 e8 b7 ef ec ff 48 85 c0 0f 84 71 ff ff ff a8 RSP: 0018:ffffb0cbc484fb88 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000560ddb9e9000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000560ddb9e9 RDI: 0000000000000001 RBP: ffffb0cbc484fbc0 R08: ffff94a5a227a578 R09: ffff94a5a227a578 R10: 0000000000000000 R11: 0000560ddbbe7000 R12: ffffe903098ba728 R13: ffffb0cbc484fc78 R14: ffffb0cbc484fcf8 R15: ffff94a5a2e9cf48 FS: 00007f6dfb683740(0000) GS:ffff94a5aaf80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000f0 CR3: 000000011c118001 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __walk_page_range+0x3c2/0x6f0 walk_page_vma+0x42/0x60 smap_gather_stats+0x79/0xe0 ? gather_pte_stats+0x320/0x320 ? gather_hugetlb_stats+0x70/0x70 show_smaps_rollup+0xcd/0x1c0 seq_read+0x157/0x400 __vfs_read+0x3a/0x180 ? security_file_permission+0x93/0xc0 ? security_file_permission+0x93/0xc0 vfs_read+0x8f/0x140 ksys_read+0x55/0xc0 __x64_sys_read+0x1a/0x20 do_syscall_64+0x5a/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Decoded code matched to local compilation+disassembly points to smaps_pte_entry(): } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap && pte_none(*pte))) { page = find_get_entry(vma->vm_file->f_mapping, linear_page_index(vma, addr)); Here, vma->vm_file is NULL. mss->check_shmem_swap should be false in that case, however for smaps_rollup, smap_gather_stats() can set the flag true for one vma and leave it true for subsequent vma's where it should be false. To fix, reset the check_shmem_swap flag to false. There's also related bug which sets mss->swap to shmem_swapped, which in the context of smaps_rollup overwrites any value accumulated from previous vma's. Fix that as well. Note that the report suggests a regression between 4.17.19 and 4.19-rc7, which makes the 4.19 series ending with commit 258f669e7e88 ("mm: /proc/pid/smaps_rollup: convert to single value seq_file") suspicious. But the mss was reused for rollup since 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") so let's play it safe with the stable backport. Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") Link: https://bugzilla.kernel.org/show_bug.cgi?id=201377 Reported-by: Leonardo Mueller <leozinho29_eu@hotmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> --- fs/proc/task_mmu.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5ea1d64cb0b4..a027473561c6 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -713,6 +713,8 @@ static void smap_gather_stats(struct vm_area_struct *vma, smaps_walk.private = mss; #ifdef CONFIG_SHMEM + /* In case of smaps_rollup, reset the value from previous vma */ + mss->check_shmem_swap = false; if (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)) { /* * For shared or readonly shmem mappings we know that all @@ -728,7 +730,7 @@ static void smap_gather_stats(struct vm_area_struct *vma, if (!shmem_swapped || (vma->vm_flags & VM_SHARED) || !(vma->vm_flags & VM_WRITE)) { - mss->swap = shmem_swapped; + mss->swap += shmem_swapped; } else { mss->check_shmem_swap = true; smaps_walk.pte_hole = smaps_pte_hole;
On 10/14/18 8:07 PM, Leonardo Soares Müller wrote: > This patch applied on 4.19-rc7 corrected the problem to me and the > script is no longer triggering the kernel bug. Great! Can we add your Tested-by: then? > I completely skipped 4.18 because there were multiple regressions > affecting my computer. 4.19-rc6 and 4.19-rc7 have most regressions fixed > but then this issue appeared. > > The first kernel version released I found with this problem is 4.18-rc4, OK, that confirms the smaps_rollup problem is indeed older than my rewrite. Unless it's a typo and you mean 4.19-rc4 since you "skipped 4.18". > but bisecting between 4.18-rc3 and 4.18-rc4 failed: on boot there was > one message starting with [UNSUPP] and with something about "Arbitrary > File System". >
I meant eighteen, this is right. While I skipped 4.18 for normal use, to do tests when this issue appeared I tested with 4.18 too and noticed that since 4.18-rc4 the issue exist. Yes, you can add me to Tested-by, as this patch solved the issue to me: no problems with kernel and the script runs normally. Thank you. Em 14/10/2018 17:14, Vlastimil Babka escreveu: > On 10/14/18 8:07 PM, Leonardo Soares Müller wrote: >> This patch applied on 4.19-rc7 corrected the problem to me and the >> script is no longer triggering the kernel bug. > > Great! Can we add your Tested-by: then? > >> I completely skipped 4.18 because there were multiple regressions >> affecting my computer. 4.19-rc6 and 4.19-rc7 have most regressions fixed >> but then this issue appeared. >> >> The first kernel version released I found with this problem is 4.18-rc4, > > OK, that confirms the smaps_rollup problem is indeed older than my > rewrite. Unless it's a typo and you mean 4.19-rc4 since you "skipped 4.18". > >> but bisecting between 4.18-rc3 and 4.18-rc4 failed: on boot there was >> one message starting with [UNSUPP] and with something about "Arbitrary >> File System". >>
This patch applied on 4.19-rc7 corrected the problem to me and the script is no longer triggering the kernel bug. I completely skipped 4.18 because there were multiple regressions affecting my computer. 4.19-rc6 and 4.19-rc7 have most regressions fixed but then this issue appeared. The first kernel version released I found with this problem is 4.18-rc4, but bisecting between 4.18-rc3 and 4.18-rc4 failed: on boot there was one message starting with [UNSUPP] and with something about "Arbitrary File System". Em 14/10/2018 04:17, Vlastimil Babka escreveu: > On 10/13/18 2:57 PM, Vlastimil Babka wrote: >> On 10/13/18 12:56 AM, Andrew Morton wrote: >>> (cc linux-mm, argh) >>> >>> On Fri, 12 Oct 2018 15:55:33 -0700 Andrew Morton >>> <akpm@linux-foundation.org> wrote: >>> >>>> >>>> (switched to email. Please respond via emailed reply-to-all, not via the >>>> bugzilla web interface). >>>> >>>> Vlastimil, it looks like your August 21 smaps changes are failing. >>>> This one is pretty urgent, please. >> >> Thanks, will look in few hours. Glad that there will be rc8... > > I think I found it, and it seems the bug was there all the time for > smaps_rollup. > Dunno why it was hit only now. Please test? > > ----8<---- > From 948be25ee1bdddca8244d1a055fbf812022571e7 Mon Sep 17 00:00:00 2001 > From: Vlastimil Babka <vbabka@suse.cz> > Date: Sun, 14 Oct 2018 08:59:44 +0200 > Subject: [PATCH] mm: /proc/pid/smaps_rollup: fix NULL pointer deref in > smaps_pte_range > > Leonardo reports an apparent regression in 4.19-rc7: > > BUG: unable to handle kernel NULL pointer dereference at 00000000000000f0 > PGD 0 P4D 0 > Oops: 0000 [#1] PREEMPT SMP PTI > CPU: 3 PID: 6032 Comm: python Not tainted 4.19.0-041900rc7-lowlatency > #201810071631 > Hardware name: LENOVO 80UG/Toronto 4A2, BIOS 0XCN45WW 08/09/2018 > RIP: 0010:smaps_pte_range+0x32d/0x540 > Code: 80 00 00 00 00 74 a9 48 89 de 41 f6 40 52 40 0f 85 04 02 00 00 49 2b > 30 48 c1 ee 0c 49 03 b0 98 00 00 00 49 8b 80 a0 00 00 00 <48> 8b b8 f0 00 00 > 00 e8 b7 ef ec ff 48 85 c0 0f 84 71 ff ff ff a8 > RSP: 0018:ffffb0cbc484fb88 EFLAGS: 00010202 > RAX: 0000000000000000 RBX: 0000560ddb9e9000 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000560ddb9e9 RDI: 0000000000000001 > RBP: ffffb0cbc484fbc0 R08: ffff94a5a227a578 R09: ffff94a5a227a578 > R10: 0000000000000000 R11: 0000560ddbbe7000 R12: ffffe903098ba728 > R13: ffffb0cbc484fc78 R14: ffffb0cbc484fcf8 R15: ffff94a5a2e9cf48 > FS: 00007f6dfb683740(0000) GS:ffff94a5aaf80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000000000f0 CR3: 000000011c118001 CR4: 00000000003606e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > __walk_page_range+0x3c2/0x6f0 > walk_page_vma+0x42/0x60 > smap_gather_stats+0x79/0xe0 > ? gather_pte_stats+0x320/0x320 > ? gather_hugetlb_stats+0x70/0x70 > show_smaps_rollup+0xcd/0x1c0 > seq_read+0x157/0x400 > __vfs_read+0x3a/0x180 > ? security_file_permission+0x93/0xc0 > ? security_file_permission+0x93/0xc0 > vfs_read+0x8f/0x140 > ksys_read+0x55/0xc0 > __x64_sys_read+0x1a/0x20 > do_syscall_64+0x5a/0x110 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Decoded code matched to local compilation+disassembly points to > smaps_pte_entry(): > > } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap > && pte_none(*pte))) { > page = find_get_entry(vma->vm_file->f_mapping, > linear_page_index(vma, > addr)); > > Here, vma->vm_file is NULL. mss->check_shmem_swap should be false in that > case, > however for smaps_rollup, smap_gather_stats() can set the flag true for one > vma > and leave it true for subsequent vma's where it should be false. > > To fix, reset the check_shmem_swap flag to false. There's also related bug > which sets mss->swap to shmem_swapped, which in the context of smaps_rollup > overwrites any value accumulated from previous vma's. Fix that as well. > > Note that the report suggests a regression between 4.17.19 and 4.19-rc7, > which makes the 4.19 series ending with commit 258f669e7e88 ("mm: > /proc/pid/smaps_rollup: convert to single value seq_file") suspicious. But > the > mss was reused for rollup since 493b0e9d945f ("mm: add > /proc/pid/smaps_rollup") > so let's play it safe with the stable backport. > > Fixes: 493b0e9d945f ("mm: add /proc/pid/smaps_rollup") > Link: https://bugzilla.kernel.org/show_bug.cgi?id=201377 > Reported-by: Leonardo Mueller <leozinho29_eu@hotmail.com> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > Cc: <stable@vger.kernel.org> > --- > fs/proc/task_mmu.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 5ea1d64cb0b4..a027473561c6 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -713,6 +713,8 @@ static void smap_gather_stats(struct vm_area_struct *vma, > smaps_walk.private = mss; > > #ifdef CONFIG_SHMEM > + /* In case of smaps_rollup, reset the value from previous vma */ > + mss->check_shmem_swap = false; > if (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)) { > /* > * For shared or readonly shmem mappings we know that all > @@ -728,7 +730,7 @@ static void smap_gather_stats(struct vm_area_struct *vma, > > if (!shmem_swapped || (vma->vm_flags & VM_SHARED) || > !(vma->vm_flags & VM_WRITE)) { > - mss->swap = shmem_swapped; > + mss->swap += shmem_swapped; > } else { > mss->check_shmem_swap = true; > smaps_walk.pte_hole = smaps_pte_hole; >