Bug 202149

Summary: NULL Pointer Dereference in __split_huge_pmd on PPC64LE
Product: Memory Management Reporter: Matt Corallo (kernel)
Component: OtherAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.19.13 Subsystem:
Regression: No Bisected commit-id:

Description Matt Corallo 2019-01-04 22:49:52 UTC
Kernel is actually 4.19.13 + this commit to fix mpt3sas, though I also saw this fault with a different version of mpt3sas patched into an earlier 4.19 kernel https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23c3828aa2f84edec7020c7397a22931e7a879e1 . Config is roughly Debian's default config + 4K pages instead of the default 64K.

[ 9531.579895] Unable to handle kernel paging request for data at address 0x00000000
[ 9531.579918] Faulting instruction address: 0xc000000000076c64
[ 9531.579930] Oops: Kernel access of bad area, sig: 11 [#1]
[ 9531.579948] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 9531.579960] Modules linked in: binfmt_misc veth xt_nat tap nft_chain_nat_ipv4 nft_chain_route_ipv4 tun btrfs zstd_compress zstd_decompress xxhash ipip tunnel4 ip_tunnel ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_DSCP xt_dscp nft_counter xt_tcpudp nft_compat nf_tables nfnetlink amdgpu chash gpu_sched ast snd_hda_codec_hdmi ttm drm_kms_helper snd_hda_intel snd_hda_codec drm sg snd_hda_core snd_hwdep snd_pcm uas drm_panel_orientation_quirks syscopyarea sysfillrect snd_timer sysimgblt fb_sys_fops tg3 mpt3sas snd i2c_algo_bit ofpart ipmi_powernv opal_prd ipmi_devintf soundcore ipmi_msghandler powernv_flash libphy mtd raid_class scsi_transport_sas at24 ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto sd_mod raid10 raid456 crc32c_generic libcrc32c async_raid6_recov
[ 9531.580142]  async_memcpy async_pq evdev hid_generic usbhid hid raid6_pq async_xor xor async_tx raid1 raid0 multipath linear md_mod usb_storage dm_crypt dm_mod algif_skcipher af_alg ecb xts xhci_pci vmx_crypto xhci_hcd usbcore nvme nvme_core usb_common
[ 9531.580219] CPU: 9 PID: 4762 Comm: rustc Not tainted 4.19.0-2-powerpc64le #1 Debian 4.19.13-1
[ 9531.580250] NIP:  c000000000076c64 LR: c00000000037ec38 CTR: c0000000000471e0
[ 9531.580280] REGS: c0000001a4f6f840 TRAP: 0300   Not tainted  (4.19.0-2-powerpc64le Debian 4.19.13-1)
[ 9531.580311] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24202848  XER: 00000000
[ 9531.580337] CFAR: c00000000037ec34 DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0 
               GPR00: c00000000037ec38 c0000001a4f6fac0 c0000000010a5800 c0000008b1a2ec00 
               GPR04: c0000001a0ccaf80 0000000000000800 c000000001202e60 c000000001202de0 
               GPR08: 0000000000000009 c00a000006833280 c00a000000000000 c0000000010b9fd8 
               GPR12: 0000000000002000 c000000fffff9600 00003fff40000000 0001000000000000 
               GPR16: e61fffffffffffff fffffffffffffe7f 0000000000000001 c00a0000065f48a8 
               GPR20: c0000001a0ccaf80 0002000000000000 c0000008b1a2ec00 c000000001202de0 
               GPR24: c00a000019a20000 c0000008b1a2ec00 c0000001a0ccaf80 c0000006f001c5b0 
               GPR28: c00a000006833280 c000000001202e68 00003fff3e000000 0000000000000000 
[ 9531.580483] NIP [c000000000076c64] radix__pgtable_trans_huge_withdraw+0x94/0x160
[ 9531.580506] LR [c00000000037ec38] __split_huge_pmd+0x588/0xcc0
[ 9531.580524] Call Trace:
[ 9531.580541] [c0000001a4f6fac0] [c0000001a4f6fb10] 0xc0000001a4f6fb10 (unreliable)
[ 9531.580572] [c0000001a4f6faf0] [c00000000037ebbc] __split_huge_pmd+0x50c/0xcc0
[ 9531.580605] [c0000001a4f6fbb0] [c00000000032aeb8] move_page_tables+0x438/0xd30
[ 9531.580637] [c0000001a4f6fcc0] [c00000000032b8fc] move_vma+0x14c/0x370
[ 9531.580669] [c0000001a4f6fd60] [c00000000032c0a8] sys_mremap+0x588/0x670
[ 9531.580702] [c0000001a4f6fe30] [c00000000000b9e4] system_call+0x5c/0x70
[ 9531.580732] Instruction dump:
[ 9531.580760] 0b0a0000 e9060000 e9470000 7d294030 7d2907b4 79291f24 7d2900d0 7d292038 
[ 9531.580797] 7929a402 79293664 7d2a4a14 ebe90010 <e95f0000> 7fbf5040 419e0064 7c0802a6 
[ 9531.580837] ---[ end trace 21ba871647464d8b ]---
Comment 1 Andrew Morton 2019-01-05 01:05:03 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Fri, 04 Jan 2019 22:49:52 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=202149
> 
>             Bug ID: 202149
>            Summary: NULL Pointer Dereference in __split_huge_pmd on
>                     PPC64LE

I think that trace is pointing at the ppc-specific
pgtable_trans_huge_withdraw()?

>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 4.19.13
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>           Assignee: akpm@linux-foundation.org
>           Reporter: kernel@bluematt.me
>         Regression: No
> 
> Kernel is actually 4.19.13 + this commit to fix mpt3sas, though I also saw
> this
> fault with a different version of mpt3sas patched into an earlier 4.19 kernel
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23c3828aa2f84edec7020c7397a22931e7a879e1
> . Config is roughly Debian's default config + 4K pages instead of the default
> 64K.
> 
> [ 9531.579895] Unable to handle kernel paging request for data at address
> 0x00000000
> [ 9531.579918] Faulting instruction address: 0xc000000000076c64
> [ 9531.579930] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 9531.579948] LE SMP NR_CPUS=2048 NUMA PowerNV
> [ 9531.579960] Modules linked in: binfmt_misc veth xt_nat tap
> nft_chain_nat_ipv4 nft_chain_route_ipv4 tun btrfs zstd_compress
> zstd_decompress
> xxhash ipip tunnel4 ip_tunnel ipt_MASQUERADE nf_nat_ipv4 nf_nat nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 xt_DSCP xt_dscp nft_counter xt_tcpudp
> nft_compat
> nf_tables nfnetlink amdgpu chash gpu_sched ast snd_hda_codec_hdmi ttm
> drm_kms_helper snd_hda_intel snd_hda_codec drm sg snd_hda_core snd_hwdep
> snd_pcm uas drm_panel_orientation_quirks syscopyarea sysfillrect snd_timer
> sysimgblt fb_sys_fops tg3 mpt3sas snd i2c_algo_bit ofpart ipmi_powernv
> opal_prd
> ipmi_devintf soundcore ipmi_msghandler powernv_flash libphy mtd raid_class
> scsi_transport_sas at24 ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2
> fscrypto sd_mod raid10 raid456 crc32c_generic libcrc32c async_raid6_recov
> [ 9531.580142]  async_memcpy async_pq evdev hid_generic usbhid hid raid6_pq
> async_xor xor async_tx raid1 raid0 multipath linear md_mod usb_storage
> dm_crypt
> dm_mod algif_skcipher af_alg ecb xts xhci_pci vmx_crypto xhci_hcd usbcore
> nvme
> nvme_core usb_common
> [ 9531.580219] CPU: 9 PID: 4762 Comm: rustc Not tainted 4.19.0-2-powerpc64le
> #1
> Debian 4.19.13-1
> [ 9531.580250] NIP:  c000000000076c64 LR: c00000000037ec38 CTR:
> c0000000000471e0
> [ 9531.580280] REGS: c0000001a4f6f840 TRAP: 0300   Not tainted 
> (4.19.0-2-powerpc64le Debian 4.19.13-1)
> [ 9531.580311] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24202848 
> XER: 00000000
> [ 9531.580337] CFAR: c00000000037ec34 DAR: 0000000000000000 DSISR: 40000000
> IRQMASK: 0 
>                GPR00: c00000000037ec38 c0000001a4f6fac0 c0000000010a5800
> c0000008b1a2ec00 
>                GPR04: c0000001a0ccaf80 0000000000000800 c000000001202e60
> c000000001202de0 
>                GPR08: 0000000000000009 c00a000006833280 c00a000000000000
> c0000000010b9fd8 
>                GPR12: 0000000000002000 c000000fffff9600 00003fff40000000
> 0001000000000000 
>                GPR16: e61fffffffffffff fffffffffffffe7f 0000000000000001
> c00a0000065f48a8 
>                GPR20: c0000001a0ccaf80 0002000000000000 c0000008b1a2ec00
> c000000001202de0 
>                GPR24: c00a000019a20000 c0000008b1a2ec00 c0000001a0ccaf80
> c0000006f001c5b0 
>                GPR28: c00a000006833280 c000000001202e68 00003fff3e000000
> 0000000000000000 
> [ 9531.580483] NIP [c000000000076c64]
> radix__pgtable_trans_huge_withdraw+0x94/0x160
> [ 9531.580506] LR [c00000000037ec38] __split_huge_pmd+0x588/0xcc0
> [ 9531.580524] Call Trace:
> [ 9531.580541] [c0000001a4f6fac0] [c0000001a4f6fb10] 0xc0000001a4f6fb10
> (unreliable)
> [ 9531.580572] [c0000001a4f6faf0] [c00000000037ebbc]
> __split_huge_pmd+0x50c/0xcc0
> [ 9531.580605] [c0000001a4f6fbb0] [c00000000032aeb8]
> move_page_tables+0x438/0xd30
> [ 9531.580637] [c0000001a4f6fcc0] [c00000000032b8fc] move_vma+0x14c/0x370
> [ 9531.580669] [c0000001a4f6fd60] [c00000000032c0a8] sys_mremap+0x588/0x670
> [ 9531.580702] [c0000001a4f6fe30] [c00000000000b9e4] system_call+0x5c/0x70
> [ 9531.580732] Instruction dump:
> [ 9531.580760] 0b0a0000 e9060000 e9470000 7d294030 7d2907b4 79291f24 7d2900d0
> 7d292038 
> [ 9531.580797] 7929a402 79293664 7d2a4a14 ebe90010 <e95f0000> 7fbf5040
> 419e0064
> 7c0802a6 
> [ 9531.580837] ---[ end trace 21ba871647464d8b ]---
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
Comment 2 Aneesh Kumar KV 2019-01-08 07:26:45 UTC
Andrew Morton <akpm@linux-foundation.org> writes:

> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Fri, 04 Jan 2019 22:49:52 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=202149
>> 
>>             Bug ID: 202149
>>            Summary: NULL Pointer Dereference in __split_huge_pmd on
>>                     PPC64LE
>
> I think that trace is pointing at the ppc-specific
> pgtable_trans_huge_withdraw()?
>

That is correct. 

Matt,
Can you share the .config used for the kernel. Does this happen only
with 4K page size ?

-aneesh
Comment 3 Aneesh Kumar KV 2019-01-09 11:50:53 UTC
Matt Corallo <kernel@bluematt.me> writes:

> .config follows. I have not tested with 64K pages as, sadly, I have a 
> large BTRFS volume that was formatted on x86, and am thus stuck with 4K 
> pages. Note that this is roughly the Debian kernel, so it has whatever 
> patches Debian defaults to applying, a list of which follows.
>

What is the test you are running? I tried a 4K page size config on P9. I
am running ltp test suite there. Also tried few thp memremap tests.
Nothing hit that.

root@:~/tests/ltp/testcases/kernel/mem/thp# getconf  PAGESIZE
4096
root@ltc-boston123:~/tests/ltp/testcases/kernel/mem/thp# grep thp /proc/vmstat 
thp_fault_alloc 641141
thp_fault_fallback 0
thp_collapse_alloc 90
thp_collapse_alloc_failed 0
thp_file_alloc 0
thp_file_mapped 0
thp_split_page 1
thp_split_page_failed 0
thp_deferred_split_page 641150
thp_split_pmd 24
thp_zero_page_alloc 1
thp_zero_page_alloc_failed 0
thp_swpout 0
thp_swpout_fallback 0
root@:~/tests/ltp/testcases/kernel/mem/thp# 

-aneesh
Comment 4 Matt Corallo 2019-01-09 13:38:32 UTC
It's normal daily usage on a workstation (TALOS 2). I've seen it at least twice, both times in rustc, though I've run rustc more times than I can count. Note that the program that triggered it was running in lxc and it only happened after upgrading to 4.19.

> On Jan 9, 2019, at 06:50, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> wrote:
> 
> Matt Corallo <kernel@bluematt.me> writes:
> 
>> .config follows. I have not tested with 64K pages as, sadly, I have a 
>> large BTRFS volume that was formatted on x86, and am thus stuck with 4K 
>> pages. Note that this is roughly the Debian kernel, so it has whatever 
>> patches Debian defaults to applying, a list of which follows.
>> 
> 
> What is the test you are running? I tried a 4K page size config on P9. I
> am running ltp test suite there. Also tried few thp memremap tests.
> Nothing hit that.
> 
> root@:~/tests/ltp/testcases/kernel/mem/thp# getconf  PAGESIZE
> 4096
> root@ltc-boston123:~/tests/ltp/testcases/kernel/mem/thp# grep thp
> /proc/vmstat 
> thp_fault_alloc 641141
> thp_fault_fallback 0
> thp_collapse_alloc 90
> thp_collapse_alloc_failed 0
> thp_file_alloc 0
> thp_file_mapped 0
> thp_split_page 1
> thp_split_page_failed 0
> thp_deferred_split_page 641150
> thp_split_pmd 24
> thp_zero_page_alloc 1
> thp_zero_page_alloc_failed 0
> thp_swpout 0
> thp_swpout_fallback 0
> root@:~/tests/ltp/testcases/kernel/mem/thp# 
> 
> -aneesh
>
Comment 5 Aneesh Kumar KV 2019-01-21 12:35:47 UTC
Can you test this patch?

From e511e79af9a314854848ea8fda9dfa6d7e07c5e4 Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Date: Mon, 21 Jan 2019 16:43:17 +0530
Subject: [PATCH] arch/powerpc/radix: Fix kernel crash with mremap

With support for split pmd lock, we use pmd page pmd_huge_pte pointer to store
the deposited page table. In those config when we move page tables we need to
make sure we move the depoisted page table to the right pmd page. Otherwise this
can result in crash when we withdraw of deposited page table because we can find
the pmd_huge_pte NULL.

c0000000004a1230 __split_huge_pmd+0x1070/0x1940
c0000000004a0ff4 __split_huge_pmd+0xe34/0x1940 (unreliable)
c0000000004a4000 vma_adjust_trans_huge+0x110/0x1c0
c00000000042fe04 __vma_adjust+0x2b4/0x9b0
c0000000004316e8 __split_vma+0x1b8/0x280
c00000000043192c __do_munmap+0x13c/0x550
c000000000439390 sys_mremap+0x220/0x7e0
c00000000000b488 system_call+0x5c/0x70

Fixes: 675d995297d4 ("powerpc/book3s64: Enable split pmd ptlock.")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 92eaea164700..86e62384256d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1262,8 +1262,6 @@ static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
 					 struct spinlock *old_pmd_ptl,
 					 struct vm_area_struct *vma)
 {
-	if (radix_enabled())
-		return false;
 	/*
 	 * Archs like ppc64 use pgtable to store per pmd
 	 * specific information. So when we switch the pmd,
Comment 6 Matt Corallo 2019-02-14 20:25:10 UTC
Hey, sorry for the delay on this. I had some apparently-unrelated hangs that I believe were due to mpt3sas instability, and at the risk of speaking too soon for a bug I couldn't reliably reproduce, this patch appears to have resolved it, thanks!

> On Jan 21, 2019, at 07:35, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> wrote:
> 
> 
> Can you test this patch?
> 
> From e511e79af9a314854848ea8fda9dfa6d7e07c5e4 Mon Sep 17 00:00:00 2001
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
> Date: Mon, 21 Jan 2019 16:43:17 +0530
> Subject: [PATCH] arch/powerpc/radix: Fix kernel crash with mremap
> 
> With support for split pmd lock, we use pmd page pmd_huge_pte pointer to
> store
> the deposited page table. In those config when we move page tables we need to
> make sure we move the depoisted page table to the right pmd page. Otherwise
> this
> can result in crash when we withdraw of deposited page table because we can
> find
> the pmd_huge_pte NULL.
> 
> c0000000004a1230 __split_huge_pmd+0x1070/0x1940
> c0000000004a0ff4 __split_huge_pmd+0xe34/0x1940 (unreliable)
> c0000000004a4000 vma_adjust_trans_huge+0x110/0x1c0
> c00000000042fe04 __vma_adjust+0x2b4/0x9b0
> c0000000004316e8 __split_vma+0x1b8/0x280
> c00000000043192c __do_munmap+0x13c/0x550
> c000000000439390 sys_mremap+0x220/0x7e0
> c00000000000b488 system_call+0x5c/0x70
> 
> Fixes: 675d995297d4 ("powerpc/book3s64: Enable split pmd ptlock.")
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
> arch/powerpc/include/asm/book3s/64/pgtable.h | 2 --
> 1 file changed, 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 92eaea164700..86e62384256d 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -1262,8 +1262,6 @@ static inline int pmd_move_must_withdraw(struct
> spinlock *new_pmd_ptl,
>                     struct spinlock *old_pmd_ptl,
>                     struct vm_area_struct *vma)
> {
> -    if (radix_enabled())
> -        return false;
>    /*
>     * Archs like ppc64 use pgtable to store per pmd
>     * specific information. So when we switch the pmd,
> -- 
> 2.20.1
>
Comment 7 mpe 2019-02-15 04:06:05 UTC
Matt Corallo <kernel@bluematt.me> writes:
> Hey, sorry for the delay on this. I had some apparently-unrelated
> hangs that I believe were due to mpt3sas instability, and at the risk
> of speaking too soon for a bug I couldn't reliably reproduce, this
> patch appears to have resolved it, thanks!

Thanks.

For the archives it went upstream in a slightly different form as:

  579b9239c1f3 ("powerpc/radix: Fix kernel crash with mremap()")

  https://git.kernel.org/torvalds/c/579b9239c1f3


cheers

>> On Jan 21, 2019, at 07:35, Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> wrote:
>> 
>> 
>> Can you test this patch?
>> 
>> From e511e79af9a314854848ea8fda9dfa6d7e07c5e4 Mon Sep 17 00:00:00 2001
>> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
>> Date: Mon, 21 Jan 2019 16:43:17 +0530
>> Subject: [PATCH] arch/powerpc/radix: Fix kernel crash with mremap
>> 
>> With support for split pmd lock, we use pmd page pmd_huge_pte pointer to
>> store
>> the deposited page table. In those config when we move page tables we need
>> to
>> make sure we move the depoisted page table to the right pmd page. Otherwise
>> this
>> can result in crash when we withdraw of deposited page table because we can
>> find
>> the pmd_huge_pte NULL.
>> 
>> c0000000004a1230 __split_huge_pmd+0x1070/0x1940
>> c0000000004a0ff4 __split_huge_pmd+0xe34/0x1940 (unreliable)
>> c0000000004a4000 vma_adjust_trans_huge+0x110/0x1c0
>> c00000000042fe04 __vma_adjust+0x2b4/0x9b0
>> c0000000004316e8 __split_vma+0x1b8/0x280
>> c00000000043192c __do_munmap+0x13c/0x550
>> c000000000439390 sys_mremap+0x220/0x7e0
>> c00000000000b488 system_call+0x5c/0x70
>> 
>> Fixes: 675d995297d4 ("powerpc/book3s64: Enable split pmd ptlock.")
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>> arch/powerpc/include/asm/book3s/64/pgtable.h | 2 --
>> 1 file changed, 2 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index 92eaea164700..86e62384256d 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -1262,8 +1262,6 @@ static inline int pmd_move_must_withdraw(struct
>> spinlock *new_pmd_ptl,
>>                     struct spinlock *old_pmd_ptl,
>>                     struct vm_area_struct *vma)
>> {
>> -    if (radix_enabled())
>> -        return false;
>>    /*
>>     * Archs like ppc64 use pgtable to store per pmd
>>     * specific information. So when we switch the pmd,
>> -- 
>> 2.20.1
>>