Bug 216529
Summary: | [fstests generic/048] BUG: Kernel NULL pointer dereference at 0x00000069, filemap_release_folio+0x88/0xb0 | ||
---|---|---|---|
Product: | File System | Reporter: | Zorro Lang (zlang) |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | NEW --- | ||
Severity: | normal | ||
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 6.0.0-rc6+ | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Zorro Lang
2022-09-25 11:55:29 UTC
On Sun, Sep 25, 2022 at 11:55:29AM +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=216529 > > > Hit a panic on ppc64le, by running generic/048 with 1k block size: Hmm, does this reproduce reliably for you? I test with a 1k block size on x86_64 as a proxy 4k block sizes on PPC64, where the blocksize < pagesize... and this isn't reproducing for me on x86, and I don't have access to a PPC64LE system. Ritesh, is this something you can take a look at it? Thanks! - Ted (In reply to Theodore Tso from comment #1) > On Sun, Sep 25, 2022 at 11:55:29AM +0000, bugzilla-daemon@kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=216529 > > > > > > Hit a panic on ppc64le, by running generic/048 with 1k block size: > > Hmm, does this reproduce reliably for you? I test with a 1k block > size on x86_64 as a proxy 4k block sizes on PPC64, where the blocksize > < pagesize... and this isn't reproducing for me on x86, and I don't > have access to a PPC64LE system. Hi Ted, Yes, it's reproducible for me, I just reproduced it again on another ppc64le (P8) machine [1]. But it's not easy to reproduce by running generic/048 (maybe there's a better way to reproduce it). And this time the call trace is a little different, it might be a folio [mm] related bug? Maybe I should cc linux-mm list to get more checking? Thanks, Zorro [ 1254.857035] run fstests generic/048 at 2022-09-26 12:12:26 [ 1257.651002] EXT4-fs (sda3): mounted filesystem with ordered data mode. Quota mode: none. [ 1257.666754] EXT4-fs (sda3): shut down requested (1) [ 1257.666773] Aborting journal on device sda3-8. [ 1257.696046] EXT4-fs (sda3): unmounting filesystem. [ 1259.216580] EXT4-fs (sda3): mounted filesystem with ordered data mode. Quota mode: none. [ 1273.042962] restraintd[2251]: *** Current Time: Mon Sep 26 12:12:45 2022 Localwatchdog at: Wed Sep 28 11:54:44 2022 [ 1333.319238] restraintd[2251]: *** Current Time: Mon Sep 26 12:13:45 2022 Localwatchdog at: Wed Sep 28 11:54:44 2022 [ 1394.828503] restraintd[2251]: *** Current Time: Mon Sep 26 12:14:47 2022 Localwatchdog at: Wed Sep 28 11:54:44 2022 [ 1403.799008] BUG: Kernel NULL pointer dereference at 0x00000062 [ 1403.799218] Faulting instruction address: 0xc00000000068edfc [ 1403.799228] Oops: Kernel access of bad area, sig: 11 [#1] [ 1403.799233] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 1403.799241] Modules linked in: ext4 mbcache jbd2 bonding tls rfkill sunrpc pseries_rng drm fuse drm_panel_orientation_quirks xfs libcrc32c sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp vmx_crypto [ 1403.799280] CPU: 4 PID: 82 Comm: kswapd0 Kdump: loaded Not tainted 6.0.0-rc7 #1 [ 1403.799293] NIP: c00000000068edfc LR: c00000000068f2a8 CTR: 0000000000000000 [ 1403.799300] REGS: c00000000a44b560 TRAP: 0380 Not tainted (6.0.0-rc7) [ 1403.799308] MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28028244 XER: 00000001 [ 1403.799327] CFAR: c00000000068ede4 IRQMASK: 0 [ 1403.799327] GPR00: c00000000068f2a8 c00000000a44b800 c000000002cf1700 c00c0000001c0bc0 [ 1403.799327] GPR04: c00000000a44b860 0000000000000002 00000003fb290000 c000000002de7dc8 [ 1403.799327] GPR08: 0000000ae4f08f42 0000000000000000 c00c0000001c0bc0 0000000000008000 [ 1403.799327] GPR12: 00000003fb290000 c00000000ffcc080 c000000000194288 c0000003fff9c480 [ 1403.799327] GPR16: c000000069d30050 0000000000000007 0000000000000000 0000000000000000 [ 1403.799327] GPR20: 0000000000000001 c00000000a44b8f8 c00000000146bad8 5deadbeef0000100 [ 1403.799327] GPR24: 5deadbeef0000122 c000000069d30000 c00000000a44bc00 c00000000a44b8e8 [ 1403.799327] GPR28: c00000000a44b860 c00c0000001c0bc0 0000000000000002 0000000000000002 [ 1403.799413] NIP [c00000000068edfc] drop_buffers.constprop.0+0x4c/0x1c0 [ 1403.799423] LR [c00000000068f2a8] try_to_free_buffers+0x128/0x150 [ 1403.799431] Call Trace: [ 1403.799434] [c00000000a44b840] [c00000000a44bc00] 0xc00000000a44bc00 [ 1403.799443] [c00000000a44b890] [c0000000004986f8] filemap_release_folio+0x88/0xb0 [ 1403.799452] [c00000000a44b8b0] [c0000000004c51b0] shrink_active_list+0x490/0x750 [ 1403.799462] [c00000000a44b9b0] [c0000000004c9f78] shrink_lruvec+0x3f8/0x430 [ 1403.799470] [c00000000a44baa0] [c0000000004ca1e4] shrink_node_memcgs+0x234/0x290 [ 1403.799478] [c00000000a44bb10] [c0000000004ca3b4] shrink_node+0x174/0x6b0 [ 1403.799486] [c00000000a44bbc0] [c0000000004cace0] balance_pgdat+0x3f0/0x970 [ 1403.799494] [c00000000a44bd20] [c0000000004cb430] kswapd+0x1d0/0x450 [ 1403.799501] [c00000000a44bdc0] [c0000000001943c8] kthread+0x148/0x150 [ 1403.799510] [c00000000a44be10] [c00000000000cbe4] ret_from_kernel_thread+0x5c/0x64 [ 1403.799520] Instruction dump: [ 1403.799525] fbc1fff0 f821ffc1 7c7d1b78 7c9c2378 ebc30028 7fdff378 48000018 60000000 [ 1403.799540] 60000000 ebff0008 7c3ef840 41820048 <815f0060> e93f0000 5529077c 7d295378 [ 1403.799554] ---[ end trace 0000000000000000 ]--- [ 1403.806330] [-- MARK -- Mon Sep 26 16:15:00 2022] [ 1415.093395] EXT4-fs (sda3): shut down requested (2) [ 1415.093410] Aborting journal on device sda3-8. [ 1429.107188] EXT4-fs (sda3): unmounting filesystem. [ 1429.926262] EXT4-fs (sda3): recovery complete [ 1429.983938] EXT4-fs (sda3): mounted filesystem with ordered data mode. Quota mode: none. [ 1429.988189] EXT4-fs (sda3): unmounting filesystem. [ 1430.166549] EXT4-fs (sda3): mounted filesystem with ordered data mode. Quota mode: none. [ 1453.015796] restraintd[2251]: *** Current Time: Mon Sep 26 12:15:45 2022 Localwatchdog at: Wed Sep 28 11:54:44 2022 [ 1454.708150] EXT4-fs (sda5): unmounting filesystem. [ 1455.225112] EXT4-fs (sda3): unmounting filesystem. [ 1456.128026] EXT4-fs (sda3): mounted filesystem with ordered data mode. Quota mode: none. [ 1456.139102] EXT4-fs (sda3): unmounting filesystem. [ 1456.396367] EXT4-fs (sda5): mounted filesystem with ordered data mode. Quota mode: none. [ 1462.317449] EXT4-fs (sda3): mounted filesystem with ordered data mode. Quota mode: none. [ 1462.326680] EXT4-fs (sda3): unmounting filesystem. [ 1462.427320] EXT4-fs (sda5): unmounting filesystem. [ 1463.259690] EXT4-fs (sda5): mounted filesystem with ordered data mode. Quota mode: none. > > Ritesh, is this something you can take a look at it? Thanks! > > - Ted On Tue, Sep 27, 2022 at 12:47:02AM +0000, bugzilla-daemon@kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=216529 > > Yes, it's reproducible for me, I just reproduced it again on another ppc64le > (P8) machine [1]. But it's not easy to reproduce by running generic/048 > (maybe > there's a better way to reproduce it). Can you give a rough percentage of how often it reproduces? e.g., does it reproduces 10% of the time? 50% of the time? 2-3 times after 100 tries, so 2-3%? etc. If it reproduces but rarely, it'll be a lot harder to try to bisect. Something perhaps to try is to enable KASAN, since both stack traces seem to involve a null pointer derference while trying to free buffers. Maybe that will give us some hints towards the cause.... Thanks, - Ted On 22/09/26 01:02AM, Theodore Ts'o wrote: > On Sun, Sep 25, 2022 at 11:55:29AM +0000, bugzilla-daemon@kernel.org wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=216529 > > > > > > Hit a panic on ppc64le, by running generic/048 with 1k block size: > > Hmm, does this reproduce reliably for you? I test with a 1k block > size on x86_64 as a proxy 4k block sizes on PPC64, where the blocksize > < pagesize... and this isn't reproducing for me on x86, and I don't > have access to a PPC64LE system. > > Ritesh, is this something you can take a look at it? Thanks! I was away for some personal work for last few days, but I am back to work from today. Sure, I will take a look at this and will get back. I did give this test a couple of runs though, but wasn't able to reproduce it. But let me try few more things along with more iterations. Will update accordingly. -ritesh On 22/09/27 11:40PM, Ritesh Harjani (IBM) wrote: > On 22/09/26 01:02AM, Theodore Ts'o wrote: > > On Sun, Sep 25, 2022 at 11:55:29AM +0000, bugzilla-daemon@kernel.org wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=216529 > > > > > > > > > Hit a panic on ppc64le, by running generic/048 with 1k block size: > > > > Hmm, does this reproduce reliably for you? I test with a 1k block > > size on x86_64 as a proxy 4k block sizes on PPC64, where the blocksize > > < pagesize... and this isn't reproducing for me on x86, and I don't > > have access to a PPC64LE system. > > > > Ritesh, is this something you can take a look at it? Thanks! > > I was away for some personal work for last few days, but I am back to work > from > today. Sure, I will take a look at this and will get back. > > I did give this test a couple of runs though, but wasn't able to reproduce > it. > But let me try few more things along with more iterations. Will update > accordingly. I thought I had updated this. But I guess I forgot to update on this mail thread... I tested this for quite some time in a loop and also gave it a overnight run, but I couldn't hit this issue. I had kept low memory size guest, so that we could see more reclaim activity (which I also ensured by doing perf trace to see if we are going over that path or not while test was running). I am not sure whether this could be a timing issue or what. Maybe if you could share your defconfig, I could give a try with that on my setup once. -ritesh |