Bug 215696 - Kernel Oops since kernel-5.17 on dual socket Intel Xeon Gold servers - kernel NULL pointer dereference
Summary: Kernel Oops since kernel-5.17 on dual socket Intel Xeon Gold servers - kernel...
Status: RESOLVED CODE_FIX
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-03-17 11:20 UTC by Jirka Hladky
Modified: 2022-04-25 21:42 UTC (History)
1 user (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Jirka Hladky 2022-03-17 11:20:48 UTC
Since kernel 5.17 (tested with rc2, rc4, rc7, rc8) we experience kernel oops on Intel Xeon Gold dual-socket servers (2x Xeon Gold 6126 CPU)

Bellow is a backtrace and the dmesg log.

I have trouble creating a simple reproducer - it happens at random places when preparing the NAS benchmark to be run. It creates a bunch of directories, compiles the benchmark a start trial runs. 

Could you please help to narrow down the problem?

Reports bellow were created with Fedora-Rawhide-20220316.n.0 running 5.17.0-0.rc8.123.fc37 kernel and with 
echo 1 > /proc/sys/kernel/panic_on_oops
setting. 

crash> sys
      KERNEL: /usr/lib/debug/lib/modules/5.17.0-0.rc8.123.fc37.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 48
        DATE: Thu Mar 17 02:49:40 CET 2022
      UPTIME: 00:02:50
LOAD AVERAGE: 0.32, 0.10, 0.03
       TASKS: 608
    NODENAME: gold-2s-c.tpb.lab.eng.brq.redhat.com
     RELEASE: 5.17.0-0.rc8.123.fc37.x86_64
     VERSION: #1 SMP PREEMPT Mon Mar 14 18:11:49 UTC 2022
     MACHINE: x86_64  (2600 Mhz)
      MEMORY: 94.7 GB
       PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" (check log for details)


crash> bt
PID: 2480   TASK: ffff9e8f76cb8000  CPU: 26  COMMAND: "umount"
#0 [ffffae00cacbfbb8] machine_kexec at ffffffffbb068980
#1 [ffffae00cacbfc08] __crash_kexec at ffffffffbb1a300a
#2 [ffffae00cacbfcc8] crash_kexec at ffffffffbb1a4045
#3 [ffffae00cacbfcd0] oops_end at ffffffffbb02c410
#4 [ffffae00cacbfcf0] page_fault_oops at ffffffffbb076a38
#5 [ffffae00cacbfd68] exc_page_fault at ffffffffbbd0b7c1
#6 [ffffae00cacbfd90] asm_exc_page_fault at ffffffffbbe00ace
   [exception RIP: kernfs_remove+7]
   RIP: ffffffffbb421f67  RSP: ffffae00cacbfe48  RFLAGS: 00010246
   RAX: 0000000000000001  RBX: ffffffffbce31e58  RCX: 0000000080200018
   RDX: 0000000080200019  RSI: ffffdfbd44161640  RDI: 0000000000000000
   RBP: ffffffffbce31e58   R8: 0000000000000000   R9: 0000000080200018
   R10: ffff9e8f05859e80  R11: ffff9e9443b1bd98  R12: ffff9ea057f1d000
   R13: ffffffffbce31e60  R14: dead000000000122  R15: dead000000000100
   ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#7 [ffffae00cacbfe58] rdt_kill_sb at ffffffffbb05074b
#8 [ffffae00cacbfea8] deactivate_locked_super at ffffffffbb36ce1f
#9 [ffffae00cacbfec0] cleanup_mnt at ffffffffbb39176e
#10 [ffffae00cacbfee8] task_work_run at ffffffffbb10703c
#11 [ffffae00cacbff08] exit_to_user_mode_prepare at ffffffffbb17a399
#12 [ffffae00cacbff28] syscall_exit_to_user_mode at ffffffffbbd0bde8
#13 [ffffae00cacbff38] do_syscall_64 at ffffffffbbd071a6
#14 [ffffae00cacbff50] entry_SYSCALL_64_after_hwframe at ffffffffbbe0007c
   RIP: 00007f442c75126b  RSP: 00007ffc82d66fe8  RFLAGS: 00000202
   RAX: 0000000000000000  RBX: 000055bd4cc37090  RCX: 00007f442c75126b
   RDX: 0000000000000001  RSI: 0000000000000001  RDI: 000055bd4cc3b950
   RBP: 000055bd4cc371a8   R8: 0000000000000000   R9: 0000000000000073
   R10: 0000000000000000  R11: 0000000000000202  R12: 0000000000000001
   R13: 000055bd4cc3b950  R14: 000055bd4cc372c0  R15: 000055bd4cc37090
   ORIG_RAX: 00000000000000a6  CS: 0033  SS: 002b

[2] dmesg
[  172.776553] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  172.783513] #PF: supervisor read access in kernel mode
[  172.788652] #PF: error_code(0x0000) - not-present page
[  172.793793] PGD 0 P4D 0  
[  172.796330] Oops: 0000 [#1] PREEMPT SMP PTI
[  172.800519] CPU: 26 PID: 2480 Comm: umount Kdump: loaded Not tainted 5.17.0-0.rc8.123.fc37.x86_64 #1
[  172.809645] Hardware name: Supermicro Super Server/X11DDW-L, BIOS 2.0b 03/07/2018
[  172.817123] RIP: 0010:kernfs_remove+0x7/0x50
[  172.821397] Code: e8 be e7 2c 00 48 89 df e8 b6 8c f0 ff 48 c7 c3 f4 ff ff ff 48 89 d8 5b 5d 41 5c 41 5d 41 5e c3 cc 66 90 0f 1f 44 00 00 55 53 <48> 8b 47 08 48 89 fb 48 85 c0 48 0f 44 c7 48 8b 68 50 48 83 c5 60
[  172.840141] RSP: 0018:ffffae00cacbfe48 EFLAGS: 00010246
[  172.845367] RAX: 0000000000000001 RBX: ffffffffbce31e58 RCX: 0000000080200018
[  172.852501] RDX: 0000000080200019 RSI: ffffdfbd44161640 RDI: 0000000000000000
[  172.859632] RBP: ffffffffbce31e58 R08: 0000000000000000 R09: 0000000080200018
[  172.866764] R10: ffff9e8f05859e80 R11: ffff9e9443b1bd98 R12: ffff9ea057f1d000
[  172.873899] R13: ffffffffbce31e60 R14: dead000000000122 R15: dead000000000100
[  172.881033] FS:  00007f442c53c800(0000) GS:ffff9e9429000000(0000) knlGS:0000000000000000
[  172.889117] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  172.894861] CR2: 0000000000000008 CR3: 000000010ba96006 CR4: 00000000007706e0
[  172.901997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  172.909127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  172.916261] PKRU: 55555554
[  172.918974] Call Trace:
[  172.921427]  <TASK>
[  172.923533]  rdt_kill_sb+0x29b/0x350
[  172.927112]  deactivate_locked_super+0x2f/0xa0
[  172.931559]  cleanup_mnt+0xee/0x180
[  172.935051]  task_work_run+0x5c/0x90
[  172.938629]  exit_to_user_mode_prepare+0x229/0x230
[  172.943424]  syscall_exit_to_user_mode+0x18/0x40
[  172.948043]  do_syscall_64+0x46/0x80
[  172.951623]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  172.956675] RIP: 0033:0x7f442c75126b
[  172.960271] Code: cb 1b 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 91 1b 0e 00 f7 d8
[  172.979017] RSP: 002b:00007ffc82d66fe8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a6
[  172.986584] RAX: 0000000000000000 RBX: 000055bd4cc37090 RCX: 00007f442c75126b
[  172.993715] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 000055bd4cc3b950
[  173.000849] RBP: 000055bd4cc371a8 R08: 0000000000000000 R09: 0000000000000073
[  173.007980] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000001
[  173.015115] R13: 000055bd4cc3b950 R14: 000055bd4cc372c0 R15: 000055bd4cc37090
[  173.022249]  </TASK>
[  173.024440] Modules linked in: rfkill intel_rapl_msr intel_rapl_common isst_if_common irdma skx_edac nfit libnvdimm ice x86_pkg_temp_thermal intel_powerclamp coretemp ib_uverbs iTCO_wdt intel_pmc_bxt ib_core iTCO_vendor_support kvm_
intel ipmi_ssif kvm irqbypass rapl acpi_ipmi intel_cstate i40e joydev mei_me ioatdma i2c_i801 intel_uncore lpc_ich i2c_smbus mei intel_pch_thermal dca ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter fuse zram xfs crct10d
if_pclmul ast crc32_pclmul crc32c_intel drm_vram_helper drm_ttm_helper ttm wmi ghash_clmulni_intel
[  173.073900] CR2: 0000000000000008
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-03-19 12:15:17 UTC
You are likely better of reporting the issue by mail, as explained in https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html – I assume you don't reach the right people here and guess nobody might see bug reported against this particular component. Also: some developer don't care about distro kernels, even if they are patched only lightly. So you will increase your chances by reproducing this with a vanilla kernel.
Comment 2 Artem S. Tashkinov 2022-03-21 12:36:40 UTC
Would be great if you tried to bisect it.
Comment 3 Jirka Hladky 2022-03-21 23:42:39 UTC
Thorsten, thanks a lot for the hint! 

I have started a mailing thread here:
https://lore.kernel.org/lkml/CAE4VaGDZr_4wzRn2___eDYRtmdPaGGJdzu_LCSkJYuY9BEO3cw@mail.gmail.com/
Comment 4 Jirka Hladky 2022-03-21 23:52:31 UTC
(In reply to Artem S. Tashkinov from comment #2)
> Would be great if you tried to bisect it.

I will try that later this week when I get access to the server.
Comment 5 Jirka Hladky 2022-03-30 22:21:52 UTC
I have found the commit causing the trouble [1]. Any ideas how to fix it? 


$ git bisect visualize 
commit 393c3714081a53795bbff0e985d24146def6f57f (refs/bisect/bad)
Author: Minchan Kim <minchan@kernel.org>
Date:   Thu Nov 18 15:00:08 2021 -0800

    kernfs: switch global kernfs_rwsem lock to per-fs lock
    
    The kernfs implementation has big lock granularity(kernfs_rwsem) so
    every kernfs-based(e.g., sysfs, cgroup) fs are able to compete the
    lock. It makes trouble for some cases to wait the global lock
    for a long time even though they are totally independent contexts
    each other.
    
    A general example is process A goes under direct reclaim with holding
    the lock when it accessed the file in sysfs and process B is waiting
    the lock with exclusive mode and then process C is waiting the lock
    until process B could finish the job after it gets the lock from
    process A.
    
    This patch switches the global kernfs_rwsem to per-fs lock, which
    put the rwsem into kernfs_root.
    
    Suggested-by: Tejun Heo <tj@kernel.org>
    Acked-by: Tejun Heo <tj@kernel.org>
    Signed-off-by: Minchan Kim <minchan@kernel.org>
    Link: https://lore.kernel.org/r/20211118230008.2679780-1-minchan@kernel.org
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Comment 6 Jirka Hladky 2022-04-25 21:42:12 UTC
The issue is fixed by this patch:

https://lore.kernel.org/all/YmLznjFdpblHzZiM@google.com/

Fixes: 393c3714081a (kernfs: switch global kernfs_rwsem lock to per-fs lock)
Reported-by: Jirka Hladky <jhladky@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 fs/kernfs/dir.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 61a8edc4ba8b..e205fde7163a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1406,7 +1406,12 @@ static void __kernfs_remove(struct kernfs_node *kn)
  */
 void kernfs_remove(struct kernfs_node *kn)
 {
-	struct kernfs_root *root = kernfs_root(kn);
+	struct kernfs_root *root;
+
+	if (!kn)
+		return;
+
+	root = kernfs_root(kn);
 
 	down_write(&root->kernfs_rwsem);
 	__kernfs_remove(kn);
-- 

I'm going to close this BZ.

Note You need to log in before you can comment on or make changes to this bug.