Hi, After upgrade of a backup NAS server from kernel 5.16.16 to 6.0.6 : the system begun to freeze (hardware reset needed to reboot). Inspecting the log, I could find that a NULL pointer dereference occurs just before the freeze : BUG: kernel NULL pointer dereference, address: 000000000000000c #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 6 PID: 1571474 Comm: kworker/6:1 Tainted: G W 6.0.6-arch1-1 #1 a46cc4b882cfc11c3bbb09d6a0fab3dcad53b5c2 Hardware name: Gigabyte Technology Co., Ltd. B550M DS3H/B550M DS3H, BIOS F13 07/08/2021 Workqueue: events nfsd_file_gc_worker [nfsd] RIP: 0010:nfsd_file_lru_cb+0x36/0x1f0 [nfsd] Code: 53 48 89 f7 8b 50 48 48 8d 58 f8 83 fa 01 0f 87 a4 00 00 00 48 8b 50 20 48 85 d2 74 20 f6 42 44 02 74 1a 48 8b 92 d8 00 00 00 <f7> 42 0c 00 00 00 18 74 0a 0f 1f 44 00 00 e9 bf 00 00 00 f0 48 0f RSP: 0018:ffffaab09780fd68 EFLAGS: 00010202 RAX: ffff8e614c6935b8 RBX: ffff8e614c6935b0 RCX: ffffaab09780fe50 RDX: 0000000000000000 RSI: ffff8e61459bf388 RDI: ffff8e61459bf388 RBP: ffff8e61459bf380 R08: ffffaab09780fe50 R09: ffffaab09780fe48 R10: ffff8e61459bf380 R11: 0000000000000000 R12: ffff8e614c6935b8 R13: ffffffffc08c3020 R14: ffff8e61459bf388 R15: ffff8e614c6935b8 FS: 0000000000000000(0000) GS:ffff8e7ffe380000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000000c CR3: 000000069dc34000 CR4: 0000000000350ee0 Call Trace: <TASK> __list_lru_walk_one+0xb6/0x1d0 ? nfsd_file_key_hashfn+0x60/0x60 [nfsd 0d4f7161ec4af5d335a43572ddfe34915b30f27a] list_lru_walk_node+0x72/0x150 ? nfsd_file_key_hashfn+0x60/0x60 [nfsd 0d4f7161ec4af5d335a43572ddfe34915b30f27a] nfsd_file_gc_worker+0x201/0x310 [nfsd 0d4f7161ec4af5d335a43572ddfe34915b30f27a] process_one_work+0x1c7/0x380 worker_thread+0x51/0x390 ? rescuer_thread+0x3b0/0x3b0 kthread+0xde/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> Modules linked in: tun rpcsec_gss_krb5 veth nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rpcrdma rdma_cm iw_cm ib_cm ib_core iscsi_target_mod target_core_mod intel_rapl_msr intel_rapl_common gigabyte_wmi wmi_bmof edac_mce_amd kvm_amd amdgpu kvm irqbypass snd_hda_codec_realtek crct10dif_pclmul crc32_pclmul snd_hda_codec_generic polyval_clmulni ledtrig_audio snd_hda_codec_hdmi polyval_generic gf128mul snd_hda_intel ghash_clmulni_intel snd_intel_dspcfg aesni_intel snd_intel_sdw_acpi sp5100_tco crypto_simd snd_hda_codec cryptd pcspkr rapl k10temp i2c_piix4 snd_hda_core gpu_sched drm_buddy snd_hwdep drm_ttm_helper snd_pcm ccp ttm r8169 snd_timer rng_core snd realtek drm_display_helper mdio_devres soundcore cec libphy mousedev wmi video gpio_amdpt gpio_generic mac_hid acpi_cpufreq nls_iso8859_1 vfat fat bridge stp llc cfg80211 rfkill nfsd auth_rpcgss nfs_acl lockd grace tcp_htcp dm_multipath sunrpc fuse bpf_preload ip_tables x_tables ext4 crc16 mbcache jbd2 usbhid dm_thin_pool dm_persistent_data libcrc32c crc32c_generic dm_bio_prison dm_bufio dm_mod raid1 md_mod nvme nvme_core crc32c_intel nvme_common xhci_pci xhci_pci_renesas CR2: 000000000000000c ---[ end trace 0000000000000000 ]--- I don't know if there is a memory leak there but I noticed the freeze generally occurs after heavy NFS load (nightly backup) and only when there are only few megabytes of free memory (128GB system with 60 to 80GB occupied by cache when freezing). Perhaps the moment the GC is starting ? (nfsd_file_gc_worker). The system is installed with nfs-utils 2.6.2 and the main workload is generated by Longhorn v1.3.2 backup job so through NFS v4. Mandraxx.
Regarding differences between kernels 5.16.16 and 6.0.x, I'm wondering if this commit could change __list_lru_walk_one behavior and introduce the bug : https://github.com/torvalds/linux/commit/5abc1e37afa0335c52608d640fd30910b2eeda21 ? Mandraxx.
This is not my area of expertise, but there where recently a few fixes in the NFS code. Might be wise to test if the issue occurs with the latest code; ideally test mainline.
Yes, it was in my mind. I first encountered the bug with 6.0.6 and upgraded to 6.0.8 with same behavior (crash occurred after 6 days uptime). I just downgraded my server to 5.15.79 (LTS) to be sure that it becomes stable again.
Can you try v6.1-rc ? There have been some recent fixes in that area.
I suspect what is happening is the nfsd_file being examined in nfsd_file_lru_cb() is getting freed elsewhere, and the resulting reuse of that memory triggers a bad pointer dereference.
Hi, Sorry, did not have much time since our last contact. So, first of all, I wish you an happy new year ;-) I just upgraded my config from 5.15.79 LTS (that was stable for 2 months now) with v6.1.4 : let see if it is stable again.
Hi, The server is stable for 24 days now with Kernel v6.1.4. It usually was crashed after 14/15 days with 6.0.x. So, I think the issue is fixed. Thank you for your help.