Bug 205563 - nfs4_open_prepare causes unable to handle kernel paging request at ffffffffffffffb0
Summary: nfs4_open_prepare causes unable to handle kernel paging request at ffffffffff...
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Trond Myklebust
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-11-18 17:43 UTC by harald.schilly
Modified: 2021-05-31 12:39 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.0.0-1025-gcp
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Linux 5.4.39 messages (`dmesg` excerpt) (108.32 KB, text/plain)
2021-04-08 14:06 UTC, Paul Menzel
Details

Description harald.schilly 2019-11-18 17:43:09 UTC
This kernel crash happens on client nodes, mounting shares from an NFS server version 4.2. I tried to find it here in bugzilla, but wasn't able to. I don't know how to reproduce it, it seems to happen randomly.

$ cat /proc/version
Linux version 5.0.0-1025-gcp (buildd@lcy01-amd64-001) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #26~18.04.1-Ubuntu SMP Mon Nov 11 13:09:18 UTC 2019


$ nfsstat -m
/storage/projects/5 from 10.101.129.72:/
 Flags: rw,sync,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=1200,retrans=2,sec=sys,clientaddr=10.240.0.19,local_lock=none,addr=10.101.129.72

[and a couple more, same options]


kernel panic:

[162137.872207] BUG: unable to handle kernel paging request at ffffffffffffffb0
[162137.879425] #PF error: [normal kernel read fault]
[162137.884355] PGD 5c0212067 P4D 5c0212067 PUD 5c0214067 PMD 0
[162137.890322] Oops: 0000 [#1] SMP PTI
[162137.894118] CPU: 0 PID: 713192 Comm: kworker/u8:1 Tainted: P        W  OE     5.0.0-1025-gcp #26~18.04.1-Ubuntu
[162137.904680] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[162137.914354] Workqueue: rpciod rpc_async_schedule [sunrpc]
[162137.920009] RIP: 0010:nfs4_open_prepare+0x84/0x1f0 [nfsv4]
[162137.925744] Code: 44 24 20 00 00 00 00 48 8d b3 a8 00 00 00 4c 89 e7 e8 a0 f7 ff ff 5b 41 5c 41 5d 41 5e 5d c3 48 8b 83 80 03 00 00 48 8b 40 38 <48> 8b 78 b0 48 85 ff 74 10 8b 43 2c 8b 57 34 21 c2 39 d0 0f 84 d6
[162137.944829] RSP: 0018:ffffab1a4a1a3de8 EFLAGS: 00010246
[162137.950279] RAX: 0000000000000000 RBX: ffff9d99edd81400 RCX: 0000000000000004
[162137.957639] RDX: 0000000000008000 RSI: 0000000000000001 RDI: ffff9d9d1f605b40
[162137.965003] RBP: ffffab1a4a1a3e08 R08: 0000646f69637072 R09: 8080808080808080
[162137.972355] R10: ffffffffa9603d00 R11: fefefefefefefeff R12: ffff9d9cb402de00
[162137.979727] R13: 0000000000000004 R14: ffff9d9f331aa400 R15: ffffffffc04d9a40
[162137.987260] FS:  0000000000000000(0000) GS:ffff9d9f65a00000(0000) knlGS:0000000000000000
[162137.995615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[162138.001705] CR2: ffffffffffffffb0 CR3: 0000000637698001 CR4: 00000000001606f0
[162138.009067] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[162138.016420] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[162138.023773] Call Trace:
[162138.026648]  rpc_prepare_task+0x1f/0x30 [sunrpc]
[162138.031517]  __rpc_execute+0x7a/0x3f0 [sunrpc]
[162138.036317]  rpc_async_schedule+0x12/0x20 [sunrpc]
[162138.041361]  process_one_work+0x1fd/0x400
[162138.045685]  worker_thread+0x34/0x410
[162138.049744]  kthread+0x121/0x140
[162138.053356]  ? process_one_work+0x400/0x400
[162138.057822]  ? kthread_park+0xb0/0xb0
[162138.061702]  ret_from_fork+0x35/0x40
[162138.065478] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache xt_u32 xt_set xt_multiport iptable_mangle iptable_raw ip_set_hash_ip ip_set_hash_net ip_set ipip tunnel4 ip_tunnel veth ip6table_nat nf_nat_ipv6 xt_statistic xt_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_comment xt_mark ipt_MASQUERADE xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat br_netfilter bridge stp llc overlay aufs zfs(POE) zunicode(POE) zavl(POE) icp(POE) zlua(POE) zcommon(POE) znvpair(POE) spl(OE) ip6table_filter ip6_tables iptable_filter bpfilter binfmt_misc nls_iso8859_1 input_leds serio_raw sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi sunrpc ip_tables x_tables btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64
[162138.065547]  crypto_simd cryptd glue_helper virtio_net psmouse net_failover failover
[162138.161736] CR2: ffffffffffffffb0
[162138.165273] ---[ end trace 3547cb652b75e419 ]---
[162138.170310] RIP: 0010:nfs4_open_prepare+0x84/0x1f0 [nfsv4]
[162138.176139] Code: 44 24 20 00 00 00 00 48 8d b3 a8 00 00 00 4c 89 e7 e8 a0 f7 ff ff 5b 41 5c 41 5d 41 5e 5d c3 48 8b 83 80 03 00 00 48 8b 40 38 <48> 8b 78 b0 48 85 ff 74 10 8b 43 2c 8b 57 34 21 c2 39 d0 0f 84 d6
[162138.196198] RSP: 0018:ffffab1a4a1a3de8 EFLAGS: 00010246
[162138.201639] RAX: 0000000000000000 RBX: ffff9d99edd81400 RCX: 0000000000000004
[162138.209016] RDX: 0000000000008000 RSI: 0000000000000001 RDI: ffff9d9d1f605b40
[162138.216508] RBP: ffffab1a4a1a3e08 R08: 0000646f69637072 R09: 8080808080808080
[162138.223941] R10: ffffffffa9603d00 R11: fefefefefefefeff R12: ffff9d9cb402de00
[162138.231435] R13: 0000000000000004 R14: ffff9d9f331aa400 R15: ffffffffc04d9a40
[162138.238891] FS:  0000000000000000(0000) GS:ffff9d9f65a00000(0000) knlGS:0000000000000000
[162138.247191] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[162138.253549] CR2: ffffffffffffffb0 CR3: 0000000637698001 CR4: 00000000001606f0
[162138.260916] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[162138.268261] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[162138.276039] Kernel panic - not syncing: Fatal exception
[162138.282680] Kernel Offset: 0x27000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[162138.293691] Rebooting in 10 seconds..
[162148.220763] ACPI MEMORY or I/O RESET_REG.
Comment 1 Trond Myklebust 2019-11-19 12:20:18 UTC
Please reproduce without any ZFS or other proprietary modules loaded.
Comment 2 Jasper Mattsson 2019-12-23 10:14:49 UTC
Here's my reproduction without ZFS or proprietary modules. There are other out-of-tree drivers though: vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE). 

/proc/version:
Linux version 5.4.6-arch1-1 (linux@archlinux) (gcc version 9.2.0 (GCC)) #1 SMP PREEMPT Sat, 21 Dec 2019 16:34:41 +0000

Mounts:

redacted.redacted.com:/srv/export on /mnt/nasroot type nfs4 (rw,nosuid,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=krb5p,clientaddr=192.168.1.8,local_lock=none,addr=192.168.1.20)
redacted.redacted.com:/mnt/cryptroot/data on /mnt/nascrypt type nfs4 (rw,nosuid,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=krb5p,clientaddr=192.168.1.8,local_lock=none,addr=192.168.1.20)


[ 8157.118266] BUG: unable to handle page fault for address: ffffffffffffffb0
[ 8157.118269] #PF: supervisor read access in kernel mode
[ 8157.118270] #PF: error_code(0x0000) - not-present page
[ 8157.118270] PGD 53a20f067 P4D 53a20f067 PUD 53a211067 PMD 0
[ 8157.118272] Oops: 0000 [#1] PREEMPT SMP PTI
[ 8157.118274] CPU: 6 PID: 19255 Comm: kworker/u16:0 Tainted: G           OE     5.4.6-arch1-1 #1
[ 8157.118275] Hardware name: System manufacturer System Product Name/Z170 PRO GAMING, BIOS 3501 06/23/2017
[ 8157.118288] Workqueue: rpciod rpc_async_schedule [sunrpc]
[ 8157.118301] RIP: 0010:nfs4_get_valid_delegation+0x7/0x30 [nfsv4]
[ 8157.118303] Code: 6f fd 5c e5 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 f0 80 4f 48 08 c3 0f 1f 44 00 00 0f 1f 44 00 00 41 54 <4c> 8b 67 b0 31 f6 4c 89 e7 e8 3b fa ff ff 84 c0 b8 00 00 00 00 4c
[ 8157.118304] RSP: 0018:ffffb59d883f7dd8 EFLAGS: 00010202
[ 8157.118305] RAX: ffff9729768a3780 RBX: ffff972c14ec4800 RCX: 0000000000000000
[ 8157.118305] RDX: 0000000000028000 RSI: 0000000000000001 RDI: 0000000000000000
[ 8157.118306] RBP: ffff973150409c00 R08: 0000646f69637072 R09: 8080808080808080
[ 8157.118307] R10: ffff972bff432a2c R11: 0000000000000018 R12: ffff97314f848400
[ 8157.118307] R13: 0000000000000004 R14: 0000000000004281 R15: ffffffffc058e8b0
[ 8157.118308] FS:  0000000000000000(0000) GS:ffff973156b80000(0000) knlGS:0000000000000000
[ 8157.118309] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8157.118310] CR2: ffffffffffffffb0 CR3: 000000038613a006 CR4: 00000000003606e0
[ 8157.118311] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8157.118311] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8157.118312] Call Trace:
[ 8157.118320]  nfs4_open_prepare+0x85/0x1f0 [nfsv4]
[ 8157.118329]  __rpc_execute+0x7c/0x3a0 [sunrpc]
[ 8157.118338]  rpc_async_schedule+0x29/0x40 [sunrpc]
[ 8157.118340]  process_one_work+0x1e2/0x3b0
[ 8157.118342]  worker_thread+0x4a/0x3d0
[ 8157.118343]  kthread+0xfb/0x130
[ 8157.118345]  ? process_one_work+0x3b0/0x3b0
[ 8157.118346]  ? kthread_park+0x90/0x90
[ 8157.118348]  ret_from_fork+0x35/0x40
[ 8157.118350] Modules linked in: tun fuse rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache intel_rapl_msr intel_rapl_common amdgpu ath9k ath9k_common x86_pkg_temp_thermal intel_powerclamp ath9k_hw coretemp gpu_sched snd_usb_audio nls_iso8859_1 i2c_algo_bit snd_hda_codec_hdmi nls_cp437 ttm vfat joydev mei_hdcp iTCO_wdt kvm_intel snd_hda_intel ath fat snd_usbmidi_lib snd_intel_nhlt drm_kms_helper iTCO_vendor_support btusb mac80211 snd_hda_codec snd_rawmidi btrtl kvm btbcm snd_seq_device drm snd_hda_core mc irqbypass btintel eeepc_wmi snd_hwdep asus_wmi intel_cstate battery bluetooth intel_uncore cfg80211 wacom input_leds sparse_keymap wmi_bmof r8169 intel_rapl_perf agpgart snd_pcm mxm_wmi realtek syscopyarea ecdh_generic sysfillrect snd_timer rfkill mousedev ecc e1000e libphy snd libarc4 sysimgblt mei_me i2c_i801 mei fb_sys_fops soundcore evdev mac_hid vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) auth_rpcgss sg sunrpc crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2
[ 8157.118372]  dm_crypt dm_mod hid_generic usbhid hid uas usb_storage sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ahci serio_raw libahci aesni_intel libata xhci_pci crypto_simd cryptd scsi_mod glue_helper xhci_hcd i8042 wmi atkbd libps2 serio
[ 8157.118380] CR2: ffffffffffffffb0
[ 8157.118381] ---[ end trace 91c52b50d9ef8bfb ]---
[ 8157.118390] RIP: 0010:nfs4_get_valid_delegation+0x7/0x30 [nfsv4]
[ 8157.118392] Code: 6f fd 5c e5 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 f0 80 4f 48 08 c3 0f 1f 44 00 00 0f 1f 44 00 00 41 54 <4c> 8b 67 b0 31 f6 4c 89 e7 e8 3b fa ff ff 84 c0 b8 00 00 00 00 4c
[ 8157.118393] RSP: 0018:ffffb59d883f7dd8 EFLAGS: 00010202
[ 8157.118393] RAX: ffff9729768a3780 RBX: ffff972c14ec4800 RCX: 0000000000000000
[ 8157.118394] RDX: 0000000000028000 RSI: 0000000000000001 RDI: 0000000000000000
[ 8157.118395] RBP: ffff973150409c00 R08: 0000646f69637072 R09: 8080808080808080
[ 8157.118395] R10: ffff972bff432a2c R11: 0000000000000018 R12: ffff97314f848400
[ 8157.118396] R13: 0000000000000004 R14: 0000000000004281 R15: ffffffffc058e8b0
[ 8157.118397] FS:  0000000000000000(0000) GS:ffff973156b80000(0000) knlGS:0000000000000000
[ 8157.118398] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8157.118398] CR2: ffffffffffffffb0 CR3: 000000038613a006 CR4: 00000000003606e0
[ 8157.118399] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8157.118399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Comment 3 Paul Menzel 2021-04-08 08:17:25 UTC
Reproduced on Linux 5.4.39 with XFS on the remote server.
Comment 4 bfields 2021-04-08 13:47:11 UTC
(In reply to Paul Menzel from comment #3)
> Reproduced on Linux 5.4.39 with XFS on the remote server.

Could we see the message from the kernel logs?
Comment 5 Paul Menzel 2021-04-08 14:06:54 UTC
Created attachment 296295 [details]
Linux 5.4.39 messages (`dmesg` excerpt)

Sure, please find them attached.
Comment 6 Bruno Santos 2021-05-31 12:15:19 UTC
The crash pattern seems similar to one that occurs with kernels 4.18.0-193.el8.x86_64 (CentOS 8 and RHEL 8) and 5.4.0-33.37 on Ubuntu Focal:
* https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1885010
* https://access.redhat.com/solutions/5606431 (am unable to see the solution)

The patch with the commit comment "nfs: fix NULL deference in nfs4_get_valid_delegation" seems to be what fixes this issue, as mentioned on the bug tracker for Ubuntu's Linux kernels and on the following Oracle's kernel patch logs: https://linux.oracle.com/errata/ELSA-2020-2427.html

And I quote:

  [4.18.0-193.5.1_2]
    - [fs] nfs: fix NULL deference in nfs4_get_valid_delegation ('J. Bruce Fields') [1837969 1831553]
Comment 7 Paul Menzel 2021-05-31 12:39:11 UTC
The referenced commit was added in Linux 5.7-rc6 [1].

    $ git describe 29fe839976266bc7c55b927360a1daae57477723
    v5.7-rc5-2-g29fe83997626

It was applied to the Linux 5.4 stable series in v5.4.42 (commit d1538d8d). So, it was not applied yet, when we experienced it.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=29fe839976266bc7c55b927360a1daae57477723

Note You need to log in before you can comment on or make changes to this bug.