Bug 219508 - commit b571cfcb9dcac187c6d967987792d37cb0688610 introduces CPU soft lockup on nfs write operations on virtual machines
Summary: commit b571cfcb9dcac187c6d967987792d37cb0688610 introduces CPU soft lockup on...
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: Intel Linux
: P3 blocking
Assignee: Trond Myklebust
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-11-18 12:25 UTC by Sebastien Harnist
Modified: 2024-12-09 05:27 UTC (History)
1 user (show)

See Also:
Kernel Version: 6.11 / 6.12
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output (13.78 KB, text/plain)
2024-11-18 12:25 UTC, Sebastien Harnist
Details

Description Sebastien Harnist 2024-11-18 12:25:11 UTC
Created attachment 307226 [details]
dmesg output

Hi,

the change from commit b571cfcb9dcac187c6d967987792d37cb0688610 / https://github.com/torvalds/linux/commit/b571cfcb9dcac187c6d967987792d37cb0688610 appears to be introducing a bug on the nfs stack that leads to a soft lockup when high nfs workloads are being written to a mounted nfs share within a virtual machine.
In my case it happens in a QEMU virtual machine, currently checking on how to reproduce systematically.

Igor Raits who initially reported this via email (please see https://lore.kernel.org/linux-nfs/CA+9S74gER-UFWp7fV8GnsUKMV5T32-UiKDPq4dYtW9XzG3tstw@mail.gmail.com/#R) confirmed to me that when removing this particular commit the issue does not reappear in kernel 6.11.9.

The issue is difficult to reproduce as it will sometimes take time to happen.

Please see attached dmesg, journal as below:

Nov 17 15:41:58 coreos kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 78s! [kworker/u16:7:101418]
Nov 17 15:41:58 coreos kernel: CPU#2 Utilization every 4s during lockup:
Nov 17 15:41:59 coreos kernel:         #1: 100% system,          0% softirq,          0% hardirq,          0% idle
Nov 17 15:41:59 coreos kernel:         #2: 100% system,          1% softirq,          1% hardirq,          0% idle
Nov 17 15:41:59 coreos kernel:         #3: 100% system,          0% softirq,          1% hardirq,          0% idle
Nov 17 15:41:59 coreos kernel:         #4: 100% system,          0% softirq,          0% hardirq,          0% idle
Nov 17 15:41:59 coreos kernel:         #5: 100% system,          0% softirq,          1% hardirq,          0% idle
Nov 17 15:41:59 coreos kernel: Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace veth xt_nat xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl virtio_balloon i2c_piix4 qxl drm_ttm_helper i2c_smbus ttm joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_net net_failover failover ata_generic pata_acpi nvme_tcp nvme_fabrics nvme_keyring nvme_core serio_raw nvme_auth be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua
Nov 17 15:41:59 coreos kernel:  ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
Nov 17 15:41:59 coreos kernel: CPU: 2 UID: 0 PID: 101418 Comm: kworker/u16:7 Tainted: G             L     6.11.5-300.fc41.x86_64 #1
Nov 17 15:41:59 coreos kernel: Tainted: [L]=SOFTLOCKUP
Nov 17 15:41:59 coreos kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Nov 17 15:41:59 coreos kernel: Workqueue: writeback wb_workfn (flush-0:279)
Nov 17 15:41:59 coreos kernel: RIP: 0010:nfs_page_group_unlock+0x28/0x40 [nfs]
Nov 17 15:41:59 coreos kernel: Code: 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb 48 8b 7f 50 48 39 df 74 05 e8 65 ff ff ff 48 8d 7b 38 f0 80 63 38 bf 48 8b 43 38 <f6> c4 10 75 06 5b c3 cc cc cc cc be 06 00 00 00 5b e9 e2 79 f1 dd
Nov 17 15:41:59 coreos kernel: RSP: 0018:ffffb8d1c9e4f820 EFLAGS: 00000206
Nov 17 15:41:59 coreos kernel: RAX: 0000000000000027 RBX: ffff97ae029a0800 RCX: ffffffffffffffe0
Nov 17 15:42:00 coreos kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff97ae029a0838
Nov 17 15:42:00 coreos kernel: RBP: ffff97ae029a0838 R08: 0000000000000018 R09: 0000000000000000
Nov 17 15:42:00 coreos kernel: R10: 0000000000000000 R11: 000000000003dc8c R12: ffff97ae029a0800
Nov 17 15:42:00 coreos kernel: R13: fffff1c5c01a0000 R14: 0000000000000001 R15: 0000000000000000
Nov 17 15:42:00 coreos kernel: FS:  0000000000000000(0000) GS:ffff97b22a500000(0000) knlGS:0000000000000000
Nov 17 15:42:00 coreos kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 17 15:42:00 coreos kernel: CR2: 00003c020063e000 CR3: 000000028a42a001 CR4: 0000000000370ef0
Nov 17 15:42:00 coreos kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 17 15:42:00 coreos kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 17 15:42:00 coreos kernel: Call Trace:
Nov 17 15:42:00 coreos kernel:  <IRQ>
Nov 17 15:42:08 coreos kernel:  ? watchdog_timer_fn.cold+0x233/0x311
Nov 17 15:42:08 coreos kernel:  ? __pfx_watchdog_timer_fn+0x10/0x10
Nov 17 15:42:11 coreos kernel:  ? __hrtimer_run_queues+0x113/0x280
Nov 17 15:42:15 coreos kernel:  ? kvm_clock_get_cycles+0x18/0x30
Nov 17 15:42:19 coreos kernel:  ? hrtimer_interrupt+0xfa/0x210
Nov 17 15:42:21 coreos kernel:  ? __sysvec_apic_timer_interrupt+0x52/0x100
Nov 17 15:42:25 coreos kernel:  ? sysvec_apic_timer_interrupt+0x6c/0x90
Nov 17 15:42:44 coreos kernel:  </IRQ>
Nov 17 15:43:02 coreos kernel:  <TASK>

Thank for you attention and happy to test workarounds or suggestions.

Best,
  Sebastien

Note You need to log in before you can comment on or make changes to this bug.