Bug 219199

Summary: btrfs core dump in ordered-data.c:983
Product: File System Reporter: Andreas Jahn (jahn-andi)
Component: btrfsAssignee: BTRFS virtual assignee (fs_btrfs)
Status: RESOLVED CODE_FIX    
Severity: normal CC: fdmanana
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: 6.10.5 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: System Description

Description Andreas Jahn 2024-08-26 10:16:58 UTC
Created attachment 306778 [details]
System Description

I'm utilizing a setup with two libvirt/qemu/kvm VMs. The diskimages for these VMs are located on a dedicated btrfs subvolume.

Since kernel 6.10.5 btrfs dumps a core within hours uptime when 
virtual machines running with > 50% CPU utilization.

The identical VM setup under kernel 6.10.3 and 6.10.4 was stable.

Please see also https://bugzilla.kernel.org/show_bug.cgi?id=219187

===================================================
Aug 20 15:02:23 cantor.console.lan kernel: assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
Aug 20 15:02:23 cantor.console.lan kernel: ------------[ cut here ]------------
Aug 20 15:02:23 cantor.console.lan kernel: kernel BUG at fs/btrfs/ordered-data.c:983!
Aug 20 15:02:23 cantor.console.lan kernel: Oops: invalid opcode: 0000 [#2] PREEMPT SMP PTI
Aug 20 15:02:23 cantor.console.lan kernel: CPU: 5 PID: 40501 Comm: worker Tainted: P      D    OE      6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b0>
Aug 20 15:02:23 cantor.console.lan kernel: Hardware name: Dell Inc. Precision 7540/0XMC3F, BIOS 1.32.0 04/01/2024
Aug 20 15:02:23 cantor.console.lan kernel: RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
Aug 20 15:02:23 cantor.console.lan kernel: Code: 50 46 75 c0 e8 79 93 24 e9 0f 0b b9 d7 03 00 00 48 c7 c2 b0 b8 74 c0 48 c7 c6 40 48 75 c0 48 c7 c7 50 46 75 c0 e8 58 93 24 e9 <0f>>
Aug 20 15:02:23 cantor.console.lan kernel: RSP: 0018:ffffbb3f8415ba88 EFLAGS: 00010246
Aug 20 15:02:23 cantor.console.lan kernel: RAX: 0000000000000054 RBX: ffff9ebcedb14a88 RCX: 0000000000000000
Aug 20 15:02:23 cantor.console.lan kernel: RDX: 0000000000000000 RSI: ffff9ec27baa7800 RDI: ffff9ec27baa7800
Aug 20 15:02:23 cantor.console.lan kernel: RBP: ffff9ebd5802dc00 R08: 0000000000000000 R09: ffffbb3f8415b948
Aug 20 15:02:23 cantor.console.lan kernel: R10: ffffbb3f8415b940 R11: 0000000000000003 R12: ffff9ebb082bd000
Aug 20 15:02:23 cantor.console.lan kernel: R13: 0000000000000001 R14: ffff9ebb510adcc0 R15: 0000000000000000
Aug 20 15:02:23 cantor.console.lan kernel: FS:  00007f86f3e006c0(0000) GS:ffff9ec27ba80000(0000) knlGS:0000000000000000
Aug 20 15:02:23 cantor.console.lan kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 20 15:02:23 cantor.console.lan kernel: CR2: 00007f2d74869010 CR3: 00000002c44c4002 CR4: 00000000003726f0
Aug 20 15:02:23 cantor.console.lan kernel: Call Trace:
Aug 20 15:02:23 cantor.console.lan kernel:  <TASK>
Aug 20 15:02:23 cantor.console.lan kernel:  ? __die_body.cold+0x14/0x24
Aug 20 15:02:23 cantor.console.lan kernel:  ? die+0x2e/0x50
Aug 20 15:02:23 cantor.console.lan kernel:  ? do_trap+0xca/0x110
Aug 20 15:02:23 cantor.console.lan kernel:  ? do_error_trap+0x6a/0x90
Aug 20 15:02:23 cantor.console.lan kernel:  ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
Aug 20 15:02:23 cantor.console.lan kernel:  ? exc_invalid_op+0x50/0x70
Aug 20 15:02:23 cantor.console.lan kernel:  ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
Aug 20 15:02:23 cantor.console.lan kernel:  ? asm_exc_invalid_op+0x1a/0x20
Aug 20 15:02:23 cantor.console.lan kernel:  ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
Aug 20 15:02:23 cantor.console.lan kernel:  ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
Aug 20 15:02:23 cantor.console.lan kernel:  btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
Aug 20 15:02:23 cantor.console.lan kernel:  ? __seccomp_filter+0x31d/0x4f0
Aug 20 15:02:23 cantor.console.lan kernel:  __x64_sys_fdatasync+0x4f/0x90
Aug 20 15:02:23 cantor.console.lan kernel:  do_syscall_64+0x82/0x160
Aug 20 15:02:23 cantor.console.lan kernel:  ? __x64_sys_futex+0x10e/0x1d0
Aug 20 15:02:23 cantor.console.lan kernel:  ? syscall_exit_to_user_mode+0x72/0x220
Aug 20 15:02:23 cantor.console.lan kernel:  ? do_syscall_64+0x8e/0x160
Aug 20 15:02:23 cantor.console.lan kernel:  ? do_syscall_64+0x8e/0x160
Aug 20 15:02:23 cantor.console.lan kernel:  ? __futex_wait+0x150/0x1b0
Aug 20 15:02:23 cantor.console.lan kernel:  ? futex_wake+0x85/0x190
Aug 20 15:02:23 cantor.console.lan kernel:  ? do_futex+0x125/0x190
Aug 20 15:02:23 cantor.console.lan kernel:  ? __x64_sys_futex+0x10e/0x1d0
Aug 20 15:02:23 cantor.console.lan kernel:  ? syscall_exit_to_user_mode+0x72/0x220
Aug 20 15:02:23 cantor.console.lan kernel:  ? do_syscall_64+0x8e/0x160
Aug 20 15:02:23 cantor.console.lan kernel:  ? do_futex+0xcb/0x190
Aug 20 15:02:23 cantor.console.lan kernel:  ? __x64_sys_futex+0x10e/0x1d0
Aug 20 15:02:23 cantor.console.lan kernel:  ? futex_wake+0x85/0x190
Aug 20 15:02:23 cantor.console.lan kernel:  ? do_futex+0x125/0x190
Aug 20 15:02:23 cantor.console.lan kernel:  ? __x64_sys_futex+0x10e/0x1d0
Aug 20 15:02:23 cantor.console.lan kernel:  ? syscall_exit_to_user_mode+0x72/0x220
Aug 20 15:02:23 cantor.console.lan kernel:  ? do_syscall_64+0x8e/0x160
Aug 20 15:02:23 cantor.console.lan kernel:  ? syscall_exit_to_user_mode+0x72/0x220
Aug 20 15:02:23 cantor.console.lan kernel:  ? do_syscall_64+0x8e/0x160
Aug 20 15:02:23 cantor.console.lan kernel:  ? switch_fpu_return+0x4f/0xd0
Aug 20 15:02:23 cantor.console.lan kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
Aug 20 15:02:23 cantor.console.lan kernel: RIP: 0033:0x7f882530911a
Aug 20 15:02:23 cantor.console.lan kernel: Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 44 c3 0f 1f 00 48 83 ec 18 89 7c 24 0c e8 c1 58 f8 ff 8b 7c 24 0c 89 c2 b8 4b 00 00 00 0f 05 <48>>
Aug 20 15:02:23 cantor.console.lan kernel: RSP: 002b:00007f86f3dff590 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Aug 20 15:02:23 cantor.console.lan kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f882530911a
Aug 20 15:02:23 cantor.console.lan kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000f
Aug 20 15:02:23 cantor.console.lan kernel: RBP: 000055f5262127e0 R08: 0000000000000000 R09: 000055f525f9ee14
Aug 20 15:02:23 cantor.console.lan kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 000055f525f9ed80
Aug 20 15:02:23 cantor.console.lan kernel: R13: 000055f4f9c8bb11 R14: 00007f86ff3ff280 R15: 00007f86f3600000
Aug 20 15:02:23 cantor.console.lan kernel:  </TASK>
Aug 20 15:02:23 cantor.console.lan kernel: Modules linked in: binfmt_misc vhost_net vhost vhost_iotlb tap tun nfsd auth_rpcgss nfs_acl lockd grace sunrpc uhid rfcomm snd_seq_dummy>
Aug 20 15:02:23 cantor.console.lan kernel:  snd_sof_intel_hda_common snd_sof_intel_hda_mlink mbcache snd_sof_intel_hda jbd2 snd_sof_pci snd_sof_xtensa_dsp snd_sof mac80211 snd_sof>
Aug 20 15:02:23 cantor.console.lan kernel:  dell_wmi_descriptor intel_wmi_thunderbolt wmi_bmof mxm_wmi e1000e rfkill i2c_mux snd_timer processor_thermal_rfim spi_intel i2c_smbus m>
Aug 20 15:02:23 cantor.console.lan kernel:  intel_lpss t10_pi typec scsi_common idma64 i2c_hid_acpi battery i2c_hid video wmi pinctrl_cannonlake serio_raw btrfs blake2b_generic li>
Aug 20 15:02:23 cantor.console.lan kernel: ---[ end trace 0000000000000000 ]---
Aug 20 15:02:23 cantor.console.lan kernel: RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
Aug 20 15:02:23 cantor.console.lan kernel: Code: 50 46 75 c0 e8 79 93 24 e9 0f 0b b9 d7 03 00 00 48 c7 c2 b0 b8 74 c0 48 c7 c6 40 48 75 c0 48 c7 c7 50 46 75 c0 e8 58 93 24 e9 <0f>>
Aug 20 15:02:23 cantor.console.lan kernel: RSP: 0018:ffffbb3f8429fb30 EFLAGS: 00010246
Aug 20 15:02:23 cantor.console.lan kernel: RAX: 0000000000000054 RBX: ffff9ebcedb16cc8 RCX: 0000000000000000
Aug 20 15:02:23 cantor.console.lan kernel: RDX: 0000000000000000 RSI: ffff9ec27b827800 RDI: ffff9ec27b827800
Aug 20 15:02:23 cantor.console.lan kernel: RBP: ffff9ebb7f87b600 R08: 0000000000000000 R09: ffffbb3f8429f9f0
Aug 20 15:02:23 cantor.console.lan kernel: R10: ffffbb3f8429f9e8 R11: 0000000000000003 R12: ffff9ebb082bd000
Aug 20 15:02:23 cantor.console.lan kernel: R13: 0000000000000001 R14: ffff9ebb510add80 R15: 0000000000000000
Aug 20 15:02:23 cantor.console.lan kernel: FS:  00007f86f3e006c0(0000) GS:ffff9ec27ba80000(0000) knlGS:0000000000000000
Aug 20 15:02:23 cantor.console.lan kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 20 15:02:23 cantor.console.lan kernel: CR2: 00007f2d74869010 CR3: 00000002c44c4002 CR4: 00000000003726f0
===================================================
Comment 1 Filipe David Manana 2024-08-29 15:17:44 UTC
So this is a duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=219187

Working on a fix for it, I think I know what's happening.
I'll let you known when a patch is available for testing.
Thanks.
Comment 2 Filipe David Manana 2024-08-30 10:10:50 UTC
Here's the fix:

https://lore.kernel.org/linux-btrfs/717029440fe379747b9548a9c91eb7801bc5a813.1724972507.git.fdmanana@suse.com/

That applies only against 6.11-rc releases. For a version that applies to the latest 6.10 stable release (6.10.7):

https://gist.githubusercontent.com/fdmanana/e032236302bf5e65fb9728fca98c5eb0/raw/456f1c2c30c338964085d6ba016dcf330aa10fb4/gistfile1.txt
Comment 3 Andreas Jahn 2024-08-30 19:49:31 UTC
Many thanks for fixing this bug so quickly - highly appreciated!

I will test it during our next maintenance window on Monday, 2024-09-02.
Comment 4 Andreas Jahn 2024-09-03 17:21:53 UTC
Current Status: I applied your patch to the oS Tumbleweed 6.10.7-1-default kernel and started a rather heavy work load run approx. 19h ago.
Four (instead of two) VM's wrote during this time data sets to disk more than three times bigger as "usual".
Before starting the test with the btrfs patch applied I checked the plain oS 6.10.5-1-default kernel again and got the 'btrfs core dump in ordered-data.c:983' after 12 minutes runtime,


Test Result: The system with the patched btrfs is still running stable (as expected).


Conclusion: From my point of view this bug is fixed and our admin team started to roll out the patched 6.10.7 kernel to all our servers.


Thanks again for your swift support and for providing a "special" patch for Kernel 6.10.7 to us. Well done!