Bug 218902

Summary: Kernel crash/reboot on cifs copy
Product: File System Reporter: dufgrinder
Component: Samba/SMBAssignee: fs_samba-smb
Status: NEW ---    
Severity: blocking CC: kernelbugzilla.jim, pc, regressions, smfrench, toracat
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: 6.8.4 Subsystem:
Regression: Yes Bisected commit-id:

Description dufgrinder 2024-05-27 20:24:38 UTC
Computer crashs when copying files from a local drive to a cifs mounted files system.

This bug exists since kernel 6.8.4 and is always present in last kernel 6.9.

Step to reproduce: on AlmaLinux 9

Boot with kernel 6.9.1-2 or 6.8.4

Open a terminal and copy some files on the cifs mounted directory. (Copy can be performed either by cp * ~/COMMUN or using Thunar)

--> The computer freezes and reboots producing an /boot/initram.....kdump.img file


The bug is not present in kernel 5.14.0-427 nor in 6.7.9
The bug does not affect read operations from the cifs mounted directory

Mount options:
//192.168.1.112/commun /home/olivier/COMMUN cifs noauto,x-systemd.automount,user,nosuid,gid=100,uid=1026,credentials=/home/olivier/passwd 0 0
-----------------------------------------------------------

The /var/log/messages contains the following:
May 19 09:41:12 deyme18 kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cifsd:1959]
May 19 09:41:12 deyme18 kernel: Modules linked in: snd_seq_dummy(E) snd_hrtimer(E) nls_utf8(E) cifs(E) cifs_arc4(E) nls_ucs2_utils(E) rdma_cm(E) iw_cm(E) ib_cm(E) ib_core(E) cifs_md4(E) dns_resolver(E) netfs(E) snd_hda_codec_hdmi(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) sunrpc(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ip_set(E) nf_tables(E) libcrc32c(E) nfnetlink(E) snd_sof_pci_intel_skl(E) snd_sof_intel_hda_common(E) soundwire_intel(E) soundwire_generic_allocation(E) snd_sof_intel_hda_mlink(E) soundwire_cadence(E) snd_sof_intel_hda(E) snd_sof_pci(E) snd_sof_xtensa_dsp(E) snd_sof(E) snd_sof_utils(E) soundwire_bus(E) snd_soc_avs(E) snd_soc_hda_codec(E) snd_soc_skl(E) snd_soc_hdac_hda(E) snd_hda_ext_core(E) snd_soc_sst_ipc(E) snd_soc_sst_dsp(E) snd_soc_acpi_intel_match(E) snd_soc_acpi(E) snd_soc_core(E) vfat(E) fat(E) intel_rapl_msr(E) iwlmvm(E)
May 19 09:41:12 deyme18 kernel: snd_compress(E) intel_rapl_common(E) snd_pcm_dmaengine(E) ac97_bus(E) snd_hda_intel(E) intel_uncore_frequency(E) intel_uncore_frequency_common(E) intel_tcc_cooling(E) snd_intel_dspcfg(E) snd_intel_sdw_acpi(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) snd_hda_codec(E) coretemp(E) jc42(E) mac80211(E) kvm_intel(E) libarc4(E) uvcvideo(E) regmap_i2c(E) snd_hda_core(E) i915(E) ee1004(E) iTCO_wdt(E) btusb(E) iwlwifi(E) intel_pmc_bxt(E) kvm(E) uvc(E) videobuf2_vmalloc(E) snd_hwdep(E) iTCO_vendor_support(E) videobuf2_memops(E) btrtl(E) snd_seq(E) snd_seq_device(E) btintel(E) btbcm(E) irqbypass(E) videobuf2_v4l2(E) btmtk(E) snd_pcm(E) videodev(E) cfg80211(E) rapl(E) videobuf2_common(E) bluetooth(E) intel_cstate(E) mc(E) intel_uncore(E) mei_me(E) intel_gtt(E) snd_timer(E) drm_buddy(E) i2c_i801(E) pcspkr(E) i2c_algo_bit(E) i2c_smbus(E) drm_display_helper(E) snd(E) mei(E) soundcore(E) ttm(E) rfkill(E) intel_pch_thermal(E) joydev(E) cec(E) intel_pmc_core(E) drm_kms_helper(E) intel_xhci_usb_role_switch(E) intel_hid(E)
May 19 09:41:12 deyme18 kernel: sparse_keymap(E) intel_vsec(E) pmt_telemetry(E) acpi_pad(E) pmt_class(E) drm(E) ext4(E) mbcache(E) jbd2(E) hid_logitech_hidpp(E) hid_logitech_dj(E) rtsx_pci_sdmmc(E) mmc_core(E) ahci(E) nvme(E) crct10dif_pclmul(E) crc32_pclmul(E) libahci(E) crc32c_intel(E) polyval_clmulni(E) polyval_generic(E) nvme_core(E) libata(E) r8169(E) rtsx_pci(E) ghash_clmulni_intel(E) t10_pi(E) video(E) wmi(E) serio_raw(E) fuse(E)
May 19 09:41:12 deyme18 kernel: CPU: 2 PID: 1959 Comm: cifsd Kdump: loaded Tainted: G E 6.8.4-1.el9.elrepo.x86_64 #1
May 19 09:41:12 deyme18 kernel: Hardware name: PC Specialist LTD N2x0WU /N2x0WU , BIOS 1.07.18 02/15/2019
May 19 09:41:12 deyme18 kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x72/0x2d0
May 19 09:41:12 deyme18 kernel: Code: 08 0f 92 c2 8b 45 00 0f b6 d2 c1 e2 08 30 e4 09 d0 a9 00 01 ff ff 0f 85 f2 01 00 00 85 c0 74 12 0f b6 45 00 84 c0 74 0a f3 90 <0f> b6 45 00 84 c0 75 f6 b8 01 00 00 00 66 89 45 00 5b 5d 41 5c 41
May 19 09:41:12 deyme18 kernel: RSP: 0018:ffffad7d40d3bd60 EFLAGS: 00000202
May 19 09:41:12 deyme18 kernel: RAX: 0000000000000001 RBX: ffffffffc1ede988 RCX: 0000000000000000
May 19 09:41:12 deyme18 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffc1ede988
May 19 09:41:12 deyme18 kernel: RBP: ffffffffc1ede988 R08: 0000000000000000 R09: ffff9f5a4b100b40
May 19 09:41:12 deyme18 kernel: R10: ffff9f5a259eb1c8 R11: 01d9e3d51959a6a5 R12: ffff9f5a03db0038
May 19 09:41:12 deyme18 kernel: R13: ffff9f5a516f4ec0 R14: ffff9f5a259eb000 R15: 0000000000000000
May 19 09:41:12 deyme18 kernel: FS: 0000000000000000(0000) GS:ffff9f5d5ec80000(0000) knlGS:0000000000000000
May 19 09:41:12 deyme18 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 19 09:41:12 deyme18 kernel: CR2: 000055fa37ada5c0 CR3: 00000002d7a1e001 CR4: 00000000003706f0
May 19 09:41:12 deyme18 kernel: Call Trace:
May 19 09:41:12 deyme18 kernel: <IRQ>
May 19 09:41:12 deyme18 kernel: ? watchdog_timer_fn+0x261/0x2f0
May 19 09:41:12 deyme18 kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
May 19 09:41:12 deyme18 kernel: ? __hrtimer_run_queues+0x10f/0x2b0
May 19 09:41:12 deyme18 kernel: ? hrtimer_interrupt+0x106/0x240
May 19 09:41:12 deyme18 kernel: ? __sysvec_apic_timer_interrupt+0x6b/0x180
May 19 09:41:12 deyme18 kernel: ? sysvec_apic_timer_interrupt+0x9d/0xd0
May 19 09:41:12 deyme18 kernel: </IRQ>
May 19 09:41:12 deyme18 kernel: <TASK>
May 19 09:41:12 deyme18 kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20
May 19 09:41:12 deyme18 kernel: ? native_queued_spin_lock_slowpath+0x72/0x2d0
May 19 09:41:12 deyme18 kernel: _raw_spin_lock+0x30/0x40
May 19 09:41:12 deyme18 kernel: __cifs_put_smb_ses+0x53/0x440 [cifs]
May 19 09:41:12 deyme18 kernel: smb2_find_smb_tcon+0x61/0xd0 [cifs]
May 19 09:41:12 deyme18 kernel: smb2_handle_cancelled_mid+0x42/0x90 [cifs]
May 19 09:41:12 deyme18 kernel: __release_mid+0x8a/0xb0 [cifs]
May 19 09:41:12 deyme18 kernel: cifs_demultiplex_thread+0x2fc/0x790 [cifs]
May 19 09:41:12 deyme18 kernel: ? __pfx_cifs_demultiplex_thread+0x10/0x10 [cifs]
May 19 09:41:12 deyme18 kernel: kthread+0xee/0x120
May 19 09:41:12 deyme18 kernel: ? __pfx_kthread+0x10/0x10
May 19 09:41:12 deyme18 kernel: ret_from_fork+0x2d/0x50
May 19 09:41:12 deyme18 kernel: ? __pfx_kthread+0x10/0x10
May 19 09:41:12 deyme18 kernel: ret_from_fork_asm+0x1b/0x30
May 19 09:41:12 deyme18 kernel: </TASK>
May 19 09:41:23 deyme18 systemd-logind[719]: The system will reboot now!
May 19 09:41:23 deyme18 systemd-logind[719]: System is rebooting.
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-05-30 05:55:15 UTC
(In reply to dufgrinder from comment #0)
> 
> This bug exists since kernel 6.8.4

So 6.8.3 works fine? Or did you mean 6.7.y worked fine? Then a bisection would be good (and might be required): https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html
Comment 2 Akemi Yagi 2024-05-31 19:19:22 UTC
According to :

https://elrepo.org/bugs/view.php?id=1454

There is one other person who reported what looks like the same issue. kernel 6.7.9-1 works fine.
Comment 3 dufgrinder 2024-05-31 19:56:12 UTC
Sure, elrepo reporter is me.
The bug appears on 2 of my last updated computers to AlmaLinux 9.4.

I'll try to do the bisect, but it'll be my first on a kernel !

KR, Olivier
Comment 4 Akemi Yagi 2024-06-03 19:24:00 UTC
Just for info, the other reporter on the ELRepo's bug tracker runs CentOS Stream 9.
Comment 5 Paulo Alcantara 2024-06-06 16:59:20 UTC
FWIW, there is a patch [1] that has been sent to the mailing list that is a good candidate for fixing this deadlock.

[1] https://lore.kernel.org/r/20240606161313.25521-1-ematsumiya@suse.de
Comment 6 Akemi Yagi 2024-06-07 10:13:23 UTC
Using the patch provided in Comment 5, I have built a kernel-ml package set, kernel-ml-6.9.3-1.1.el9.elrepo. 

https://toracat.org/test/kernel/bug218902/

Can you test and see if the issue is fixed?
Comment 7 Jim 2024-06-08 22:30:33 UTC
(In reply to Akemi Yagi from comment #6)
> Using the patch provided in Comment 5, I have built a kernel-ml package set,
> kernel-ml-6.9.3-1.1.el9.elrepo. 
> 
> https://toracat.org/test/kernel/bug218902/
> 
> Can you test and see if the issue is fixed?

I get the same crash with this kernel with CentOS Stream 9
Comment 8 Akemi Yagi 2024-06-08 22:44:34 UTC
> I get the same crash with this kernel with CentOS Stream 9

Apparently, the candidate patch did not fix the issue. 

(dufgrinder also reported the same result.)
Comment 9 dufgrinder 2024-06-09 14:54:27 UTC
Sure, Crash confirmed with candidate fix (kernel-ml-6.9.3-1.1.el9.elrepo) on AlmaLinux 9.4
Comment 10 Steve French 2024-06-10 19:14:49 UTC
Anyone try it on 6.10-rc3 yet?
Comment 11 Akemi Yagi 2024-06-10 21:45:59 UTC
We (elrepo) just built kernel-ml-6.10.0-0.rc3.el9.elrepo. We'll make it available for testing.
Comment 12 Akemi Yagi 2024-06-10 22:01:49 UTC
kernel-ml-6.10.0-0.rc3.el9.elrepo.x86_64 is available here:

https://elrepo.org/people/akemi/testing/el9/kernel/218902/x86_64/
Comment 13 Jim 2024-06-11 00:38:13 UTC
kernel-ml-6.10.0-0.rc3.el9.elrepo.x86_64 worked for me with CentOS Stream 9. Writing to a CIFS mount no longer crashes my system.
Comment 14 Akemi Yagi 2024-06-11 00:47:18 UTC
That's good news. kernel-ml-6.10 GA version is expected to be out in mid July.
Comment 15 Paulo Alcantara 2024-06-11 01:02:28 UTC
Steve, you need to figure out which commit fixed the copy regression so it can be backported to v6.8.y and v6.9.y -- if it hasn't been backported yet.  Note that they reported two different oopses.  The one from description seems to be fixed by [1].

[1] https://lore.kernel.org/r/20240606161313.25521-1-ematsumiya@suse.de
Comment 16 dufgrinder 2024-06-11 18:09:24 UTC
Hi

kernel-ml-6.10.0-0.rc3.el9.elrepo.x86_64 worked for me using AlmaLinux 9.4 on two laptops ( Intel i5-8250U and an old Intel Pentium B970)

Tested with the same kind of copy to the CIFS mounted directory
Comment 17 Jim 2024-08-24 03:24:33 UTC
I think this can be closed.

CIFS writes still work with kernel-ml-6.10.6-1.el9.elrepo.x86_64 (CentOS Stream 9).
Comment 18 dufgrinder 2024-08-24 06:50:36 UTC
I confirm, CIFS copy works correctly now

This bug can be closed