Bug 217170
Summary: | amdgpu failed to resume involving AMD IOMMU with 6.2.2-301 kernel resulting in a black screen | ||
---|---|---|---|
Product: | Drivers | Reporter: | Matt Fagnani (matt.fagnani) |
Component: | IOMMU | Assignee: | drivers_iommu |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | regressions |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 6.2.2-301.fc38 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: | The kernel log for a boot when I clicked Sleep in sddm, tried to resume the system, and the problem happened. |
Description
Matt Fagnani
2023-03-10 03:10:41 UTC
Please report here: https://gitlab.freedesktop.org/drm/amd/-/issues It's the primary hub for AMD driver development. (In reply to Artem S. Tashkinov from comment #1) > Please report here: https://gitlab.freedesktop.org/drm/amd/-/issues Matt afaics did that already. And I'm not even sure if that's the right move (it might though), as this might be another IOMMU issue. But with a vendor kernel this will be hard to get traction anyway. Might be wise to test 6.2.3 (likely out today or tomorrow; rc is available; contains the same patches, but a lot of other fixes as well) and mainline. Packages with them are (or in the case of 6.2.3: will be) available in these repositories: https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories Thanks. I reported this problem at https://gitlab.freedesktop.org/drm/amd/-/issues/2454 and https://bugzilla.redhat.com/show_bug.cgi?id=2177111 I reported this problem here against the IOMMU subsystem since the problem doesn't happen if the AMD IOMMU is disabled with amd_iommu=off on the kernel command line, and Alex Deucher requested that I report the previous black screen during boot problem here against the IOMMU subsystem https://gitlab.freedesktop.org/drm/amd/-/issues/2319#note_1699814 Should I email iommu@lists.linux.dev or another mailing list about this problem? The latest Fedora Rawhide build kernel-6.3.0-0.rc1.20230309git6a98c9cae232.18.fc39.x86_64 has this resume problem. kernel-6.3.0-0.rc0.20230227gitf3a2439f20d9.9.fc39.x86_64 is the first Rawhide kernel without the black screen during boot problem https://gitlab.freedesktop.org/drm/amd/-/issues/2319 and it has this failure to resume problem. The previous build kernel-6.3.0-0.rc0.20230223gita5c95ca18a98.4.fc39.x86_64 had the black screen during boot, so I'm unsure how to test such kernels for this resume problem since it's necessary to use amdgpu and have the IOMMU enabled for it to happen. The 6.3 kernels had a warning while suspending involving amdgpu which wasn't shown with 6.2.2. Mar 10 02:21:24 kernel: ------------[ cut here ]------------ Mar 10 02:21:24 kernel: WARNING: CPU: 2 PID: 1393 at kernel/workqueue.c:3167 __flush_work.isra.0+0x270/0x280 Mar 10 02:21:24 kernel: Modules linked in: snd_seq_dummy snd_hrtimer nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nf_log_syslog nft_log nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc iwlmvm mac80211 uvcvideo edac_mce_amd libarc4 kvm_amd btusb btrtl snd_ctl_led uvc iwlwifi btbcm snd_hda_codec_realtek ccp btintel videobuf2_vmalloc videobuf2_memops snd_hda_codec_generic btmtk videobuf2_v4l2 snd_hda_codec_hdmi ledtrig_audio videobuf2_common hp_wmi snd_hda_intel kvm snd_intel_dspcfg bluetooth sparse_keymap platform_profile snd_intel_sdw_acpi irqbypass cfg80211 snd_hda_codec videodev vfat wmi_bmof fat mc pcspkr snd_hda_core snd_hwdep i2c_piix4 rfkill fam15h_power k10temp snd_seq snd_seq_device snd_pcm snd_timer snd soundcore i2c_scmi wireless_hotkey acpi_cpufreq joydev loop zram amdgpu hid_logitech_hidpp crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic i2c_algo_bit drm_ttm_helper ttm iommu_v2 Mar 10 02:21:24 kernel: ghash_clmulni_intel drm_buddy r8169 sha512_ssse3 wdat_wdt gpu_sched sp5100_tco drm_display_helper cec video wmi hid_multitouch hid_logitech_dj serio_raw scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse dm_multipath Mar 10 02:21:24 kernel: CPU: 2 PID: 1393 Comm: kworker/u8:10 Not tainted 6.3.0-0.rc0.20230227gitf3a2439f20d9.9.fc39.x86_64 #1 Mar 10 02:21:24 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019 Mar 10 02:21:24 kernel: Workqueue: events_unbound async_run_entry_fn Mar 10 02:21:24 kernel: RIP: 0010:__flush_work.isra.0+0x270/0x280 Mar 10 02:21:24 kernel: Code: 8b 04 25 80 22 03 00 48 89 44 24 40 48 8b 73 30 8b 4b 28 e9 e3 fe ff ff 40 30 f6 4c 8b 3e e9 21 fe ff ff 0f 0b e9 3a ff ff ff <0f> 0b e9 33 ff ff ff e8 04 d2 e3 00 0f 1f 40 00 90 90 90 90 90 90 Mar 10 02:21:24 kernel: RSP: 0018:ffff98a4c3de7ca8 EFLAGS: 00010246 Mar 10 02:21:24 kernel: RAX: 0000000000000000 RBX: ffff8d3350680340 RCX: 0000000000000000 Mar 10 02:21:24 kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff98a4c3de7cf0 Mar 10 02:21:24 kernel: RBP: ffff8d3350680340 R08: 745e72736d647564 R09: ffff8d3386ae3c74 Mar 10 02:21:24 kernel: R10: 000000000000000f R11: fefefefefefefeff R12: 0000000000000001 Mar 10 02:21:24 kernel: R13: ffff98a4c3de7ca8 R14: 0000000000000001 R15: ffff8d33789e4f28 Mar 10 02:21:24 kernel: FS: 0000000000000000(0000) GS:ffff8d3437500000(0000) knlGS:0000000000000000 Mar 10 02:21:24 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 10 02:21:24 kernel: CR2: 0000562f5c082158 CR3: 00000001459ca000 CR4: 00000000001506e0 Mar 10 02:21:24 kernel: Call Trace: Mar 10 02:21:24 kernel: <TASK> Mar 10 02:21:24 kernel: __cancel_work_timer+0xff/0x190 Mar 10 02:21:24 kernel: ? wait_for_completion+0x37/0x160 Mar 10 02:21:24 kernel: ? preempt_count_add+0x6a/0xa0 Mar 10 02:21:24 kernel: drm_kms_helper_poll_disable+0x1e/0x40 Mar 10 02:21:24 kernel: amdgpu_device_suspend+0x9e/0x180 [amdgpu] Mar 10 02:21:24 kernel: pci_pm_suspend+0x7b/0x170 Mar 10 02:21:24 kernel: ? __pfx_pci_pm_suspend+0x10/0x10 Mar 10 02:21:24 kernel: dpm_run_callback+0x8c/0x1e0 Mar 10 02:21:24 kernel: __device_suspend+0x10a/0x560 Mar 10 02:21:24 kernel: async_suspend+0x1a/0x70 Mar 10 02:21:24 kernel: async_run_entry_fn+0x30/0x130 Mar 10 02:21:24 kernel: process_one_work+0x1c7/0x3d0 Mar 10 02:21:24 kernel: worker_thread+0x4d/0x380 Mar 10 02:21:24 kernel: ? __pfx_worker_thread+0x10/0x10 Mar 10 02:21:24 kernel: kthread+0xe9/0x110 Mar 10 02:21:24 kernel: ? __pfx_kthread+0x10/0x10 Mar 10 02:21:24 kernel: ret_from_fork+0x2c/0x50 Mar 10 02:21:24 kernel: </TASK> Mar 10 02:21:24 kernel: ---[ end trace 0000000000000000 ]--- I reported this problem to the IOMMU subsystem mailing list at https://lore.kernel.org/all/4a3b225c-2ffd-e758-4de1-447375e34cad@bell.net/T/#u Vasant Hegde and Felix Kuehling explained the details of the problem in amdgpu there. Thorsten Leemhuis added the problem to regzbot. https://lore.kernel.org/all/4a3b225c-2ffd-e758-4de1-447375e34cad@bell.net/T/#m52dfb8f457727ce725aad66e5e7db4e8afa46fad https://linux-regtracking.leemhuis.info/regzbot/regression/217170/ I built 6.3-rc2 after applying Felix's patch at https://lore.kernel.org/stable/20230314175359.1747662-1-Felix.Kuehling@amd.com/ amdgpu resumed normally 5/5 times with 6.3-rc2 + the patch. Felix's patch fixed the problem. Thanks. Felix Kuehling's patch to fix this problem was pulled into the mainline branch on 2023-3-17 and is in 6.3-rc3 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=master&id=f3921a9a641483784448fb982b2eb738b383d9b9 6.3.0-0.rc3.30.fc39 didn't have this problem when resuming a few times. Felix's patch is queued for the 6.2 branch at https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/commit/?id=d68ccb83abd757877de8c7f344fa43c05b81760f |