Created attachment 278087 [details] dmesg [ 269.610488] PM: suspend entry (deep) [ 269.610489] PM: Syncing filesystems ... done. [ 269.632274] Freezing user space processes ... (elapsed 0.041 seconds) done. [ 269.673340] OOM killer disabled. [ 269.673341] Freezing remaining freezable tasks ... [ 289.675276] Freezing of tasks failed after 20.001 seconds (0 tasks refusing to freeze, wq_busy=1): [ 289.675280] Showing busy workqueues and worker pools: [ 289.675283] workqueue events: flags=0x0 [ 289.675286] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256 [ 289.675293] in-flight: 267:process_unsol_events [snd_hda_core] process_unsol_events [snd_hda_core], 5:hda_jackpoll_work [snd_hda_codec] BAR(5) [ 289.675326] workqueue pm: flags=0x4 [ 289.675329] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/0 [ 289.675335] in-flight: 266:pm_runtime_work [ 289.675356] workqueue xfs-reclaim/sda5: flags=0xc [ 289.675359] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0 [ 289.675364] delayed: xfs_reclaim_worker [xfs] [ 289.675431] workqueue xfs-sync/sda5: flags=0x4 [ 289.675434] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0 [ 289.675439] delayed: xfs_log_worker [xfs] [ 289.675505] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=1s workers=6 idle: 492 128 441 [ 289.675584] Restarting kernel threads ... done. [ 289.676063] OOM killer enabled. [ 289.676063] Restarting tasks ... done. [ 289.676911] PM: suspend exit After that I see multiple traces like this: [ 491.156115] INFO: task kworker/0:2:266 blocked for more than 120 seconds. [ 491.156130] Tainted: G O 4.18.5-arch1-1-ARCH #1 [ 491.156131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 491.156132] kworker/0:2 D 0 266 2 0x80000080 [ 491.156136] Workqueue: pm pm_runtime_work [ 491.156137] Call Trace: [ 491.156138] ? __schedule+0x29b/0x8b0 [ 491.156140] ? preempt_count_add+0x68/0xa0 [ 491.156141] ? drain_workqueue+0x120/0x120 [ 491.156142] schedule+0x32/0x90 [ 491.156143] __cancel_work_timer+0x17e/0x190 [ 491.156144] ? drain_workqueue+0x120/0x120 [ 491.156147] ? hda_codec_runtime_resume+0x50/0x50 [snd_hda_codec] [ 491.156150] hda_codec_runtime_suspend+0x20/0xb0 [snd_hda_codec] [ 491.156151] __rpm_callback+0x75/0x1b0 [ 491.156153] ? __switch_to_asm+0x40/0x70 [ 491.156156] ? hda_codec_runtime_resume+0x50/0x50 [snd_hda_codec] [ 491.156158] rpm_callback+0x1f/0x70 [ 491.156161] ? hda_codec_runtime_resume+0x50/0x50 [snd_hda_codec] [ 491.156162] rpm_suspend+0x13a/0x640 [ 491.156163] ? finish_task_switch+0x83/0x2c0 [ 491.156165] ? __switch_to_asm+0x40/0x70 [ 491.156166] pm_runtime_work+0x7d/0xa0 [ 491.156168] process_one_work+0x1eb/0x3c0 [ 491.156169] worker_thread+0x2d/0x3d0 [ 491.156171] ? process_one_work+0x3c0/0x3c0 [ 491.156172] kthread+0x112/0x130 [ 491.156173] ? kthread_flush_work_fn+0x10/0x10 [ 491.156175] ret_from_fork+0x35/0x40 All of them mentioning snd_hda_<something> Full dmesg output attached.
Looks like the HD-audio on the discrete AMD GPU had a trouble [ 93.478837] snd_hda_intel 0000:67:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x003f0900 This indicates already something odd happened. Likely a problem of vga_switcheroo and runtime PM. Lukas, any clue?
Also, is this a regression? If yes, which kernel did work?
@Aleksandr Mezin: Please add log_buf_len=10M to the kernel command line so that dmesg output is not truncated. The HDA codec on the AMD GPU is unresponsive after the first system sleep cycle: [ 93.478837] snd_hda_intel 0000:67:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x003f0900 [ 135.705372] snd_hda_intel 0000:67:00.1: No response from codec, disabling MSI: last cmd=0x003f0900 Question is why. Perhaps the BIOS disables the codec on system sleep? Could you attach an acpidump and the output of "lspci -vv" after the first system sleep cycle? The issue exposes a bug in hda_jackpoll_work() for unresponsive cards wherein the work item deadlocks because it tries to cancel itself: hda_jackpoll_work() snd_hda_jack_poll_all() jack_detect_update() read_pin_sense() codec_read() codec_exec_verb() snd_hda_bus_reset() snd_hda_bus_reset_codecs() /* FIXME: maybe a better way needed for forced reset */ cancel_delayed_work_sync(&codec->jackpoll_work); The reason why hda_codec_runtime_suspend() hangs is because it also tries to cancel jackpoll_work, so it's just a victim of the combination of unresponsive card plus hda_jackpoll_work() deadlock.
(In reply to Lukas Wunner from comment #3) > The issue exposes a bug in hda_jackpoll_work() for unresponsive cards > wherein the work item deadlocks because it tries to cancel itself: Yeah, that looks like a side-effect by the unexpected communication error. A patch like below may avoid the deadlock (but totally untested): --- a/sound/pci/hda/hda_codec.c +++ b/sound/pci/hda/hda_codec.c @@ -3935,7 +3935,8 @@ void snd_hda_bus_reset_codecs(struct hda_bus *bus) list_for_each_codec(codec, bus) { /* FIXME: maybe a better way needed for forced reset */ - cancel_delayed_work_sync(&codec->jackpoll_work); + if (current_work() != &codec->jackpoll_work.work) + cancel_delayed_work_sync(&codec->jackpoll_work); #ifdef CONFIG_PM if (hda_codec_is_power_on(codec)) { hda_call_codec_suspend(codec); Of course, this doesn't fix the root cause, though.
I could reproduce the issue artificially and confirmed that the change in comment 4 works around it, so it's queued in for-linus branch for the next pull request. It doesn't address the root cause -- why it hits the communication error, though.
Seems that there are multiple suspend/resume related issues with Vega https://bugzilla.kernel.org/show_bug.cgi?id=200531, https://bugzilla.kernel.org/show_bug.cgi?id=201015 (on 4.18 even a monitor switching to standby can cause the driver to hang), so probably there's no bug in snd_hda, but something is wrong in "display code". Also, I've never seen this bug (snd_hda hang during suspend) on 4.19-rc (where monitor switching to standby doesn't kill amdgpu driver anymore). And currently I can't reproduce it even on 4.18.5