Bug 200937

Summary: Suspend fails apparently because of snd_hda_*
Product: Drivers Reporter: Aleksandr Mezin (mezin.alexander)
Component: Sound(ALSA)Assignee: Jaroslav Kysela (perex)
Status: RESOLVED OBSOLETE    
Severity: normal CC: lukas, tiwai
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.18.5 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg

Description Aleksandr Mezin 2018-08-26 00:18:08 UTC
Created attachment 278087 [details]
dmesg

[  269.610488] PM: suspend entry (deep)
[  269.610489] PM: Syncing filesystems ... done.
[  269.632274] Freezing user space processes ... (elapsed 0.041 seconds) done.
[  269.673340] OOM killer disabled.
[  269.673341] Freezing remaining freezable tasks ... 
[  289.675276] Freezing of tasks failed after 20.001 seconds (0 tasks refusing to freeze, wq_busy=1):
[  289.675280] Showing busy workqueues and worker pools:
[  289.675283] workqueue events: flags=0x0
[  289.675286]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[  289.675293]     in-flight: 267:process_unsol_events [snd_hda_core] process_unsol_events [snd_hda_core], 5:hda_jackpoll_work [snd_hda_codec] BAR(5)
[  289.675326] workqueue pm: flags=0x4
[  289.675329]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/0
[  289.675335]     in-flight: 266:pm_runtime_work
[  289.675356] workqueue xfs-reclaim/sda5: flags=0xc
[  289.675359]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
[  289.675364]     delayed: xfs_reclaim_worker [xfs]
[  289.675431] workqueue xfs-sync/sda5: flags=0x4
[  289.675434]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
[  289.675439]     delayed: xfs_log_worker [xfs]
[  289.675505] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=1s workers=6 idle: 492 128 441
[  289.675584] Restarting kernel threads ... done.
[  289.676063] OOM killer enabled.
[  289.676063] Restarting tasks ... done.
[  289.676911] PM: suspend exit

After that I see multiple traces like this:

[  491.156115] INFO: task kworker/0:2:266 blocked for more than 120 seconds.
[  491.156130]       Tainted: G           O      4.18.5-arch1-1-ARCH #1
[  491.156131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  491.156132] kworker/0:2     D    0   266      2 0x80000080
[  491.156136] Workqueue: pm pm_runtime_work
[  491.156137] Call Trace:
[  491.156138]  ? __schedule+0x29b/0x8b0
[  491.156140]  ? preempt_count_add+0x68/0xa0
[  491.156141]  ? drain_workqueue+0x120/0x120
[  491.156142]  schedule+0x32/0x90
[  491.156143]  __cancel_work_timer+0x17e/0x190
[  491.156144]  ? drain_workqueue+0x120/0x120
[  491.156147]  ? hda_codec_runtime_resume+0x50/0x50 [snd_hda_codec]
[  491.156150]  hda_codec_runtime_suspend+0x20/0xb0 [snd_hda_codec]
[  491.156151]  __rpm_callback+0x75/0x1b0
[  491.156153]  ? __switch_to_asm+0x40/0x70
[  491.156156]  ? hda_codec_runtime_resume+0x50/0x50 [snd_hda_codec]
[  491.156158]  rpm_callback+0x1f/0x70
[  491.156161]  ? hda_codec_runtime_resume+0x50/0x50 [snd_hda_codec]
[  491.156162]  rpm_suspend+0x13a/0x640
[  491.156163]  ? finish_task_switch+0x83/0x2c0
[  491.156165]  ? __switch_to_asm+0x40/0x70
[  491.156166]  pm_runtime_work+0x7d/0xa0
[  491.156168]  process_one_work+0x1eb/0x3c0
[  491.156169]  worker_thread+0x2d/0x3d0
[  491.156171]  ? process_one_work+0x3c0/0x3c0
[  491.156172]  kthread+0x112/0x130
[  491.156173]  ? kthread_flush_work_fn+0x10/0x10
[  491.156175]  ret_from_fork+0x35/0x40

All of them mentioning snd_hda_<something>

Full dmesg output attached.
Comment 1 Takashi Iwai 2018-08-26 06:13:47 UTC
Looks like the HD-audio on the discrete AMD GPU had a trouble
  [   93.478837] snd_hda_intel 0000:67:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x003f0900

This indicates already something odd happened.  Likely a problem of vga_switcheroo and runtime PM.

Lukas, any clue?
Comment 2 Takashi Iwai 2018-08-26 07:17:22 UTC
Also, is this a regression?  If yes, which kernel did work?
Comment 3 Lukas Wunner 2018-08-26 07:43:55 UTC
@Aleksandr Mezin: Please add log_buf_len=10M to the kernel command line so that dmesg output is not truncated.

The HDA codec on the AMD GPU is unresponsive after the first system sleep cycle:

[   93.478837] snd_hda_intel 0000:67:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x003f0900
[  135.705372] snd_hda_intel 0000:67:00.1: No response from codec, disabling MSI: last cmd=0x003f0900

Question is why.  Perhaps the BIOS disables the codec on system sleep?  Could you attach an acpidump and the output of "lspci -vv" after the first system sleep cycle?

The issue exposes a bug in hda_jackpoll_work() for unresponsive cards wherein the work item deadlocks because it tries to cancel itself:

hda_jackpoll_work()
  snd_hda_jack_poll_all()
    jack_detect_update()
      read_pin_sense()
        codec_read()
          codec_exec_verb()
            snd_hda_bus_reset()
              snd_hda_bus_reset_codecs()
                /* FIXME: maybe a better way needed for forced reset */
                cancel_delayed_work_sync(&codec->jackpoll_work);

The reason why hda_codec_runtime_suspend() hangs is because it also tries to cancel jackpoll_work,
so it's just a victim of the combination of unresponsive card plus hda_jackpoll_work() deadlock.
Comment 4 Takashi Iwai 2018-08-27 06:54:41 UTC
(In reply to Lukas Wunner from comment #3)
> The issue exposes a bug in hda_jackpoll_work() for unresponsive cards
> wherein the work item deadlocks because it tries to cancel itself:

Yeah, that looks like a side-effect by the unexpected communication error.
A patch like below may avoid the deadlock (but totally untested):

--- a/sound/pci/hda/hda_codec.c
+++ b/sound/pci/hda/hda_codec.c
@@ -3935,7 +3935,8 @@ void snd_hda_bus_reset_codecs(struct hda_bus *bus)
 
        list_for_each_codec(codec, bus) {
                /* FIXME: maybe a better way needed for forced reset */
-               cancel_delayed_work_sync(&codec->jackpoll_work);
+               if (current_work() != &codec->jackpoll_work.work)
+                       cancel_delayed_work_sync(&codec->jackpoll_work);
 #ifdef CONFIG_PM
                if (hda_codec_is_power_on(codec)) {
                        hda_call_codec_suspend(codec);


Of course, this doesn't fix the root cause, though.
Comment 5 Takashi Iwai 2018-09-03 07:47:33 UTC
I could reproduce the issue artificially and confirmed that the change in comment 4 works around it, so it's queued in for-linus branch for the next pull request.

It doesn't address the root cause -- why it hits the communication error, though.
Comment 6 Aleksandr Mezin 2018-09-05 06:29:27 UTC
Seems that there are multiple suspend/resume related issues with Vega https://bugzilla.kernel.org/show_bug.cgi?id=200531, https://bugzilla.kernel.org/show_bug.cgi?id=201015 (on 4.18 even a monitor switching to standby can cause the driver to hang), so probably there's no bug in snd_hda, but something is wrong in "display code".

Also, I've never seen this bug (snd_hda hang during suspend) on 4.19-rc (where monitor switching to standby doesn't kill amdgpu driver anymore). And currently I can't reproduce it even on 4.18.5