Bug 208511 - snd_hda_intel driver runs into errors on some hardware devices
Summary: snd_hda_intel driver runs into errors on some hardware devices
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Sound(ALSA) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: Jaroslav Kysela
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-10 09:22 UTC by Stefan Gottwald
Modified: 2021-01-24 19:28 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.4.37
Tree: Mainline
Regression: Yes


Attachments
dmesg output of 5.4-rc4 kernel test on Dell Wyse 5070 (90.72 KB, text/plain)
2020-07-10 11:10 UTC, Stefan Gottwald
Details
dmesg with disabled HDMI audio support (74.97 KB, text/plain)
2020-07-10 14:59 UTC, Stefan Gottwald
Details
kernel log with drm.debug=0x07 (112.50 KB, application/gzip)
2020-07-10 15:04 UTC, Stefan Gottwald
Details
Kernel config for the 5.8-rc4 kernel (240.79 KB, text/plain)
2020-07-10 15:07 UTC, Stefan Gottwald
Details
Testing patch (disable the audio controller's runtime pm on geminilake platforms) (1.85 KB, application/mbox)
2020-07-13 12:06 UTC, Hui Wang
Details

Description Stefan Gottwald 2020-07-10 09:22:00 UTC
With the Update from 5.4.36 to 5.4.37 the snd_hda_intel driver fails to work with Dell Wyse 5070 and Futro S740 hardware (possible other hardware is also affected but these two I tested with).

The error appear in the kernel log with following messages:

Jul 07 13:37:17 wyse5070 kernel: snd_hda_intel 0000:00:0e.0: azx_get_response timeout, switching to polling mode: last cmd=0x20bf8100
Jul 07 13:37:18 wyse5070 kernel: snd_hda_intel 0000:00:0e.0: No response from codec, disabling MSI: last cmd=0x20bf8100
Jul 07 13:37:19 wyse5070 kernel: snd_hda_intel 0000:00:0e.0: No response from codec, resetting bus: last cmd=0x20bf8100
Jul 07 13:37:20 wyse5070 kernel: snd_hda_intel 0000:00:0e.0: No response from codec, resetting bus: last cmd=0x20bf8100
Jul 07 13:37:22 wyse5070 kernel: snd_hda_intel 0000:00:0e.0: No response from codec, resetting bus: last cmd=0x20170500
Jul 07 13:37:23 wyse5070 kernel: snd_hda_intel 0000:00:0e.0: No response from codec, resetting bus: last cmd=0x20270500
Jul 07 13:37:24 wyse5070 kernel: snd_hda_intel 0000:00:0e.0: No response from codec, resetting bus: last cmd=0x20370500
.
.
.
Jul 07 13:39:40 wyse5070 kernel: snd_hda_intel 0000:00:0e.0: azx_get_response timeout, switching to single_cmd mode: last cmd=0x20170503

This happens with Ubuntu and our IGEL system but is also a little bit timing related as it occurs not every time on Ubuntu which made debugging quite annoying.

Some Testing and searching for the commit causing the issue I found this one:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/sound?h=v5.4.37&id=daafdf87b89830ac30b9349c6d38401a07726489

Reverting only this commit fixed the issue on the affected hardware also with newer kernels (up to 5.4.48).
Comment 1 Hui Wang 2020-07-10 09:43:11 UTC
According to the log, your machine is an Intel gemini-lake platform. I remember we met lots of audio issue on this platform before, and most of them have sth to do with i915 driver (on hdmi audio).

Like this one: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1826868


So please help check if all machines are gemini-lake machines? and could you please test the latest mainline kernel, if you still could reproduce the problem with the latest mainline kernel. (like this one 5.8-rc4: 
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.8-rc4/).
Comment 2 Stefan Gottwald 2020-07-10 11:10:33 UTC
Created attachment 290203 [details]
dmesg output of 5.4-rc4 kernel test on Dell Wyse 5070

The Dell Wyse 5070 is a J5005 and the Futro S740 is a J4105 both GeminiLake platforms. But the issue was also present if using a DP to DVI adapter and seems not to be related to the DP/HDMI audio directly. I know it could still be a issue in this area.

Tests with kernel 5.8-rc4 were also done with a DP to DVI adapter so no DP/HDMI audio involved. But also with the 5.8-rc4 the system is running in the same issue.
Comment 3 Hui Wang 2020-07-10 13:29:37 UTC
you could disable the hdmi audio temporarily by adding "snd_hda_intel.probe_mask=1" in the boot args, or within ubuntu, you could edit /etc/modprobe.d/alsa-base.conf, and add options snd-hda-intel probe_mask=1.

After booting up, you could verify the hdmi codec is disabled by "ls /proc/asound/card0/", there should be only codec#0 in that folder. Then you could do the test, let us see if the issue still could be reproduced or not after disable the hdmi codec.
Comment 4 Stefan Gottwald 2020-07-10 14:59:54 UTC
Created attachment 290211 [details]
dmesg with disabled HDMI audio support

So tested the 5.8-rc4 kernel with HDMI audio disabled and there the problem does not occur. Attached is the dmesg of the test with added kernel parameter "snd_hda_intel.probe_mask=1".

So this seems to be HDMI audio related, but I remember playing with the snd_hda_intel parameters before and with a 5.4.x kernel and there the probe_mask=1 needed a single_cmd=1 additionally to work and then it was not always stable (but this was on the Futro S740 with J4105 CPU).

So yes the options seems to help but I don't know how reliable it will help.
Comment 5 Takashi Iwai 2020-07-10 15:03:44 UTC
Could you give your kernel config?  Just to make sure that you've enabled the proper drivers, especially for HDMI codecs.
Comment 6 Stefan Gottwald 2020-07-10 15:04:17 UTC
Created attachment 290213 [details]
kernel log with drm.debug=0x07

Did another run with drm.debug=0x07 active to probably see a connection to the HDMI part. But this seems to have changed some timings and so I need 3 tries to get the error case again, so the whole thing as a timing component as it seems.

Attached is the compressed log from the run with error.
Comment 7 Stefan Gottwald 2020-07-10 15:07:45 UTC
Created attachment 290215 [details]
Kernel config for the 5.8-rc4 kernel

Hope this helps and this is the config you wanted.
Comment 8 Hui Wang 2020-07-13 00:20:53 UTC
@Stefan,

If you remove the snd_hda_intel.probe_mask=1, instead you add snd_hda_intel.power_save_controller=0, does it help. Maybe we could add a workaround by setting power_save_controller for geminilake platforms.
Comment 9 Stefan Gottwald 2020-07-13 05:59:22 UTC
I tested the snd_hda_intel.power_save_controller=0 setting and it is also working. Which was to expect as the reverted commit was in the power save area.
Comment 10 Hui Wang 2020-07-13 12:06:09 UTC
Created attachment 290251 [details]
Testing patch (disable the audio controller's runtime pm on geminilake platforms)

Could you please test this patch.

thx.
Comment 11 Stefan Gottwald 2020-07-14 12:52:28 UTC
Sorry for the delay. I tested the patch and it is solving the problem for the Dell Wyse 5070 and Futro S740 which I tested just before some minutes.
Comment 12 Takashi Iwai 2020-07-14 13:33:12 UTC
It's OK to merge the fix as a temporary workaround in a short term, but I'd like to get this addressed properly in the i915 side (or in HD-audio somewhere, if any).

Kai, could Intel can take a look?
Comment 13 Kai Vehmanen 2020-07-14 16:57:39 UTC
Hmm, this looks tricky. 5.8-rc2 had a few related patches in i915 driver (forcing code wake), but that does not seem to help. The patch to disable PM will also keep the display active, so if merged, it will impact PM also outside audio. Could Stefan try with i915.disable_power_well=0 i915 driver patch? This will have a PM hit as well. I'll browse through the logs in this bug better tomorrow.
Comment 14 Stefan Gottwald 2020-07-14 20:00:42 UTC
Hi will try i915.disable_power_well=0 on Thursday as I don't have access to the hardware tomorrow. If you need additional Logs with other drm.debug settings just write it and I will look if I can provide them.
Comment 15 Hui Wang 2020-07-14 23:40:30 UTC
Thanks Kai.

But you said "The patch to disable PM will also keep the display active, so if merged, it will impact PM also outside audio." If you refer to the i915 display_power_get/put, this patch will not impact that, since this patch only disable controller's PM, not codec's PM, the codec's runtime PM still works as before.

And It is best we could fix this issue from root cause instead of a workaround like my patch.

thx.
Comment 16 Kai Vehmanen 2020-07-15 12:58:42 UTC
Stefan, ok great! Test w/ normal drm.debug is ok. Gemini Lake platform has had its own type of issues with HDMI probe as GLK supports much lower display clock (CDCLK), and we can hit some unique clocking constraint issues on the HDA link between audio and display. 

One additional thing to try is this old Chrome patch (basicly limits the lowest CDCLK and thus avoids the transitions to the lowest clock):
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1185899/
The original Chrome issue has since been fixed in upstream, and the above patch is no longer used. But in case there is some board specific issues with clock stability, this is worth a shot.

Hui, ack, I stand corrected. Indeed, with this hw, keeping controller up won't block i915.

Takashi, with above note from Hui, I think it's ok to merge this patch. Unless the above tests reveal something about the behaviour, I fear this is going to be a time consuming task to debug the hw and will likely require access to the specific hw and/or to reproduce the issue. Currently our test benches (audio and graphics) don't hit this.
Comment 17 Stefan Gottwald 2020-07-15 13:15:39 UTC
Hi Kai.

I got an report from a customer which hit the same issue with a Intel NUC but there I will not have access to the hardware. But probably it is interesting for you in case you can got one of these devices easier.

The dmesg says this about the system:
Intel(R) Client Systems NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0053.2019.1015.1510 10/15/2019
CPU0: Intel(R) Celeron(R) J4005 CPU @ 2.00GHz (family: 0x6, model: 0x7a, stepping: 0x1)

Will try the chromium patch tomorrow too if it can solve the problem.
Comment 18 Stefan Gottwald 2020-07-16 13:59:46 UTC
Hi,

the i915.disable_power_well=0 does not fix the problem or seems not to have any influence on the issue.

I tried the Chrome Patch and it is solving the issue too.
Comment 19 Kai Vehmanen 2020-07-20 15:59:39 UTC
Thanks Stefan for testing! I'll try to find hw where I could reproduce the issue. But until then, I recommend merging Hui's patch.
Comment 20 Hui Wang 2020-07-30 01:05:18 UTC
Recently, users reported the audio jack can't detect plug/unplug event anymore, the two machines all have the alc662 codec, revert the commit 9a6418487b566 (ALSA: hda: call runtime_allow() for all hda controllers) make the jack detection work again.

Looks like "call runtime_allow() for all hda controllers" too aggressive, let me revert this patch.

Note You need to log in before you can comment on or make changes to this bug.