Bug 201901 - slow resume: 5200 ms snd_hda_codec_hdmi, snd_hda_codec_realtek - Dell XPS 13 9360
Summary: slow resume: 5200 ms snd_hda_codec_hdmi, snd_hda_codec_realtek - Dell XPS 13...
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Sound(ALSA) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Takashi Iwai
URL:
Keywords:
Depends on:
Blocks: 178231
  Show dependency tree
 
Reported: 2018-12-06 02:18 UTC by Len Brown
Modified: 2019-09-27 00:56 UTC (History)
5 users (show)

See Also:
Kernel Version: 4.18
Subsystem:
Regression: No
Bisected commit-id:


Attachments
otcpl-dell-9360-kbl_mem.html.zip (476.55 KB, application/octet-stream)
2018-12-06 02:23 UTC, Len Brown
Details
dmesg from boot up (72.48 KB, text/plain)
2019-01-10 02:14 UTC, Chen Yu
Details
issue.def (423 bytes, text/plain)
2019-04-19 13:55 UTC, Todd Brandt
Details
issue.def (487 bytes, text/plain)
2019-05-07 00:38 UTC, Todd Brandt
Details
issue.def (384 bytes, text/plain)
2019-05-07 00:43 UTC, Todd Brandt
Details
Test fix patch (2.34 KB, patch)
2019-07-15 19:48 UTC, Takashi Iwai
Details | Diff
Summary of 100 suspend cycles on Linux-5.2 on Dell 9360 (40.43 KB, text/html)
2019-07-16 06:02 UTC, Len Brown
Details
Summary of 100 suspend cycles on Linux-5.2 + patch, on Dell 9360 (40.23 KB, text/html)
2019-07-16 06:07 UTC, Len Brown
Details
Summary of 2,000 suspend to mem cycles on Dell 9360 (770.38 KB, text/html)
2019-07-17 05:35 UTC, Len Brown
Details
Summary of 2,000 suspend to mem cycles on Dell 9360 (698.08 KB, text/html)
2019-07-17 05:38 UTC, Len Brown
Details
Dell 9360 sound delay full callgraph (611.13 KB, application/gzip)
2019-07-17 23:48 UTC, Todd Brandt
Details
Dell 9360 sound delay full callgraph (with takashi patch) (404.00 KB, application/gzip)
2019-07-18 00:02 UTC, Todd Brandt
Details
Patch to sync after each write (414 bytes, patch)
2019-07-18 15:36 UTC, Takashi Iwai
Details | Diff
A patch to log the communication stall (1001 bytes, patch)
2019-07-24 22:12 UTC, Takashi Iwai
Details | Diff

Description Len Brown 2018-12-06 02:18:13 UTC
The Dell XPS 13 9360 is popular with Linux users, including Linus Torvalds.
Dell ships Ubuntu Linux pre-installed on this model.

When we run suspend/resume endurance tests on this machine (2000 cycles),
we find that about 50 times (2.5%) that sound adds about 5 seconds to resume.

This happens in both mem and freeze (s2idle) modes.

azx_rirb_get_response() calls msleep() many times.
There is a pattern -- 31ms, 15ms many times followed by 50ms, and that repeats.

There are many other machines that also see this same problem.  It is not always 5000 ms.  more often is is just over 1,000 ms -- which still makes it the slowest device in the system to resume.
Comment 1 Len Brown 2018-12-06 02:23:22 UTC
Created attachment 279873 [details]
otcpl-dell-9360-kbl_mem.html.zip

5200 ms delay in resume from suspend to mem
Comment 2 Takashi Iwai 2018-12-06 07:11:22 UTC
It's already odd that it needs msleep() there.  It implies that the hardware doesn't return the response for a verb properly by some reason.

Could you give dmesg output?
Comment 3 Takashi Iwai 2018-12-06 07:15:42 UTC
Also tracepoint should show which verb took too long.  I guess it's about some widget power up/down things.
Comment 4 Len Brown 2019-01-10 02:04:34 UTC
When you open the attachment in comment 1 in a browser, in the uppper right corner is a button [dmesg], that if you press, will open a tab with the dmesg.

Unfortunately, it is the dmesg associated only with the suspend resume, let me dig up one that goes back to boot.
Comment 5 Chen Yu 2019-01-10 02:14:25 UTC
Created attachment 280375 [details]
dmesg from boot up
Comment 6 Todd Brandt 2019-04-19 13:55:05 UTC
Created attachment 282413 [details]
issue.def
Comment 7 Todd Brandt 2019-05-07 00:38:23 UTC
Created attachment 282655 [details]
issue.def
Comment 8 Todd Brandt 2019-05-07 00:43:45 UTC
Created attachment 282657 [details]
issue.def
Comment 9 Todd Brandt 2019-05-07 03:39:13 UTC
update, in 5.1.0-rc7 I'm seeing this issue on 15 of the 16 machines we're currently testing. I've changed the issue.def file to reflect that it's not Dell specific. Here's a list of machines with the percentage of tests where the issue was seen.

5.1.0-rc7+	otcpl-amberlake-y	mem-x3163	515	16.28%
5.1.0-rc7+	otcpl-amberlake-y	freeze-x2641	397	15.03%
5.1.0-rc7+	otcpl-dell-p3520	freeze-x2031	331	16.30%
5.1.0-rc7+	otcpl-dell-p5510-xeon-1	freeze-x2164	264	12.20%
5.1.0-rc7+	otcpl-dell-9360-kbl	freeze-x2216	261	11.78%
5.1.0-rc7+	otcpl-dell-p3520	mem-x1773	259	14.61%
5.1.0-rc7+	otcpl-whiskeylake-u	freeze-x2465	253	10.26%
5.1.0-rc7+	otcpl-hp-elite-x2-kbl	mem-x2044	243	11.89%
5.1.0-rc7+	otcpl-yoga-920-kblr	freeze-x2112	180	8.52%
5.1.0-rc7+	otcpl-dell-p5510-xeon-1	mem-x2052	179	8.72%
5.1.0-rc7+	otcpl-coffelake-u	reeze-x1953	168	8.60%
5.1.0-rc7+	otcpl-yoga-920-kblr	mem-x1965	163	8.30%
5.1.0-rc7+	otcpl-whiskeylake-u	mem-x2728	154	5.65%
5.1.0-rc7+	otcpl-hp-elite-x2-kbl	freeze-x2411	148	6.14%
5.1.0-rc7+	otcpl-dell-9360-kbl	mem-x2178	132	6.06%
5.1.0-rc7+	otcpl-asus-e300-apl	mem-x1930	69	3.58%
5.1.0-rc7+	otcpl-Galaxy-Book-10	freeze-x735	57	7.76%
5.1.0-rc7+	otcpl-nuc-coffeelake	freeze-x2873	55	1.91%
5.1.0-rc7+	otcpl-asus-e300-apl	freeze-x2022	50	2.47%
5.1.0-rc7+	otcpl-macbook-pro	freeze-x1955	47	2.40%
5.1.0-rc7+	otcpl-dell-p5510-xeon-2	freeze-x327	39	11.93%
5.1.0-rc7+	otcpl-macbook-pro	mem-x566	9	1.59%
5.1.0-rc7+	otcpl-nuc-coffeelake	mem-x72		8	11.11%
5.1.0-rc7+	otcpl-hp-x360-bsw	mem-x1140	0	0%
5.1.0-rc7+	otcpl-hp-x360-bsw	freeze-x1259	0	0%
5.1.0-rc7+	otcpl-coffeelake-h	mem-x2029	0	0%
5.1.0-rc7+	otcpl-coffeelake-h	freeze-x2542	0	0%
Comment 10 Len Brown 2019-07-11 00:10:19 UTC
Takashi?
Comment 11 Takashi Iwai 2019-07-13 17:56:16 UTC
Well, the comment 9 doesn't show anything over 1000ms like the original report says.  And unfortunately I cannot reproduce the issue on any of my machines.  Maybe Skylake is way too old to show such a bug.  It definitely depends on the platform, and it's basically only chip vendors who may know the real culprit.

Or, possibly the problem might be related with the interaction with graphics.  You can try to disable the codec binding with probe_mask option of snd-hda-intel module so that it will bind only with the analog onboard codec.
Comment 12 Takashi Iwai 2019-07-13 17:59:27 UTC
(In reply to Takashi Iwai from comment #11)
> Or, possibly the problem might be related with the interaction with
> graphics.  You can try to disable the codec binding with probe_mask option
> of snd-hda-intel module so that it will bind only with the analog onboard
> codec.

To make clear my point: if disabling the HDMI codec makes the delay going away, it implies that it's GPU who causes the problem.  The HDMI audio codec is merely a virtually implemented PCI device as a shadow of GPU HDMI output function.  That is, only Intel may know a secret :)
Comment 13 Takashi Iwai 2019-07-15 19:47:19 UTC
BTW, how the tests were performed?  With the external monitor connected for HDMI/DP audio, or without that?

In the latter case -- i.e. without connection to HDMI/DP monitor -- we can skip the forced resume for the HDMI codec.  Currently the HD-audio codec system resume callback forcibly powers up the device even if it wasn't used at suspend time.  This is because the jack detection has to be performed at resume.
But for i915 gfx, we manage the jack detection via the audio component notification mechanism, and the report will be done in anyway, hence the forced resume can be skipped.

Below is a (totally untested) patch to achieve that.  It'd be appreciated if you can see whether it has any positive effect.
Comment 14 Takashi Iwai 2019-07-15 19:48:10 UTC
Created attachment 283713 [details]
Test fix patch
Comment 15 Todd Brandt 2019-07-15 20:12:20 UTC
I will try your patch and test ou the (In reply to Takashi Iwai from comment #11)
> Well, the comment 9 doesn't show anything over 1000ms like the original
> report says.  And unfortunately I cannot reproduce the issue on any of my
> machines.  Maybe Skylake is way too old to show such a bug.  It definitely
> depends on the platform, and it's basically only chip vendors who may know
> the real culprit.
> 
> Or, possibly the problem might be related with the interaction with
> graphics.  You can try to disable the codec binding with probe_mask option
> of snd-hda-intel module so that it will bind only with the analog onboard
> codec.

The Comment 9 data is meant to convey that this issue occurred on many of the test runs for most of the machines we test on. e.g. on the amber lake there were 3163 suspend/resume runs, 16% of them (515) had this sound issue where the delay took more than 1000ms. I didn't include any timelines as they all look substantially similar to the first one attached.

Also, none of these machines have external monitors connected via HDMI, they're all laptops with attached screens. For each test I forcibly wakeup the screen prior to suspend so that we get the full performance hit.

I will try both your probe_mask experiment (I assume you mean snd_hda_intel.probe_mask=0) and the attached patch. Thanks for the help!
Comment 16 Takashi Iwai 2019-07-15 20:47:19 UTC
OK, in that case, better to test my patch at first.  If this helps, it implies that the problem is about the HDMI audio codec, and the patch should be good enough as a workaround for 99.9% use cases.
Comment 17 Len Brown 2019-07-16 06:02:59 UTC
Created attachment 283731 [details]
Summary of 100 suspend cycles on Linux-5.2 on Dell 9360

hdaudio resumedslowly 17 times out of 100 suspend cycles on the Dell 9360 running Linux 5.2
Comment 18 Len Brown 2019-07-16 06:07:38 UTC
Created attachment 283733 [details]
Summary of 100 suspend cycles on Linux-5.2 + patch, on Dell 9360

hdaudio resumed slowly 5 times out of 100 suspend/resume cycles on the Dell 9360, when running Linux-5.2 plus the patch in comment #14

So the patch removed 2/3's of the issues, but 1/3 remain.
Comment 19 Takashi Iwai 2019-07-16 06:55:06 UTC
Thanks, that shows that the problem isn't specific to HDMI codec.

The original report showed the errors like:

[   65.231934] snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to polling mode: last cmd=0x20170500
[   66.239639] snd_hda_intel 0000:00:1f.3: No response from codec, disabling MSI: last cmd=0x20170500
[   67.247928] snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to single_cmd mode: last cmd=0x20170500
[   67.471338] snd_hda_codec_realtek hdaudioC0D0: Unable to sync register 0x2b8000. -5
[   67.473960] snd_hda_codec_realtek hdaudioC0D0: Unable to sync register 0x2b8000. -5

This clearly indicates the communication error on HD-audio bus, and switching to single_cmd mode means that the everything got screwed up.  Also, since both Realtek and HDMI codecs suffer, it implies that the problem is rather in HD-audio controller side.

A similar error was reported in the past and fixed for Broxton in commit 6639484ddaf6 and applied more widely by commit 7e31a0159461, then a4b4793f640b.
We need to make sure that this workaround is applied at the hardware reset at resume.

Also, we need to check whether the similar pattern appears in each long-time resume case on different hardware.  I guess so, but need confirmation.

In anyway, it's something about HD-audio controller reset, supposedly, and we need a help from Intel.  It's the very chip-specific stuff, and I have too little information about it, unfortunately.
Comment 20 Takashi Iwai 2019-07-16 07:03:35 UTC
Meanwhile I'm going to submit & merge the patch to upstream, as it's basically a "good thing" (although it doesn't address the whole issue here).
Comment 21 Todd Brandt 2019-07-16 20:04:30 UTC
Takashi, I've just completed a 16 hour stress test run on our amber lake machine using your patch (otcpl-amberlake-y and 5.2.0). The occurrence of the issue dropped from 10% to 2.6% of the total test runs (~3000 runs). It also reduced the maximum resume delay from 5 seconds to 2 seconds, so it definitely does have an effect.
Comment 22 Len Brown 2019-07-17 05:35:47 UTC
Created attachment 283759 [details]
Summary of 2,000 suspend to mem cycles on Dell 9360

In 2,000 freeze cycles on the 9360 with the patch, I didn't see this issue once.
This makes me think I may have failed to reboot into the patch kernel for my "after" test of 100 iterations above.
Comment 23 Len Brown 2019-07-17 05:38:47 UTC
Created attachment 283761 [details]
Summary of 2,000 suspend to mem cycles on Dell 9360

2,000 suspend to mem cycles on Dell 9360 with patch, this issue not seen.
Comment 24 Len Brown 2019-07-17 05:41:01 UTC
(In reply to Len Brown from comment #22)

> Summary of 2,000 suspend to mem cycles on Dell 9360

typo -- that one was 2,000 *freeze* (s2idle) iterations, not mem. The next one is mem.
Comment 25 Len Brown 2019-07-17 06:23:32 UTC
Takashi,
To reproduce, do this:

git clone git://github.com/01org/pm-graph
cd pm-graph
sudo make install 

cd some-output-dir

sleepgraph -m freeze -dev -rtcwake 5 -multi 100 0

and about 15-minutes later, in the results directory, see summary.html

you can replace "freeze" with "mem" to get ACPI S3, which also sees this issue
Comment 26 Takashi Iwai 2019-07-17 14:38:36 UTC
Thanks, I'll give it a try on a few machines here and see whether I can reproduce the issue.
Comment 27 Todd Brandt 2019-07-17 23:48:32 UTC
Created attachment 283807 [details]
Dell 9360 sound delay full callgraph

This is a full callgraph trace of the hdaudioC0D0 and hdaudioC0D2 devices during suspend and resume. Both are well over a second in resume. view it by unzipping it and loading into a browser. You can expand & collapse the call tree by clicking the plus/minus icons. This is a run from an unaltered 5.2.0 upstream kernel build.
Comment 28 Todd Brandt 2019-07-18 00:02:51 UTC
Created attachment 283809 [details]
Dell 9360 sound delay full callgraph (with takashi patch)

This is a full callgraph trace of the hdaudioC0D0 and hdaudioC0D2 devices during suspend and resume. Both are well over a second in resume. view it by unzipping it and loading into a browser. You can expand & collapse the call tree by clicking the plus/minus icons. This is run from a 5.2.0 upstream kernel built with takashi's proposed patch fix. This test was the worst from a series of 100 runs.
Comment 29 Takashi Iwai 2019-07-18 15:35:36 UTC
OK, this really looks like some stall at CORB/RIRB communication over HD-audio bus.  It looks spontaneous, and hard to guess what's going wrong there.

Maybe a big band-aid would be to sync each write operation, as we do for AMD and other HD-audio controllers.  The patch below does it for Skylake and later.  Please give it a try (together with my previous patch -- which was already on the way to Linus tree today).

I tried the tests on my machines, but they didn't show the too long delay.  A slight delay was seen once, but not reproducible, unfortunately.  Maybe this really depends on the machine, possibly CPU speed or whatever.
Comment 30 Takashi Iwai 2019-07-18 15:36:42 UTC
Created attachment 283815 [details]
Patch to sync after each write
Comment 31 Takashi Iwai 2019-07-19 08:36:53 UTC
Maybe it's better to enable the same workaround for BXT and GLK, too.
For that, please replace AZX_DCAPS_INTEL_BROXTON definition in hda_intel.c like:

#define AZX_DCAPS_INTEL_BROXTON  AZX_DCAPS_INTEL_SKYLAKE
Comment 32 Todd Brandt 2019-07-19 20:07:04 UTC
ok, I applied both patches and included the linking of the broxton config to skylake, I believe that covers all machines in our lab ATM. I'll do a full 48 hour soak (24 freeze/24 mem) this weekend and post the results. Thanks!
Comment 33 Takashi Iwai 2019-07-20 20:44:33 UTC
Thanks, that sounds promising!

I'll prepare a proper patch and submit/merge to upstream once after the confirmed results.
I suppose you won't mind if I put both of you to Reported-by tag?
Comment 34 Todd Brandt 2019-07-24 18:28:02 UTC
Hi Takashi, I finished the full 48 hour soak and the problem appears to have been fully resolved. This is the patch I applied on top of the attached patch:

diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c
index 50f86f458918..acc123229899 100644
--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -313,11 +313,10 @@ enum {
 
 #define AZX_DCAPS_INTEL_SKYLAKE \
 	(AZX_DCAPS_INTEL_PCH_BASE | AZX_DCAPS_PM_RUNTIME |\
+	 AZX_DCAPS_SYNC_WRITE |\
 	 AZX_DCAPS_SEPARATE_STREAM_TAG | AZX_DCAPS_I915_COMPONENT)
 
-#define AZX_DCAPS_INTEL_BROXTON \
-	(AZX_DCAPS_INTEL_PCH_BASE | AZX_DCAPS_PM_RUNTIME |\
-	 AZX_DCAPS_SEPARATE_STREAM_TAG | AZX_DCAPS_I915_COMPONENT)
+#define AZX_DCAPS_INTEL_BROXTON  AZX_DCAPS_INTEL_SKYLAKE
 
 /* quirks for ATI SB / AMD Hudson */
 #define AZX_DCAPS_PRESET_ATI_SB \

Here's the data synopsis, the table shows each machine, the mode used, the iteration count, the total number of occurrences of the error N (1 second or longer hdaudio resume), and the error rate in percent of the iteration count.

machine               mode-count    N   rate(N/count)
-------------------------------------------------
macbook-pro           freeze-x2342  49  2.09%
macbook-pro           mem-x318      7   2.20%
whiskeylake-u         mem-x2688     0   0%
whiskeylake-u         freeze-x2867  0   0%
NUC-coffelake         mem-x2705     0   0%
NUC-coffelake         freeze-x3084  0   0%
HPx360 braswell       mem-x1183     0   0%
HPx360 braswell       freeze-x1304  0   0%
geminilake-rvp        mem-x54       0   0%
geminilake-rvp        freeze-x49    0   0%
dell-p5510-xeon-2     mem-x731      0   0%
dell-p5510-xeon-2     freeze-x1470  0   0%
dell-p5510-xeon-1     mem-x2574     0   0%
dell-p5510-xeon-1     freeze-x2608  0   0%
dell-p3520            mem-x2457     0   0%
dell-p3520            freeze-x2144  0   0%
dell-9360-kbl         mem-x2843     0   0%
dell-9360-kbl         freeze-x2688  0   0%
coffelake-u           mem-x2453     0   0%
coffelake-u           freeze-x2233  0   0%
coffelake-h           mem-x2190     0   0%
coffelake-h           freeze-x2614  0   0%
Asus-E300-apollolake  mem-x2052     0   0%
Asus-E300-apollolake  freeze-x2115  0   0%
amberlake-y           mem-x3263     0   0%
amberlake-y           freeze-x2817  0   0%
Galaxy-Book-10        freeze-x271   0   0%

Almost every machine architecture covered by the second patch appears to have been helped. The macbook pro still has a few, it has an i7-4850HQ Haswell chip but the resume call is "hdaudioC0D0 {snd_hda_codec_cirrus}", so the issue might be completely different.

I think both patches have a profound effect and should go upstream, thanks!

P.S. yes you can put me on the reported-by, tested-by, or acked-by fields.
Comment 35 Takashi Iwai 2019-07-24 22:11:45 UTC
Thanks, good to hear!  The patch was submitted shortly ago, and now I merged it to sound.git tree with your reported-and-tested-by tag.
Once after linux-next test passes, I'm going to push the fixes to Linus in this week.

The Mac's have Cirrus codecs, and these might behave slightly differently, yes.
Possibly a patch like below can record the verb triggers the codec communication stall.
Comment 36 Takashi Iwai 2019-07-24 22:12:29 UTC
Created attachment 283943 [details]
A patch to log the communication stall
Comment 37 Takashi Iwai 2019-07-24 22:13:55 UTC
FYI, the fix commit is 2756d9143aa517b97961e85412882b8ce31371a6 in sound.git tree:
  https://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git/commit/?id=2756d9143aa517b97961e85412882b8ce31371a6
Comment 38 Takashi Iwai 2019-09-03 12:57:20 UTC
The fix was merged to mainline 5.3 and stable trees.  Let's close.
Comment 39 Len Brown 2019-09-05 00:29:44 UTC
The issue with cirrus (on a macbook) persists.

The issue with realtek seems to be gone from most machines,
but there are a couple of machines that still see this issue,
albeit at a much lower rate than before the patch above:

5.3.0-rc6	lkp-bdw-nuc1		mem-x2000	85	4.25%
5.3.0-rc6	lkp-bdw-nuc1		freeze-x2000	52	2.60%
5.3.0-rc6	lkp-ivb-toshiba1	mem-x2000	12	0.60%
5.3.0-rc6	lkp-bdwy-lenovo1	freeze-x2000	7	0.35%
5.3.0-rc6	lkp-ivb-toshiba1	freeze-x2000	5	0.25%
Comment 40 Takashi Iwai 2019-09-05 09:13:52 UTC
The MacBook is very hard to debug because the actual hardware implementation is never shown.  They do always something weird non-standard, so it's WONTFIX unless you get the vendor-specific information from Apple.

For the rest, unless you face to the fallback to single_cmd mode (which is a fatal error), it's also likely WONTFIX.  That is some flakiness on the hardware where the initialization required so long.  It's no generic problem and has to be addressed at each, if any...
Comment 41 Todd Brandt 2019-09-27 00:56:33 UTC
Patch is now in kernel, commit id is here:

2756d9143aa517b97961e85412882b8ce31371a6

closing this

Note You need to log in before you can comment on or make changes to this bug.