Bug 201901
Summary: | slow resume: 5200 ms snd_hda_codec_hdmi, snd_hda_codec_realtek - Dell XPS 13 9360 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Len Brown (lenb) |
Component: | Sound(ALSA) | Assignee: | Takashi Iwai (tiwai) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | joe.konno, lenb, rui.zhang, todd.e.brandt, yu.c.chen |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.18 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 178231 | ||
Attachments: |
otcpl-dell-9360-kbl_mem.html.zip
dmesg from boot up issue.def issue.def issue.def Test fix patch Summary of 100 suspend cycles on Linux-5.2 on Dell 9360 Summary of 100 suspend cycles on Linux-5.2 + patch, on Dell 9360 Summary of 2,000 suspend to mem cycles on Dell 9360 Summary of 2,000 suspend to mem cycles on Dell 9360 Dell 9360 sound delay full callgraph Dell 9360 sound delay full callgraph (with takashi patch) Patch to sync after each write A patch to log the communication stall |
Description
Len Brown
2018-12-06 02:18:13 UTC
Created attachment 279873 [details]
otcpl-dell-9360-kbl_mem.html.zip
5200 ms delay in resume from suspend to mem
It's already odd that it needs msleep() there. It implies that the hardware doesn't return the response for a verb properly by some reason. Could you give dmesg output? Also tracepoint should show which verb took too long. I guess it's about some widget power up/down things. When you open the attachment in comment 1 in a browser, in the uppper right corner is a button [dmesg], that if you press, will open a tab with the dmesg. Unfortunately, it is the dmesg associated only with the suspend resume, let me dig up one that goes back to boot. Created attachment 280375 [details]
dmesg from boot up
Created attachment 282413 [details]
issue.def
Created attachment 282655 [details]
issue.def
Created attachment 282657 [details]
issue.def
update, in 5.1.0-rc7 I'm seeing this issue on 15 of the 16 machines we're currently testing. I've changed the issue.def file to reflect that it's not Dell specific. Here's a list of machines with the percentage of tests where the issue was seen. 5.1.0-rc7+ otcpl-amberlake-y mem-x3163 515 16.28% 5.1.0-rc7+ otcpl-amberlake-y freeze-x2641 397 15.03% 5.1.0-rc7+ otcpl-dell-p3520 freeze-x2031 331 16.30% 5.1.0-rc7+ otcpl-dell-p5510-xeon-1 freeze-x2164 264 12.20% 5.1.0-rc7+ otcpl-dell-9360-kbl freeze-x2216 261 11.78% 5.1.0-rc7+ otcpl-dell-p3520 mem-x1773 259 14.61% 5.1.0-rc7+ otcpl-whiskeylake-u freeze-x2465 253 10.26% 5.1.0-rc7+ otcpl-hp-elite-x2-kbl mem-x2044 243 11.89% 5.1.0-rc7+ otcpl-yoga-920-kblr freeze-x2112 180 8.52% 5.1.0-rc7+ otcpl-dell-p5510-xeon-1 mem-x2052 179 8.72% 5.1.0-rc7+ otcpl-coffelake-u reeze-x1953 168 8.60% 5.1.0-rc7+ otcpl-yoga-920-kblr mem-x1965 163 8.30% 5.1.0-rc7+ otcpl-whiskeylake-u mem-x2728 154 5.65% 5.1.0-rc7+ otcpl-hp-elite-x2-kbl freeze-x2411 148 6.14% 5.1.0-rc7+ otcpl-dell-9360-kbl mem-x2178 132 6.06% 5.1.0-rc7+ otcpl-asus-e300-apl mem-x1930 69 3.58% 5.1.0-rc7+ otcpl-Galaxy-Book-10 freeze-x735 57 7.76% 5.1.0-rc7+ otcpl-nuc-coffeelake freeze-x2873 55 1.91% 5.1.0-rc7+ otcpl-asus-e300-apl freeze-x2022 50 2.47% 5.1.0-rc7+ otcpl-macbook-pro freeze-x1955 47 2.40% 5.1.0-rc7+ otcpl-dell-p5510-xeon-2 freeze-x327 39 11.93% 5.1.0-rc7+ otcpl-macbook-pro mem-x566 9 1.59% 5.1.0-rc7+ otcpl-nuc-coffeelake mem-x72 8 11.11% 5.1.0-rc7+ otcpl-hp-x360-bsw mem-x1140 0 0% 5.1.0-rc7+ otcpl-hp-x360-bsw freeze-x1259 0 0% 5.1.0-rc7+ otcpl-coffeelake-h mem-x2029 0 0% 5.1.0-rc7+ otcpl-coffeelake-h freeze-x2542 0 0% Takashi? Well, the comment 9 doesn't show anything over 1000ms like the original report says. And unfortunately I cannot reproduce the issue on any of my machines. Maybe Skylake is way too old to show such a bug. It definitely depends on the platform, and it's basically only chip vendors who may know the real culprit. Or, possibly the problem might be related with the interaction with graphics. You can try to disable the codec binding with probe_mask option of snd-hda-intel module so that it will bind only with the analog onboard codec. (In reply to Takashi Iwai from comment #11) > Or, possibly the problem might be related with the interaction with > graphics. You can try to disable the codec binding with probe_mask option > of snd-hda-intel module so that it will bind only with the analog onboard > codec. To make clear my point: if disabling the HDMI codec makes the delay going away, it implies that it's GPU who causes the problem. The HDMI audio codec is merely a virtually implemented PCI device as a shadow of GPU HDMI output function. That is, only Intel may know a secret :) BTW, how the tests were performed? With the external monitor connected for HDMI/DP audio, or without that? In the latter case -- i.e. without connection to HDMI/DP monitor -- we can skip the forced resume for the HDMI codec. Currently the HD-audio codec system resume callback forcibly powers up the device even if it wasn't used at suspend time. This is because the jack detection has to be performed at resume. But for i915 gfx, we manage the jack detection via the audio component notification mechanism, and the report will be done in anyway, hence the forced resume can be skipped. Below is a (totally untested) patch to achieve that. It'd be appreciated if you can see whether it has any positive effect. Created attachment 283713 [details]
Test fix patch
I will try your patch and test ou the (In reply to Takashi Iwai from comment #11) > Well, the comment 9 doesn't show anything over 1000ms like the original > report says. And unfortunately I cannot reproduce the issue on any of my > machines. Maybe Skylake is way too old to show such a bug. It definitely > depends on the platform, and it's basically only chip vendors who may know > the real culprit. > > Or, possibly the problem might be related with the interaction with > graphics. You can try to disable the codec binding with probe_mask option > of snd-hda-intel module so that it will bind only with the analog onboard > codec. The Comment 9 data is meant to convey that this issue occurred on many of the test runs for most of the machines we test on. e.g. on the amber lake there were 3163 suspend/resume runs, 16% of them (515) had this sound issue where the delay took more than 1000ms. I didn't include any timelines as they all look substantially similar to the first one attached. Also, none of these machines have external monitors connected via HDMI, they're all laptops with attached screens. For each test I forcibly wakeup the screen prior to suspend so that we get the full performance hit. I will try both your probe_mask experiment (I assume you mean snd_hda_intel.probe_mask=0) and the attached patch. Thanks for the help! OK, in that case, better to test my patch at first. If this helps, it implies that the problem is about the HDMI audio codec, and the patch should be good enough as a workaround for 99.9% use cases. Created attachment 283731 [details]
Summary of 100 suspend cycles on Linux-5.2 on Dell 9360
hdaudio resumedslowly 17 times out of 100 suspend cycles on the Dell 9360 running Linux 5.2
Created attachment 283733 [details] Summary of 100 suspend cycles on Linux-5.2 + patch, on Dell 9360 hdaudio resumed slowly 5 times out of 100 suspend/resume cycles on the Dell 9360, when running Linux-5.2 plus the patch in comment #14 So the patch removed 2/3's of the issues, but 1/3 remain. Thanks, that shows that the problem isn't specific to HDMI codec. The original report showed the errors like: [ 65.231934] snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to polling mode: last cmd=0x20170500 [ 66.239639] snd_hda_intel 0000:00:1f.3: No response from codec, disabling MSI: last cmd=0x20170500 [ 67.247928] snd_hda_intel 0000:00:1f.3: azx_get_response timeout, switching to single_cmd mode: last cmd=0x20170500 [ 67.471338] snd_hda_codec_realtek hdaudioC0D0: Unable to sync register 0x2b8000. -5 [ 67.473960] snd_hda_codec_realtek hdaudioC0D0: Unable to sync register 0x2b8000. -5 This clearly indicates the communication error on HD-audio bus, and switching to single_cmd mode means that the everything got screwed up. Also, since both Realtek and HDMI codecs suffer, it implies that the problem is rather in HD-audio controller side. A similar error was reported in the past and fixed for Broxton in commit 6639484ddaf6 and applied more widely by commit 7e31a0159461, then a4b4793f640b. We need to make sure that this workaround is applied at the hardware reset at resume. Also, we need to check whether the similar pattern appears in each long-time resume case on different hardware. I guess so, but need confirmation. In anyway, it's something about HD-audio controller reset, supposedly, and we need a help from Intel. It's the very chip-specific stuff, and I have too little information about it, unfortunately. Meanwhile I'm going to submit & merge the patch to upstream, as it's basically a "good thing" (although it doesn't address the whole issue here). Takashi, I've just completed a 16 hour stress test run on our amber lake machine using your patch (otcpl-amberlake-y and 5.2.0). The occurrence of the issue dropped from 10% to 2.6% of the total test runs (~3000 runs). It also reduced the maximum resume delay from 5 seconds to 2 seconds, so it definitely does have an effect. Created attachment 283759 [details]
Summary of 2,000 suspend to mem cycles on Dell 9360
In 2,000 freeze cycles on the 9360 with the patch, I didn't see this issue once.
This makes me think I may have failed to reboot into the patch kernel for my "after" test of 100 iterations above.
Created attachment 283761 [details]
Summary of 2,000 suspend to mem cycles on Dell 9360
2,000 suspend to mem cycles on Dell 9360 with patch, this issue not seen.
(In reply to Len Brown from comment #22) > Summary of 2,000 suspend to mem cycles on Dell 9360 typo -- that one was 2,000 *freeze* (s2idle) iterations, not mem. The next one is mem. Takashi, To reproduce, do this: git clone git://github.com/01org/pm-graph cd pm-graph sudo make install cd some-output-dir sleepgraph -m freeze -dev -rtcwake 5 -multi 100 0 and about 15-minutes later, in the results directory, see summary.html you can replace "freeze" with "mem" to get ACPI S3, which also sees this issue Thanks, I'll give it a try on a few machines here and see whether I can reproduce the issue. Created attachment 283807 [details]
Dell 9360 sound delay full callgraph
This is a full callgraph trace of the hdaudioC0D0 and hdaudioC0D2 devices during suspend and resume. Both are well over a second in resume. view it by unzipping it and loading into a browser. You can expand & collapse the call tree by clicking the plus/minus icons. This is a run from an unaltered 5.2.0 upstream kernel build.
Created attachment 283809 [details]
Dell 9360 sound delay full callgraph (with takashi patch)
This is a full callgraph trace of the hdaudioC0D0 and hdaudioC0D2 devices during suspend and resume. Both are well over a second in resume. view it by unzipping it and loading into a browser. You can expand & collapse the call tree by clicking the plus/minus icons. This is run from a 5.2.0 upstream kernel built with takashi's proposed patch fix. This test was the worst from a series of 100 runs.
OK, this really looks like some stall at CORB/RIRB communication over HD-audio bus. It looks spontaneous, and hard to guess what's going wrong there. Maybe a big band-aid would be to sync each write operation, as we do for AMD and other HD-audio controllers. The patch below does it for Skylake and later. Please give it a try (together with my previous patch -- which was already on the way to Linus tree today). I tried the tests on my machines, but they didn't show the too long delay. A slight delay was seen once, but not reproducible, unfortunately. Maybe this really depends on the machine, possibly CPU speed or whatever. Created attachment 283815 [details]
Patch to sync after each write
Maybe it's better to enable the same workaround for BXT and GLK, too. For that, please replace AZX_DCAPS_INTEL_BROXTON definition in hda_intel.c like: #define AZX_DCAPS_INTEL_BROXTON AZX_DCAPS_INTEL_SKYLAKE ok, I applied both patches and included the linking of the broxton config to skylake, I believe that covers all machines in our lab ATM. I'll do a full 48 hour soak (24 freeze/24 mem) this weekend and post the results. Thanks! Thanks, that sounds promising! I'll prepare a proper patch and submit/merge to upstream once after the confirmed results. I suppose you won't mind if I put both of you to Reported-by tag? Hi Takashi, I finished the full 48 hour soak and the problem appears to have been fully resolved. This is the patch I applied on top of the attached patch: diff --git a/sound/pci/hda/hda_intel.c b/sound/pci/hda/hda_intel.c index 50f86f458918..acc123229899 100644 --- a/sound/pci/hda/hda_intel.c +++ b/sound/pci/hda/hda_intel.c @@ -313,11 +313,10 @@ enum { #define AZX_DCAPS_INTEL_SKYLAKE \ (AZX_DCAPS_INTEL_PCH_BASE | AZX_DCAPS_PM_RUNTIME |\ + AZX_DCAPS_SYNC_WRITE |\ AZX_DCAPS_SEPARATE_STREAM_TAG | AZX_DCAPS_I915_COMPONENT) -#define AZX_DCAPS_INTEL_BROXTON \ - (AZX_DCAPS_INTEL_PCH_BASE | AZX_DCAPS_PM_RUNTIME |\ - AZX_DCAPS_SEPARATE_STREAM_TAG | AZX_DCAPS_I915_COMPONENT) +#define AZX_DCAPS_INTEL_BROXTON AZX_DCAPS_INTEL_SKYLAKE /* quirks for ATI SB / AMD Hudson */ #define AZX_DCAPS_PRESET_ATI_SB \ Here's the data synopsis, the table shows each machine, the mode used, the iteration count, the total number of occurrences of the error N (1 second or longer hdaudio resume), and the error rate in percent of the iteration count. machine mode-count N rate(N/count) ------------------------------------------------- macbook-pro freeze-x2342 49 2.09% macbook-pro mem-x318 7 2.20% whiskeylake-u mem-x2688 0 0% whiskeylake-u freeze-x2867 0 0% NUC-coffelake mem-x2705 0 0% NUC-coffelake freeze-x3084 0 0% HPx360 braswell mem-x1183 0 0% HPx360 braswell freeze-x1304 0 0% geminilake-rvp mem-x54 0 0% geminilake-rvp freeze-x49 0 0% dell-p5510-xeon-2 mem-x731 0 0% dell-p5510-xeon-2 freeze-x1470 0 0% dell-p5510-xeon-1 mem-x2574 0 0% dell-p5510-xeon-1 freeze-x2608 0 0% dell-p3520 mem-x2457 0 0% dell-p3520 freeze-x2144 0 0% dell-9360-kbl mem-x2843 0 0% dell-9360-kbl freeze-x2688 0 0% coffelake-u mem-x2453 0 0% coffelake-u freeze-x2233 0 0% coffelake-h mem-x2190 0 0% coffelake-h freeze-x2614 0 0% Asus-E300-apollolake mem-x2052 0 0% Asus-E300-apollolake freeze-x2115 0 0% amberlake-y mem-x3263 0 0% amberlake-y freeze-x2817 0 0% Galaxy-Book-10 freeze-x271 0 0% Almost every machine architecture covered by the second patch appears to have been helped. The macbook pro still has a few, it has an i7-4850HQ Haswell chip but the resume call is "hdaudioC0D0 {snd_hda_codec_cirrus}", so the issue might be completely different. I think both patches have a profound effect and should go upstream, thanks! P.S. yes you can put me on the reported-by, tested-by, or acked-by fields. Thanks, good to hear! The patch was submitted shortly ago, and now I merged it to sound.git tree with your reported-and-tested-by tag. Once after linux-next test passes, I'm going to push the fixes to Linus in this week. The Mac's have Cirrus codecs, and these might behave slightly differently, yes. Possibly a patch like below can record the verb triggers the codec communication stall. Created attachment 283943 [details]
A patch to log the communication stall
FYI, the fix commit is 2756d9143aa517b97961e85412882b8ce31371a6 in sound.git tree: https://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git/commit/?id=2756d9143aa517b97961e85412882b8ce31371a6 The fix was merged to mainline 5.3 and stable trees. Let's close. The issue with cirrus (on a macbook) persists. The issue with realtek seems to be gone from most machines, but there are a couple of machines that still see this issue, albeit at a much lower rate than before the patch above: 5.3.0-rc6 lkp-bdw-nuc1 mem-x2000 85 4.25% 5.3.0-rc6 lkp-bdw-nuc1 freeze-x2000 52 2.60% 5.3.0-rc6 lkp-ivb-toshiba1 mem-x2000 12 0.60% 5.3.0-rc6 lkp-bdwy-lenovo1 freeze-x2000 7 0.35% 5.3.0-rc6 lkp-ivb-toshiba1 freeze-x2000 5 0.25% The MacBook is very hard to debug because the actual hardware implementation is never shown. They do always something weird non-standard, so it's WONTFIX unless you get the vendor-specific information from Apple. For the rest, unless you face to the fallback to single_cmd mode (which is a fatal error), it's also likely WONTFIX. That is some flakiness on the hardware where the initialization required so long. It's no generic problem and has to be addressed at each, if any... Patch is now in kernel, commit id is here: 2756d9143aa517b97961e85412882b8ce31371a6 closing this |