Bug 201615 - [Regression][bisected] commit 37a3a98ef601f89100e3bb657fb0e190b857028c breaks dpm for my hybrid Intel/AMD laptop's dGPU
Summary: [Regression][bisected] commit 37a3a98ef601f89100e3bb657fb0e190b857028c breaks...
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Sound(ALSA) (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: Jaroslav Kysela
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-03 21:52 UTC by taijian
Modified: 2018-11-06 12:36 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.19-rc4+
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
reversing this commit makes my dGPU power down again (7.62 KB, application/mbox)
2018-11-03 21:52 UTC, taijian
Details
Test fix patch (548 bytes, patch)
2018-11-04 21:28 UTC, Takashi Iwai
Details | Diff
dmesg of kernel 4.19 with both patches applied (79.06 KB, text/plain)
2018-11-04 23:34 UTC, taijian
Details
Revised test patch (474 bytes, patch)
2018-11-05 07:28 UTC, Takashi Iwai
Details | Diff
Fix patch (proeprly formatted) (1.51 KB, patch)
2018-11-05 11:01 UTC, Takashi Iwai
Details | Diff

Description taijian 2018-11-03 21:52:02 UTC
Created attachment 279303 [details]
reversing this commit makes my dGPU power down again

I have an Alienware 15R3 with hybrid Intel/AMD (RX470) graphics. I noticed in the development of the 4.19 kernel that at some point dpm for my dGPU started to break - as in: it wouldn't power down anymore, when not in use. I finally got around to properly bisecting the issue and finally identified: 

From 37a3a98ef601f89100e3bb657fb0e190b857028c Mon Sep 17 00:00:00 2001
From: Takashi Iwai <tiwai@suse.de>
Date: Mon, 10 Sep 2018 16:20:25 +0200
Subject: ALSA: hda - Enable runtime PM only for discrete GPU

as the commit that broke dpm on my system. Any kernel past this commit does not power down the dGPU when it is idle, reverting this commit restores dpm. 
I'll be happy to assist in further investigations.
Comment 1 Takashi Iwai 2018-11-04 08:05:45 UTC
Check whether azx_vs_gpu_bound gets called on your system, and see which client_id value is passed.  The flag for runtime PM for discrete GPU must be enabled (cleared) there.
Comment 2 taijian 2018-11-04 10:43:38 UTC
Hi Takashi, 
thank you for your prompt answer. I am sorry though, you'll have be be a little more ELI5 with me - I do know how to bisect a kernel bug, but other than that I'll need some handholding to do the debugging that you need. Back over here (https://bugs.freedesktop.org/show_bug.cgi?id=106597) Lukas Wunner did a very good job leading me through the necessary steps (just to give you an idea of my level of ...competence...).

So, where would I look for azx_vs_gpu_bound? 
Should I look in both the 'vanilla' kernel and the one where I reverted your patch?

Thank you for your help!
Comment 3 Takashi Iwai 2018-11-04 21:27:56 UTC
For checking the call of azx_vs_gpu_bound(), just put a debug print like:

--- a/sound/pci/hda/hda_intel.c
+++ b/sound/pci/hda/hda_intel.c
@@ -1330,6 +1330,7 @@ static void azx_vs_gpu_bound(struct pci_dev *pci,
        struct azx *chip = card->private_data;
        struct hda_intel *hda = container_of(chip, struct hda_intel, chip);
 
+       dev_info(&pci->dev, "XXX azx_vs_gpu_bound: client_id=%d\n", client_id);
        if (client_id == VGA_SWITCHEROO_DIS)
                hda->need_eld_notify_link = 0;
        setup_vga_switcheroo_runtime_pm(chip);

... onto the original code (no revert), and reboot and check the kernel message.  I guess this print won't be triggered.

If my guess is correct, a patch below might fix the issue.  Please check it out.
Comment 4 Takashi Iwai 2018-11-04 21:28:22 UTC
Created attachment 279309 [details]
Test fix patch
Comment 5 taijian 2018-11-04 23:34:04 UTC
Created attachment 279317 [details]
dmesg of kernel 4.19 with both patches applied

OK, seems that you were right on the first count - I'm not seeing that debug message printed in the dmesg. However, even with your proposed patch applied, I'm still getting this:

       │ File: /sys/kernel/debug/dri/1/amdgpu_pm_info
───────┼──────────────────────────────────────────────
   1   │ Clock Gating Flags Mask: 0x3fbcf
   2   │         Graphics Medium Grain Clock Gating: On
   3   │         Graphics Medium Grain memory Light Sleep: On
   4   │         Graphics Coarse Grain Clock Gating: On
   5   │         Graphics Coarse Grain memory Light Sleep: On
   6   │         Graphics Coarse Grain Tree Shader Clock Gating: Off
   7   │         Graphics Coarse Grain Tree Shader Light Sleep: Off
   8   │         Graphics Command Processor Light Sleep: On
   9   │         Graphics Run List Controller Light Sleep: On
  10   │         Graphics 3D Coarse Grain Clock Gating: Off
  11   │         Graphics 3D Coarse Grain memory Light Sleep: Off
  12   │         Memory Controller Light Sleep: On
  13   │         Memory Controller Medium Grain Clock Gating: On
  14   │         System Direct Memory Access Light Sleep: Off
  15   │         System Direct Memory Access Medium Grain Clock Gating: On
  16   │         Bus Interface Medium Grain Clock Gating: Off
  17   │         Bus Interface Light Sleep: On
  18   │         Unified Video Decoder Medium Grain Clock Gating: On
  19   │         Video Compression Engine Medium Grain Clock Gating: On
  20   │         Host Data Path Light Sleep: On
  21   │         Host Data Path Medium Grain Clock Gating: On
  22   │         Digital Right Management Medium Grain Clock Gating: Off
  23   │         Digital Right Management Light Sleep: Off
  24   │         Rom Medium Grain Clock Gating: On
  25   │         Data Fabric Medium Grain Clock Gating: Off
  26   │ 
  27   │ GFX Clocks and Power:
  28   │         0 MHz (MCLK)
  29   │         0 MHz (SCLK)
  30   │         300 MHz (PSTATE_SCLK)
  31   │         300 MHz (PSTATE_MCLK)
  32   │         900 mV (VDDGFX)
  33   │         10.57 W (average GPU)
  34   │ 
  35   │ GPU Temperature: 49 C
  36   │ GPU Load: 0 %
  37   │ 
  38   │ UVD: Disabled
  39   │ 
  40   │ VCE: Disabled
Comment 6 Takashi Iwai 2018-11-05 07:28:30 UTC
Thanks for testing.  I see the problem in my test patch.  vga_switcheroo_ready() isn't suitable there, and we should call it there unconditionally.

The revised patch is below.  Please give it a try.
Comment 7 Takashi Iwai 2018-11-05 07:28:59 UTC
Created attachment 279319 [details]
Revised test patch
Comment 8 taijian 2018-11-05 09:39:58 UTC
OK, with the revised patch from comment 7 and your debug print from earlier applied on top of 4.20-rc1 I am now getting a kernel print:

> $ dmesg | grep -i azx
> [14.320229] snd_hda_intel 0000:01:00.1: XXX azx_vs_gpu_bound: client_id=1

Also:

File: /sys/kernel/debug/dri/1/amdgpu_pm_info
───────┼───────────────────────────────────────
   1   │ Clock Gating Flags Mask: 0x3fbcf
   2   │         Graphics Medium Grain Clock Gating: On
   3   │         Graphics Medium Grain memory Light Sleep: On
   4   │         Graphics Coarse Grain Clock Gating: On
   5   │         Graphics Coarse Grain memory Light Sleep: On
   6   │         Graphics Coarse Grain Tree Shader Clock Gating: Off
   7   │         Graphics Coarse Grain Tree Shader Light Sleep: Off
   8   │         Graphics Command Processor Light Sleep: On
   9   │         Graphics Run List Controller Light Sleep: On
  10   │         Graphics 3D Coarse Grain Clock Gating: Off
  11   │         Graphics 3D Coarse Grain memory Light Sleep: Off
  12   │         Memory Controller Light Sleep: On
  13   │         Memory Controller Medium Grain Clock Gating: On
  14   │         System Direct Memory Access Light Sleep: Off
  15   │         System Direct Memory Access Medium Grain Clock Gating: On
  16   │         Bus Interface Medium Grain Clock Gating: Off
  17   │         Bus Interface Light Sleep: On
  18   │         Unified Video Decoder Medium Grain Clock Gating: On
  19   │         Video Compression Engine Medium Grain Clock Gating: On
  20   │         Host Data Path Light Sleep: On
  21   │         Host Data Path Medium Grain Clock Gating: On
  22   │         Digital Right Management Medium Grain Clock Gating: Off
  23   │         Digital Right Management Light Sleep: Off
  24   │         Rom Medium Grain Clock Gating: On
  25   │         Data Fabric Medium Grain Clock Gating: Off
  26   │ 
  27   │ PX asic powered off

I guess that counts as a success?
Comment 9 Takashi Iwai 2018-11-05 10:59:50 UTC
Looks good, yeah.  I'm going to submit the proper patch (attached below) for review to the upstream.  Thanks!
Comment 10 Takashi Iwai 2018-11-05 11:01:35 UTC
Created attachment 279323 [details]
Fix patch (proeprly formatted)
Comment 11 taijian 2018-11-05 13:07:25 UTC
Thank you for the quick resolution!
Comment 12 Takashi Iwai 2018-11-06 12:36:59 UTC
The fix was merged to sound git tree and will be likely included in 4.20-rc2, then backported to 4.19.y stable eventually later.

Note You need to log in before you can comment on or make changes to this bug.