Bug 212077

Summary: AMD GPU discrete card memory at highest frequency even while not in use
Product: Drivers Reporter: Bat Malin (bat_malin)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: high CC: alexdeucher
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.11.3 Subsystem:
Regression: No Bisected commit-id:
Attachments: Dmesg
Picture of memory status
Picture of memory status (new)
Dmesg (new)
possible fix

Description Bat Malin 2021-03-05 19:44:20 UTC
Created attachment 295677 [details]
Dmesg

1.240847] amdgpu: Clock is not in range of specified clock range for watermark from DAL!  Using highest water mark set.
[    1.240850] amdgpu: Clock is not in range of specified clock range for watermark from DAL!  Using highest water mark set.
[    1.240851] amdgpu: Clock is not in range of specified clock range for watermark from DAL!  Using highest water mark set.
[    1.240852] amdgpu: Clock is not in range of specified clock range for watermark from DAL!  Using highest water mark set.
[    1.240853] amdgpu: Clock is not in range of specified clock range for watermark from DAL!  Using highest water mark set.
[    1.240854] amdgpu: Clock is not in range of specified clock range for watermark from DAL!  Using highest water mark set.
[    1.240855] amdgpu: Clock is not in range of specified clock range for watermark from DAL!  Using highest water mark set.
[    1.240856] amdgpu: Clock is not in range of specified clock range for watermark from DAL!  Using highest water mark set.
[    1.240857] amdgpu: Clock is not in range of specified clock range for watermark from DAL!  Using highest water mark set.
[    1.240858] amdgpu: Clock is not in range of specified clock range for watermark from DAL!  Using highest water mark set.
Dmesg attached
Comment 1 Bat Malin 2021-03-05 19:50:12 UTC
Created attachment 295679 [details]
Picture of memory status
Comment 2 Alex Deucher 2021-03-05 19:51:47 UTC
Should be fixed with this patch:
https://patchwork.freedesktop.org/patch/422999/
Comment 3 Bat Malin 2021-03-06 19:45:09 UTC
Thank you Alex!
Comment 4 Bat Malin 2021-03-08 19:43:07 UTC
Issue not fixed in kernel 5.11.4
Comment 5 Bat Malin 2021-03-10 20:44:32 UTC
Issue still present in 5.11.5
 1.335057] amdgpu: Clock is not in range of specified clock range for watermark from DAL!  Using highest water mark set.
Comment 6 Bat Malin 2021-03-10 20:55:44 UTC
No change in the code of 5.12-rc2...

for (i = 0; i < dep_mclk_table->count; i++) {
		for (j = 0; j < dep_sclk_table->count; j++) {
			valid_entry = false;
			for (k = 0; k < watermarks->num_wm_sets; k++) {
				if (dep_sclk_table->entries[i].clk / 10 >= watermarks->wm_clk_ranges[k].wm_min_eng_clk_in_khz &&
				    dep_sclk_table->entries[i].clk / 10 < watermarks->wm_clk_ranges[k].wm_max_eng_clk_in_khz &&
				    dep_mclk_table->entries[i].clk / 10 >= watermarks->wm_clk_ranges[k].wm_min_mem_clk_in_khz &&
				    dep_mclk_table->entries[i].clk / 10 < watermarks->wm_clk_ranges[k].wm_max_mem_clk_in_khz) {
					valid_entry = true;
					table->DisplayWatermark[i][j] = watermarks->wm_clk_ranges[k].wm_set_id;
					break;
Comment 7 Bat Malin 2021-03-13 20:05:58 UTC
Code not fixed in 5.11.6
Comment 8 Bat Malin 2021-03-17 19:11:34 UTC
Code fixed in 5.11.7
Thank you!
Comment 9 Bat Malin 2021-03-17 19:19:04 UTC
Code fixed but the GPU is still running @highest possible clock
Comment 10 Bat Malin 2021-03-17 19:21:47 UTC
Created attachment 295905 [details]
Picture of memory status (new)
Comment 11 Bat Malin 2021-03-17 19:23:23 UTC
Created attachment 295907 [details]
Dmesg (new)
Comment 12 Bat Malin 2021-03-19 19:44:35 UTC
Old Kernel e.g. 5.10.23 initializes this 
1.038643] [drm] DM_PPLIB: values for Engine clock
[    1.038645] [drm] DM_PPLIB:	 214000
[    1.038646] [drm] DM_PPLIB:	 603000
[    1.038646] [drm] DM_PPLIB:	 958000
[    1.038647] [drm] DM_PPLIB:	 1060000
[    1.038647] [drm] DM_PPLIB:	 1128000
[    1.038647] [drm] DM_PPLIB:	 1182000
[    1.038648] [drm] DM_PPLIB:	 1230000
[    1.038648] [drm] DM_PPLIB:	 1275000
[    1.038649] [drm] DM_PPLIB: Validation clocks:
[    1.038649] [drm] DM_PPLIB:    engine_max_clock: 127500
[    1.038649] [drm] DM_PPLIB:    memory_max_clock: 175000
[    1.038650] [drm] DM_PPLIB:    level           : 8
[    1.038651] [drm] DM_PPLIB: values for Memory clock
[    1.038651] [drm] DM_PPLIB:	 300000
[    1.038651] [drm] DM_PPLIB:	 625000
[    1.038652] [drm] DM_PPLIB:	 1750000
[    1.038652] [drm] DM_PPLIB: Validation clocks:
[    1.038652] [drm] DM_PPLIB:    engine_max_clock: 127500
[    1.038653] [drm] DM_PPLIB:    memory_max_clock: 175000
[    1.038653] [drm] DM_PPLIB:    level           : 8
 and for the integrated card- 
[    1.469248] [drm] DM_PPLIB: values for F clock
[    1.469250] [drm] DM_PPLIB:	 400000 in kHz, 2874 in mV
[    1.469251] [drm] DM_PPLIB:	 933000 in kHz, 3224 in mV
[    1.469252] [drm] DM_PPLIB:	 1067000 in kHz, 3924 in mV
[    1.469253] [drm] DM_PPLIB:	 1200000 in kHz, 4074 in mV
[    1.469256] [drm] DM_PPLIB: values for DCF clock
[    1.469257] [drm] DM_PPLIB:	 300000 in kHz, 2874 in mV
[    1.469258] [drm] DM_PPLIB:	 600000 in kHz, 3224 in mV
[    1.469259] [drm] DM_PPLIB:	 626000 in kHz, 3924 in mV
[    1.469260] [drm] DM_PPLIB:	 654000 in kHz, 4074 in mV
[    1.469553] [drm] Display Core initialized with v3.2.104!


The new one 5.11.7 only for integrated card
[    1.992374] kernel: [drm] DM_PPLIB: values for F clock
[    1.992377] kernel: [drm] DM_PPLIB:         400000 in kHz, 2874 in mV
[    1.992379] kernel: [drm] DM_PPLIB:         933000 in kHz, 3224 in mV
[    1.992381] kernel: [drm] DM_PPLIB:         1067000 in kHz, 3924 in mV
[    1.992382] kernel: [drm] DM_PPLIB:         1200000 in kHz, 4074 in mV
[    1.992385] kernel: [drm] DM_PPLIB: values for DCF clock
[    1.992387] kernel: [drm] DM_PPLIB:         300000 in kHz, 2874 in mV
[    1.992388] kernel: [drm] DM_PPLIB:         600000 in kHz, 3224 in mV
[    1.992390] kernel: [drm] DM_PPLIB:         626000 in kHz, 3924 in mV
[    1.992391] kernel: [drm] DM_PPLIB:         654000 in kHz, 4074 in mV
So I think this is related as the new kernel driver can`t initialize the values for the discrete card.
Please fix.
Comment 13 Alex Deucher 2021-03-24 21:32:07 UTC
Created attachment 296035 [details]
possible fix

This patch should fix it.
Comment 14 Bat Malin 2021-03-26 20:25:45 UTC
Thank you Alex for your engagement! Could you please include the patch in the next 5.11.11 release so I could test the patch, sorry but I am not allowed to compile a kernel on this machine.
Comment 15 Bat Malin 2021-04-07 18:21:25 UTC
Issue fixed in 5.11.12 even now it consumes less power (~1,07W less).

Before:

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:      756.00 mV 
edge:         +35.0 C  (crit = +94.0 C, hyst = -273.1 C)
power1:        8.14 W  (cap =  60.00 W)

After:

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:      756.00 mV 
edge:         +38.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:        7.07 W  (cap =  60.00 W)
 
Thank you!
Comment 16 Bat Malin 2021-04-08 18:31:09 UTC
After reboot even better - 
amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:      756.00 mV 
edge:         +35.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:        6.22 W  (cap =  60.00 W)