Bug 201763

Summary: amdgpu: [powerplay] VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
Product: Drivers Reporter: Rogério Brito (rbrito)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: fin4478, vyanitskiy
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.18.10 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg log with kernel 4.18.10
AMD wip kernel config with 1000Hz timer for Ryzen 5 1600 desktop PC
dmesg log of kernel 4.19 with error messages amdgpu
corresponding Xorg log to dmesg with error messages from amdgpu

Description Rogério Brito 2018-11-22 06:41:47 UTC
Dear developers,

I have a notebook that is giving me a lot of trouble, especially as it is the newest computer that I have at my disposal where I can perform any actual work.

This notebook has VERY frequent lockups (hard freezes, where the screen displays something where I am working, but the keyboard does not respond, the mouse doesn't either etc.).

The only way that I can resume any work is by forcibly shutting it down (by pushing the power button for many seconds) and, of course, having all previous unsaved work lost (including a previous report of this bug with the bugzilla interface). :-(

There is no particular pattern that I could discover so far that triggers the problem (for the past year, at least), but I do see some error messages on my dmesg logs and I would like start with some of the more salient points by sharing some problems that I have and, perhaps, zero in on potentially loose ends that may help everybody with hardware that is similar.

The notebook in question is a Dell Inspiron 5548 with a Core i7-5500U, with two graphic cards (or so I am told), with one of them being integrated with the CPU and another being a discrete AMD GPU.

The userspace that I am using is a Debian testing (soon to be Debian 10) with Debian's kernel 4.18.0-2-amd64 (which is actually a kernel 4.18.10-2). I can, of course, test any other kernels for the sake of getting things fixed. Just let me know and I will do my best to fix things.

BTW, no kernel that I have ever used on this machine has every worked perfectly, even the prebuilt, unpatched kernels that Ubuntu compiles daily.

With that being said, when the kernel boots up, there are some red messages on the dmesg log that are related to this AMD GPU (the entire dmesg log will be attached) with some prominent lines being:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
$ dmesg | grep -E -i "drm|amdgpu|i965"
[   13.054583] fb: switching to inteldrmfb from EFI VGA
[   13.059874] [drm] Replacing VGA console driver
[   13.060448] [drm] ACPI BIOS requests an excessive sleep of 10000 ms, using 1500 ms instead
[   13.061122] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[   13.061126] [drm] Driver supports precise vblank timestamp query.
[   13.067596] [drm] Initialized i915 1.6.0 20180514 for 0000:00:02.0 on minor 0
[   13.071479] fbcon: inteldrmfb (fb0) is primary device
[   13.475640] [drm] amdgpu kernel modesetting enabled.
[   14.237446] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
[   14.238730] amdgpu 0000:04:00.0: enabling device (0100 -> 0103)
[   14.240876] [drm] initializing kernel modesetting (TOPAZ 0x1002:0x6900 0x1028:0x0643 0x00).
[   14.240888] [drm] register mmio base: 0xC2000000
[   14.240888] [drm] register mmio size: 262144
[   14.240898] [drm] probing gen 2 caps for device 8086:9c98 = 5323c42/0
[   14.240899] [drm] probing mlw for device 8086:9c98 = 5323c42
[   14.240901] [drm] add ip block number 0 <vi_common>
[   14.240901] [drm] add ip block number 1 <gmc_v7_0>
[   14.240902] [drm] add ip block number 2 <iceland_ih>
[   14.240902] [drm] add ip block number 3 <powerplay>
[   14.240903] [drm] add ip block number 4 <gfx_v8_0>
[   14.240904] [drm] add ip block number 5 <sdma_v2_4>
[   14.240905] amdgpu 0000:04:00.0: kfd not supported on this ASIC
[   14.262237] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[   14.346850] amdgpu 0000:04:00.0: firmware: direct-loading firmware amdgpu/topaz_mc.bin
[   14.347897] amdgpu 0000:04:00.0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
[   14.348877] amdgpu 0000:04:00.0: GTT: 256M 0x0000000000000000 - 0x000000000FFFFFFF
[   14.350661] [drm] Detected VRAM RAM=2048M, BAR=256M
[   14.351494] [drm] RAM width 64bits DDR3
[   14.358584] [drm] amdgpu: 2048M of VRAM memory ready
[   14.359442] [drm] amdgpu: 3072M of GTT memory ready.
[   14.360290] [drm] GART: num cpu pages 65536, num gpu pages 65536
[   14.361807] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[   14.363257] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[   14.364522] [drm] Driver supports precise vblank timestamp query.
[   14.365629] amdgpu 0000:04:00.0: firmware: direct-loading firmware amdgpu/topaz_pfp.bin
[   14.366113] amdgpu 0000:04:00.0: firmware: direct-loading firmware amdgpu/topaz_me.bin
[   14.366262] amdgpu 0000:04:00.0: firmware: direct-loading firmware amdgpu/topaz_ce.bin
[   14.366264] [drm] Chained IB support enabled!
[   14.366414] amdgpu 0000:04:00.0: firmware: direct-loading firmware amdgpu/topaz_rlc.bin
[   14.381603] amdgpu 0000:04:00.0: firmware: direct-loading firmware amdgpu/topaz_mec.bin
[   14.384759] amdgpu 0000:04:00.0: firmware: direct-loading firmware amdgpu/topaz_sdma.bin
[   14.386476] amdgpu 0000:04:00.0: firmware: direct-loading firmware amdgpu/topaz_sdma1.bin
[   14.461717] amdgpu 0000:04:00.0: firmware: direct-loading firmware amdgpu/topaz_smc.bin
[   14.465233] amdgpu: [powerplay] can't get the mac of 5
[   14.467411] amdgpu: [powerplay] VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
[   14.477284] [drm] Initialized amdgpu 3.26.0 20150101 for 0000:04:00.0 on minor 1
[   21.824595] amdgpu: [powerplay] VI should always have 2 performance levels
[   21.871775] amdgpu 0000:04:00.0: GPU pci config reset
[   22.574755] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[   22.577374] amdgpu: [powerplay] can't get the mac of 5
[   22.578782] amdgpu: [powerplay] VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
[   29.802702] amdgpu: [powerplay] VI should always have 2 performance levels
[   29.848230] amdgpu 0000:04:00.0: GPU pci config reset
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Well, that's it. Please let me know whatever information you want me to get and I will post it here.

Reiterating: I can compile kernels and or run other programs to diagnose anything that you want me to to fix these issues.


Thanks in advance,

Rogério Brito.
Comment 1 Rogério Brito 2018-11-22 06:46:02 UTC
Created attachment 279599 [details]
dmesg log with kernel 4.18.10
Comment 2 Michel Dänzer 2018-11-22 15:52:43 UTC
From the dmesg output, it looks like the AMD GPU is powered off most of the time. Do the freezes happen when you explicitly use it for something, e.g. for a game via DRI_PRIME=1?
Comment 3 fin4478 2018-11-25 12:10:48 UTC
Before  reporting, test with latest drivers. That means with these:
https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.21-wip
https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers

Use the Oipaf ppa bionic version with Debian testing. Google how to use ppas in Debian. Disable vsync in Xfce Compositor settings. Disable Thunar thumbnails too. 
You can use my kernel config as base, add more drivers for your hardware with make xconfig.
Comment 4 fin4478 2018-11-25 12:11:59 UTC
Created attachment 279641 [details]
AMD wip kernel config with 1000Hz timer for Ryzen 5 1600 desktop PC
Comment 5 Rogério Brito 2019-02-28 22:31:19 UTC
Dear Michel,

First of all, sorry for the late reply. I had really a really bad start of the year (death in family, complications caused by that, health problems, fire at home and also recovering from that hard hit etc.)

So, I'm really sorry for the late reply.

(In reply to Michel Dänzer from comment #2)
> From the dmesg output, it looks like the AMD GPU is powered off most of the
> time. Do the freezes happen when you explicitly use it for something, e.g.
> for a game via DRI_PRIME=1?

I never play games (really, the only game that I played in the last few years was 2048 on a browser), but I guess that other applications may use the discrete AMD GPU that this notebook has.

I just set the DRI_PRIME variable now in my .bash_profile file and I will observe if I still get the lock ups. OTOH, while opening terminal sessions (I live by them), I just observed the following in my dmesg logs:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[ 4335.591693] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 4335.594671] amdgpu: [powerplay] can't get the mac of 5
[ 4335.595690] amdgpu: [powerplay] VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
[ 4341.181479] amdgpu: [powerplay] VI should always have 2 performance levels
[ 4341.231068] amdgpu 0000:04:00.0: GPU pci config reset
[ 4433.700699] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 4433.705976] amdgpu: [powerplay] can't get the mac of 5
[ 4433.707025] amdgpu: [powerplay] VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
[ 4439.230380] amdgpu: [powerplay] VI should always have 2 performance levels
[ 4439.276205] amdgpu 0000:04:00.0: GPU pci config reset
[ 4843.838487] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 4843.842649] amdgpu: [powerplay] can't get the mac of 5
[ 4843.844046] amdgpu: [powerplay] VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
[ 4849.072890] amdgpu: [powerplay] VI should always have 2 performance levels
[ 4849.121352] amdgpu 0000:04:00.0: GPU pci config reset
[ 4954.354975] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 4954.358935] amdgpu: [powerplay] can't get the mac of 5
[ 4954.360287] amdgpu: [powerplay] VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
[ 4960.173664] amdgpu: [powerplay] VI should always have 2 performance levels
[ 4960.219082] amdgpu 0000:04:00.0: GPU pci config reset
[ 4982.871619] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 4982.874760] amdgpu: [powerplay] can't get the mac of 5
[ 4982.875794] amdgpu: [powerplay] VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
[ 4988.077968] amdgpu: [powerplay] VI should always have 2 performance levels
[ 4988.126289] amdgpu 0000:04:00.0: GPU pci config reset
[ 5023.317917] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 5023.321614] amdgpu: [powerplay] can't get the mac of 5
[ 5023.322918] amdgpu: [powerplay] VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
[ 5029.036045] amdgpu: [powerplay] VI should always have 2 performance levels
[ 5029.081469] amdgpu 0000:04:00.0: GPU pci config reset
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

I'm using, as you may expect, Debian's testing distribution (I can give you the precise details), upgraded almost daily (when I am not up-to-date, I am 1 or 2 days late due to weekends, when I have to take care of my son).

I observed a few details more with respect to the bug:

1 - The problem of freezes has always occurred when I am using the GUI and clicking or typing something. It is my (wild) guess that the problem occurs when many interrupts happen, but I have no way to prove it.

   I have not yet seen the freezes when I leave the computer running scripts to perform some long job (say, reencoding some lecture videos that I get from youtube to make them smaller) with ffmpeg, even if it takes many days on uninterrupted computation (and heat being generated).

   OTOH, if I am interacting with it with a mouse intensely (say, with a program like scantailor or some other programs), switching windows or editing some texts in Emacs, then I get freezes in just a few hours (say, 3 or 4 hours).

   In fact, I hope that it doesn't occur during me typing this report (crossing fingers and copying the contents to Emacs to save it and paste the contents in case it freezes).

2 - The problem isn't detected by Dell's builtin UEFI application of system diagnostic (as I said, it seems to happen when I interact with the computer and the screen is being constantly updated).

3 - I discovered that whatever bug this is, it actually doesn't *completely* freeze the computer, since at least the sound card keeps playing sound in a loop (not that I intend to, but probably the samples that are already in the sound card memory).

I recorded a few (short) videos of the problem that I see and I uploaded them to YouTube:

  * https://www.youtube.com/watch?v=6o7Fl8kqtwg
  * https://www.youtube.com/watch?v=6o7Fl8kqtwg
  * https://www.youtube.com/watch?v=9zPluvySdIM

If you have any idea, please let me know.

Even if the freezes have nothing to do with the video card, I would like to have the messages (which, as you mention, may be indicative of something) of the GPU being fixed (in the hopes that it fixes things for other users that may not have the initiative of filing something to able developers).

As a last resort, I may end up selling this computer (even though the money will not be sufficient to buy one with similar specs). :-(



Thanks,

Rogério Brito.
Comment 6 Rogério Brito 2019-02-28 22:33:55 UTC
Oh, I forgot to say that the kernel that I am using is currently identified as:

    Linux zatz 4.19.0-2-amd64 #1 SMP Debian 4.19.16-1 (2019-01-17) x86_64 GNU/Linux

I can report the versions of the graphics stack once I know what is relevant. I can also try to stress test anything here.


Thanks once again,

Rogério Brito.
Comment 7 Michel Dänzer 2019-03-01 16:14:34 UTC
(In reply to Rogério Brito from comment #5)
> First of all, sorry for the late reply. I had really a really bad start of
> the year (death in family, complications caused by that, health problems,
> fire at home and also recovering from that hard hit etc.)

Nothing to apologize for, I hope things are (getting) better for you now!


> (In reply to Michel Dänzer from comment #2)
> > From the dmesg output, it looks like the AMD GPU is powered off most of the
> > time. Do the freezes happen when you explicitly use it for something, e.g.
> > for a game via DRI_PRIME=1?
> 
> I never play games (really, the only game that I played in the last few
> years was 2048 on a browser), but I guess that other applications may use
> the discrete AMD GPU that this notebook has.

The AMD GPU should only be used if you explicitly choose to, by setting DRI_PRIME=1 or maybe using a corresponding setting of your desktop environment. Maybe the AMD GPU is only getting powered up accidentally, and the freezes happen due to something going wrong while powering it up/down.

Please attach the corresponding Xorg log file, preferably captured after dmesg has at least two instances of

 [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).


You could also try modprobe.blacklist=amdgpu on the kernel command line, to see if the freezes happen even if the amdgpu driver never initializes the AMD GPU.
Comment 8 Rogério Brito 2019-03-30 02:22:07 UTC
Dear Michel,

(In reply to Michel Dänzer from comment #7)
> (In reply to Rogério Brito from comment #5)
> > First of all, sorry for the late reply. I had really a really bad start of
> > the year (death in family, complications caused by that, health problems,
> > fire at home and also recovering from that hard hit etc.)
> 
> Nothing to apologize for, I hope things are (getting) better for you now!

Things are slowly getting better now (still working on fixing things related to the fire at home).

> > (In reply to Michel Dänzer from comment #2)
> > > From the dmesg output, it looks like the AMD GPU is powered off most of
> the
> > > time. Do the freezes happen when you explicitly use it for something,
> e.g.
> > > for a game via DRI_PRIME=1?
> > 
> > I never play games (really, the only game that I played in the last few
> > years was 2048 on a browser), but I guess that other applications may use
> > the discrete AMD GPU that this notebook has.
> 
> The AMD GPU should only be used if you explicitly choose to, by setting
> DRI_PRIME=1 or maybe using a corresponding setting of your desktop
> environment. Maybe the AMD GPU is only getting powered up accidentally, and
> the freezes happen due to something going wrong while powering it up/down.

Nice to know that. I may have mentioned before, but I put DRI_PRIME=1 on my bash_profile file. I notice that when I open/close Firefox, then I get one instance of:

------------
[drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
amdgpu: [powerplay] can't get the mac of 5
amdgpu: [powerplay] VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
------------

> Please attach the corresponding Xorg log file, preferably captured after
> dmesg has at least two instances of
> 
>  [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).

OK, I am attaching both a dmesg log and the corresponding Xorg log of this moment that I am writing (just performed a cold boot, to rule things out), but the Xorg log doesn't contain anything after the first 50 seconds or so...

I can turn on some debug options, if you want me to.

> You could also try modprobe.blacklist=amdgpu on the kernel command line, to
> see if the freezes happen even if the amdgpu driver never initializes the
> AMD GPU.

OK, I will do that after I finish this message.


Thanks,

Rogério Brito.
Comment 9 Rogério Brito 2019-03-30 02:23:30 UTC
Created attachment 282069 [details]
dmesg log of kernel 4.19 with error messages amdgpu
Comment 10 Rogério Brito 2019-03-30 02:24:37 UTC
Created attachment 282071 [details]
corresponding Xorg log to dmesg with error messages from amdgpu
Comment 11 Rogério Brito 2020-11-10 21:46:42 UTC
Dear Michel and other people,

Since the last time that I reported this bug, the lock ups have not happened anymore.

OTOH, the messages on the dmesg log persist. I can include newer logs (but I don't think that many things have changed since then). Just as a reminder, here is what I'm getting:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(...)
[468341.815312] amdgpu: can't get the mac of 5
[468341.816323] amdgpu: VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
[468347.792326] amdgpu: VI should always have 2 performance levels
(...)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Of course, the cause may be another thing ealier in the logs. Since the first time that I reported the issue, I upgraded the BIOS/Firmware from Dell's site, but I'm guessing that it is very conservative and only includes an "updated" (not so much) microcode for the CPU vulnerabilities of all these years.

I'm running an up-to-date Debian testing distribution, but I can perform any (non-destructive :-)) tests that you want me to.


Thanks,

Rogério Brito.
Comment 12 Vadim Yanitskiy 2020-11-19 10:43:13 UTC
Hello all,

I have a DELL Inspiron 5547 with Radeon R7 M256:

03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445 / 530/535 / 620/625 Mobile]

I never experienced any lockups with it, most likely because I was running quite old kernel most of the time.  About a year ago, I started to use Arch Linux (thus more or less recent kernels), and also started to see these messages too:

[ 3916.822707] amdgpu: can't get the mac of 5
[ 3916.824691] amdgpu: VBIOS did not find boot engine clock value in dependency table. Using Memory DPM level 0!
[ 3923.082543] amdgpu: VI should always have 2 performance levels

It's not like they indicate any problems, the GPU actually works: with hashcat and proprietary OpenCL run-time on top of open source amdgpu driver I get nearly the same performance as under Windows; and even OpenGL/Vulkan rendering seems to work (although performance is significantly worse compared to Intel Graphics).  Even though I use Intel Graphics most of the time, I was always interested to investigate the cause of those warnings.

I had a quick look at the kernel's code, and from what I can see they are all related to the power management (powerplay).  I patched and compiled my own kernel to get a bit more information, and here is what I managed to understand:

> [ 3916.822707] amdgpu: can't get the mac of 5

According to 'drivers/gpu/drm/amd/powerplay/inc/smumgr.h', the 'mac 5' corresponds to SMU_MAX_LEVELS_VDDGFX.  This value is neither handled in iceland_get_mac_definition(), nor it's defined in 'drivers/gpu/drm/amd/powerplay/inc/smu71.h'.  For other GPU families this constant is used in '*_Discrete_DpmTable', while in 'SMU71_Discrete_DpmTable' I could not find anything related to VDDGFX.  Therefore I guess this GPU family (Iceland, SMU71) does not support this kind of power control.

> [57695.583784] amdgpu: VBIOS did not find boot engine clock value in
> dependency table. Using Memory DPM level 0!

This is something I would love to investigate further, but unfortunately have no time.  The warning itself comes from iceland_populate_smc_boot_level() defined in 'drivers/gpu/drm/amd/powerplay/smumgr/iceland_smumgr.c'.  This function attempts to get initial clock levels for Graphics DPM and Memory DPM from VBIOS.

Since we see only one warning, it successfully gets the clock value for Graphics DPM, but not for Memory DPM.  The function attempts to find value 'data->vbios_boot_state.mclk_bootup_value' in table 'data->dpm_table.mclk_table', which in its turn is populated by iceland_populate_all_memory_levels().  I need to add some more debug statements to see the contents of this table and the value that is attampted to be found in it.

> [ 3923.082543] amdgpu: VI should always have 2 performance levels

I patched the kernel to provide more details in this message, so:

> [ 5312.502812] amdgpu: VI should always have 2 performance levels, however 1
> was detected

This one comes from smu7_apply_state_adjust_rules() defined in 'drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c'.  As far as I can see, the code is able to handle values !=2, and in some pleces I see checks like ==1, I most likely this warning can be safely ignored.

As I conclusion, I would say that none of those warnings is critical.

P.S. I am not a kernel developer, and neither I am familiar with amdgpu code base.  Just had some spare time :)

Best regards,
Vadim.
Comment 13 Vadim Yanitskiy 2020-11-19 14:23:09 UTC
Here we go:

[  582.721066] amdgpu: iceland_populate_all_memory_levels(): mclk_table has 3 entries
[  582.721081] amdgpu: iceland_populate_all_memory_levels(): dpm_levels[0] is 30000
[  582.721095] amdgpu: iceland_populate_all_memory_levels(): dpm_levels[1] is 60000
[  582.721110] amdgpu: iceland_populate_all_memory_levels(): dpm_levels[2] is 90000
[  582.722669] amdgpu: VBIOS did not find boot engine clock value (29900) in dependency table. Using Memory DPM level 0!

As can be seen, the driver falls-back to level 0, which is very close to the requested value (29900 vs 30000).  Looks like a bug in VBIOS, because AFAIU, value 29900 comes from there (see smu7_dpm_patch_boot_state() in 'drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c').  In any case, this does not look critical to me too.

Best regards,
Vadim.