Bug 211875 - CPU frequency scaling lost after "WARNING: CPU: 2 PID: 2358578 at smu8_send_msg_to_smc_with_parameter+0xfe/0x140 [amdgpu]"
Summary: CPU frequency scaling lost after "WARNING: CPU: 2 PID: 2358578 at smu8_send_m...
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-02-21 13:09 UTC by Erhard F.
Modified: 2021-09-28 22:49 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.11
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (kernel 5.11, A10-9700E) (92.00 KB, text/plain)
2021-02-21 13:09 UTC, Erhard F.
Details
kernel .config (kernel 5.11, A10-9700E) (106.65 KB, text/plain)
2021-02-21 13:10 UTC, Erhard F.
Details
kernel .config (kernel 5.12-rc1, A10-9700E) (107.65 KB, text/plain)
2021-03-02 09:11 UTC, Erhard F.
Details
dmesg (kernel 5.12-rc5, A10-9700E) (66.42 KB, text/plain)
2021-03-30 09:03 UTC, Erhard F.
Details
kernel .config (kernel 5.12-rc5, A10-9700E) (109.45 KB, text/plain)
2021-03-30 09:05 UTC, Erhard F.
Details
dmesg (kernel 4.14.228, A10-9700E) (54.10 KB, text/plain)
2021-04-07 22:18 UTC, Erhard F.
Details
dmesg (kernel 5.13-rc1, A10-9700E) (90.37 KB, text/plain)
2021-05-11 02:17 UTC, Erhard F.
Details
kernel .config (kernel 5.13-rc1, A10-9700E) (107.62 KB, text/plain)
2021-05-11 02:22 UTC, Erhard F.
Details

Description Erhard F. 2021-02-21 13:09:39 UTC
Created attachment 295381 [details]
dmesg (kernel 5.11, A10-9700E)

I get this on kernel 5.10.17 and it happens on 5.11 too (though a lot less often). Leaving the machine running for some hours sooner or later I get this "smu8_send_msg_to_smc_with_parameter(0x0011, 0x460) timed out after 3234846 us" in kernel dmesg and after that CPU clock frequency is hampered to the 800 MHz base clock of the A10-9700E. Setting cpu manually to performance or back to ondemand does not change a thing, clock stays down at 800 MHz. The machine does not crash and is still usable but well ... slow..

[...]
------------[ cut here ]------------
smu8_send_msg_to_smc_with_parameter(0x0011, 0x460) timed out after 3234846 us
WARNING: CPU: 2 PID: 2358578 at smu8_send_msg_to_smc_with_parameter+0xfe/0x140 [amdgpu]
Modules linked in: fuse rfkill input_leds joydev ext4 crc16 mbcache jbd2 hid_generic usbhid hid crc32_generic crc32_pclmul f2fs evdev amdgpu aesni_intel libaes crypto_simd cryptd snd_hda_codec_hdmi glue_helper drm_ttm_helper snd_hda_codec_realtek ttm snd_hda_codec_generic ledtrig_audio led_class r8169 fam15h_power realtek gpu_sched k10temp mdio_devres ehci_pci ehci_hcd i2c_piix4 libphy snd_hda_intel snd_intel_dspcfg i2c_algo_bit snd_hda_codec drm_kms_helper snd_hwdep cfbfillrect snd_hda_core syscopyarea xhci_pci cfbimgblt snd_pcm xhci_hcd sysfillrect sysimgblt snd_timer usbcore fb_sys_fops cfbcopyarea snd usb_common fb soundcore font fbdev video acpi_cpufreq button processor nfsd zram nct6775 auth_rpcgss zsmalloc hwmon_vid jc42 lockd hwmon grace sunrpc drm drm_panel_orientation_quirks nfs_ssc backlight
CPU: 2 PID: 2358578 Comm: kworker/2:0 Not tainted 5.11.0-gentoo #2
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A320M-HDV R3.0, BIOS P3.10 06/26/2019
Workqueue: events dm_irq_work_func [amdgpu]
RIP: 0010:smu8_send_msg_to_smc_with_parameter+0xfe/0x140 [amdgpu]
Code: c0 48 c7 c7 a0 4b a8 c0 48 89 c1 48 f7 ea 48 89 c8 44 89 f1 48 c1 f8 3f 48 c1 fa 07 48 29 c2 49 89 d0 44 89 ea e8 db a8 e4 d5 <0f> 0b eb b3 41 bc ea ff ff ff eb ab 48 8b 7d 60 be c0 01 00 00 48
RSP: 0018:ffffb25b4015bb48 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 00009c65e3a5a2a5 RCX: ffffa23c0f512d58
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffffa23c0f512d50
RBP: ffffa23908ceec00 R08: 0000000000000000 R09: ffffb25b4015b980
R10: 0000000000000003 R11: ffffffff96e57e48 R12: 00000000ffffffc2
R13: 0000000000000011 R14: 0000000000000460 R15: ffffa2390e53128c
FS:  0000000000000000(0000) GS:ffffa23c0f500000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f99db4fbbc0 CR3: 0000000134f00000 CR4: 00000000001506e0
Call Trace:
 smum_send_msg_to_smc_with_parameter+0xbe/0xf0 [amdgpu]
 smu8_set_power_state_tasks+0x1d6/0x360 [amdgpu]
 phm_set_power_state+0x44/0x68 [amdgpu]
 psm_adjust_power_state_dynamic+0xb6/0x1b8 [amdgpu]
 ? amdgpu_fence_wait_empty+0x80/0xd0 [amdgpu]
 pp_dpm_dispatch_tasks+0x35/0x58 [amdgpu]
 dm_pp_apply_display_requirements+0x197/0x1c0 [amdgpu]
 dce11_update_clocks+0x8f/0xe8 [amdgpu]
 dc_commit_updates_for_stream+0xc2c/0x1980 [amdgpu]
 ? dc_link_detect_helper+0x516/0xe90 [amdgpu]
 dm_set_dpms_off+0x85/0xd0 [amdgpu]
 handle_hpd_irq+0xf1/0x148 [amdgpu]
 dm_irq_work_func+0x44/0x50 [amdgpu]
 process_one_work+0x1b8/0x318
 worker_thread+0x4b/0x3a0
 ? rescuer_thread+0x360/0x360
 kthread+0x11c/0x138
 ? __kthread_bind_mask+0x58/0x58
 ret_from_fork+0x22/0x30
---[ end trace 0fa0eb110cfd5fb7 ]---
amdgpu: smu8_send_msg_to_smc_with_parameter(0x026e) aborted; SMU still servicing msg (0x0011)
amdgpu: smu8_send_msg_to_smc_with_parameter(0x002f) aborted; SMU still servicing msg (0x0011)
amdgpu: smu8_send_msg_to_smc_with_parameter(0x0012) aborted; SMU still servicing msg (0x0011)
amdgpu: smu8_send_msg_to_smc_with_parameter(0x0013) aborted; SMU still servicing msg (0x0011)
amdgpu: smu8_send_msg_to_smc_with_parameter(0x0011) aborted; SMU still servicing msg (0x0011)
[drm] amdgpu_dm_irq_schedule_work FAILED src 2
[drm] amdgpu_dm_irq_schedule_work FAILED src 2
amdgpu: smu8_send_msg_to_smc_with_parameter(0x026e) aborted; SMU still servicing msg (0x0011)
[drm] amdgpu_dm_irq_schedule_work FAILED src 2
amdgpu: smu8_send_msg_to_smc_with_parameter(0x002f) aborted; SMU still servicing msg (0x0011)
amdgpu: smu8_send_msg_to_smc_with_parameter(0x0012) aborted; SMU still servicing msg (0x0011)
amdgpu: smu8_send_msg_to_smc_with_parameter(0x0013) aborted; SMU still servicing msg (0x0011)
amdgpu: smu8_send_msg_to_smc_with_parameter(0x0011) aborted; SMU still servicing msg (0x0011)
amdgpu: smu8_send_msg_to_smc_with_parameter(0x026e) aborted; SMU still servicing msg (0x0011)
[...]

Some data about the machine:
 $ inxi -b
System:    Kernel: 5.11.0-gentoo x86_64 bits: 64 Desktop: MATE 1.24.0 Distro: Gentoo Base System release 2.7 
Machine:   Type: Desktop Mobo: ASRock model: A320M-HDV R3.0 serial: <superuser/root required> 
           UEFI [Legacy]: American Megatrends v: P3.10 date: 06/26/2019 
CPU:       Info: Quad Core AMD A10-9700E RADEON R7 10 COMPUTE CORES 4C+6G [MCP] speed: 2485 MHz min/max: 800/3000 MHz 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] driver: amdgpu v: kernel 
           Display: x11 server: X.Org 1.20.10 driver: amdgpu,ati unloaded: fbdev,modesetting resolution: 1920x1080~60Hz 
           OpenGL: renderer: AMD CARRIZO (DRM 3.40.0 5.11.0-gentoo LLVM 11.0.0) v: 4.6 Mesa 20.2.6 
Network:   Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet driver: r8169
Comment 1 Erhard F. 2021-02-21 13:10:45 UTC
Created attachment 295383 [details]
kernel .config (kernel 5.11, A10-9700E)
Comment 2 Erhard F. 2021-02-21 19:17:02 UTC
------------[ cut here ]------------
smu8_send_msg_to_smc_with_parameter(0x0011, 0x460) timed out after 3236309 us
WARNING: CPU: 1 PID: 403 at smu8_send_msg_to_smc_with_parameter+0xfe/0x140 [amdgpu]
Modules linked in: dm_crypt nhpoly1305_avx2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 algif_skcipher dm_mod fuse input_leds joydev hid_generic usbhid hid rfkill raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx ext4 crc16 mbcache jbd2 md_mod amdgpu evdev crc32_generic crc32_pclmul f2fs aesni_intel libaes crypto_simd cryptd glue_helper snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio led_class drm_ttm_helper snd_hda_codec_hdmi k10temp ttm ehci_pci gpu_sched fam15h_power ehci_hcd i2c_algo_bit r8169 i2c_piix4 snd_hda_intel snd_intel_dspcfg realtek mdio_devres xhci_pci libphy snd_hda_codec drm_kms_helper snd_hwdep xhci_hcd snd_hda_core cfbfillrect syscopyarea cfbimgblt sysfillrect usbcore sysimgblt snd_pcm fb_sys_fops usb_common cfbcopyarea snd_timer fb snd font fbdev soundcore video acpi_cpufreq button processor nfsd nct6775 hwmon_vid zram auth_rpcgss zsmalloc lockd jc42 grace drm hwmon drm_panel_orientation_quirks sunrpc
 backlight nfs_ssc
CPU: 1 PID: 403 Comm: X Not tainted 5.11.0-gentoo #2
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A320M-HDV R3.0, BIOS P3.10 06/26/2019
RIP: 0010:smu8_send_msg_to_smc_with_parameter+0xfe/0x140 [amdgpu]
Code: c0 48 c7 c7 a0 5b ac c0 48 89 c1 48 f7 ea 48 89 c8 44 89 f1 48 c1 f8 3f 48 c1 fa 07 48 29 c2 49 89 d0 44 89 ea e8 db 98 e0 cf <0f> 0b eb b3 41 bc ea ff ff ff eb ab 48 8b 7d 60 be c0 01 00 00 48
RSP: 0018:ffff9fc60749b788 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 00000d3740b628cb RCX: ffff9d430f492d58
RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff9d430f492d50
RBP: ffff9d400f23a800 R08: 0000000000000000 R09: ffff9fc60749b5c0
R10: 0000000000000003 R11: ffffffff90e57e48 R12: 00000000ffffffc2
R13: 0000000000000011 R14: 0000000000000460 R15: ffff9d405006988c
FS:  00007ff812a7b980(0000) GS:ffff9d430f480000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff7e8fef000 CR3: 000000010f26e000 CR4: 00000000001506e0
Call Trace:
 smum_send_msg_to_smc_with_parameter+0xbe/0xf0 [amdgpu]
 smu8_set_power_state_tasks+0x1d6/0x360 [amdgpu]
 phm_set_power_state+0x44/0x68 [amdgpu]
 psm_adjust_power_state_dynamic+0xb6/0x1b8 [amdgpu]
 ? amdgpu_fence_wait_empty+0x80/0xd0 [amdgpu]
 pp_dpm_dispatch_tasks+0x35/0x58 [amdgpu]
 dm_pp_apply_display_requirements+0x197/0x1c0 [amdgpu]
 dce11_update_clocks+0x8f/0xe8 [amdgpu]
 dc_commit_state+0x1d2/0xa70 [amdgpu]
 amdgpu_dm_atomic_commit_tail+0x3ee/0x2378 [amdgpu]
 ? free_one_page+0x4e/0xc0
 ? bw_calcs+0x656/0x1df0 [amdgpu]
 ? _cond_resched+0x11/0x38
 ? __ww_mutex_lock.constprop.0+0x2c/0x6f8
 ? kfree+0x9f/0x3a0
 ? ttm_bo_move_to_lru_tail+0x1c/0xc0 [ttm]
 ? ttm_eu_backoff_reservation+0x3e/0x58 [ttm]
 ? amdgpu_bo_gpu_offset+0x65/0xb0 [amdgpu]
 ? dm_plane_helper_prepare_fb+0x127/0x248 [amdgpu]
 ? _cond_resched+0x11/0x38
 ? wait_for_completion_timeout+0x31/0xe0
 ? wait_for_completion_interruptible+0x2e/0x128
 commit_tail+0x8d/0x120 [drm_kms_helper]
 drm_atomic_helper_commit+0x111/0x138 [drm_kms_helper]
 drm_atomic_connector_commit_dpms+0xd4/0xf0 [drm]
 drm_mode_obj_set_property_ioctl+0x183/0x3a0 [drm]
 ? drm_connector_set_obj_prop+0x80/0x80 [drm]
 drm_connector_property_set_ioctl+0x32/0x50 [drm]
 drm_ioctl_kernel+0xa5/0xf0 [drm]
 drm_ioctl+0x20a/0x398 [drm]
 ? drm_connector_set_obj_prop+0x80/0x80 [drm]
 amdgpu_drm_ioctl+0x44/0x78 [amdgpu]
 __x64_sys_ioctl+0x81/0xa0
 do_syscall_64+0x33/0x48
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7ff812686cc7
Code: 3c 1c 48 f7 d8 49 39 c4 72 b9 e8 24 ff ff ff 85 c0 78 be 5b 4c 89 e0 5d 41 5c c3 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 71 41 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd6465ce88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffd6465cec0 RCX: 00007ff812686cc7
RDX: 00007ffd6465cec0 RSI: 00000000c01064ab RDI: 000000000000000c
RBP: 00000000c01064ab R08: 00007ffd64740080 R09: 00007ff811fa7d10
R10: 00007ff811fa7d20 R11: 0000000000000246 R12: 0000555c25ecedd0
R13: 000000000000000c R14: 0000000000000000 R15: 0000555c24e06540
---[ end trace c56a82515c29af45 ]---
Comment 3 Erhard F. 2021-03-02 09:11:59 UTC
Created attachment 295573 [details]
kernel .config (kernel 5.12-rc1, A10-9700E)

Same for kernel v5.12-rc1. Backtrace looks slightly different with KASAN.

[...]
------------[ cut here ]------------
smu8_send_msg_to_smc_with_parameter(0x0011, 0x460) timed out after 3350731 us
WARNING: CPU: 3 PID: 598651 at smu8_send_msg_to_smc_with_parameter+0x180/0x1e0 [amdgpu]
Modules linked in: rfkill dm_crypt nhpoly1305_avx2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 algif_skcipher input_leds led_class joydev hid_generic dm_mod usbhid hid raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx md_mod amdgpu f2fs crc32_pclmul evdev ext4 crc16 mbcache jbd2 aesni_intel libaes crypto_simd cryptd gpu_sched i2c_algo_bit drm_ttm_helper ttm snd_hda_codec_generic snd_hda_codec_hdmi drm_kms_helper cfbfillrect snd_hda_intel syscopyarea cfbimgblt snd_intel_dspcfg sysfillrect sysimgblt xhci_pci fb_sys_fops snd_hda_codec ehci_pci cfbcopyarea snd_hwdep xhci_hcd snd_hda_core snd_pcm ehci_hcd fb i2c_piix4 snd_timer usbcore k10temp fam15h_power font fbdev snd soundcore usb_common 8250 8250_base serial_core video acpi_cpufreq button processor nfsd zram auth_rpcgss zsmalloc nct6775 hwmon_vid lockd jc42 hwmon grace drm drm_panel_orientation_quirks fuse backlight configfs sunrpc nfs_ssc efivarfs
CPU: 3 PID: 598651 Comm: kworker/3:0 Not tainted 5.12.0-rc1-Excavator #2
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A320M-HDV R3.0, BIOS P3.10 06/26/2019
Workqueue: events dm_irq_work_func [amdgpu]
RIP: 0010:smu8_send_msg_to_smc_with_parameter+0x180/0x1e0 [amdgpu]
Code: 20 48 c7 c7 a0 c5 16 c1 48 89 c1 48 f7 ea 48 89 c8 44 89 f1 48 c1 f8 3f 48 c1 fa 07 48 29 c2 49 89 d0 44 89 ea e8 22 db e9 d2 <0f> 0b eb ac 41 bc ea ff ff ff eb a4 4c 89 ef e8 dc 54 5d d2 48 8b
RSP: 0018:ffffc900090174d0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000011 RCX: 0000000000000000
RDX: 0000000000000027 RSI: 0000000000000004 RDI: fffff52001202e8c
RBP: ffff88810abc6800 R08: ffffffff931ecfbe R09: ffff8883d39ab09b
R10: ffffed107a735613 R11: 0000000000000001 R12: 00000000ffffffc2
R13: 0000000000000011 R14: 0000000000000460 R15: ffffffffc114fce0
FS:  0000000000000000(0000) GS:ffff8883d3980000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6d79d34000 CR3: 00000001163d6000 CR4: 00000000001506e0
Call Trace:
 smum_send_msg_to_smc_with_parameter+0x13c/0x170 [amdgpu]
 smu8_set_power_state_tasks+0x34c/0x560 [amdgpu]
 phm_set_power_state+0xb2/0xf0 [amdgpu]
 ? phm_power_down_asic+0x60/0x60 [amdgpu]
 ? smu8_apply_state_adjust_rules+0x1bf/0x290 [amdgpu]
 psm_adjust_power_state_dynamic+0x12c/0x300 [amdgpu]
 ? psm_set_user_performance_state+0xd0/0xd0 [amdgpu]
 ? amdgpu_fence_wait_empty+0x24f/0x380 [amdgpu]
 ? mutex_lock_io_nested+0xac0/0xac0
 ? memcpy+0x39/0x60
 hwmgr_handle_task+0xf1/0x180 [amdgpu]
 ? hwmgr_resume+0x70/0x70 [amdgpu]
 ? amdgpu_fence_fallback+0x20/0x20 [amdgpu]
 ? program_stutter_watermark+0x1b4/0x1d0 [amdgpu]
 pp_dpm_dispatch_tasks+0x40/0x60 [amdgpu]
 ? pp_get_power_limit+0xe0/0xe0 [amdgpu]
 amdgpu_pm_compute_clocks.part.0+0x18a/0xa00 [amdgpu]
 ? wait_for_completion_killable_timeout+0x200/0x200
 ? amdgpu_dpm_get_vrefresh+0xd0/0xd0 [amdgpu]
 ? phm_store_dal_configuration_data+0x165/0x190 [amdgpu]
 ? pp_get_current_clocks+0x2d0/0x2d0 [amdgpu]
 ? pp_get_current_clocks+0x2d0/0x2d0 [amdgpu]
 dm_pp_apply_display_requirements+0x3ae/0x3f0 [amdgpu]
 dce11_update_clocks+0x149/0x1e0 [amdgpu]
 ? dce11_pplib_apply_display_requirements+0x470/0x470 [amdgpu]
 dc_commit_updates_for_stream+0x243b/0x3c20 [amdgpu]
 ? __mutex_lock+0x2e9/0xb80
 ? dm_set_dpms_off+0xc0/0x190 [amdgpu]
 ? dc_check_update_surfaces_for_stream+0x1130/0x1130 [amdgpu]
 ? dmub_reg_value_burst_set_pack+0x240/0x240 [amdgpu]
 ? dal_gpio_create_irq+0x20/0x20 [amdgpu]
 ? kasan_quarantine_put+0xd1/0x1f0
 ? trace_hardirqs_on+0x1c/0x100
 dm_set_dpms_off+0x120/0x190 [amdgpu]
 ? dm_handle_hpd_rx_irq+0x270/0x270 [amdgpu]
 ? clk_mgr_optimize_pwr_state+0x8e/0xd0 [amdgpu]
 handle_hpd_irq+0x1ed/0x250 [amdgpu]
 ? handle_hpd_rx_irq+0x2f0/0x2f0 [amdgpu]
 ? lock_release+0xb2/0x430
 ? __schedule+0x5f8/0x10e0
 ? lockdep_hardirqs_on_prepare+0xe/0x200
 ? _raw_spin_unlock_irq+0x1f/0x30
 ? handle_hpd_rx_irq+0x2f0/0x2f0 [amdgpu]
 dm_irq_work_func+0x77/0x90 [amdgpu]
 process_one_work+0x4f3/0x9b0
 ? pwq_dec_nr_in_flight+0x110/0x110
 ? rwlock_bug.part.0+0x50/0x50
 worker_thread+0x2ce/0x6c0
 ? process_one_work+0x9b0/0x9b0
 kthread+0x1d9/0x200
 ? __kthread_bind_mask+0x60/0x60
 ret_from_fork+0x22/0x30
irq event stamp: 0
hardirqs last  enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff930690fa>] copy_process+0x8fa/0x31e0
softirqs last  enabled at (0): [<ffffffff930690fa>] copy_process+0x8fa/0x31e0
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace 6e48f7e48c11b7c0 ]---
amdgpu: smu8_send_msg_to_smc_with_parameter(0x026e) aborted; SMU still servicing msg (0x0011)
amdgpu: smu8_send_msg_to_smc_with_parameter(0x002f) aborted; SMU still servicing msg (0x0011)
amdgpu: smu8_send_msg_to_smc_with_parameter(0x0012) aborted; SMU still servicing msg (0x0011)
Comment 4 Alex Deucher 2021-03-02 17:16:56 UTC
If this is a regression, can you bisect?  https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html
Comment 5 Erhard F. 2021-03-02 18:42:52 UTC
I don't know. This is a new machine and have seen this since 5.10.x which was the first kernel I tried on it.
Comment 6 Erhard F. 2021-03-30 09:03:45 UTC
Created attachment 296145 [details]
dmesg (kernel 5.12-rc5, A10-9700E)

Gave kernel 5.12-rc5 a try but the issue stays.

Looks like I really need to find an older kernel where this does not happen and start a bisect from there. This will take some time...
Comment 7 Erhard F. 2021-03-30 09:05:56 UTC
Created attachment 296147 [details]
kernel .config (kernel 5.12-rc5, A10-9700E)
Comment 8 Erhard F. 2021-04-07 22:18:11 UTC
Created attachment 296273 [details]
dmesg (kernel 4.14.228, A10-9700E)

Traced the issue back to kernel v4.14.228 which is still affected (v4.19.184 and v5.4.109 too). On v4.14.228 no stack trace like in recent kernels but these messages:

[...]
[28541.868617] amdgpu: [powerplay] min_core_set_clock not set
[28542.483228] cz_send_msg_to_smc_async (0x0011) failed
[28543.097905] cz_send_msg_to_smc_async (0x026e) failed
[28543.712424] cz_send_msg_to_smc_async (0x002f) failed
[28543.712719] amdgpu: [powerplay] min_core_set_clock not set
[28544.330105] cz_send_msg_to_smc_async (0x0011) failed
[28544.947054] cz_send_msg_to_smc_async (0x026e) failed
[28545.564013] cz_send_msg_to_smc_async (0x002f) failed
[28545.564251] amdgpu: [powerplay] min_core_set_clock not set
[28546.179695] cz_send_msg_to_smc_async (0x0011) failed
[28546.794880] cz_send_msg_to_smc_async (0x026e) failed
[28547.409986] cz_send_msg_to_smc_async (0x002f) failed

Apart from that the machine behaves the same after these "cz_send_msg_to_smc_async (0x002f) failed" - CPU permanently downclocked to 800 MHz, desktop 'freezing' issues with display going black after some time. Access via ssh still works.
Comment 9 Erhard F. 2021-04-08 21:55:44 UTC
(In reply to Alex Deucher from comment #4)
> If this is a regression, can you bisect? 
> https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html
Sorry but my bisecting efforts came to a halt. Last kernel I was able to boot was  4.14.228. And this one still has the issue.

Kernels 4.13.16, 4.12.14, 4.11.12, 4.10.17, 4.9.264 I was able to build but they don't boot into desktop or even console. Just getting a "no signal" message von my monitor after the kernel has booted, and some of these kernels do a reboot.

Don't know how to proceed from here.
Comment 10 Erhard F. 2021-05-11 02:17:29 UTC
Created attachment 296715 [details]
dmesg (kernel 5.13-rc1, A10-9700E)

Some locking stacktrace on kernel v5.13-rc1 before the usual "smu8_send_msg_to_smc_with_parameter" messages:

[...]
================================
WARNING: inconsistent lock state
5.13.0-rc1-bdver4 #9 Not tainted
--------------------------------
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
X/591 [HC1[1]:SC0[0]:HE0:SE1] takes:
ffffffffbe37d160 (fs_reclaim){?.+.}-{0:0}, at: fs_reclaim_acquire+0xf7/0x160
{HARDIRQ-ON-W} state was registered at:
  lock_acquire+0x1a0/0x6e0
  fs_reclaim_acquire+0x117/0x160
  kmem_cache_alloc_trace+0x3b/0x320
  init_rescuer+0x80/0x330
  workqueue_init+0x12f/0x2fd
  kernel_init_freeable+0x305/0x584
  kernel_init+0x8/0x116
  ret_from_fork+0x22/0x30
irq event stamp: 68143332
hardirqs last  enabled at (68143331): [<ffffffffbd800baf>] asm_sysvec_apic_timer_interrupt+0xf/0x20
hardirqs last disabled at (68143332): [<ffffffffbd774184>] common_interrupt+0x14/0xa0
softirqs last  enabled at (68143304): [<ffffffffbc0d0cd9>] irq_exit_rcu+0x119/0x1a0
softirqs last disabled at (68143299): [<ffffffffbc0d0cd9>] irq_exit_rcu+0x119/0x1a0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(fs_reclaim);
  <Interrupt>
    lock(fs_reclaim);

 *** DEADLOCK ***

5 locks held by X/591:
 #0: ffff88810397fa20 (crtc_ww_class_acquire){+.+.}-{0:0}, at: set_property_atomic+0xb2/0x2f0 [drm]
 #1: ffff88812bc00538 (crtc_ww_class_mutex){+.+.}-{3:3}, at: modeset_lock+0x364/0x500 [drm]
 #2: ffff88812bc1cb58 (&adev->dm.dc_lock){+.+.}-{3:3}, at: amdgpu_dm_atomic_commit_tail+0xa12/0x9870 [amdgpu]
 #3: ffff8881538d6880 (&hwmgr->smu_lock){+.+.}-{3:3}, at: pp_dpm_dispatch_tasks+0x50/0x90 [amdgpu]
 #4: ffff8881538d6910 (&hwmgr->msg_lock){+.+.}-{3:3}, at: smum_send_msg_to_smc_with_parameter+0x1bf/0x300 [amdgpu]

stack backtrace:
CPU: 1 PID: 591 Comm: X Not tainted 5.13.0-rc1-bdver4 #9
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A320M-HDV R3.0, BIOS P3.10 06/26/2019
Call Trace:
 <IRQ>
 dump_stack+0xa5/0xe6
 mark_lock.cold+0x145/0x14f
 ? lock_chain_count+0x20/0x20
 ? mark_lock+0xee/0x2fd0
 ? lock_chain_count+0x20/0x20
 ? lock_chain_count+0x20/0x20
 ? lock_chain_count+0x20/0x20
 ? rcu_read_lock_sched_held+0x3a/0x70
 __lock_acquire+0x146a/0x5d40
 ? debug_check_no_locks_held+0xa0/0xa0
 ? mark_lock+0xee/0x2fd0
 ? debug_check_no_locks_held+0xa0/0xa0
 lock_acquire+0x1a0/0x6e0
 ? fs_reclaim_acquire+0xf7/0x160
 ? lock_release+0x730/0x730
 ? amdgpu_dm_irq_handler+0x1ad/0x780 [amdgpu]
 ? lock_downgrade+0x6e0/0x6e0
 fs_reclaim_acquire+0x117/0x160
 ? fs_reclaim_acquire+0xf7/0x160
 ? amdgpu_dm_irq_handler+0x2d1/0x780 [amdgpu]
 kmem_cache_alloc_trace+0x3b/0x320
 amdgpu_dm_irq_handler+0x2d1/0x780 [amdgpu]
 amdgpu_irq_dispatch+0x280/0x590 [amdgpu]
 ? amdgpu_irq_add_id+0x2c0/0x2c0 [amdgpu]
 ? drm_print_bits+0x190/0x190 [drm]
 ? __lock_acquire+0xd5b/0x5d40
 amdgpu_ih_process+0x1c4/0x390 [amdgpu]
 ? amdgpu_irq_disable_all+0x300/0x300 [amdgpu]
 amdgpu_irq_handler+0x22/0x210 [amdgpu]
 ? amdgpu_irq_disable_all+0x300/0x300 [amdgpu]
 __handle_irq_event_percpu+0x248/0x600
 handle_irq_event+0xfa/0x260
 ? handle_irq_event_percpu+0x140/0x140
 ? lapic_next_event+0x53/0x80
 ? clockevents_program_event+0x1c7/0x280
 handle_edge_irq+0x1f8/0xa90
 __common_interrupt+0x70/0x150
 common_interrupt+0x76/0xa0
 </IRQ>
 asm_common_interrupt+0x1b/0x40
RIP: 0010:delay_halt_mwaitx+0x31/0x40
Code: 00 65 48 03 05 a8 83 22 43 31 d2 48 89 d1 0f 01 fa bb ff ff ff ff b8 f0 00 00 00 b9 02 00 00 00 48 39 de 48 0f 46 de 0f 01 fb <5b> c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 83 ec 08 48 89 f8
RSP: 0018:ffff88810397ec38 EFLAGS: 00000297
RAX: 00000000000000f0 RBX: 0000000000000bb3 RCX: 0000000000000002
RDX: 0000000000000000 RSI: 0000000000000bb3 RDI: 00002c4339a8e963
RBP: 0000000000000bb3 R08: 0000000000000000 R09: ffffffffbe61f63f
R10: 0000000000000001 R11: 0000000000000001 R12: ffff8881538d6800
R13: 00000000000001d0 R14: 0000000000000000 R15: ffffed102a71ad3e
 delay_halt+0x36/0x60
 phm_wait_for_register_unequal+0xd5/0x240 [amdgpu]
 ? amdgpu_device_wreg.part.0+0x2ae/0x350 [amdgpu]
 smu8_send_msg_to_smc_with_parameter+0x1e9/0x380 [amdgpu]
 smum_send_msg_to_smc_with_parameter+0x215/0x300 [amdgpu]
 smu8_set_power_state_tasks+0x691/0xe10 [amdgpu]
 ? debug_check_no_locks_held+0xa0/0xa0
 phm_set_power_state+0xcc/0x130 [amdgpu]
 ? phm_power_down_asic+0x90/0x90 [amdgpu]
 psm_adjust_power_state_dynamic+0x172/0x570 [amdgpu]
 ? psm_set_user_performance_state+0x1c0/0x1c0 [amdgpu]
 ? mutex_lock_io_nested+0xfd0/0xfe0
 ? memcpy+0x39/0x60
 ? psm_set_states+0x109/0x190 [amdgpu]
 hwmgr_handle_task+0x10a/0x1f0 [amdgpu]
 ? hwmgr_resume+0xc0/0xc0 [amdgpu]
 ? amdgpu_debugfs_fence_info_show+0x450/0x450 [amdgpu]
 pp_dpm_dispatch_tasks+0x5e/0x90 [amdgpu]
 ? pp_get_power_limit+0x250/0x250 [amdgpu]
 amdgpu_pm_compute_clocks.part.0+0x245/0x1500 [amdgpu]
 ? wait_for_completion_killable_timeout+0x2c0/0x2c0
 ? amdgpu_dpm_get_vrefresh+0x1c0/0x1c0 [amdgpu]
 ? phm_store_dal_configuration_data+0x2d0/0x3f0 [amdgpu]
 ? pp_get_current_clocks+0x6e0/0x6e0 [amdgpu]
 ? pp_get_current_clocks+0x6e0/0x6e0 [amdgpu]
 dm_pp_apply_display_requirements+0x92c/0xc60 [amdgpu]
 dce11_update_clocks+0x27f/0x3d0 [amdgpu]
 ? dce11_pplib_apply_display_requirements+0xbb0/0xbb0 [amdgpu]
 dc_commit_state+0x422/0x27a0 [amdgpu]
 ? amdgpu_dm_atomic_commit_tail+0xa12/0x9870 [amdgpu]
 ? unwind_next_frame+0x15a6/0x1cf0
 ? dc_allow_idle_optimizations+0x1c0/0x1c0 [amdgpu]
 ? drm_atomic_helper_calc_timestamping_constants+0x15a/0x1f0 [drm_kms_helper]
 amdgpu_dm_atomic_commit_tail+0xa47/0x9870 [amdgpu]
 ? mark_lock+0xee/0x2fd0
 ? dm_crtc_helper_atomic_check+0x960/0x960 [amdgpu]
 ? mark_lock+0xee/0x2fd0
 ? lock_chain_count+0x20/0x20
 ? lock_chain_count+0x20/0x20
 ? __lock_acquire+0xd5b/0x5d40
 ? check_chain_key+0x24a/0x580
 ? __lock_acquire+0xd5b/0x5d40
 ? mark_lock+0xee/0x2fd0
 ? debug_check_no_locks_held+0xa0/0xa0
 ? mark_lock+0xee/0x2fd0
 ? lock_chain_count+0x20/0x20
 ? __lock_acquire+0xd5b/0x5d40
 ? check_chain_key+0x24a/0x580
 ? wait_for_completion_timeout+0x19d/0x220
 ? mark_held_locks+0x9e/0xe0
 ? rwlock_bug.part.0+0x90/0x90
 ? lockdep_hardirqs_on_prepare+0x273/0x3f0
 ? _raw_spin_unlock_irq+0x1f/0x30
 ? trace_hardirqs_on+0x1c/0x110
 ? _raw_spin_unlock_irq+0x1f/0x30
 ? wait_for_completion_timeout+0x19d/0x220
 ? wait_for_completion_io+0x210/0x210
 ? hpet_msi_interrupt_handler+0x50/0x50
 ? mark_held_locks+0x9e/0xe0
 ? drm_crtc_commit_wait+0x29/0x60 [drm]
 ? drm_atomic_helper_wait_for_dependencies+0x429/0x670 [drm_kms_helper]
 commit_tail+0x1a9/0x300 [drm_kms_helper]
 drm_atomic_helper_commit+0x216/0x290 [drm_kms_helper]
 drm_atomic_connector_commit_dpms+0x2d7/0x4f0 [drm]
 ? drm_modeset_acquire_init+0x1e9/0x390 [drm]
 ? set_property_atomic+0xb2/0x2f0 [drm]
 set_property_atomic+0x1be/0x2f0 [drm]
 ? drm_mode_object_get+0x110/0x110 [drm]
 ? lock_acquire+0x1a0/0x6e0
 drm_mode_obj_set_property_ioctl+0x339/0x4c0 [drm]
 ? drm_connector_set_obj_prop+0x160/0x160 [drm]
 drm_connector_property_set_ioctl+0xee/0x180 [drm]
 ? drm_connector_set_obj_prop+0x160/0x160 [drm]
 ? drm_is_current_master+0x65/0x120 [drm]
 ? drm_connector_set_obj_prop+0x160/0x160 [drm]
 ? drm_ioctl_permit+0x132/0x160 [drm]
 drm_ioctl_kernel+0x1c8/0x260 [drm]
 ? drm_ioctl_permit+0x160/0x160 [drm]
 ? debug_check_no_locks_held+0xa0/0xa0
 drm_ioctl+0x419/0x890 [drm]
 ? drm_connector_set_obj_prop+0x160/0x160 [drm]
 ? drm_ioctl_kernel+0x260/0x260 [drm]
 ? lock_downgrade+0x6e0/0x6e0
 ? lockdep_hardirqs_on_prepare+0x273/0x3f0
 ? _raw_spin_unlock_irqrestore+0x42/0x50
 ? trace_hardirqs_on+0x1c/0x110
 amdgpu_drm_ioctl+0xc9/0x1a0 [amdgpu]
 __x64_sys_ioctl+0x11b/0x170
 do_syscall_64+0x61/0x80
 ? lockdep_hardirqs_on_prepare+0x273/0x3f0
 ? do_syscall_64+0x6e/0x80
 ? do_syscall_64+0x6e/0x80
 ? lockdep_hardirqs_on_prepare+0x273/0x3f0
 ? do_syscall_64+0x6e/0x80
 ? do_syscall_64+0x6e/0x80
 ? do_syscall_64+0x6e/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f7acfa36937
Code: 3c 1c 48 f7 d8 4c 39 e0 77 b9 e8 24 ff ff ff 85 c0 78 be 5b 4c 89 e0 5d 41 5c c3 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 01 45 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffcba76bd48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f7acfa36937
RDX: 00007ffcba76bd98 RSI: 00000000c01064ab RDI: 000000000000000c
RBP: 00000000c01064ab R08: 0000562d03ee9cf8 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000562d040468b0
R13: 00007ffcba76bd98 R14: 0000000000000000 R15: 0000562d027ab540
------------[ cut here ]------------
Comment 11 Erhard F. 2021-05-11 02:22:46 UTC
Created attachment 296717 [details]
kernel .config (kernel 5.13-rc1, A10-9700E)
Comment 12 Erhard F. 2021-09-28 22:49:55 UTC
Sold the hardware, can no longer test nor reproduce.

Note You need to log in before you can comment on or make changes to this bug.