Bug 210849

Summary: Black screen after resume from long suspend. Open/Close lid. AMDGPU
Product: Drivers Reporter: xrootware
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: blocking CC: alexdeucher, jvdelisle2
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.10.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: cut of dmesg

Description xrootware 2020-12-22 11:15:59 UTC
Created attachment 294295 [details]
cut of dmesg

Bug on kernel greater than 5.9 on Fedora 33. Black screen after resume from long suspend. Open/Close lid.

More error like this https://bugzilla.redhat.com/show_bug.cgi?id=1884180

HARD RESET (power button 5 seconds) required to normal boot.

System encountered a non-fatal error in drm_dev_register()

WARNING: CPU: 2 PID: 323 at drivers/gpu/drm/drm_mode_config.c:617 drm_mode_config_validate+0x178/0x200 [drm] 


backtrace          [-M-O] 33 L:[ 14+15  29/ 43] *(3144/3968b) 0010 0x00A                                                            [*][X]
FС:  00007f051daef4c0(0000) GS:ffff8a3fb8a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f22b257b56c CR3: 000000012ed8a000 CR4: 00000000003506f0
Call Trace:
 ? generic_reg_get+0x1d/0x30 [amdgpu]
 drm_get_last_vbltimestamp+0x8a/0xa0 [drm]
 drm_reset_vblank_timestamp+0x4b/0xb0 [drm]
 drm_crtc_vblank_on+0x7b/0x130 [drm]
 amdgpu_dm_atomic_commit_tail+0xd00/0x24c0 [amdgpu]
 commit_tail+0x94/0x130 [drm_kms_helper]
 drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
 drm_atomic_helper_set_config+0x70/0xb0 [drm_kms_helper]
 drm_mode_setcrtc+0x1d3/0x6f0 [drm]
 ? avc_has_extended_perms+0x18d/0x3e0
 ? drm_mode_getcrtc+0x180/0x180 [drm]
 drm_ioctl_kernel+0x86/0xd0 [drm]
 drm_ioctl+0x20f/0x3a0 [drm]
 ? drm_mode_getcrtc+0x180/0x180 [drm]
 amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
 __x64_sys_ioctl+0x83/0xb0
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f0521a5938b
Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8 1c ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3
RSP: 002b:00007ffc9abfb628 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffc9abfb660 RCX: 00007f0521a5938b
RDX: 00007ffc9abfb660 RSI: 00000000c06864a2 RDI: 0000000000000009
RBP: 00000000c06864a2 R08: 0000000000000000 R09: 0000556b0af88830
R10: 0000000000000000 R11: 0000000000000246 R12: 0000556b083821e0
R13: 0000000000000009 R14: 00007f0504006650 R15: 0000556b0afa2980

BOOT_IMAGE=(hd0,gpt5)/boot/vmlinuz-5.10.0-0.rc4.78.fc34.x86_64 root=UUID=1ede796f-5c84-4fb0-8a9e-c97e1b102ec4 ro resume=UUID=1dad1a43-257
Comment 1 xrootware 2020-12-22 11:18:30 UTC
Kernel 5.7.0 works fine
Comment 2 xrootware 2020-12-22 11:20:13 UTC
More bugs like this https://bugzilla.redhat.com/show_bug.cgi?id=1881889
Comment 3 xrootware 2020-12-22 11:25:40 UTC
5.10.1 builded from sources have same result
Comment 4 xrootware 2020-12-23 18:36:21 UTC
Have new error messages linkied with it, i think
drm failed to load unicode id
amdgpu: RAP: optional rap ta ucode is not available
kernel 5.10.2
Comment 5 JerryD 2020-12-27 03:03:20 UTC
There is an old bug report that has been closed related to /usr/lib/firmware/amdgpu/raven_dmcu.bin.  This file was being incorrectly loaded or has a probelm. To correct this, I renamed the file to raven_dmcu.bin.old so that it would not load.  The problem would return every time firmware updates would come through that supplied a new raven_dmcu.bin.

I had no problems until recently when the kernel was bumped from 5.8.18-200.fc32.x86_64 to the first instance of 5.9.

The previous fix was accomplished y uninstalling the kernel, rename the troublesome file, and then reintall the kernel.

I took another look today and see there is a new raven_dmcu.bin so I will repeat the procedure again.  It may not be the same problem, but I will give it a try.
Comment 6 JerryD 2020-12-27 03:31:59 UTC
I tried the fix in Comment 5 and it does not work so I am sticking with the 5.8.18 kernel until this gets ironed out.
Comment 7 JerryD 2020-12-30 20:53:39 UTC
[  296.606452] PM: resume devices took 10.437 seconds
[  296.606454] ------------[ cut here ]------------
[  296.606455] Component: resume devices, time: 10437
[  296.606465] WARNING: CPU: 2 PID: 3357 at kernel/power/suspend_test.c:53 suspend_test_finish+0x74/0x80
[  296.606465] Modules linked in: ccm uinput rfcomm nf_conntrack_netlink xt_CHECKSUM xt_addrtype xt_MASQUERADE xt_conntrack br_netfilter ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set overlay nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat rtw88_8822be rtw88_8822b edac_mce_amd kvm_amd rtw88_pci snd_hda_codec_realtek snd_hda_codec_hdmi rtw88_core snd_hda_codec_generic ledtrig_audio kvm btusb snd_hda_intel btrtl snd_intel_dspcfg btbcm btintel irqbypass snd_hda_codec uvcvideo rapl mac80211 videobuf2_vmalloc bluetooth hp_wmi snd_hda_core pcspkr
[  296.606504]  videobuf2_memops videobuf2_v4l2 snd_hwdep hid_sensor_accel_3d wmi_bmof sparse_keymap snd_seq hid_sensor_rotation hid_sensor_magn_3d videobuf2_common hid_sensor_gyro_3d snd_seq_device hid_sensor_incl_3d hid_sensor_trigger snd_pcm videodev hid_sensor_iio_common industrialio_triggered_buffer kfifo_buf ecdh_generic cfg80211 industrialio ecc joydev snd_timer mc snd k10temp i2c_piix4 soundcore rfkill libarc4 hp_accel i2c_scmi hp_wireless lis3lv02d acpi_cpufreq zram ip_tables mmc_block amdgpu hid_sensor_hub hid_multitouch rtsx_pci_sdmmc mmc_core hid_logitech_hidpp iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper cec crct10dif_pclmul drm crc32_pclmul ccp crc32c_intel ghash_clmulni_intel serio_raw nvme rtsx_pci nvme_core wmi video i2c_hid pinctrl_amd hid_logitech_dj fuse
[  296.606537] CPU: 2 PID: 3357 Comm: systemd-sleep Not tainted 5.9.16-200.fc33.x86_64 #1
[  296.606538] Hardware name: HP HP ENVY x360 Convertible 15-bq1xx/83C6, BIOS F.21 04/29/2019
[  296.606542] RIP: 0010:suspend_test_finish+0x74/0x80
[  296.606545] Code: e8 03 00 00 29 c1 e8 78 5d 9f 00 41 81 fc 10 27 00 00 77 04 5d 41 5c c3 44 89 e2 48 89 ee 48 c7 c7 c8 80 38 a8 e8 81 03 9f 00 <0f> 0b 5d 41 5c c3 cc cc cc cc cc cc 0f 1f 44 00 00 0f b6 05 58 d9
[  296.606546] RSP: 0018:ffffbdf382e8fdd8 EFLAGS: 00010296
[  296.606548] RAX: 0000000000000026 RBX: 0000000000000001 RCX: ffff975857298d08
[  296.606549] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff975857298d00
[  296.606550] RBP: ffffffffa8388011 R08: 000000450f1f561c R09: ffffffffa9404be4
[  296.606551] R10: 000000000000058a R11: 0000000000021ee0 R12: 00000000000028c5
[  296.606552] R13: 0000000000000000 R14: ffffbdf382e8fe08 R15: 0000000000000000
[  296.606554] FS:  00007f1e3fd5f000(0000) GS:ffff975857280000(0000) knlGS:0000000000000000
[  296.606555] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  296.606556] CR2: 00005584bb97e928 CR3: 000000018e8f6000 CR4: 00000000003506e0
[  296.606557] Call Trace:
[  296.606564]  suspend_devices_and_enter+0x1a2/0x7f0
[  296.606569]  pm_suspend.cold+0x329/0x374
[  296.606572]  state_store+0x71/0xd0
[  296.606577]  kernfs_fop_write+0xce/0x1b0
[  296.606581]  vfs_write+0xc7/0x210
[  296.606584]  ksys_write+0x4f/0xc0
[  296.606587]  do_syscall_64+0x33/0x40
[  296.606591]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  296.606593] RIP: 0033:0x7f1e40d26297
[  296.606597] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  296.606598] RSP: 002b:00007ffe985f7fa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  296.606600] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f1e40d26297
[  296.606601] RDX: 0000000000000007 RSI: 00005584bb97d920 RDI: 0000000000000004
[  296.606601] RBP: 00005584bb97d920 R08: 0000000000000001 R09: 00007f1e40df8a60
[  296.606602] R10: 0000000000000070 R11: 0000000000000246 R12: 0000000000000007
[  296.606603] R13: 00005584bb978650 R14: 0000000000000007 R15: 00007f1e40df9720
[  296.606605] ---[ end trace 6fa811f71ae3a8ac ]---
Comment 8 JerryD 2021-01-01 18:04:12 UTC
In the 5.18 kernel which does not hang on me I also have a segfault:

[  224.767359] Call Trace:
[  224.767529]  amdgpu_dm_backlight_update_status+0xb4/0xc0 [amdgpu]
[  224.767535]  backlight_suspend+0x6a/0x80
[  224.767538]  ? brightness_store+0x50/0x50
[  224.767541]  dpm_run_callback+0x4f/0x140
[  224.767544]  __device_suspend+0x11c/0x4a0
[  224.767547]  dpm_suspend+0x117/0x250
[  224.767549]  dpm_suspend_start+0x77/0x80
[  224.767554]  suspend_devices_and_enter+0xe6/0x7f0
[  224.767557]  pm_suspend.cold+0x333/0x38c
[  224.767560]  state_store+0x71/0xd0
[  224.767565]  kernfs_fop_write+0xce/0x1b0
[  224.767569]  vfs_write+0xc7/0x1f0
[  224.767572]  ksys_write+0x4f/0xc0
[  224.767576]  do_syscall_64+0x4d/0x90
[  224.767580]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens coming out of suspend. So pretty sure amdgpu is the culprit.
Comment 9 xrootware 2021-01-05 20:43:19 UTC
'no_console_suspend' in boot options fix problem absolutely in my case. maybe this is temporally fix, but i hope it will be fixed normally.
Comment 10 JerryD 2021-02-05 04:35:14 UTC
This bug has basically disabled two different machines I have that use AMD graphics cards.  One, a work station running Fedora 33 will lose the display if I do not log on right away after a power up.  The defaults that blank the screen or just a simple lock screen sequence basically causes the monitor to shutoff to save energy. At this point the display can not be recovered since the driver loses it self.  It is a real shame as this system was working perfectly for over a year beofre this happened.  The guilty patch should be reverted immediately.

I can remote into the system from elsewhere ussing ssh and command a reboot to bring it back.  The kernel underlying is running and completely functional. The device driver AMDGPU is 'F'ed up.  In my many many years of using Fedora and other linux distros, this is by far the worse cluster bug I have ever see.
Comment 11 Alex Deucher 2021-02-05 15:40:21 UTC
Can you bisect?
Comment 12 JerryD 2021-02-05 17:46:55 UTC
I have done some bisection with the kernel by uninstalling packages and rolling back. I tried this with my best guess of which packages to roll back using package manager (dnf here).  In the recent past I have seen issues with the amdgpu firmware and was able to disable the bug by hiding the suspect file removing the kernel and reinstalling the kernel.  This was needed because the firmware is folded into the boot image when the kernel is installed.

I was not successful with this bug here yet.

Firmware vs mesa/dri code? If someone could point me to the correct source repository, I would be happy to attempt a raw build and install to see if latest development branches have fixed and then if not I could try starting a bisection on it.
Comment 13 Alex Deucher 2021-02-05 18:37:27 UTC
most likely kernel.

https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html
Comment 14 JerryD 2021-02-05 19:01:41 UTC
(In reply to Alex Deucher from comment #13)
> most likely kernel.
> 
> https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html

OK, I have done the kernel builds before no problems.  I will start back about a year and go from there.
Comment 15 JerryD 2021-02-06 16:52:40 UTC
Dropped back to kernel-5.7.0 and everything works fine on laptop with lid close suspend / returns from suspend with no errors on my laptop.  I made no other changes so this confirms the kernel is the focus.
Comment 16 JerryD 2021-02-09 04:40:06 UTC
  kernel-5.8.1-301.fc33  amdgpu_dm_backlight_update_status+0xb4/0xc0 [amdgpu]
  kernel-5.8.0-1.fc33  amdgpu_dm_backlight_update_status+0xb4/0xc0 [amdgpu]
  ***** breakage somewhere between these two *********
  kernel-5.7.17-200.fc32 clean
  kernel-5.7.15-200.fc32 clean
  kernel-5.7.10-201.fc32 clean 
  kernel-5.7.0-301.fc33  clean

See also https://bugzilla.redhat.com/show_bug.cgi?id=1881889.
Comment 17 JerryD 2021-02-13 01:30:15 UTC
Possible fixed in this commit:

https://github.com/torvalds/linux/commit/a81bfdf8bf5396824d7d139560180854cb599b06
Comment 18 JerryD 2021-03-08 03:01:09 UTC
(In reply to JerryD from comment #17)
> Possible fixed in this commit:
> 
> https://github.com/torvalds/linux/commit/
> a81bfdf8bf5396824d7d139560180854cb599b06

No noy fixed.  I ahve tried up to:

5.11.3-50.fc33.x86_64"

Exact same failure.  I have add to clear off a buncg of files to make room on this laptop to start doing kernel builds in hopes of identifying the breakage.