Bug 210849
Summary: | Black screen after resume from long suspend. Open/Close lid. AMDGPU | ||
---|---|---|---|
Product: | Drivers | Reporter: | xrootware |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | blocking | CC: | alexdeucher, jvdelisle2 |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.10.1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | cut of dmesg |
Description
xrootware
2020-12-22 11:15:59 UTC
Kernel 5.7.0 works fine More bugs like this https://bugzilla.redhat.com/show_bug.cgi?id=1881889 5.10.1 builded from sources have same result Have new error messages linkied with it, i think drm failed to load unicode id amdgpu: RAP: optional rap ta ucode is not available kernel 5.10.2 There is an old bug report that has been closed related to /usr/lib/firmware/amdgpu/raven_dmcu.bin. This file was being incorrectly loaded or has a probelm. To correct this, I renamed the file to raven_dmcu.bin.old so that it would not load. The problem would return every time firmware updates would come through that supplied a new raven_dmcu.bin. I had no problems until recently when the kernel was bumped from 5.8.18-200.fc32.x86_64 to the first instance of 5.9. The previous fix was accomplished y uninstalling the kernel, rename the troublesome file, and then reintall the kernel. I took another look today and see there is a new raven_dmcu.bin so I will repeat the procedure again. It may not be the same problem, but I will give it a try. I tried the fix in Comment 5 and it does not work so I am sticking with the 5.8.18 kernel until this gets ironed out. [ 296.606452] PM: resume devices took 10.437 seconds [ 296.606454] ------------[ cut here ]------------ [ 296.606455] Component: resume devices, time: 10437 [ 296.606465] WARNING: CPU: 2 PID: 3357 at kernel/power/suspend_test.c:53 suspend_test_finish+0x74/0x80 [ 296.606465] Modules linked in: ccm uinput rfcomm nf_conntrack_netlink xt_CHECKSUM xt_addrtype xt_MASQUERADE xt_conntrack br_netfilter ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set overlay nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat rtw88_8822be rtw88_8822b edac_mce_amd kvm_amd rtw88_pci snd_hda_codec_realtek snd_hda_codec_hdmi rtw88_core snd_hda_codec_generic ledtrig_audio kvm btusb snd_hda_intel btrtl snd_intel_dspcfg btbcm btintel irqbypass snd_hda_codec uvcvideo rapl mac80211 videobuf2_vmalloc bluetooth hp_wmi snd_hda_core pcspkr [ 296.606504] videobuf2_memops videobuf2_v4l2 snd_hwdep hid_sensor_accel_3d wmi_bmof sparse_keymap snd_seq hid_sensor_rotation hid_sensor_magn_3d videobuf2_common hid_sensor_gyro_3d snd_seq_device hid_sensor_incl_3d hid_sensor_trigger snd_pcm videodev hid_sensor_iio_common industrialio_triggered_buffer kfifo_buf ecdh_generic cfg80211 industrialio ecc joydev snd_timer mc snd k10temp i2c_piix4 soundcore rfkill libarc4 hp_accel i2c_scmi hp_wireless lis3lv02d acpi_cpufreq zram ip_tables mmc_block amdgpu hid_sensor_hub hid_multitouch rtsx_pci_sdmmc mmc_core hid_logitech_hidpp iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper cec crct10dif_pclmul drm crc32_pclmul ccp crc32c_intel ghash_clmulni_intel serio_raw nvme rtsx_pci nvme_core wmi video i2c_hid pinctrl_amd hid_logitech_dj fuse [ 296.606537] CPU: 2 PID: 3357 Comm: systemd-sleep Not tainted 5.9.16-200.fc33.x86_64 #1 [ 296.606538] Hardware name: HP HP ENVY x360 Convertible 15-bq1xx/83C6, BIOS F.21 04/29/2019 [ 296.606542] RIP: 0010:suspend_test_finish+0x74/0x80 [ 296.606545] Code: e8 03 00 00 29 c1 e8 78 5d 9f 00 41 81 fc 10 27 00 00 77 04 5d 41 5c c3 44 89 e2 48 89 ee 48 c7 c7 c8 80 38 a8 e8 81 03 9f 00 <0f> 0b 5d 41 5c c3 cc cc cc cc cc cc 0f 1f 44 00 00 0f b6 05 58 d9 [ 296.606546] RSP: 0018:ffffbdf382e8fdd8 EFLAGS: 00010296 [ 296.606548] RAX: 0000000000000026 RBX: 0000000000000001 RCX: ffff975857298d08 [ 296.606549] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff975857298d00 [ 296.606550] RBP: ffffffffa8388011 R08: 000000450f1f561c R09: ffffffffa9404be4 [ 296.606551] R10: 000000000000058a R11: 0000000000021ee0 R12: 00000000000028c5 [ 296.606552] R13: 0000000000000000 R14: ffffbdf382e8fe08 R15: 0000000000000000 [ 296.606554] FS: 00007f1e3fd5f000(0000) GS:ffff975857280000(0000) knlGS:0000000000000000 [ 296.606555] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 296.606556] CR2: 00005584bb97e928 CR3: 000000018e8f6000 CR4: 00000000003506e0 [ 296.606557] Call Trace: [ 296.606564] suspend_devices_and_enter+0x1a2/0x7f0 [ 296.606569] pm_suspend.cold+0x329/0x374 [ 296.606572] state_store+0x71/0xd0 [ 296.606577] kernfs_fop_write+0xce/0x1b0 [ 296.606581] vfs_write+0xc7/0x210 [ 296.606584] ksys_write+0x4f/0xc0 [ 296.606587] do_syscall_64+0x33/0x40 [ 296.606591] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 296.606593] RIP: 0033:0x7f1e40d26297 [ 296.606597] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 296.606598] RSP: 002b:00007ffe985f7fa8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 296.606600] RAX: ffffffffffffffda RBX: 0000000000000007 RCX: 00007f1e40d26297 [ 296.606601] RDX: 0000000000000007 RSI: 00005584bb97d920 RDI: 0000000000000004 [ 296.606601] RBP: 00005584bb97d920 R08: 0000000000000001 R09: 00007f1e40df8a60 [ 296.606602] R10: 0000000000000070 R11: 0000000000000246 R12: 0000000000000007 [ 296.606603] R13: 00005584bb978650 R14: 0000000000000007 R15: 00007f1e40df9720 [ 296.606605] ---[ end trace 6fa811f71ae3a8ac ]--- In the 5.18 kernel which does not hang on me I also have a segfault: [ 224.767359] Call Trace: [ 224.767529] amdgpu_dm_backlight_update_status+0xb4/0xc0 [amdgpu] [ 224.767535] backlight_suspend+0x6a/0x80 [ 224.767538] ? brightness_store+0x50/0x50 [ 224.767541] dpm_run_callback+0x4f/0x140 [ 224.767544] __device_suspend+0x11c/0x4a0 [ 224.767547] dpm_suspend+0x117/0x250 [ 224.767549] dpm_suspend_start+0x77/0x80 [ 224.767554] suspend_devices_and_enter+0xe6/0x7f0 [ 224.767557] pm_suspend.cold+0x333/0x38c [ 224.767560] state_store+0x71/0xd0 [ 224.767565] kernfs_fop_write+0xce/0x1b0 [ 224.767569] vfs_write+0xc7/0x1f0 [ 224.767572] ksys_write+0x4f/0xc0 [ 224.767576] do_syscall_64+0x4d/0x90 [ 224.767580] entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happens coming out of suspend. So pretty sure amdgpu is the culprit. 'no_console_suspend' in boot options fix problem absolutely in my case. maybe this is temporally fix, but i hope it will be fixed normally. This bug has basically disabled two different machines I have that use AMD graphics cards. One, a work station running Fedora 33 will lose the display if I do not log on right away after a power up. The defaults that blank the screen or just a simple lock screen sequence basically causes the monitor to shutoff to save energy. At this point the display can not be recovered since the driver loses it self. It is a real shame as this system was working perfectly for over a year beofre this happened. The guilty patch should be reverted immediately. I can remote into the system from elsewhere ussing ssh and command a reboot to bring it back. The kernel underlying is running and completely functional. The device driver AMDGPU is 'F'ed up. In my many many years of using Fedora and other linux distros, this is by far the worse cluster bug I have ever see. Can you bisect? I have done some bisection with the kernel by uninstalling packages and rolling back. I tried this with my best guess of which packages to roll back using package manager (dnf here). In the recent past I have seen issues with the amdgpu firmware and was able to disable the bug by hiding the suspect file removing the kernel and reinstalling the kernel. This was needed because the firmware is folded into the boot image when the kernel is installed. I was not successful with this bug here yet. Firmware vs mesa/dri code? If someone could point me to the correct source repository, I would be happy to attempt a raw build and install to see if latest development branches have fixed and then if not I could try starting a bisection on it. most likely kernel. https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html (In reply to Alex Deucher from comment #13) > most likely kernel. > > https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html OK, I have done the kernel builds before no problems. I will start back about a year and go from there. Dropped back to kernel-5.7.0 and everything works fine on laptop with lid close suspend / returns from suspend with no errors on my laptop. I made no other changes so this confirms the kernel is the focus. kernel-5.8.1-301.fc33 amdgpu_dm_backlight_update_status+0xb4/0xc0 [amdgpu] kernel-5.8.0-1.fc33 amdgpu_dm_backlight_update_status+0xb4/0xc0 [amdgpu] ***** breakage somewhere between these two ********* kernel-5.7.17-200.fc32 clean kernel-5.7.15-200.fc32 clean kernel-5.7.10-201.fc32 clean kernel-5.7.0-301.fc33 clean See also https://bugzilla.redhat.com/show_bug.cgi?id=1881889. Possible fixed in this commit: https://github.com/torvalds/linux/commit/a81bfdf8bf5396824d7d139560180854cb599b06 (In reply to JerryD from comment #17) > Possible fixed in this commit: > > https://github.com/torvalds/linux/commit/ > a81bfdf8bf5396824d7d139560180854cb599b06 No noy fixed. I ahve tried up to: 5.11.3-50.fc33.x86_64" Exact same failure. I have add to clear off a buncg of files to make room on this laptop to start doing kernel builds in hopes of identifying the breakage. |