Created attachment 283693 [details] dmesg After updating from 5.1 to 5.2.1 in about 5-10 minutes of watching a Youtube video in Firefox I now get complete lock-up of video output and inability to shutdown using power button. Using "magic keys" allows me to reboot and get kernel log via `journalctl -b -1 -k`, here is relevant part: BUG: kernel NULL pointer dereference, address: 00000000000002b4 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 2 PID: 8200 Comm: kworker/u16:1 Tainted: G IO 5.2.1-1383.gd5bbc26-HSF #1 openSUSE Tumbleweed (unreleased) Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, BIOS F14e 09/09/2014 Workqueue: events_unbound commit_work RIP: 0010:dc_stream_log+0x6/0xb0 [amdgpu] Code: 04 00 00 49 8b bc 02 80 02 00 00 48 8b 07 48 8b 40 50 e8 ed 88 a8 d6 b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 53 <8b> 86 b4 02 00 00 48 89 f3 48 89 f2 8b 8e 10 01 00 00 bf 04 00 00 RSP: 0018:ffffa5568b1b7c00 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002 RDX: ffffffffc07fbd50 RSI: 0000000000000000 RDI: ffff8e9ee9500000 RBP: ffff8e9d90618000 R08: 0000000000000001 R09: 0000000000000000 R10: ffffa5568b1b7c30 R11: 0000000000000000 R12: ffff8e9ee9500000 R13: ffff8e9ededb4448 R14: ffff8e9e47e10c00 R15: ffff8e9ededa0000 FS: 0000000000000000(0000) GS:ffff8e9eee000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000002b4 CR3: 00000003bcf46000 CR4: 00000000000406e0 Call Trace: dc_commit_state+0x79/0xb0 [amdgpu] amdgpu_dm_atomic_commit_tail+0x3c0/0xdb0 [amdgpu] ? finish_task_switch+0x74/0x300 ? __switch_to+0x152/0x4e0 ? __switch_to_asm+0x34/0x70 ? __lock_acquire+0x3c8/0x7a0 ? find_held_lock+0x32/0x90 ? find_held_lock+0x32/0x90 ? sched_clock+0x5/0x10 ? mark_held_locks+0x2d/0x80 ? preempt_count_sub+0x98/0xe0 ? _raw_spin_unlock_irq+0x3a/0x50 ? wait_for_completion_timeout+0xe9/0x110 ? commit_tail+0x3c/0x70 commit_tail+0x3c/0x70 process_one_work+0x271/0x5f0 worker_thread+0x4a/0x3d0 ? process_one_work+0x5f0/0x5f0 kthread+0x118/0x140 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x27/0x50 Modules linked in: af_packet ts_bm xt_pkttype xt_string nf_nat_ftp nf_conntrack_ftp xt_tcpudp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables scsi_transport_iscsi ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq zram snd_pcm_oss rfcomm snd_mixer_oss it87 hwmon_vid bnep msr rc_avermedia tuner_simple tuner_types amd64_edac_mod tuner tda7432 edac_mce_amd btusb tvaudio kvm_amd ath9k btrtl btbcm msp3400 ath9k_common btintel bluetooth ath9k_hw kvm irqbypass ath bttv tea575x joydev tveeprom videobuf_dma_sg snd_usb_audio videobuf_core snd_usbmidi_lib rc_core snd_rawmidi snd_hda_codec_realtek mac80211 snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi v4l2_common snd_seq_device snd_hda_intel videodev sp5100_tco pcspkr snd_hda_codec wmi_bmof mxm_wmi amdgpu fam15h_power k10temp media i2c_piix4 cfg80211 r8169 snd_hda_core gpu_sched realtek snd_hwdep libphy ttm rfkill snd_pcm mac_hid hid_generic usbhid uas usb_storage ohci_pci serio_raw sd_mod ehci_pci ohci_hcd xhci_pci ehci_hcd xhci_hcd wmi exfat(O) l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc vhba(O) uinput sg nbd dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ecryptfs CR2: 00000000000002b4 ---[ end trace 0633d97cb3f2d2d6 ]--- RIP: 0010:dc_stream_log+0x6/0xb0 [amdgpu] Code: 04 00 00 49 8b bc 02 80 02 00 00 48 8b 07 48 8b 40 50 e8 ed 88 a8 d6 b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 53 <8b> 86 b4 02 00 00 48 89 f3 48 89 f2 8b 8e 10 01 00 00 bf 04 00 00 RSP: 0018:ffffa5568b1b7c00 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002 RDX: ffffffffc07fbd50 RSI: 0000000000000000 RDI: ffff8e9ee9500000 RBP: ffff8e9d90618000 R08: 0000000000000001 R09: 0000000000000000 R10: ffffa5568b1b7c30 R11: 0000000000000000 R12: ffff8e9ee9500000 R13: ffff8e9ededb4448 R14: ffff8e9e47e10c00 R15: ffff8e9ededa0000 FS: 0000000000000000(0000) GS:ffff8e9eee000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000002b4 CR3: 00000003bcf46000 CR4: 00000000000406e0
Do you mind posting an dmesg log with drm=debug=4 as part of your boot parameters? An xorg log would be good too if applicable. I'm curious to know what the actual sequence / system setup is for reproducing this as this isn't really a typical sequence. I think you'd run into other NULL pointer dereferences even if this one is guarded. I think the stream itself is NULL and it shouldn't be in the context.
Created attachment 283695 [details] dmesg with "drm=debug=4"
Created attachment 283697 [details] kernel build config
Created attachment 283699 [details] amdgpu parameters These doesn't seem to change anything about the hang. Although, maybe with larger limits of scheduling (max_num_of_queues_per_device, sched_hw_submission, sched_jobs) hang happens sooner but I'm not sure.
Created attachment 283701 [details] X.log amdgpu has TearFree and VariableRefresh (no LCD support though) enabled. Dual-screen with 2 60 fps, VA and TN, 1080p LCDs, recently overclocked to ~73 and ~72 fps via CVT-1.2 lines on both Linux and Windows.
Created attachment 283703 [details] lsmem
Created attachment 283705 [details] lspci -vv
Created attachment 283707 [details] lspci -t -PP -q -k -v
(In reply to Nicholas Kazlauskas from comment #1) > Do you mind posting an dmesg log with drm=debug=4 as part of your boot > parameters? > > An xorg log would be good too if applicable. > > I'm curious to know what the actual sequence / system setup is for > reproducing this as this isn't really a typical sequence. I think you'd run > into other NULL pointer dereferences even if this one is guarded. > > I think the stream itself is NULL and it shouldn't be in the context. I don't think that putting 'drm=debug=4' into boot cmd has changed anything but here's some more data. I also stumbled into another baffling regression (bug #203703) recently (from 5.0 to 5.1) concerning network packet scheduling (fq_codel qdics) that halts affected Ethernet device, it also gives out repeatable kernel trace on random network activity unless qdics is changed on dumb "pfifo_fast" early on, similarly how this gives out same repeatable amdgpu trace on some random GPU activity. Weird.
Thanks for all the logs. I meant drm.debug=4 actually, the drm=debug=4 was a typo on my part - sorry!
Created attachment 283709 [details] /proc/interrupts
Created attachment 283741 [details] dmesg with "drm.debug=4" Here's actual debug dmesg. pci subsystem uses 'pci=x=y' syntax, so I wouldn't have thought that for drm that wouldn't be valid. Right when I wanted to upload the first dump from hang with debug that happened in >16 hours of uptime and >30 minutes of video, it crashed before Firefox even had a chance to render single page which happened to be same Youtube page everything hanged on because it starts at last opened page. So, after >30 minutes it wasn't even a second to hang again. This dump is from that time. Haven't tried launching a local video player or a 3D app. Without opening Youtube in Firefox or video opening Firefox, doing all 2D non-accelerated desktop stuff doesn't seem to trigger it.
Created attachment 283745 [details] tail -n 2000 from dmesg with "drm.debug=5" drm.debug=4 seem to produce only 1 new relevant line: "[drm:dc_commit_state [amdgpu]] dc_commit_state: 2 streams" so I tried increasing it. debug=5 creates a horrible stream that bogs down system with i/o load from journald but it sure did write some more at the moment of hang. I'm not going any further than that, though.
(In reply to Nicholas Kazlauskas from comment #10) > Thanks for all the logs. > > I meant drm.debug=4 actually, the drm=debug=4 was a typo on my part - sorry! So, I've got all I could on this. Could this be relevant to my recent LCD overclock ? I haven't tried going back to 60 fps yet. cvt executable and modes/xf86cvt.c in X-server weren't updated for years and can't even produce cvt-1.2 modes or any useful "reduced blanking" modes with them, so I had to go for things like: https://github.com/kevinlekiller/cvt_modeline_calculator_12 and https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=899066 On Windows I had to use https://www.monitortests.com/forum/Thread-Custom-Resolution-Utility-CRU because AMD driver refuses to use custom modes it itself generates with "unsupported" (yeah, right…) "error" naggings.
Thanks for the logs. I don't think this is related to your overclock. Since this behavior wasn't previously observed during our 5.2 testing I think that either a patch got lost or changed during the submission process, or something from 5.3 was backported into 5.2 that shouldn't have been. I don't think it's necessairly setup specific.
(In reply to Nicholas Kazlauskas from comment #15) > Thanks for the logs. I don't think this is related to your overclock. > > Since this behavior wasn't previously observed during our 5.2 testing I > think that either a patch got lost or changed during the submission process, > or something from 5.3 was backported into 5.2 that shouldn't have been. > > I don't think it's necessairly setup specific. That means that you were able to reproduce it ? If so, any known workaround or ETA on the fix ? Is rc1 of 5.3 affected ? Any plans on backport to 5.2.x ?
I was facing the same issue, Complete Video output stop, X Server process went unresponsive. I did a Hardware switch a day before. GPU: PNY GTX 1060 -> Asus Vega 56 Mainboard: Asus Z370P -> MSI Z390A Pro A friend suggested me to install some packages to enhance the GPU Support, one of them was "xf86-video-amdgpu". Seams like that package was responsible for the issues. Removing it fixed the issue without any other (notable) effects. Some more info for context: X: X.Org X Server 1.20.5 Desktop: plasmashell 5.16.3 Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC 2019 x86_64 GNU/Linux
(In reply to Yann HN from comment #17) > A friend suggested me to install some packages to enhance the GPU Support, > one of them was "xf86-video-amdgpu". > > Seams like that package was responsible for the issues. > Removing it fixed the issue without any other (notable) effects. Did you get the same amdgpu_dm_atomic_commit_tail => dc_commit_state => dc_stream_log NULL pointer dereference as reported here? If yes, this is a kernel driver bug, xf86-video-amdgpu just triggers it / the Xorg modesetting driver avoids it somehow. If not, please file your own report at https://bugs.freedesktop.org/enter_bug.cgi?product=xorg&component=Driver/AMDgpu and attach the corresponding Xorg log file and output of dmesg.
(In reply to Michel Dänzer from comment #18) > (In reply to Yann HN from comment #17) > > A friend suggested me to install some packages to enhance the GPU Support, > > one of them was "xf86-video-amdgpu". > > > > Seams like that package was responsible for the issues. > > Removing it fixed the issue without any other (notable) effects. > > Did you get the same amdgpu_dm_atomic_commit_tail => dc_commit_state => > dc_stream_log NULL pointer dereference as reported here? > > If yes, this is a kernel driver bug, xf86-video-amdgpu just triggers it / > the Xorg modesetting driver avoids it somehow. > > If not, please file your own report at > https://bugs.freedesktop.org/enter_bug.cgi?product=xorg&component=Driver/ > AMDgpu and attach the corresponding Xorg log file and output of dmesg. Yes, i re installed the package and was able to reproduce the error pretty fast, here the whole stack trace(package being the source of the issue confirmed): Jul 25 17:38:12 arch-workstation kernel: BUG: kernel NULL pointer dereference, address: 00000000000002b4 Jul 25 17:38:12 arch-workstation kernel: #PF: supervisor read access in kernel mode Jul 25 17:38:12 arch-workstation kernel: #PF: error_code(0x0000) - not-present page Jul 25 17:38:12 arch-workstation kernel: PGD 0 P4D 0 Jul 25 17:38:12 arch-workstation kernel: Oops: 0000 [#1] PREEMPT SMP PTI Jul 25 17:38:12 arch-workstation kernel: CPU: 3 PID: 296 Comm: kworker/u24:4 Not tainted 5.2.2-arch1-1-ARCH #1 Jul 25 17:38:12 arch-workstation kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B98/Z390-A PRO (MS-7B98), BIOS 1.60 03/21/2019 Jul 25 17:38:12 arch-workstation kernel: Workqueue: events_unbound commit_work [drm_kms_helper] Jul 25 17:38:12 arch-workstation kernel: RIP: 0010:dc_stream_log+0x6/0xb0 [amdgpu] Jul 25 17:38:12 arch-workstation kernel: Code: 04 00 00 49 8b bc 02 80 02 00 00 48 8b 07 48 8b 40 50 e8 1d 35 f7 cd b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 53 <8b> 86 b4 02 00 00 48 89 f3 48 89 f2 8b 8e 10 01 00 00 bf 04 00> Jul 25 17:38:12 arch-workstation kernel: RSP: 0018:ffff9ced83f5faf0 EFLAGS: 00010202 Jul 25 17:38:12 arch-workstation kernel: RAX: 0000000000000000 RBX: ffff8b9687199000 RCX: 0000000000000002 Jul 25 17:38:12 arch-workstation kernel: RDX: ffffffffc1112710 RSI: 0000000000000000 RDI: ffff8b9687199000 Jul 25 17:38:12 arch-workstation kernel: RBP: ffff8b95c7868000 R08: ffff8b95c7868000 R09: 0000000000000000 Jul 25 17:38:12 arch-workstation kernel: R10: ffff8b95c7868000 R11: 0000000000000018 R12: 0000000000000001 Jul 25 17:38:12 arch-workstation kernel: R13: ffff9ced83f5fd58 R14: ffff8b967420cff0 R15: 0000000000000000 Jul 25 17:38:12 arch-workstation kernel: FS: 0000000000000000(0000) GS:ffff8b968d8c0000(0000) knlGS:0000000000000000 Jul 25 17:38:12 arch-workstation kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 25 17:38:12 arch-workstation kernel: CR2: 00000000000002b4 CR3: 000000080c284006 CR4: 00000000003606e0 Jul 25 17:38:12 arch-workstation kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jul 25 17:38:12 arch-workstation kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jul 25 17:38:12 arch-workstation kernel: Call Trace: Jul 25 17:38:12 arch-workstation kernel: dc_commit_state+0x9a/0x5a0 [amdgpu] Jul 25 17:38:12 arch-workstation kernel: ? dm_plane_helper_cleanup_fb+0xa3/0x120 [amdgpu] Jul 25 17:38:12 arch-workstation kernel: amdgpu_dm_atomic_commit_tail+0xc5d/0x1a10 [amdgpu] Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x40/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x40/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x40/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x40/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x40/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x40/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x40/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x40/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x34/0x70 Jul 25 17:38:12 arch-workstation kernel: ? __switch_to_asm+0x40/0x70 Jul 25 17:38:12 arch-workstation kernel: ? _raw_spin_unlock_irq+0x1d/0x30 Jul 25 17:38:12 arch-workstation kernel: ? finish_task_switch+0x84/0x2d0 Jul 25 17:38:12 arch-workstation kernel: ? preempt_schedule_common+0x32/0x80 Jul 25 17:38:12 arch-workstation kernel: ? commit_tail+0x3c/0x70 [drm_kms_helper] Jul 25 17:38:12 arch-workstation kernel: commit_tail+0x3c/0x70 [drm_kms_helper] Jul 25 17:38:12 arch-workstation kernel: process_one_work+0x1d1/0x3e0 Jul 25 17:38:12 arch-workstation kernel: worker_thread+0x4a/0x3d0 Jul 25 17:38:12 arch-workstation kernel: kthread+0xfb/0x130 Jul 25 17:38:12 arch-workstation kernel: ? process_one_work+0x3e0/0x3e0 Jul 25 17:38:12 arch-workstation kernel: ? kthread_park+0x90/0x90 Jul 25 17:38:12 arch-workstation kernel: ret_from_fork+0x35/0x40 Jul 25 17:38:12 arch-workstation kernel: Modules linked in: fuse xt_nat xt_tcpudp veth xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo iptable_nat xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defra> Jul 25 17:38:12 arch-workstation kernel: snd_usbmidi_lib ppdev iTCO_wdt snd_hda_codec iTCO_vendor_support snd_rawmidi snd_seq_device media snd_hda_core agpgart snd_hwdep syscopyarea snd_pcm aesni_intel sysfillrect snd_timer aes_x86_64 c> Jul 25 17:38:12 arch-workstation kernel: CR2: 00000000000002b4 Jul 25 17:38:12 arch-workstation kernel: ---[ end trace 8659bfc7daefd7ef ]--- Jul 25 17:38:12 arch-workstation kernel: RIP: 0010:dc_stream_log+0x6/0xb0 [amdgpu] Jul 25 17:38:12 arch-workstation kernel: Code: 04 00 00 49 8b bc 02 80 02 00 00 48 8b 07 48 8b 40 50 e8 1d 35 f7 cd b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 53 <8b> 86 b4 02 00 00 48 89 f3 48 89 f2 8b 8e 10 01 00 00 bf 04 00> Jul 25 17:38:12 arch-workstation kernel: RSP: 0018:ffff9ced83f5faf0 EFLAGS: 00010202 lines 802-866/1002 87%
I haven't been able to reproduce this on my setup yet with xf86-video-amdgpu on Arch's 5.2.2 kernel. I don't see anything really missing between that and staging that could affect this issue. It would probably help to have a dmesg log with drm.debug=0x54 - this will enable DRM atomic state debug prints. You'll probably need to increase your log buffer size to get the state relevant to the crash. ie: " log_buf_len=64M drm.debug=84 "
Facing the same issue (Vega64). I captured a dmesg (drm.debug=0x54) with lockup and uploaded it here: https://nognu.de/p/dmesg_amdgpu.txt Thanks!
Thanks for the log! I can reproduce the issue now by emulating the sequence using IGT. It doesn't seem to show up in desktop usage for me.
(In reply to Nicholas Kazlauskas from comment #22) > Thanks for the log! > > I can reproduce the issue now by emulating the sequence using IGT. It > doesn't seem to show up in desktop usage for me. Indeed. I tried using modeset X11 driver and got a bunch of errors in Xorg.0.log about inability to do "page flips", so I've put `PageFlip false` for it and `EnablePageFlip false` for amdgpu with removal of 'TearFree true' (why it isn't always on by default ?), just in case. No hangs for about 24 hours even with a lot of Youtube in Firefox even with amdgpu. There seem to be a lot of patches for AMD GPUs queued for 5.2.5, any chance of the complete fix among them ?
This should be fixed with the series linked below: https://patchwork.freedesktop.org/series/64505/ But it still needs review and backporting to older kernels.
(In reply to Nicholas Kazlauskas from comment #24) > This should be fixed with the series linked below: > > https://patchwork.freedesktop.org/series/64505/ > > But it still needs review and backporting to older kernels. So, I've patched my 5.2.5 kernel package with that set and re-enabled page flipping. So far, everything seems fine. When it's merged and released, this issue may be closed. Thanks !
Created attachment 284083 [details] dmesg_2019-08-02-amdgpu_fail_on_patched_5.2.5 (In reply to Nicholas Kazlauskas from comment #24) > This should be fixed with the series linked below: > > https://patchwork.freedesktop.org/series/64505/ > > But it still needs review and backporting to older kernels. Celebration might have been premature. Hours later I've got another freeze with different error in amdgpu. Only this time, mouse cursor was movable over frozen frame right until I tried switching VT. Here's trace: BUG: unable to handle page fault for address: 0000000800000184 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 2 PID: 21044 Comm: kworker/u16:0 Tainted: G W IO 5.2.5-1396.g79b6a9c-HSF #1 openSUSE Tumbleweed (unreleased) Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, BIOS F14e 09/09/2014 Workqueue: events_unbound commit_work RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2e6/0xd60 [amdgpu] Code: ff 48 89 de 48 8b b8 40 43 01 00 e8 94 3b 09 00 49 8b 54 24 08 48 89 9d 30 fe ff ff 8b 82 00 09 00 00 85 c0 0f 85 fb fd ff ff <80> bb 80 01 00 00 01 0f 86 a0 00 00 00 48 b9 00 00 00 00 01 00 00 RSP: 0018:ffff98198b837c30 EFLAGS: 00010202 RAX: 0000000000000023 RBX: 0000000800000004 RCX: ffff8aca7b146f18 RDX: ffff8acc2a2d9000 RSI: ffffffffc0994f00 RDI: 0000000000000002 RBP: ffff98198b837e10 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8aca97bf3540 R13: ffff8acc114b1000 R14: ffff8acc035da000 R15: 0000000000000006 FS: 0000000000000000(0000) GS:ffff8acc2e000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000800000184 CR3: 00000003747c2000 CR4: 00000000000406e0 Call Trace: ? mark_held_locks+0x2d/0x80 ? _raw_spin_unlock_irq+0x3a/0x50 ? finish_task_switch+0xa2/0x300 ? __lock_acquire+0x3c3/0x7c0 ? find_held_lock+0x32/0x90 ? find_held_lock+0x32/0x90 ? sched_clock+0x5/0x10 ? mark_held_locks+0x2d/0x80 ? preempt_count_sub+0x98/0xe0 ? _raw_spin_unlock_irq+0x3a/0x50 ? wait_for_completion_timeout+0xe9/0x110 ? commit_tail+0x3c/0x70 commit_tail+0x3c/0x70 process_one_work+0x271/0x5f0 worker_thread+0x4a/0x3d0 ? process_one_work+0x5f0/0x5f0 kthread+0x118/0x140 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x27/0x50 Modules linked in: r8169 binfmt_misc af_packet ts_bm xt_pkttype xt_string nf_nat_ftp nf_conntrack_ftp xt_tcpudp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables scsi_transport_iscsi ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss zram snd_mixer_oss bnep it87 hwmon_vid msr joydev amd64_edac_mod edac_mce_amd btusb btrtl btbcm rc_avermedia btintel kvm_amd tuner_simple tuner_types bluetooth snd_usb_audio tuner kvm tda7432 snd_usbmidi_lib snd_rawmidi irqbypass tvaudio msp3400 snd_seq_device ath9k bttv ath9k_common ath9k_hw tea575x tveeprom ath videobuf_dma_sg videobuf_core rc_core v4l2_common pcspkr wmi_bmof videodev mxm_wmi mac80211 fam15h_power k10temp sp5100_tco media amdgpu i2c_piix4 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio snd_hda_intel cfg80211 snd_hda_codec snd_hda_core realtek gpu_sched libphy snd_hwdep ttm rfkill snd_pcm mac_hid hid_generic usbhid uas usb_storage ohci_pci serio_raw sd_mod ohci_hcd ehci_pci ehci_hcd xhci_pci xhci_hcd wmi exfat(O) l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc vhba(O) uinput sg nbd dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ecryptfs [last unloaded: r8169] CR2: 0000000800000184 ---[ end trace 7da703104c8acbc9 ]--- RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2e6/0xd60 [amdgpu] Code: ff 48 89 de 48 8b b8 40 43 01 00 e8 94 3b 09 00 49 8b 54 24 08 48 89 9d 30 fe ff ff 8b 82 00 09 00 00 85 c0 0f 85 fb fd ff ff <80> bb 80 01 00 00 01 0f 86 a0 00 00 00 48 b9 00 00 00 00 01 00 00 RSP: 0018:ffff98198b837c30 EFLAGS: 00010202 RAX: 0000000000000023 RBX: 0000000800000004 RCX: ffff8aca7b146f18 RDX: ffff8acc2a2d9000 RSI: ffffffffc0994f00 RDI: 0000000000000002 RBP: ffff98198b837e10 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8aca97bf3540 R13: ffff8acc114b1000 R14: ffff8acc035da000 R15: 0000000000000006 FS: 0000000000000000(0000) GS:ffff8acc2e000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000800000184 CR3: 00000003747c2000 CR4: 00000000000406e0 How ironic for it to manifest again during discussion video on Youtube about recent "JoJo" Part 5 finale's "perpetually trapped in the repeating nightmare of a frozen time" theme…
Created attachment 284153 [details] dmesg_2019-08-04-amdgpu-new_dereference-with-shadowprimary So, I've been using explicitly disabled "EnablePageFlip" and "TearFree" options as workaround for the original dereference but then decided to try out "ShadowPrimary" during fiddling with mvtools' motion-interpolation optimization in mpv, since page flipping is disabled anyway. But the result was ANOTHER null pointer dereference mere seconds after login: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 3272 Comm: X:cs0 Tainted: G IO 5.2.5-1407.g79b6a9c-HSF #1 openSUSE Tumbleweed Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, BIOS F14e 09/09/2014 RIP: 0010:amdgpu_vm_update_directories+0xe7/0x260 [amdgpu] Code: 89 08 48 8d 4a 40 48 89 48 08 48 89 42 40 48 8b 78 f0 c6 40 10 00 4c 8b a7 80 06 00 00 4d 85 e4 74 08 4d 8b a4 24 40 04 00 00 <4d> 8b 6c 24 08 31 f6 49 8b 95 80 06 00 00 48 85 d2 74 0f 48 8b 92 RSP: 0018:ffffafc2478aba10 EFLAGS: 00010246 RAX: ffff98742e20e670 RBX: ffff98742e20e658 RCX: ffff98744fc66040 RDX: ffff98744fc66000 RSI: ffff98742e20e638 RDI: ffff9873a295f800 RBP: ffff987459e00000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffffafc2478abb58 R14: ffff98744fc66000 R15: ffffafc2478abb58 FS: 00007f3ee03d7700(0000) GS:ffff98746de00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 00000003f27aa000 CR4: 00000000000406e0 Call Trace: amdgpu_cs_vm_handling+0x308/0x440 [amdgpu] amdgpu_cs_ioctl+0x154/0xa10 [amdgpu] ? amdgpu_cs_vm_handling+0x440/0x440 [amdgpu] drm_ioctl_kernel+0xaa/0xf0 drm_ioctl+0x208/0x385 ? amdgpu_cs_vm_handling+0x440/0x440 [amdgpu] ? _raw_spin_unlock_irqrestore+0x59/0x70 ? preempt_count_sub+0x98/0xe0 ? _raw_spin_unlock_irqrestore+0x46/0x70 amdgpu_drm_ioctl+0x49/0x80 [amdgpu] do_vfs_ioctl+0x3ed/0x720 ? __fget+0xf9/0x1b0 ksys_ioctl+0x5e/0x90 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x66/0xc0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f3ee641c7c7 Code: 00 00 90 48 8b 05 d1 86 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a1 86 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007f3ee03d6a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f3ee03d6a70 RCX: 00007f3ee641c7c7 RDX: 00007f3ee03d6a70 RSI: 00000000c0186444 RDI: 000000000000000e RBP: 00000000c0186444 R08: 00007f3ee03d6b80 R09: 0000000000000020 R10: 00007f3ee03d6b80 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000e R14: 000055d55e6f8bf0 R15: 000055d55e6f91a8 Modules linked in: af_packet xt_pkttype xt_string nf_nat_ftp nf_conntrack_ftp xt_tcpudp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables scsi_transport_iscsi ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss msr bnep it87 hwmon_vid zram amd64_edac_mod edac_mce_amd kvm_amd kvm rc_avermedia tuner_simple tuner_types irqbypass tuner tda7432 btusb btrtl btbcm btintel tvaudio msp3400 bluetooth snd_usb_audio ath9k joydev bttv ath9k_common snd_usbmidi_lib tea575x ath9k_hw tveeprom snd_rawmidi videobuf_dma_sg mxm_wmi wmi_bmof pcspkr ath videobuf_core snd_seq_device k10temp fam15h_power rc_core snd_hda_codec_realtek v4l2_common snd_hda_codec_generic sp5100_tco snd_hda_codec_hdmi ledtrig_audio mac80211 amdgpu videodev media i2c_piix4 snd_hda_intel cfg80211 snd_hda_codec r8169 snd_hda_core realtek snd_hwdep libphy snd_pcm gpu_sched rfkill ttm mac_hid hid_generic usbhid uas usb_storage ohci_pci serio_raw sd_mod ehci_pci ohci_hcd ehci_hcd xhci_pci xhci_hcd wmi exfat(O) l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc vhba(O) uinput sg nbd dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ecryptfs CR2: 0000000000000008 ---[ end trace a7f0ed14134a76ad ]--- RIP: 0010:amdgpu_vm_update_directories+0xe7/0x260 [amdgpu] Code: 89 08 48 8d 4a 40 48 89 48 08 48 89 42 40 48 8b 78 f0 c6 40 10 00 4c 8b a7 80 06 00 00 4d 85 e4 74 08 4d 8b a4 24 40 04 00 00 <4d> 8b 6c 24 08 31 f6 49 8b 95 80 06 00 00 48 85 d2 74 0f 48 8b 92 RSP: 0018:ffffafc2478aba10 EFLAGS: 00010246 RAX: ffff98742e20e670 RBX: ffff98742e20e658 RCX: ffff98744fc66040 RDX: ffff98744fc66000 RSI: ffff98742e20e638 RDI: ffff9873a295f800 RBP: ffff987459e00000 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffffafc2478abb58 R14: ffff98744fc66000 R15: ffffafc2478abb58 FS: 00007f3ee03d7700(0000) GS:ffff98746de00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 00000003f27aa000 CR4: 00000000000406e0
I experienced issues after upgrading kernel from 5.1 to 5.2 on my notebook with 2500 U. I tried kernel boot param iommu=soft and that fixed it.
(In reply to vr00m from comment #28) > I experienced issues after upgrading kernel from 5.1 to 5.2 on my notebook > with 2500 U. I tried kernel boot param iommu=soft and that fixed it. I've encountered this issue with kernel 5.2 (tried 5.2.8 just now) and also have a Ryzen 5 2500U notebook (Huawei Matebook D 14" (AMD)). Running Manjaro. The login screen appears fine, but after that, black screen. I know nothing's locked up because I was able to launch GZDoom from typing in the dark in the whisker menu and heard the sounds of Doom, or at least the title screen.
(In reply to Sergey Kondakov from comment #27) > Created attachment 284153 [details] > dmesg_2019-08-04-amdgpu-new_dereference-with-shadowprimary > > So, I've been using explicitly disabled "EnablePageFlip" and "TearFree" > options as workaround for the original dereference but then decided to try > out "ShadowPrimary" during fiddling with mvtools' motion-interpolation > optimization in mpv, since page flipping is disabled anyway. But the result > was ANOTHER null pointer dereference mere seconds after login: > BUG: kernel NULL pointer dereference, address: 0000000000000008 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 0 P4D 0 > Oops: 0000 [#1] PREEMPT SMP NOPTI > CPU: 1 PID: 3272 Comm: X:cs0 Tainted: G IO > 5.2.5-1407.g79b6a9c-HSF #1 openSUSE Tumbleweed > Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, BIOS > F14e 09/09/2014 > RIP: 0010:amdgpu_vm_update_directories+0xe7/0x260 [amdgpu] > Code: 89 08 48 8d 4a 40 48 89 48 08 48 89 42 40 48 8b 78 f0 c6 40 10 00 4c > 8b a7 80 06 00 00 4d 85 e4 74 08 4d 8b a4 24 40 04 00 00 <4d> 8b 6c 24 08 31 > f6 49 8b 95 80 06 00 00 48 85 d2 74 0f 48 8b 92 > RSP: 0018:ffffafc2478aba10 EFLAGS: 00010246 > RAX: ffff98742e20e670 RBX: ffff98742e20e658 RCX: ffff98744fc66040 > RDX: ffff98744fc66000 RSI: ffff98742e20e638 RDI: ffff9873a295f800 > RBP: ffff987459e00000 R08: 0000000000000000 R09: 0000000000000001 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > R13: ffffafc2478abb58 R14: ffff98744fc66000 R15: ffffafc2478abb58 > FS: 00007f3ee03d7700(0000) GS:ffff98746de00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000008 CR3: 00000003f27aa000 CR4: 00000000000406e0 > Call Trace: > amdgpu_cs_vm_handling+0x308/0x440 [amdgpu] > amdgpu_cs_ioctl+0x154/0xa10 [amdgpu] > ? amdgpu_cs_vm_handling+0x440/0x440 [amdgpu] > drm_ioctl_kernel+0xaa/0xf0 > drm_ioctl+0x208/0x385 > ? amdgpu_cs_vm_handling+0x440/0x440 [amdgpu] > ? _raw_spin_unlock_irqrestore+0x59/0x70 > ? preempt_count_sub+0x98/0xe0 > ? _raw_spin_unlock_irqrestore+0x46/0x70 > amdgpu_drm_ioctl+0x49/0x80 [amdgpu] > do_vfs_ioctl+0x3ed/0x720 > ? __fget+0xf9/0x1b0 > ksys_ioctl+0x5e/0x90 > __x64_sys_ioctl+0x16/0x20 > do_syscall_64+0x66/0xc0 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x7f3ee641c7c7 > Code: 00 00 90 48 8b 05 d1 86 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff > ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff > 73 01 c3 48 8b 0d a1 86 0c 00 f7 d8 64 89 01 48 > RSP: 002b:00007f3ee03d6a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 > RAX: ffffffffffffffda RBX: 00007f3ee03d6a70 RCX: 00007f3ee641c7c7 > RDX: 00007f3ee03d6a70 RSI: 00000000c0186444 RDI: 000000000000000e > RBP: 00000000c0186444 R08: 00007f3ee03d6b80 R09: 0000000000000020 > R10: 00007f3ee03d6b80 R11: 0000000000000246 R12: 0000000000000000 > R13: 000000000000000e R14: 000055d55e6f8bf0 R15: 000055d55e6f91a8 > Modules linked in: af_packet xt_pkttype xt_string nf_nat_ftp > nf_conntrack_ftp xt_tcpudp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack > ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security > iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack > nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables > scsi_transport_iscsi ip6table_filter ip6_tables iptable_filter ip_tables > x_tables bpfilter snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq > snd_pcm_oss snd_mixer_oss msr bnep it87 hwmon_vid zram amd64_edac_mod > edac_mce_amd kvm_amd kvm rc_avermedia tuner_simple tuner_types irqbypass > tuner tda7432 btusb btrtl btbcm btintel tvaudio msp3400 bluetooth > snd_usb_audio ath9k joydev bttv ath9k_common snd_usbmidi_lib tea575x > ath9k_hw tveeprom snd_rawmidi videobuf_dma_sg mxm_wmi wmi_bmof pcspkr ath > videobuf_core snd_seq_device k10temp fam15h_power rc_core > snd_hda_codec_realtek v4l2_common snd_hda_codec_generic > sp5100_tco snd_hda_codec_hdmi ledtrig_audio mac80211 amdgpu videodev media > i2c_piix4 snd_hda_intel cfg80211 snd_hda_codec r8169 snd_hda_core realtek > snd_hwdep libphy snd_pcm gpu_sched rfkill ttm mac_hid hid_generic usbhid uas > usb_storage ohci_pci serio_raw sd_mod ehci_pci ohci_hcd ehci_hcd xhci_pci > xhci_hcd wmi exfat(O) l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel > udp_tunnel pppox ppp_generic slhc vhba(O) uinput sg nbd dm_multipath > scsi_dh_rdac scsi_dh_emc scsi_dh_alua ecryptfs > CR2: 0000000000000008 > ---[ end trace a7f0ed14134a76ad ]--- > RIP: 0010:amdgpu_vm_update_directories+0xe7/0x260 [amdgpu] > Code: 89 08 48 8d 4a 40 48 89 48 08 48 89 42 40 48 8b 78 f0 c6 40 10 00 4c > 8b a7 80 06 00 00 4d 85 e4 74 08 4d 8b a4 24 40 04 00 00 <4d> 8b 6c 24 08 31 > f6 49 8b 95 80 06 00 00 48 85 d2 74 0f 48 8b 92 > RSP: 0018:ffffafc2478aba10 EFLAGS: 00010246 > RAX: ffff98742e20e670 RBX: ffff98742e20e658 RCX: ffff98744fc66040 > RDX: ffff98744fc66000 RSI: ffff98742e20e638 RDI: ffff9873a295f800 > RBP: ffff987459e00000 R08: 0000000000000000 R09: 0000000000000001 > R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > R13: ffffafc2478abb58 R14: ffff98744fc66000 R15: ffffafc2478abb58 > FS: 00007f3ee03d7700(0000) GS:ffff98746de00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000000000008 CR3: 00000003f27aa000 CR4: 00000000000406e0 Sergey, I tried to reproduce you latest issue on Ellsmere (Polaris 10) with "ShadowPrimary" enabled flip disabled and didn't observe any crash. In case you built your own kernel can you give me the output of this command - Run gdb on amdgpu.ko gdb drivers/gpu/drm/amd/amdgpu/amdgpu.ko Then do - list *(amdgpu_vm_update_directories+0xe7)
(In reply to Andrey Grodzovsky from comment #30) > (In reply to Sergey Kondakov from comment #27) > > Sergey, I tried to reproduce you latest issue on Ellsmere (Polaris 10) with > "ShadowPrimary" enabled flip disabled and didn't observe any crash. > In case you built your own kernel can you give me the output of this command > - > > Run gdb on amdgpu.ko > gdb drivers/gpu/drm/amd/amdgpu/amdgpu.ko > > Then do - > list *(amdgpu_vm_update_directories+0xe7) The crash may take a while (hours) to manifest and requires some video-watching via Firefox and/or mpv (with '--opengl-pbo' option on opengl-hq profile). It also may or may not need VAAPI to be used ('--hwdec=vaapi-copy' in case of mpv). My kernel is built on OBS build-server, so I had to enable debuginfo packaging and rebuild it, then debuginfo package used up mind-boggling 5,1gb of space leaving me with measly ~400mb on / ! After that I managed to get this: 0x2e127 is in amdgpu_vm_update_directories (../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1191). where line #1191 is: struct amdgpu_bo *bo = parent->base.bo, *pbo; But it a different build of the kernel, so I don't know if this is even relevant. I'm not going to stick around with this monstrosity. You may check out the packages at https://build.opensuse.org/package/binaries/home:X0F:HSF:Kernel/kernel-HSF/standard - they have pretty much all kernel modules that x86_64 supports, so it should run anywhere.
Just got exactly the same 0010:amdgpu_vm_update_directories+0xe7/0x260 dereference immediately on login even with PageFlip & TearFree disabled and ShadowPrimary NOT enabled. Even with all the same addresses as before. So, now I'm not sure about what actually triggers it. However, my setup is as non-default as it gets: amdgpu has these parameters: cik_support=1 si_support=1 msi=1 sched_policy=1 compute_multipipe=1 gartsize=1024 vm_fragment_size=9 max_num_of_queues_per_device=65536 sched_hw_submission=32 sched_jobs=1024 job_hang_limit=8000 halt_if_hws_hang=1 vm_fault_stop=0 vm_update_mode=3 vm_size=20 disp_priority=2 deep_color=1 gpu_recovery=1 irqbalance is enabled with interval=1 and rtirq has this: RTIRQ_NAME_LIST="timer rtc snd drm amdgpu radeon i915 nvidia usb i8042 ahci" RTIRQ_HIGH_LIST="watchdogd oom_reaper rcu_preempt rcu_sched rcu_bh rcub rcuc gfx sdma ksoftirqd khugepaged" RTIRQ_PRIO_HIGH=80 RTIRQ_PRIO_DECR=2 RTIRQ_PRIO_LOW=50 RTIRQ_RESET_ALL=0 to boost amdgpu's processes to highest RT/FIFO priorities in hope to avoid video stuttering and audio x-runs under full load. Transparent hugepages are enabled in attempt to spare crappy AMD FX's TLB cache and MMU (hence the vm_fragment_size=9). Maybe it's non-default vm_update_mode that does it. And few kernel versions back default gart of 256MB was triggering some kind of fault, probably stall and reset, maybe it even still does but I'm not going to check. Or maybe it's all irrelevant.
I(In reply to Sergey Kondakov from comment #26) > Created attachment 284083 [details] > dmesg_2019-08-02-amdgpu_fail_on_patched_5.2.5 > > (In reply to Nicholas Kazlauskas from comment #24) > > This should be fixed with the series linked below: > > > > https://patchwork.freedesktop.org/series/64505/ > > > > But it still needs review and backporting to older kernels. > > Celebration might have been premature. Hours later I've got another freeze > with different error in amdgpu. Only this time, mouse cursor was movable > over frozen frame right until I tried switching VT. Here's trace: > BUG: unable to handle page fault for address: 0000000800000184 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 0 P4D 0 > Oops: 0000 [#1] PREEMPT SMP NOPTI > CPU: 2 PID: 21044 Comm: kworker/u16:0 Tainted: G W IO > 5.2.5-1396.g79b6a9c-HSF #1 openSUSE Tumbleweed (unreleased) > Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, BIOS > F14e 09/09/2014 > Workqueue: events_unbound commit_work > RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2e6/0xd60 [amdgpu] Are you able to consistently reproduce this issue? Is it the same setup and same conditions as before? I haven't been able to see it in my testing at least.
(In reply to Nicholas Kazlauskas from comment #33) > I(In reply to Sergey Kondakov from comment #26) > > Created attachment 284083 [details] > > dmesg_2019-08-02-amdgpu_fail_on_patched_5.2.5 > > > > (In reply to Nicholas Kazlauskas from comment #24) > > > This should be fixed with the series linked below: > > > > > > https://patchwork.freedesktop.org/series/64505/ > > > > > > But it still needs review and backporting to older kernels. > > > > Celebration might have been premature. Hours later I've got another freeze > > with different error in amdgpu. Only this time, mouse cursor was movable > > over frozen frame right until I tried switching VT. Here's trace: > > BUG: unable to handle page fault for address: 0000000800000184 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x0000) - not-present page > > PGD 0 P4D 0 > > Oops: 0000 [#1] PREEMPT SMP NOPTI > > CPU: 2 PID: 21044 Comm: kworker/u16:0 Tainted: G W IO > > 5.2.5-1396.g79b6a9c-HSF #1 openSUSE Tumbleweed (unreleased) > > Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, > BIOS > > F14e 09/09/2014 > > Workqueue: events_unbound commit_work > > RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2e6/0xd60 [amdgpu] > > Are you able to consistently reproduce this issue? Is it the same setup and > same conditions as before? I haven't been able to see it in my testing at > least. Yes, just having PageFlip enabled in amdgpu guarantees it. Changing anything other than PageFlip doesn't seem to affect it. Forcing TearFree on with PageFlip disabled may also trigger it, I think. You may try my previously linked kernel build in your testing but I doubt that it has something specific for it. It may be not reproducible with modesetting X driver because it fails to engage page flipping on init and throws a bunch of errors about it in Xorg.0.log. For some reason I'm unable to use modesetting X driver at all, even with page flipping disabled, it draws only mouse cursor on black background instead of sddm login screen. So I have to use amdgpu with PageFlip and TearFree explicitly disabled. But then another, rarer 0010:amdgpu_vm_update_directories+0xe7/0x260 dereference may happen regardless (which I suspect is connected with vm_update_mode option, unlike the first one). By the way, is there any disadvantage in forcing TearFree to be always on when it works ? Like additional frame of latency or something like that ?
Do you mind posting your compositor settings in plasma? That would certainly influence flip timing and submission and I haven't been able to reproduce the issue with the settings I'm using.
(In reply to Nicholas Kazlauskas from comment #35) > Do you mind posting your compositor settings in plasma? That would certainly > influence flip timing and submission and I haven't been able to reproduce > the issue with the settings I'm using. Sure. They are also quite funky: ~/.config/kwinrc: [Compositing] AnimationSpeed=2 Backend=OpenGL Enabled=true GLColorCorrection=true GLCore=true GLPlatformInterface=glx GLPreferBufferSwap=c GLTextureFilter=2 HiddenPreviews=5 OpenGLIsUnsafe=false OpenGLIsUnsafe0=false OpenGLIsUnsafe1=false UnredirectFullscreen=false WindowsBlockCompositing=false XRenderSmoothScale=false However, I run LXQt with this in startup /usr/local/bin/kwin.sh script: export __GL_YIELD=USLEEP export KWIN_TRIPLE_BUFFER=0 export KWIN_USE_BUFFER_AGE=1 export KWIN_OPENGL_INTERFACE=egl export KWIN_DIRECT_GL=1 export KWIN_FORCE_LANCZOS=1 export KWIN_PERSISTENT_VBO=1 export KWIN_EFFECTS_FORCE_ANIMATIONS=1 … if [ -z "$WAYLAND_DISPLAY" ]; then export WINDOWMANAGER="env mesa_glthread=true nice -n -5 ionice -c 2 -n 0 -t chrt -v -r 5 kwin_x11 $KWIN_OPTIONS" exec /etc/X11/xinit/xinitrc return 0 else export WINDOWMANAGER="env mesa_glthread=true nice -n -5 ionice -c 2 -n 0 -t chrt -v -r 5 kwin_wayland" export QT_QPA_PLATFORM=wayland-egl export GDK_BACKEND=wayland export CLUTTER_BACKEND=wayland export SDL_VIDEODRIVER=wayland return 0 fi X is run by /usr/local/bin/Xhp: nice -n -10 ionice -c 2 -n 0 -t chrt -v -r 10 X "$@" It hangs the system, by the way, if RT limit is not set by sched_rt_runtime_us Here's ~/.drirc, just in case: <driconf> <device screen="0" driver="radeonsi"> <application name="Default"> <option name="allow_glsl_relaxed_es" value="true" /> <option name="radeonsi_enable_sisched" value="true" /> <option name="allow_glsl_builtin_const_expression" value="true" /> <option name="mesa_glthread" value="true" /> <option name="radeonsi_enable_nir" value="true" /> <option name="allow_glsl_extension_directive_midshader" value="true" /> <option name="allow_rgb10_configs" value="true" /> <option name="allow_glsl_cross_stage_interpolation_mismatch" value="true" /> <option name="radeonsi_assume_no_z_fights" value="true" /> <option name="allow_glsl_builtin_variable_redeclaration" value="true" /> <option name="allow_glsl_layout_qualifier_on_function_parameters" value="true" /> <option name="adaptive_sync" value="true" /> <option name="radeonsi_commutative_blend_add" value="true" /> <option name="allow_higher_compat_version" value="true" /> </application> </device> </driconf> Some things from tuned.conf: governor=schedutil transparent_hugepages=always /sys/kernel/mm/ksm/sleep_millisecs=250 /sys/kernel/mm/transparent_hugepage/shmem_enabled=advise /sys/kernel/mm/transparent_hugepage/defrag=defer+madvise /sys/kernel/mm/transparent_hugepage/khugepaged/defrag=0 /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan=512 /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs=1000 /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs=10000 dev.hpet.max-user-freq=4096 vm.zone_reclaim_mode=0 kernel.sched_autogroup_enabled=0 kernel.sched_latency_ns=1000000 kernel.sched_min_granularity_ns=100000 kernel.sched_wakeup_granularity_ns=1000 kernel.sched_nr_migrate=256 kernel.sched_migration_cost_ns=125 kernel.sched_cfs_bandwidth_slice_us=100 kernel.sched_tunable_scaling=1 kernel.sched_rt_period_us=1000000 kernel.sched_rt_runtime_us=900000 kernel.sched_rr_timeslice_ms=3 Originally the issue manifested with GLPreferBufferSwap=n and without double-buffering & EGL enforcement, I've made those in hope to compensate for disabled TearFree and PageFlip. Please, answer the question about TearFree, if you can. I've been trying to find out since its creation and wasn't able to get even a hint. Can it really be just this perfect thing that everyone should have all the time, unless buged ?
(In reply to Sergey Kondakov from comment #34) > By the way, is there any disadvantage in forcing TearFree to be always on > when it works ? Like additional frame of latency or something like that ? The TearFree option is there to deal with compositors that do not support sync to vblank. The ddx allocates another front buffer and then that buffer is updated synchronized with vblank with the data from the real front buffer. So it uses an additional buffer.
(In reply to Alex Deucher from comment #37) > (In reply to Sergey Kondakov from comment #34) > > By the way, is there any disadvantage in forcing TearFree to be always on > > when it works ? Like additional frame of latency or something like that ? > > The TearFree option is there to deal with compositors that do not support > sync to vblank. The ddx allocates another front buffer and then that buffer > is updated synchronized with vblank with the data from the real front > buffer. So it uses an additional buffer. Thanks ! It's a shame, I've already begun believing in "The Silver Bullet of VSync". And it's completely "software" GPU-agnostic function, so alternatives like Wayland would have to just reimplement it the same way ? It always adds a buffer or "smart-enough" compositor can opt-out ? Or "the correct fix for latency" with TF is disabling vsync everywhere (such as kwin's GLPreferBufferSwap=n) else and let it handle it ? No matter how I previously tried, nothing other than TearFree guaranteed actual lack of tearing in all times in simple 2x1080p configuration but there is abundance of buffering as it is in apps and a compositor + latency of LCD displays. I'm sure, you're aware of https://gitlab.freedesktop.org/xorg/xserver/issues/244 too. Strange that "the magic" of TF isn't done directly in compositors or kernel then.
(In reply to Sergey Kondakov from comment #38) > > Thanks ! It's a shame, I've already begun believing in "The Silver Bullet of > VSync". And it's completely "software" GPU-agnostic function, so > alternatives like Wayland would have to just reimplement it the same way ? > It always adds a buffer or "smart-enough" compositor can opt-out ? Or "the > correct fix for latency" with TF is disabling vsync everywhere (such as > kwin's GLPreferBufferSwap=n) else and let it handle it ? > > No matter how I previously tried, nothing other than TearFree guaranteed > actual lack of tearing in all times in simple 2x1080p configuration but > there is abundance of buffering as it is in apps and a compositor + latency > of LCD displays. I'm sure, you're aware of > https://gitlab.freedesktop.org/xorg/xserver/issues/244 too. Strange that > "the magic" of TF isn't done directly in compositors or kernel then. Here is your issue: "simple 2x1080p" multiple display are really hard to deal with. The display timing may be different, the blanking periods may not align, etc. X uses a single surface for each multi-display desktopso when you are updating multiple displays, if the timings are not aligned, one display will show older content. For this to work smoothly, you really need the compositor to have each display using it's own set of buffers and doing vsynced rendering to each display separately.
(In reply to Alex Deucher from comment #39) > (In reply to Sergey Kondakov from comment #38) > Here is your issue: "simple 2x1080p" > > multiple display are really hard to deal with. The display timing may be > different, the blanking periods may not align, etc. X uses a single surface > for each multi-display desktopso when you are updating multiple displays, if > the timings are not aligned, one display will show older content. For this > to work smoothly, you really need the compositor to have each display using > it's own set of buffers and doing vsynced rendering to each display > separately. I little bit strange to call 2x1080p on AMD's fancy 5-port GPU (+ possible DP multiplexing) "my issue". If anything is an issue with AMD's modern output controllers it's the lack of analogue signal in DVI port for my proper 1280x1024@89 CRT monitor with majestic >10k:1 contrast. Timing on both outputs is definitively different, though. I still cannot fathom how is it still that all outputs are lumped together like that. Anyway, I was searching on my suspicion about kwin's vsync behaviour and stumbled on this treat: https://bugs.kde.org/show_bug.cgi?id=395632#c45 - new kwin developer working on that and multi-threaded per-output vsync _right now_, wants testers. Surely, this new version of kwin will "blow up" kernel module with this page-flipping bug ! And he would really benefit from your advices. Then we might not even need TearFree anywhere anymore ! https://phabricator.kde.org/T11071 - quite a progress already. Aims to make double-only per-output mandatory vsync via GLX_OML_sync_control. Right now `qdbus-qt5 org.kde.KWin /KWin supportInformation` says: … maxFpsInterval: 16666666 refreshRate: 0 vBlankTime: 6000000 glStrictBinding: false glStrictBindingFollowsDriver: true … Screens ======= Multi-Head: no Active screen follows mouse: no Number of Screens: 2 Screen 0: --------- Name: DVI-D-0 Geometry: 0,0,1920x1080 Scale: 1 Refresh Rate: 72.9249 Screen 1: --------- Name: HDMI-A-0 Geometry: 1920,0,1920x1080 Scale: 1 Refresh Rate: 71.8263 glxgears shows proper FPS (~72.923) but, judging by that bug, it's either mistiming updates or "cutting out" some frames. It will not tear if it would let apps render at their pace and then limit its own output to 60, isn't it ? And I'm as clueless as those bug-reporters on how to check its real rate on currently released version.
(In reply to Sergey Kondakov from comment #40) > > I little bit strange to call 2x1080p on AMD's fancy 5-port GPU (+ possible > DP multiplexing) "my issue". It's a limitation of desktop environments on Linux in general. Other OSes handle this differently, but in general, regardless of OS, it's a hard problem to solve. If you have multiple displays running at different refresh rates how do you update them without tearing and also without non-synchronized content on some of the displays?
(In reply to Alex Deucher from comment #37) > The TearFree option is there to deal with compositors that do not support > sync to vblank. Also for cases where a compositor cannot prevent tearing, in particular with rotation and other transforms via RandR. > The ddx allocates another front buffer and then that buffer > is updated synchronized with vblank with the data from the real front > buffer. So it uses an additional buffer. Right, that's one of the main reasons why TearFree isn't enabled in all cases by default, another one being that it can incur (at least) one refresh cycle output latency. (In reply to Sergey Kondakov from comment #38) > It's a shame, I've already begun believing in "The Silver Bullet of VSync". If TearFree was a silver bullet, it would be enabled by default in all cases. :) (It already is in cases where a compositor cannot prevent tearing, per above) > And it's completely "software" GPU-agnostic function, so alternatives like > Wayland would have to just reimplement it the same way ? More like the other way around actually; I consider TearFree sort of a poor man's Wayland compositor. The latter generally handle this better, or are at least in a better position to handle it. > It always adds a buffer or "smart-enough" compositor can opt-out ? Currently the former. It would be possible to eliminate the additional buffers while a compositor / other fullscreen client is using page flipping, but I never got around to implementing that. > Or "the correct fix for latency" with TF is disabling vsync everywhere (such > as kwin's GLPreferBufferSwap=n) else and let it handle it ? I doubt that'll help for latency, and will waste energy generating frames which are never visible. > Strange that "the magic" of TF isn't done directly in compositors or kernel > then. Compositors can do so, with some exceptions per above, but in cases where they can prevent tearing, they're generally preferable to TearFree. The kernel cannot do this transparently though. (In reply to Alex Deucher from comment #39) > multiple display are really hard to deal with. The display timing may be > different, the blanking periods may not align, etc. X uses a single surface > for each multi-display desktopso when you are updating multiple displays, if > the timings are not aligned, one display will show older content. For this > to work smoothly, you really need the compositor to have each display using > it's own set of buffers and doing vsynced rendering to each display > separately. That wouldn't necessarily make much if any visible difference though, as the displays can still show inconsistent contents sometimes if their timings aren't synchronized, and tearing within a display can be avoided even with a single scanout buffer. The main benefit of separate scanout buffers is that the application can re-use buffers for rendering new frames earlier, but OTOH there's the overhead cost of compositing (because the client buffers can't be used directly for page flipping like this). This is pretty much how TearFree works, BTW (except due to the Xorg architecture, it can't actualy allow a compositor / other fullscreen client to re-use buffers earlier when sync-to-vblank is enabled for the client). (In reply to Sergey Kondakov from comment #40) > Anyway, I was searching on my suspicion about kwin's vsync > behaviour and stumbled on this treat: > https://bugs.kde.org/show_bug.cgi?id=395632#c45 - new kwin developer working > on that and multi-threaded per-output vsync _right now_, wants testers. That's not possible with the current Xorg architecture. It only allows flipping all displays (connected to a single GPU) together. The way forward to solve this is Wayland.
(In reply to Nicholas Kazlauskas from comment #24) > This should be fixed with the series linked below: > > https://patchwork.freedesktop.org/series/64505/ > > But it still needs review and backporting to older kernels. That patch series (applied to mainline 5.2.x) appears to fix the hangs on my RX 560 while playing video with vaapi acceleration. It would be great if this could be back-ported.
(In reply to Alex Deucher from comment #41) > (In reply to Sergey Kondakov from comment #40) > > > > > I little bit strange to call 2x1080p on AMD's fancy 5-port GPU (+ possible > > DP multiplexing) "my issue". > > It's a limitation of desktop environments on Linux in general. Other OSes > handle this differently, but in general, regardless of OS, it's a hard > problem to solve. If you have multiple displays running at different > refresh rates how do you update them without tearing and also without > non-synchronized content on some of the displays? I guess, you always would have to have at least 2 (currently_rendering/next_shown and previously_rendered/currently_shown) buffers in an app, 2 per each viewport in compositor, 2 in each of system's video output controllers (what if a viewport shares several outputs or vice versa ?; "crtc" on GPU but I would prefer a tendency to simplification and de-specialization of GPUs by replacing them with separate general-purpose vector processors, rasterization or BVH ASICs, FPGA codec accelerators, output controllers, all with wider faster common system bus and RAM instead of GPU-daughterboard-on-CPU's-MB monstrosities) and 2 in each display's controller (the latter is especially a problem because of slow scalers wanting to do their bad scaling and other in-display transformations while adding unpredictable unknown latency). And then make them all work asynchronously with their own safe timeframes for flipping. (In reply to Michel Dänzer from comment #42) >… > The way forward to solve this is Wayland. Thanks for detailed explanations, stuff like that should be in manuals. As for Wayland, I even managed to launch Wayland LXQt sessions with kwin via sddm where most things work. But something made me postpone transition to it indefinitively, don't remember what exactly. But right now at least 2 reasons would be: custom display modes (I want my 72-73 "free" fps on 60 fps almost-trash-level displays and my CRT have to have its non-standard modes defined manually) and per-display colour correction (with auto-generated and custom profiles). (In reply to Tom Seewald from comment #43) > (In reply to Nicholas Kazlauskas from comment #24) > > This should be fixed with the series linked below: > > > > https://patchwork.freedesktop.org/series/64505/ > > > > But it still needs review and backporting to older kernels. > > That patch series (applied to mainline 5.2.x) appears to fix the hangs on my > RX 560 while playing video with vaapi acceleration. > > It would be great if this could be back-ported. Unfortunately, it didn't fix the page flip-triggered dereference for me. Do you have page flip related errors in Xorg log on "modesetting" X driver ? With and without it ?
> Unfortunately, it didn't fix the page flip-triggered dereference for me. Do > you have page flip related errors in Xorg log on "modesetting" X driver ? > With and without it ? I don't believe so, glancing over my Xorg.0.log and Xorg.0.log.old I don't see any errors about page flipping. I just use the standard modesetting driver for Xorg.
Will these patches[1] be back ported to 5.2/5.3 or will we need to wait until this hopefully lands in 5.4? [1] https://patchwork.freedesktop.org/series/64505/
I've been getting this crash about once a week. Would be nice if something were done here.
Vega 64.
RX 480. Applied patch, haven't had any spurious crashes since. Using patchset since kernel 5.2.14, now using it on 5.3. Haven't had any suspend/wake crashes yet, either, but that may be unrelated. Will continue applying it to successive 5.3 kernels until it is officially backported, and will report if there are any further crashes.
(In reply to Christopher Snowhill from comment #49) > RX 480. Applied patch, haven't had any spurious crashes since. Using > patchset since kernel 5.2.14, now using it on 5.3. Haven't had any > suspend/wake crashes yet, either, but that may be unrelated. > > Will continue applying it to successive 5.3 kernels until it is officially > backported, and will report if there are any further crashes. I also built 5.3 with these patches, almost just as it came out: https://patchwork.freedesktop.org/series/64505/ https://patchwork.freedesktop.org/series/64614/ https://patchwork.freedesktop.org/series/65192/ No fails on X11's amdgpu so far BUT I've changed both TearFree and vm_update_mode options to defaults (but pci=big_root_window that makes BAR=VRAM is still active), so it may be just worked around and not completely gone, will try vm_update_mode=3 later. Would be nice to have some clue about what vm_* options actually entail for OpenCL, compute-shader and general rendering performance. I just set them for whatever, code in amdgpu_vm.c goes high above my head. Modesetting X11 driver behaves weirdly for me: enabling PageFlip in it still gives me errors and in both cases it just draws the black screen with movable cursor above it instead of sddm greet-screen. But amdgpu works, so, fine.
Patches have been sent to stable and should land soon.
(In reply to Alex Deucher from comment #51) > Patches have been sent to stable and should land soon. Thanks ! However, it seems that not all is well, after all: using vm_update_mode=3 have resulted in immediate 'RIP: 0010:amdgpu_vm_update_directories+0xe7/0x260' dereference hang before sddm could draw anything, so the second one is not fixed yet. Will use vm_update_mode=0 for now to make sure that offending code is never touched.
Created attachment 285209 [details] dmesg_2019-09-26-amdgpu-old_dereference_on_patched_5.3.1 After about a day of uptime my patched 5.3.1 hanged during hours-long Youtube video with dereference that is almost identical to the original one: BUG: unable to handle page fault for address: 00000008000001b4 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 2 PID: 396 Comm: kworker/u16:2 Tainted: G W IO 5.3.1-1482.g27a0123-HSF #1 openSUSE Tumbleweed Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, BIOS F14e 09/09/2014 Workqueue: events_unbound commit_work RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2ee/0xfd0 [amdgpu] … Call Trace: ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 ? _raw_spin_unlock_irq+0x29/0x50 ? trace_hardirqs_on+0x2c/0xf0 ? _raw_spin_unlock_irq+0x3a/0x50 ? finish_task_switch+0xa3/0x2e0 ? finish_task_switch+0x75/0x2e0 ? __switch_to+0x152/0x4e0 ? __switch_to_asm+0x34/0x70 ? __schedule+0x353/0x900 ? wait_for_completion_timeout+0x31/0x110 ? _raw_spin_unlock_irq+0x29/0x50 ? preempt_count_sub+0x9b/0xd0 ? _raw_spin_unlock_irq+0x3a/0x50 ? wait_for_completion_timeout+0xe9/0x110 ? commit_tail+0x3c/0x70 commit_tail+0x3c/0x70 process_one_work+0x271/0x5b0 worker_thread+0x4a/0x3d0 ? process_one_work+0x5b0/0x5b0 kthread+0x118/0x140 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x27/0x50 … [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:47:crtc-0] hw_done or flip_done timed out Could this be due to these additional patches ? https://patchwork.freedesktop.org/series/64614/ https://patchwork.freedesktop.org/series/65192/ Or the fact that I patched kwin-5.16.5 with https://phabricator.kde.org/T11071 and added KWIN_USE_INTEL_SWAP_EVENT=1 & KWIN_USE_BUFFER_AGE=3, so it works with tighter timings now ? Or any of these ? options amdgpu cik_support=1 si_support=1 msi=1 disp_priority=2 dpm=1 runpm=1 sched_policy=1 compute_multipipe=1 vm_fragment_size=9 gartsize=1024 max_num_of_queues_per_device=65536 sched_hw_submission=32 sched_jobs=1024 job_hang_limit=8000 halt_if_hws_hang=1 vm_fault_stop=0 vm_update_mode=0 deep_color=1 gpu_recovery=1 lockup_timeout=2500,5000,8000,1000 ras_enable=1 mcbp=1 queue_preemption_timeout_ms=48 mes=1 hws_gws_support=1 discovery=1
(In reply to Sergey Kondakov from comment #53) > Or any of these ? > options amdgpu cik_support=1 si_support=1 msi=1 disp_priority=2 dpm=1 > runpm=1 sched_policy=1 compute_multipipe=1 vm_fragment_size=9 gartsize=1024 > max_num_of_queues_per_device=65536 sched_hw_submission=32 sched_jobs=1024 > job_hang_limit=8000 halt_if_hws_hang=1 vm_fault_stop=0 vm_update_mode=0 > deep_color=1 gpu_recovery=1 lockup_timeout=2500,5000,8000,1000 ras_enable=1 > mcbp=1 queue_preemption_timeout_ms=48 mes=1 hws_gws_support=1 discovery=1 remove all of those. You should use the defaults unless you are specifically debugging something.
(In reply to Sergey Kondakov from comment #53) > Created attachment 285209 [details] > dmesg_2019-09-26-amdgpu-old_dereference_on_patched_5.3.1 > > After about a day of uptime my patched 5.3.1 hanged during hours-long > Youtube video with dereference that is almost identical to the original one: I don't believe the patches[1] have landed in a stable kernel release yet, at least going by the 5.3.1 change log[2] I don't see any reference to them. [1] https://patchwork.freedesktop.org/series/64505/ [2] https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.3.1
(In reply to Alex Deucher from comment #54) > (In reply to Sergey Kondakov from comment #53) > > Or any of these ? > > options amdgpu cik_support=1 si_support=1 msi=1 disp_priority=2 dpm=1 > > runpm=1 sched_policy=1 compute_multipipe=1 vm_fragment_size=9 gartsize=1024 > > max_num_of_queues_per_device=65536 sched_hw_submission=32 sched_jobs=1024 > > job_hang_limit=8000 halt_if_hws_hang=1 vm_fault_stop=0 vm_update_mode=0 > > deep_color=1 gpu_recovery=1 lockup_timeout=2500,5000,8000,1000 ras_enable=1 > > mcbp=1 queue_preemption_timeout_ms=48 mes=1 hws_gws_support=1 discovery=1 > > remove all of those. You should use the defaults unless you are > specifically debugging something. Then you may consider that I "specifically debugging" THIS. Because when I ask these questions here or in freedesktop.org, I specifically hope for an factual response from people with actual understanding and experience of how it works and what to be a proper way to debug without guesswork, based on knowledge that would compensate for the lack of meaningful documentation and one of the highest entry-barriers in software (even corporate monstrosity like Intel can't figure out GPUs still, market that is dominated by 2 oligopolists that run it with impunity however they feel like it, after all). This third dereference would be really hard to debug, though, because there is no clear reproduction steps, UNLESS you KNOW where and how to look as a developer. Or are you all just going to ignore the presence of kernel-crashing code because it "may" (or may not) be not triggered by your defaults ? So, can you actually tell which code-path may result in this or, better yet, test it yourself so things like that just would not go into releases ? The original dereference is triggered by mere presence of PageFlip which is on by default, so blindly running developer defaults (you can see what exactly I think about them here: https://bugzilla.kernel.org/show_bug.cgi?id=203703#c9 and c11) didn't help much anyone now, did it ? Or can you at least explain on what exactly each of these options does, what may be desired and undesired consequences and how your consensus about defaults came to be ? Short summary (but not as short as modinfo) or links to mailing list discussions maybe ? Because my goals (as they are for any desktop user) are: minimal guaranteed latency (meaning, full aggressive preemption, lowest scheduling granularity and strict RT priorities) of audio/video/input/network pipelines under stress-load and in that specific order of priority, with working fast fail-over or recovery instead of hangs and reboots. If I'd be using defaults then I still would be sitting on 3,3Ghz (instead of 4Ghz + 2,4Ghz for MMU & cache) FX CPU, non-ECC RAM ran by literally retarded AMD FX's MMU (you KNOW the one, the laughing stock of 2011-2017 x86 CPUs !) by slow default JEDEC timings, ~200W (instead of down-clocked and/or under-voltaged 90-120W) RX580 GPU (that would, no doubt, fry itself at some point like my previous 6870 did) with slow memory timings, sluggish non-patched kwin, 64ms of audio latency (instead of 8-12ms) and whole bunch of random hangs/drops in audio, video stuttering and input delays/skips due to scheduling priorities that are all other the place by default. So, no, thank you very much, on that. And YOU should NOT be testing exclusively on defaults either. (In reply to Tom Seewald from comment #55) > (In reply to Sergey Kondakov from comment #53) > > Created attachment 285209 [details] > > dmesg_2019-09-26-amdgpu-old_dereference_on_patched_5.3.1 > > > > After about a day of uptime my patched 5.3.1 hanged during hours-long > > Youtube video with dereference that is almost identical to the original > one: > > I don't believe the patches[1] have landed in a stable kernel release yet, > at least going by the 5.3.1 change log[2] I don't see any reference to them. > > [1] https://patchwork.freedesktop.org/series/64505/ > [2] https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.3.1 They seem to be in queue for 5.3.2: https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/commit/?id=7f2f9d496c3b8809143f1fc14e8cb093cc981d78 BUT those only address #1 (PageFlip) dereference, NOT #2 (when vm_update_mode not 0) and #3 !
Sergey, instead of throwing tantrums why can't you just do what you are asked ? You present an extremely convoluted set of driver config params and demand from us resolving the bug with those parameters in place. This introduces unneeded complication of the failure scenario which in turn introduces a lot of unknowns. Alex asks you to simplify the settings so less unknows are in the system so it's easier for us to try and figure out what goes wrong while we inspect the code. So please, bring the parameters back to default as this is the most well tested configuration and gives a baseline and also please provide addr2line for 0010:amdgpu_dm_atomic_commit_tail+0x2ee so we can get a better idea where in code the NULL ptr happened.
I encounter this error once a week on average on my Radeon 7 (Vega 20). Great on see you guys actively working on it. When 5.3.2 releases to Arch, I'll keep using it for a week or two and report back whether I encounter an issue again or not. Thanks! @Sergey You could revert to defaults just for the duration of testing/debugging. It'll sure make things easier for developers, and you can still go back to your settings once the issue is fixed. Great settings nonetheless, do these kernel parameters really improve the power performance of RX 580, or did you need to do something in addition too? By the way, I used RX 580 on default Arch Linux settings (so most likely kernel defaults) for a year and it was fine so you probably don't have to worry about frying it. Now I'm using Radeon 7, while RX 580 is still alive in a different Windows-based computer.
(In reply to Andrey Grodzovsky from comment #57) > Sergey, instead of throwing tantrums why can't you just do what you are > asked ? You present an extremely convoluted set of driver config params and > demand from us resolving the bug with those parameters in place. This > introduces unneeded complication of the failure scenario which in turn > introduces a lot of unknowns. Alex asks you to simplify the settings so less > unknows are in the system so it's easier for us to try and figure out what > goes wrong while we inspect the code. > So please, bring the parameters back to default as this is the most well > tested configuration and gives a baseline and also please provide addr2line > for 0010:amdgpu_dm_atomic_commit_tail+0x2ee so we can get a better idea > where in code the NULL ptr happened. And how about instead of knowingly pushing untested code with known fatal errors you stop taking QA notes from FGLRX in the first place and do your own full testing ? You do realize that I, as all others, paid for that card to your employer, right ? And people don't buy your top cards, RX[4-5][7-8]0, VEGAs and so on, to use them as expensive bare output controllers. DO NOT SHOOT THE MESSENGER. What you ignore from me, others will get one way or another, most of which would be incapable to even report it and just resort to cursing you and sell the hardware, going on Nvidia & Intel combo forever instead. Do you have any idea how many times in my life I've heard "at least it's without hassle" spiel about all (yes, all) AMD stuff from "normal people" ? I don't demand from you resolving this personally for me and whatever I might configure. But I do demand you not pushing untested code, hide it under parameters that limit all cards to bare minimum and then use it as an excuse to continue not to test it. And then silently expect me to work as your QA as if I trained on how to debug kernel-level code and telepathically know what might be on your minds. What else, should I be expected to whip out chip programmer and write custom asm-code for your mystery chips by myself ? I don't have a laboratory or a dedicated debug station. _Regarding this notion of "testing on defaults"_. Maybe I was not clear on that: that #3 dereference happened just once after about a full day of uptime. The machine sometime was running for more than a week straight without issues. So, defaulting will not show any difference on my end unless I run both configs no less than 2 weeks of pure uptime each without shutting down the machine. And it still be useless guesswork which will not produce any more pointers on what exactly goes wrong, at best it just will repeat or not. However, you as a developers of that code and a trained experts, can use that little data there is to recheck exact offending code about no one else have a clue about. You also can fully reproduce my configuration (including exact packages of my kernel with debug-info) and work with full data of your own, since you not willing to test all your codepaths regularly as a rule. I will try to figure out what the hell this "addr2line" is but it will probably include installing gigabytes of debug-symbols on SSD that has no space for them, so… it will take a while. But the way, what happened with my answer about #2 ? You know, the `list *(amdgpu_vm_update_directories+0xe7)` part, which was real time-consuming pain to get, with: 0x2e127 is in amdgpu_vm_update_directories (../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1191). where line #1191 is: struct amdgpu_bo *bo = parent->base.bo, *pbo; Have you even seen it ? Was it the right thing ? Any thoughts on the cause for this one ? Should I do the same for the #3 ? Will it also go into a void of silence ? (In reply to Damian Nowak from comment #58) > I encounter this error once a week on average on my Radeon 7 (Vega 20). > Great on see you guys actively working on it. When 5.3.2 releases to Arch, > I'll keep using it for a week or two and report back whether I encounter an > issue again or not. Thanks! > > @Sergey You could revert to defaults just for the duration of > testing/debugging. It'll sure make things easier for developers, and you can > still go back to your settings once the issue is fixed. Great settings > nonetheless, do these kernel parameters really improve the power performance > of RX 580, or did you need to do something in addition too? By the way, I > used RX 580 on default Arch Linux settings (so most likely kernel defaults) > for a year and it was fine so you probably don't have to worry about frying > it. Now I'm using Radeon 7, while RX 580 is still alive in a different > Windows-based computer. Ok, I can. But what's next ? How exactly does that would give any more data ? What exactly should I do after booting the machine ? Power ? No, the custom hacked GPU BIOS does. Although, after fiddling with voltages, I just left them on auto-defaults, where driver/firmware uses built-in per-card "chip quality" as multiplier for defaults, and limited frequency to 1300. Power-draw increases exponentially with frequency and after 1300 it increases ridiculous on RX580's 14nm chips. I also made fans never stop and act more aggressively but not to the point of out-noising the case and CPU fans. And I tightened memory timings too. 90-120W are numbers from MSI Afterburner, mostly about 90W and rarely 120W in some specific loads. Pre-RX cards, the whole 2008-2015 generation of AMD GPU chips (and chipsets, for that matter), especially mobile ones, are well known to be self-destructive. And not long ago my 6870 has joined them. Ironically, default firmware settings on commercial GPUs are not safe, at least not on those generations. They are balanced by the manufacturers to barely survive warranty periods. That's why pre-overclocked cards, or any chips, is not a product that anyone should be exited about. AMD chips are knows as "the stoves" for the reason but device manufacturers bring it them from "inefficient" to "half-dead". Price's good though. With the software parameters I mainly try to balance latency and CPU time, remove sources of stuttering, do proper prioritization during CPU & I/O contention, and enable features that can be safely enabled, so when I run my live test/install distro build on unknown hardware, I could test and/or use it fully without redoing and customizing the whole damn thing. But it's more of a guesswork with GPU than with everything else. Unfortunately, developers in general are not much of the fans of "multi-task desktop user experience" on last-gen ("last" being "older than one in laboratory") hardware.
(In reply to Sergey Kondakov from comment #59) > > And how about instead of knowingly pushing untested code with known fatal > errors you stop taking QA notes from FGLRX in the first place and do your > own full testing ? You do realize that I, as all others, paid for that card > to your employer, right ? And people don't buy your top cards, > RX[4-5][7-8]0, VEGAs and so on, to use them as expensive bare output > controllers. This. If this were just a free project with volunteers giving their time, many of us who occasionally throw a tantrum towards AMD wouldn't be. But, some of us are throwing money at AMD to try to have a stable system again, and keep getting regressions introduced that are either fixed very slowly, or not at all. I'm here, because I was running an R9 390, and kernel 4.19 introduced a regression that causes a complete boot failure. Others confirmed the same. See https://bugs.freedesktop.org/show_bug.cgi?id=108781 (As I explain way below, this is still unfixed in 5.3.) On that bug, I'm asked by an amd.com developer to bisect. I run into hundreds, or even a thousand, commits that don't even compile, and only a later commit fixes that issue. Fun, thanks for pushing those, guys. I finally achieve a bisected commit, where 0d9988910989 causes a boot hang and the one previous to it doesn't. Upon being told this shouldn't have to do with the bug I've posted, I do discover that this bug causes a black screen boot hang, but it's a different bug! I then go on to document that I've found between 3 and 5 crashing commits in the new 4.19 commits. So, how am I supposed to bisect this garbage, when a lot doesn't even compile, and there are multiple bugs popping in and out of existence causing the same symptom? Boot crashes with black screen, and I'm supposed to know to mark that commit as good because it's a different bug causing the same issue? I ask the AMD devs to tell me exactly which card they use in testing (if any, at all) so I can just buy that and be done with this. No response. So, I pay AMD more money and buy a RX 580, which is mostly a downgrade from the R9 390. Get frequent crashes from that as well. So, I just decide to buy a Vega 64. I don't need the extra power, I just want to run a stable machine. Since AMD devs aren't saying what card I could use that they do, in a hope that they might fix crashes before they push them, I figure the latest and greatest might be getting more attention. All goes well until this regression is introduced. I go back to try my R9 390, and guess what? The same bug introduced in kernel 4.19 is still there in 5.3! AMD's just ignored it, and hasn't bothered to try to reproduce it themselves and try to untangle the mess of spaghetti. Since running a custom kernel with the patchset, I haven't had this crash, but come on guys! Couldn't AMD have a bank of 50 computers running different cards, constantly running the latest unpushed code and going through different stress tests? Hey, Jim, monitor #14 and #36 keep crashing, let's look into it.....
It might look like I'm just ranting. That's not the reason I posted my comment. I'm trying to give feedback to AMD about how bad so many customer experiences are right now, and have been for quite some time, and how there should be easy and affordable (for AMD) ways to make it better.
Kernel: 5.3.8 OS: Arch Linux x86_64 I was able to eliminate crash mentioned in, https://bugzilla.kernel.org/show_bug.cgi?id=204181#c27 and https://bugzilla.kernel.org/show_bug.cgi?id=204181#c34 by removing "amdgpu.vm_update_mode=3" from kernel parameters. This however reintroduced https://bugs.freedesktop.org/show_bug.cgi?id=102322 as mentioned on https://wiki.archlinux.org/index.php/AMDGPU#Freezes_with_%22[drm]_IP_block:gmc_v8_0_is_hung!%22_kernel_error. "BUG: kernel NULL pointer dereference, address: 0000000000000008" seems to happen most frequently while browsing internet using icecat.
Does the crash in comment #0 actually happen dc_stream_log ?
I encountered this crash twice today on a slightly modified Arch 5.7.2 kernel. Attaching a crash log.
Oh, there is no file attach here. So I'll just paste the whole thing in this response: Jun 18 19:53:40 mrgency kernel: general protection fault, probably for non-canonical address 0x486df9363c7dd76e: 0000 [#1] SMP NOPTI Jun 18 19:53:40 mrgency kernel: CPU: 7 PID: 15075 Comm: kworker/u64:1 Not tainted 5.7.2-6-tkg-pds #1 Jun 18 19:53:40 mrgency kernel: Hardware name: Micro-Star International Co., Ltd MS-7C02/B450 TOMAHAWK (MS-7C02), BIOS 1.D0 11/07/2019 Jun 18 19:53:40 mrgency kernel: Workqueue: events_unbound commit_work [drm_kms_helper] Jun 18 19:53:40 mrgency kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x24c/0x2040 [amdgpu] Jun 18 19:53:40 mrgency kernel: Code: 8b 4f 08 8b 81 e0 02 00 00 41 ff c5 44 39 e8 0f 87 4d ff ff ff 48 83 bd 60 fd ff ff 00 0f 84 01 01 00 00 48 8b bd 60 fd ff ff <80> bf b0 01 00 00 01 0f 86 aa 00 00 00 31 c0 48 b9 00 00 00 00 01 Jun 18 19:53:40 mrgency kernel: RSP: 0018:ffffaa9109057b70 EFLAGS: 00010202 Jun 18 19:53:40 mrgency kernel: RAX: 0000000000000006 RBX: ffff916786f2c800 RCX: ffff916a4c049800 Jun 18 19:53:40 mrgency kernel: RDX: ffff916a4c0ce800 RSI: ffffffffc14dd198 RDI: 486df9363c7dd76e Jun 18 19:53:40 mrgency kernel: RBP: ffffaa9109057e60 R08: 0000000000000001 R09: 0000000000000001 Jun 18 19:53:40 mrgency kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9169ce131800 Jun 18 19:53:40 mrgency kernel: R13: 0000000000000006 R14: 0000000000000000 R15: ffff91680248b780 Jun 18 19:53:40 mrgency kernel: FS: 0000000000000000(0000) GS:ffff916a4e9c0000(0000) knlGS:0000000000000000 Jun 18 19:53:40 mrgency kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 18 19:53:40 mrgency kernel: CR2: 000010a7e8949000 CR3: 000000028b26c000 CR4: 00000000003406e0 Jun 18 19:53:40 mrgency kernel: Call Trace: Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x40/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x34/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x40/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x34/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x40/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x34/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x40/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x34/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x40/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x34/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x40/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x34/0x70 Jun 18 19:53:40 mrgency kernel: ? take_other_rq_task+0x9d/0x3e0 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x40/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x34/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x40/0x70 Jun 18 19:53:40 mrgency kernel: ? __switch_to_asm+0x34/0x70 Jun 18 19:53:40 mrgency kernel: ? timerqueue_add+0x65/0xb0 Jun 18 19:53:40 mrgency kernel: ? enqueue_hrtimer+0x3c/0x90 Jun 18 19:53:40 mrgency kernel: ? hrtimer_start_range_ns+0x1a2/0x2f0 Jun 18 19:53:40 mrgency kernel: ? __schedule+0x202/0x9d0 Jun 18 19:53:40 mrgency kernel: ? psi_task_change+0x84/0xc0 Jun 18 19:53:40 mrgency kernel: ? usleep_range+0x80/0x80 Jun 18 19:53:40 mrgency kernel: ? _cond_resched+0x16/0x40 Jun 18 19:53:40 mrgency kernel: ? __wait_for_common+0x3b/0x160 Jun 18 19:53:40 mrgency kernel: commit_tail+0x92/0x120 [drm_kms_helper] Jun 18 19:53:40 mrgency kernel: process_one_work+0x1e6/0x3b0 Jun 18 19:53:40 mrgency kernel: worker_thread+0x50/0x410 Jun 18 19:53:40 mrgency kernel: ? process_one_work+0x3b0/0x3b0 Jun 18 19:53:40 mrgency kernel: kthread+0x122/0x140 Jun 18 19:53:40 mrgency kernel: ? __kthread_bind_mask+0x60/0x60 Jun 18 19:53:40 mrgency kernel: ret_from_fork+0x22/0x40 Jun 18 19:53:40 mrgency kernel: Modules linked in: fuse rfcomm tun uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev cmac algif_hash algif_skcipher af_alg bnep btusb btrtl btbcm btintel bluetooth ecdh_generic rfkill ecc crc16 hid_steam snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device mc mousedev joydev input_leds nls_iso8859_1 nls_cp437 vfat amdgpu hid_generic wmi_bmof fat edac_mce_amd dm_mod snd_hda_codec_realtek kvm_amd snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio kvm snd_hda_intel gpu_sched snd_intel_dspcfg i2c_algo_bit irqbypass snd_hda_codec ttm crct10dif_pclmul crc32_pclmul snd_hda_core ghash_clmulni_intel drm_kms_helper snd_hwdep snd_pcm usbhid hid cec aesni_intel r8169 snd_timer rc_core sp5100_tco snd crypto_simd syscopyarea realtek sysfillrect cryptd glue_helper sysimgblt pcspkr ccp libphy i2c_piix4 k10temp soundcore fb_sys_fops tpm_crb wmi tpm_tis tpm_tis_core tpm pinctrl_amd rng_core gpio_amdpt evdev mac_hid acpi_cpufreq drm sg crypto_user Jun 18 19:53:40 mrgency kernel: agpgart ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq crc32c_intel xhci_pci sr_mod cdrom xhci_hcd Jun 18 19:53:40 mrgency kernel: ---[ end trace 28969089457f0e4d ]--- Jun 18 19:53:40 mrgency kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x24c/0x2040 [amdgpu] Jun 18 19:53:40 mrgency kernel: Code: 8b 4f 08 8b 81 e0 02 00 00 41 ff c5 44 39 e8 0f 87 4d ff ff ff 48 83 bd 60 fd ff ff 00 0f 84 01 01 00 00 48 8b bd 60 fd ff ff <80> bf b0 01 00 00 01 0f 86 aa 00 00 00 31 c0 48 b9 00 00 00 00 01 Jun 18 19:53:40 mrgency kernel: RSP: 0018:ffffaa9109057b70 EFLAGS: 00010202 Jun 18 19:53:40 mrgency kernel: RAX: 0000000000000006 RBX: ffff916786f2c800 RCX: ffff916a4c049800 Jun 18 19:53:40 mrgency kernel: RDX: ffff916a4c0ce800 RSI: ffffffffc14dd198 RDI: 486df9363c7dd76e Jun 18 19:53:40 mrgency kernel: RBP: ffffaa9109057e60 R08: 0000000000000001 R09: 0000000000000001 Jun 18 19:53:40 mrgency kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff9169ce131800 Jun 18 19:53:40 mrgency kernel: R13: 0000000000000006 R14: 0000000000000000 R15: ffff91680248b780 Jun 18 19:53:40 mrgency kernel: FS: 0000000000000000(0000) GS:ffff916a4e9c0000(0000) knlGS:0000000000000000 Jun 18 19:53:40 mrgency kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 18 19:53:40 mrgency kernel: CR2: 000010a7e8949000 CR3: 000000028b26c000 CR4: 00000000003406e0
Created attachment 290589 [details] drm/amd/display: Clear dm_state for fast updates Alright, the bug patch I mentioned in the last comment seems to be good after a few hours of testing. Please try out this patch and see if it fixes the issue for the rest of you. In the meantime, I'm doing more extended tests on this patch to confirm it works well enough before posting it on LKML. Nicholas, I haven't tested your commit since I was too busy with this. I'll try it out if this one fails though. Also, can you please review this patch to confirm that I'm not doing anything wrong here?
(In reply to mnrzk from comment #66) > Created attachment 290589 [details] > drm/amd/display: Clear dm_state for fast updates > > Alright, the bug patch I mentioned in the last comment seems to be good > after a few hours of testing. > > Please try out this patch and see if it fixes the issue for the rest of > you. > > In the meantime, I'm doing more extended tests on this patch to confirm it > works well enough before posting it on LKML. > > Nicholas, I haven't tested your commit since I was too busy with this. I'll > try it out if this one fails though. > > Also, can you please review this patch to confirm that I'm not doing > anything wrong here? Oh my god, I just responded to the wrong thread by accident, so sorry.
Hello, I get similar system freeze and I know exactly how to reproduce it (on my machine): just visit https://www.unrealengine.com/ with Firefox 106.0.3 and you get the freeze. It happens also in others websites but more randomly. If you need more info I can give you, many thanks. Info about my graphic card: 07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] (prog-if 00 [VGA controller]) Subsystem: PC Partner Limited / Sapphire Technology Device 3001 uname -a Linux arch-tower 6.0.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Sat, 29 Oct 2022 14:08:39 +0000 x86_64 GNU/Linux The log from journalctl: Nov 09 19:30:29 arch-tower kernel: BUG: kernel NULL pointer dereference, address: 0000000000000020 Nov 09 19:30:29 arch-tower kernel: #PF: supervisor read access in kernel mode Nov 09 19:30:29 arch-tower kernel: #PF: error_code(0x0000) - not-present page Nov 09 19:30:29 arch-tower kernel: PGD 0 P4D 0 Nov 09 19:30:29 arch-tower kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI Nov 09 19:30:29 arch-tower kernel: CPU: 7 PID: 220976 Comm: firefox:cs0 Not tainted 6.0.6-arch1-1 #1 a46cc4b882cfc11c3bbb09d6a0fab3dcad53b5c2 Nov 09 19:30:29 arch-tower kernel: Hardware name: System manufacturer System Product Name/PRIME A320M-K, BIOS 5207 08/30/2019 Nov 09 19:30:29 arch-tower kernel: RIP: 0010:amdgpu_sa_bo_free+0x57/0x150 [amdgpu] Nov 09 19:30:29 arch-tower kernel: Code: 00 00 4c 8b 60 20 48 89 d5 4c 89 e7 e8 22 fd 4b c3 48 85 ed 0f 84 a4 00 00 00 48 8b 45 30 a8 01 0f 85 98 00 00 00 48 8b 45 08 <48> 8b 40 20 48 85 c0 74 0c 48 89 ef e8 48 1e 6c c3 84 c0 75 77 4c Nov 09 19:30:29 arch-tower kernel: RSP: 0018:ffffb2c98cedfa70 EFLAGS: 00010246 Nov 09 19:30:29 arch-tower kernel: RAX: 0000000000000000 RBX: ffff948784158e30 RCX: 0000000080800078 Nov 09 19:30:29 arch-tower kernel: RDX: 0000000000000001 RSI: ffff948784158e30 RDI: ffff94878b1e62f0 Nov 09 19:30:29 arch-tower kernel: RBP: ffff948784158d98 R08: 0000000000000000 R09: 0000000080800078 Nov 09 19:30:29 arch-tower kernel: R10: 0000000000000008 R11: 0000000010000000 R12: ffff94878b1e62f0 Nov 09 19:30:29 arch-tower kernel: R13: ffff94878b1e9628 R14: 00000000fffffff4 R15: 0000000000000001 Nov 09 19:30:29 arch-tower kernel: FS: 00007f494b1ff6c0(0000) GS:ffff9488969c0000(0000) knlGS:0000000000000000 Nov 09 19:30:29 arch-tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 09 19:30:29 arch-tower kernel: CR2: 0000000000000020 CR3: 0000000154e50000 CR4: 00000000003506e0 Nov 09 19:30:29 arch-tower kernel: Call Trace: Nov 09 19:30:29 arch-tower kernel: <TASK> Nov 09 19:30:29 arch-tower kernel: amdgpu_job_free+0x55/0xe0 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:29 arch-tower kernel: amdgpu_cs_ioctl+0x506/0x1f30 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:29 arch-tower kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:29 arch-tower kernel: drm_ioctl_kernel+0xcd/0x170 Nov 09 19:30:29 arch-tower kernel: drm_ioctl+0x231/0x410 Nov 09 19:30:29 arch-tower kernel: ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:29 arch-tower kernel: amdgpu_drm_ioctl+0x4e/0x90 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:29 arch-tower kernel: __x64_sys_ioctl+0x94/0xd0 Nov 09 19:30:29 arch-tower kernel: do_syscall_64+0x5f/0x90 Nov 09 19:30:29 arch-tower kernel: ? do_futex+0xde/0x1b0 Nov 09 19:30:29 arch-tower kernel: ? __x64_sys_futex+0x92/0x1d0 Nov 09 19:30:29 arch-tower kernel: ? syscall_exit_to_user_mode+0x1b/0x40 Nov 09 19:30:29 arch-tower kernel: ? do_syscall_64+0x6b/0x90 Nov 09 19:30:29 arch-tower kernel: ? do_syscall_64+0x6b/0x90 Nov 09 19:30:29 arch-tower kernel: ? syscall_exit_to_user_mode+0x1b/0x40 Nov 09 19:30:29 arch-tower kernel: ? do_syscall_64+0x6b/0x90 Nov 09 19:30:29 arch-tower kernel: ? do_syscall_64+0x6b/0x90 Nov 09 19:30:29 arch-tower kernel: ? syscall_exit_to_user_mode+0x1b/0x40 Nov 09 19:30:29 arch-tower kernel: ? do_syscall_64+0x6b/0x90 Nov 09 19:30:29 arch-tower kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd Nov 09 19:30:29 arch-tower kernel: RIP: 0033:0x7f494b515c0f Nov 09 19:30:29 arch-tower kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 Nov 09 19:30:29 arch-tower kernel: RSP: 002b:00007f494b1fe9c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Nov 09 19:30:29 arch-tower kernel: RAX: ffffffffffffffda RBX: 00007f494b1feb38 RCX: 00007f494b515c0f Nov 09 19:30:29 arch-tower kernel: RDX: 00007f494b1fea80 RSI: 00000000c0186444 RDI: 0000000000000018 Nov 09 19:30:29 arch-tower kernel: RBP: 00007f494b1fea80 R08: 00007f494b1feb80 R09: 00007f494b1fea60 Nov 09 19:30:29 arch-tower kernel: R10: 00007f491f8cbd00 R11: 0000000000000246 R12: 00000000c0186444 Nov 09 19:30:29 arch-tower kernel: R13: 0000000000000018 R14: 00007f494b1feb38 R15: 0000000000000002 Nov 09 19:30:29 arch-tower kernel: </TASK> Nov 09 19:30:29 arch-tower kernel: Modules linked in: iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter uas usb_storage ccm dm_crypt cbc encrypted_keys trusted asn1_encoder tee tpm nls_iso8859_1 vfat fat intel_rapl_msr intel_rapl_common edac_mce_amd eeepc_wmi snd_hda_codec_realtek asus_wmi kvm_amd snd_hda_codec_generic sparse_keymap platform_profile ledtrig_audio snd_hda_codec_hdmi video wmi_bmof kvm snd_hda_intel snd_intel_dspcfg irqbypass snd_intel_sdw_acpi mt7601u crct10dif_pclmul snd_hda_codec crc32_pclmul polyval_clmulni snd_hda_core polyval_generic mac80211 snd_hwdep gf128mul r8169 ghash_clmulni_intel snd_pcm realtek aesni_intel mdio_devres snd_timer crypto_simd ccp cryptd mousedev libarc4 sp5100_tco snd joydev rapl libphy pcspkr k10temp soundcore i2c_piix4 rng_core gpio_amdpt mac_hid cfg80211 gpio_generic wmi acpi_cpufreq rfkill dm_multipath dm_mod crypto_user fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid sr_mod Nov 09 19:30:29 arch-tower kernel: crc32c_intel xhci_pci cdrom xhci_pci_renesas amdgpu drm_ttm_helper ttm gpu_sched drm_buddy drm_display_helper cec Nov 09 19:30:29 arch-tower kernel: CR2: 0000000000000020 Nov 09 19:30:29 arch-tower kernel: ---[ end trace 0000000000000000 ]--- Nov 09 19:30:29 arch-tower kernel: RIP: 0010:amdgpu_sa_bo_free+0x57/0x150 [amdgpu] Nov 09 19:30:29 arch-tower kernel: Code: 00 00 4c 8b 60 20 48 89 d5 4c 89 e7 e8 22 fd 4b c3 48 85 ed 0f 84 a4 00 00 00 48 8b 45 30 a8 01 0f 85 98 00 00 00 48 8b 45 08 <48> 8b 40 20 48 85 c0 74 0c 48 89 ef e8 48 1e 6c c3 84 c0 75 77 4c Nov 09 19:30:29 arch-tower kernel: RSP: 0018:ffffb2c98cedfa70 EFLAGS: 00010246 Nov 09 19:30:29 arch-tower kernel: RAX: 0000000000000000 RBX: ffff948784158e30 RCX: 0000000080800078 Nov 09 19:30:29 arch-tower kernel: RDX: 0000000000000001 RSI: ffff948784158e30 RDI: ffff94878b1e62f0 Nov 09 19:30:29 arch-tower kernel: RBP: ffff948784158d98 R08: 0000000000000000 R09: 0000000080800078 Nov 09 19:30:29 arch-tower kernel: R10: 0000000000000008 R11: 0000000010000000 R12: ffff94878b1e62f0 Nov 09 19:30:29 arch-tower kernel: R13: ffff94878b1e9628 R14: 00000000fffffff4 R15: 0000000000000001 Nov 09 19:30:29 arch-tower kernel: FS: 00007f494b1ff6c0(0000) GS:ffff9488969c0000(0000) knlGS:0000000000000000 Nov 09 19:30:29 arch-tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 09 19:30:29 arch-tower kernel: CR2: 0000000000000020 CR3: 0000000154e50000 CR4: 00000000003506e0 Nov 09 19:30:29 arch-tower kernel: note: firefox:cs0[220976] exited with preempt_count 1 Nov 09 19:30:54 arch-tower kernel: watchdog: BUG: soft lockup - CPU#8 stuck for 27s! [Renderer:202012] Nov 09 19:30:54 arch-tower kernel: Modules linked in: iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter uas usb_storage ccm dm_crypt cbc encrypted_keys trusted asn1_encoder tee tpm nls_iso8859_1 vfat fat intel_rapl_msr intel_rapl_common edac_mce_amd eeepc_wmi snd_hda_codec_realtek asus_wmi kvm_amd snd_hda_codec_generic sparse_keymap platform_profile ledtrig_audio snd_hda_codec_hdmi video wmi_bmof kvm snd_hda_intel snd_intel_dspcfg irqbypass snd_intel_sdw_acpi mt7601u crct10dif_pclmul snd_hda_codec crc32_pclmul polyval_clmulni snd_hda_core polyval_generic mac80211 snd_hwdep gf128mul r8169 ghash_clmulni_intel snd_pcm realtek aesni_intel mdio_devres snd_timer crypto_simd ccp cryptd mousedev libarc4 sp5100_tco snd joydev rapl libphy pcspkr k10temp soundcore i2c_piix4 rng_core gpio_amdpt mac_hid cfg80211 gpio_generic wmi acpi_cpufreq rfkill dm_multipath dm_mod crypto_user fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid sr_mod Nov 09 19:30:54 arch-tower kernel: crc32c_intel xhci_pci cdrom xhci_pci_renesas amdgpu drm_ttm_helper ttm gpu_sched drm_buddy drm_display_helper cec Nov 09 19:30:54 arch-tower kernel: CPU: 8 PID: 202012 Comm: Renderer Tainted: G D 6.0.6-arch1-1 #1 a46cc4b882cfc11c3bbb09d6a0fab3dcad53b5c2 Nov 09 19:30:54 arch-tower kernel: Hardware name: System manufacturer System Product Name/PRIME A320M-K, BIOS 5207 08/30/2019 Nov 09 19:30:54 arch-tower kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x21f/0x2d0 Nov 09 19:30:54 arch-tower kernel: Code: 41 8d 4d 01 41 c1 e4 10 c1 e1 12 44 09 e1 89 c8 c1 e8 10 66 87 43 02 89 c2 c1 e2 10 81 fa ff ff 00 00 77 3b 31 d2 eb 02 f3 90 <8b> 03 66 85 c0 75 f7 89 c6 66 31 f6 39 f1 0f 84 87 00 00 00 c6 03 Nov 09 19:30:54 arch-tower kernel: RSP: 0000:ffffb2c9a003f7e0 EFLAGS: 00000202 Nov 09 19:30:54 arch-tower kernel: RAX: 0000000000240101 RBX: ffff94878b1e62f0 RCX: 0000000000240000 Nov 09 19:30:54 arch-tower kernel: RDX: 0000000000000000 RSI: 0000000000000101 RDI: ffff94878b1e62f0 Nov 09 19:30:54 arch-tower kernel: RBP: ffff948896a33b80 R08: ffffb2c9a003f7c8 R09: 0000000000000040 Nov 09 19:30:54 arch-tower kernel: R10: 0000000000200000 R11: ffffb2c9a003fb80 R12: 0000000000000000 Nov 09 19:30:54 arch-tower kernel: R13: 0000000000000008 R14: ffff94878b1e6518 R15: ffff94878b1e62f0 Nov 09 19:30:54 arch-tower kernel: FS: 00007efd265fc6c0(0000) GS:ffff948896a00000(0000) knlGS:0000000000000000 Nov 09 19:30:54 arch-tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 09 19:30:54 arch-tower kernel: CR2: 00007efd04b68400 CR3: 0000000138d60000 CR4: 00000000003506e0 Nov 09 19:30:54 arch-tower kernel: Call Trace: Nov 09 19:30:54 arch-tower kernel: <TASK> Nov 09 19:30:54 arch-tower kernel: _raw_spin_lock+0x29/0x30 Nov 09 19:30:54 arch-tower kernel: amdgpu_sa_bo_new+0xd5/0x560 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: ? update_sd_lb_stats.constprop.0+0x10f/0x910 Nov 09 19:30:54 arch-tower kernel: ? select_task_rq_fair+0x161/0x1a60 Nov 09 19:30:54 arch-tower kernel: amdgpu_ib_get+0x43/0x90 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: amdgpu_job_alloc_with_ib+0x5b/0x80 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: amdgpu_copy_buffer+0xc2/0x230 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: amdgpu_ttm_copy_mem_to_mem+0x396/0x770 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: amdgpu_bo_move+0x151/0x6d0 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: ttm_bo_handle_move_mem+0xa8/0x170 [ttm 3393e9853c224a250513194a7cd094617e0e2b51] Nov 09 19:30:54 arch-tower kernel: ttm_bo_validate+0x10c/0x160 [ttm 3393e9853c224a250513194a7cd094617e0e2b51] Nov 09 19:30:54 arch-tower kernel: amdgpu_bo_fault_reserve_notify+0xbf/0x150 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: amdgpu_gem_fault+0x89/0x100 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: __do_fault+0x36/0x110 Nov 09 19:30:54 arch-tower kernel: do_fault+0x2a2/0x420 Nov 09 19:30:54 arch-tower kernel: __handle_mm_fault+0x668/0xf70 Nov 09 19:30:54 arch-tower kernel: handle_mm_fault+0xb2/0x290 Nov 09 19:30:54 arch-tower kernel: do_user_addr_fault+0x1be/0x6a0 Nov 09 19:30:54 arch-tower kernel: exc_page_fault+0x74/0x170 Nov 09 19:30:54 arch-tower kernel: asm_exc_page_fault+0x26/0x30 Nov 09 19:30:54 arch-tower kernel: RIP: 0033:0x7efd4a16c7d5 Nov 09 19:30:54 arch-tower kernel: Code: fc ff 0f 1f 00 f3 0f 1e fa 48 89 f8 48 83 fa 20 0f 82 af 00 00 00 c5 fe 6f 06 48 83 fa 40 0f 87 3e 01 00 00 c5 fe 6f 4c 16 e0 <c5> fe 7f 07 c5 fe 7f 4c 17 e0 c5 f8 77 c3 66 66 2e 0f 1f 84 00 00 Nov 09 19:30:54 arch-tower kernel: RSP: 002b:00007efd265f9698 EFLAGS: 00010246 Nov 09 19:30:54 arch-tower kernel: RAX: 00007efd04b68400 RBX: 00007efd21d36908 RCX: 00000000ffffffc0 Nov 09 19:30:54 arch-tower kernel: RDX: 0000000000000040 RSI: 00007efd3ab85c00 RDI: 00007efd04b68400 Nov 09 19:30:54 arch-tower kernel: RBP: 00007efd21d35000 R08: 0000000000040000 R09: 00007efd21d36918 Nov 09 19:30:54 arch-tower kernel: R10: 00007efd16c47c00 R11: 00007efd2396f000 R12: 0000000000000040 Nov 09 19:30:54 arch-tower kernel: R13: 0000000000000400 R14: 0000000000000000 R15: 00007efd21d35000 Nov 09 19:30:54 arch-tower kernel: </TASK> Nov 09 19:30:54 arch-tower kernel: watchdog: BUG: soft lockup - CPU#11 stuck for 27s! [MediaPD~oder #1:283921] Nov 09 19:30:54 arch-tower kernel: Modules linked in: iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter uas usb_storage ccm dm_crypt cbc encrypted_keys trusted asn1_encoder tee tpm nls_iso8859_1 vfat fat intel_rapl_msr intel_rapl_common edac_mce_amd eeepc_wmi snd_hda_codec_realtek asus_wmi kvm_amd snd_hda_codec_generic sparse_keymap platform_profile ledtrig_audio snd_hda_codec_hdmi video wmi_bmof kvm snd_hda_intel snd_intel_dspcfg irqbypass snd_intel_sdw_acpi mt7601u crct10dif_pclmul snd_hda_codec crc32_pclmul polyval_clmulni snd_hda_core polyval_generic mac80211 snd_hwdep gf128mul r8169 ghash_clmulni_intel snd_pcm realtek aesni_intel mdio_devres snd_timer crypto_simd ccp cryptd mousedev libarc4 sp5100_tco snd joydev rapl libphy pcspkr k10temp soundcore i2c_piix4 rng_core gpio_amdpt mac_hid cfg80211 gpio_generic wmi acpi_cpufreq rfkill dm_multipath dm_mod crypto_user fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid sr_mod Nov 09 19:30:54 arch-tower kernel: crc32c_intel xhci_pci cdrom xhci_pci_renesas amdgpu drm_ttm_helper ttm gpu_sched drm_buddy drm_display_helper cec Nov 09 19:30:54 arch-tower kernel: CPU: 11 PID: 283921 Comm: MediaPD~oder #1 Tainted: G D L 6.0.6-arch1-1 #1 a46cc4b882cfc11c3bbb09d6a0fab3dcad53b5c2 Nov 09 19:30:54 arch-tower kernel: Hardware name: System manufacturer System Product Name/PRIME A320M-K, BIOS 5207 08/30/2019 Nov 09 19:30:54 arch-tower kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x6d/0x2d0 Nov 09 19:30:54 arch-tower kernel: Code: 00 77 7d f0 0f ba 2b 08 0f 92 c2 8b 03 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 77 59 85 c0 74 0e 8b 03 84 c0 74 08 f3 90 <8b> 03 84 c0 75 f8 b8 01 00 00 00 66 89 03 65 48 ff 05 c5 24 11 7d Nov 09 19:30:54 arch-tower kernel: RSP: 0018:ffffb2c9832c37c8 EFLAGS: 00000202 Nov 09 19:30:54 arch-tower kernel: RAX: 0000000000240101 RBX: ffff94878b1e62f0 RCX: 0000000000000001 Nov 09 19:30:54 arch-tower kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff94878b1e62f0 Nov 09 19:30:54 arch-tower kernel: RBP: ffff94879ac18a30 R08: ffffb2c9832c37b0 R09: 0000000000000040 Nov 09 19:30:54 arch-tower kernel: R10: 0000000000000000 R11: ffff9488238cb9d8 R12: ffffb2c9832c38d8 Nov 09 19:30:54 arch-tower kernel: R13: 0000000000000100 R14: ffff94878b1e6518 R15: ffff94878b1e62f0 Nov 09 19:30:54 arch-tower kernel: FS: 00007f49237fa6c0(0000) GS:ffff948896ac0000(0000) knlGS:0000000000000000 Nov 09 19:30:54 arch-tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Nov 09 19:30:54 arch-tower kernel: CR2: 00007efd142bbfe0 CR3: 0000000154e50000 CR4: 00000000003506e0 Nov 09 19:30:54 arch-tower kernel: Call Trace: Nov 09 19:30:54 arch-tower kernel: <TASK> Nov 09 19:30:54 arch-tower kernel: _raw_spin_lock+0x29/0x30 Nov 09 19:30:54 arch-tower kernel: amdgpu_sa_bo_new+0xd5/0x560 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: ? __switch_to_asm+0x3e/0x60 Nov 09 19:30:54 arch-tower kernel: ? finish_task_switch.isra.0+0x90/0x2d0 Nov 09 19:30:54 arch-tower kernel: ? __schedule+0x34b/0x11c0 Nov 09 19:30:54 arch-tower kernel: ? update_sd_lb_stats.constprop.0+0x10f/0x910 Nov 09 19:30:54 arch-tower kernel: amdgpu_ib_get+0x43/0x90 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: amdgpu_job_alloc_with_ib+0x5b/0x80 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: ? kmem_cache_alloc_trace+0x15d/0x320 Nov 09 19:30:54 arch-tower kernel: amdgpu_vm_sdma_prepare+0x2b/0x70 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: amdgpu_vm_update_range+0x1c0/0x770 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: amdgpu_vm_bo_update+0x300/0x5a0 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: amdgpu_gem_va_ioctl+0x54f/0x590 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: ? amdgpu_gem_va_map_flags+0x80/0x80 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: drm_ioctl_kernel+0xcd/0x170 Nov 09 19:30:54 arch-tower kernel: drm_ioctl+0x231/0x410 Nov 09 19:30:54 arch-tower kernel: ? amdgpu_gem_va_map_flags+0x80/0x80 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: amdgpu_drm_ioctl+0x4e/0x90 [amdgpu 3b0071ba2e7e576c543138f03ed9b8249042cca2] Nov 09 19:30:54 arch-tower kernel: __x64_sys_ioctl+0x94/0xd0 Nov 09 19:30:54 arch-tower kernel: do_syscall_64+0x5f/0x90 Nov 09 19:30:54 arch-tower kernel: ? syscall_exit_to_user_mode+0x1b/0x40 Nov 09 19:30:54 arch-tower kernel: ? do_syscall_64+0x6b/0x90 Nov 09 19:30:54 arch-tower kernel: ? exc_page_fault+0x74/0x170 Nov 09 19:30:54 arch-tower kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd Nov 09 19:30:54 arch-tower kernel: RIP: 0033:0x7f494b515c0f Nov 09 19:30:54 arch-tower kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 Nov 09 19:30:54 arch-tower kernel: RSP: 002b:00007f49237f7f70 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Nov 09 19:30:54 arch-tower kernel: RAX: ffffffffffffffda RBX: 00007f491e7f23c0 RCX: 00007f494b515c0f Nov 09 19:30:54 arch-tower kernel: RDX: 00007f49237f8010 RSI: 00000000c0286448 RDI: 0000000000000018 Nov 09 19:30:54 arch-tower kernel: RBP: 00007f49237f8010 R08: 000000010c400000 R09: 000000000000000e Nov 09 19:30:54 arch-tower kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00000000c0286448 Nov 09 19:30:54 arch-tower kernel: R13: 0000000000000018 R14: 0000000000330000 R15: 0000000000000005 Nov 09 19:30:54 arch-tower kernel: </TASK> -- Boot c2c099679ad94b7190cf2380e047c12b -- Nov 09 19:31:40 arch-tower kernel: Linux version 6.0.6-arch1-1 (linux@archlinux) (gcc (GCC) 12.2.0, GNU ld (GNU Binutils) 2.39.0) #1 SMP PREEMPT_DYNAMIC Sat, 29 Oct 2022 14:08:39 +0000 Nov 09 19:31:40 arch-tower kernel: Command line: initrd=\amd-ucode.img initrd=\initramfs-linux.img root=PARTUUID=48a0f342-0f7d-7a4e-a0fd-7ccf9a7950fd resume=PARTUUID=fd3f4977-4b82-6945-8257-e60d5214141b rw acpi=on radeon.si_support=0 radeon.cik_support=0 amdgpu.cik_support=1 amdgpu.si_support=1
Hello, I'm encountering a similar regression on the 6.1.1 kernel (not present in this form on 6.0.12, although the system occasionally freezes as well). When connecting a Thinkpad USB-C Dock with two monitors to my Ryzen 3500U Thinkpad, the system freezes with a null-pointer dereference in amdgpu. Kernel: Linux version 6.1.1-arch1-1 (linux@archlinux) (gcc (GCC) 12.2.0, GNU ld (GNU Binutils) 2.39.0) #1 SMP PREEMPT_DYNAMIC Wed, 21 Dec 2022 22:27:55 +0000 Graphics controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso/Raven 2 [Radeon Vega Series / Radeon Vega Mobile Series] (rev d2) Output from journalctl: Dec 23 09:42:29 kevin-t495 kernel: usb 2-1.3.3: New USB device found, idVendor=17ef, idProduct=a395, bcdDevice=60.70 Dec 23 09:42:29 kevin-t495 kernel: usb 2-1.3.3: New USB device strings: Mfr=10, Product=11, SerialNumber=0 Dec 23 09:42:29 kevin-t495 kernel: usb 2-1.3.3: Product: USB2.0 Hub Dec 23 09:42:29 kevin-t495 kernel: usb 2-1.3.3: Manufacturer: Lenovo Dec 23 09:42:29 kevin-t495 kernel: [drm] Downstream port present 1, type 2 Dec 23 09:42:29 kevin-t495 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008 Dec 23 09:42:29 kevin-t495 kernel: #PF: supervisor read access in kernel mode Dec 23 09:42:29 kevin-t495 kernel: #PF: error_code(0x0000) - not-present page Dec 23 09:42:29 kevin-t495 kernel: PGD 0 P4D 0 Dec 23 09:42:29 kevin-t495 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI Dec 23 09:42:29 kevin-t495 kernel: CPU: 4 PID: 998 Comm: sway Not tainted 6.1.1-arch1-1 #1 9bd09188b430be630e611f984454e4f3c489be77 Dec 23 09:42:29 kevin-t495 kernel: Hardware name: LENOVO 20NKS01Y00/20NKS01Y00, BIOS R12ET61W(1.31 ) 07/28/2022 Dec 23 09:42:29 kevin-t495 kernel: RIP: 0010:drm_dp_atomic_find_time_slots+0x61/0x2a0 [drm_display_helper] Dec 23 09:42:29 kevin-t495 kernel: Code: 00 00 00 48 8b 85 60 05 00 00 48 63 80 88 00 00 00 3b 43 28 0f 8d ce 01 00 00 48 8b 53 30 48 8d 04 80 48 8d 04 c2 48 8b 40 18 <48> 8b 40 08 4d 8d 65 38 8b 88 90 00 00 00 b8 01 00 00 00 d3 e0 41 Dec 23 09:42:29 kevin-t495 kernel: RSP: 0018:ffffa526c0eef780 EFLAGS: 00010293 Dec 23 09:42:29 kevin-t495 kernel: RAX: 0000000000000000 RBX: ffff9555ef407200 RCX: 0000000000000214 Dec 23 09:42:29 kevin-t495 kernel: RDX: ffff9555c4124800 RSI: ffff9555429ba540 RDI: ffff9555ef407200 Dec 23 09:42:29 kevin-t495 kernel: RBP: ffff9555cfc76000 R08: 0000000000000001 R09: ffff9555c4242050 Dec 23 09:42:29 kevin-t495 kernel: R10: ffffa526c0eef8a0 R11: 000000004cb505a0 R12: 026d60dce16e8423 Dec 23 09:42:29 kevin-t495 kernel: R13: ffff95554cb505a0 R14: ffff9555429ba540 R15: 0000000000000214 Dec 23 09:42:29 kevin-t495 kernel: FS: 00007fb56378b980(0000) GS:ffff9557f0b00000(0000) knlGS:0000000000000000 Dec 23 09:42:29 kevin-t495 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 23 09:42:29 kevin-t495 kernel: CR2: 0000000000000008 CR3: 000000010ae2a000 CR4: 00000000003506e0 Dec 23 09:42:29 kevin-t495 kernel: Call Trace: Dec 23 09:42:29 kevin-t495 kernel: <TASK> Dec 23 09:42:29 kevin-t495 kernel: compute_mst_dsc_configs_for_link+0x31d/0x9d0 [amdgpu 895e2b3772442c7d04dbf61a65c8a3690bb074b6] Dec 23 09:42:29 kevin-t495 kernel: ? cm_helper_translate_curve_to_degamma_hw_format+0x5f0/0x5f0 [amdgpu 895e2b3772442c7d04dbf61a65c8a3690bb074b6] Dec 23 09:42:29 kevin-t495 kernel: ? fill_plane_buffer_attributes+0x355/0x530 [amdgpu 895e2b3772442c7d04dbf61a65c8a3690bb074b6] Dec 23 09:42:29 kevin-t495 kernel: compute_mst_dsc_configs_for_state+0x1e1/0x250 [amdgpu 895e2b3772442c7d04dbf61a65c8a3690bb074b6] Dec 23 09:42:29 kevin-t495 kernel: amdgpu_dm_atomic_check+0xf81/0x1230 [amdgpu 895e2b3772442c7d04dbf61a65c8a3690bb074b6] Dec 23 09:42:29 kevin-t495 kernel: drm_atomic_check_only+0x537/0xba0 Dec 23 09:42:29 kevin-t495 kernel: drm_mode_atomic_ioctl+0x750/0xbb0 Dec 23 09:42:29 kevin-t495 kernel: ? drm_property_add_enum+0x180/0x180 Dec 23 09:42:29 kevin-t495 kernel: ? idr_alloc+0x3a/0x70 Dec 23 09:42:29 kevin-t495 kernel: ? drm_atomic_set_property+0xbc0/0xbc0 Dec 23 09:42:29 kevin-t495 kernel: drm_ioctl_kernel+0xcd/0x170 Dec 23 09:42:29 kevin-t495 kernel: drm_ioctl+0x1eb/0x450 Dec 23 09:42:29 kevin-t495 kernel: ? drm_atomic_set_property+0xbc0/0xbc0 Dec 23 09:42:29 kevin-t495 kernel: amdgpu_drm_ioctl+0x4e/0x90 [amdgpu 895e2b3772442c7d04dbf61a65c8a3690bb074b6] Dec 23 09:42:29 kevin-t495 kernel: __x64_sys_ioctl+0x94/0xd0 Dec 23 09:42:29 kevin-t495 kernel: do_syscall_64+0x5f/0x90 Dec 23 09:42:29 kevin-t495 kernel: ? syscall_exit_to_user_mode+0x1b/0x40 Dec 23 09:42:29 kevin-t495 kernel: ? do_syscall_64+0x6b/0x90 Dec 23 09:42:29 kevin-t495 kernel: ? do_syscall_64+0x6b/0x90 Dec 23 09:42:29 kevin-t495 kernel: ? syscall_exit_to_user_mode+0x1b/0x40 Dec 23 09:42:29 kevin-t495 kernel: ? do_syscall_64+0x6b/0x90 Dec 23 09:42:29 kevin-t495 kernel: ? do_syscall_64+0x6b/0x90 Dec 23 09:42:29 kevin-t495 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd Dec 23 09:42:29 kevin-t495 kernel: RIP: 0033:0x7fb5645dec0f Dec 23 09:42:29 kevin-t495 kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 Dec 23 09:42:29 kevin-t495 kernel: RSP: 002b:00007ffd3850b740 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Dec 23 09:42:29 kevin-t495 kernel: RAX: ffffffffffffffda RBX: 000055c223c95960 RCX: 00007fb5645dec0f Dec 23 09:42:29 kevin-t495 kernel: RDX: 00007ffd3850b7e0 RSI: 00000000c03864bc RDI: 000000000000000d Dec 23 09:42:29 kevin-t495 kernel: RBP: 00007ffd3850b7e0 R08: 0000000000000003 R09: 0000000000000003 Dec 23 09:42:29 kevin-t495 kernel: R10: 000055c222b65010 R11: 0000000000000246 R12: 00000000c03864bc Dec 23 09:42:29 kevin-t495 kernel: R13: 000000000000000d R14: 000055c223c6cba0 R15: 000055c223bfcab0 Dec 23 09:42:29 kevin-t495 kernel: </TASK> Dec 23 09:42:29 kevin-t495 kernel: Modules linked in: cdc_ether usbnet r8152 mii rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast ccm cmac algif_hash algif_skcipher af_alg nft_fib_inet nft_> Dec 23 09:42:29 kevin-t495 kernel: snd_pci_acp6x iwlwifi snd_pci_acp5x snd_hda_core rapl snd_rn_pci_acp3x vfat think_lmi realtek ecdh_generic snd_hwdep fat snd_acp_config ucsi_acpi typec_ucsi pcspkr mdio_devres psmouse snd_soc_acpi firmware_attributes_c> Dec 23 09:42:29 kevin-t495 kernel: CR2: 0000000000000008 Dec 23 09:42:29 kevin-t495 kernel: ---[ end trace 0000000000000000 ]--- Dec 23 09:42:29 kevin-t495 kernel: RIP: 0010:drm_dp_atomic_find_time_slots+0x61/0x2a0 [drm_display_helper] Dec 23 09:42:29 kevin-t495 kernel: Code: 00 00 00 48 8b 85 60 05 00 00 48 63 80 88 00 00 00 3b 43 28 0f 8d ce 01 00 00 48 8b 53 30 48 8d 04 80 48 8d 04 c2 48 8b 40 18 <48> 8b 40 08 4d 8d 65 38 8b 88 90 00 00 00 b8 01 00 00 00 d3 e0 41 Dec 23 09:42:29 kevin-t495 kernel: RSP: 0018:ffffa526c0eef780 EFLAGS: 00010293 Dec 23 09:42:29 kevin-t495 kernel: RAX: 0000000000000000 RBX: ffff9555ef407200 RCX: 0000000000000214 Dec 23 09:42:29 kevin-t495 kernel: RDX: ffff9555c4124800 RSI: ffff9555429ba540 RDI: ffff9555ef407200 Dec 23 09:42:29 kevin-t495 kernel: RBP: ffff9555cfc76000 R08: 0000000000000001 R09: ffff9555c4242050 Dec 23 09:42:29 kevin-t495 kernel: R10: ffffa526c0eef8a0 R11: 000000004cb505a0 R12: 026d60dce16e8423 Dec 23 09:42:29 kevin-t495 kernel: R13: ffff95554cb505a0 R14: ffff9555429ba540 R15: 0000000000000214 Dec 23 09:42:29 kevin-t495 kernel: FS: 00007fb56378b980(0000) GS:ffff9557f0b00000(0000) knlGS:0000000000000000 Dec 23 09:42:29 kevin-t495 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 23 09:42:29 kevin-t495 kernel: CR2: 0000000000000008 CR3: 000000010ae2a000 CR4: 00000000003506e0