Bug 204181 - NULL pointer dereference regression in amdgpu
Summary: NULL pointer dereference regression in amdgpu
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-15 10:11 UTC by Sergey Kondakov
Modified: 2019-08-19 15:11 UTC (History)
10 users (show)

See Also:
Kernel Version: 5.2.1
Tree: Mainline
Regression: No


Attachments
dmesg (164.81 KB, text/plain)
2019-07-15 10:11 UTC, Sergey Kondakov
Details
dmesg with "drm=debug=4" (171.17 KB, text/plain)
2019-07-15 15:43 UTC, Sergey Kondakov
Details
kernel build config (227.25 KB, application/x-config)
2019-07-15 15:43 UTC, Sergey Kondakov
Details
amdgpu parameters (2.83 KB, text/plain)
2019-07-15 15:45 UTC, Sergey Kondakov
Details
X.log (53.91 KB, text/plain)
2019-07-15 15:48 UTC, Sergey Kondakov
Details
lsmem (10.04 KB, text/plain)
2019-07-15 15:50 UTC, Sergey Kondakov
Details
lspci -vv (57.29 KB, text/plain)
2019-07-15 15:50 UTC, Sergey Kondakov
Details
lspci -t -PP -q -k -v (2.57 KB, text/plain)
2019-07-15 15:53 UTC, Sergey Kondakov
Details
/proc/interrupts (5.40 KB, text/plain)
2019-07-15 15:59 UTC, Sergey Kondakov
Details
dmesg with "drm.debug=4" (237.09 KB, text/plain)
2019-07-16 15:29 UTC, Sergey Kondakov
Details
tail -n 2000 from dmesg with "drm.debug=5" (214.64 KB, application/octet-stream)
2019-07-16 16:36 UTC, Sergey Kondakov
Details
dmesg_2019-08-02-amdgpu_fail_on_patched_5.2.5 (181.40 KB, text/plain)
2019-08-02 02:21 UTC, Sergey Kondakov
Details
dmesg_2019-08-04-amdgpu-new_dereference-with-shadowprimary (175.65 KB, text/plain)
2019-08-04 05:17 UTC, Sergey Kondakov
Details

Description Sergey Kondakov 2019-07-15 10:11:49 UTC
Created attachment 283693 [details]
dmesg

After updating from 5.1 to 5.2.1 in about 5-10 minutes of watching a Youtube video in Firefox I now get complete lock-up of video output and inability to shutdown using power button. Using "magic keys" allows me to reboot and get kernel log via `journalctl -b -1 -k`, here is relevant part:
BUG: kernel NULL pointer dereference, address: 00000000000002b4
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 2 PID: 8200 Comm: kworker/u16:1 Tainted: G          IO      5.2.1-1383.gd5bbc26-HSF #1 openSUSE Tumbleweed (unreleased)
Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, BIOS F14e 09/09/2014
Workqueue: events_unbound commit_work
RIP: 0010:dc_stream_log+0x6/0xb0 [amdgpu]
Code: 04 00 00 49 8b bc 02 80 02 00 00 48 8b 07 48 8b 40 50 e8 ed 88 a8 d6 b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 53 <8b> 86 b4 02 00 00 48 89 f3 48 89 f2 8b 8e 10 01 00 00 bf 04 00 00
RSP: 0018:ffffa5568b1b7c00 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002
RDX: ffffffffc07fbd50 RSI: 0000000000000000 RDI: ffff8e9ee9500000
RBP: ffff8e9d90618000 R08: 0000000000000001 R09: 0000000000000000
R10: ffffa5568b1b7c30 R11: 0000000000000000 R12: ffff8e9ee9500000
R13: ffff8e9ededb4448 R14: ffff8e9e47e10c00 R15: ffff8e9ededa0000
FS:  0000000000000000(0000) GS:ffff8e9eee000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002b4 CR3: 00000003bcf46000 CR4: 00000000000406e0
Call Trace:
 dc_commit_state+0x79/0xb0 [amdgpu]
 amdgpu_dm_atomic_commit_tail+0x3c0/0xdb0 [amdgpu]
 ? finish_task_switch+0x74/0x300
 ? __switch_to+0x152/0x4e0
 ? __switch_to_asm+0x34/0x70
 ? __lock_acquire+0x3c8/0x7a0
 ? find_held_lock+0x32/0x90
 ? find_held_lock+0x32/0x90
 ? sched_clock+0x5/0x10
 ? mark_held_locks+0x2d/0x80
 ? preempt_count_sub+0x98/0xe0
 ? _raw_spin_unlock_irq+0x3a/0x50
 ? wait_for_completion_timeout+0xe9/0x110
 ? commit_tail+0x3c/0x70
 commit_tail+0x3c/0x70
 process_one_work+0x271/0x5f0
 worker_thread+0x4a/0x3d0
 ? process_one_work+0x5f0/0x5f0
 kthread+0x118/0x140
 ? kthread_create_worker_on_cpu+0x70/0x70
 ret_from_fork+0x27/0x50
Modules linked in: af_packet ts_bm xt_pkttype xt_string nf_nat_ftp nf_conntrack_ftp xt_tcpudp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables scsi_transport_iscsi ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq zram snd_pcm_oss rfcomm snd_mixer_oss it87 hwmon_vid bnep msr rc_avermedia tuner_simple tuner_types amd64_edac_mod tuner tda7432 edac_mce_amd btusb tvaudio kvm_amd ath9k btrtl btbcm msp3400 ath9k_common btintel bluetooth ath9k_hw kvm irqbypass ath bttv tea575x joydev tveeprom videobuf_dma_sg snd_usb_audio videobuf_core snd_usbmidi_lib rc_core snd_rawmidi snd_hda_codec_realtek mac80211 snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi v4l2_common snd_seq_device
 snd_hda_intel videodev sp5100_tco pcspkr snd_hda_codec wmi_bmof mxm_wmi amdgpu fam15h_power k10temp media i2c_piix4 cfg80211 r8169 snd_hda_core gpu_sched realtek snd_hwdep libphy ttm rfkill snd_pcm mac_hid hid_generic usbhid uas usb_storage ohci_pci serio_raw sd_mod ehci_pci ohci_hcd xhci_pci ehci_hcd xhci_hcd wmi exfat(O) l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc vhba(O) uinput sg nbd dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ecryptfs
CR2: 00000000000002b4
---[ end trace 0633d97cb3f2d2d6 ]---
RIP: 0010:dc_stream_log+0x6/0xb0 [amdgpu]
Code: 04 00 00 49 8b bc 02 80 02 00 00 48 8b 07 48 8b 40 50 e8 ed 88 a8 d6 b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 53 <8b> 86 b4 02 00 00 48 89 f3 48 89 f2 8b 8e 10 01 00 00 bf 04 00 00
RSP: 0018:ffffa5568b1b7c00 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002
RDX: ffffffffc07fbd50 RSI: 0000000000000000 RDI: ffff8e9ee9500000
RBP: ffff8e9d90618000 R08: 0000000000000001 R09: 0000000000000000
R10: ffffa5568b1b7c30 R11: 0000000000000000 R12: ffff8e9ee9500000
R13: ffff8e9ededb4448 R14: ffff8e9e47e10c00 R15: ffff8e9ededa0000
FS:  0000000000000000(0000) GS:ffff8e9eee000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002b4 CR3: 00000003bcf46000 CR4: 00000000000406e0
Comment 1 Nicholas Kazlauskas 2019-07-15 13:07:54 UTC
Do you mind posting an dmesg log with drm=debug=4 as part of your boot parameters?

An xorg log would be good too if applicable.

I'm curious to know what the actual sequence / system setup is for reproducing this as this isn't really a typical sequence. I think you'd run into other NULL pointer dereferences even if this one is guarded.

I think the stream itself is NULL and it shouldn't be in the context.
Comment 2 Sergey Kondakov 2019-07-15 15:43:03 UTC
Created attachment 283695 [details]
dmesg with "drm=debug=4"
Comment 3 Sergey Kondakov 2019-07-15 15:43:30 UTC
Created attachment 283697 [details]
kernel build config
Comment 4 Sergey Kondakov 2019-07-15 15:45:40 UTC
Created attachment 283699 [details]
amdgpu parameters

These doesn't seem to change anything about the hang. Although, maybe with larger limits of scheduling (max_num_of_queues_per_device, sched_hw_submission, sched_jobs) hang happens sooner but I'm not sure.
Comment 5 Sergey Kondakov 2019-07-15 15:48:51 UTC
Created attachment 283701 [details]
X.log

amdgpu has TearFree and VariableRefresh (no LCD support though) enabled. Dual-screen with 2 60 fps, VA and TN, 1080p LCDs, recently overclocked to ~73 and ~72 fps via CVT-1.2 lines on both Linux and Windows.
Comment 6 Sergey Kondakov 2019-07-15 15:50:36 UTC
Created attachment 283703 [details]
lsmem
Comment 7 Sergey Kondakov 2019-07-15 15:50:56 UTC
Created attachment 283705 [details]
lspci -vv
Comment 8 Sergey Kondakov 2019-07-15 15:53:09 UTC
Created attachment 283707 [details]
lspci -t -PP -q -k -v
Comment 9 Sergey Kondakov 2019-07-15 15:56:11 UTC
(In reply to Nicholas Kazlauskas from comment #1)
> Do you mind posting an dmesg log with drm=debug=4 as part of your boot
> parameters?
> 
> An xorg log would be good too if applicable.
> 
> I'm curious to know what the actual sequence / system setup is for
> reproducing this as this isn't really a typical sequence. I think you'd run
> into other NULL pointer dereferences even if this one is guarded.
> 
> I think the stream itself is NULL and it shouldn't be in the context.

I don't think that putting 'drm=debug=4' into boot cmd has changed anything but here's some more data. I also stumbled into another baffling regression (bug #203703) recently (from 5.0 to 5.1) concerning network packet scheduling (fq_codel qdics) that halts affected Ethernet device, it also gives out repeatable kernel trace on random network activity unless qdics is changed on dumb "pfifo_fast" early on, similarly how this gives out same repeatable amdgpu trace on some random GPU activity. Weird.
Comment 10 Nicholas Kazlauskas 2019-07-15 15:58:19 UTC
Thanks for all the logs.

I meant drm.debug=4 actually, the drm=debug=4 was a typo on my part - sorry!
Comment 11 Sergey Kondakov 2019-07-15 15:59:08 UTC
Created attachment 283709 [details]
/proc/interrupts
Comment 12 Sergey Kondakov 2019-07-16 15:29:06 UTC
Created attachment 283741 [details]
dmesg with "drm.debug=4"

Here's actual debug dmesg. pci subsystem uses 'pci=x=y' syntax, so I wouldn't have thought that for drm that wouldn't be valid.

Right when I wanted to upload the first dump from hang with debug that happened in >16 hours of uptime and >30 minutes of video, it crashed before Firefox even had a chance to render single page which happened to be same Youtube page everything hanged on because it starts at last opened page. So, after >30 minutes it wasn't even a second to hang again. This dump is from that time.

Haven't tried launching a local video player or a 3D app. Without opening Youtube in Firefox or video opening Firefox, doing all 2D non-accelerated desktop stuff doesn't seem to trigger it.
Comment 13 Sergey Kondakov 2019-07-16 16:36:41 UTC
Created attachment 283745 [details]
tail -n 2000 from dmesg with "drm.debug=5"

drm.debug=4 seem to produce only 1 new relevant line:
"[drm:dc_commit_state [amdgpu]] dc_commit_state: 2 streams"
so I tried increasing it. debug=5 creates a horrible stream that bogs down system with i/o load from journald but it sure did write some more at the moment of hang. I'm not going any further than that, though.
Comment 14 Sergey Kondakov 2019-07-16 16:52:30 UTC
(In reply to Nicholas Kazlauskas from comment #10)
> Thanks for all the logs.
> 
> I meant drm.debug=4 actually, the drm=debug=4 was a typo on my part - sorry!

So, I've got all I could on this.

Could this be relevant to my recent LCD overclock ? I haven't tried going back to 60 fps yet.
cvt executable and modes/xf86cvt.c in X-server weren't updated for years and can't even produce cvt-1.2 modes or any useful "reduced blanking" modes with them, so I had to go for things like: 
https://github.com/kevinlekiller/cvt_modeline_calculator_12 and
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=899066
On Windows I had to use https://www.monitortests.com/forum/Thread-Custom-Resolution-Utility-CRU because AMD driver refuses to use custom modes it itself generates with "unsupported" (yeah, right…) "error" naggings.
Comment 15 Nicholas Kazlauskas 2019-07-16 16:55:14 UTC
Thanks for the logs. I don't think this is related to your overclock.

Since this behavior wasn't previously observed during our 5.2 testing I think that either a patch got lost or changed during the submission process, or something from 5.3 was backported into 5.2 that shouldn't have been.

I don't think it's necessairly setup specific.
Comment 16 Sergey Kondakov 2019-07-24 18:33:04 UTC
(In reply to Nicholas Kazlauskas from comment #15)
> Thanks for the logs. I don't think this is related to your overclock.
> 
> Since this behavior wasn't previously observed during our 5.2 testing I
> think that either a patch got lost or changed during the submission process,
> or something from 5.3 was backported into 5.2 that shouldn't have been.
> 
> I don't think it's necessairly setup specific.

That means that you were able to reproduce it ? If so, any known workaround or ETA on the fix ? Is rc1 of 5.3 affected ? Any plans on backport to 5.2.x ?
Comment 17 Yann HN 2019-07-25 10:52:44 UTC
I was facing the same issue, Complete Video output stop, X Server process went unresponsive.

I did a Hardware switch a day before.
GPU: PNY GTX 1060 -> Asus Vega 56
Mainboard: Asus Z370P -> MSI Z390A Pro

A friend suggested me to install some packages to enhance the GPU Support, one of them was "xf86-video-amdgpu".

Seams like that package was responsible for the issues.
Removing it fixed the issue without any other (notable) effects.

Some more info for context:
X: X.Org X Server 1.20.5
Desktop: plasmashell 5.16.3
Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC 2019 x86_64 GNU/Linux
Comment 18 Michel Dänzer 2019-07-25 14:21:55 UTC
(In reply to Yann HN from comment #17)
> A friend suggested me to install some packages to enhance the GPU Support,
> one of them was "xf86-video-amdgpu".
> 
> Seams like that package was responsible for the issues.
> Removing it fixed the issue without any other (notable) effects.

Did you get the same amdgpu_dm_atomic_commit_tail => dc_commit_state => dc_stream_log NULL pointer dereference as reported here?

If yes, this is a kernel driver bug, xf86-video-amdgpu just triggers it / the Xorg modesetting driver avoids it somehow.

If not, please file your own report at https://bugs.freedesktop.org/enter_bug.cgi?product=xorg&component=Driver/AMDgpu and attach the corresponding Xorg log file and output of dmesg.
Comment 19 Yann HN 2019-07-25 15:42:40 UTC
(In reply to Michel Dänzer from comment #18)
> (In reply to Yann HN from comment #17)
> > A friend suggested me to install some packages to enhance the GPU Support,
> > one of them was "xf86-video-amdgpu".
> > 
> > Seams like that package was responsible for the issues.
> > Removing it fixed the issue without any other (notable) effects.
> 
> Did you get the same amdgpu_dm_atomic_commit_tail => dc_commit_state =>
> dc_stream_log NULL pointer dereference as reported here?
> 
> If yes, this is a kernel driver bug, xf86-video-amdgpu just triggers it /
> the Xorg modesetting driver avoids it somehow.
> 
> If not, please file your own report at
> https://bugs.freedesktop.org/enter_bug.cgi?product=xorg&component=Driver/
> AMDgpu and attach the corresponding Xorg log file and output of dmesg.

Yes, i re installed the package and was able to reproduce the error pretty fast, here the whole stack trace(package being the source of the issue confirmed):

Jul 25 17:38:12 arch-workstation kernel: BUG: kernel NULL pointer dereference, address: 00000000000002b4
Jul 25 17:38:12 arch-workstation kernel: #PF: supervisor read access in kernel mode
Jul 25 17:38:12 arch-workstation kernel: #PF: error_code(0x0000) - not-present page
Jul 25 17:38:12 arch-workstation kernel: PGD 0 P4D 0 
Jul 25 17:38:12 arch-workstation kernel: Oops: 0000 [#1] PREEMPT SMP PTI
Jul 25 17:38:12 arch-workstation kernel: CPU: 3 PID: 296 Comm: kworker/u24:4 Not tainted 5.2.2-arch1-1-ARCH #1
Jul 25 17:38:12 arch-workstation kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B98/Z390-A PRO (MS-7B98), BIOS 1.60 03/21/2019
Jul 25 17:38:12 arch-workstation kernel: Workqueue: events_unbound commit_work [drm_kms_helper]
Jul 25 17:38:12 arch-workstation kernel: RIP: 0010:dc_stream_log+0x6/0xb0 [amdgpu]
Jul 25 17:38:12 arch-workstation kernel: Code: 04 00 00 49 8b bc 02 80 02 00 00 48 8b 07 48 8b 40 50 e8 1d 35 f7 cd b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 53 <8b> 86 b4 02 00 00 48 89 f3 48 89 f2 8b 8e 10 01 00 00 bf 04 00>
Jul 25 17:38:12 arch-workstation kernel: RSP: 0018:ffff9ced83f5faf0 EFLAGS: 00010202
Jul 25 17:38:12 arch-workstation kernel: RAX: 0000000000000000 RBX: ffff8b9687199000 RCX: 0000000000000002
Jul 25 17:38:12 arch-workstation kernel: RDX: ffffffffc1112710 RSI: 0000000000000000 RDI: ffff8b9687199000
Jul 25 17:38:12 arch-workstation kernel: RBP: ffff8b95c7868000 R08: ffff8b95c7868000 R09: 0000000000000000
Jul 25 17:38:12 arch-workstation kernel: R10: ffff8b95c7868000 R11: 0000000000000018 R12: 0000000000000001
Jul 25 17:38:12 arch-workstation kernel: R13: ffff9ced83f5fd58 R14: ffff8b967420cff0 R15: 0000000000000000
Jul 25 17:38:12 arch-workstation kernel: FS:  0000000000000000(0000) GS:ffff8b968d8c0000(0000) knlGS:0000000000000000
Jul 25 17:38:12 arch-workstation kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 25 17:38:12 arch-workstation kernel: CR2: 00000000000002b4 CR3: 000000080c284006 CR4: 00000000003606e0
Jul 25 17:38:12 arch-workstation kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jul 25 17:38:12 arch-workstation kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jul 25 17:38:12 arch-workstation kernel: Call Trace:
Jul 25 17:38:12 arch-workstation kernel:  dc_commit_state+0x9a/0x5a0 [amdgpu]
Jul 25 17:38:12 arch-workstation kernel:  ? dm_plane_helper_cleanup_fb+0xa3/0x120 [amdgpu]
Jul 25 17:38:12 arch-workstation kernel:  amdgpu_dm_atomic_commit_tail+0xc5d/0x1a10 [amdgpu]
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x40/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x40/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x40/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x40/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x40/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x40/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x40/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x40/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x34/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? __switch_to_asm+0x40/0x70
Jul 25 17:38:12 arch-workstation kernel:  ? _raw_spin_unlock_irq+0x1d/0x30
Jul 25 17:38:12 arch-workstation kernel:  ? finish_task_switch+0x84/0x2d0
Jul 25 17:38:12 arch-workstation kernel:  ? preempt_schedule_common+0x32/0x80
Jul 25 17:38:12 arch-workstation kernel:  ? commit_tail+0x3c/0x70 [drm_kms_helper]
Jul 25 17:38:12 arch-workstation kernel:  commit_tail+0x3c/0x70 [drm_kms_helper]
Jul 25 17:38:12 arch-workstation kernel:  process_one_work+0x1d1/0x3e0
Jul 25 17:38:12 arch-workstation kernel:  worker_thread+0x4a/0x3d0
Jul 25 17:38:12 arch-workstation kernel:  kthread+0xfb/0x130
Jul 25 17:38:12 arch-workstation kernel:  ? process_one_work+0x3e0/0x3e0
Jul 25 17:38:12 arch-workstation kernel:  ? kthread_park+0x90/0x90
Jul 25 17:38:12 arch-workstation kernel:  ret_from_fork+0x35/0x40
Jul 25 17:38:12 arch-workstation kernel: Modules linked in: fuse xt_nat xt_tcpudp veth xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo iptable_nat xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack nf_defrag_ipv6 nf_defra>
Jul 25 17:38:12 arch-workstation kernel:  snd_usbmidi_lib ppdev iTCO_wdt snd_hda_codec iTCO_vendor_support snd_rawmidi snd_seq_device media snd_hda_core agpgart snd_hwdep syscopyarea snd_pcm aesni_intel sysfillrect snd_timer aes_x86_64 c>
Jul 25 17:38:12 arch-workstation kernel: CR2: 00000000000002b4
Jul 25 17:38:12 arch-workstation kernel: ---[ end trace 8659bfc7daefd7ef ]---
Jul 25 17:38:12 arch-workstation kernel: RIP: 0010:dc_stream_log+0x6/0xb0 [amdgpu]
Jul 25 17:38:12 arch-workstation kernel: Code: 04 00 00 49 8b bc 02 80 02 00 00 48 8b 07 48 8b 40 50 e8 1d 35 f7 cd b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 53 <8b> 86 b4 02 00 00 48 89 f3 48 89 f2 8b 8e 10 01 00 00 bf 04 00>
Jul 25 17:38:12 arch-workstation kernel: RSP: 0018:ffff9ced83f5faf0 EFLAGS: 00010202
lines 802-866/1002 87%
Comment 20 Nicholas Kazlauskas 2019-07-25 15:50:25 UTC
I haven't been able to reproduce this on my setup yet with xf86-video-amdgpu on Arch's 5.2.2 kernel. I don't see anything really missing between that and staging that could affect this issue.

It would probably help to have a dmesg log with drm.debug=0x54 - this will enable DRM atomic state debug prints.

You'll probably need to increase your log buffer size to get the state relevant to the crash.

ie: " log_buf_len=64M drm.debug=84 "
Comment 21 Frank Steinborn 2019-07-26 12:23:43 UTC
Facing the same issue (Vega64). I captured a dmesg (drm.debug=0x54) with lockup and uploaded it here:

https://nognu.de/p/dmesg_amdgpu.txt

Thanks!
Comment 22 Nicholas Kazlauskas 2019-07-26 16:02:22 UTC
Thanks for the log!

I can reproduce the issue now by emulating the sequence using IGT. It doesn't seem to show up in desktop usage for me.
Comment 23 Sergey Kondakov 2019-07-30 21:41:57 UTC
(In reply to Nicholas Kazlauskas from comment #22)
> Thanks for the log!
> 
> I can reproduce the issue now by emulating the sequence using IGT. It
> doesn't seem to show up in desktop usage for me.

Indeed. I tried using modeset X11 driver and got a bunch of errors in Xorg.0.log about inability to do "page flips", so I've put `PageFlip false` for it and `EnablePageFlip false` for amdgpu with removal of 'TearFree true' (why it isn't always on by default ?), just in case. No hangs for about 24 hours even with a lot of Youtube in Firefox even with amdgpu.

There seem to be a lot of patches for AMD GPUs queued for 5.2.5, any chance of the complete fix among them ?
Comment 24 Nicholas Kazlauskas 2019-07-31 16:28:15 UTC
This should be fixed with the series linked below:

https://patchwork.freedesktop.org/series/64505/

But it still needs review and backporting to older kernels.
Comment 25 Sergey Kondakov 2019-08-01 06:13:54 UTC
(In reply to Nicholas Kazlauskas from comment #24)
> This should be fixed with the series linked below:
> 
> https://patchwork.freedesktop.org/series/64505/
> 
> But it still needs review and backporting to older kernels.

So, I've patched my 5.2.5 kernel package with that set and re-enabled page flipping. So far, everything seems fine. When it's merged and released, this issue may be closed. Thanks !
Comment 26 Sergey Kondakov 2019-08-02 02:21:31 UTC
Created attachment 284083 [details]
dmesg_2019-08-02-amdgpu_fail_on_patched_5.2.5

(In reply to Nicholas Kazlauskas from comment #24)
> This should be fixed with the series linked below:
> 
> https://patchwork.freedesktop.org/series/64505/
> 
> But it still needs review and backporting to older kernels.

Celebration might have been premature. Hours later I've got another freeze with different error in amdgpu. Only this time, mouse cursor was movable over frozen frame right until I tried switching VT. Here's trace:
BUG: unable to handle page fault for address: 0000000800000184
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 2 PID: 21044 Comm: kworker/u16:0 Tainted: G        W IO      5.2.5-1396.g79b6a9c-HSF #1 openSUSE Tumbleweed (unreleased)
Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, BIOS F14e 09/09/2014
Workqueue: events_unbound commit_work
RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2e6/0xd60 [amdgpu]
Code: ff 48 89 de 48 8b b8 40 43 01 00 e8 94 3b 09 00 49 8b 54 24 08 48 89 9d 30 fe ff ff 8b 82 00 09 00 00 85 c0 0f 85 fb fd ff ff <80> bb 80 01 00 00 01 0f 86 a0 00 00 00 48 b9 00 00 00 00 01 00 00
RSP: 0018:ffff98198b837c30 EFLAGS: 00010202
RAX: 0000000000000023 RBX: 0000000800000004 RCX: ffff8aca7b146f18
RDX: ffff8acc2a2d9000 RSI: ffffffffc0994f00 RDI: 0000000000000002
RBP: ffff98198b837e10 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8aca97bf3540
R13: ffff8acc114b1000 R14: ffff8acc035da000 R15: 0000000000000006
FS:  0000000000000000(0000) GS:ffff8acc2e000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000800000184 CR3: 00000003747c2000 CR4: 00000000000406e0
Call Trace:
 ? mark_held_locks+0x2d/0x80
 ? _raw_spin_unlock_irq+0x3a/0x50
 ? finish_task_switch+0xa2/0x300
 ? __lock_acquire+0x3c3/0x7c0
 ? find_held_lock+0x32/0x90
 ? find_held_lock+0x32/0x90
 ? sched_clock+0x5/0x10
 ? mark_held_locks+0x2d/0x80
 ? preempt_count_sub+0x98/0xe0
 ? _raw_spin_unlock_irq+0x3a/0x50
 ? wait_for_completion_timeout+0xe9/0x110
 ? commit_tail+0x3c/0x70
 commit_tail+0x3c/0x70
 process_one_work+0x271/0x5f0
 worker_thread+0x4a/0x3d0
 ? process_one_work+0x5f0/0x5f0
 kthread+0x118/0x140
 ? kthread_create_worker_on_cpu+0x70/0x70
 ret_from_fork+0x27/0x50
Modules linked in: r8169 binfmt_misc af_packet ts_bm xt_pkttype xt_string nf_nat_ftp nf_conntrack_ftp xt_tcpudp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables scsi_transport_iscsi ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss zram snd_mixer_oss bnep it87 hwmon_vid msr joydev amd64_edac_mod edac_mce_amd btusb btrtl btbcm rc_avermedia btintel kvm_amd tuner_simple tuner_types bluetooth snd_usb_audio tuner kvm tda7432 snd_usbmidi_lib snd_rawmidi irqbypass tvaudio msp3400 snd_seq_device ath9k bttv ath9k_common ath9k_hw tea575x tveeprom ath videobuf_dma_sg videobuf_core rc_core v4l2_common pcspkr wmi_bmof videodev mxm_wmi mac80211 fam15h_power k10temp sp5100_tco
 media amdgpu i2c_piix4 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio snd_hda_intel cfg80211 snd_hda_codec snd_hda_core realtek gpu_sched libphy snd_hwdep ttm rfkill snd_pcm mac_hid hid_generic usbhid uas usb_storage ohci_pci serio_raw sd_mod ohci_hcd ehci_pci ehci_hcd xhci_pci xhci_hcd wmi exfat(O) l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc vhba(O) uinput sg nbd dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ecryptfs [last unloaded: r8169]
CR2: 0000000800000184
---[ end trace 7da703104c8acbc9 ]---
RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2e6/0xd60 [amdgpu]
Code: ff 48 89 de 48 8b b8 40 43 01 00 e8 94 3b 09 00 49 8b 54 24 08 48 89 9d 30 fe ff ff 8b 82 00 09 00 00 85 c0 0f 85 fb fd ff ff <80> bb 80 01 00 00 01 0f 86 a0 00 00 00 48 b9 00 00 00 00 01 00 00
RSP: 0018:ffff98198b837c30 EFLAGS: 00010202
RAX: 0000000000000023 RBX: 0000000800000004 RCX: ffff8aca7b146f18
RDX: ffff8acc2a2d9000 RSI: ffffffffc0994f00 RDI: 0000000000000002
RBP: ffff98198b837e10 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8aca97bf3540
R13: ffff8acc114b1000 R14: ffff8acc035da000 R15: 0000000000000006
FS:  0000000000000000(0000) GS:ffff8acc2e000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000800000184 CR3: 00000003747c2000 CR4: 00000000000406e0

How ironic for it to manifest again during discussion video on Youtube about recent "JoJo" Part 5 finale's "perpetually trapped in the repeating nightmare of a frozen time" theme…
Comment 27 Sergey Kondakov 2019-08-04 05:17:02 UTC
Created attachment 284153 [details]
dmesg_2019-08-04-amdgpu-new_dereference-with-shadowprimary

So, I've been using explicitly disabled "EnablePageFlip" and "TearFree" options as workaround for the original dereference but then decided to try out "ShadowPrimary" during fiddling with mvtools' motion-interpolation optimization in mpv, since page flipping is disabled anyway. But the result was ANOTHER null pointer dereference mere seconds after login:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 3272 Comm: X:cs0 Tainted: G          IO      5.2.5-1407.g79b6a9c-HSF #1 openSUSE Tumbleweed
Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, BIOS F14e 09/09/2014
RIP: 0010:amdgpu_vm_update_directories+0xe7/0x260 [amdgpu]
Code: 89 08 48 8d 4a 40 48 89 48 08 48 89 42 40 48 8b 78 f0 c6 40 10 00 4c 8b a7 80 06 00 00 4d 85 e4 74 08 4d 8b a4 24 40 04 00 00 <4d> 8b 6c 24 08 31 f6 49 8b 95 80 06 00 00 48 85 d2 74 0f 48 8b 92
RSP: 0018:ffffafc2478aba10 EFLAGS: 00010246
RAX: ffff98742e20e670 RBX: ffff98742e20e658 RCX: ffff98744fc66040
RDX: ffff98744fc66000 RSI: ffff98742e20e638 RDI: ffff9873a295f800
RBP: ffff987459e00000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffafc2478abb58 R14: ffff98744fc66000 R15: ffffafc2478abb58
FS:  00007f3ee03d7700(0000) GS:ffff98746de00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 00000003f27aa000 CR4: 00000000000406e0
Call Trace:
 amdgpu_cs_vm_handling+0x308/0x440 [amdgpu]
 amdgpu_cs_ioctl+0x154/0xa10 [amdgpu]
 ? amdgpu_cs_vm_handling+0x440/0x440 [amdgpu]
 drm_ioctl_kernel+0xaa/0xf0
 drm_ioctl+0x208/0x385
 ? amdgpu_cs_vm_handling+0x440/0x440 [amdgpu]
 ? _raw_spin_unlock_irqrestore+0x59/0x70
 ? preempt_count_sub+0x98/0xe0
 ? _raw_spin_unlock_irqrestore+0x46/0x70
 amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
 do_vfs_ioctl+0x3ed/0x720
 ? __fget+0xf9/0x1b0
 ksys_ioctl+0x5e/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x66/0xc0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f3ee641c7c7
Code: 00 00 90 48 8b 05 d1 86 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a1 86 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007f3ee03d6a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f3ee03d6a70 RCX: 00007f3ee641c7c7
RDX: 00007f3ee03d6a70 RSI: 00000000c0186444 RDI: 000000000000000e
RBP: 00000000c0186444 R08: 00007f3ee03d6b80 R09: 0000000000000020
R10: 00007f3ee03d6b80 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000e R14: 000055d55e6f8bf0 R15: 000055d55e6f91a8
Modules linked in: af_packet xt_pkttype xt_string nf_nat_ftp nf_conntrack_ftp xt_tcpudp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables scsi_transport_iscsi ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss msr bnep it87 hwmon_vid zram amd64_edac_mod edac_mce_amd kvm_amd kvm rc_avermedia tuner_simple tuner_types irqbypass tuner tda7432 btusb btrtl btbcm btintel tvaudio msp3400 bluetooth snd_usb_audio ath9k joydev bttv ath9k_common snd_usbmidi_lib tea575x ath9k_hw tveeprom snd_rawmidi videobuf_dma_sg mxm_wmi wmi_bmof pcspkr ath videobuf_core snd_seq_device k10temp fam15h_power rc_core snd_hda_codec_realtek v4l2_common snd_hda_codec_generic
 sp5100_tco snd_hda_codec_hdmi ledtrig_audio mac80211 amdgpu videodev media i2c_piix4 snd_hda_intel cfg80211 snd_hda_codec r8169 snd_hda_core realtek snd_hwdep libphy snd_pcm gpu_sched rfkill ttm mac_hid hid_generic usbhid uas usb_storage ohci_pci serio_raw sd_mod ehci_pci ohci_hcd ehci_hcd xhci_pci xhci_hcd wmi exfat(O) l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel pppox ppp_generic slhc vhba(O) uinput sg nbd dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ecryptfs
CR2: 0000000000000008
---[ end trace a7f0ed14134a76ad ]---
RIP: 0010:amdgpu_vm_update_directories+0xe7/0x260 [amdgpu]
Code: 89 08 48 8d 4a 40 48 89 48 08 48 89 42 40 48 8b 78 f0 c6 40 10 00 4c 8b a7 80 06 00 00 4d 85 e4 74 08 4d 8b a4 24 40 04 00 00 <4d> 8b 6c 24 08 31 f6 49 8b 95 80 06 00 00 48 85 d2 74 0f 48 8b 92
RSP: 0018:ffffafc2478aba10 EFLAGS: 00010246
RAX: ffff98742e20e670 RBX: ffff98742e20e658 RCX: ffff98744fc66040
RDX: ffff98744fc66000 RSI: ffff98742e20e638 RDI: ffff9873a295f800
RBP: ffff987459e00000 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffafc2478abb58 R14: ffff98744fc66000 R15: ffffafc2478abb58
FS:  00007f3ee03d7700(0000) GS:ffff98746de00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 00000003f27aa000 CR4: 00000000000406e0
Comment 28 vr00m 2019-08-07 17:43:34 UTC
I experienced issues after upgrading kernel from 5.1 to 5.2 on my notebook with 2500 U. I tried kernel boot param iommu=soft and that fixed it.
Comment 29 bl0rp 2019-08-14 06:43:51 UTC
(In reply to vr00m from comment #28)
> I experienced issues after upgrading kernel from 5.1 to 5.2 on my notebook
> with 2500 U. I tried kernel boot param iommu=soft and that fixed it.


I've encountered this issue with kernel 5.2 (tried 5.2.8 just now) and also have a Ryzen 5 2500U notebook (Huawei Matebook D 14" (AMD)). Running Manjaro. The login screen appears fine, but after that, black screen. I know nothing's locked up because I was able to launch GZDoom from typing in the dark in the whisker menu and heard the sounds of Doom, or at least the title screen.
Comment 30 Andrey Grodzovsky 2019-08-14 19:06:17 UTC
(In reply to Sergey Kondakov from comment #27)
> Created attachment 284153 [details]
> dmesg_2019-08-04-amdgpu-new_dereference-with-shadowprimary
> 
> So, I've been using explicitly disabled "EnablePageFlip" and "TearFree"
> options as workaround for the original dereference but then decided to try
> out "ShadowPrimary" during fiddling with mvtools' motion-interpolation
> optimization in mpv, since page flipping is disabled anyway. But the result
> was ANOTHER null pointer dereference mere seconds after login:
> BUG: kernel NULL pointer dereference, address: 0000000000000008
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0 
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 1 PID: 3272 Comm: X:cs0 Tainted: G          IO     
> 5.2.5-1407.g79b6a9c-HSF #1 openSUSE Tumbleweed
> Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, BIOS
> F14e 09/09/2014
> RIP: 0010:amdgpu_vm_update_directories+0xe7/0x260 [amdgpu]
> Code: 89 08 48 8d 4a 40 48 89 48 08 48 89 42 40 48 8b 78 f0 c6 40 10 00 4c
> 8b a7 80 06 00 00 4d 85 e4 74 08 4d 8b a4 24 40 04 00 00 <4d> 8b 6c 24 08 31
> f6 49 8b 95 80 06 00 00 48 85 d2 74 0f 48 8b 92
> RSP: 0018:ffffafc2478aba10 EFLAGS: 00010246
> RAX: ffff98742e20e670 RBX: ffff98742e20e658 RCX: ffff98744fc66040
> RDX: ffff98744fc66000 RSI: ffff98742e20e638 RDI: ffff9873a295f800
> RBP: ffff987459e00000 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: ffffafc2478abb58 R14: ffff98744fc66000 R15: ffffafc2478abb58
> FS:  00007f3ee03d7700(0000) GS:ffff98746de00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 00000003f27aa000 CR4: 00000000000406e0
> Call Trace:
>  amdgpu_cs_vm_handling+0x308/0x440 [amdgpu]
>  amdgpu_cs_ioctl+0x154/0xa10 [amdgpu]
>  ? amdgpu_cs_vm_handling+0x440/0x440 [amdgpu]
>  drm_ioctl_kernel+0xaa/0xf0
>  drm_ioctl+0x208/0x385
>  ? amdgpu_cs_vm_handling+0x440/0x440 [amdgpu]
>  ? _raw_spin_unlock_irqrestore+0x59/0x70
>  ? preempt_count_sub+0x98/0xe0
>  ? _raw_spin_unlock_irqrestore+0x46/0x70
>  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
>  do_vfs_ioctl+0x3ed/0x720
>  ? __fget+0xf9/0x1b0
>  ksys_ioctl+0x5e/0x90
>  __x64_sys_ioctl+0x16/0x20
>  do_syscall_64+0x66/0xc0
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x7f3ee641c7c7
> Code: 00 00 90 48 8b 05 d1 86 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff
> ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff
> 73 01 c3 48 8b 0d a1 86 0c 00 f7 d8 64 89 01 48
> RSP: 002b:00007f3ee03d6a08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 00007f3ee03d6a70 RCX: 00007f3ee641c7c7
> RDX: 00007f3ee03d6a70 RSI: 00000000c0186444 RDI: 000000000000000e
> RBP: 00000000c0186444 R08: 00007f3ee03d6b80 R09: 0000000000000020
> R10: 00007f3ee03d6b80 R11: 0000000000000246 R12: 0000000000000000
> R13: 000000000000000e R14: 000055d55e6f8bf0 R15: 000055d55e6f91a8
> Modules linked in: af_packet xt_pkttype xt_string nf_nat_ftp
> nf_conntrack_ftp xt_tcpudp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack
> ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security
> iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables
> scsi_transport_iscsi ip6table_filter ip6_tables iptable_filter ip_tables
> x_tables bpfilter snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq
> snd_pcm_oss snd_mixer_oss msr bnep it87 hwmon_vid zram amd64_edac_mod
> edac_mce_amd kvm_amd kvm rc_avermedia tuner_simple tuner_types irqbypass
> tuner tda7432 btusb btrtl btbcm btintel tvaudio msp3400 bluetooth
> snd_usb_audio ath9k joydev bttv ath9k_common snd_usbmidi_lib tea575x
> ath9k_hw tveeprom snd_rawmidi videobuf_dma_sg mxm_wmi wmi_bmof pcspkr ath
> videobuf_core snd_seq_device k10temp fam15h_power rc_core
> snd_hda_codec_realtek v4l2_common snd_hda_codec_generic
>  sp5100_tco snd_hda_codec_hdmi ledtrig_audio mac80211 amdgpu videodev media
> i2c_piix4 snd_hda_intel cfg80211 snd_hda_codec r8169 snd_hda_core realtek
> snd_hwdep libphy snd_pcm gpu_sched rfkill ttm mac_hid hid_generic usbhid uas
> usb_storage ohci_pci serio_raw sd_mod ehci_pci ohci_hcd ehci_hcd xhci_pci
> xhci_hcd wmi exfat(O) l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel
> udp_tunnel pppox ppp_generic slhc vhba(O) uinput sg nbd dm_multipath
> scsi_dh_rdac scsi_dh_emc scsi_dh_alua ecryptfs
> CR2: 0000000000000008
> ---[ end trace a7f0ed14134a76ad ]---
> RIP: 0010:amdgpu_vm_update_directories+0xe7/0x260 [amdgpu]
> Code: 89 08 48 8d 4a 40 48 89 48 08 48 89 42 40 48 8b 78 f0 c6 40 10 00 4c
> 8b a7 80 06 00 00 4d 85 e4 74 08 4d 8b a4 24 40 04 00 00 <4d> 8b 6c 24 08 31
> f6 49 8b 95 80 06 00 00 48 85 d2 74 0f 48 8b 92
> RSP: 0018:ffffafc2478aba10 EFLAGS: 00010246
> RAX: ffff98742e20e670 RBX: ffff98742e20e658 RCX: ffff98744fc66040
> RDX: ffff98744fc66000 RSI: ffff98742e20e638 RDI: ffff9873a295f800
> RBP: ffff987459e00000 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> R13: ffffafc2478abb58 R14: ffff98744fc66000 R15: ffffafc2478abb58
> FS:  00007f3ee03d7700(0000) GS:ffff98746de00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 00000003f27aa000 CR4: 00000000000406e0

Sergey, I tried to reproduce you latest issue on Ellsmere (Polaris 10) with "ShadowPrimary" enabled flip disabled and didn't observe any crash.
In case you built your own kernel can you give me the output of this command -

Run gdb on amdgpu.ko
gdb drivers/gpu/drm/amd/amdgpu/amdgpu.ko

Then do - 
list *(amdgpu_vm_update_directories+0xe7)
Comment 31 Sergey Kondakov 2019-08-15 22:05:22 UTC
(In reply to Andrey Grodzovsky from comment #30)
> (In reply to Sergey Kondakov from comment #27)
> 
> Sergey, I tried to reproduce you latest issue on Ellsmere (Polaris 10) with
> "ShadowPrimary" enabled flip disabled and didn't observe any crash.
> In case you built your own kernel can you give me the output of this command
> -
> 
> Run gdb on amdgpu.ko
> gdb drivers/gpu/drm/amd/amdgpu/amdgpu.ko
> 
> Then do - 
> list *(amdgpu_vm_update_directories+0xe7)

The crash may take a while (hours) to manifest and requires some video-watching via Firefox and/or mpv (with '--opengl-pbo' option on opengl-hq profile). It also may or may not need VAAPI to be used ('--hwdec=vaapi-copy' in case of mpv).

My kernel is built on OBS build-server, so I had to enable debuginfo packaging and rebuild it, then debuginfo package used up mind-boggling 5,1gb of space leaving me with measly ~400mb on / ! After that I managed to get this:
0x2e127 is in amdgpu_vm_update_directories (../drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:1191).
where line #1191 is:
struct amdgpu_bo *bo = parent->base.bo, *pbo;

But it a different build of the kernel, so I don't know if this is even relevant. I'm not going to stick around with this monstrosity. You may check out the packages at https://build.opensuse.org/package/binaries/home:X0F:HSF:Kernel/kernel-HSF/standard - they have pretty much all kernel modules that x86_64 supports, so it should run anywhere.
Comment 32 Sergey Kondakov 2019-08-17 05:13:28 UTC
Just got exactly the same 0010:amdgpu_vm_update_directories+0xe7/0x260 dereference immediately on login even with PageFlip & TearFree disabled and ShadowPrimary NOT enabled. Even with all the same addresses as before. So, now I'm not sure about what actually triggers it. However, my setup is as non-default as it gets:
amdgpu has these parameters: cik_support=1 si_support=1 msi=1 sched_policy=1 compute_multipipe=1 gartsize=1024 vm_fragment_size=9 max_num_of_queues_per_device=65536 sched_hw_submission=32 sched_jobs=1024 job_hang_limit=8000 halt_if_hws_hang=1 vm_fault_stop=0 vm_update_mode=3 vm_size=20 disp_priority=2 deep_color=1 gpu_recovery=1
irqbalance is enabled with interval=1 and rtirq has this:
RTIRQ_NAME_LIST="timer rtc snd drm amdgpu radeon i915 nvidia usb i8042 ahci"
RTIRQ_HIGH_LIST="watchdogd oom_reaper rcu_preempt rcu_sched rcu_bh rcub rcuc gfx sdma ksoftirqd khugepaged"
RTIRQ_PRIO_HIGH=80
RTIRQ_PRIO_DECR=2
RTIRQ_PRIO_LOW=50
RTIRQ_RESET_ALL=0
to boost amdgpu's processes to highest RT/FIFO priorities in hope to avoid video stuttering and audio x-runs under full load. Transparent hugepages are enabled in attempt to spare crappy AMD FX's TLB cache and MMU (hence the vm_fragment_size=9).

Maybe it's non-default vm_update_mode that does it. And few kernel versions back default gart of 256MB was triggering some kind of fault, probably stall and reset, maybe it even still does but I'm not going to check. Or maybe it's all irrelevant.
Comment 33 Nicholas Kazlauskas 2019-08-19 13:39:26 UTC
I(In reply to Sergey Kondakov from comment #26)
> Created attachment 284083 [details]
> dmesg_2019-08-02-amdgpu_fail_on_patched_5.2.5
> 
> (In reply to Nicholas Kazlauskas from comment #24)
> > This should be fixed with the series linked below:
> > 
> > https://patchwork.freedesktop.org/series/64505/
> > 
> > But it still needs review and backporting to older kernels.
> 
> Celebration might have been premature. Hours later I've got another freeze
> with different error in amdgpu. Only this time, mouse cursor was movable
> over frozen frame right until I tried switching VT. Here's trace:
> BUG: unable to handle page fault for address: 0000000800000184
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0 
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 2 PID: 21044 Comm: kworker/u16:0 Tainted: G        W IO     
> 5.2.5-1396.g79b6a9c-HSF #1 openSUSE Tumbleweed (unreleased)
> Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3, BIOS
> F14e 09/09/2014
> Workqueue: events_unbound commit_work
> RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2e6/0xd60 [amdgpu]

Are you able to consistently reproduce this issue? Is it the same setup and same conditions as before? I haven't been able to see it in my testing at least.
Comment 34 Sergey Kondakov 2019-08-19 15:11:14 UTC
(In reply to Nicholas Kazlauskas from comment #33)
> I(In reply to Sergey Kondakov from comment #26)
> > Created attachment 284083 [details]
> > dmesg_2019-08-02-amdgpu_fail_on_patched_5.2.5
> > 
> > (In reply to Nicholas Kazlauskas from comment #24)
> > > This should be fixed with the series linked below:
> > > 
> > > https://patchwork.freedesktop.org/series/64505/
> > > 
> > > But it still needs review and backporting to older kernels.
> > 
> > Celebration might have been premature. Hours later I've got another freeze
> > with different error in amdgpu. Only this time, mouse cursor was movable
> > over frozen frame right until I tried switching VT. Here's trace:
> > BUG: unable to handle page fault for address: 0000000800000184
> > #PF: supervisor read access in kernel mode
> > #PF: error_code(0x0000) - not-present page
> > PGD 0 P4D 0 
> > Oops: 0000 [#1] PREEMPT SMP NOPTI
> > CPU: 2 PID: 21044 Comm: kworker/u16:0 Tainted: G        W IO     
> > 5.2.5-1396.g79b6a9c-HSF #1 openSUSE Tumbleweed (unreleased)
> > Hardware name: Gigabyte Technology Co., Ltd. GA-990XA-UD3/GA-990XA-UD3,
> BIOS
> > F14e 09/09/2014
> > Workqueue: events_unbound commit_work
> > RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2e6/0xd60 [amdgpu]
> 
> Are you able to consistently reproduce this issue? Is it the same setup and
> same conditions as before? I haven't been able to see it in my testing at
> least.

Yes, just having PageFlip enabled in amdgpu guarantees it. Changing anything other than PageFlip doesn't seem to affect it. Forcing TearFree on with PageFlip disabled may also trigger it, I think. You may try my previously linked kernel build in your testing but I doubt that it has something specific for it.

It may be not reproducible with modesetting X driver because it fails to engage page flipping on init and throws a bunch of errors about it in Xorg.0.log. For some reason I'm unable to use modesetting X driver at all, even with page flipping disabled, it draws only mouse cursor on black background instead of sddm login screen. So I have to use amdgpu with PageFlip and TearFree explicitly disabled. But then another, rarer 0010:amdgpu_vm_update_directories+0xe7/0x260 dereference may happen regardless (which I suspect is connected with vm_update_mode option, unlike the first one).

By the way, is there any disadvantage in forcing TearFree to be always on when it works ? Like additional frame of latency or something like that ?

Note You need to log in before you can comment on or make changes to this bug.