Bug 207383

Summary: [Regression] 5.7 amdgpu/polaris11 gpf: amdgpu_atomic_commit_tail
Product: Drivers Reporter: Duncan (1i5t5.duncan)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: blocking CC: absusrex, akanar, akpm, alexdeucher, anthony.ruhier, barry, bugzKernel, chancuan66, christian.koenig, code, fabianm88, harry.wentland, jeremy, jlp.bugs, kees, kode54, laser.eyess.trackers, maciej.stanczew+b, mnrzk, mphantomx, nicholas.kazlauskas, pmenzel+bugzilla.kernel.org, rtmasura+kernel, stan, strzol, sunpeng.li, tyson, zzyxpaw
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.7-rc1 - 5.7 - 5.8-rc5+, fixed in 5.8.0, 5.7.13, 5.4.56 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: kernel config
automated boot-time dmesg dump
Partial git bisect log
Updated partial git bisect log
Another partial git bisect log update
Partial git bisect log update #3
KASAN Use-after-free
Possible bug fix #1
0001-drm-amd-display-Force-add-all-CRTCs-to-state-when-us.patch
drm/amd/display: Clear dm_state for fast updates

Description Duncan 2020-04-21 09:51:33 UTC
Created attachment 288649 [details]
kernel config

5.7-rc1 and rc2 regression from kernel 5.6.0

After starting X/plasma on 5.7-rc1 and rc2, system runs for a few seconds to a few hours, then display freezes.  The pointer continues to be movable and audio will continue to play for some seconds but they eventually stop as well. The kernel remains alive at least enough to reboot with SRQ-b, not sure if previous SRQs have any effect or not.

Sometimes but not always there's a gpf left in the log, appearing to confirm it's amdgpu (the -dirty is simply a patch making mounts noatime by default):

Apr 20 03:25:55 h2 kernel: general protection fault, probably for non-canonical address 0xc1316515e40a92f6: 0000 [#1] SMP
Apr 20 03:25:55 h2 kernel: CPU: 3 PID: 3921 Comm: kworker/u16:5 Tainted: G                T 5.7.0-rc2-dirty #194
Apr 20 03:25:55 h2 kernel: Hardware name: Gigabyte Technology Co., Ltd. GA-990FXA-UD3/GA-990FXA-UD3, BIOS F6 03/30/2012
Apr 20 03:25:55 h2 kernel: Workqueue: events_unbound commit_work
Apr 20 03:25:55 h2 kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x102d/0x1fd8
Apr 20 03:25:55 h2 kernel: Code: 48 89 9d a0 fc ff ff 8b 90 e0 02 00 00 85 d2 0f 85 26 f1 ff ff 48 8b 85 e0 fc ff ff 48 89 85 a0 fc ff ff 48 8b b5 e0 fc ff ff <80> be b0 01 00 00 01 0f 86 b4 00 00 00 31 c0 48 b9 00 00 00 00 01
Apr 20 03:25:55 h2 kernel: RSP: 0018:ffffc9000216bad0 EFLAGS: 00010286
Apr 20 03:25:55 h2 kernel: RAX: ffff88842a6e1000 RBX: ffff8883d1d5b800 RCX: ffff8884283db200
Apr 20 03:25:55 h2 kernel: RDX: ffff8884283db2e0 RSI: c1316515e40a92f6 RDI: 0000000000000002
Apr 20 03:25:55 h2 kernel: RBP: ffffc9000216be50 R08: 0000000000000001 R09: 0000000000000001
Apr 20 03:25:55 h2 kernel: R10: 0000000000030000 R11: 0000000000000000 R12: 0000000000000000
Apr 20 03:25:55 h2 kernel: R13: 0000000000000005 R14: ffff88842bb76000 R15: ffff88841c08cc00
Apr 20 03:25:55 h2 kernel: FS:  0000000000000000(0000) GS:ffff88842ecc0000(0000) knlGS:0000000000000000
Apr 20 03:25:55 h2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 20 03:25:55 h2 kernel: CR2: 000078617de4fffc CR3: 000000040ca0e000 CR4: 00000000000406e0
Apr 20 03:25:55 h2 kernel: Call Trace:
Apr 20 03:25:55 h2 kernel:  ? 0xffffffff81000000
Apr 20 03:25:55 h2 kernel:  ? __switch_to_asm+0x34/0x70
Apr 20 03:25:55 h2 kernel:  ? __switch_to_asm+0x40/0x70
Apr 20 03:25:55 h2 kernel:  ? __switch_to_asm+0x34/0x70
Apr 20 03:25:55 h2 kernel:  ? __switch_to_asm+0x40/0x70
Apr 20 03:25:55 h2 kernel:  ? commit_tail+0x8e/0x120
Apr 20 03:25:55 h2 kernel:  ? process_one_work+0x1a9/0x300
Apr 20 03:25:55 h2 kernel:  ? worker_thread+0x45/0x3b8
Apr 20 03:25:55 h2 kernel:  ? kthread+0xf3/0x130
Apr 20 03:25:55 h2 kernel:  ? process_one_work+0x300/0x300
Apr 20 03:25:55 h2 kernel:  ? __kthread_create_on_node+0x180/0x180
Apr 20 03:25:55 h2 kernel:  ? ret_from_fork+0x22/0x40
Apr 20 03:25:55 h2 kernel: ---[ end trace 33869116def8e8ad ]---
Apr 20 03:25:55 h2 kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x102d/0x1fd8
Apr 20 03:25:55 h2 kernel: Code: 48 89 9d a0 fc ff ff 8b 90 e0 02 00 00 85 d2 0f 85 26 f1 ff ff 48 8b 85 e0 fc ff ff 48 89 85 a0 fc ff ff 48 89 85 a0 fc ff ff 48 8b b5 e0 fc ff ff <80> be b0 01 00 00 01 0f 86 b4 00 00 00 31 c0 48 b9 00 00 00 00 01
Apr 20 03:25:55 h2 kernel: RSP: 0018:ffffc9000216bad0 EFLAGS: 00010286
Apr 20 03:25:55 h2 kernel: RAX: ffff88842a6e1000 RBX: ffff8883d1d5b800 RCX: ffff8884283db200
Apr 20 03:25:55 h2 kernel: RDX: ffff8884283db2e0 RSI: c1316515e40a92f6 RDI: 0000000000000002
Apr 20 03:25:55 h2 kernel: RBP: ffffc9000216be50 R08: 0000000000000001 R09: 0000000000000001
Apr 20 03:25:55 h2 kernel: R10: 0000000000030000 R11: 0000000000000000 R12: 0000000000000000
Apr 20 03:25:55 h2 kernel: R13: 0000000000000005 R14: ffff88842bb76000 R15: ffff88841c08cc00
Apr 20 03:25:55 h2 kernel: FS:  0000000000000000(0000) GS:ffff88842ecc0000(0000) knlGS:0000000000000000
Apr 20 03:25:55 h2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 20 03:25:55 h2 kernel: CR2: 000078617de4fffc CR3: 000000040ca0e000 CR4: 00000000000406e0

That's it.  Nothing in the log since boot before, and the next entry is after reboot.

gcc version 9.3.0 on Gentoo.  AMD fx6100 on the Gigabyte board in the log above.    
xorg-server 1.20.8, mesa 20.0.4, xf86-video-amdgpu 19.1.0, linux-firmware 20200413

kernel config attached
Comment 1 Duncan 2020-04-21 09:57:59 UTC
Created attachment 288651 [details]
automated boot-time dmesg dump
Comment 2 Duncan 2020-04-21 10:04:28 UTC
I build kernels from git and can apply testing patches as necessary.  I may bisect, but haven't yet, and it'd take a bit and may not be reliable as the trigger time is variable.  Plus of course I can't do anything I don't want interrupted while attempting to bisect.  So hoping the polaris-11, log and pin to v5.6..v5.7-rc1 is enough.
Comment 3 Duncan 2020-04-23 04:59:01 UTC
CCed the two from MAINTAINERS bugzi would let me add.  It wouldn't let me add amd-gfx@ or david1.zhou@, and Alex's gmail address according to bugzi isn't what's in MAINTAINERS.
Comment 4 Duncan 2020-04-27 19:24:54 UTC
Still there with 5.7-rc3, altho /maybe/ it's not triggering as quickly.  Took 13 hours to trigger this time and I'd almost decided it was fixed as it had been triggering sooner than that, but could simply be luck.  Rebooted to rc3 again.  We'll see...
Comment 5 Duncan 2020-04-27 19:42:32 UTC
Well, that didn't take long.  Four konsole terminals open to do (various aspects of) a system update.  Just a few seconds after I entered the (git-based) sync command, display-FREEZE!

Back on 5.6.0 now.  I'll probably test again with rc4, perhaps earlier if I see a set of drm/amdgpu updates in mainline git.
Comment 6 Alex Deucher 2020-04-27 19:43:50 UTC
Can you bisect?
Comment 7 Duncan 2020-05-01 08:20:43 UTC
Bisecting, but it's slow going when the bug can take 12+ hours to trigger, and even then I can't be sure a "good" is actually so.

So far (at 5.6.0-01623-g12ab316ce, ~7 bisect steps to go, under 100 commits "after"), the first few were all "good", while the one I'm currently testing obviously isn't "bad" in terms of this bug yet, but does display a nasty buffer-sync issue with off-frame read-outs and eventual firefox crashes trying to play 4k@30fps youtube in firefox, a bit of a struggle with this kit but usually OK (it's the 4k@60fps that's the real problem in firefox/chromium, tho it tends to be fine without the browser overhead in mpv/smplayer/vlc).

But I hadn't seen that issue with the full 5.7-rc1 thru rc3, so it was apparently already fixed with rc1.  And no incidents of this bug, full system or full graphics lockups with a segfault in amdgpu_dm_atomic_commit_tail, during the bisect yet.
Comment 8 Duncan 2020-05-01 08:28:50 UTC
Hmm.  Don't think I mentioned on this bug yet that I'm running dual 4K TVs as monitors.  So it could only trigger on dual display, and two 4K displays means it's pumping a lot more pixels than most cards, too.
Comment 9 Duncan 2020-05-02 16:03:21 UTC
I'm not there yet but it's starting to look like a possibly dud bisect: everything showing good so far.  Maybe I didn't wait long enough for the bug to trigger at some step and I'm running up the wrong side of the tree, or maybe it's not drm after all (I thought I'd try something new and limit the paths to drivers/gpu/drm/ and include/drm/, but that may have been a critical mistake).  Right now there's only 3-4 even remotely reasonable candidates (out of 14 left to test... the rest being mediatek or similar):

4064b9827
Peter Xu
mm: allow VM_FAULT_RETRY for multiple times

6bfef2f91
Jason Gunthorpe
mm/hmm: remove HMM_FAULT_SNAPSHOT

17ffdc482
Christoph Hellwig
mm: simplify device private page handling in hmm_range_fault

And maybe (but I'm neither EFI nor 32-bit)

72e0ef0e5
Mikel Rychliski
PCI: Use ioremap(), not phys_to_virt() for platform ROM


Meanwhile, user-side I've gotten vulkan/mesa/etc updates recently.  I'm considering checking out linus-master/HEAD again, doing a pull, and seeing if by chance either the last week's kernel updates or the user-side updates have eliminated the problem.  If not I can come back and finish the bisect (or try just reverting those four on current linus-master/HEAD), before starting a new clean bisect if necessary.  Just saved the bisect log and current pointer.
Comment 10 Duncan 2020-05-03 15:10:59 UTC
(In reply to Duncan from comment #9)
> I'm not there yet but it's starting to look like a possibly dud bisect:
> everything showing good so far

Good but not ideal news!

I did get an apparent graphics crash at the bisect-point above, but it didn't dump anything in the log this time and behavior was a bit different than usual for this bug -- audio continued playing longer and I was able to confirm SRQ-E termination via audio and cpu-fan, and SRQ-S sync via sata-activity LED.

So I'm not sure it's the same bug, or maybe a different one; I'm bisecting pre-rc1 after all so others aren't unlikely.

So I'm rebooted to the same bisect step to try again, with any luck to get that gpf dump in the log confirming it's the same bug this time.

If it *is* the same bug, it looks like I avoided a dud bisect after all, just happened to be all good until almost the very end, I'm only a few steps away from pinning it down, and it's almost certainly one of the commits listed in comment #9. =:^)

> Meanwhile, user-side I've gotten vulkan/mesa/etc updates recently.  I'm
> considering checking out linus-master/HEAD again, doing a pull, and seeing
> if by chance either the last week's kernel updates or the user-side updates
> have eliminated the problem.

Been there, done that, still had the bug, with gpf-log-dump confirmation.  Back to the bisect.
Comment 11 Duncan 2020-05-05 04:23:26 UTC
(In reply to Duncan from comment #10)
> I did get an apparent graphics crash at the bisect-point above, but it
> didn't dump anything in the log this time

Got a gpf dump with amdgpu_atomic_commit_tail, confirming it's the same bug.  Still a couple bisect steps to go, but the EFI candidate's out now, leaving only three (plus mediatek and nouveau, and an amdgpu that says it was doc fix only), and the current round is testing between 406 and the 6bf/17f pair so I should eliminate at least one of the three this round:

4064b9827
Peter Xu
mm: allow VM_FAULT_RETRY for multiple times

6bfef2f91
Jason Gunthorpe
mm/hmm: remove HMM_FAULT_SNAPSHOT

17ffdc482
Christoph Hellwig
mm: simplify device private page handling in hmm_range_fault
Comment 12 Duncan 2020-05-06 17:46:44 UTC
OK, bisect says:

4064b9827
Peter Xu
mm: allow VM_FAULT_RETRY for multiple times

... which came in via akpm and touches drivers/drm/ttm/ttm_bo_vm.c.

But I'm not entirely confident in that result ATM.  Among other things I had set ZSWAP_DEFAULT_ON for 5.7, and I had zswap configured but not active previously, so that could be it too.  I'm not typically under enough memory pressure to trigger it, but...

Luckily a git show -R generated patch still applies cleanly on current master (5.7.0-rc4-00029-gdc56c5acd, tho I've only built it not rebooted to test it yet) so I can test both the commit-revert patch and the changed zswap options now.

So I'm confirming still.

But perhaps take another look at that commit and see if there's some way allowing unlimited VM_FAULT_RETRY could leave drm at least on on amdgpu eternally stalled, which does seem to fit the symptoms, whether it's unlimited VM_FAULT_RETRY or not.
Comment 13 Duncan 2020-05-06 22:06:49 UTC
Well, so much for /that/ bisect!  Took me a few hours but then had the graphics stall twice in a few minutes... with the above commit reverted AND with memory compression off.

So it's back to square one, except I know that my originally chosen new memory compression options aren't involved.  New bisect time.
Comment 14 Duncan 2020-06-03 00:04:49 UTC
Unfortunately the bug's still there in 5.7 release. =:^(

Not properly bisected yet as after the first failure I needed something reasonably stable for awhile as I had about a dozen live-git kde-plasma userspace bugs to track down and report, but kernel 5.6.0-07388-gf365ab31e has been exactly that, stable for me, for weeks now (built May 6), and the bug definitely triggered in 5.7-rc1, so it's gotta be between those.  With the unrelated userspace side mostly fixed now, and this kernelspace bug now known to remain unfixed in the normal development cycle, maybe I can get back to bisecting it again.
Comment 15 Duncan 2020-06-21 07:01:42 UTC
Bug's in v5.8-rc1-226-g4333a9b0b too.
Comment 16 rtmasura+kernel 2020-06-22 15:20:33 UTC
Reporting I've had the same issue with kernel 5.7.2 and 5.7.4:

Jun 22 07:10:24 abiggun kernel: general protection fault, probably for non-canonical address 0xd3d74027d6d8fad4: 0000 [#1] PREEMPT SMP NOPTI
Jun 22 07:10:24 abiggun kernel: CPU: 0 PID: 32680 Comm: kworker/u12:9 Not tainted 5.7.4-arch1-1 #1
Jun 22 07:10:24 abiggun kernel: Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 1102    08/24/2010
Jun 22 07:10:24 abiggun kernel: Workqueue: events_unbound commit_work [drm_kms_helper]
Jun 22 07:10:24 abiggun kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2aa/0x2310 [amdgpu]
Jun 22 07:10:24 abiggun kernel: Code: 4f 08 8b 81 e0 02 00 00 41 83 c5 01 44 39 e8 0f 87 46 ff ff ff 48 83 bd f0 fc ff ff 00 0f 84 03 01 00 00 48 8b bd f0 f>
Jun 22 07:10:24 abiggun kernel: RSP: 0018:ffffb0cc421abaf8 EFLAGS: 00010286
Jun 22 07:10:24 abiggun kernel: RAX: 0000000000000006 RBX: ffffa21b8e16c400 RCX: ffffa21cab9c8800
Jun 22 07:10:24 abiggun kernel: RDX: ffffa21ca7326200 RSI: ffffffffc10de1a0 RDI: d3d74027d6d8fad4
Jun 22 07:10:24 abiggun kernel: RBP: ffffb0cc421abe60 R08: 0000000000000001 R09: 0000000000000001
Jun 22 07:10:24 abiggun kernel: R10: 00000000000002be R11: 00000000001c57a1 R12: 0000000000000000
Jun 22 07:10:24 abiggun kernel: R13: 0000000000000006 R14: ffffa218e4959800 R15: ffffa219e5b12780
Jun 22 07:10:24 abiggun kernel: FS:  0000000000000000(0000) GS:ffffa21cbfc00000(0000) knlGS:0000000000000000
Jun 22 07:10:24 abiggun kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 22 07:10:24 abiggun kernel: CR2: 00007fec2b573008 CR3: 0000000344bd8000 CR4: 00000000000006f0
Jun 22 07:10:24 abiggun kernel: Call Trace:
Jun 22 07:10:24 abiggun kernel:  ? cpumask_next_and+0x19/0x20
Jun 22 07:10:24 abiggun kernel:  ? update_sd_lb_stats.constprop.0+0x115/0x8f0
Jun 22 07:10:24 abiggun kernel:  ? __update_load_avg_cfs_rq+0x277/0x2f0
Jun 22 07:10:24 abiggun kernel:  ? update_load_avg+0x58f/0x660
Jun 22 07:10:24 abiggun kernel:  ? update_curr+0x108/0x1f0
Jun 22 07:10:24 abiggun kernel:  ? __switch_to_asm+0x34/0x70
Jun 22 07:10:24 abiggun kernel:  ? __switch_to_asm+0x40/0x70
Jun 22 07:10:24 abiggun kernel:  ? __switch_to_asm+0x34/0x70
Jun 22 07:10:24 abiggun kernel:  ? __switch_to_asm+0x40/0x70
Jun 22 07:10:24 abiggun kernel:  ? rescuer_thread+0x3f0/0x3f0
Jun 22 07:10:24 abiggun kernel:  commit_tail+0x94/0x130 [drm_kms_helper]
Jun 22 07:10:24 abiggun kernel:  process_one_work+0x1da/0x3d0
Jun 22 07:10:24 abiggun kernel:  ? rescuer_thread+0x3f0/0x3f0
Jun 22 07:10:24 abiggun kernel:  worker_thread+0x4d/0x3e0
Jun 22 07:10:24 abiggun kernel:  ? rescuer_thread+0x3f0/0x3f0
Jun 22 07:10:24 abiggun kernel:  kthread+0x13e/0x160
Jun 22 07:10:24 abiggun kernel:  ? __kthread_bind_mask+0x60/0x60
Jun 22 07:10:24 abiggun kernel:  ret_from_fork+0x22/0x40
Jun 22 07:10:24 abiggun kernel: Modules linked in: snd_usb_audio snd_usbmidi_lib snd_rawmidi hid_plantronics mc vhost_net vhost tap vhost_iotlb snd_seq_dumm>
Jun 22 07:10:24 abiggun kernel:  crypto_simd cryptd glue_helper xts dm_crypt hid_generic usbhid hid raid456 libcrc32c crc32c_generic async_raid6_recov async>
Jun 22 07:10:24 abiggun kernel: ---[ end trace 536cfe34e3c36293 ]---
Jun 22 07:10:24 abiggun kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2aa/0x2310 [amdgpu]
Jun 22 07:10:24 abiggun kernel: Code: 4f 08 8b 81 e0 02 00 00 41 83 c5 01 44 39 e8 0f 87 46 ff ff ff 48 83 bd f0 fc ff ff 00 0f 84 03 01 00 00 48 8b bd f0 f>
Jun 22 07:10:25 abiggun kernel: RSP: 0018:ffffb0cc421abaf8 EFLAGS: 00010286
Jun 22 07:10:25 abiggun kernel: RAX: 0000000000000006 RBX: ffffa21b8e16c400 RCX: ffffa21cab9c8800
Jun 22 07:10:25 abiggun kernel: RDX: ffffa21ca7326200 RSI: ffffffffc10de1a0 RDI: d3d74027d6d8fad4
Jun 22 07:10:25 abiggun kernel: RBP: ffffb0cc421abe60 R08: 0000000000000001 R09: 0000000000000001
Jun 22 07:10:25 abiggun kernel: R10: 00000000000002be R11: 00000000001c57a1 R12: 0000000000000000
Jun 22 07:10:25 abiggun kernel: R13: 0000000000000006 R14: ffffa218e4959800 R15: ffffa219e5b12780
Jun 22 07:10:25 abiggun kernel: FS:  0000000000000000(0000) GS:ffffa21cbfc00000(0000) knlGS:0000000000000000
Jun 22 07:10:25 abiggun kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 22 07:10:25 abiggun kernel: CR2: 00007fec2b573008 CR3: 0000000344bd8000 CR4: 00000000000006f0
Comment 17 Duncan 2020-06-22 17:44:59 UTC
(In reply to rtmasura+kernel from comment #16)
> Reporting I've had the same issue with kernel 5.7.2 and 5.7.4:

Thanks!

> Jun 22 07:10:24 abiggun kernel: Hardware name: System manufacturer System
> Product Name/Crosshair IV Formula, BIOS 1102    08/24/2010

So socket AM3 from 2010, slightly older than my AM3+ from 2012.  Both are PCIe-2.0.

What's your CPU and GPU?

As above my GPU is Polaris11 (AMD Radeon RX 460, arctic-islands/gcn4 series, pcie-3),  AMD fx6100 CPU.

Guessing the bug is gpu-series code specific or there'd be more people howling, so what you're running for gpu is significant.  It's /possible/ it may be specific to people running pcie mismatch, as well (note my pcie-3 gpu card on a pcie-2 mobo).

> Jun 22 07:10:24 abiggun kernel: Workqueue: events_unbound commit_work
> [drm_kms_helper]
> 0010:amdgpu_dm_atomic_commit_tail+0x2aa/0x2310 [amdgpu]

That's the bit of the dump I understand, similar to mine...

If you can find a quicker/more-reliable way to trigger the crash, it'd sure be helpful for bisecting.  Also, if you're running a bad kernel enough to tell (not just back to 5.6 after finding 5.7 bad), does it reliably dump-log before the reboot for you?  I'm back to a veerrry--sloowww second bisect attempt, with for instance my current kernel having crashed three times now so it's obviously bugged, but nothing dumped in the log on the way down yet so I can't guarantee it's the _same_ bug (the bisect is in pre-rc1 code so chances of a different bug are definitely non-zero), and given the bad results on the first bisect I'm trying to confirm each bisect-bad with a log-dump and each bisect-good with at least 3-4 days no crash.  But this one's in between right now, frequent crashing but no log-dump to confirm it's the same bug.
Comment 18 rtmasura+kernel 2020-06-22 17:57:25 UTC
lspci:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 Northbridge only single slot PCI-e GFX Hydra part (rev 02)
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU)
00:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GFX port 0)
00:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 0)
00:07.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 3)
00:0b.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD990 PCI to PCI bridge (PCI Express GFX2 port 0)
00:0d.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP2 Port 0)
00:11.0 RAID bus controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [RAID5 mode] (rev 40)
00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller (rev 42)
00:14.2 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) (rev 40)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller (rev 40)
00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40)
00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller
00:16.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:16.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control
02:00.0 PCI bridge: PLX Technology, Inc. PEX 8624 24-lane, 6-Port PCI Express Gen 2 (5.0 GT/s) Switch [ExpressLane] (rev bb)
03:04.0 PCI bridge: PLX Technology, Inc. PEX 8624 24-lane, 6-Port PCI Express Gen 2 (5.0 GT/s) Switch [ExpressLane] (rev bb)
03:05.0 PCI bridge: PLX Technology, Inc. PEX 8624 24-lane, 6-Port PCI Express Gen 2 (5.0 GT/s) Switch [ExpressLane] (rev bb)
03:06.0 PCI bridge: PLX Technology, Inc. PEX 8624 24-lane, 6-Port PCI Express Gen 2 (5.0 GT/s) Switch [ExpressLane] (rev bb)
03:08.0 PCI bridge: PLX Technology, Inc. PEX 8624 24-lane, 6-Port PCI Express Gen 2 (5.0 GT/s) Switch [ExpressLane] (rev bb)
03:09.0 PCI bridge: PLX Technology, Inc. PEX 8624 24-lane, 6-Port PCI Express Gen 2 (5.0 GT/s) Switch [ExpressLane] (rev bb)
04:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
04:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
06:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
06:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
07:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
07:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
09:00.0 VGA compatible controller: NVIDIA Corporation GP104GL [Quadro P4000] (rev a1)
09:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)
0a:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03)
0b:00.0 SATA controller: JMicron Technology Corp. JMB363 SATA/IDE Controller (rev 03)
0b:00.1 IDE interface: JMicron Technology Corp. JMB363 SATA/IDE Controller (rev 03)
0c:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1470 (rev c3)
0d:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1471
0e:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] (rev c3)
0e:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]
                                                                        
A few notes on that: The AMD Vega56 is used for this PC, the Quadro P4000 is disabled on my system and passed through to VMs. 

I haven't found any way to trigger it. Seems completely random. Sat down this morning to update a VM (not the one with the nvidia passthrough) and it froze, wasn't any real graphical things going on other than normal KDE stuff. 


lscpu:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          6
On-line CPU(s) list:             0-5
Thread(s) per core:              1
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      16
Model:                           10
Model name:                      AMD Phenom(tm) II X6 1090T Processor
Stepping:                        0
CPU MHz:                         3355.192
BogoMIPS:                        6421.46
Virtualization:                  AMD-V
L1d cache:                       384 KiB
L1i cache:                       384 KiB
L2 cache:                        3 MiB
L3 cache:                        6 MiB
NUMA node0 CPU(s):               0-5
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Not affected
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_
                                 opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni monitor cx16 po
                                 pcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt cpb hw_pstate vmmcall
                                  npt lbrv svm_lock nrip_save pausefilter


I would be happy to help with any testing, just let me know what information you need.
Comment 19 Duncan 2020-06-22 19:36:29 UTC
(In reply to rtmasura+kernel from comment #18)
> 09:00.0 VGA compatible controller: NVIDIA Corporation GP104GL [Quadro P4000]
> (rev a1)

> 0e:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> Vega 10 XL/XT [Radeon RX Vega 56/64] (rev c3)

> A few notes on that: The AMD Vega56 is used for this PC, the Quadro P4000 is
> disabled on my system and passed through to VMs. 

So newer graphics, Vega56/gcn5 compared to my gcn4.

No VMs at all here so that can be excluded as a factor (unless it's a minor trigger similar to my zooming or video play).

> I haven't found any way to trigger it. Seems completely random. Sat down
> this morning to update a VM (not the one with the nvidia passthrough) and it
> froze, wasn't any real graphical things going on other than normal KDE
> stuff. 

KDE/Plasma here too.  I think kwin exercises the opengl a bit more than some WMs, in part because it's a compositor as well.  The bug most often hits here when playing video or using kwin's zoom effect, which exercise the graphics a bit.

So mostly kde/kwin triggers could lower the population hitting it and could be a factor, based on both of us running it.

> Model name:                      AMD Phenom(tm) II X6 1090T Processor

Newer graphics, gcn5 to gcn4, older cpu, phenom ii to fx, than here.

So we know gcn4 and gcn5 are affected, and pcie2 bus with pcie3 cards and kde/kwin are common-factor possible triggers so far.

> I would be happy to help with any testing, just let me know what information
> you need.

If you happen to run anything besides KDE/Plasma on X, duplicating (or failing to duplicate) the bug on non-kde and/or on wayland would be useful info.  I only run KDE Plasma on X here.  Well, that and CLI (on amdgpu-drm-framebuffer) more than some but not enough that I'd have expected to see it there, which I haven't.
Comment 20 rtmasura+kernel 2020-06-22 20:00:46 UTC
I have XFCE4 installed as well, I'll give it a test and let you know in 24 hours; a GPF should have happened by then
Comment 21 rtmasura+kernel 2020-06-23 15:36:40 UTC
OK. I've uninstalled the vast majority of KDE and am using a vanilla XFCE4. It's been about 12 hours on 5.7.4-arch1-1 and I have yet to have a crash. It is looking like it may be something with KDE.
Comment 22 Duncan 2020-06-23 23:41:25 UTC
(In reply to rtmasura+kernel from comment #21)
> OK. I've uninstalled the vast majority of KDE and am using a vanilla XFCE4.
> It's been about 12 hours on 5.7.4-arch1-1 and I have yet to have a crash. It
> is looking like it may be something with KDE.

Note that it is possible to run kwin (kwin_x11 being the actual executable) on another desktop, or conversely, a different WM on plasma.  To run kwin and make it replace the existing WM you'd simply type in (in the xfce runner or terminal window, it can be done from a different VT as well but then you gotta feed kwin the display information too) kwin_x11 --replace.  Presumably other WMs have a similar command-line option.

I've never actually done it on a non-plasma desktop (tho I run live-git plasma and frameworks so I must always be prepared to restart it or various other plasma components, to the point I have non-kde-invoked shortcuts setup to do it there), but I /think/ kwin would continue to use the configuration setup on kde, the various window rules, configured kwin keyboard shortcuts and effects, etc.

That could prove whether it's actually kwin triggering or not (tho it's a kernel bug regardless), tho I suspect the proof is academic at this point given that you've demonstrated that the trigger does appear to be kde/plasma related, at least.  IMO kwin triggering is a reasonably safe assumption given that.  But it does explain why the bug isn't widely reported, plasma being the apparent biggest trigger and limited to specific now older generations of hardware means few people, even of those running the latest kernels, are going to see it.

Meanwhile, I actually got a log-dump on the 4th crash of the kernel at that bisect step, confirming it is indeed this bug, and have advanced a bisect step.  But git says I still have ~11 steps, 1000+ commits, so it's still well too large to start trying to pick out candidate buggy commits from the remainder.  Slow going indeed.  At this rate a full bisect and fix could well be after 5.8 release, giving us two full bad release cycles and kernels before a fix.  Not good. =:^(
Comment 23 rtmasura+kernel 2020-06-24 08:55:18 UTC
Yeah, over 24 hours and still stable. And glad I could help, I rarely have anything I can give back to the community.

And wow, that much work. Truly, we all do appreciate your work, but I don't think most of us understand how much. Thank you from all of us :)
Comment 24 rtmasura+kernel 2020-06-27 04:37:18 UTC
I've been up and stable on XFCE4 since that last message, but just crashed today with a bit of a different error. This happened after I turned on a screen tear fix:

xfconf-query -c xfwm4 -p /general/vblank_mode -s glx

I also didn't reboot to activate it, I just hot loaded it with:

xfwm4 --replace --vblank=glx &

Don't think that changes anything, but just in case. Not sure if it's related, I had a game idling on my monitor while I was cooking, and it's the first time I had played it. It was Battle of Wesnoth. Anyway, here's the log:


Jun 26 21:08:03 abiggun kernel: general protection fault, probably for non-canonical address 0x3b963e011fb9f84: 0000 [#1] PREEMPT SMP NOPTI
Jun 26 21:08:03 abiggun kernel: CPU: 4 PID: 362093 Comm: kworker/u12:1 Not tainted 5.7.4-arch1-1 #1
Jun 26 21:08:03 abiggun kernel: Hardware name: System manufacturer System Product Name/Crosshair IV Formula, BIOS 1102    08/24/2010
Jun 26 21:08:03 abiggun kernel: Workqueue: events_unbound commit_work [drm_kms_helper]
Jun 26 21:08:03 abiggun kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2aa/0x2310 [amdgpu]
Jun 26 21:08:03 abiggun kernel: Code: 4f 08 8b 81 e0 02 00 00 41 83 c5 01 44 39 e8 0f 87 46 ff ff ff 48 83 bd f0 fc ff ff 00 0f 84 03 01 00 00 48 8b bd f0 fc ff ff <80> bf b0 01 00 00 01 0f 86 ac 00 00 00 48 b9 00 00 00 00 01 00 00
Jun 26 21:08:03 abiggun kernel: RSP: 0018:ffff993cc4037af8 EFLAGS: 00010206
Jun 26 21:08:03 abiggun kernel: RAX: 0000000000000006 RBX: ffff931ae09c0800 RCX: ffff931bfe478000
Jun 26 21:08:03 abiggun kernel: RDX: ffff931bf2dd2600 RSI: ffffffffc10a51a0 RDI: 03b963e011fb9f84
Jun 26 21:08:03 abiggun kernel: RBP: ffff993cc4037e60 R08: 0000000000000001 R09: 0000000000000001
Jun 26 21:08:03 abiggun kernel: R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000000
Jun 26 21:08:03 abiggun kernel: R13: 0000000000000006 R14: ffff931bd0450c00 R15: ffff931b3574dc80
Jun 26 21:08:03 abiggun kernel: FS:  0000000000000000(0000) GS:ffff931c3fd00000(0000) knlGS:0000000000000000
Jun 26 21:08:03 abiggun kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 26 21:08:03 abiggun kernel: CR2: 00007fe602dc0008 CR3: 0000000418080000 CR4: 00000000000006e0
Jun 26 21:08:03 abiggun kernel: Call Trace:
Jun 26 21:08:03 abiggun kernel:  ? tomoyo_write_self+0x100/0x1d0
Jun 26 21:08:03 abiggun kernel:  ? __switch_to_asm+0x34/0x70
Jun 26 21:08:03 abiggun kernel:  ? __switch_to_asm+0x40/0x70
Jun 26 21:08:03 abiggun kernel:  ? __switch_to_asm+0x34/0x70
Jun 26 21:08:03 abiggun kernel:  ? __switch_to_asm+0x40/0x70
Jun 26 21:08:03 abiggun kernel:  ? rescuer_thread+0x3f0/0x3f0
Jun 26 21:08:03 abiggun kernel:  commit_tail+0x94/0x130 [drm_kms_helper]
Jun 26 21:08:03 abiggun kernel:  process_one_work+0x1da/0x3d0
Jun 26 21:08:03 abiggun kernel:  ? rescuer_thread+0x3f0/0x3f0
Jun 26 21:08:03 abiggun kernel:  worker_thread+0x4d/0x3e0
Jun 26 21:08:03 abiggun kernel:  ? rescuer_thread+0x3f0/0x3f0
Jun 26 21:08:03 abiggun kernel:  kthread+0x13e/0x160
Jun 26 21:08:03 abiggun kernel:  ? __kthread_bind_mask+0x60/0x60
Jun 26 21:08:03 abiggun kernel:  ret_from_fork+0x22/0x40
Jun 26 21:08:03 abiggun kernel: Modules linked in: snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device mc hid_plantronics macvtap macvlan vhost_net vhost tap vhost_iotlb fuse xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge stp llc rfkill tun lm92 hwmon_vid input_leds amdgpu squashfs nouveau loop edac_mce_amd kvm_amd ccp rng_core mxm_wmi snd_hda_codec_via gpu_sched snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio kvm ttm snd_hda_intel snd_intel_dspcfg wmi_bmof snd_hda_codec drm_kms_helper snd_hda_core pcspkr sp5100_tco k10temp snd_hwdep snd_pcm cec i2c_piix4 joydev rc_core mousedev igb syscopyarea snd_timer sysfillrect snd sysimgblt i2c_algo_bit dca fb_sys_fops soundcore asus_atk0110 evdev mac_hid wmi drm crypto_user agpgart ip_tables x_tables ext4 crc16 mbcache jbd2 ecb crypto_simd cryptd
Jun 26 21:08:03 abiggun kernel:  glue_helper xts hid_generic usbhid hid dm_crypt raid456 libcrc32c crc32c_generic async_raid6_recov async_memcpy async_pq async_xor xor async_tx ohci_pci raid6_pq md_mod ehci_pci ehci_hcd ohci_hcd xhci_pci xhci_hcd ata_generic pata_acpi pata_jmicron vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio dm_mod
Jun 26 21:08:03 abiggun kernel: ---[ end trace 4e7c8ad2195077a2 ]---
Jun 26 21:08:03 abiggun kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2aa/0x2310 [amdgpu]
Jun 26 21:08:03 abiggun kernel: Code: 4f 08 8b 81 e0 02 00 00 41 83 c5 01 44 39 e8 0f 87 46 ff ff ff 48 83 bd f0 fc ff ff 00 0f 84 03 01 00 00 48 8b bd f0 fc ff ff <80> bf b0 01 00 00 01 0f 86 ac 00 00 00 48 b9 00 00 00 00 01 00 00
Jun 26 21:08:03 abiggun kernel: RSP: 0018:ffff993cc4037af8 EFLAGS: 00010206
Jun 26 21:08:03 abiggun kernel: RAX: 0000000000000006 RBX: ffff931ae09c0800 RCX: ffff931bfe478000
Jun 26 21:08:03 abiggun kernel: RDX: ffff931bf2dd2600 RSI: ffffffffc10a51a0 RDI: 03b963e011fb9f84
Jun 26 21:08:03 abiggun kernel: RBP: ffff993cc4037e60 R08: 0000000000000001 R09: 0000000000000001
Jun 26 21:08:03 abiggun kernel: R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000000
Jun 26 21:08:03 abiggun kernel: R13: 0000000000000006 R14: ffff931bd0450c00 R15: ffff931b3574dc80
Jun 26 21:08:03 abiggun kernel: FS:  0000000000000000(0000) GS:ffff931c3fd00000(0000) knlGS:0000000000000000
Jun 26 21:08:03 abiggun kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 26 21:08:03 abiggun kernel: CR2: 00007fe602dc0008 CR3: 0000000418080000 CR4: 00000000000006e0
Jun 26 21:08:23 abiggun Thunar[3946]: 2020-06-27T04:08:23.137Z - debug: [REPOSITORY] fetch request: /cytrus.json
Jun 26 21:08:23 abiggun Thunar[3946]: 2020-06-27T04:08:23.138Z - debug: [REPOSITORY] request: /cytrus.json
Jun 26 21:08:23 abiggun Thunar[3946]: { repository: 'https://launcher.cdn.ankama.com' }
Jun 26 21:08:23 abiggun Thunar[3946]: 2020-06-27T04:08:23.155Z - debug: [REPOSITORY] fetchJson: Parsing data for /cytrus.json
Jun 26 21:08:23 abiggun Thunar[3946]: 2020-06-27T04:08:23.156Z - debug: [REGISTRY] update
Jun 26 21:08:23 abiggun Thunar[3946]: 2020-06-27T04:08:23.156Z - debug: [REGISTRY] Parse repository Data
Jun 26 21:08:40 abiggun audit[241624]: ANOM_ABEND auid=1000 uid=1000 gid=985 ses=2 subj==unconfined pid=241624 comm="GpuWatchdog" exe="/opt/google/chrome/chrome" sig=11 res=1
Jun 26 21:08:40 abiggun kernel: GpuWatchdog[241650]: segfault at 0 ip 0000556ef31897ad sp 00007f11132a95d0 error 6 in chrome[556eeeadc000+785b000]
Jun 26 21:08:40 abiggun kernel: Code: 00 79 09 48 8b 7d b0 e8 f1 95 6c fe c7 45 b0 aa aa aa aa 0f ae f0 41 8b 84 24 e0 00 00 00 89 45 b0 48 8d 7d b0 e8 f3 5a ba fb <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 38 5b 41 5c 41 5d 41 5e
Jun 26 21:08:40 abiggun audit: BPF prog-id=71 op=LOAD
Jun 26 21:08:40 abiggun audit: BPF prog-id=72 op=LOAD
Jun 26 21:08:40 abiggun systemd[1]: Started Process Core Dump (PID 362491/UID 0).
Jun 26 21:08:40 abiggun audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-coredump@4-362491-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jun 26 21:08:45 abiggun systemd-coredump[362492]: Process 241624 (chrome) of user 1000 dumped core.
                                                  
                                                  Stack trace of thread 241650:
                                                  #0  0x0000556ef31897ad n/a (chrome + 0x62b07ad)
                                                  #1  0x0000556ef17e5c93 n/a (chrome + 0x490cc93)
                                                  #2  0x0000556ef17f7199 n/a (chrome + 0x491e199)
                                                  #3  0x0000556ef17ad6cf n/a (chrome + 0x48d46cf)
                                                  #4  0x0000556ef17f795c n/a (chrome + 0x491e95c)
                                                  #5  0x0000556ef17d08b9 n/a (chrome + 0x48f78b9)
                                                  #6  0x0000556ef180ea1b n/a (chrome + 0x4935a1b)
                                                  #7  0x0000556ef184ae78 n/a (chrome + 0x4971e78)
                                                  #8  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #9  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241624:
                                                  #0  0x00007f1117c2a05f __poll (libc.so.6 + 0xf505f)
                                                  #1  0x00007f11190c663b n/a (libxcb.so.1 + 0xc63b)
                                                  #2  0x00007f11190c845b xcb_wait_for_special_event (libxcb.so.1 + 0xe45b)
                                                  #3  0x00007f11128cd381 n/a (libGLX_mesa.so.0 + 0x57381)
                                                  #4  0x00007f11128c132b n/a (libGLX_mesa.so.0 + 0x4b32b)
                                                  #5  0x0000556ef295706e n/a (chrome + 0x5a7e06e)
                                                  #6  0x0000556ef2955cb8 n/a (chrome + 0x5a7ccb8)
                                                  #7  0x0000556ef17e5c93 n/a (chrome + 0x490cc93)
                                                  #8  0x0000556ef17f7199 n/a (chrome + 0x491e199)
                                                  #9  0x0000556ef17ad999 n/a (chrome + 0x48d4999)
                                                  #10 0x0000556ef17f795c n/a (chrome + 0x491e95c)
                                                  #11 0x0000556ef17d08b9 n/a (chrome + 0x48f78b9)
                                                  #12 0x0000556ef59a9ed9 n/a (chrome + 0x8ad0ed9)
                                                  #13 0x0000556ef13329b4 n/a (chrome + 0x44599b4)
                                                  #14 0x0000556ef139addd n/a (chrome + 0x44c1ddd)
                                                  #15 0x0000556ef1330901 n/a (chrome + 0x4457901)
                                                  #16 0x0000556eeede80ce ChromeMain (chrome + 0x1f0f0ce)
                                                  #17 0x00007f1117b5c002 __libc_start_main (libc.so.6 + 0x27002)
                                                  #18 0x0000556eeeadc6aa _start (chrome + 0x1c036aa)
                                                  
                                                  Stack trace of thread 241636:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241642:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241644:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241643:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 359981:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241651:
                                                  #0  0x00007f1117c34f3e epoll_wait (libc.so.6 + 0xfff3e)
                                                  #1  0x0000556ef192ea1a n/a (chrome + 0x4a55a1a)
                                                  #2  0x0000556ef192c227 n/a (chrome + 0x4a53227)
                                                  #3  0x0000556ef18588d0 n/a (chrome + 0x497f8d0)
                                                  #4  0x0000556ef17f795c n/a (chrome + 0x491e95c)
                                                  #5  0x0000556ef17d08b9 n/a (chrome + 0x48f78b9)
                                                  #6  0x0000556ef1809624 n/a (chrome + 0x4930624)
                                                  #7  0x0000556ef180ea1b n/a (chrome + 0x4935a1b)
                                                  #8  0x0000556ef184ae78 n/a (chrome + 0x4971e78)
                                                  #9  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #10 0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241655:
                                                  #0  0x00007f1119245158 pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0 + 0x10158)
                                                  #1  0x0000556ef1846f60 n/a (chrome + 0x496df60)
                                                  #2  0x0000556ef18475b0 n/a (chrome + 0x496e5b0)
                                                  #3  0x0000556ef17ad716 n/a (chrome + 0x48d4716)
                                                  #4  0x0000556ef17f795c n/a (chrome + 0x491e95c)
                                                  #5  0x0000556ef17d08b9 n/a (chrome + 0x48f78b9)
                                                  #6  0x0000556ef180ea1b n/a (chrome + 0x4935a1b)
                                                  #7  0x0000556ef184ae78 n/a (chrome + 0x4971e78)
                                                  #8  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #9  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241656:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 242011:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241646:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241657:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241658:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 351071:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 351072:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 359972:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241659:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 361357:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241647:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241652:
                                                  #0  0x00007f1119245158 pthread_cond_timedwait@@GLIBC_2.3.2 (libpthread.so.0 + 0x10158)
                                                  #1  0x0000556ef1846f60 n/a (chrome + 0x496df60)
                                                  #2  0x0000556ef18475b0 n/a (chrome + 0x496e5b0)
                                                  #3  0x0000556ef1809c6a n/a (chrome + 0x4930c6a)
                                                  #4  0x0000556ef180a54c n/a (chrome + 0x493154c)
                                                  #5  0x0000556ef180a234 n/a (chrome + 0x4931234)
                                                  #6  0x0000556ef184ae78 n/a (chrome + 0x4971e78)
                                                  #7  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #8  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241653:
                                                  #0  0x00007f1117c34f3e epoll_wait (libc.so.6 + 0xfff3e)
                                                  #1  0x0000556ef192ea1a n/a (chrome + 0x4a55a1a)
                                                  #2  0x0000556ef192c227 n/a (chrome + 0x4a53227)
                                                  #3  0x0000556ef18588d0 n/a (chrome + 0x497f8d0)
                                                  #4  0x0000556ef17f795c n/a (chrome + 0x491e95c)
                                                  #5  0x0000556ef17d08b9 n/a (chrome + 0x48f78b9)
                                                  #6  0x0000556ef180ea1b n/a (chrome + 0x4935a1b)
                                                  #7  0x0000556ef184ae78 n/a (chrome + 0x4971e78)
                                                  #8  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #9  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241660:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241661:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241662:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241665:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241666:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241663:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241667:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241664:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241851:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241852:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241853:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 245560:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x0000556ef1846e48 n/a (chrome + 0x496de48)
                                                  #2  0x0000556ef18475d9 n/a (chrome + 0x496e5d9)
                                                  #3  0x0000556ef184739f n/a (chrome + 0x496e39f)
                                                  #4  0x0000556ef17ad751 n/a (chrome + 0x48d4751)
                                                  #5  0x0000556ef17f795c n/a (chrome + 0x491e95c)
                                                  #6  0x0000556ef17d08b9 n/a (chrome + 0x48f78b9)
                                                  #7  0x0000556ef180ea1b n/a (chrome + 0x4935a1b)
                                                  #8  0x0000556ef184ae78 n/a (chrome + 0x4971e78)
                                                  #9  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #10 0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241862:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 361354:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 361028:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241902:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 361345:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 361358:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241645:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241638:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241639:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241640:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241641:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241750:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241855:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 309100:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 359991:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
                                                  
                                                  Stack trace of thread 241637:
                                                  #0  0x00007f1119244e32 pthread_cond_wait@@GLIBC_2.3.2 (libpthread.so.0 + 0xfe32)
                                                  #1  0x00007f111158e3bc n/a (radeonsi_dri.so + 0x4ae3bc)
                                                  #2  0x00007f111158cdb8 n/a (radeonsi_dri.so + 0x4acdb8)
                                                  #3  0x00007f111923e422 start_thread (libpthread.so.0 + 0x9422)
                                                  #4  0x00007f1117c34bf3 __clone (libc.so.6 + 0xffbf3)
Comment 25 rtmasura+kernel 2020-06-27 04:38:47 UTC
Same kernel (5.7.4) and I'll try to reproduce it, and if it happens I'll turn off the screen tear and try to reproduce again

Let me know if that's anything I can provide you
Comment 26 rtmasura+kernel 2020-06-27 05:16:46 UTC
and just got another crash, only watching a video in chrome. Guess the chrome bit at the end might be more important than I thought

I *think* I've turned off the glx for xfwm.. we'll see. My computer has been showing video in chrome every day without issues before today. I hadn't updated since last week either, no changes in the system.
Comment 27 rtmasura+kernel 2020-06-27 06:08:43 UTC
and another crash, chrome's good at causing them (watching youtube). Used -s "" for the setting which I think should set it to 'auto', and what I assumed was default. I've changed that to -s "off" to see if that helps.
Comment 28 Duncan 2020-06-27 07:07:39 UTC
(In reply to rtmasura+kernel from comment #27)
> and another crash, chrome's good at causing them (watching youtube). Used -s
> "" for the setting which I think should set it to 'auto', and what I assumed
> was default. I've changed that to -s "off" to see if that helps.

You just added those updates as I was typing a comment pointing out that chrome/chromium in your bug; bugzilla warned of a mid-air collision!  Chrom(e|ium) has new vulkan accel code and very likely exercises some of the same relatively new amdgpu kernel code kwin does, so both of them triggering the bug wouldn't surprise me at all.

As it happens I switched back to firefox during the 5.6 kernel cycle, so haven't seen chromium's interaction with the (kernel 5.7) bug myself, but once I saw it in that trace I said to myself I bet that's his trigger!


FWIW I advanced a couple more bisect steps pretty quickly as it was triggering as I tried to complete system updates (which on gentoo of course means building the packages), but then I hit an apparently good kernel, and uptime says 3 days now, something I've not seen in awhile!  Only thing is, I finished those updates and they were pretty calm the next couple days, so I've not been stressing the system to the same extent, either.  Given the problems I got myself into the first bisect run, I'm going to run on this kernel a bit longer before I do that bisect good to advance a step.  If it reaches a week and I've done either a good system update or a some heavy 4k@60 youtube on firefox, I'll call it good, but I'm not ready to yet.

The good news is, in a couple more bisect steps I'll be down to some practical number of remaining commits to report the range here, and if they have the time, a dev with a practiced eye should be able to narrow it down by say 3/4 (two steps ahead of my bisect), leaving something actually practical to examine closer.  After that it'll be past the point of my bisect being the only bottleneck, if it's big enough to get dev priority time, of course.  If not, I'll just have to keep plugging away at the bisect...
Comment 29 zzyxpaw 2020-06-27 22:26:41 UTC
Just hit this on Archlinux with linux-5.7.6 on a Vega 64. So far I've had three crashes mostly occuring within the first few minutes of uptime. I'm not running kwin or chrome, just a light window manager (bspwm) and compton.

During the first two, steam's fossilize was running which lead me to suspect it was triggered by an interaction with that. However the third crashed before I even managed to start steam, so either I'm just lucky or my system is good at triggering this. @Duncan I'm not sure if you want to muddle your bisect results with a different system configuration, but I'm happy to help test commits if that would be helpful.

I've noticed the call traces reported in the kernel log are slightly different for each crash; I'm not sure if they're likely to be useful or not. Here's at least the one from my first crash:

Jun 27 14:04:40 erebor kernel: general protection fault, probably for non-canonical address 0x5dda9795528973db: 0000 [#1] PREEMPT SMP NOPTI
Jun 27 14:04:40 erebor kernel: CPU: 14 PID: 193610 Comm: kworker/u32:14 Tainted: G           OE     5.7.6-arch1-1 #1
Jun 27 14:04:40 erebor kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350 Pro4, BIOS P4.90 06/14/2018
Jun 27 14:04:40 erebor kernel: Workqueue: events_unbound commit_work [drm_kms_helper]
Jun 27 14:04:40 erebor kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2aa/0x2310 [amdgpu]
Jun 27 14:04:40 erebor kernel: Code: 4f 08 8b 81 e0 02 00 00 41 83 c5 01 44 39 e8 0f 87 46 ff ff ff 48 83 bd f0 fc ff ff 00 0f 84 03 01 00 00 48 8b bd f0 fc ff ff <80> bf b0 01 00 00 01 0f 86 ac 00 00>
Jun 27 14:04:40 erebor kernel: RSP: 0018:ffffbcec0a4afaf8 EFLAGS: 00010206
Jun 27 14:04:40 erebor kernel: RAX: 0000000000000006 RBX: ffff9b71dbaed000 RCX: ffff9b7472e4b800
Jun 27 14:04:40 erebor kernel: RDX: ffff9b72504ea400 RSI: ffffffffc13181e0 RDI: 5dda9795528973db
Jun 27 14:04:40 erebor kernel: RBP: ffffbcec0a4afe60 R08: 0000000000000001 R09: 0000000000000001
Jun 27 14:04:40 erebor kernel: R10: 0000000000000082 R11: 00000000000730e2 R12: 0000000000000000
Jun 27 14:04:40 erebor kernel: R13: 0000000000000006 R14: ffff9b71dbaed800 R15: ffff9b71a8fdb580
Jun 27 14:04:40 erebor kernel: FS:  0000000000000000(0000) GS:ffff9b747ef80000(0000) knlGS:0000000000000000
Jun 27 14:04:40 erebor kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 27 14:04:40 erebor kernel: CR2: 000056460ce164b0 CR3: 0000000341c86000 CR4: 00000000003406e0
Jun 27 14:04:40 erebor kernel: Call Trace:
Jun 27 14:04:40 erebor kernel:  ? __erst_read+0x160/0x1d0
Jun 27 14:04:40 erebor kernel:  ? __switch_to_asm+0x34/0x70
Jun 27 14:04:40 erebor kernel:  ? __switch_to_asm+0x40/0x70
Jun 27 14:04:40 erebor kernel:  ? __switch_to_asm+0x34/0x70
Jun 27 14:04:40 erebor kernel:  ? __switch_to_asm+0x40/0x70
Jun 27 14:04:40 erebor kernel:  ? rescuer_thread+0x3f0/0x3f0
Jun 27 14:04:40 erebor kernel:  commit_tail+0x94/0x130 [drm_kms_helper]
Jun 27 14:04:40 erebor kernel:  process_one_work+0x1da/0x3d0
Jun 27 14:04:40 erebor kernel:  ? rescuer_thread+0x3f0/0x3f0
Jun 27 14:04:40 erebor kernel:  worker_thread+0x4d/0x3e0
Jun 27 14:04:40 erebor kernel:  ? rescuer_thread+0x3f0/0x3f0
Jun 27 14:04:40 erebor kernel:  kthread+0x13e/0x160
Jun 27 14:04:40 erebor kernel:  ? __kthread_bind_mask+0x60/0x60
Jun 27 14:04:40 erebor kernel:  ret_from_fork+0x22/0x40
Jun 27 14:04:40 erebor kernel: Modules linked in: snd_seq_midi snd_seq_dummy snd_seq_midi_event snd_hrtimer snd_seq fuse ccm 8021q garp mrp stp llc snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_de>
Jun 27 14:04:40 erebor kernel:  blake2b_generic libcrc32c crc32c_generic xor uas usb_storage raid6_pq crc32c_intel xhci_pci xhci_hcd
Jun 27 14:04:40 erebor kernel: ---[ end trace cb5c0d96dd991657 ]---
Jun 27 14:04:40 erebor kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x2aa/0x2310 [amdgpu]
Jun 27 14:04:40 erebor kernel: Code: 4f 08 8b 81 e0 02 00 00 41 83 c5 01 44 39 e8 0f 87 46 ff ff ff 48 83 bd f0 fc ff ff 00 0f 84 03 01 00 00 48 8b bd f0 fc ff ff <80> bf b0 01 00 00 01 0f 86 ac 00 00>
Jun 27 14:04:40 erebor kernel: RSP: 0018:ffffbcec0a4afaf8 EFLAGS: 00010206
Jun 27 14:04:40 erebor kernel: RAX: 0000000000000006 RBX: ffff9b71dbaed000 RCX: ffff9b7472e4b800
Jun 27 14:04:40 erebor kernel: RDX: ffff9b72504ea400 RSI: ffffffffc13181e0 RDI: 5dda9795528973db
Jun 27 14:04:40 erebor kernel: RBP: ffffbcec0a4afe60 R08: 0000000000000001 R09: 0000000000000001
Jun 27 14:04:40 erebor kernel: R10: 0000000000000082 R11: 00000000000730e2 R12: 0000000000000000
Jun 27 14:04:40 erebor kernel: R13: 0000000000000006 R14: ffff9b71dbaed800 R15: ffff9b71a8fdb580
Jun 27 14:04:40 erebor kernel: FS:  0000000000000000(0000) GS:ffff9b747ef80000(0000) knlGS:0000000000000000
Jun 27 14:04:40 erebor kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 27 14:04:40 erebor kernel: CR2: 000056460ce164b0 CR3: 0000000341c86000 CR4: 00000000003406e0
Comment 30 mnrzk 2020-06-28 01:12:58 UTC
I've been looking at this bug for a while now and I'll try to share what I've found about it.

In some conditions, when amdgpu_dm_atomic_commit_tail calls dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct dm_atomic_state* with an garbage context pointer.

I've also found that this bug exclusively occurs when commit_work is on the workqueue. After forcing drm_atomic_helper_commit to run all of the commits without adding to the workqueue and running the OS, the issue seems to have disappeared. The system was stable for at least 1.5 hours before I manually shut it down (meanwhile it has usually crashed within 30-45 minutes).

Perhaps there's some sort of race condition occurring after commit_work is queued?
Comment 31 Duncan 2020-06-28 10:48:15 UTC
(In reply to mnrzk from comment #30)
> In some conditions, when amdgpu_dm_atomic_commit_tail calls
> dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct
> dm_atomic_state* with an garbage context pointer.

Good! Someone with the bug who can actually read and work the code, now. Portends well for a fix.  =:^)

> I've also found that this bug exclusively occurs when commit_work is on the
> workqueue. After forcing drm_atomic_helper_commit to run all of the commits
> without adding to the workqueue and running the OS, the issue seems to have
> disappeared.

I see it always with the workqueue too, but not being a dev I simply assumed that was how it was; I had no idea it could be taken off the workqueue.

> The system was stable for at least 1.5 hours before I manually
> shut it down (meanwhile it has usually crashed within 30-45 minutes).

You're seeing a crash much faster than I am.  I believe my longest uptime before a crash with the telltale trace was something like two and a half days, with the obvious implications for bisect good since it's always a gamble that I've simply not tested long enough.

> Perhaps there's some sort of race condition occurring after commit_work is
> queued?

Agreed, FWIW, tho you've taken it farther than I could, not being able to work with code much beyond bisect or modifying an existing patch here or there.
Comment 32 Duncan 2020-06-28 15:30:47 UTC
Created attachment 289911 [details]
Partial git bisect log

(In reply to zzyxpaw from comment #29)
> @Duncan I'm not sure if you want to muddle your
> bisect results with a different system configuration, but I'm happy to help
> test commits if that would be helpful.

Here's my current git bisect log you can replay.

I believe that should leave you at v5.6-rc2-245-gcf6c26ec7, which I'm going to build and boot to as soon as I post this.

But if your system's as good at triggering the bug as you suggest, try deleting that last good before the replay as I'm only ~98% sure about it given a potential trigger-time of days on my system.  That should leave you at 7be97138e which you can try triggering it with.  If your system's reliably triggering within minutes and it doesn't trigger on that, you can confirm my bisect good and go from there.

Note that if you're building with gcc-10.x you'll likely need a couple patches that were committed later in the 5.7 cycle, depending on if if they were applied before or after whatever you're testing.  If you're building with gcc-9.3 (and presumably earlier) they shouldn't be necessary.

a9a3ed1ef and e78d334a5 are the commits in question.  One was necessary to build with gcc-10, the other to get past a boot-time crash when built with gcc-10.  Only one's applying at cf6c26ec7, I don't remember which, but they were both necessary for 7be97138e.

At my somewhat limited git skill level it was easiest to redirect a git show of the commit to a patchfile, then apply the patch on top of whatever git bisect gave me and git reset --hard to clean up the patches before the next git bisect good/bad.  I guess a git cherry-pick would be the usual way to apply them but I'm not entirely sure how that interacts with git bisect, so applying the patches on top was easier way for me, particularly given that I already have scripts to automate patch application for my local default-to-noatime patch.
Comment 33 Michel Dänzer 2020-06-29 07:39:39 UTC
(In reply to rtmasura+kernel from comment #24)
> xfwm4 --replace --vblank=glx &

FWIW, I recommend

 xfwm4 --vblank=xpresent

instead. --vblank=glx is less efficient and relies on rather exotic GLX functionality which can be quirky with Mesa.
Comment 34 mnrzk 2020-06-29 22:09:23 UTC
Has anyone tried 5.8-rc3? I've been testing it out for the past 3 hours and it seems stable to me. Also, there were some amdgpu drm fixes pushed between rc2 and rc3 which could have fixed it.

Could someone else experiencing this bug test 5.8-rc3 and see if it's fixed?

I have some debug code and kernel options which may have interfered with my testing so I wouldn't exactly say the bug is fixed based on my findings.
Comment 35 Duncan 2020-07-01 19:08:44 UTC
(In reply to mnrzk from comment #34)
> Has anyone tried 5.8-rc3? I've been testing it out for the past 3 hours and
> it seems stable to me.

I have now (well, v5.8.0-rc3-00017-g7c30b859a).  Unfortunately got a freeze with our familiar trace fairly quickly (building kde updates at the time) so it's not fixed yet.  =:^(
Comment 36 Duncan 2020-07-04 19:57:49 UTC
Created attachment 290093 [details]
Updated partial git bisect log

Updated partial git bisect log.  Looks like 226 commits including merges.

There appear to be four Linus-level merge-trees, one of which appears to be the majority of the remaining commits:

8c1b724dd kvm (medium).  No kvm here so that /should/ be out.

f14a9532e tip (single commit).  sparse warning, x86: bitups.h.  Says generated code shouldn't be affected.

7f218319c integrity (small). Shouldn't be.

6cad420cc akpm (the majority).  Very likely in this tree.  The current bisect step is the first code commit (as opposed to tree merge) step and (if I'm reading things right) appears to split this one, much of this tree on one side, the rest of it and everything else on the other.

Notice, no drm tree, tho whatever buggy commit it is obviously affects drm/amdgpu.
Comment 37 mnrzk 2020-07-04 20:13:11 UTC
>Notice, no drm tree, tho whatever buggy commit it is obviously affects
>drm/amdgpu.

Yeah, I kind of noticed that while I was just skimming through the commit history. Perhaps it's possible that the issue has existed for a while but became much more apparent since 5.7?

Whatever it is, keep up the good work; maybe you'll find some sort of clue while bisecting.
Comment 38 Duncan 2020-07-05 16:58:49 UTC
Created attachment 290101 [details]
Another partial git bisect log update

Just as I was thinking that step was going to be bisect good... it wasn't.  Confirmed with the usual tail-commit log trace.

(In reply to Duncan from comment #36)
> 6cad420cc akpm (the majority).  Very likely in this tree.

Definitely this tree/pull.  No merge but 113 commits remaining *at* this step (not _after_), all with signed-off-by both Andrew and Linus so it's all the akpm tree.  We know the tree, now.

FWIW for anyone relatively new to the bug who skipped some of the first comments, my bad first bisect attempt ended up in akpm as well.  I haven't checked if it was the same pull altho I'd guess so.  However, at that time I was only testing commits with drm in the path (including several that went in via the akpm tree not the drm tree, one of which that bisect ultimately pointed me at), and I suspect that's what did me in.

So I strongly suspect that while it's the akpm tree, it's *NOT* the one remaining candidate with the drm-path in it (4064b9827), thus explaining why the first bisect ended up pointing at a drm-path commit that I tested by reverting, only to still have the bug.  I tried a shortcut and it ended up a rabbit trail. =:^(

Other than that, 113 candidate commits left (well, 112 if we subtract that one) is still too many (for me) to guess at or really to even just list here.  Two more steps should bring it down to 28ish, three to 14ish, and maybe I can start guessing then.  With luck I'll get a couple more bad ones right away and narrow it down quickly.
Comment 39 Duncan 2020-07-05 22:08:37 UTC
Created attachment 290105 [details]
Partial git bisect log update #3

(In reply to Duncan from comment #38)
> With luck I'll get a couple more bad ones
> right away and narrow it down quickly.

And so it is.  28 candidates ATM, several of which are OCFS2 or spelling fixes neither of which should affect this bug.  Excluding those there are eleven left; the penultimate (next to last) one looks to be a good candidate:

5f2d5026b mm/Makefile: disable KCSAN for kmemleak
b0d14fc43 mm/kmemleak.c: use address-of operator on section symbols
667c79016 revert "topology: add support for node_to_mem_node() to determine the fallback node"
3202fa62f slub: relocate freelist pointer to middle of object
1ad53d9fa slub: improve bit diffusion for freelist ptr obfuscation
bbd4e305e mm/slub.c: replace kmem_cache->cpu_partial with wrapped APIs
4c7ba22e4 mm/slub.c: replace cpu_slab->partial with wrapped APIs
c537338c0 fs_parse: remove pr_notice() about each validation
630f289b7 asm-generic: make more kernel-space headers mandatory
98c985d7d kthread: mark timer used by delayed kthread works as IRQ safe
4054ab64e tools/accounting/getdelays.c: fix netlink attribute length

My gut says it's 98c "kthread: mark ... delayed kthread... IRQ safe".  Not a coder but the comment talks about delayed kthreads, we always see the workqueue in the traces, and mnrzk observes in comment #30 that forcing drm_atomic_helper_commit to run directly instead of using the workqueue seems to eliminate the freeze.  If it's called from the amdgpu code and that commit changes the IRQ-safety assumptions the amdgpu code was depending on in the workqueue, where the unqueued context is automatically IRQ-safe...

Still could be wrong, but at 11 real candidates it's a 9% chance even simply statistically, and it sure seems to fit.  Anyway, if it /is/ correct, the next few bisect steps should be bisect bad and thus go faster, narrowing it down even further.

Regardless, we're down far enough that someone that can actually read code might be able to take a look at that and the others now, so my bisect shouldn't be the /entire/ bottleneck any longer.
Comment 41 Duncan 2020-07-06 23:57:14 UTC
(In reply to Alex Deucher from comment #40)
> Does this patch help?

Booted to v5.7 with it applied now.  We'll see.  Since the bug can take awhile to trigger on my hardware, if the patch fixes it I won't know for days, and won't be /sure/ for say  a week, the reason bisecting was taking so long.

(It wouldn't apply to current 5.8-rc4-plus-an-s390-pull.  Too tired to figure out why ATM but if it's because it was there already, hopefully it was pulled in after v5.8-rc3 as I tested that and got the same graphics freeze with the characteristic trace, so if the patch was already in v5.8-rc3, it does /not/ fix the bug.)

As for bisecting, I've hard-crashed twice on the current step, apparently with a different bug, so while _this_ bug hasn't seemed to trigger yet, I haven't gotten the necessary confidence that it's a bisect-good.  So hopefully this patch /does/ fix it, and I can put this entirely too frustrating bug-bisect behind me!
Comment 42 Duncan 2020-07-07 00:37:46 UTC
(In reply to Alex Deucher from comment #40)
> Does this patch help?

No.  v5.7 with the patch applied gave me the same graphics freeze, with the usual log trace confirming it's _this_ bug.

Sigh, back to the bisect. =:^(
Comment 44 Duncan 2020-07-07 11:01:36 UTC
(In reply to Christopher Snowhill from comment #43)
> What about this patch?
> 
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-5.
> 8&id=6eb3cf2e06d22b2b08e6b0ab48cb9c05a8e1a107

I see that in mainline as of 5.8-rc4 which I just triggered this bug on, so no, that doesn't fix it.


As for the bisect, now that I'm down to just a few commits, I woke up a couple hours ago with the idea to just try patch-reverting them on top of 5.7 or current 5.8-rc, thus eliminating the apparently unrelated kernel-panics I''ve twice triggered at the current bisect step. I delayed that to try this patch, to no avail, but that's what I'm going to try now (which is after all pre-5.6-rc1 so other bugs are to be expected).  For all I know some of the reverts won't apply to current due to either being already reverted or more code changes since, but we'll see how it goes.
Comment 45 Fabian Möller 2020-07-07 12:43:29 UTC
(In reply to Christopher Snowhill from comment #43)
> What about this patch?
> 
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-5.
> 8&id=6eb3cf2e06d22b2b08e6b0ab48cb9c05a8e1a107

Applying 6eb3cf2e06d22b2b08e6b0ab48cb9c05a8e1a107 to v5.7.7 fixed the issue for a RX5700/Navi10 under Wayland for me. 
It still produces the following log, which might be related to https://bugzilla.kernel.org/show_bug.cgi?id=206349.

------------[ cut here ]------------
WARNING: CPU: 2 PID: 1176 at arch/x86/kernel/fpu/core.c:109 kernel_fpu_end+0x19/0x20
Modules linked in: fuse xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv4 br_netfilter overlay wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libblake2s blake2s_x86_64 ip6_udp_tunnel udp_tunnel libcurve25519_generic libchacha libblake2s_generic af_packet rfkill msr amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper drm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi wmi_bmof mxm_wmi snd_hda_intel nls_iso8859_1 agpgart igb deflate nls_cp437 snd_intel_dspcfg sp5100_tco ptp mousedev vfat efi_pstore fb_sys_fops evdev fat snd_hda_codec edac_mce_amd pstore mac_hid syscopyarea watchdog pps_core sysfillrect edac_core snd_hda_core sysimgblt dca i2c_piix4 backlight crc32_pclmul i2c_algo_bit ghash_clmulni_intel efivars k10temp snd_hwdep i2c_core thermal wmi pinctrl_amd tiny_power_button button acpi_cpufreq sch_fq_codel snd_pcm_oss
 snd_mixer_oss snd_pcm snd_timer snd soundcore atkbd libps2 serio loop cpufreq_ondemand tun tap macvlan bridge stp llc vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) kvm_amd kvm irqbypass efivarfs ip_tables x_tables ipv6 nf_defrag_ipv6 crc_ccitt autofs4 xfs libcrc32c crc32c_generic dm_crypt algif_skcipher af_alg input_leds led_class hid_generic usbhid hid ahci xhci_pci libahci crc32c_intel xhci_hcd libata aesni_intel libaes crypto_simd nvme usbcore cryptd scsi_mod glue_helper nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul usb_common crct10dif_common rtc_cmos dm_snapshot dm_bufio dm_mod
CPU: 2 PID: 1176 Comm: systemd-logind Tainted: G           OE     5.7.7 #1-NixOS
Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F10c 11/08/2019
RIP: 0010:kernel_fpu_end+0x19/0x20
Code: 90 e9 db 9b 14 00 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 8a 05 5c 1a 3e 66 84 c0 74 09 65 c6 05 50 1a 3e 66 00 c3 <0f> 0b eb f3 0f 1f 00 0f 1f 44 00 00 8b 15 cd 59 57 01 31 f6 e8 2e
RSP: 0018:ffffb9dbc1417660 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000121b
RDX: 0000000000000001 RSI: ffff90b685fa1cd4 RDI: 000000000002f980
RBP: ffff90b685fa0000 R08: 0000000000000000 R09: 0000000000000040
R10: ffffb9dbc14175b0 R11: ffffb9dbc14170a0 R12: 0000000000000001
R13: ffff90b685fa1da8 R14: 0000000000000006 R15: ffff90b5158f8400
FS:  00007fd7ab143880(0000) GS:ffff90b6bea80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000e81091ef030 CR3: 00000007d80c4000 CR4: 0000000000340ee0
Call Trace:
 dcn20_validate_bandwidth+0x2c/0x40 [amdgpu]
 dc_commit_updates_for_stream+0xad7/0x1930 [amdgpu]
 ? amdgpu_display_get_crtc_scanoutpos+0x85/0x190 [amdgpu]
 amdgpu_dm_atomic_commit_tail+0xb4c/0x1fc0 [amdgpu]
 commit_tail+0x94/0x130 [drm_kms_helper]
 drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
 drm_client_modeset_commit_atomic+0x1c9/0x200 [drm]
 drm_client_modeset_commit_locked+0x50/0x150 [drm]
 __drm_fb_helper_restore_fbdev_mode_unlocked+0x59/0xc0 [drm_kms_helper]
 drm_fb_helper_set_par+0x3c/0x50 [drm_kms_helper]
 fb_set_var+0x175/0x370
 ? update_load_avg+0x78/0x630
 ? update_curr+0x69/0x1a0
 fbcon_blank+0x20d/0x270
 do_unblank_screen+0xaa/0x150
 complete_change_console+0x54/0xd0
 vt_ioctl+0x126f/0x1320
 tty_ioctl+0x372/0x8c0
 ksys_ioctl+0x87/0xc0
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x4e/0x160
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fd7ab75d1c7
Code: 00 00 90 48 8b 05 b9 9c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 9c 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe9138fec8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd7ab75d1c7
RDX: 0000000000000001 RSI: 0000000000005605 RDI: 0000000000000015
RBP: 0000000000000015 R08: 0000000000000000 R09: 00000000ffffffff
R10: 0000000000000001 R11: 0000000000000246 R12: 00007ffe9138ff38
R13: 0000000000000006 R14: 00007ffe91390050 R15: 00007ffe91390048
---[ end trace eefc00b763354df8 ]---
Comment 46 Duncan 2020-07-07 15:27:53 UTC
(In reply to Fabian Möller from comment #45)
> Applying 6eb3cf2e06d22b2b08e6b0ab48cb9c05a8e1a107 to v5.7.7 fixed the issue
> for a RX5700/Navi10 under Wayland for me. 

Polaris11 uses a different code path or needs an additional fix?  (Less likely, maybe X/plasma/kwin makes the kernel calls differently?)

Progress for all and a fix for some in any case! =:^)
Comment 47 Duncan 2020-07-07 19:05:12 UTC
(In reply to Duncan from comment #39)
> 28 candidates ATM, several of which are OCFS2 or spelling
> fixes neither of which should affect this bug.  Excluding those there are
> eleven left; the penultimate (next to last) one looks to be a good candidate:
> 
> 5f2d5026b mm/Makefile: disable KCSAN for kmemleak
> b0d14fc43 mm/kmemleak.c: use address-of operator on section symbols
> 667c79016 revert "topology: add support for node_to_mem_node() to determine
> the fallback node"
> 3202fa62f slub: relocate freelist pointer to middle of object
> 1ad53d9fa slub: improve bit diffusion for freelist ptr obfuscation
> bbd4e305e mm/slub.c: replace kmem_cache->cpu_partial with wrapped APIs
> 4c7ba22e4 mm/slub.c: replace cpu_slab->partial with wrapped APIs
> c537338c0 fs_parse: remove pr_notice() about each validation
> 630f289b7 asm-generic: make more kernel-space headers mandatory
> 98c985d7d kthread: mark timer used by delayed kthread works as IRQ safe
> 4054ab64e tools/accounting/getdelays.c: fix netlink attribute length

(... and comment #44)
> [I]dea to just try patch-reverting them on top of
> 5.7 or current 5.8-rc, thus eliminating the apparently unrelated
> kernel-panics I''ve twice triggered at the current bisect step.

[Again noting that on my polaris11 the bug doesn't seem to be fixed, despite comment #45 saying it is on his navi10 with a patch/commit that I can see in 5.8-rc4+.]

So I tried this with the 11 above commits against 5.8.0-rc4-00025-gbfe91da29, which previously tested as triggering the freeze for me.  Of the 11, nine clean-reversed and I simply noted and skipped the other two (3202fa62f and 630f289b7) for the moment.  The patched kernel successfully built and I'm booted to it now.  I just completed a system update (on gentoo so built from source), which doesn't always trigger the freeze, but seems to do so with a reasonable number of package updates on kernels with this bug perhaps 50% of the time.  No freeze.

I'll now try some 4k youtube in firefox, the other stressor that sometimes seems to trigger it here, and perhaps combine that with an unnecessary rebuild (since my system's already current) of something big like qtwebengine.  If that doesn't trigger a freeze I'll stay booted to this thing another few days and try some more, before being confident enough to declare that one of those nine commits triggers the bug on my hardware and reverting them eliminates it.

Assuming it is one of those 9 commits (down from 28, as I quoted above, at my last completed auto-bisect step) I'll reset and try manually bisecting on the 9.  It's looking good so far, but other kernels have looked good at this stage and then ultimately frozen with the telltale gpf log, so it remains to be seen.

Meanwhile, nice to be on a current development kernel and well past rc1 stage, again. =:^)  Bisect-testing otherwise long-stale pre-rc1 kernels with other kernel-crasher bugs to complicate things is *not* my definition of fun!
Comment 48 Duncan 2020-07-08 00:25:09 UTC
(In reply to Duncan from comment #47)
> > [I]dea to just try patch-reverting them on top of
> > 5.7 or current 5.8-rc, thus eliminating the apparently unrelated
> > kernel-panics I''ve twice triggered at the current bisect step.
> 
> So I tried this with the 11 above commits against
> 5.8.0-rc4-00025-gbfe91da29, which previously tested as triggering the freeze
> for me.  Of the 11, nine clean-reversed and I simply noted and skipped the
> other two (3202fa62f and 630f289b7) for the moment.  The patched kernel
> successfully built and I'm booted to it now.

Bah, humbug!  Got a freeze and the infamous logged trace on that too!  I was hoping to demonstrably prove it to be in those nine!  I proved it *NOT* to be!

Well, there's still the two commits to look at that wouldn't cleanly simple-revert.  Maybe I'll get lucky and it's just an ordering thing, since I applied out of order compared to original commit, and they'll simple-revert on top of the others.  Otherwise I'll have to actually look and see if I can make sense of it and manual revert, maybe/maybe-not for a non-coder, or try on 5.7 instead of 5.8-rc.

If not them, maybe I'll just have to declare defeat on the bisect and hope for a fix without that.  Last resort there's the buy-my-way-out solution, tho of course that leaves others without that option in a bind.  But given the hours I've put into this (that I've only been able to thanks to COVID work suspension), at some point you just gotta cut your losses and declare defeat defeat.

But we're not there yet.  There's still the two to look at first, and the middle-ground 5.7 to try all 11 against.  Hopefully...
Comment 49 Christopher Snowhill 2020-07-08 01:25:35 UTC
One possibility that I hadn't considered when I was originally testing this. I use the GNOME 3 desktop on Arch, and have two monitors, one 3840x2160@60Hz, one 1920x1080@60Hz, both DisplayPort. One thing I haven't enabled since I switched back from my Nvidia GTX 960 backup card was Variable Refresh Rate, which I had previously enabled in my Xorg configuration.

I never experienced crashes like these on page flips on my Nvidia card, and am awaiting a crash any day now on the RX 480, assuming I haven't magically configured it away with the deletion of that Xorg config snippet which did nothing but enable VRR.
Comment 50 rtmasura+kernel 2020-07-08 20:16:01 UTC
I have 3 monitors, 2 1080p and one 1440p. Happens when I use vblank_mode glx or xpresent, off and I'm stable.
Comment 51 rtmasura+kernel 2020-07-08 20:17:12 UTC
that didn't read well, with vblank_mode off for XFWM I don't have this issue at all.
Comment 52 Michel Dänzer 2020-07-09 07:45:43 UTC
(In reply to rtmasura+kernel from comment #51)
> that didn't read well, with vblank_mode off for XFWM I don't have this issue
> at all.

That just avoids the problem by not doing any page flips.
Comment 53 Stratos Zolotas 2020-07-10 07:23:09 UTC
Hi everyone.

Don't know if it helps. I'm getting a similar issue on Opensuse Tumbleweed with kernel 5.7.7. Reverting to kernel 5.7.5 makes things stable for me. My GPU is RX580.

    Ιουλ 09 21:17:39.718030 teras.baskin.cywn kernel: general protection fault, probably for non-canonical address 0x3e9478a9ecb3abc8: 0000 [#1] SMP NOPTI
    Ιουλ 09 21:17:39.718200 teras.baskin.cywn kernel: CPU: 1 PID: 141 Comm: kworker/u16:3 Tainted: G           O      5.7.7-1-default #1 openSUSE Tumbleweed (unreleased)
    Ιουλ 09 21:17:39.718239 teras.baskin.cywn kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./970A-DS3P, BIOS FD 02/26/2016
    Ιουλ 09 21:17:39.718273 teras.baskin.cywn kernel: Workqueue: events_unbound commit_work [drm_kms_helper]
    Ιουλ 09 21:17:39.718306 teras.baskin.cywn kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x273/0x10f0 [amdgpu]
    Ιουλ 09 21:17:39.718339 teras.baskin.cywn kernel: Code: 43 08 8b 90 e0 02 00 00 41 83 c6 01 44 39 f2 0f 87 3a ff ff ff 48 83 bd a0 fd ff ff 00 0f 84 03 01 00 00 48 8b bd a0 fd ff ff <80> bf b0 01 00 00 01 0f 86 ac 00 00 00 48 b9 00 00 00 00 01 00 00
    Ιουλ 09 21:17:39.718368 teras.baskin.cywn kernel: RSP: 0018:ffffb7cf4037bbe0 EFLAGS: 00010202
    Ιουλ 09 21:17:39.718400 teras.baskin.cywn kernel: RAX: ffff8fb2a5e11800 RBX: ffff8fb28f2c2880 RCX: ffff8fb10ff8ec00
    Ιουλ 09 21:17:39.718442 teras.baskin.cywn kernel: RDX: 0000000000000006 RSI: ffffffffc0b7f530 RDI: 3e9478a9ecb3abc8
    Ιουλ 09 21:17:39.718482 teras.baskin.cywn kernel: RBP: ffffb7cf4037be68 R08: 0000000000000001 R09: 0000000000000001
    Ιουλ 09 21:17:39.718519 teras.baskin.cywn kernel: R10: 000000000000014d R11: 0000000000000018 R12: ffff8fb2a35e7400
    Ιουλ 09 21:17:39.718547 teras.baskin.cywn kernel: R13: 0000000000000000 R14: 0000000000000006 R15: ffff8fb112001000
    Ιουλ 09 21:17:39.718584 teras.baskin.cywn kernel: FS:  0000000000000000(0000) GS:ffff8fb2aec40000(0000) knlGS:0000000000000000
    Ιουλ 09 21:17:39.718620 teras.baskin.cywn kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Ιουλ 09 21:17:39.718652 teras.baskin.cywn kernel: CR2: 00007f364b9bc000 CR3: 000000042a2ae000 CR4: 00000000000406e0
    Ιουλ 09 21:17:39.718683 teras.baskin.cywn kernel: Call Trace:
    Ιουλ 09 21:17:39.718715 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.718750 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.718784 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.718810 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.718840 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.718868 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.718894 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.718921 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.718946 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.718972 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.718999 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.719026 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.719062 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.719088 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.719122 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.719149 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.719177 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.719203 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.719229 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.719254 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.719280 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.719306 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.719333 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.719359 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.719383 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.719408 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.719433 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.719465 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.719490 teras.baskin.cywn kernel:  ? __switch_to_asm+0x40/0x70
    Ιουλ 09 21:17:39.719515 teras.baskin.cywn kernel:  ? __switch_to+0x152/0x380
    Ιουλ 09 21:17:39.719545 teras.baskin.cywn kernel:  ? __switch_to_asm+0x34/0x70
    Ιουλ 09 21:17:39.719572 teras.baskin.cywn kernel:  ? __schedule+0x1fe/0x560
    Ιουλ 09 21:17:39.719605 teras.baskin.cywn kernel:  ? usleep_range+0x80/0x80
    Ιουλ 09 21:17:39.719637 teras.baskin.cywn kernel:  ? _cond_resched+0x16/0x40
    Ιουλ 09 21:17:39.719664 teras.baskin.cywn kernel:  ? __wait_for_common+0x3b/0x160
    Ιουλ 09 21:17:39.719690 teras.baskin.cywn kernel:  commit_tail+0x94/0x130 [drm_kms_helper]
    Ιουλ 09 21:17:39.719727 teras.baskin.cywn kernel:  process_one_work+0x1e3/0x3b0
    Ιουλ 09 21:17:39.719760 teras.baskin.cywn kernel:  worker_thread+0x46/0x340
    Ιουλ 09 21:17:39.719795 teras.baskin.cywn kernel:  ? process_one_work+0x3b0/0x3b0
    Ιουλ 09 21:17:39.719830 teras.baskin.cywn kernel:  kthread+0x115/0x140
    Ιουλ 09 21:17:39.719888 teras.baskin.cywn kernel:  ? __kthread_bind_mask+0x60/0x60
    Ιουλ 09 21:17:39.719920 teras.baskin.cywn kernel:  ret_from_fork+0x22/0x40
    Ιουλ 09 21:17:39.719952 teras.baskin.cywn kernel: Modules linked in: rfcomm fuse af_packet vboxnetadp(O) vboxnetflt(O) cmac algif_hash vboxdrv(O) algif_skcipher af_alg bnep dmi_sysfs msr it87 hwmon_vid squashfs xfs nls_iso8859_1 nls_cp437 vfat fat loop edac_mce_amd uvcvideo kvm_amd pktcdvd ccp videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 kvm videobuf2_common snd_hda_codec_realtek snd_hda_codec_generic snd_usb_audio irqbypass snd_hda_codec_hdmi ledtrig_audio videodev btusb snd_usbmidi_lib btrtl snd_rawmidi btbcm snd_hda_intel btintel snd_seq_device snd_intel_dspcfg crct10dif_pclmul crc32_pclmul mc ghash_clmulni_intel bluetooth joydev snd_hda_codec aesni_intel ecdh_generic crypto_simd rfkill ecc cryptd snd_hda_core glue_helper efi_pstore fam15h_power pcspkr k10temp sp5100_tco snd_hwdep i2c_piix4 snd_pcm r8169 snd_timer realtek snd libphy soundcore tiny_power_button button acpi_cpufreq tcp_bbr sch_fq hid_logitech_hidpp hid_logitech_dj uas usb_storage hid_generic usbhid btrfs sr_mod cdrom blake2b_generic libcrc32c xor amdgpu
    Ιουλ 09 21:17:39.720011 teras.baskin.cywn kernel:  ohci_pci amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper raid6_pq crc32c_intel ata_generic syscopyarea xhci_pci sysfillrect sysimgblt fb_sys_fops xhci_hcd cec ohci_hcd rc_core ehci_pci ehci_hcd drm pata_atiixp usbcore sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua efivarfs
    Ιουλ 09 21:17:39.720045 teras.baskin.cywn kernel: ---[ end trace 573bd378072b1ec2 ]---
    Ιουλ 09 21:17:39.720078 teras.baskin.cywn kernel: RIP: 0010:amdgpu_dm_atomic_commit_tail+0x273/0x10f0 [amdgpu]
    Ιουλ 09 21:17:39.720105 teras.baskin.cywn kernel: Code: 43 08 8b 90 e0 02 00 00 41 83 c6 01 44 39 f2 0f 87 3a ff ff ff 48 83 bd a0 fd ff ff 00 0f 84 03 01 00 00 48 8b bd a0 fd ff ff <80> bf b0 01 00 00 01 0f 86 ac 00 00 00 48 b9 00 00 00 00 01 00 00
    Ιουλ 09 21:17:39.720133 teras.baskin.cywn kernel: RSP: 0018:ffffb7cf4037bbe0 EFLAGS: 00010202
    Ιουλ 09 21:17:39.720163 teras.baskin.cywn kernel: RAX: ffff8fb2a5e11800 RBX: ffff8fb28f2c2880 RCX: ffff8fb10ff8ec00
    Ιουλ 09 21:17:39.720193 teras.baskin.cywn kernel: RDX: 0000000000000006 RSI: ffffffffc0b7f530 RDI: 3e9478a9ecb3abc8
    Ιουλ 09 21:17:39.720241 teras.baskin.cywn kernel: RBP: ffffb7cf4037be68 R08: 0000000000000001 R09: 0000000000000001
    Ιουλ 09 21:17:39.720294 teras.baskin.cywn kernel: R10: 000000000000014d R11: 0000000000000018 R12: ffff8fb2a35e7400
    Ιουλ 09 21:17:39.720322 teras.baskin.cywn kernel: R13: 0000000000000000 R14: 0000000000000006 R15: ffff8fb112001000
    Ιουλ 09 21:17:39.720351 teras.baskin.cywn kernel: FS:  0000000000000000(0000) GS:ffff8fb2aec40000(0000) knlGS:0000000000000000
    Ιουλ 09 21:17:39.720380 teras.baskin.cywn kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Ιουλ 09 21:17:39.720409 teras.baskin.cywn kernel: CR2: 00007f364b9bc000 CR3: 000000042a2ae000 CR4: 00000000000406e0
    Ιουλ 09 21:19:54.107989 teras.baskin.cywn kernel: [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:49:crtc-1] hw_done or flip_done timed out
    Ιουλ 09 21:20:04.348029 teras.baskin.cywn kernel: [drm:do_aquire_global_lock.isra.0 [amdgpu]] *ERROR* [CRTC:49:crtc-1] hw_done or flip_done timed out
Comment 54 Paul Menzel 2020-07-10 07:36:15 UTC
(In reply to Stratos Zolotas from comment #53)

> Don't know if it helps. I'm getting a similar issue on Opensuse Tumbleweed
> with kernel 5.7.7. Reverting to kernel 5.7.5 makes things stable for me. My
> GPU is RX580.

[…]

Thank you for your report. How quickly can you reproduce it? If you could bisect the issue to pinpoint the culprit commit between 5.7.5 and 5.7.7, that’d be great. Maybe open even a separate bug report, in case they are unrelated. They can always be marked as duplicates later.
Comment 55 Stratos Zolotas 2020-07-10 08:10:10 UTC
(In reply to Paul Menzel from comment #54)

> Thank you for your report. How quickly can you reproduce it? If you could
> bisect the issue to pinpoint the culprit commit between 5.7.5 and 5.7.7,
> that’d be great. Maybe open even a separate bug report, in case they are
> unrelated. They can always be marked as duplicates later.

If you guide me on what to do I can report back in some hours (not on that system now). I had 4 crashes yesterday with kernel 5.7.7 in 3 hours doing daily stuff (not gaming or something like that). System was unresponsive, ssh to the box worked but reboot from console hangs also, only ALT+SysRq+B reboots the system. I booted with the previous kernel (5.7.5) and was stable for over 6-7 hours.
Comment 56 Duncan 2020-07-10 10:55:02 UTC
Some notes and a question (last * point):

* There seem to be two and it's now looking like three near identical bugs or variants of the same bug, all with the very similar amdgpu-dm-atomic-commit-tail/events-unbound-commit-work log trace.  

1) Until now all the reports seemed to start by 5.7.0 and presumably between 5.6.0 and 5.7-rc1, which was when I first saw it.  But now, comment #53 is reporting an origin with 5.7.6 or 5.7.7 while 5.7.5 was fine.  That's on rx580, which wikipedia says is polaris20.

2) Of the other two, one is reported fixed (on an rc5700/navi10) by commit 6eb3cf2e0 which we were asked to try above, that made it into 5.8-rc4, while...

3) My older rx460/polaris11, started with a pull shortly before 5.7-rc1 (that I've been unable to properly bisect to, once for sure and it's looking like twice, much to my frustration!) and continues all the way thru today's almost 5.8-rc5 -- the 6eb commit didn't help.

Seems the vega/navi graphics either started later (your 5.7.5 good, 5.7.7 bad) or are fixed by 6eb, while my older polaris, started earlier and isn't fixed by 6eb.

BTW Stratos, that 6eb commit appears to be in the fresh 5.7.8 as well.  Seeing if the bug is still there would thus be interesting.

* Chris mentioned variable-refresh-rate/VRR in comment #49.  He was wondering if turning it OFF helped him as he had done so when migrating cards and hadn't seen the problem on his rx480 after that.

I hadn't messed with VRR here on my rx460/polaris11, because I'm running dual 4k TVs as monitors and didn't think they supported it, yet I was the OP, so at least on rx460 having VRR off doesn't seem to help.  But just for kicks I did try turning it on yesterday while back on a stable 5.6.0, and then booted to today's near-5.8-rc5 to test it.  Still got the graphics freeze.  So that didn't appear to affect the bug here on my rx460 anyway.

Interestingly enough, tho, quite aside from this bug and maybe it's all in my head, but despite thinking VRR shouldn't be available here and expecting no difference, turning it on /does/ seem to make things smoother.  Now I'm wondering if even without actual VRR, turning it on helps something stay in sync better, at least on my hardware.  <shrug>  Tho it doesn't seem to affect how the bug triggers, maybe that'll be the hint necessary for the devs to figure out what's different with the bug on my rx460 compared to the newer stuff, thus helping them to fix the older stuff too.

* Now the question: Anybody with this bug that is **NOT** running multi-monitor when it triggers?  Seems all I've seen are multi-monitor, but someone could have simply not mentioned (or I just missed it) that they're seeing it on single-monitor too.  (If you are running multi-monitor you don't need to post a reply just for this, as that seems to be the reported default.  But having explicit confirmation of whether it affects single-monitor or not could be helpful.)
Comment 57 Anthony Ruhier 2020-07-10 11:25:29 UTC
To give some precision about the kernel version range, I'm staying on 5.6.19 for a while, which doesn't have the issue. It's pretty bad though, as it's EOL.

Only the 5.7 branch has it. So it's something that wasn't backported.

I also have multimonitors, with one with VRR, though I don't know if VRR changes anything in my case as the 2 other screens don't support it.
Comment 58 Anthony Ruhier 2020-07-10 14:31:04 UTC
Sorry, I forgot to say that I have a vega64.
Comment 59 Chan Cuan 2020-07-12 05:20:47 UTC
(In reply to Paul Menzel from comment #54)
> (In reply to Stratos Zolotas from comment #53)
> 
> > Don't know if it helps. I'm getting a similar issue on Opensuse Tumbleweed
> > with kernel 5.7.7. Reverting to kernel 5.7.5 makes things stable for me. My
> > GPU is RX580.
> 
> […]
> 
> Thank you for your report. How quickly can you reproduce it? If you could
> bisect the issue to pinpoint the culprit commit between 5.7.5 and 5.7.7,
> that’d be great. Maybe open even a separate bug report, in case they are
> unrelated. They can always be marked as duplicates later.

I am running the same setup as the comment. RX 580, Tumbleweed, have both kernels 5.7.5 and 5.7.7. On 5.7.7, it happens almost immediately after login. However, reverting to 5.7.5 does NOT stabilise, and the same problem arises somewhere between 1 to 10 minutes.

I didn't have this issue prior to installing the 5.7.7 kernel though...
Comment 60 Stratos Zolotas 2020-07-12 05:47:27 UTC
(In reply to Chan Cuan from comment #59)

> I didn't have this issue prior to installing the 5.7.7 kernel though...

To make things looks more strange... I have a non-explicable development with this issue. When it appeared to me I was in the middle of upgrading some components on my system. I replaced my AMD FX-8350 with one AMD Ryzen 5 3600X and my Gigabyte GA-970a-ds3p motherboard with one Gigabyte X570 UD (along with new RAM dimms from 16GB to 32GB). RX580 stayed the same and also OS is the same (disks moved to the new motherboard, no re-install). Guess what... running with 5.7.7 for 48 hours now without issues.... problem has disappeared. I suspect a very rare combination of things maybe even not in the amdgpu driver itself... With 5.7.7 on my "old" configuration, I had the crash almost immediately after login like in the above comment.
Comment 61 Christopher Snowhill 2020-07-12 07:47:04 UTC
It may be worth noting that I also haven't experienced this crash lately, and one of the things I did recently was update my motherboard BIOS, which included an update from AGESA 1.0.0.4 release 2, to 1.0.0.6am4.
Comment 62 Duncan 2020-07-14 23:36:23 UTC
(In reply to Duncan from comment #48)
> (In reply to Duncan from comment #47)
> > So I tried [patch-reverting] with the 11 above commits against
> > 5.8.0-rc4-00025-gbfe91da29, which previously tested as triggering the
> freeze
> > for me.  Of the 11, nine clean-reversed and I simply noted and skipped the
> > other two (3202fa62f and 630f289b7) for the moment.  The patched kernel
> > successfully built and I'm booted to it now.
> 
> Bah, humbug!  Got a freeze and the infamous logged trace on that too

After taking a few days discouragement-break I'm back at trying to pin it down.  The quoted above left two candidate commits, 3202fa62f and 630f289b7, neither of which would clean-revert as commits since were preventing that.

630f289b7 is a few lines changed in many files so I'm focusing on the simpler 3202fa62f first.  Turns out the reason 320... wasn't reverting was two additional fixes to it that landed before v5.7.  Since they had Fixes: 320... labels they were easy enough to find and patch-revert, after which patch-reverting 320... itself worked against a current v5.8-rc5-8-g0dc589da8.  I first tested it without the reverts to be sure it's still triggering this bug for me, and just confirmed it was, freeze with the telltale log dump.

So for me at least, v5.8-rc5 is bad (just updated the version field to reflect that).

Meanwhile I've applied the three 320-and-followups revert-patches to v5.8-rc5-8-g0dc589da8 and just did the rebuild with them applied.  Now to reboot to it and see if it still has our bug.  If no, great, pinned down.  If yes, there's still that 630... commit to try to test.
Comment 63 Duncan 2020-07-15 16:49:23 UTC
(In reply to Duncan from comment #62)
> I've applied the three 320-and-followups revert-patches to
> v5.8-rc5-8-g0dc589da8 and just did the rebuild with them applied.
> Now to reboot to it and see if it still has our bug.

NB: The 3202fa62f followups are cbfc35a48 and 89b83f282.  That should let anyone else with git and kernel building skills try reverting the three.

Still too early (by days) to call it nailed down as I've had it take 2-3 days to trigger, but no gfx freeze here yet on that v5.8-rc5+ with 320-and-followups reverted so far, despite playing 4k video to try to trigger it as it has previously on affected kernels.  I'll be trying update builds (gentoo) later today or tomorrow, another previous trigger, so we'll see how it goes.

But initial results are good enough to let others know that may want to try it...
Comment 64 Anthony Ruhier 2020-07-15 17:12:59 UTC
(In reply to Duncan from comment #63)
> (In reply to Duncan from comment #62)
> > I've applied the three 320-and-followups revert-patches to
> > v5.8-rc5-8-g0dc589da8 and just did the rebuild with them applied.
> > Now to reboot to it and see if it still has our bug.
> 
> NB: The 3202fa62f followups are cbfc35a48 and 89b83f282.  That should let
> anyone else with git and kernel building skills try reverting the three.
> 
> Still too early (by days) to call it nailed down as I've had it take 2-3
> days to trigger, but no gfx freeze here yet on that v5.8-rc5+ with
> 320-and-followups reverted so far, despite playing 4k video to try to
> trigger it as it has previously on affected kernels.  I'll be trying update
> builds (gentoo) later today or tomorrow, another previous trigger, so we'll
> see how it goes.
> 
> But initial results are good enough to let others know that may want to try
> it...

Thanks a lot, I'm also trying on my side.
Comment 65 Duncan 2020-07-16 02:12:52 UTC
(In reply to Duncan from comment #63)
> NB: The 3202fa62f followups are cbfc35a48 and 89b83f282.  That should let
> anyone else with git and kernel building skills try reverting the three.
> 
> Still too early (by days) to call it nailed down as I've had it take 2-3
> days to trigger, but no gfx freeze here yet on that v5.8-rc5+ with
> 320-and-followups reverted so far, despite playing 4k video to try to
> trigger it as it has previously on affected kernels.  I'll be trying update
> builds (gentoo) later today or tomorrow, another previous trigger, so we'll
> see how it goes.

I'm still not saying for sure, but that's actually looking like the culprit.

Today's gentoo update included a dep of qtwebengine, which changed ABI so qtwebengine needed rebuilt on top of it, and qtwebengine is chromium-based.  And as anyone that's built chromium (or firefox for that matter) can tell you, at least on older fx-based hardware, it's several hours of near constant 100% all-cores.

While rebuilding qtwebengine (at a batch-nice of +19 so it doesn't interfere too badly with anything else I want to run), I was playing youtube videos at 1080p, not normally a problem by themselves (tho 4k can be, especially 4k60) but with qtwebengine building at the same time...

No freezes.

I'm going to run with the 320 commit and followups reverted a few more days before declaring it for sure the culprit, and I'm watching for Anthony's results as well, but the bug's sure doing a convincing job of hiding ATM if that commit isn't the culprit!

I'd say it's time to start reviewing the amdgpu code to see what relocating the slub freelist pointer to the middle of the object (what the 320 commit did according to its git log explanation) could tickle, when the work goes on the work queue to run later, since that's consistently what the logs say is the scenario and what mnrzk confirmed by forcing it /not/ to go to the work queue in comment #30.

Hopefully we can still get and confirm a proper codefix by 5.8.0 release. =:^)
Comment 66 Paul Menzel 2020-07-16 06:37:10 UTC
Kees, Andrew, do you have an idea, how commit 3202fa62fb (slub: relocate freelist pointer to middle of object) could cause a regression.
Comment 67 Anthony Ruhier 2020-07-16 09:35:57 UTC
No freeze for me too, and I compiled firefox yesterday, which usually triggers a freeze on 5.7, and nothing yet. That's some really good news if it stays true, thanks a lot Duncan!

FYI, I applied the revert on 5.7.8, I didn't want to run on 5.8.
Comment 68 Stratos Zolotas 2020-07-16 10:24:56 UTC
(In reply to Stratos Zolotas from comment #60)
> 
> To make things looks more strange... I have a non-explicable development
> with this issue. When it appeared to me I was in the middle of upgrading
> some components on my system. I replaced my AMD FX-8350 with one AMD Ryzen 5
> 3600X and my Gigabyte GA-970a-ds3p motherboard with one Gigabyte X570 UD
> (along with new RAM dimms from 16GB to 32GB). RX580 stayed the same and also
> OS is the same (disks moved to the new motherboard, no re-install). Guess
> what... running with 5.7.7 for 48 hours now without issues.... problem has
> disappeared. I suspect a very rare combination of things maybe even not in
> the amdgpu driver itself... With 5.7.7 on my "old" configuration, I had the
> crash almost immediately after login like in the above comment.

Just to report that got the issue after some days with my new hardware setup, so it is still there, hope you guys pinpoint it soon!
Comment 69 Anthony Ruhier 2020-07-16 10:30:08 UTC
(In reply to Stratos Zolotas from comment #68)
> (In reply to Stratos Zolotas from comment #60)
> > 
> > To make things looks more strange... I have a non-explicable development
> > with this issue. When it appeared to me I was in the middle of upgrading
> > some components on my system. I replaced my AMD FX-8350 with one AMD Ryzen
> 5
> > 3600X and my Gigabyte GA-970a-ds3p motherboard with one Gigabyte X570 UD
> > (along with new RAM dimms from 16GB to 32GB). RX580 stayed the same and
> also
> > OS is the same (disks moved to the new motherboard, no re-install). Guess
> > what... running with 5.7.7 for 48 hours now without issues.... problem has
> > disappeared. I suspect a very rare combination of things maybe even not in
> > the amdgpu driver itself... With 5.7.7 on my "old" configuration, I had the
> > crash almost immediately after login like in the above comment.
> 
> Just to report that got the issue after some days with my new hardware
> setup, so it is still there, hope you guys pinpoint it soon!

You're talking about having the bug with 5.7.7 vanilla, right? Not with the revert of the commits quoted above?
Comment 70 Stratos Zolotas 2020-07-16 10:32:26 UTC
(In reply to Anthony Ruhier from comment #69)

> 
> You're talking about having the bug with 5.7.7 vanilla, right? Not with the
> revert of the commits quoted above?

Yes! It seemed to had "disappeared" with the change on hardware but probably it took a little to appear. I'm on vanilla kernel correct.
Comment 71 Anthony Ruhier 2020-07-17 12:39:43 UTC
Just to give some news, I can confirm that I haven't had any freeze since Wednesday. Usually, when my system just idled, it would quickly trigger the bug. That or doing something CPU intensive (like compiling firefox). But nothing since I reverted the 3 commits.

Really good job Duncan! Thanks a lot for your debug!

MB chipset: x470 
CPU: ryzen 2700x
GPU: vega64
Comment 72 Vinicius 2020-07-20 02:20:13 UTC
Confirming that reverting 3202fa62f, cbfc35a48 and 89b83f282, fixed my polaris10 too.

Tested with 5.7.8 and 5.7.9, Radeon RX 570.
Comment 73 Jeremy Kescher 2020-07-21 16:40:10 UTC
Confirming as well that 3202fa62f, cbfc35a48 and 89b83f282 are the commits that cause this regression.

Tested with 5.7.9, Radeon RX 480.
Comment 74 Paul Menzel 2020-07-21 16:57:12 UTC
I sent a message to the LKML and amd-gfx list [1], asking Kees and Andrew on how to proceed.

[1]: https://lkml.org/lkml/2020/7/21/729
     "[Regression] hangs caused by commit 3202fa62fb (slub: relocate freelist pointer to middle of object)"
Comment 75 Kees Cook 2020-07-21 19:32:30 UTC
Hi!

First, let me say sorry for all the work my patch has caused! It seems like it might be tickling another (previously dormant) bug in the gpu driver.


(In reply to mnrzk from comment #30)
> I've been looking at this bug for a while now and I'll try to share what
> I've found about it.
> 
> In some conditions, when amdgpu_dm_atomic_commit_tail calls
> dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct
> dm_atomic_state* with an garbage context pointer.
> 
> I've also found that this bug exclusively occurs when commit_work is on the
> workqueue. After forcing drm_atomic_helper_commit to run all of the commits
> without adding to the workqueue and running the OS, the issue seems to have
> disappeared. The system was stable for at least 1.5 hours before I manually
> shut it down (meanwhile it has usually crashed within 30-45 minutes).
> 
> Perhaps there's some sort of race condition occurring after commit_work is
> queued?

If it helps to explain what's happening in 3202fa62f, the kernel memory allocator is moving it's free pointer from offset 0 to the middle of the object. That means that when the memory is freed, it writes 8 bytes to join the newly freed memory into the allocator's freelist. That always happened, but after 3202fa62f it began writing it in the middle, not offset 0. If the work queue is trying to use freed memory, and before it didn't notice the first 8 bytes getting written, now it appears to notice the overwrite... but that still means something is freeing memory before it should.

Finding that might be a real trick. :( However, if you've suffered through all those bisections, I wonder if you can try one other thing, which is to compile the kernel with KASAN:

CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
CONFIG_KASAN_OUTLINE=y
CONFIG_KASAN_STACK=y
CONFIG_KASAN_VMALLOC=y

This will make things _slow_, which might mean the use-after-free race may never trigger. *However* it's possible that it'll catch a bad behavior before it even needs to get hit in a race that triggers the behavior you're seeing. (And note that swapping CONFIG_KASAN_OUTLINE=y for CONFIG_KASAN_INLINE=y might speed things up, but the kernel image gets bigger).

I'm going to try to read the work queue code for the driver and see if anything obvious stands out...
Comment 76 mnrzk 2020-07-21 20:33:54 UTC
(In reply to Kees Cook from comment #75)
> Hi!
> 
> First, let me say sorry for all the work my patch has caused! It seems like
> it might be tickling another (previously dormant) bug in the gpu driver.
> 
> 
> (In reply to mnrzk from comment #30)
> > I've been looking at this bug for a while now and I'll try to share what
> > I've found about it.
> > 
> > In some conditions, when amdgpu_dm_atomic_commit_tail calls
> > dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct
> > dm_atomic_state* with an garbage context pointer.
> > 
> > I've also found that this bug exclusively occurs when commit_work is on the
> > workqueue. After forcing drm_atomic_helper_commit to run all of the commits
> > without adding to the workqueue and running the OS, the issue seems to have
> > disappeared. The system was stable for at least 1.5 hours before I manually
> > shut it down (meanwhile it has usually crashed within 30-45 minutes).
> > 
> > Perhaps there's some sort of race condition occurring after commit_work is
> > queued?
> 
> If it helps to explain what's happening in 3202fa62f, the kernel memory
> allocator is moving it's free pointer from offset 0 to the middle of the
> object. That means that when the memory is freed, it writes 8 bytes to join
> the newly freed memory into the allocator's freelist. That always happened,
> but after 3202fa62f it began writing it in the middle, not offset 0. If the
> work queue is trying to use freed memory, and before it didn't notice the
> first 8 bytes getting written, now it appears to notice the overwrite... but
> that still means something is freeing memory before it should.
> 
> Finding that might be a real trick. :( However, if you've suffered through
> all those bisections, I wonder if you can try one other thing, which is to
> compile the kernel with KASAN:
> 
> CONFIG_KASAN=y
> CONFIG_KASAN_GENERIC=y
> CONFIG_KASAN_OUTLINE=y
> CONFIG_KASAN_STACK=y
> CONFIG_KASAN_VMALLOC=y
> 
> This will make things _slow_, which might mean the use-after-free race may
> never trigger. *However* it's possible that it'll catch a bad behavior
> before it even needs to get hit in a race that triggers the behavior you're
> seeing. (And note that swapping CONFIG_KASAN_OUTLINE=y for
> CONFIG_KASAN_INLINE=y might speed things up, but the kernel image gets
> bigger).
> 
> I'm going to try to read the work queue code for the driver and see if
> anything obvious stands out...

Actually this makes perfect sense, struct dm_atomic_state* dm_state has
two components, base (a struct containing a struct drm_atomic_state*) and
context (a struct dc_state*). Reading through the code of
amdgpu_dm_atomic_commit_tail, I see that dm_state->base is never used.

If my understanding is correct, base would have previously been filled with
the freelist pointer (since it's the first 8 bytes). Now since the freelist
pointer is being put in the middle (rounded to the nearest sizeof(void*),
 or 8 bytes), it's being put in the last 8 bytes of *dm_state
(or dm_state->context).

I'll place a void* for padding in the middle of struct dm_atomic_state* and
if my hypothesis is correct, the padding will be filled with garbage data
instead of context and the bug should be fixed. Of course, there would
still be a use-after-free bug in the code which may cause other issues in
the future so I wouldn't really consider it a solution.

Regarding KASAN, I've tried compiling the kernel with KASAN enabled and
from my experience, the bug did not trigger after actively using the system
for 3 hours and leaving it on for 12 hours. This was almost a month ago
though so maybe I'll try again with different KASAN options (i.e.
CONFIG_KASAN_INLINE=y). If anyone has any more tips on getting KASAN to run
faster, I'll be glad to hear them.
Comment 77 Kees Cook 2020-07-21 20:49:36 UTC
(Midair collision... you saw the same about the structure layout as I did. Here's my comment...)

(In reply to mnrzk from comment #30)
> I've been looking at this bug for a while now and I'll try to share what
> I've found about it.
> 
> In some conditions, when amdgpu_dm_atomic_commit_tail calls
> dm_atomic_get_new_state, dm_atomic_get_new_state returns a struct
> dm_atomic_state* with an garbage context pointer.

It looks like when amdgpu_dm_atomic_commit_tail() walks the private objects list with for_each_new_private_obj_in_state(), it'll return the first object's state when the function pointer tables match. This is a struct dm_atomic_state allocation, which is 16 bytes:

struct drm_private_state {
        struct drm_atomic_state *state;
};

struct dm_atomic_state {
        struct drm_private_state base;
        struct dc_state *context;
};

If struct dm_atomic_state is being freed early, this would match the behavior seen: before 3202fa62f, .base.state would be overwritten with a freelist pointer. After 3202fa62f, .context will be overwritten.

In looking for all "kfree(.*state" patterns in drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c, I see a few suspicious things, maybe. dm_crtc_destroy_state() and amdgpu_dm_connector_funcs_reset() do an explicit kfree(state) -- should they use dm_atomic_destroy_state() instead? Or nothing at all, since I'd expect "state" to be managed by the drm layer via the .atomic_destroy_state callback?


> I've also found that this bug exclusively occurs when commit_work is on the
> workqueue. After forcing drm_atomic_helper_commit to run all of the commits
> without adding to the workqueue and running the OS, the issue seems to have
> disappeared. The system was stable for at least 1.5 hours before I manually
> shut it down (meanwhile it has usually crashed within 30-45 minutes).

Is this the async call to "commit_work" in drm_atomic_helper_commit()?

There's a big warning in there:

        /*
         * Everything below can be run asynchronously without the need to grab
         * any modeset locks at all under one condition: It must be guaranteed
         * that the asynchronous work has either been cancelled (if the driver
         * supports it, which at least requires that the framebuffers get
         * cleaned up with drm_atomic_helper_cleanup_planes()) or completed
         * before the new state gets committed on the software side with
         * drm_atomic_helper_swap_state().
         ...

I'm not sure how to determine if amdgpu_dm.c is doing this correctly?

I can't tell what can interfere with drm_atomic_helper_commit() -- I would guess the race is between that and something else causing a kfree(), but I don't know the APIs here at all...
Comment 78 Kees Cook 2020-07-21 20:56:28 UTC
(In reply to mnrzk from comment #76)
> If my understanding is correct, base would have previously been filled with
> the freelist pointer (since it's the first 8 bytes). Now since the freelist
> pointer is being put in the middle (rounded to the nearest sizeof(void*),
>  or 8 bytes), it's being put in the last 8 bytes of *dm_state
> (or dm_state->context).
> 
> I'll place a void* for padding in the middle of struct dm_atomic_state* and
> if my hypothesis is correct, the padding will be filled with garbage data
> instead of context and the bug should be fixed. Of course, there would
> still be a use-after-free bug in the code which may cause other issues in
> the future so I wouldn't really consider it a solution.

Agreed: that should make it disappear again, but as you say, it's just kicking the problem down the road since now the failing condition is losing a race with kfree()+kmalloc()+new contents.

And if you want to detect without crashing, you can just zero the padding at init time and report when it's non-NULL at workqueue run time... I wonder if KASAN can run in a mode where the allocation/freeing tracking happens, but without the heavy checking instrumentation? Then when the corruption is detected, it could dump a traceback about who did the early kfree()... hmmm.
Comment 79 mnrzk 2020-07-21 21:16:03 UTC
(In reply to Kees Cook from comment #78)
> (In reply to mnrzk from comment #76)
> > If my understanding is correct, base would have previously been filled with
> > the freelist pointer (since it's the first 8 bytes). Now since the freelist
> > pointer is being put in the middle (rounded to the nearest sizeof(void*),
> >  or 8 bytes), it's being put in the last 8 bytes of *dm_state
> > (or dm_state->context).
> > 
> > I'll place a void* for padding in the middle of struct dm_atomic_state* and
> > if my hypothesis is correct, the padding will be filled with garbage data
> > instead of context and the bug should be fixed. Of course, there would
> > still be a use-after-free bug in the code which may cause other issues in
> > the future so I wouldn't really consider it a solution.
> 
> Agreed: that should make it disappear again, but as you say, it's just
> kicking the problem down the road since now the failing condition is losing
> a race with kfree()+kmalloc()+new contents.
> 
> And if you want to detect without crashing, you can just zero the padding at
> init time and report when it's non-NULL at workqueue run time... I wonder if
> KASAN can run in a mode where the allocation/freeing tracking happens, but
> without the heavy checking instrumentation? Then when the corruption is
> detected, it could dump a traceback about who did the early kfree()... hmmm.

So far I've been testing it by passing my GPU to my VM via vfio-pci and
attaching kgdb to the guest. To test if the context was invalid, I added
a check to make sure the context pointer wasn't garbage data (by checking
if dc_state was not null and the upper 16 bits were set on dc_state).

I wonder if there's any way to set a watchpoint to see where exactly the
dm_atomic_state gets filled with garbage data.

Also, since I'm not too familiar with freelists, do freelist pointers look
like regular pointers? On a regular pointer on a system with a 48-bit
virtual address space, regular pointers would be something like
0xffffXXXXXXXXXXXX. I've noticed that the data being inserted never
followed this format. Is this something valuable to note or is that just
the nature of freelist pointers?
Comment 80 Kees Cook 2020-07-22 02:03:15 UTC
(In reply to mnrzk from comment #79)
> I wonder if there's any way to set a watchpoint to see where exactly the
> dm_atomic_state gets filled with garbage data.

mm/slub.c set_freepointer() (via several possible paths through slab_free()) via writes the pointer. What you really want to know is "who called kfree() before this tried to read from here?". 

> Also, since I'm not too familiar with freelists, do freelist pointers look
> like regular pointers? On a regular pointer on a system with a 48-bit
> virtual address space, regular pointers would be something like
> 0xffffXXXXXXXXXXXX. I've noticed that the data being inserted never
> followed this format. Is this something valuable to note or is that just
> the nature of freelist pointers?

With CONFIG_SLAB_FREELIST_HARDENED=y the contents will be randomly permuted on a per-slab basis. Without, they'll look like a "regular" kernel heap pointer (0xffff....). You maybe have much more exciting failure modes without CONFIG_SLAB_FREELIST_HARDENED since the pointer will actually be valid. :P
Comment 81 Kees Cook 2020-07-22 02:05:24 UTC
I assume this is the change, BTW:

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
index d61186ff411d..2b8da2b17a5d 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
@@ -424,6 +424,8 @@ struct dm_crtc_state {
 struct dm_atomic_state {
        struct drm_private_state base;
 
+       /* This will be overwritten by the freelist pointer during kfree() */
+       void *padding;
        struct dc_state *context;
 };
Comment 82 mnrzk 2020-07-22 03:37:14 UTC
(In reply to Kees Cook from comment #81)
> I assume this is the change, BTW:
> 
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
> index d61186ff411d..2b8da2b17a5d 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
> @@ -424,6 +424,8 @@ struct dm_crtc_state {
>  struct dm_atomic_state {
>         struct drm_private_state base;
>  
> +       /* This will be overwritten by the freelist pointer during kfree() */
> +       void *padding;
>         struct dc_state *context;
>  };

Yeah that's exactly the change I made, save for the comment of course.

I just got around to actually testing it and it appears to still crash.
Either my hypothesis was wrong or I'm doing something wrong here.

Do you have any ideas?
Comment 83 Christian König 2020-07-22 07:27:17 UTC
Instead of working around the bug I think we should concentrate on nailing the root cause.

I suggest to insert an use after free check into just that structure. In other words add a field "magic_number" will it with 0xdeadbeef on allocation and set it to zero before the kfree().

A simple BUG_ON(ptr->magic_number != 0xdeadbeef) should yield results rather quickly.

Then just add printk()s before the kfree() to figure out why we have this use after free race.
Comment 84 Nicholas Kazlauskas 2020-07-22 13:04:51 UTC
We don't manually free the dm_state from amdgpu, that should be handled by the DRM core.

It should generally only be freed once it's no longer use by the DRM core as well once the state has been swapped and we drop the reference on the old state at the end of commit tail.

If DRM private objects work the same as regular DRM objects - which from my impression they should - then they should be NULL until they've been acquired for a new state as needed.

This turns out to be on almost every commit in our current code. I think most commits that touch planes or CRTCs would end up doing this.

I kind of wonder if we're keeping the old dm_state pointer that was freed in the case where it isn't duplicated and for whatever reason it isn't actually NULL.

Based on the above discussion I guess we're probably not doing a use after free on the dc_state itself.

There's been other bugs with private objects in the past with DRM that didn't exist with the regular objects that I'd almost consider finding an alternative solution here and not keeping an old vs new dc_state just to avoid using them in the first place.
Comment 85 mnrzk 2020-07-23 00:48:56 UTC
(In reply to Christian König from comment #83)
> Instead of working around the bug I think we should concentrate on nailing
> the root cause.
> 
> I suggest to insert an use after free check into just that structure. In
> other words add a field "magic_number" will it with 0xdeadbeef on allocation
> and set it to zero before the kfree().
> 
> A simple BUG_ON(ptr->magic_number != 0xdeadbeef) should yield results rather
> quickly.
> 
> Then just add printk()s before the kfree() to figure out why we have this
> use after free race.

Fair point, I was just trying to confirm my hypothesis.

I realised why the test failed, adding 8 bytes of padding to the middle
made the struct size 24 bytes. Since the freelist pointer is being added
to the middle (12 bytes) and that's aligned to the nearest 8 bytes, the
pointer ended up being placed at an offset of 16 bytes (context).

After making the padding an array of 2 void* and initialising it to
{0xDEADBEEFCAFEF00D, 0x1BADF00D1BADC0DE}, the padding was eventually
corrupted with the context being left intact and therefore, no crashing.

GDB output of dm_struct:
{
    base = {state = 0xffff888273884c00},
    padding = {0xdeadbeefcafef00d, 0x513df83afd3ad7b2},
    context = 0xffff88824e680000
}

That said, I still don't know the root cause of the bug, I'll see
if I can use KASAN or something to figure out what exactly freed
dm_state. If anyone is more familiar with this code has any advice
for me, please let me know.
Comment 86 mnrzk 2020-07-23 05:46:18 UTC
Created attachment 290475 [details]
KASAN Use-after-free

Good news, I got KASAN to spit out a use-after-free bug report.

Here's the KASAN bug report, I'm currently trying to understand
what's going on here.

Hopefully someone else can figure something out from this.
Comment 87 mnrzk 2020-07-23 21:30:10 UTC
Good news, I wrote a patch that fixed this bug on my machine and submitted
it to the Linux kernel mailing list [1].

I've tested this for almost 12 hours with KASAN enabled and 3 hours with
all debugging options disabled while watching videos and there have been no
crashes. The longest it's taken for the bug to occur in the past for me was
about 1 hour.

To anyone experiencing this bug, please test out the patch and report on
whether on not it works. I think we'll need some Tested-bys in the LKML
thread and in here before we can consider this bug fixed.

[1] https://lkml.org/lkml/2020/7/23/1123
Comment 88 mnrzk 2020-07-23 21:34:17 UTC
Created attachment 290485 [details]
Possible bug fix #1

(In reply to mnrzk from comment #87)
> Good news, I wrote a patch that fixed this bug on my machine and submitted
> it to the Linux kernel mailing list [1].
> 
> I've tested this for almost 12 hours with KASAN enabled and 3 hours with
> all debugging options disabled while watching videos and there have been no
> crashes. The longest it's taken for the bug to occur in the past for me was
> about 1 hour.
> 
> To anyone experiencing this bug, please test out the patch and report on
> whether on not it works. I think we'll need some Tested-bys in the LKML
> thread and in here before we can consider this bug fixed.
> 
> [1] https://lkml.org/lkml/2020/7/23/1123

For convenience, I'll attach the patch here as well.
Comment 89 Christian König 2020-07-24 07:18:29 UTC
(In reply to mnrzk from comment #87)
> Good news, I wrote a patch that fixed this bug on my machine and submitted
> it to the Linux kernel mailing list [1].

You should probably send it to the amd-gfx@lists.freedesktop.org mailing list as well if you haven't already done so.

I'm not an expert on the DC state stuff, so Harry or Alex need to validate this patch. But of hand it looks like a nice catch to me.

Good work :)
Comment 90 mnrzk 2020-07-24 07:24:06 UTC
(In reply to Christian König from comment #89)
> (In reply to mnrzk from comment #87)
> > Good news, I wrote a patch that fixed this bug on my machine and submitted
> > it to the Linux kernel mailing list [1].
> 
> You should probably send it to the amd-gfx@lists.freedesktop.org mailing
> list as well if you haven't already done so.
> 
> I'm not an expert on the DC state stuff, so Harry or Alex need to validate
> this patch. But of hand it looks like a nice catch to me.
> 
> Good work :)

After further testing, it seems that it only caused the issue to be delayed.
An hour and a half after I submitted the patch, my system crashed.

I mentioned this on the LKML thread but I forgot to mention it here.

I have a suspicion that the same state is being committed twice. I'll have
to investigate this further though. Once I determine if it is, I'll report
back on here and perhaps that will help with a bug fix.
Comment 91 laser.eyess.trackers 2020-07-24 19:08:41 UTC
I wanted to comment on this bug because I believe I have been experiencing it based on a bug report I filed with amdgpu[1]. As of 5.7.8 on Arch Linux I no longer experiencing this bug regularly. Usually I could trigger it every 1-3 days. The biggest change I made was turning off adaptive_sync (VRR, Freesync, etc.) in my window manager. Now it's been almost a week and I haven't seen it. Right now I am on 5.7.9 and will keep running as long as possible until it crashes again, if it crashes again.


I see some discussion here about race conditions between memory allocations and atomic commits, and while I don't understand most of it, would I be correct in assuming that variable frame timing would exacerbate this bug? If so, I believe that is exactly what I am experiencing. I'd love to help test patches for this as they come in, but for now I want to add that VRR is an important part of the equation for this bug for me.


The bug report linked in [1] has more of my set up but all I'll say here is that I also have a multimonitor setup, each one supports VRR and they are at varying resolutions/refresh rates; two at 1440p 144Hz, one at 4k 60Hz.


1. https://gitlab.freedesktop.org/drm/amd/-/issues/1216
Comment 92 Nicholas Kazlauskas 2020-07-24 21:00:07 UTC
This sounds very similar to a bug I fixed a year ago but that issue was with freeing the dc_state.

https://bugzilla.kernel.org/show_bug.cgi?id=204181

1. Client requests non-blocking Commit #1, has a new dc_state #1,
state is swapped, commit tail is deferred to work queue

2. Client requests non-blocking Commit #2, has a new dc_state #2,
state is swapped, commit tail is deferred to work queue

3. Commit #2 work starts before Commit #1, commit tail finishes,
atomic state is cleared, dc_state #1 is freed

4. Commit #1 work starts after Commit #2, uses dc_state #1, NULL pointer deref.

This issue was fixed, but it occurred under similar conditions - heavy system load and frequent pageflipping.

However, in the case of dm_state things can't be solved in the same manner. Commit #2 can't free Commit #1's commit - only the commit tail for Commit #1 can free it along with the IOCTL caller.

I don't know if this is going down any of the deadlock paths in DRM core because that might trigger strange behavior as well with clearing/putting the dm_state.

If someone who can reproduce this issue can produce a dmesg log with the DRM IOCTLs logged (I think drm.debug=0x54 should work) then I should be able to examine the IOCTL sequence in more detail.
Comment 93 mnrzk 2020-07-25 02:38:07 UTC
(In reply to Nicholas Kazlauskas from comment #92)
> This sounds very similar to a bug I fixed a year ago but that issue was with
> freeing the dc_state.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=204181
> 
> 1. Client requests non-blocking Commit #1, has a new dc_state #1,
> state is swapped, commit tail is deferred to work queue
> 
> 2. Client requests non-blocking Commit #2, has a new dc_state #2,
> state is swapped, commit tail is deferred to work queue
> 
> 3. Commit #2 work starts before Commit #1, commit tail finishes,
> atomic state is cleared, dc_state #1 is freed
> 
> 4. Commit #1 work starts after Commit #2, uses dc_state #1, NULL pointer
> deref.
> 
> This issue was fixed, but it occurred under similar conditions - heavy
> system load and frequent pageflipping.
> 
> However, in the case of dm_state things can't be solved in the same manner.
> Commit #2 can't free Commit #1's commit - only the commit tail for Commit #1
> can free it along with the IOCTL caller.
> 
> I don't know if this is going down any of the deadlock paths in DRM core
> because that might trigger strange behavior as well with clearing/putting
> the dm_state.
> 
> If someone who can reproduce this issue can produce a dmesg log with the DRM
> IOCTLs logged (I think drm.debug=0x54 should work) then I should be able to
> examine the IOCTL sequence in more detail.

Yes, this actually seems quite similar to that bug. Perhaps it's something
like that bug but with dm_state instead?

Also, some more observations I've made:
While dm_state is encountering a use-after-free bug, it does not seem like
state as a whole is. The KASAN bug report only states that reading from
dm_state is invalid, but the same cannot be said about state.

Furthermore, dm_state seems to be used in two separate commits and is being
freed after one commit is complete. This creates a race between the two
commits where the completion of one commit before the other calls
dm_atomic_get_new_state causes a use-after-free.

I think the bug works something like this. Keep in mind that I haven't
worked with this code outside of this bug report so there may be a few misconceptions:

1. Client requests non-blocking Commit #1, has a new dm_state #1,
state is swapped, commit tail is deferred to work queue

2. Client requests non-blocking Commit #2, has a new dm_state #2,
state is swapped, commit tail is deferred to work queue

3. Commit #2 work starts before Commit #1, commit tail finishes,
atomic state is cleared, dm_state #1 is freed

4. Commit #1 work starts after Commit #2, uses dm_state #1 (use-after-free),
reads bad context pointer and dereferences freelist pointer instead.

So I would agree that this is very similar to the dc_state bug (I even
based that explanation on yours). Perhaps that bug you fixed also
affected dm_state as a whole but only caused an issue with dc_state at the time?
Comment 94 mnrzk 2020-07-26 06:47:10 UTC
I just got this interesting log w/ drm.debug=0x54 right before a crash:

[  971.537862] [drm:drm_atomic_state_init [drm]] Allocated atomic state 00000000cac2d51a
[  971.537909] [drm:drm_atomic_get_crtc_state [drm]] Added [CRTC:47:crtc-0] 00000000dc3e08a2 state to 00000000cac2d51a
[  971.537938] [drm:drm_atomic_get_plane_state [drm]] Added [PLANE:45:plane-5] 00000000ab054dfb state to 00000000cac2d51a
[  971.537963] [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:103] for [PLANE:45:plane-5] state 00000000ab054dfb
[  971.537988] [drm:drm_atomic_check_only [drm]] checking 00000000cac2d51a
[  971.538064] [drm:drm_atomic_get_private_obj_state [drm]] Added new private object 00000000da817c3e state 000000001743c8e6 to 00000000cac2d51a
[  971.538211] [drm:drm_atomic_nonblocking_commit [drm]] committing 00000000cac2d51a nonblocking
[  971.538898] [drm:drm_atomic_state_init [drm]] Allocated atomic state 00000000cc027c4b
[  971.538941] [drm:drm_atomic_get_crtc_state [drm]] Added [CRTC:49:crtc-1] 00000000992fcbd2 state to 00000000cc027c4b
[  971.538968] [drm:drm_atomic_get_plane_state [drm]] Added [PLANE:44:plane-4] 000000009d6970b1 state to 00000000cc027c4b
[  971.538992] [drm:drm_atomic_set_fb_for_plane [drm]] Set [FB:103] for [PLANE:44:plane-4] state 000000009d6970b1
[  971.539017] [drm:drm_atomic_check_only [drm]] checking 00000000cc027c4b
[  971.539108] [drm:drm_atomic_get_private_obj_state [drm]] Added new private object 00000000da817c3e state 0000000057153d72 to 00000000cc027c4b
[  971.539140] [drm:drm_atomic_nonblocking_commit [drm]] committing 00000000cc027c4b nonblocking
[  971.544942] [drm:drm_atomic_state_default_clear [drm]] Clearing atomic state 00000000cc027c4b
[  971.544977] [drm:__drm_atomic_state_free [drm]] Freeing atomic state 00000000cc027c4b

and then my debugger detected a use-after-free while 00000000cac2d51a was being
committed.

Basically the sequence of events is as follows:

1. Non-blocking commit #1 (00000000cac2d51a) was requested, allocated, and is
deferred to workqueue.

2. Non-blocking commit #2 (00000000cc027c4b) was requested, allocated, and is
deferred to workqueue.

3. Commit #2 starts and completes before commit #1 is started, dm_state is
freed.

4. Commit #1 starts after commit #2 and is using commit #2's freed dm_state
pointer.

And from every instance of this bug I have seen, it has been due to page-flipping.

So Nicholas, it seems your observation was correct; the sequence of events are
very similar to how you've described the other bug.

Perhaps we'll have to look into the page-flipping code to figure out what exactly
is going on.
Comment 95 Nicholas Kazlauskas 2020-07-26 18:40:55 UTC
Created attachment 290583 [details]
0001-drm-amd-display-Force-add-all-CRTCs-to-state-when-us.patch

So the sequence looks like the following:

1. Non-blocking commit #1 requested, checked, swaps state and deferred to work queue.

2. Non-blocking commit #2 requested, checked, swaps state and deferred to work queue.

Commits #1 and #2 don't touch any of the same core DRM objects (CRTCs, Planes, Connectors) so Commit #2 does not stall for Commit #1. DRM Private Objects have always been avoided in stall checks, so we have no safety from DRM core in this regard.

3. Due to system load commit #2 executes first and finishes its commit tail work. At the end of commit tail, as part of DRM core, it calls drm_atomic_state_put().

Since this was the pageflip IOCTL we likely already dropped the reference on the state held by the IOCTL itself. So it's going to actually free at this point.

This eventually calls drm_atomic_state_clear() which does the following:

obj->funcs->atomic_destroy_state(obj, state->private_objs[i].state);

Note that it clears "state" here. Commit sets "state" to the following:

state->private_objs[i].state = old_obj_state;
obj->state = new_obj_state;

Since Commit #1 swapped first this means Commit #2 actually does free Commit #1's private object.

4. Commit #1 then executes and we get a use after free.

Same bug, it's just this was never corrupted before by the slab changes. It's been sitting dormant for 5.0~5.8.

Attached is a patch that might help resolve this.
Comment 96 mnrzk 2020-07-26 19:55:46 UTC
(In reply to Nicholas Kazlauskas from comment #95)
> Created attachment 290583 [details]
> 0001-drm-amd-display-Force-add-all-CRTCs-to-state-when-us.patch
> 
> So the sequence looks like the following:
> 
> 1. Non-blocking commit #1 requested, checked, swaps state and deferred to
> work queue.
> 
> 2. Non-blocking commit #2 requested, checked, swaps state and deferred to
> work queue.
> 
> Commits #1 and #2 don't touch any of the same core DRM objects (CRTCs,
> Planes, Connectors) so Commit #2 does not stall for Commit #1. DRM Private
> Objects have always been avoided in stall checks, so we have no safety from
> DRM core in this regard.
> 
> 3. Due to system load commit #2 executes first and finishes its commit tail
> work. At the end of commit tail, as part of DRM core, it calls
> drm_atomic_state_put().
> 
> Since this was the pageflip IOCTL we likely already dropped the reference on
> the state held by the IOCTL itself. So it's going to actually free at this
> point.
> 
> This eventually calls drm_atomic_state_clear() which does the following:
> 
> obj->funcs->atomic_destroy_state(obj, state->private_objs[i].state);
> 
> Note that it clears "state" here. Commit sets "state" to the following:
> 
> state->private_objs[i].state = old_obj_state;
> obj->state = new_obj_state;

What line number roughly does that happen on? I can't seem to find that 
anywhere in amdgpu_dm.c

> 
> Since Commit #1 swapped first this means Commit #2 actually does free Commit
> #1's private object.
> 
> 4. Commit #1 then executes and we get a use after free.
> 
> Same bug, it's just this was never corrupted before by the slab changes.
> It's been sitting dormant for 5.0~5.8.
> 
> Attached is a patch that might help resolve this.

I actually just started testing my own patch, but I'll apply your patch
and see if it works though.

My patch is based on how you solved bug 204181 [1] and instead of setting
the new dc_state to the old dc_state, it frees the dm_state and removes
the associated private object.

If I understand correctly, if dm_state is set to NULL (i.e. new state
cannot be found), commit_tail retains the current state and context.
Since dm_state only contains the context (which is unused), I don't see
why freeing the state and clearing the private object beforehand would
be an issue.

I would attach the patch but I'll need to clean up my code first. If the
patch works for the next few hours, I'll clean it up and attach it.

[1] https://patchwork.freedesktop.org/patch/320797/
Comment 97 mnrzk 2020-07-26 22:52:17 UTC
Created attachment 290591 [details]
drm/amd/display: Clear dm_state for fast updates

drm/amd/display: Clear dm_state for fast updates

Alright, the bug patch I mentioned in the last comment seems to be good
after a few hours of testing.

Please try out this patch and see if it fixes the issue for the rest of
you.

In the meantime, I'm doing more extended tests on this patch to confirm it
works well enough before posting it on LKML.

Nicholas, I haven't tested your commit since I was too busy with this. I'll
try it out if this one fails though.

Also, can you please review this patch to confirm that I'm not doing
anything wrong here?
Comment 98 Nicholas Kazlauskas 2020-07-26 23:30:30 UTC
As much as I'd like to remove the DRM private object from the state instead of just carrying it over I'd really rather not be hacking around behavior from the DRM core itself.

Maybe there's value in adding these as DRM helpers in the case where a driver explicitly wants to remove something from the state. My guess as to why these don't exist today is because they can be bug prone since the core implicitly adds some objects (like CRTCs when you add a plane and CRTCs when you add connectors) but I don't see any technical limitation for not exposing this.
Comment 99 mnrzk 2020-07-26 23:52:08 UTC
(In reply to Nicholas Kazlauskas from comment #98)
> As much as I'd like to remove the DRM private object from the state instead
> of just carrying it over I'd really rather not be hacking around behavior
> from the DRM core itself.
> 
> Maybe there's value in adding these as DRM helpers in the case where a
> driver explicitly wants to remove something from the state. My guess as to
> why these don't exist today is because they can be bug prone since the core
> implicitly adds some objects (like CRTCs when you add a plane and CRTCs when
> you add connectors) but I don't see any technical limitation for not
> exposing this.

I'm a little bit confused, is there anything particularly illegal or
discouraged about the patch I sent? If so, how should I correct it?

Should I create some sort of DRM helper for deleting a private object and
use that to delete the state's associated private object?
Comment 100 mnrzk 2020-07-27 06:11:14 UTC
I posted the patch on the LKML [1] just now so I can get the other
reviewers' input on it. I think it's safe to say that it's working now due
to how much I've tested it but I will test more over the coming days just
to be safe.

If anyone else can test this patch and give their Tested-by in the LKML
thread, or just comment in here about it, please do.

Aside from the description, this patch is identical to the one I just
attached.

Nicholas, sorry but I wasn't quite sure if you were giving a suggestion in
that comment earlier. Please tell me if you have any suggestions or
concerns with this patch.

[1] https://lkml.org/lkml/2020/7/27/64
Comment 101 Duncan 2020-07-27 16:55:06 UTC
(In reply to Nicholas Kazlauskas from comment #95)
> Created attachment 290583 [details]
> 0001-drm-amd-display-Force-add-all-CRTCs-to-state-when-us.patch

Just booted to 5.8-rc7 with this patched in locally (and the g320+ reverts /not/ patched in).  So testing, but noting again that the bug can take a couple days to trigger on my hardware, so while verifying bug-still-there /might/ be fast, verifying that it's /not/ there will take awhile.

If this still bugs on me (and barring other developments first) I'll try mnrzk's patch in place of this one.  Even if it's not permanent, getting it into 5.8 as a temporary fix and doing something better for 5.9 would buy us some time to develop and test the more permanent fix.
Comment 102 Duncan 2020-07-28 02:29:00 UTC
(In reply to Duncan from comment #101)
> (In reply to Nicholas Kazlauskas from comment #95)
> > 0001-drm-amd-display-Force-add-all-CRTCs-to-state-when-us.patch
> 
> Just booted to 5.8-rc7 with this patched in locally (and the g320+ reverts
> /not/ patched in).  So testing, but noting again that the bug can take a
> couple days to trigger on my hardware, so while verifying bug-still-there
> /might/ be fast, verifying that it's /not/ there will take awhile.

So far building system updates so heavy cpu load while playing only moderate FHD video.  No freezes but I have seen a bit of the predicted judder.

I suspect the synchronization is preventing the freezes, and the judder hasn't been /bad/.  But with different-refresh monitors (mine are both 60 Hz 4k bigscreen TVs so same refresh), or trying 4k video, particularly 4k60 which my system already struggles with, or possibly even both say 120 Hz monitors, the judder would be noticeably worse.  The 4k30 and 4k60 youtube tests will probably have to wait for tomorrow, tho, as I've been up near 24 now...
Comment 103 mnrzk 2020-07-28 03:21:54 UTC
(In reply to Nicholas Kazlauskas from comment #95)
> Created attachment 290583 [details]
> 0001-drm-amd-display-Force-add-all-CRTCs-to-state-when-us.patch
> 
> So the sequence looks like the following:
> 
> 1. Non-blocking commit #1 requested, checked, swaps state and deferred to
> work queue.
> 
> 2. Non-blocking commit #2 requested, checked, swaps state and deferred to
> work queue.
> 
> Commits #1 and #2 don't touch any of the same core DRM objects (CRTCs,
> Planes, Connectors) so Commit #2 does not stall for Commit #1. DRM Private
> Objects have always been avoided in stall checks, so we have no safety from
> DRM core in this regard.
> 
> 3. Due to system load commit #2 executes first and finishes its commit tail
> work. At the end of commit tail, as part of DRM core, it calls
> drm_atomic_state_put().
> 
> Since this was the pageflip IOCTL we likely already dropped the reference on
> the state held by the IOCTL itself. So it's going to actually free at this
> point.
> 
> This eventually calls drm_atomic_state_clear() which does the following:
> 
> obj->funcs->atomic_destroy_state(obj, state->private_objs[i].state);
> 
> Note that it clears "state" here. Commit sets "state" to the following:
> 
> state->private_objs[i].state = old_obj_state;
> obj->state = new_obj_state;
> 
> Since Commit #1 swapped first this means Commit #2 actually does free Commit
> #1's private object.
> 
> 4. Commit #1 then executes and we get a use after free.
> 
> Same bug, it's just this was never corrupted before by the slab changes.
> It's been sitting dormant for 5.0~5.8.
> 
> Attached is a patch that might help resolve this.

So I just got around to testing this patch and so far, not very promising.

Right now I can't comment on if the bug in question was resolved but this
just introduced some new critical bugs for me.

I first tried this on my bare metal system w/ my RX 480 and it boots into
lightdm just fine. As soon as I log in and start up XFCE however, one of my
two monitors goes black (monitor reports being asleep) but my cursor seems
to drift into the other monitor just fine. So after that, I check the
display settings and both monitors are detected. So I tried re-enabling the
off monitor and then both monitors work fine.

After that, another bug: I now have two cursors, one only works on my right
monitor and the other only stays in one position.

At this point, I recompiled and remade the initramfs, and sure enough, same
issues. This time, however, changing the display settings didn't "fix" the
issue with one monitor being blank; the off monitor activated, but the
previously working one just froze.

I also tried this on my VM passing through my GPU w/ vfio-pci; similar
issues. Lightdm worked fine but when I started KDE Plasma, it started
flashing white and one of my monitors just became blank. This time, I
couldn't enable the blank display from the settings, it just didn't show
up. Xrandr only showed one output as well; switching HDMI outputs still
only lets me use the monitor on the "working" HDMI port.

I don't exactly know how I would go about debugging this since there's just
too many bugs to count. I also don't know if it would be worth it at all.

Do you have any idea why this would occur? This patch only seems to force
synchronisation, I don't quite know why it would break my system so much.
Comment 104 mnrzk 2020-07-28 03:39:18 UTC
(In reply to mnrzk from comment #103)
> (In reply to Nicholas Kazlauskas from comment #95)
> > Created attachment 290583 [details]
> > 0001-drm-amd-display-Force-add-all-CRTCs-to-state-when-us.patch
> > 
> > So the sequence looks like the following:
> > 
> > 1. Non-blocking commit #1 requested, checked, swaps state and deferred to
> > work queue.
> > 
> > 2. Non-blocking commit #2 requested, checked, swaps state and deferred to
> > work queue.
> > 
> > Commits #1 and #2 don't touch any of the same core DRM objects (CRTCs,
> > Planes, Connectors) so Commit #2 does not stall for Commit #1. DRM Private
> > Objects have always been avoided in stall checks, so we have no safety from
> > DRM core in this regard.
> > 
> > 3. Due to system load commit #2 executes first and finishes its commit tail
> > work. At the end of commit tail, as part of DRM core, it calls
> > drm_atomic_state_put().
> > 
> > Since this was the pageflip IOCTL we likely already dropped the reference
> on
> > the state held by the IOCTL itself. So it's going to actually free at this
> > point.
> > 
> > This eventually calls drm_atomic_state_clear() which does the following:
> > 
> > obj->funcs->atomic_destroy_state(obj, state->private_objs[i].state);
> > 
> > Note that it clears "state" here. Commit sets "state" to the following:
> > 
> > state->private_objs[i].state = old_obj_state;
> > obj->state = new_obj_state;
> > 
> > Since Commit #1 swapped first this means Commit #2 actually does free
> Commit
> > #1's private object.
> > 
> > 4. Commit #1 then executes and we get a use after free.
> > 
> > Same bug, it's just this was never corrupted before by the slab changes.
> > It's been sitting dormant for 5.0~5.8.
> > 
> > Attached is a patch that might help resolve this.
> 
> So I just got around to testing this patch and so far, not very promising.
> 
> Right now I can't comment on if the bug in question was resolved but this
> just introduced some new critical bugs for me.
> 
> I first tried this on my bare metal system w/ my RX 480 and it boots into
> lightdm just fine. As soon as I log in and start up XFCE however, one of my
> two monitors goes black (monitor reports being asleep) but my cursor seems
> to drift into the other monitor just fine. So after that, I check the
> display settings and both monitors are detected. So I tried re-enabling the
> off monitor and then both monitors work fine.
> 
> After that, another bug: I now have two cursors, one only works on my right
> monitor and the other only stays in one position.
> 
> At this point, I recompiled and remade the initramfs, and sure enough, same
> issues. This time, however, changing the display settings didn't "fix" the
> issue with one monitor being blank; the off monitor activated, but the
> previously working one just froze.
> 
> I also tried this on my VM passing through my GPU w/ vfio-pci; similar
> issues. Lightdm worked fine but when I started KDE Plasma, it started
> flashing white and one of my monitors just became blank. This time, I
> couldn't enable the blank display from the settings, it just didn't show
> up. Xrandr only showed one output as well; switching HDMI outputs still
> only lets me use the monitor on the "working" HDMI port.
> 
> I don't exactly know how I would go about debugging this since there's just
> too many bugs to count. I also don't know if it would be worth it at all.
> 
> Do you have any idea why this would occur? This patch only seems to force
> synchronisation, I don't quite know why it would break my system so much.

This just gets even weirder the more I test it out. Swapping the two
monitors (i.e. swapping the HDMI ports used for each monitor) seems to fix
the issue completely on my VM (at least from 1 minute of testing), but on
the host it fixes some of the issues (my cursor still disappears on one of
my monitors).
Comment 105 Duncan 2020-07-28 07:14:06 UTC
(In reply to Duncan from comment #102)
> (In reply to Duncan from comment #101)
> > (In reply to Nicholas Kazlauskas from comment #95)
> > > 0001-drm-amd-display-Force-add-all-CRTCs-to-state-when-us.patch
> > 
> > Just booted to 5.8-rc7 with this patched in 
> 
> So far building system updates so heavy cpu load while playing only moderate
> FHD video.  No freezes but I have seen a bit of the predicted judder.
> 
> The 4k30 and 4k60 youtube tests will probably have to wait for tomorrow, tho,
> as I've been up near 24 now...

Still up...  Here's the promised 4k youtube-in-firefox tests.

4k is a bit more stuttery than normal with the patch, but not near as bad as I expected it to be.  I can normally run 4k60 at 80-85% normal speed with occasional stutters but without freezing the video entirely until I drop the speed down again as I often have to do if I try running over that.  With the patch I was doing 70-75%.  So there's definitely some effect on 4k60.  Switching to the performance cpufreq governor from my default conservative, as usual, helps a bit, but not a lot, maybe 5%, tho the frame-freezes seem to recover a bit better on performance.  In addition to long video freezes at the full 4k60 100%, even normally I'll sometimes get tab-crashes depending on the video.  I didn't have any for this test but then I'm so used to not being able to run at full-speed that I didn't try it for long.

I can normally run 4k30 videos without much problem on default conservative.  With the patch I was still getting some stuttering at 30fps on conservative, but it pretty much cleared up with on-demand.  I did just have a tab-crash at 4k30, something I very rarely if ever see normally on 4k30, it normally takes 4k60 to trigger them, so it's definitely affecting it.

But... other than slowing down the usable 4k fps, I'm not seeing any of judder artifacts on the work (non-video-playing) monitor that I was seeing with the high system load but relatively low video load build testing with only FHD video.  That surprised me.  I expected to see more of that with the more demanding video.  But apparently that's tied to CPU or memory load, not video load.

But nothing like the problems mnrzk's seeing with the patch, at all.  Both monitors running fine in text mode, login, startx to plasma, running fine there too.  Hardware cursor's fine. <shrug>  The only thing I'm seeing is some slowdown and judder, as described above.
Comment 106 Duncan 2020-07-29 02:33:15 UTC
(In reply to Duncan from comment #101)
> (In reply to Nicholas Kazlauskas from comment #95)
> > Created attachment 290583 [details]
> > 0001-drm-amd-display-Force-add-all-CRTCs-to-state-when-us.patch
> 
> Just booted to 5.8-rc7 with this patched in locally (and the g320+ reverts
> /not/ patched in).  So testing, but noting again that the bug can take a
> couple days to trigger on my hardware

This doesn't seem to trigger the bug at all for me, tho there's the expected slowdown/judder from force-syncing all CRTCs as detailed in my last couple comments.  But with the more serious side effects mnrzk is seeing with it, it's clearly not useful as an even temporary mainline candidate.

I'll be testing mnrzk's patch now.  Hopefully it'll be good enough for the quickly approaching 5.8, tho dev consensus seems to be that a deeper rework is needed longer-term.
Comment 107 Paul Menzel 2020-07-29 06:41:53 UTC
Everyone seeing this, it’d be great, if you tested

    [PATCH] drm/amd/display: Clear dm_state for fast updates

and reported any noticeable performance regressions.
Comment 108 Duncan 2020-07-29 16:02:54 UTC
(In reply to Paul Menzel from comment #107)
> Everyone seeing this, it’d be great, if you tested
> 
>     [PATCH] drm/amd/display: Clear dm_state for fast updates

I've been testing it for... ~12 hours now and so far... nothing unusual to report. =:^)

Everything seems to be working normally including 4k video and update builds.  The only two caveats are that there wasn't anything /too/ heavy in the update pipeline to build, and it has only been 12 hours, while sometimes this bug took two days to bite on my setup.  But so far, so good, and now that I'm posting this, if the bug's going to bite it's likely to be right after I hit submit, so let's see! =:^)
Comment 109 zzyxpaw 2020-07-29 16:37:47 UTC
I've been testing mnrzk's patch for about 12 hours as well, so far so good. No obvious performance degradation has appeared, at least that I can discern just by "feel". My testing has been interrupted a couple times by the new power-off-on-overtemperature feature while attempting to test heavier loads.
Comment 110 Nicholas Kazlauskas 2020-07-29 16:45:23 UTC
That's inline with the expectations I think.

That patch shouldn't cause any performance or stuttering impacts and it should resolve the protection fault.

If there were issues with the patch I would expect to see them within the first few pageflips from booting into desktop.

For now I'll give my Reviewed-by on the patch on the mailing list and get it merged in.
Comment 111 mnrzk 2020-07-29 20:32:41 UTC
Yeah, no noticeable performance impact on my end either. I don't really
see why it would cause a performance impact either. I could run a benchmark
to compare but I don't really know what to benchmark specifically.
Comment 112 Duncan 2020-07-31 16:38:13 UTC
(In reply to Paul Menzel from comment #107)
> Everyone seeing this, it’d be great, if you tested
> 
>     [PATCH] drm/amd/display: Clear dm_state for fast updates

For the record, with no reported problems that's in 5.8-post-rc7 now as fde9f39ac, merged into the drm tree with merge-commit 887c909dd, which in turn was merged into mainline on Thursday July 30 with merge-commit d8b9faec5.

Thanks, everyone. =:^)

Close the bug on 5.8.0 release?
Comment 113 laser.eyess.trackers 2020-08-02 01:40:14 UTC
I have been using this patch for about 24 hours now, and there has not been any noticeable performance degradation. I have not experienced any crashes, but it was much harder for me to get this crash (1-3 days, if I was lucky), so I'm not sure what that means. At the very least nothing is worse.
Comment 114 Jeremy Kescher 2020-08-02 13:06:50 UTC
(In reply to Duncan from comment #108)
> (In reply to Paul Menzel from comment #107)
> > Everyone seeing this, it’d be great, if you tested
> > 
> >     [PATCH] drm/amd/display: Clear dm_state for fast updates
> 


It fixes the issue for me. My system would, without any patches, crash in a matter of minutes (perhaps a mix of 144 Hz and 60 Hz monitors causes this crash to happen faster?), but it has been running for multiple hours on intense workloads now, without any hiccups or anything.
Comment 115 Duncan 2020-08-03 13:51:06 UTC
So 5.8.0 has been out for a few hours with the patch-fix, and I see Greg K-H has it applied to the 5.7 stable tree as well as 5.4 LTS (the bug was in 5.4 but latent, not exposed until developments in 5.7), so they should be covered in their next releases.
Comment 116 Duncan 2020-08-05 16:10:53 UTC
For those not on 5.8 yet, Mazin's patch is in the 5.7.13 stable and 5.4.56 LTS releases.

As far as I'm concerned (and lacking any NAKs to my previous question about closing) there's no further reason to leave the bug open so I'm closing.  The bugzilla.kernel.org installation has some confusing custom resolution choices that haven't been documented in the status help link (bug #13851, filed years ago as implied by the bug number compared to this one) and I don't know whether CODE_FIX or PATCH_ALREADY_AVAILABLE is more appropriate as both seem to apply equally, so I guess I'll leave it at the default CODE_FIX.

Thanks again to everyone who confirmed the bug and/or worked on fixes and testing.
Comment 117 Duncan 2020-08-17 05:45:32 UTC
For those on stable-series 5.4 and/or interested in related bugs...

FWIW, there's an (apparently different) atomic_commit_tail bug reported against 5.4.58 now, bug #208913, with the patch for this bug (which went into 5.4.56 after hitting a late 5.8-rc) originally listed as a potential trigger.

But the filer closed the bug and moved it to the gitlab instance on freedesktop.org https://gitlab.freedesktop.org/drm/amd/-/issues/1263 , where he said reverting the patch didn't cure his issue, so there's something else going on there.

Just posting this here as related, in case anyone here wants to follow it, since I came across it while checking on a different (not graphics-related) bug in 5.9-rc1.  With any luck, however, the similar bug will help get a better longer term fix for both bugs, since the patch in 5.8 (backported to 5.7 and 5.4) was seen as a temporary bandaid, not a permanent fix.
Comment 118 Christopher Snowhill 2021-01-06 06:36:03 UTC
Now experiencing this attempting to run Luxmark with literally any OpenCL runtime on my RX 480, be it ROCm, Clover, or AMDGPU Pro 20.45. Goody gumdrops, wlroots added support for adaptive sync / variable refresh rate 6 months ago, but Wayfire still hasn't added an option to control it, so it may or may not be in the mix.

Happens with both 5.10.4 and 5.4.86 kernels.
Comment 119 Duncan 2021-01-06 12:05:17 UTC
(In reply to Christopher Snowhill from comment #118)
> Now experiencing this attempting to run Luxmark with literally any OpenCL
> runtime on my RX 480, be it ROCm, Clover, or AMDGPU Pro 20.45. Goody
> gumdrops, wlroots added support for adaptive sync / variable refresh rate 6
> months ago, but Wayfire still hasn't added an option to control it, so it
> may or may not be in the mix.
> 
> Happens with both 5.10.4 and 5.4.86 kernels.

FWIW: While I'd normally be on 5.11-rc by now, I've been busy switching to wayland (on plasma/kwin), along with the usual disruption accompanying such a major upgrade, and thus have been focused on userspace bugs.

So did you try earlier 5.10 releases, 5.10.0-5.10.3, and were they fine?  Because I'm still on 5.10.0 and haven't had an issue since this bug was originally resolved.

Maybe I better get with the program and try current 5.11-rcs...
Comment 120 Duncan 2021-01-06 18:59:53 UTC
On Wed, 06 Jan 2021 12:05:17 +0000
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=207383
> 
> --- Comment #119 from Duncan (1i5t5.duncan@cox.net) ---
> (In reply to Christopher Snowhill from comment #118)
> > Now experiencing this attempting to run Luxmark with literally any
> > OpenCL runtime on my RX 480

> > Happens with both 5.10.4 and 5.4.86 kernels.  

> So did you try earlier 5.10 releases, 5.10.0-5.10.3, and were they
> fine? Because I'm still on 5.10.0 and haven't had an issue since this
> bug was originally resolved.
> 
> Maybe I better get with the program and try current 5.11-rcs...

On current 5.11-rc2+ (rc2-00142-g9f1abbe to be exact) now (5 hours
since rebooting to new kernel) and not seeing anything unusual, tho on
my system the bug did sometimes take a couple days to trigger. Also I
was on X before and am now on wayland (plasma/kwin), however that might
or might not affect it.

5.11 is reported to be pretty big amdgpu-wise, tho most of it is
register headers for new silicon.  But it might be worth testing since
it's there /to/ test.

(The below's probably obvious and some is repeat from earlier comments,
but in case it mentions something or triggers an association you
previously missed ...)

If you haven't, check out the possibly related bug I
mentioned in comment #117, which has a similar atomic-commit-tail print
and was originally suspected to be triggered by the fix for this
one, tho reverting it didn't fix it, so...  As of last nite the last
comment on the freedesktop.org issue for it was reported as three
months old, without a fix because they haven't been able to
definitively bisect.

But (if you don't believe it puts you at too much risk) you might
consider trying with a 5.4 series previous to 5.4.56, which got the fix
for this bug.  5.4.86 is what you just tested and see it.  5.4.58 is
what the reporter of that bug/issue says he was on, and 5.4.56
introduced the patch for this bug, thus making the fix for this bug
suspect tho he says reverting it didn't help.  So try 5.4.55 or earlier
and see if you can reproduce. That's where I'd start, anyway, since
you've already replicated on 5.4.86.  Then if that's clean bisect
between it and current; if not, try all the way back to 5.4.0.  Because
there's way less backported patches to test so if you can definitively
say it wasn't in 5.4.x and appeared in 5.4.y you've just cut the
problem and bisect space /tremendously/, compared to everything in
mainline since 5.4-stable's branch.

What does not help (as I found out with my bisect struggles on this
bug) is that the change may well be in some other part of the kernel.
So don't just look at amdgpu or all of the drm tree, as you could
easily miss it, just as I did on the first bisect with this bug, because
it was a hardening patch introduced thru Andrew's tree that triggered it
(tho the bug had actually existed for quite some time and was simply
revealed/triggered by that patch).

The other thing is I don't have freesync monitors (they're both 75-inch
UHD/4K@60 TVs), so if that's your trigger I'd be unlikely to see it.
Comment 121 akanar 2021-07-06 08:47:20 UTC
Is the linked issue possibly related to the issue discussed in this thread? If it is you may at least have a 100% consistent and easy way to test.

https://github.com/prusa3d/PrusaSlicer/issues/6677