Bug 90741

Summary: Radeon: System pauses on TAHITI
Product: Drivers Reporter: Gustaw Smolarczyk (wielkiegie)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: normal CC: abortretryfail, alexdeucher, bugs, damien, deathsimple, Dieter, jonjon.arnearne, julien.isorce, szg00000
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.18.0 3.18.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: /proc/interrupts on 3.17.4
Restore old fence wait behavior
Use exclusive wait to put the waiter at the end of the waitqueue
Working patch
Final version.
Another approach
Clear irqs before processing.. Patch on top of 'another approach'
Print refcounts on sw irq, panic on disabled.
dmesg after "Print refcounts..." patch
dmesg after disabling radeon_fence_schedule_check()
dmesg with refcount on sw irq
dmesg with refcount on sw irq
Use fence_add/remove_callback
flush hdp harder
reshuffle evergreen_dma_fence_ring_emit

Description Gustaw Smolarczyk 2015-01-04 00:35:37 UTC
The system freezes for 1-3 seconds from time to time. No message is shown in dmesg. Some of the programs report high display delays in their profiles (maybe glXSwapBuffers?). The pause is system-wide, probably due to whole-OpenGL stack (glamor).

This seems to be a 3.18.x regression. The problem does not occur on 3.17.4.

During normal usage the problem is very rare. However, it occurs very often while playing OpenGL games.

For games like Minecraft, the pauses are less likely while not moving around. I believe that the game does not upload many buffers then. So the issue may be due to buffer management.

I am using mesa from git, libdrm from git, xserver 1.16.2.901, xf86-video-ati 7.5. The problem occurs both with compositing enabled and disabled (kwin).
Comment 1 Gustaw Smolarczyk 2015-01-04 14:07:58 UTC
I have managed to bisect this regression:

f2c24b83ae90292d315aa7ac029c6ce7929e01aa is the first bad commit
commit f2c24b83ae90292d315aa7ac029c6ce7929e01aa
Author: Maarten Lankhorst <maarten.lankhorst@canonical.com>
Date:   Wed Apr 2 17:14:48 2014 +0200

    drm/ttm: flip the switch, and convert to dma_fence
    
    Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com>

:040000 040000 634b9d842856f9a253b8250b6b61423a8b19bf7d 82569dbc605b725ecefac7457b25599158af5fb6 M      drivers
:040000 040000 6ff7fd43b2671ac388b81ec3c9ab47f4a2d39162 0c9d3986f193c83b952b05e7ec1f1c777a44e88e M      include

I am not familiar with the kernel code so I cannot say whether it's a real regression or is just a problem with a userspace.
Comment 2 Michel Dänzer 2015-01-05 09:01:31 UTC
Does it still happen with a 3.19-rc kernel?
Comment 3 Maarten Lankhorst 2015-01-05 09:14:16 UTC
Ok so that commit introduces the bad behavior, does commit
39e7f6f84b3a3aa4520504473f2e2bac1f949ffa "drm/radeon: use rcu waits in some ioctls"
or commit "drm/ttm: use rcu in core ttm" change the bad behavior in any way?
Comment 4 Gustaw Smolarczyk 2015-01-05 19:15:41 UTC
There seems to be no substantial difference in 3.19-rc2, 39e7f6f84b3a3aa4520504473f2e2bac1f949ffa or 472db7ab3093bf2a2999f6b5aa64a030466d6f92.

The only thing (that may be subjective) is that there seems to be a bit less pauses in final 3.18 and later. (did not bisect that, may be totally subjective)

Also of importance can be that I am using 2 GPUs (discrete AMD TAHITI and integrated Intel gen7 (IVB)) in randr 1.4 offloading configuration:

xrandr --setprovideroutputsource Intel radeon
xrandr --output HDMI2 --mode 1280x1024 --rate 60 --right-of DVI-0

Where:
  DVI-0 is my main monitor and is connected to TAHITI
  HDMI2 is my secondary monitor and is connected to IVB
  The OpenGL window is on the main monitor

If that may be the source of these problems, I could disable the second monitor along with the iGPU in order to test that hypothesis.
Comment 5 Michel Dänzer 2015-01-06 02:17:26 UTC
(In reply to Gustaw Smolarczyk from comment #4)
> If that may be the source of these problems, I could disable the second
> monitor along with the iGPU in order to test that hypothesis.

That would be interesting.
Comment 6 Maarten Lankhorst 2015-01-06 11:14:15 UTC
I think that right now i915 doesn't use the fence stuff yet so it's probably not causing the problem, still it would be useful to check.
Comment 7 Gustaw Smolarczyk 2015-01-06 12:55:23 UTC
Unfortunately, it was a false lead. Disabling iGPU didn't help in any way on 3.18.1.

Additionally, I take back the "less pauses in 3.18.0 and later" thing. 3.18.1 was horrible just now (same as any kernel after the commit bisected).

Is there any way I can help investigating it more?
Comment 8 Christian König 2015-01-06 13:47:41 UTC
Just a shot into the dark, but can you please post the content of /proc/interrupts.
Comment 9 Gustaw Smolarczyk 2015-01-06 13:50:51 UTC
Created attachment 162601 [details]
/proc/interrupts on 3.17.4

I have attached the contents of /proc/interrupts on 3.17.4. Should I attach one with a faulty kernel too?
Comment 10 Christian König 2015-01-06 14:23:19 UTC
(In reply to Gustaw Smolarczyk from comment #9)
> Created attachment 162601 [details]
> /proc/interrupts on 3.17.4
> 
> I have attached the contents of /proc/interrupts on 3.17.4. Should I attach
> one with a faulty kernel too?

No, all I wanted to check is that interrupts work as expected. And that indeed seems to be the case here:

 26:      40191      12200       8939       6450      22186      15970       9307       6821   PCI-MSI-edge      radeon
Comment 11 Maarten Lankhorst 2015-01-20 08:34:55 UTC
Created attachment 163871 [details]
Restore old fence wait behavior

Ok the only difference I can see is the fence wait function being different in that commit. The new one should be correct, so lets see if it that's what causes it, or if it's the introduction of timeouts in ttm.

Does this hack change things?
Comment 12 Gustaw Smolarczyk 2015-01-20 20:15:11 UTC
This hack, when applied on top of 3.19-rc5, seems to fix the problem.
Comment 13 Maarten Lankhorst 2015-01-21 09:24:39 UTC
Created attachment 164121 [details]
Use exclusive wait to put the waiter at the end of the waitqueue

Ok seems I made a thinko. wait_event_interruptible_timeout and wait_event_timeout add items to the front of the queue instead of the back, this may result in fences never signalling.

Can you test the attached patch?
Comment 14 Maarten Lankhorst 2015-01-21 09:50:10 UTC
Created attachment 164131 [details]
Working patch
Comment 15 Maarten Lankhorst 2015-01-21 14:47:11 UTC
Created attachment 164141 [details]
Final version.

Can you test if this thing fixes it? Forget the other patch, this one should fix the bug.
Comment 16 Gustaw Smolarczyk 2015-01-21 18:02:28 UTC
Unfortunately, the last patch doesn't seem to help in any way. Should I test the other two too?
Comment 17 Maarten Lankhorst 2015-01-22 19:30:56 UTC
Created attachment 164421 [details]
Another approach

Don't need to test anything but final version, can you test this version though? I want to see if the bug's in the wait code or something with the interrupt handling..

If this version fails too, can you uncomment the sw_irq_get and sw_irq_put calls added in the patch?
Comment 18 Gustaw Smolarczyk 2015-01-22 19:48:20 UTC
Only the second version (with the 2 lines uncommented) of the new patch does fix the problem.
Comment 19 Maarten Lankhorst 2015-01-23 19:19:23 UTC
Created attachment 164511 [details]
Clear irqs before processing.. Patch on top of 'another approach'

Can you try with the irq_put/get lines commented, and this patch applied? Lets see if clearing irqs before processing helps..

Also there is a line in radeon_fence.c with "radeon_irq_kms_sw_irq_get_delayed", can you add a WARN_ON_ONCE above that line after the else? To check if that path is taken...
Comment 20 Gustaw Smolarczyk 2015-01-23 22:58:51 UTC
I will try that on Sunday since I have no access to the radeonsi hardware at the moment.
Comment 21 Jon Arne Jørgensen 2015-01-24 12:57:51 UTC
Hi,
I'm the reporter of bug 90861, which migth be related to this bug.

I'm now trying to compile a kernel with the patch from Maarten (attachment 164421 [details]) to see if it changes anything, but I can't get it to compile:

[...snip...]
  Building modules, stage 2.
  MODPOST 238 modules
  OBJCOPY arch/x86/boot/compressed/vmlinux.bin
  GZIP    arch/x86/boot/compressed/vmlinux.bin.gz
ERROR: "__tracepoint_fence_enable_signal" [drivers/gpu/drm/radeon/radeon.ko] undefined!
make[2]: *** [__modpost] Error 1
make[1]: *** [modules] Error 2
make[1]: *** Waiting for unfinished jobs....
[...snip...]

I've tried to enable everythoing I could find related to tracepoint in the kernel config, but with no success.

Any pointers?
Comment 22 Gustaw Smolarczyk 2015-01-25 19:55:38 UTC
@Maarten Lankhorst

The "Clear irqs..." patch doesn't seem to help.

Also, the WARN_ON_ONCE() doesn't seem to be hit (I have added 'WARN_ON_ONCE(1);' in the else block). The dmesg doesn't have anything related to it. Am I right that the WARN_ON_ONCE line should appear in dmesg in case it is reached?

@Jon Arne Jørgensen

About patch compilation: I have applied it on top of 3.19-rc5. Are you by any chance using 3.18?
Comment 23 Jon Arne Jørgensen 2015-01-25 20:03:00 UTC
@Gustaw:

~/Development/c/kernel/src% git describe --long --tags                 
v3.19-rc5-0-gec6f34e

So, that should be 3.19-rc5 :)
Comment 24 Jon Arne Jørgensen 2015-01-25 23:00:35 UTC
I just commented out the trace_fence_enable_signal call to get it to compile. Figured I don't need the tracepoint debug to test the patch.

I also uncommented the sw_irq_get and sw_irq_put calls, and can report that the patch seems to fix bug 90861.

I guess this is proof that the bug is related.

I didn't try the patch with sw_irq_get/set disabled. Just ask, and I will give it a try.
Comment 25 Maarten Lankhorst 2015-01-28 12:21:36 UTC
Oke, can you test without the calls to sw_irq_get/put?
Comment 26 Jon Arne Jørgensen 2015-01-28 12:24:15 UTC
I've tested, and can confirm that they are needed for the patch to fix the crash.
Comment 27 Maarten Lankhorst 2015-01-28 15:34:41 UTC
Created attachment 165071 [details]
Print refcounts on sw irq, panic on disabled.

Can you test this patch on top of 'another approach', keep the sw_irq_get/put commented?
Also enable CONFIG_FENCE_TRACE please.

And then print the entire dmesg..
Comment 28 Gustaw Smolarczyk 2015-01-28 19:43:20 UTC
Created attachment 165091 [details]
dmesg after "Print refcounts..." patch

I have applied this patch on top of original "another approach" one.

I think the pauses are now shorter, but more frequent.

The contents of dmesg has been attached. I have taken it from /var/log/messages, since the in-kernel dmesg buffer was overwritten many times over. I truncated the messages a bit, since I stopped testing a game at ~80s of uptime.
Comment 29 Maarten Lankhorst 2015-01-29 12:14:34 UTC
Thanks, that rules out 1 possibility. There is probably no bug in refcounting irqs. :P
Comment 30 Maarten Lankhorst 2015-01-29 12:21:32 UTC
Can you empty the contents of the radeon_fence_schedule_check function in radeon_fence.c?

It should make the pauses worse, giving me a chance to find them in the log. :-)
Comment 31 Gustaw Smolarczyk 2015-01-29 12:35:49 UTC
Do you mean on top of the latest patch?
Comment 32 Maarten Lankhorst 2015-01-29 13:29:37 UTC
You can keep print refcounts or remove it, it doesn't matter much either way. I just want to have at least 'another approach' applied.
Comment 33 Gustaw Smolarczyk 2015-01-29 13:38:43 UTC
Created attachment 165161 [details]
dmesg after disabling radeon_fence_schedule_check()

I chose to leave the refcount printing patch as it is.

The pauses now are much much longer (~10-20s).

The 2h time difference in the messages log is due to local time not being set correctly and ntp-client daemon fixing it after some time.
Comment 34 Jon Arne Jørgensen 2015-02-05 09:10:53 UTC
Created attachment 165901 [details]
dmesg with refcount on sw irq

Hi, Maarten I finaly got around to append the patch you posted in comment 27.
Not sure if you still need this.

I'm also not sure if I should stick with my original bug report (bug 90861) to not clutter this one?

PS. I had to compile the radeon driver into the kernel to get it to compile.
If I compile the radeon driver as module, the compile will break with this error:
ERROR: "__tracepoint_fence_enable_signal" [drivers/gpu/drm/radeon/radeon.ko] undefined!

But I guess that is another bug report :)
Comment 35 Jon Arne Jørgensen 2015-02-05 09:13:22 UTC
Created attachment 165911 [details]
dmesg with refcount on sw irq

Hm, seems that auto-detect mime failed with the last attachment?
Comment 36 Maarten Lankhorst 2015-02-12 08:35:18 UTC
Ugh.. can you test "another approach" again, but change the list_add to list_add_tail ? Looks like I messed it up there..
Comment 37 Maarten Lankhorst 2015-02-12 09:47:56 UTC
Never mind, that won't matter much..
Comment 38 Maarten Lankhorst 2015-02-12 09:49:13 UTC
Created attachment 166571 [details]
Use fence_add/remove_callback
Comment 39 Gustaw Smolarczyk 2015-02-12 09:53:56 UTC
Should I apply it on top of 3.19.0? Or is it an extension of some previous patch?
Comment 40 Maarten Lankhorst 2015-02-12 14:35:35 UTC
It replaces the previous attempts to fix the bug.
Comment 41 Gustaw Smolarczyk 2015-02-12 18:28:33 UTC
Unfortunately, still no good.
Comment 42 Damien Grassart 2015-02-23 11:48:18 UTC
Hi, is there any progress on this bug? The current kernel makes Dota 2 unplayable on my Radeon HD 7870. I can also confirm that the "another approach" patch with the two lines uncommented fixes the problem (as Gustaw already mentioned above).

Is there anything I can do or provide that might help resolve this?
Comment 43 Maarten Lankhorst 2015-03-02 08:29:57 UTC
After applying attachment 166571 [details] can you add a usleep(50000); in radeon_fence_enable_signaling after the line "radeon_irq_kms_sw_irq_get(rdev, fence->ring);" but before the line "if (radeon_fence_activity(rdev, fence->ring))"?

It's a hack, to confirm if my suspicions are correct. It will probably make everything a lot slower, but I only care if the pauses are gone or not..
Comment 44 Gustaw Smolarczyk 2015-03-02 09:15:57 UTC
I think you meant udelay(50000). (I just used msleep(50) and got a nice lock-up since the call was in an atomic context).

Just a sec.
Comment 45 Gustaw Smolarczyk 2015-03-02 09:27:07 UTC
(In the end I had to use mdelay(50))

The performance was really traumatic (8-11fps). But on a positive side, there seemed to be no pauses. There was one moment when fps dropped to 0 (for ~1/3 of a second), but it may have been due to completely different reason.
Comment 46 Maarten Lankhorst 2015-03-02 09:46:28 UTC
Ok what about udelay(500), should be a smaller pause?
Comment 47 Gustaw Smolarczyk 2015-03-02 09:55:54 UTC
With 500us delay, the performance is nearly back to how it was (~70fps). And the pauses are still gone.
Comment 48 Maarten Lankhorst 2015-03-02 10:03:56 UTC
Created attachment 168571 [details]
flush hdp harder

No idea if this does anything, adding a bunch of flushes instead of doing a sleep.. pauses still occur?

If not can you check if it still works with only the chunk in radeon_fence_enable_signalling?
Comment 49 Gustaw Smolarczyk 2015-03-02 10:11:02 UTC
(had to remove "robj->" from the patch).

Doesn't help.
Comment 50 Maarten Lankhorst 2015-03-02 10:19:41 UTC
Ok never mind that approach then..

in radeon_fence.c, add a WARN_ON((int)atomic_read(&rdev->irq.ring_int[fence->ring]) <= 0); after the sw_irq_get, it should never fire.

in si.c you have a few lines like this:

	if (atomic_read(&rdev->irq.ring_int[RADEON_RING_TYPE_GFX_INDEX])) {
		DRM_DEBUG("si_irq_set: sw int gfx\n");
		cp_int_cntl |= TIME_STAMP_INT_ENABLE;
	}
	if (atomic_read(&rdev->irq.ring_int[CAYMAN_RING_TYPE_CP1_INDEX])) {
		DRM_DEBUG("si_irq_set: sw int cp1\n");
		cp_int_cntl1 |= TIME_STAMP_INT_ENABLE;
	}
	if (atomic_read(&rdev->irq.ring_int[CAYMAN_RING_TYPE_CP2_INDEX])) {
		DRM_DEBUG("si_irq_set: sw int cp2\n");
		cp_int_cntl2 |= TIME_STAMP_INT_ENABLE;
	}

Can you first change all of the conditions to atomic_read(..) || 1 so it always sets the irqs? That should fix the hangs at some cpu overhead..

If that works and you no longer get hangs, can you remove the || 1 for RADEON_RING_TYPE_GFX_INDEX, see if the hangs are still gone?
Comment 51 Gustaw Smolarczyk 2015-03-02 10:21:25 UTC
Should I apply it on top of attachment 166571 [details]?
Comment 52 Maarten Lankhorst 2015-03-02 10:22:05 UTC
Yes, and also add a WARN_ON(!rdev->ih.enabled); to radeon_fence_enable_signaling, just to be thorough..
Comment 53 Gustaw Smolarczyk 2015-03-02 10:41:33 UTC
At first, I added "|| 1" only to GFX, CP1 and CP2 (I omitted the two DMA conditions). This did NOT fix the pauses.

However, after converting these final two DMA conditions, the pauses are gone. Reverting GFX condition didn't change that.

Also - is there an easy way to check whether the WARN_ON have been fired or not?
Comment 54 Maarten Lankhorst 2015-03-02 10:54:13 UTC
Just look if there's some output in dmesg or not.

To verify, only having interrupts enabled on the DMA conditions fixes your bug?
Comment 55 Maarten Lankhorst 2015-03-02 11:00:12 UTC
Do you happen to know if it's DMA or DMA1?
Comment 56 Maarten Lankhorst 2015-03-02 11:04:36 UTC
Created attachment 168581 [details]
reshuffle evergreen_dma_fence_ring_emit

Try this patch perhaps, and remove the || 1's?
Comment 57 Gustaw Smolarczyk 2015-03-02 11:05:39 UTC
Will try when I return home.
Comment 58 Gustaw Smolarczyk 2015-03-02 17:09:40 UTC
This patch doesn't help.

I have been experimenting with forcing interrupts on only some rings. In the end I have found that by adding "|| 1" only to R600_RING_TYPE_DMA_INDEX ring I have removed the pauses.
Comment 59 Maarten Lankhorst 2015-03-02 18:46:56 UTC
I was afraid of that, that's the ring used for eviction.
Comment 60 Maarten Lankhorst 2015-03-02 18:49:41 UTC
can you add a if (fence->ring == R600_RING_TYPE_DMA_INDEX) udelay(50); to radeon_fence_enable_signaling? after the irq_get again. If that works try udelay(5), if it doesn't udelay(500) ?
Comment 61 Gustaw Smolarczyk 2015-03-02 19:19:56 UTC
5us is ok.
Comment 62 Gustaw Smolarczyk 2015-03-02 19:24:21 UTC
Even 1us works.
Comment 63 Maarten Lankhorst 2015-03-02 20:02:13 UTC
What about changing the udelay to a RREG32(DMA_CNTL + DMA0_REGISTER_OFFSET); ?
Comment 64 Gustaw Smolarczyk 2015-03-02 20:23:20 UTC
Yes, the fix still works.
Comment 65 Maarten Lankhorst 2015-03-02 20:27:58 UTC
So attachment 166571 [details] and the rreg32 addition is enough?
Comment 66 Gustaw Smolarczyk 2015-03-02 20:29:37 UTC
It seems that way. I hope others can check that too.
Comment 67 Maarten Lankhorst 2015-03-02 20:33:56 UTC
what about a rmb(); there instead?
Comment 68 Gustaw Smolarczyk 2015-03-02 20:40:57 UTC
rmb(); doesn't seem to be enough.
Comment 69 Maarten Lankhorst 2015-03-02 20:48:21 UTC
Hm full mb(); ?
Comment 70 Gustaw Smolarczyk 2015-03-02 21:07:20 UTC
mb(); is ok.
Comment 71 Maarten Lankhorst 2015-03-02 22:00:50 UTC
Thanks for helping investigate, I will send the attachment as a bugfix to bug 90861, that bug is specifically about the waitqueue ordering in radeon_fence_default_wait. That one was really caused by me.

The bug here just happened to be triggered more reliably because there is a lot less wait time between sw_irq_get and radeon_fence_read. It needs some kind of barrier but I'm not sure which one. The radeon dev Alex Deucher will look into it and talk with some other people about the proper way to fix this.

Thanks for all your help, I'll cc you when I send my bugfix.
Comment 72 Alex Deucher 2015-03-02 22:11:46 UTC
Does reading any arbitrary register also work?  E.g., SRBM_STATUS
Comment 73 Michel Dänzer 2015-03-03 01:44:24 UTC
(In reply to Alex Deucher from comment #72)
> Does reading any arbitrary register also work?  E.g., SRBM_STATUS

I think that would indeed be the right thing to do here, to prevent the PCIe bridge from delaying the register writes.

Probably better to also check the fence status again after enabling the IRQ, to handle the case where the hardware might signal the fence after we check its status but before we enable the IRQ.
Comment 74 Alex Deucher 2015-03-03 01:47:52 UTC
posting reads implemented here:
http://cgit.freedesktop.org/~agd5f/linux/log/?h=posting-read
Comment 75 Maarten Lankhorst 2015-03-03 06:34:22 UTC
Michel, enabling irqs before checking the fence signaling was exactly what I was doing already, I'm hoping the attachment plus the branch is enough. :)
Comment 76 Dieter Nützel 2015-03-03 17:59:59 UTC
Hello Maarten,

is https://bugzilla.kernel.org/attachment.cgi?id=168581
needed or obsolete with Alex's patches?
Comment 77 Gustaw Smolarczyk 2015-03-03 18:52:39 UTC
No pauses after I applied attachment 166571 [details] and "drm/radeon: do a posting read in si_set_irq" from posting-read branch on top of v3.19. So I guess these commits combined fix this issue.

@Dieter: I did not need to apply this patch in order to fix this issue.
Comment 78 Dieter Nützel 2015-03-03 21:50:06 UTC
(In reply to Gustaw Smolarczyk from comment #77)
> No pauses after I applied attachment 166571 [details] and "drm/radeon: do a
> posting read in si_set_irq" from posting-read branch on top of v3.19. So I
> guess these commits combined fix this issue.
> 
> @Dieter: I did not need to apply this patch in order to fix this issue.

Thanks Gustaw!

In the meantime I have it running on r600g/NI/Turks XT with:

drm-next-4.1-wip
+
HDMI fixes from 'audio-fixes'
+
' posting-read'
drm-radeon-do-a-posting-read-in-evergreen_set_irq.patch
drm-radeon-do-a-posting-read-in-r600_set_irq.patch (better more than less...;-)
+
attachment 166571 [details]
drm-radeon-fix-wait-to-actually-occur-after-the-signaling-callback.patch
+
drm-radeon-fix-DRM_IOCTL_RADEON_CS-oops.patch

Cheers,
   Dieter
Comment 79 Maarten Lankhorst 2015-03-04 09:21:00 UTC
Yeah all patches are obsolete except attachment 166571 [details] and the posting read patches..