Bug 90741
Summary: | Radeon: System pauses on TAHITI | ||
---|---|---|---|
Product: | Drivers | Reporter: | Gustaw Smolarczyk (wielkiegie) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | normal | CC: | abortretryfail, alexdeucher, bugs, damien, deathsimple, Dieter, jonjon.arnearne, julien.isorce, szg00000 |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 3.18.0 3.18.1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
/proc/interrupts on 3.17.4
Restore old fence wait behavior Use exclusive wait to put the waiter at the end of the waitqueue Working patch Final version. Another approach Clear irqs before processing.. Patch on top of 'another approach' Print refcounts on sw irq, panic on disabled. dmesg after "Print refcounts..." patch dmesg after disabling radeon_fence_schedule_check() dmesg with refcount on sw irq dmesg with refcount on sw irq Use fence_add/remove_callback flush hdp harder reshuffle evergreen_dma_fence_ring_emit |
Description
Gustaw Smolarczyk
2015-01-04 00:35:37 UTC
I have managed to bisect this regression: f2c24b83ae90292d315aa7ac029c6ce7929e01aa is the first bad commit commit f2c24b83ae90292d315aa7ac029c6ce7929e01aa Author: Maarten Lankhorst <maarten.lankhorst@canonical.com> Date: Wed Apr 2 17:14:48 2014 +0200 drm/ttm: flip the switch, and convert to dma_fence Signed-off-by: Maarten Lankhorst <maarten.lankhorst@canonical.com> :040000 040000 634b9d842856f9a253b8250b6b61423a8b19bf7d 82569dbc605b725ecefac7457b25599158af5fb6 M drivers :040000 040000 6ff7fd43b2671ac388b81ec3c9ab47f4a2d39162 0c9d3986f193c83b952b05e7ec1f1c777a44e88e M include I am not familiar with the kernel code so I cannot say whether it's a real regression or is just a problem with a userspace. Does it still happen with a 3.19-rc kernel? Ok so that commit introduces the bad behavior, does commit 39e7f6f84b3a3aa4520504473f2e2bac1f949ffa "drm/radeon: use rcu waits in some ioctls" or commit "drm/ttm: use rcu in core ttm" change the bad behavior in any way? There seems to be no substantial difference in 3.19-rc2, 39e7f6f84b3a3aa4520504473f2e2bac1f949ffa or 472db7ab3093bf2a2999f6b5aa64a030466d6f92. The only thing (that may be subjective) is that there seems to be a bit less pauses in final 3.18 and later. (did not bisect that, may be totally subjective) Also of importance can be that I am using 2 GPUs (discrete AMD TAHITI and integrated Intel gen7 (IVB)) in randr 1.4 offloading configuration: xrandr --setprovideroutputsource Intel radeon xrandr --output HDMI2 --mode 1280x1024 --rate 60 --right-of DVI-0 Where: DVI-0 is my main monitor and is connected to TAHITI HDMI2 is my secondary monitor and is connected to IVB The OpenGL window is on the main monitor If that may be the source of these problems, I could disable the second monitor along with the iGPU in order to test that hypothesis. (In reply to Gustaw Smolarczyk from comment #4) > If that may be the source of these problems, I could disable the second > monitor along with the iGPU in order to test that hypothesis. That would be interesting. I think that right now i915 doesn't use the fence stuff yet so it's probably not causing the problem, still it would be useful to check. Unfortunately, it was a false lead. Disabling iGPU didn't help in any way on 3.18.1. Additionally, I take back the "less pauses in 3.18.0 and later" thing. 3.18.1 was horrible just now (same as any kernel after the commit bisected). Is there any way I can help investigating it more? Just a shot into the dark, but can you please post the content of /proc/interrupts. Created attachment 162601 [details]
/proc/interrupts on 3.17.4
I have attached the contents of /proc/interrupts on 3.17.4. Should I attach one with a faulty kernel too?
(In reply to Gustaw Smolarczyk from comment #9) > Created attachment 162601 [details] > /proc/interrupts on 3.17.4 > > I have attached the contents of /proc/interrupts on 3.17.4. Should I attach > one with a faulty kernel too? No, all I wanted to check is that interrupts work as expected. And that indeed seems to be the case here: 26: 40191 12200 8939 6450 22186 15970 9307 6821 PCI-MSI-edge radeon Created attachment 163871 [details]
Restore old fence wait behavior
Ok the only difference I can see is the fence wait function being different in that commit. The new one should be correct, so lets see if it that's what causes it, or if it's the introduction of timeouts in ttm.
Does this hack change things?
This hack, when applied on top of 3.19-rc5, seems to fix the problem. Created attachment 164121 [details]
Use exclusive wait to put the waiter at the end of the waitqueue
Ok seems I made a thinko. wait_event_interruptible_timeout and wait_event_timeout add items to the front of the queue instead of the back, this may result in fences never signalling.
Can you test the attached patch?
Created attachment 164131 [details]
Working patch
Created attachment 164141 [details]
Final version.
Can you test if this thing fixes it? Forget the other patch, this one should fix the bug.
Unfortunately, the last patch doesn't seem to help in any way. Should I test the other two too? Created attachment 164421 [details]
Another approach
Don't need to test anything but final version, can you test this version though? I want to see if the bug's in the wait code or something with the interrupt handling..
If this version fails too, can you uncomment the sw_irq_get and sw_irq_put calls added in the patch?
Only the second version (with the 2 lines uncommented) of the new patch does fix the problem. Created attachment 164511 [details]
Clear irqs before processing.. Patch on top of 'another approach'
Can you try with the irq_put/get lines commented, and this patch applied? Lets see if clearing irqs before processing helps..
Also there is a line in radeon_fence.c with "radeon_irq_kms_sw_irq_get_delayed", can you add a WARN_ON_ONCE above that line after the else? To check if that path is taken...
I will try that on Sunday since I have no access to the radeonsi hardware at the moment. Hi, I'm the reporter of bug 90861, which migth be related to this bug. I'm now trying to compile a kernel with the patch from Maarten (attachment 164421 [details]) to see if it changes anything, but I can't get it to compile: [...snip...] Building modules, stage 2. MODPOST 238 modules OBJCOPY arch/x86/boot/compressed/vmlinux.bin GZIP arch/x86/boot/compressed/vmlinux.bin.gz ERROR: "__tracepoint_fence_enable_signal" [drivers/gpu/drm/radeon/radeon.ko] undefined! make[2]: *** [__modpost] Error 1 make[1]: *** [modules] Error 2 make[1]: *** Waiting for unfinished jobs.... [...snip...] I've tried to enable everythoing I could find related to tracepoint in the kernel config, but with no success. Any pointers? @Maarten Lankhorst The "Clear irqs..." patch doesn't seem to help. Also, the WARN_ON_ONCE() doesn't seem to be hit (I have added 'WARN_ON_ONCE(1);' in the else block). The dmesg doesn't have anything related to it. Am I right that the WARN_ON_ONCE line should appear in dmesg in case it is reached? @Jon Arne Jørgensen About patch compilation: I have applied it on top of 3.19-rc5. Are you by any chance using 3.18? @Gustaw: ~/Development/c/kernel/src% git describe --long --tags v3.19-rc5-0-gec6f34e So, that should be 3.19-rc5 :) I just commented out the trace_fence_enable_signal call to get it to compile. Figured I don't need the tracepoint debug to test the patch. I also uncommented the sw_irq_get and sw_irq_put calls, and can report that the patch seems to fix bug 90861. I guess this is proof that the bug is related. I didn't try the patch with sw_irq_get/set disabled. Just ask, and I will give it a try. Oke, can you test without the calls to sw_irq_get/put? I've tested, and can confirm that they are needed for the patch to fix the crash. Created attachment 165071 [details]
Print refcounts on sw irq, panic on disabled.
Can you test this patch on top of 'another approach', keep the sw_irq_get/put commented?
Also enable CONFIG_FENCE_TRACE please.
And then print the entire dmesg..
Created attachment 165091 [details]
dmesg after "Print refcounts..." patch
I have applied this patch on top of original "another approach" one.
I think the pauses are now shorter, but more frequent.
The contents of dmesg has been attached. I have taken it from /var/log/messages, since the in-kernel dmesg buffer was overwritten many times over. I truncated the messages a bit, since I stopped testing a game at ~80s of uptime.
Thanks, that rules out 1 possibility. There is probably no bug in refcounting irqs. :P Can you empty the contents of the radeon_fence_schedule_check function in radeon_fence.c? It should make the pauses worse, giving me a chance to find them in the log. :-) Do you mean on top of the latest patch? You can keep print refcounts or remove it, it doesn't matter much either way. I just want to have at least 'another approach' applied. Created attachment 165161 [details]
dmesg after disabling radeon_fence_schedule_check()
I chose to leave the refcount printing patch as it is.
The pauses now are much much longer (~10-20s).
The 2h time difference in the messages log is due to local time not being set correctly and ntp-client daemon fixing it after some time.
Created attachment 165901 [details] dmesg with refcount on sw irq Hi, Maarten I finaly got around to append the patch you posted in comment 27. Not sure if you still need this. I'm also not sure if I should stick with my original bug report (bug 90861) to not clutter this one? PS. I had to compile the radeon driver into the kernel to get it to compile. If I compile the radeon driver as module, the compile will break with this error: ERROR: "__tracepoint_fence_enable_signal" [drivers/gpu/drm/radeon/radeon.ko] undefined! But I guess that is another bug report :) Created attachment 165911 [details]
dmesg with refcount on sw irq
Hm, seems that auto-detect mime failed with the last attachment?
Ugh.. can you test "another approach" again, but change the list_add to list_add_tail ? Looks like I messed it up there.. Never mind, that won't matter much.. Created attachment 166571 [details]
Use fence_add/remove_callback
Should I apply it on top of 3.19.0? Or is it an extension of some previous patch? It replaces the previous attempts to fix the bug. Unfortunately, still no good. Hi, is there any progress on this bug? The current kernel makes Dota 2 unplayable on my Radeon HD 7870. I can also confirm that the "another approach" patch with the two lines uncommented fixes the problem (as Gustaw already mentioned above). Is there anything I can do or provide that might help resolve this? After applying attachment 166571 [details] can you add a usleep(50000); in radeon_fence_enable_signaling after the line "radeon_irq_kms_sw_irq_get(rdev, fence->ring);" but before the line "if (radeon_fence_activity(rdev, fence->ring))"?
It's a hack, to confirm if my suspicions are correct. It will probably make everything a lot slower, but I only care if the pauses are gone or not..
I think you meant udelay(50000). (I just used msleep(50) and got a nice lock-up since the call was in an atomic context). Just a sec. (In the end I had to use mdelay(50)) The performance was really traumatic (8-11fps). But on a positive side, there seemed to be no pauses. There was one moment when fps dropped to 0 (for ~1/3 of a second), but it may have been due to completely different reason. Ok what about udelay(500), should be a smaller pause? With 500us delay, the performance is nearly back to how it was (~70fps). And the pauses are still gone. Created attachment 168571 [details]
flush hdp harder
No idea if this does anything, adding a bunch of flushes instead of doing a sleep.. pauses still occur?
If not can you check if it still works with only the chunk in radeon_fence_enable_signalling?
(had to remove "robj->" from the patch). Doesn't help. Ok never mind that approach then.. in radeon_fence.c, add a WARN_ON((int)atomic_read(&rdev->irq.ring_int[fence->ring]) <= 0); after the sw_irq_get, it should never fire. in si.c you have a few lines like this: if (atomic_read(&rdev->irq.ring_int[RADEON_RING_TYPE_GFX_INDEX])) { DRM_DEBUG("si_irq_set: sw int gfx\n"); cp_int_cntl |= TIME_STAMP_INT_ENABLE; } if (atomic_read(&rdev->irq.ring_int[CAYMAN_RING_TYPE_CP1_INDEX])) { DRM_DEBUG("si_irq_set: sw int cp1\n"); cp_int_cntl1 |= TIME_STAMP_INT_ENABLE; } if (atomic_read(&rdev->irq.ring_int[CAYMAN_RING_TYPE_CP2_INDEX])) { DRM_DEBUG("si_irq_set: sw int cp2\n"); cp_int_cntl2 |= TIME_STAMP_INT_ENABLE; } Can you first change all of the conditions to atomic_read(..) || 1 so it always sets the irqs? That should fix the hangs at some cpu overhead.. If that works and you no longer get hangs, can you remove the || 1 for RADEON_RING_TYPE_GFX_INDEX, see if the hangs are still gone? Should I apply it on top of attachment 166571 [details]?
Yes, and also add a WARN_ON(!rdev->ih.enabled); to radeon_fence_enable_signaling, just to be thorough.. At first, I added "|| 1" only to GFX, CP1 and CP2 (I omitted the two DMA conditions). This did NOT fix the pauses. However, after converting these final two DMA conditions, the pauses are gone. Reverting GFX condition didn't change that. Also - is there an easy way to check whether the WARN_ON have been fired or not? Just look if there's some output in dmesg or not. To verify, only having interrupts enabled on the DMA conditions fixes your bug? Do you happen to know if it's DMA or DMA1? Created attachment 168581 [details]
reshuffle evergreen_dma_fence_ring_emit
Try this patch perhaps, and remove the || 1's?
Will try when I return home. This patch doesn't help. I have been experimenting with forcing interrupts on only some rings. In the end I have found that by adding "|| 1" only to R600_RING_TYPE_DMA_INDEX ring I have removed the pauses. I was afraid of that, that's the ring used for eviction. can you add a if (fence->ring == R600_RING_TYPE_DMA_INDEX) udelay(50); to radeon_fence_enable_signaling? after the irq_get again. If that works try udelay(5), if it doesn't udelay(500) ? 5us is ok. Even 1us works. What about changing the udelay to a RREG32(DMA_CNTL + DMA0_REGISTER_OFFSET); ? Yes, the fix still works. So attachment 166571 [details] and the rreg32 addition is enough?
It seems that way. I hope others can check that too. what about a rmb(); there instead? rmb(); doesn't seem to be enough. Hm full mb(); ? mb(); is ok. Thanks for helping investigate, I will send the attachment as a bugfix to bug 90861, that bug is specifically about the waitqueue ordering in radeon_fence_default_wait. That one was really caused by me. The bug here just happened to be triggered more reliably because there is a lot less wait time between sw_irq_get and radeon_fence_read. It needs some kind of barrier but I'm not sure which one. The radeon dev Alex Deucher will look into it and talk with some other people about the proper way to fix this. Thanks for all your help, I'll cc you when I send my bugfix. Does reading any arbitrary register also work? E.g., SRBM_STATUS (In reply to Alex Deucher from comment #72) > Does reading any arbitrary register also work? E.g., SRBM_STATUS I think that would indeed be the right thing to do here, to prevent the PCIe bridge from delaying the register writes. Probably better to also check the fence status again after enabling the IRQ, to handle the case where the hardware might signal the fence after we check its status but before we enable the IRQ. posting reads implemented here: http://cgit.freedesktop.org/~agd5f/linux/log/?h=posting-read Michel, enabling irqs before checking the fence signaling was exactly what I was doing already, I'm hoping the attachment plus the branch is enough. :) Hello Maarten, is https://bugzilla.kernel.org/attachment.cgi?id=168581 needed or obsolete with Alex's patches? No pauses after I applied attachment 166571 [details] and "drm/radeon: do a posting read in si_set_irq" from posting-read branch on top of v3.19. So I guess these commits combined fix this issue.
@Dieter: I did not need to apply this patch in order to fix this issue.
(In reply to Gustaw Smolarczyk from comment #77) > No pauses after I applied attachment 166571 [details] and "drm/radeon: do a > posting read in si_set_irq" from posting-read branch on top of v3.19. So I > guess these commits combined fix this issue. > > @Dieter: I did not need to apply this patch in order to fix this issue. Thanks Gustaw! In the meantime I have it running on r600g/NI/Turks XT with: drm-next-4.1-wip + HDMI fixes from 'audio-fixes' + ' posting-read' drm-radeon-do-a-posting-read-in-evergreen_set_irq.patch drm-radeon-do-a-posting-read-in-r600_set_irq.patch (better more than less...;-) + attachment 166571 [details] drm-radeon-fix-wait-to-actually-occur-after-the-signaling-callback.patch + drm-radeon-fix-DRM_IOCTL_RADEON_CS-oops.patch Cheers, Dieter Yeah all patches are obsolete except attachment 166571 [details] and the posting read patches..
|