Bug 85421 - radeon stalled, GPU lockup, reset and failed on resume; crashed by firefox.
Summary: radeon stalled, GPU lockup, reset and failed on resume; crashed by firefox.
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-01 23:21 UTC by Hin-Tak Leung
Modified: 2021-09-24 18:28 UTC (History)
10 users (show)

See Also:
Kernel Version: 3.16.3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
/var/log/messages from radeon 0000:00:01.0: ring 0 stalled to reboot. (347.88 KB, application/octet-stream)
2014-10-01 23:21 UTC, Hin-Tak Leung
Details
var log message, another crash, 4 days later. (266.23 KB, text/plain)
2014-10-07 21:41 UTC, Hin-Tak Leung
Details
whole dmesg from vt with 3.16.6 when it crashed. (162.46 KB, text/plain)
2014-11-09 09:48 UTC, Hin-Tak Leung
Details
/var/log/message, another GPU crash under mesa 10.3.3 (910.35 KB, text/plain)
2014-11-21 22:33 UTC, Hin-Tak Leung
Details
screen corruption just before suspend & GPU crash on resume (126.94 KB, image/png)
2015-02-09 16:43 UTC, Hin-Tak Leung
Details
the part of /var/log/messages about GPU lock up and oops in 3.18.9-200.fc21.x86_64 (439.62 KB, application/octet-stream)
2015-03-17 05:21 UTC, Hin-Tak Leung
Details
output of: sudo journalctl -b -1 --all --no-pager (371.48 KB, text/plain)
2015-03-23 14:50 UTC, abandoned account
Details
dmesg (330.92 KB, text/plain)
2015-03-24 15:39 UTC, abandoned account
Details
instant blanking without recovery when radeon.lockup_timeout=20 (341.04 KB, text/plain)
2015-03-24 16:19 UTC, abandoned account
Details
same error as OP (dmesg) (333.47 KB, text/plain)
2015-03-25 09:41 UTC, abandoned account
Details

Description Hin-Tak Leung 2014-10-01 23:21:23 UTC
Created attachment 152191 [details]
/var/log/messages from radeon 0000:00:01.0: ring 0 stalled to reboot.

I was away from the computer when the radeon dri driver crashed; I left a fair number of firefox windows on/tab, some of them may have videos (from BBC news web site) and animated gifs from another web site on; but it crashed about 5-10 minutes after I was away and I was aware of it because the laptop blipped.

# lspci | grep VGA
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Mullins [Radeon R3 Graphics]

some excerpt from the attached logs are:

...
 [ 8770.250116] radeon 0000:00:01.0: ring 0 stalled for more than 10012msec
 [ 8770.250128] radeon 0000:00:01.0: GPU lockup (waiting for 0x0000000000056034 last fence id 0x0000000000056031 on ring 0)
 [ 8770.298635] radeon 0000:00:01.0: Saved 14196 dwords of commands on ring 0.
 [ 8770.298663] radeon 0000:00:01.0: GPU softreset: 0x0000000C
...
 [ 8770.313299] radeon 0000:00:01.0: GPU reset succeeded, trying to resume
...
 [ 8770.339724] [drm] ring test on 0 succeeded in 3 usecs
 [ 8770.518568] [drm:cik_ring_test] *ERROR* radeon: ring 1 test failed (scratch(0x3010C)=0xCAFEDEAD)
 [ 8770.752885] [drm:cik_sdma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
 [ 8770.752892] [drm:cik_resume] *ERROR* cik startup failed on resume
 [ 8780.753181] radeon 0000:00:01.0: ring 0 stalled for more than 10001msec
 [ 8780.753193] radeon 0000:00:01.0: GPU lockup (waiting for 0x00000000000560f7 last fence id 0x0000000000056031 on ring 0)
 [ 8780.753199] [drm:cik_ib_test] *ERROR* radeon: fence wait failed (-35).
 [ 8780.753209] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35).
 [ 8780.753215] radeon 0000:00:01.0: ib ring test failed (-35).
 [ 8780.762131] radeon 0000:00:01.0: GPU softreset: 0x0000000C
...


The kernel is a largely fedora 3.16.3-200 one grabbed from the koji srpm but with the additional patch from
https://bugzilla.kernel.org/show_bug.cgi?id=71051#c8

drv ati 7.4.0 , mesa 10.2.8, glamor from git 347ef4 .
Comment 1 Hin-Tak Leung 2014-10-07 21:41:29 UTC
Created attachment 152841 [details]
var log message, another crash, 4 days later.

same description as last, from stalled to reboot. This time it happened when I was running mplayer - with -vo xv, I think. (in addition to firefox).

From now on I am using some additional patches from look at 3.16.3..3.17-rc7 - basically anything that affects gpu/radeon and which would apply cleanly to 3.16.3, besides the sleep patch.

0001-drm-radeon-disable-gfx-cgcg-on-cik.patch
0001-drm-radeon-cik-Read-back-SDMA-WPTR-register-after-wr.patch
0001-drm-radeon-don-t-reset-dma-on-NI-SI-init.patch
0001-drm-radeon-don-t-reset-dma-on-r6xx-evergreen-init.patch
0001-drm-radeon-don-t-reset-sdma-on-CIK-init.patch
0001-drm-radeon-cik-use-a-separate-counter-for-CP-init-ti.patch

The sdma one might be relevant? Also the counter one - the ring test failed on ring 1 and 3?
Comment 2 Hin-Tak Leung 2014-10-08 12:28:38 UTC
The patches listed above don't fix the problem. Had another GPU lock-up while just switching window from an upload file dialog in firefox (another bugzilla else) to a terminal to change permission of the file being uploaded...
Comment 3 Alex Deucher 2014-10-13 15:41:26 UTC
GPU lockups are usually caused by a problem with the command buffers generated in the usermode acceleration drivers in mesa.  I would suggest trying a newer version of mesa.
Comment 4 Hin-Tak Leung 2014-10-13 16:09:42 UTC
(In reply to Alex Deucher from comment #3)
> GPU lockups are usually caused by a problem with the command buffers
> generated in the usermode acceleration drivers in mesa.  I would suggest
> trying a newer version of mesa.

How recent should I try? I am already using mesa 10.2.8 and libdrm 2.4.58 .
libdrm seems to be a more recent install on 5th Oct, both mesa 10.2.8 was already on when the lock-up happened (twice). mesa 10.3 was released around the same time as 10.2.8 .
Comment 5 Alex Deucher 2014-10-13 16:37:15 UTC
(In reply to Hin-Tak Leung from comment #4)
> (In reply to Alex Deucher from comment #3)
> > GPU lockups are usually caused by a problem with the command buffers
> > generated in the usermode acceleration drivers in mesa.  I would suggest
> > trying a newer version of mesa.
> 
> How recent should I try? I am already using mesa 10.2.8 and libdrm 2.4.58 .
> libdrm seems to be a more recent install on 5th Oct, both mesa 10.2.8 was
> already on when the lock-up happened (twice). mesa 10.3 was released around
> the same time as 10.2.8 .

Just try a newer or older version and see if it helps.  If so, try and bisect to narrow down what change on the mesa side cuased the problem.
Comment 6 Hin-Tak Leung 2014-10-13 21:28:59 UTC
(In reply to Alex Deucher from comment #5)
> Just try a newer or older version and see if it helps.  If so, try and
> bisect to narrow down what change on the mesa side cuased the problem.

Unfortunately it doesn't happen often/"reproducible" enough to do git bisect... This is a new machine/hardware which I just put linux on exactly a month ago,
and things stabilising perhaps around when I put mesa 10.2.8 on on 25th sept. It is my "main" machine now, and locked up twice in 18 days, which is frequent enough to be troublesome but not frequent enough to do bisect/go back/forward versions to try...

I do think it is a kernel problem though, as it seems to be accompanied by X and gnome-shell segfaulting. I still have the core dump from X if that helps?
Comment 7 Alan 2014-10-23 15:50:11 UTC
Reproducable case seems to be using firefox to review/edit an object you've uploaded to Shapeways. Another one is to load a fairly curved shape into OpenScad and hit F5 to view then rotate it.
Comment 8 Michel Dänzer 2014-10-28 09:52:47 UTC
(In reply to Alan from comment #7)
> Reproducable case seems to be using firefox to review/edit an object you've
> uploaded to Shapeways. Another one is to load a fairly curved shape into
> OpenScad and hit F5 to view then rotate it.

Those seem sufficiently different from the scenario described in this report that they should be tracked separately.
Comment 9 Hin-Tak Leung 2014-10-29 01:06:33 UTC
FWIW, I am glad I haven't had a lock up since the last time I wrote (over two weeks ago). FWIW, all the patches I mentioned in comment 1 except two are integrated and therefore dropped with my current kernel 3.16.6- (and I haven't upgraded/downgraded anything else actively); so it looks like improvements are being made. I hope I don't see that error again :-).
Comment 10 Hin-Tak Leung 2014-11-09 09:48:42 UTC
Created attachment 157061 [details]
whole dmesg from vt with 3.16.6 when it crashed.

This time it crashed while I was running just a few terminals and a qemu/kvm window, and I was switching terminals (in gnome-shell) to type something I forgot what it was, maybe just doing ls -l to check on the VM's disk image size growth. I had something running for a few hours inside the VM and it is minimized. If you need to know, just the gcc testsuite from an ssh session in, so the VM isn't using much of its graphic capability.

This is the whole dmesg since boot; so should have all hardware info, history, etc if those are important.

gnome-shell died but I still seemed to have a VT or two so I just do dmesg, waited a bit to see that the drm was not coming back, and rebooted.

Am upgrading to mesa 2.10.9 (from 2.10.8) and also to 3.17.2-200 (and dropped all those patches since they were merged) and hoping not to see this problem again.
Comment 11 Hin-Tak Leung 2014-11-09 09:51:13 UTC
This time I don't have firefox running. Just a few terminals and qemu/kvm. The gcc testsuite inside the vm is demanding enough I didn't want to run anything else.
Comment 12 Hin-Tak Leung 2014-11-21 22:33:10 UTC
Created attachment 158451 [details]
/var/log/message, another GPU crash under mesa 10.3.3

Fedora shipped mesa 10.3.3 
http://koji.fedoraproject.org/koji/buildinfo?buildID=593648
and it upgraded my custom-built 10.2.9 . Bad idea!

The GPU crashed again the first time resuming from a suspend. I have been suspending/resuming under 10.2.9 happily for two+ weeks and generally happy with it for that period. Though it looks like I upgraded from kernel 
3.17.2-200 to 3.17.3-200 yesterday and have not needed to suspend during that time.

This time the log is interesting in that an hour into using the newer 10.3.3, I have a pile of:

Nov 21 13:47:47 localhost kernel: radeon 0000:00:01.0: GPU fault detected: 146 0x02690004
Nov 21 13:47:47 localhost kernel: radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00007D93
Nov 21 13:47:47 localhost kernel: radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09000004
Nov 21 13:47:47 localhost kernel: VM fault (0x04, vmid 4) at page 32147, write from 'CB0' (0x43423000) (0)

though it looks like I continue to use the machine for another hour, suspend, then GPU crash on resume. Oh, just firefox (plus a few terminals) in gnome-shell class mode in gnome 2.12 copr.
Comment 13 Jorn Amundsen 2014-11-28 20:30:22 UTC
I also experience similar hangs, with a F20 system with mesa 10.3.3. In total, I have F20 installed on four systems, two using the nvidia kernel driver, one with i915 and one with radeon. Only the radeon system experience hangs. I get

Nov 28 10:17:15 t kernel: radeon 0000:01:00.0: ring 5 stalled for more than 10000msec
Nov 28 10:17:15 t kernel: radeon 0000:01:00.0: GPU lockup (waiting for 0x0000000000000004 last fence id 0x0000000000000002 on ring 5)
Nov 28 10:17:15 t kernel: [drm:uvd_v1_0_ib_test] *ERROR* radeon: fence wait failed (-35).
Nov 28 10:17:15 t kernel: [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on ring 5 (-35).
Nov 28 10:17:15 t kernel: [drm:si_dpm_set_power_state] *ERROR* si_set_sw_state failed
Nov 28 10:17:15 t kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x06c24804
Nov 28 10:17:15 t kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x000125B6
Nov 28 10:17:15 t kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048004
Nov 28 10:17:15 t kernel: VM fault (0x04, vmid 1) at page 75190, read from TC (72)
Nov 28 10:17:15 t kernel: radeon 0000:01:00.0: GPU fault detected: 146 0x04a33d04
Nov 28 10:17:15 t kernel: radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00012EA5
Nov 28 10:17:15 t kernel: radeon 0000:01:00.0:  VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0303D004
Nov 28 10:17:15 t kernel: VM fault (0x04, vmid 1) at page 77477, write from DMA1 (61)
Nov 28 10:17:42 t sh: abrt-watch-log: Warning, '/usr/bin/abrt-dump-xorg' did not process its input
...

I can not pinpoint one specific action which triggers the hang. First time it was firefox, next time Konsole and other times it simply hangs a while after locking the screen. I am considering downgrading mesa. Let me know if there are other measures I can take to help resolving this bug.
Comment 14 Alex Deucher 2014-12-01 02:13:35 UTC
Please make sure your version of mesa has this patch:
http://cgit.freedesktop.org/mesa/mesa/commit/?id=ae4536b4f71cbe76230ea7edc7eb4d6041e651b4
Comment 15 Hin-Tak Leung 2014-12-01 03:07:49 UTC
FWIW, so far my best experience seems to be 10.2.9 which I haven't had a lockup yet (had two(?) lockup's with 10.2.8 and once within a few hours after upgrading to 10.3.3.). So far I spent about 6 weeks under 10.2.8, and 3 weeks with 10.2.9.
Comment 16 Hin-Tak Leung 2014-12-05 00:35:26 UTC
I finally had a lock-up with mesa 10.2.9. Looking at the logs, I have had a fair number of GPU faults which I did not notice, briefly for a few seconds about 4 hours before an extended period of 1/2 hour of such faults. I believe in the 2nd period I was watching a video with mplayer. (the first period might be a trial run of the same video). I suspended the machine to RAM, then on waking up, the screen flashed a few times between black and the last desktop look, with some corruption in the desktop look; the mouse is still responsive to movement but
clicking no longer works, nor keyboard (trying to switch to a vt to shutdown/reboot, did not respond).

Anyway, I am onto mesa 10.3.4 now, which includes http://cgit.freedesktop.org/mesa/mesa/commit/?id=ae4536b4f71cbe76230ea7edc7eb4d6041e651b4 . I hope this get fixed properly though, since the change looks like it just band-aided over something.
Comment 17 Jorn Amundsen 2014-12-05 07:47:57 UTC
I rebuilt the Fedora20 RPM's from the F20 v-10.3.3-1 mesa.spec file, after applying http://cgit.freedesktop.org/mesa/mesa/commit/?id=ae4536b4f71cbe76230ea7edc7eb4d6041e651b4, creating new RPM's v.10.3.3-2. Now I am up and running with the patched Mesa 10.3.3-2 and the 3.17.4-200 kernel.
Comment 18 Hin-Tak Leung 2014-12-10 22:35:29 UTC
(In reply to Alex Deucher from comment #14)
> Please make sure your version of mesa has this patch:
> http://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=ae4536b4f71cbe76230ea7edc7eb4d6041e651b4

This seems no good /insufficient. I just had a lock-up with 10.3.5, which includes it. With kernel 3.17.6-200.fc20.x86_64, if that means anything. Also it looks like I upgraded firefox to v34 (from v33) 5 days ago. I was merely opening a few more tabs on firefox when it happened. Though 20 minutes before then my computer came out of a suspend, and before the suspend, I was using kvm and virtualbox a bit.

Switching VT was still possible so I was able to reboot cleanly.

The failure message seems slightly different, so just in case it means anything,

...
[71241.232157] radeon 0000:00:01.0: ring 0 stalled for more than 10002msec
[71241.232173] radeon 0000:00:01.0: GPU lockup (waiting for 0x000000000052e910 last fence id 0x00
0000000052e90d on ring 0)
[71241.232337] radeon 0000:00:01.0: failed to get a new IB (-35)
[71241.232347] [drm:radeon_cs_ib_fill] *ERROR* Failed to get ib !
[71241.279772] radeon 0000:00:01.0: Saved 15657 dwords of commands on ring 0.
...
[71252.356774] [drm:cik_ring_test] *ERROR* radeon: ring 1 test failed (scratch(0x3010C)=0xCAFEDEA
D)
[71252.718837] [drm:cik_ring_test] *ERROR* radeon: ring 2 test failed (scratch(0x3010C)=0xCAFEDEA
D)
[71252.836977] [drm:cik_sdma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)
[71252.836992] [drm:cik_resume] *ERROR* cik startup failed on resume
[71252.837260] [drm] ib test on ring 0 succeeded in 0 usecs
[71252.837790] [drm] ib test on ring 6 succeeded
[71252.838167] [drm] ib test on ring 7 succeeded
[71254.210168] [drm:radeon_dp_link_train_cr] *ERROR* displayport link status failed
[71254.210182] [drm:radeon_dp_link_train_cr] *ERROR* clock recovery failed
[71257.654395] radeon 0000:00:01.0: still active bo inside vm
[71257.765448] radeon 0000:00:01.0: still active bo inside vm
[71258.526881] radeon 0000:00:01.0: still active bo inside vm
[71265.473102] radeon 0000:00:01.0: couldn't schedule ib
...

I cam supply the dmesg if needed.

Seeing as the patch does not work/insufficient, and my best experience so far is 10.2.9 (lasted 3 weeks, without the patch), my worst experience is 10.3.3 (less than a day), and 10.3.4/10.3.5 (patch included) lasted a week, I am going back to 10.2.9, and adding the patch to it. If the patch improves 10.2.9 the way it did from 10.3.3 -> 10.3.4/10.3.5, i.e. make 10.2.9 lasts a few months, I'd be happy enough.
Comment 19 Jorn Amundsen 2014-12-12 09:54:28 UTC
(In reply to Hin-Tak Leung from comment #18)
> (In reply to Alex Deucher from comment #14)
> > Please make sure your version of mesa has this patch:
> > http://cgit.freedesktop.org/mesa/mesa/commit/
> > ?id=ae4536b4f71cbe76230ea7edc7eb4d6041e651b4
> 
> This seems no good /insufficient. I just had a lock-up with 10.3.5, which
...
> Seeing as the patch does not work/insufficient, and my best experience so
> far is 10.2.9 (lasted 3 weeks, without the patch), my worst experience is
> 10.3.3 (less than a day), and 10.3.4/10.3.5 (patch included) lasted a week,
> I am going back to 10.2.9, and adding the patch to it. If the patch improves
> 10.2.9 the way it did from 10.3.3 -> 10.3.4/10.3.5, i.e. make 10.2.9 lasts a
> few months, I'd be happy enough.

Hi Hin-Tak, I just would like to add that I have been running for one week without problems after patching Mesa 10.3.3 with the patch in Comment #14. Without the patch, I hung every 10-15 minute.

--joern
Comment 20 Hin-Tak Leung 2014-12-16 17:18:15 UTC
mesa 10.4.0 also crashed on me (fedora provides it so I thought I'll let my 10.2.9 upgrade to have a go), on the first day. first there are some errors in dmesg (running mplayer , not as much as under 10.3.3); on suspend/resume, the screen lighted up again but not shows anything (e.g. busy on resume, I guess).

So my experience so far is that if I ever see any errors under dmesg, I should reboot as soon as is convenient, instead of trying to suspend/resume to continue, as it will not survive a suspend resume.

I am going back to 10.2.9 patched, until the next mesa that's not 10.3.5 and 10.4.0...
Comment 21 Michel Dänzer 2014-12-17 02:19:17 UTC
If there really is a significant difference in stability between Mesa 10.2.y and 10.3.y, it would be interesting if you could isolate which change between them made the difference.

However, from your description so far, I'm afraid the difference might just be coincidence, because we don't understand yet what triggers the problem, so it happens 'randomly'.
Comment 22 Hin-Tak Leung 2014-12-17 20:07:45 UTC
Am just following the advice Alex gave in comment 2(try different versions of mesa) and reporting on my experience.

It would appear that my problem is orthogonal to what the commit "radeonsi: Disable asynchronous DMA except for PIPE_BUFFER" was trying to address. That commit was in 10.3.4/10.3.5 and 10.4.0, and *not* in 10.2.9; but I have had crashes with 10.3.5 after ~5 days of use, 10.4.0 within a day, and 10.2.9 for nearly 4 weeks.

10.2.8: two crashes in 6 weeks.
10.2.9: crash after almost 4 weeks.

10.3.3: crash within first day
10.3.4: insufficient data - used it only for a day or two before 10.3.5
10.3.5: crashed after 5 days
10.4.0: crash within first day

My crash-free days with 10.3.x/10.4.0 are measured in days if not hours, but with 10.2.x is in weeks. I'll continue to switch to a newer mesa as it comes out, and if I get burned, go back to the longest crash-free version until the another mesa version comes out.

I think there is a bug with xv (so probably either mesa or glamor; does not seem to be sensitive to which version) because some videos plays skewed as in
playing a square as:

                   ----------------
                  /              /
                 /              /
                /              /
               /              / 
              /              /
             ----------------

It happens only to certain specific videos (vdpau gl and x11 are fine), so I am not sure whether it is a bug in mplayer's use of xv, glamor's implementation of xv, or what. It seems to happens to videos with "Movie-Aspect is 1.xx:1 - prescaling to correct movie aspect." when played, but not all such videos are played badly.

I am mentioning this, just in case digging further on that video playing problem might help fix the crash...
Comment 23 Michel Dänzer 2014-12-18 06:03:54 UTC
(In reply to Hin-Tak Leung from comment #22)
> My crash-free days with 10.3.x/10.4.0 are measured in days if not hours, but
> with 10.2.x is in weeks. I'll continue to switch to a newer mesa as it comes
> out, and if I get burned, go back to the longest crash-free version until
> the another mesa version comes out.

So apparently there was a change between 10.2 and 10.3 which significantly decreased stability on your system. Without isolating that change, it's unlikely that we can reverse the effect unless we get lucky and an independent change happens to help.


> I am mentioning this, just in case digging further on that video playing
> problem might help fix the crash...

That seems unlikely. Please report the Xv problem at https://bugs.freedesktop.org/enter_bug.cgi?product=xorg , component Driver/glamor.
Comment 24 Hin-Tak Leung 2014-12-18 19:40:15 UTC
I had a quick look about 10.2.x vs 10.3.x (specifically, just doing "git log mesa-10.2.8..mesa-10.3.3 | grep '^commit' | wc -l" and vice versa), and it is more like they diverged from their most recent common ancestor by 300 and 3000 commits respectively. Though doing a grep 'chery-pick' , says about 250-300 of those are cheery-picked, so the actual difference might be 50 vs 2700 commits, which is still a big bunch to look at what made 10.3.x unstable. It would be easier if the crashes are more "reproducible".

The corrupted video playback issue was filed as:
https://bugs.freedesktop.org/show_bug.cgi?id=87455
Just in case it is of interest.
Comment 25 Michel Dänzer 2014-12-19 03:23:55 UTC
(In reply to Hin-Tak Leung from comment #24)
> [...] which is still a big bunch to look at what made 10.3.x
> unstable.

That's what git bisect is for. It can isolate a change with the minimum number of tests required (approximately log2 of the number of commits between the known good and bad).


> It would be easier if the crashes are more "reproducible".

Indeed, so if you do try to bisect it, it's important that you test each commit long enough to be sure it's 'more stable' before declaring it as good.
Comment 26 Hin-Tak Leung 2015-01-24 01:08:48 UTC
I had 10.2.9 + patch for 24 days before it locked up on the 10th; so 10.2.9 (with or without the extra patch) is still by far the best. I had 10.4.1 for a few days (about 4) before upgrading to 10.4.2; so far I have been on 10.4.2 for 10 days now and it is good enough.

The crash with 10.2.9 + patch was with kernel 3.17.7-300.fc21 . I booted to 3.17.8-300.fc21 after that and spent 8 days in it, and another 6 in 3.18.3-200.fc21; so a newer kernel might be contributing too.

I'll write again if I can go beyond a month without crash.
Comment 27 Hin-Tak Leung 2015-02-09 16:43:38 UTC
Created attachment 166191 [details]
screen corruption just before suspend & GPU crash on resume

See screenshot.

I had been playing a few videos with mplayer -vo xv http://*.mp4 from the BBC web site, on and off; then I did some browsing and noticed a few firefox tabs are corrupted (as shown - only about 3 out of those are). I then checked and see some messages in dmesg:

[14452.823499] radeon 0000:00:01.0: GPU fault detected: 146 0x02050004
[14452.823512] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00015A90
[14452.823516] radeon 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05000004
[14452.823521] VM fault (0x04, vmid 2) at page 88720, write from 'CB0' (0x43423000) (0)

So I suspended anyway; and the GPU crashed on resume. I was able to switch to a vt to reboot safely (not always the case) - so if there is anything say, I can look while the GPU is still stuck, please left me know.

Been on 10.4.3 since 29 Jan (just routine fedora upgrade) and never did have problem with 10.4.2 before that, so I guess 10.4.2/10.4.3 is at least as good as 10.2.9 + patch.

kernel is 3.18.6-200.fc21.x86_64.

oh! I rebooted from from 3.18.5-200.fc21 and libdrm-2.4.58-3.fc21.* -> 2.4.59-4.fc21.*
So if the screen corruption or the lock-up a regression from 25 days of goodness under 10.4.2/10.4.3 (and perhaps even a month in you include 10.4.1), than either the kernel or libdrm might be it.
Comment 28 Hin-Tak Leung 2015-02-09 16:48:48 UTC
So a quick summary is that later 10.4.x (excluding 10.4.0, which locked up within first day) is about as stable as 10.2.9 + patch. None of 10.3.x tried was any good.

I'll just continue with 10.4.x and upgrade as they are available, and hope not to report too often about further lock ups.
Comment 29 Hin-Tak Leung 2015-02-14 15:29:44 UTC
5 days, another crash. This time it was a hard lock-up, x server crashed and screen was back-lit but blank; no warnings (dmesg nothing interesting before since boot), and couldn't switch vt either. I was not doing anything interesting - just reading web mail on firefox, and it just suddenly went blank.

I haven't any interesting change since last; but since I booted libdrm-2.4.58-3.fc21.* -> 2.4.59-4.fc21.* on 7th, had a crash on 9th 2 days later, another 5 days later, and before that been lock-up-free for almost a whole month, I should have a look at what changed in libdrm-2.4.58-3.fc21.* -> 2.4.59-4.fc21.* .

Will write again if libdrm turns out to be interesting.
Comment 30 Hin-Tak Leung 2015-02-24 23:48:51 UTC
Have another hard lock up under 10.4.3/10.4.4. Just using mplayer and not running firefox - it was right before a scheduled reboot to upgrade to 10.4.4 so the system have 10.4.4 but 10.4.3 was probably cached - so that's probably not a good idea.

So that's 10 days since last crash.
Comment 31 Hin-Tak Leung 2015-03-17 05:16:45 UTC
Apparently according to the log, I had a GPU crash on 1st March on suspend, which I did not notice.

With kernel 3.18.9-200.fc21.x86_64, besides GPU lock up, the kernel also oops'ed. May it is a good thing? :

Mar 17 00:35:43 localhost kernel: [12172.039701] WARNING: CPU: 1 PID: 1938 at drivers/gpu/drm/radeon/radeon_object.c:84 radeon_ttm_bo_destroy+
0xf1/0x100 [radeon]()
...

Mar 17 00:35:43 localhost kernel: [12172.039816] CPU: 1 PID: 1938 Comm: gnome-shell Not tainted 3.18.9-200.fc21.x86_64 #1
Mar 17 00:35:43 localhost kernel: [12172.039820] Hardware name: TOSHIBA SATELLITE C50D-B/ZBWAE, BIOS 1.30 06/06/2014
Mar 17 00:35:43 localhost kernel: [12172.039824]  0000000000000000 00000000a817c8ac ffff880233173b68 ffffffff8175b71c
Mar 17 00:35:43 localhost kernel: [12172.039830]  0000000000000000 0000000000000000 ffff880233173ba8 ffffffff81098eb1
Mar 17 00:35:43 localhost kernel: [12172.039835]  0000000000000002 ffff880170180058 ffff880170180000 ffffffffffffffff
Mar 17 00:35:43 localhost kernel: [12172.039841] Call Trace:
Mar 17 00:35:43 localhost kernel: [12172.039853]  [<ffffffff8175b71c>] dump_stack+0x46/0x58
Mar 17 00:35:43 localhost kernel: [12172.039862]  [<ffffffff81098eb1>] warn_slowpath_common+0x81/0xa0
Mar 17 00:35:43 localhost kernel: [12172.039868]  [<ffffffff81098fca>] warn_slowpath_null+0x1a/0x20
Mar 17 00:35:43 localhost kernel: [12172.039901]  [<ffffffffa011c091>] radeon_ttm_bo_destroy+0xf1/0x100 [radeon]
Mar 17 00:35:43 localhost kernel: [12172.039919]  [<ffffffffa00a41b6>] ttm_bo_release_list+0xa6/0x1a0 [ttm]
Mar 17 00:35:43 localhost kernel: [12172.039933]  [<ffffffffa00a4575>] ttm_bo_release+0x105/0x250 [ttm]
Mar 17 00:35:43 localhost kernel: [12172.039948]  [<ffffffffa00a46e9>] ttm_bo_unref+0x29/0x30 [ttm]
Mar 17 00:35:43 localhost kernel: [12172.039980]  [<ffffffffa011c559>] radeon_bo_unref+0x39/0x70 [radeon]
Mar 17 00:35:43 localhost kernel: [12172.040017]  [<ffffffffa013182b>] radeon_gem_object_free+0x4b/0x70 [radeon]
Mar 17 00:35:43 localhost kernel: [12172.040112]  [<ffffffffa00443e7>] drm_gem_object_free+0x27/0x40 [drm]
Mar 17 00:35:43 localhost kernel: [12172.040151]  [<ffffffffa0044970>] drm_gem_object_handle_unreference_unlocked+0x120/0x130 [drm]
Mar 17 00:35:43 localhost kernel: [12172.040182]  [<ffffffffa0044a26>] drm_gem_handle_delete+0xa6/0x100 [drm]
Mar 17 00:35:43 localhost kernel: [12172.040204]  [<ffffffffa00450c5>] drm_gem_close_ioctl+0x25/0x30 [drm]
Mar 17 00:35:43 localhost kernel: [12172.040224]  [<ffffffffa0045a9f>] drm_ioctl+0x1df/0x680 [drm]
Mar 17 00:35:43 localhost kernel: [12172.040236]  [<ffffffff811cc452>] ? unmap_region+0xe2/0x130
Mar 17 00:35:43 localhost kernel: [12172.040264]  [<ffffffffa00fb04c>] radeon_drm_ioctl+0x4c/0x80 [radeon]
Mar 17 00:35:43 localhost kernel: [12172.040283]  [<ffffffff812282b0>] do_vfs_ioctl+0x2d0/0x4b0
Mar 17 00:35:43 localhost kernel: [12172.040289]  [<ffffffff81228511>] SyS_ioctl+0x81/0xa0
Mar 17 00:35:43 localhost kernel: [12172.040297]  [<ffffffff81139b86>] ? __audit_syscall_exit+0x1f6/0x2a0
Mar 17 00:35:43 localhost kernel: [12172.040310]  [<ffffffff81762129>] system_call_fastpath+0x12/0x17
Mar 17 00:35:43 localhost kernel: [12172.040314] ---[ end trace 844a94b2a6ea5f19 ]---
Comment 32 Hin-Tak Leung 2015-03-17 05:21:49 UTC
Created attachment 170901 [details]
the part of /var/log/messages about GPU lock up and oops in 3.18.9-200.fc21.x86_64

gz'ed.
Comment 33 abandoned account 2015-03-23 14:50:17 UTC
Created attachment 171791 [details]
output of: sudo journalctl -b -1 --all --no-pager

Hi. I am getting something similar and I think it may be reproducible. Since I upgraded to kernel 4 and also upgraded to:
libdrm-git (2.4.60.17.g8dff7a0-1)
xf86-video-ati-git (7.5.0.r34.g6291baa-1
mesa-dri-git 10.6.0_devel.68990-1  (i don't know what commit, but it's from 1-2 days ago, currently recompiling with latest commit to retry)
mesa-git 10.6.0_devel.68990-1
mesa-libgl-git 10.6.0_devel.68990-1
mesa-vaapi-git 10.6.0_devel.68990-1
mesa-vdpau-git 10.6.0_devel.68990-1
opencl-mesa-git 10.6.0_devel.68990-1


I got one random screen freeze(no blanking though) and system was locked up while viewing a youtube video in chromium and using the volume buttons. (nothing in the logs, probably because the system froze)

And this one which is probably reproducible, the next day, I plugged in my webcam, and as soon as vlc was attempting to display it (even though it looked like a black picture)  screen froze, then blacked the screen (can't remember if with backlight or not) and then I unplugged the webcam before I shutdown(pressed power button once after Ctrl+Alt+F2 to switch to a non-gfx virtual terminal console) and thus see the log on next boot(journalctl -b -1)

Mar 23 15:06:15 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 10483msec
Mar 23 15:06:15 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x0000000000012c15 last fence 
id 0x0000000000012cdc on ring 0)

(full log included in attachment)
the above messages are the beginning of when it got blank

I don't know if this is a new issue because I haven't tried my webcam for at least 1-2 months before this. But I've only recently(1-2days) upgraded to kernel 4, from 3.19.

apparently after updating mesa(to try next) the new version is 10.6.0_devel.67962-1 but the old one(with the above error) was 10.6.0_devel.68990-1
using manjaro linux.
brb
Comment 34 abandoned account 2015-03-23 23:08:01 UTC
There were no further lockups for me thus far, although that v4l2_release stack dump I still got when closing vlc(or was it when I unplugged, i forget) but that seems to be a different issue: pasted here https://bugzilla.kernel.org/show_bug.cgi?id=81581#c2
Comment 35 abandoned account 2015-03-24 14:29:53 UTC
I retried with the sole purpose of reproducing:
I plugged in (FSC)webcam while vlc was running, opened Media->Capture Device, selected Video Device name /dev/video0 (i think),  Advanced Options, Width: 800, Height: 600, (this was working the last time, but probably as 640x480), hit Ok, then Play, at this time, a 640x480 black vlc window appeared and everything froze, the mouse was still moving(i think) and after a few seconds screen went black(forgot if with backlight or not) then I proceeded to ctrl+alt+del and that's how i saved the log. I didn't unplug the usb webcam this time, so there are no v4l2_release stacktraces anymore.

Mar 24 15:15:21 manji kernel: ehci-pci 0000:00:12.2: restoring config space at offset 0x4 (was 0x2b00000, writing 0x2b00012)
Mar 24 15:15:21 manji kernel: ehci-pci 0000:00:12.2: PME# disabled
Mar 24 15:15:21 manji kernel: ehci-pci 0000:00:12.2: enabling bus mastering
Mar 24 15:15:21 manji kernel: device: 'ep_81': device_add
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 10163msec
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x000000000002ea19 last fence id 0x000000000002ea20 on ring 0)
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0: Saved 226 dwords of commands on ring 0.
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0: GPU softreset: 0x0000000D
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS               = 0xF5702828
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS_SE0           = 0xFC000005
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS_SE1           = 0x00000007
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   SRBM_STATUS               = 0x20000840
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   SRBM_STATUS2              = 0x00000000
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   R_008678_CP_STALLED_STAT2 = 0x400C0000
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   R_00867C_CP_BUSY_STAT     = 0x00048002
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   R_008680_CP_STAT          = 0x80268647
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   R_00D034_DMA_STATUS_REG   = 0x44483106
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0: GRBM_SOFT_RESET=0x00007F6B
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0: SRBM_SOFT_RESET=0x00100100
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS               = 0x00003828
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS_SE0           = 0x00000007
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS_SE1           = 0x00000007
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   SRBM_STATUS               = 0x20000040
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   SRBM_STATUS2              = 0x00000000
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   R_008680_CP_STAT          = 0x00000000
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0: GPU reset succeeded, trying to resume
Mar 24 15:15:32 manji kernel: [drm] Found smc ucode version: 0x00011100
Mar 24 15:15:32 manji kernel: [drm] PCIE GART of 1024M enabled (table at 0x0000000000274000).
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0: WB enabled
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0: fence driver on ring 0 use gpu addr 0x0000000020000c00 and cpu addr 0xffff8804099a5c00
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0: fence driver on ring 3 use gpu addr 0x0000000020000c0c and cpu addr 0xffff8804099a5c0c
Mar 24 15:15:32 manji kernel: radeon 0000:00:01.0: fence driver on ring 5 use gpu addr 0x0000000000072118 and cpu addr 0xffffc90005d32118
Mar 24 15:15:32 manji kernel: [drm] ring test on 0 succeeded in 1 usecs
Mar 24 15:15:32 manji kernel: [drm] ring test on 3 succeeded in 3 usecs
Mar 24 15:15:32 manji kernel: [drm] ring test on 5 succeeded in 1 usecs
Mar 24 15:15:33 manji kernel: [drm] UVD initialized successfully.
Mar 24 15:15:33 manji kernel: [drm:radeon_dp_link_train] *ERROR* displayport link status failed
Mar 24 15:15:33 manji kernel: [drm:radeon_dp_link_train] *ERROR* clock recovery failed
Mar 24 15:15:33 manji kernel: [drm] ib test on ring 0 succeeded in 0 usecs
Mar 24 15:15:33 manji kernel: [drm] ib test on ring 3 succeeded in 0 usecs
Mar 24 15:15:33 manji kernel: i2c i2c-8: master_xfer[0] W, addr=0x50, len=1
Mar 24 15:15:33 manji kernel: i2c i2c-8: master_xfer[1] R, addr=0x50, len=8
Mar 24 15:15:33 manji kernel: [drm] ib test on ring 5 succeeded
Mar 24 15:15:35 manji kernel: r8169 0000:01:00.0 net0: link down
Mar 24 15:15:41 manji kernel: r8169 0000:01:00.0: PME# enabled
Mar 24 15:15:44 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 10490msec
Mar 24 15:15:44 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x000000000002ea3a last fence id 0x000000000002ea76 on ring 0)
Mar 24 15:15:44 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 10990msec
Mar 24 15:15:44 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x000000000002ea3a last fence id 0x000000000002ea76 on ring 0)
Mar 24 15:15:45 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 11490msec
Mar 24 15:15:45 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x000000000002ea3a last fence id 0x000000000002ea76 on ring 0)
Mar 24 15:15:45 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 11990msec
Mar 24 15:15:45 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x000000000002ea3a last fence id 0x000000000002ea76 on ring 0)
Mar 24 15:15:46 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 12490msec
Mar 24 15:15:46 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x000000000002ea3a last fence id 0x000000000002ea76 on ring 0)
Mar 24 15:15:46 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 12990msec
Mar 24 15:15:46 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x000000000002ea3a last fence id 0x000000000002ea76 on ring 0)
Mar 24 15:15:47 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 13490msec
Mar 24 15:15:47 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x000000000002ea3a last fence id 0x000000000002ea76 on ring 0)
Mar 24 15:15:47 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 13990msec
Mar 24 15:15:47 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x000000000002ea3a last fence id 0x000000000002ea76 on ring 0)
Mar 24 15:15:48 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 14490msec
Mar 24 15:15:48 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x000000000002ea3a last fence id 0x000000000002ea76 on ring 0)
Mar 24 15:15:48 manji kernel: radeon 0000:00:01.0: ring 0 stalled for more than 14990msec
Mar 24 15:15:48 manji kernel: radeon 0000:00:01.0: GPU lockup (current fence id 0x000000000002ea3a last fence id 0x000000000002ea76 on ring 0)
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0: Saved 1906 dwords of commands on ring 0.
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0: GPU softreset: 0x00000009
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS               = 0xF5702828
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS_SE0           = 0xFC000005
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS_SE1           = 0x00000007
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   SRBM_STATUS               = 0x20000840
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   SRBM_STATUS2              = 0x00000000
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   R_008678_CP_STALLED_STAT2 = 0x400C0000
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   R_00867C_CP_BUSY_STAT     = 0x00048002
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   R_008680_CP_STAT          = 0x80268647
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0: GRBM_SOFT_RESET=0x00007F6B
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0: SRBM_SOFT_RESET=0x00000100
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS               = 0x00003828
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS_SE0           = 0x00000007
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   GRBM_STATUS_SE1           = 0x00000007
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   SRBM_STATUS               = 0x20000040
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   SRBM_STATUS2              = 0x00000000
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   R_008680_CP_STAT          = 0x00000000
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0: GPU reset succeeded, trying to resume
Mar 24 15:15:49 manji kernel: [drm] Found smc ucode version: 0x00011100
Mar 24 15:15:49 manji kernel: [drm] PCIE GART of 1024M enabled (table at 0x0000000000274000).
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0: WB enabled
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0: fence driver on ring 0 use gpu addr 0x0000000020000c00 and cpu addr 0xffff8804099a5c00
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0: fence driver on ring 3 use gpu addr 0x0000000020000c0c and cpu addr 0xffff8804099a5c0c
Mar 24 15:15:49 manji kernel: radeon 0000:00:01.0: fence driver on ring 5 use gpu addr 0x0000000000072118 and cpu addr 0xffffc90005d32118
Mar 24 15:15:49 manji kernel: [drm] ring test on 0 succeeded in 1 usecs
Mar 24 15:15:49 manji kernel: [drm] ring test on 3 succeeded in 3 usecs
Mar 24 15:15:49 manji kernel: [drm] ring test on 5 succeeded in 1 usecs
Mar 24 15:15:49 manji kernel: [drm] UVD initialized successfully.
Mar 24 15:15:49 manji kernel: [drm:radeon_dp_link_train] *ERROR* displayport link status failed
Mar 24 15:15:49 manji kernel: [drm:radeon_dp_link_train] *ERROR* clock recovery failed
Mar 24 15:15:49 manji kernel: [drm] ib test on ring 0 succeeded in 0 usecs
Mar 24 15:15:49 manji kernel: [drm] ib test on ring 3 succeeded in 0 usecs
Mar 24 15:15:49 manji kernel: i2c i2c-8: master_xfer[0] W, addr=0x50, len=1
Mar 24 15:15:49 manji kernel: i2c i2c-8: master_xfer[1] R, addr=0x50, len=8
Mar 24 15:15:49 manji kernel: [drm] ib test on ring 5 succeeded
Mar 24 15:15:50 manji systemd[1]: Starting Getty on tty2...
Mar 24 15:15:50 manji systemd[1]: Started Getty on tty2.
Mar 24 15:15:50 manji acpid[2529]: client 3299[1000:100] has disconnected
Mar 24 15:15:50 manji systemd[1]: Received SIGINT.
...
at this point rebooting was in progress

And I also had added  radeon.hard_reset=1  at kernel cmdline (since my last dmesg) because I saw in another bug that that helped someone fix it, but apparently not for me.

Mar 24 14:40:57 manji kernel: Linux version 4.0.0-rc5-gbc465aa (emacs@manji) (gcc version 4.9.2 20150304 (prerelease) (GCC) ) #56 SMP Mon Mar 23 14:50:12 CET 2015
Mar 24 14:40:57 manji kernel: Command line: BOOT_IMAGE=/vmlinuz-linux-git root=UUID=bfa4ab6e-19a3-4601-ba2b-267c55841c73 rw cryptdevice=/dev/disk/by-uuid/70c08890-417a-497d-b6ab-c0d0357a63e2:cryptManjaro:allow-discards ipv6.disable=1 pnp.debug=1 loglevel=9 log_buf_len=10M printk.always_kmsg_dump=y printk.time=y mminit_loglevel=0 memory_corruption_check=1 fbcon=scrollback:4096k fbcon=font:ProFont6x11 apic=debug earlyprintk=vga dynamic_debug.verbose=1 "dyndbg=file arch/x86/kernel/apic/* +pflmt ; file drivers/video/* +pflmt ; file drivers/acpi/* +pflmt" i8042.debug acpi_backlight=vendor radeon.hard_reset=1

I am willing to test patches or any suggestions... i have time.
Cheers
Comment 36 abandoned account 2015-03-24 15:39:03 UTC
Created attachment 172001 [details]
dmesg

ok, I can always reproduce this (tried thrice) just by openning webcam in vlc, twice: the second time fails.

1. plug in usb webcam (and never unplug it)
2. reboot (not needed, but hey)
3. modprobe uvcvideo
4. run vlc
Media->Open Capture Device, /dev/video0, Play
5. that works ok, now exit vlc
6. do step 4 and 5 again, this time the screen will freeze, mouse will keep moving, and the vlc window has black webcam screen instead of actual webcam screen.
7. after like 10 sec the screen blanks without backlight
8. +- a few seconds later, I can do Alt+Ctrl+F2 to switch to virtual terminal number 2 which isn't graphic to then can do Ctrl+Alt+Del to reboot and thus save the log (included in attachment)

Step 1 and 3 do not need to be in order (ignoring step 2 that is)
Comment 37 abandoned account 2015-03-24 16:19:52 UTC
Created attachment 172011 [details]
instant blanking without recovery when radeon.lockup_timeout=20

I accidentally added kernel param: radeon.lockup_timeout=20
which made it blank in firefox without being locked up or frozen first, and remained blank, but i was able to ctrl+alt+del of course.
But this tells me that the (soft?)reset gpu thing isn't working. Is there a way to make it work?
If that doesn't work normally, that explains why  when it really locks up, it can't get back to life again unless a warm boot happens. If this would work, then I assume it would be able to recover from the current issue being described in this thread.

the log for this case is attached (search for: 20msec )
the problem i guess are these:
Mar 24 17:00:42 manji kernel: [drm:radeon_dp_link_train] *ERROR* displayport link status failed
Mar 24 17:00:42 manji kernel: [drm:radeon_dp_link_train] *ERROR* clock recovery failed

btw the correct param I wanted was: radeon.lockup_timeout=20000  for 20 sec.
and by now I have also added some extra kernel params:
nohz=on rcu_nocbs=1-3 pcie_aspm=force radeon.audio=0 radeon.lockup_timeout=20000 radeon.test=0 radeon.agpmode=-1 radeon.benchmark=0 radeon.tv=0 radeon.hard_reset=1 radeon.aspm=1 radeon.msi=1 radeon.pcie_gen2=-1 radeon.no_wb=1 radeon.dynclks=1 radeon.r4xx_atom=0 radeonfb radeon.fastfb=1 radeon.modeset=1 radeon.dpm=1 radeon.runpm=1
which still allow me to reproduce the issue described in the previous comment.
Comment 38 Michel Dänzer 2015-03-25 03:50:46 UTC
Emanuel, please file your own report for the webcam related hang.
Comment 39 abandoned account 2015-03-25 09:41:47 UTC
Created attachment 172311 [details]
same error as OP (dmesg)

Sorry, I thought it was the same thing(bug).

I actually did get the same error as OP once (*ERROR* radeon: failed testing IB on GFX ring (-35)) with radeon.lockup_timeout=4  (but was intended to be 4000)

My apologies though.
Comment 40 Hin-Tak Leung 2015-04-26 02:11:53 UTC
I have been using mesa-10.4.7 since 02 Apr 2015, (and hadn't got a crash since March 17, most of that on mesa 10.4.6, I think). So 10.4.7 itself is certainly as good as the end of 10.2.x series.

FYI, the video playback issues I mentioned in comment 22 were filed and fixed:

https://bugs.freedesktop.org/show_bug.cgi?id=87455
corrupted xv video playback

https://bugzilla.redhat.com/show_bug.cgi?id=1213021
gnome-shell and mutter mis-use PAspect for XSetWMNormalHints()

So at the moment, my X is working as good as I hope it would be - and if I don't get a GPU lock up in another month - when I upgrade to fc22, which may break something -, I'd probably say the problem has somehow disappeared.

Michel: I see you were the one who fixed the X glamor bug, so thanks!

Emanuel: I really don't think that filing bugs when using so many experimental versions (kernel, mesa, etc) is constructive - dev codes are what they are. Can you try at least to see which of the experimental things cause your problem, since you can crash "reliably"?
Comment 41 Paul Dufresne 2021-09-24 18:28:17 UTC
I discovered on Ubuntu, that from 17.04 (since using glamor rather than EXA), I do get similar bug on resume on a RS780C (Radeon 3100 card [2008]).
Problem gone away when using EXA acceleration... on Mint 20.2 at least.

Details in: https://bugs.launchpad.net/bugs/1944991

Note You need to log in before you can comment on or make changes to this bug.