Bug 198669

Summary: Driver crash at radeon_ring_backup+0xd3/0x140 [radeon]
Product: Drivers Reporter: roger (roger)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED OBSOLETE    
Severity: high CC: airlied, christian.koenig, kai.heng.feng
Priority: P1    
Hardware: All   
OS: Linux   
URL: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1746232
Kernel Version: 4.13.0-32-generic x86_64 Subsystem:
Regression: No Bisected commit-id:
Attachments: Possible fix

Description roger@beardandsandals.co.uk 2018-02-04 17:39:49 UTC
This is a resilience bug in the driver. When trying to recover from a GPU stall radeon_ring_backup causes a paging violation.

[  488.507091] BUG: unable to handle kernel paging request at ffffb406c1891ffc
[  488.507176] IP: radeon_ring_backup+0xd3/0x140 [radeon]

The GPU stall is caused by a hardware problem triggered by vibration. N.B. This bug is not abput the hardware problem. It is about the drivers resilience when trying to recover from it.

It is very similar to bug #62721 reported 4 years ago. However this is occurring using with the driver in the 4.13.0 kernel.

Here os the dmesg output.

[  139.457873] rfkill: input handler disabled
[  468.102340] radeon 0000:02:00.0: ring 0 stalled for more than 10256msec
[  468.102346] radeon 0000:02:00.0: GPU lockup (current fence id 0x0000000000001bdb last fence id 0x0000000000001bdc on ring 0)

... Similar lines removed

[  487.558156] radeon 0000:02:00.0: ring 0 stalled for more than 29712msec
[  487.558161] radeon 0000:02:00.0: GPU lockup (current fence id 0x0000000000001bdb last fence id 0x0000000000001bdc on ring 0)
[  488.070157] radeon 0000:02:00.0: ring 0 stalled for more than 30224msec
[  488.070162] radeon 0000:02:00.0: GPU lockup (current fence id 0x0000000000001bdb last fence id 0x0000000000001bdc on ring 0)
[  488.507091] BUG: unable to handle kernel paging request at ffffb406c1891ffc
[  488.507176] IP: radeon_ring_backup+0xd3/0x140 [radeon]
[  488.507195] PGD 236d37067 
[  488.507196] P4D 236d37067 
[  488.507207] PUD 0 

[  488.507234] Oops: 0000 [#1] SMP PTI
[  488.507248] Modules linked in: rfcomm bnep bonding binfmt_misc btusb btrtl btbcm btintel intel_powerclamp joydev coretemp kvm_intel kvm input_leds bluetooth ecdh_generic arc4 ath9k ath9k_common ath9k_hw ath mac80211 irqbypass snd_seq_midi snd_seq_midi_event intel_cstate snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_rawmidi cfg80211 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep serio_raw snd_pcm snd_seq snd_seq_device snd_timer snd lpc_ich shpchp i7core_edac mac_hid i5500_temp soundcore tpm_infineon asus_atk0110 nfsd auth_rpcgss nfs_acl lockd grace sunrpc parport_pc ppdev lp parport ip_tables x_tables autofs4 amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper hid_generic syscopyarea uas sysfillrect usbhid sysimgblt firewire_ohci fb_sys_fops usb_storage pata_acpi hid
[  488.507500]  psmouse firewire_core r8169 drm crc_itu_t mii
[  488.507523] CPU: 7 PID: 2073 Comm: gnome-shell Tainted: G          I     4.13.0-32-generic #35-Ubuntu
[  488.507554] Hardware name: System manufacturer System Product Name/P6T SE, BIOS 0403    05/19/2009
[  488.507584] task: ffff9e0cb6191600 task.stack: ffffb402c3724000
[  488.507619] RIP: 0010:radeon_ring_backup+0xd3/0x140 [radeon]
[  488.507639] RSP: 0018:ffffb402c3727c00 EFLAGS: 00010246
[  488.507658] RAX: ffff9e0c6f300000 RBX: 0000000000037ba1 RCX: 0000000000000000
[  488.507682] RDX: 0000000000000000 RSI: ffffb406c1891ffc RDI: 00000000000dee84
[  488.507707] RBP: ffffb402c3727c28 R08: 00000000000269a8 R09: 00000000000b2c44
[  488.507731] R10: ffffdc5046bd0000 R11: ffff9e0cfffd1d00 R12: ffffb402c3727c68
[  488.507756] R13: ffff9e0ceafc9538 R14: ffff9e0ceafc9558 R15: 00000000ffffffff
[  488.507780] FS:  00007fdc82a60ac0(0000) GS:ffff9e0cf73c0000(0000) knlGS:0000000000000000
[  488.507808] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  488.507828] CR2: ffffb406c1891ffc CR3: 00000001f632a000 CR4: 00000000000006e0
[  488.507853] Call Trace:
[  488.507875]  radeon_gpu_reset+0xc0/0x330 [radeon]
[  488.507895]  ? dma_fence_wait_timeout+0x38/0xf0
[  488.507912]  ? reservation_object_wait_timeout_rcu+0x14f/0x2d0
[  488.507946]  radeon_gem_handle_lockup.part.4+0xe/0x20 [radeon]
[  488.507979]  radeon_gem_wait_idle_ioctl+0x9c/0x100 [radeon]
[  488.508012]  ? radeon_gem_busy_ioctl+0x80/0x80 [radeon]
[  488.508040]  drm_ioctl_kernel+0x5d/0xb0 [drm]
[  488.508063]  drm_ioctl+0x31b/0x3d0 [drm]
[  488.508091]  ? radeon_gem_busy_ioctl+0x80/0x80 [radeon]
[  488.508111]  ? futex_wake+0x8f/0x180
[  488.508134]  radeon_drm_ioctl+0x4f/0x90 [radeon]
[  488.508153]  do_vfs_ioctl+0xa5/0x610
[  488.509723]  ? entry_SYSCALL_64_after_hwframe+0x118/0x168
[  488.511292]  ? entry_SYSCALL_64_after_hwframe+0x111/0x168
[  488.512853]  ? entry_SYSCALL_64_after_hwframe+0x10a/0x168
[  488.514405]  ? entry_SYSCALL_64_after_hwframe+0x103/0x168
[  488.515956]  ? entry_SYSCALL_64_after_hwframe+0xfc/0x168
[  488.517499]  ? entry_SYSCALL_64_after_hwframe+0xf5/0x168
[  488.519040]  ? entry_SYSCALL_64_after_hwframe+0xee/0x168
[  488.520572]  ? entry_SYSCALL_64_after_hwframe+0xe7/0x168
[  488.522095]  ? entry_SYSCALL_64_after_hwframe+0xe0/0x168
[  488.523598]  SyS_ioctl+0x79/0x90
[  488.525088]  ? entry_SYSCALL_64_after_hwframe+0xa1/0x168
[  488.526579]  entry_SYSCALL_64_fastpath+0x33/0xa3
[  488.528065] RIP: 0033:0x7fdc7fb65ef7
[  488.529553] RSP: 002b:00007ffd23064578 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  488.531081] RAX: ffffffffffffffda RBX: 00007ffd230645c0 RCX: 00007fdc7fb65ef7
[  488.532581] RDX: 00007ffd230645c0 RSI: 0000000040086464 RDI: 000000000000000c
[  488.534078] RBP: 00007ffd230645c0 R08: 0000000000000000 R09: 0000000800000000
[  488.535556] R10: 00007ffd230645d0 R11: 0000000000000246 R12: 0000000040086464
[  488.537031] R13: 000000000000000c R14: 00007ffd230646e8 R15: 000055d8a7b62200
[  488.538511] Code: 48 85 c0 49 89 04 24 74 62 8d 53 ff 48 8d 3c 95 04 00 00 00 31 d2 eb 04 49 8b 04 24 49 8b 76 08 41 8d 4f 01 45 89 ff 4a 8d 34 be <8b> 36 89 34 10 41 23 4e 54 48 83 c2 04 48 39 d7 41 89 cf 75 d8 
[  488.541900] RIP: radeon_ring_backup+0xd3/0x140 [radeon] RSP: ffffb402c3727c00
[  488.543656] CR2: ffffb406c1891ffc
[  488.552481] ---[ end trace e6e07e03d7738a24 ]---

Various versions of this crash seem to be have been reported over the last few years but none successfully closed.

For further diagnostics see https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1746232

Roger
Comment 1 Christian König 2018-02-04 18:26:32 UTC
What does radeon_ring_backup+0xd3 resolve to on your system?
Comment 2 roger@beardandsandals.co.uk 2018-02-04 20:55:15 UTC
Looking at the debug files.

radeon_ring_backup resolves to 0x33430 so +0xd4 is 0x33503.

The line info gives this

radeon_ring.c                                323             0x334f4
radeon_ring.c                                324             0x33508
Comment 3 Christian König 2018-02-05 12:16:07 UTC
Created attachment 274001 [details]
Possible fix

The attached patch is a shoot into the dark, but please give it a try.
Comment 4 roger@beardandsandals.co.uk 2018-02-05 22:03:56 UTC
Well it moved the problem. It crashed somewhere else in the driver with some message about scratch. Sorry I cannot tell you what is was because I screwed up the save of the the kernel message buffer, and now I cannot get the thing to glitch again. My normal method of stamping on the floor next to the system is not working. I think I might have overdone it and now the thing has bedded in. Going to leave it powered off overnight and try again in the morning.
Comment 5 roger@beardandsandals.co.uk 2018-02-06 14:05:50 UTC
My best guess is the error came from 

r600.c:2848:            DRM_ERROR("radeon: ring %d test failed (scratch(0x%04X)=0x%08X)\n",


I cannot reproduce the mechanical hardware failure. I don't want to clobber the system any harder and risk damaging a disk.

I assume this is being called from the GPU reset path.
Comment 6 Christian König 2018-02-06 14:12:02 UTC
Well the issue is triggered by the driver reading nonsense values from the hardware.

E.g. we ask the hardware what the last good position on a 16k ring buffer is and get 0xffffffff as result (or something like this) which obviously can't be correct.

My patch mitigated that by clamping the value to a valid range, but if you read nonsense values from the hardware because the hardware has a loose connection and acts strange on vibrations then I basically can't guarantee for anything.
Comment 7 roger@beardandsandals.co.uk 2018-02-06 15:19:20 UTC
The original point I made in the bug report was that this bug is not about the mechanical hardware glitch. It as about the driver being in what is obviously a failure mode and attempting a recovery that fails and leaves the system in unusable state. The error recovery paths of any driver should be its most resilient components. Especially when the driver is controlling a part of the primary user interface to it.

To pose another question. Why, when the driver has the information to tell it that the GPU is irrevocably stalled, does it attempt a soft restart and leave the system in an unusable state.
Comment 8 Christian König 2018-02-06 15:53:14 UTC
(In reply to roger@beardandsandals.co.uk from comment #7)
> The original point I made in the bug report was that this bug is not about
> the mechanical hardware glitch. It as about the driver being in what is
> obviously a failure mode and attempting a recovery that fails and leaves the
> system in unusable state.

You are missing the point. The driver fails to recover because the hardware is buggy and not because there is any problem with the recovery routine.

In other words we read back an impossible value from the hardware and that is why the system is failing.

I mean I can handle this impossible value at this code location, but as you actually figured out by yourself it then fails at the next best location.

There are simply hundreds or even thousands of locations where the assumption is that the hardware works correctly and we don't handle the case to get nonsense values.
Comment 9 roger@beardandsandals.co.uk 2018-02-06 21:39:40 UTC
I think we have to agree to differ on this one. You seem to be focussing on the software interface between the GPU and the driver.

What follows is my personal opinion.

The most likely cause of this kind of mechanical issue is the signal path between the video interface hardware and the outside world, either a dry joint or a mechanical fault in the cable or cable connectors. I can only reiterate what I said in my previous post. The driver has sufficient information to determine that a hard failure has occured, and that failure is probably not in the gpu itself. I would like to see the driver doing a hard reset of the card with rigorous error checking. If it cannot reset the GPU in graphical mode it should try to set the display hardware into a basic console mode.
Comment 10 Christian König 2018-02-07 08:22:50 UTC
(In reply to roger@beardandsandals.co.uk from comment #9)
> The most likely cause of this kind of mechanical issue is the signal path
> between the video interface hardware and the outside world, either a dry
> joint or a mechanical fault in the cable or cable connectors.

That is what I absolutely agree about.

> The driver has sufficient
> information to determine that a hard failure has occured, and that failure
> is probably not in the gpu itself. I would like to see the driver doing a
> hard reset of the card with rigorous error checking. If it cannot reset the
> GPU in graphical mode it should try to set the display hardware into a basic
> console mode.

And that is the part you don't seem to understand. The driver is trying exactly what you are describing.

We detect a problem because of a timeout, e.g. the hardware doesn't respond in a given time frame on commands we send to it.

What we do then is to query the hardware how far we proceeded in the execution and the hardware answered with a nonsense value. In other words bits are set in the response which should never be set.

This is a clear indicator that the PCIe transaction for the register read aborted because the device doesn't response any more.

The most likely cause of that is that the bus interface in the ASIC locked up because of an electrical problem (I think the ESD protection kicked in) and the only way to get out of that is a hard reset of the system.

What we can try to do is trying to prevent further failures like the crash you described by checking the values read from the hardware. This way you can at least access the box over the network or blindly shut it down with keyboard short cuts.
Comment 11 roger@beardandsandals.co.uk 2018-02-07 09:12:08 UTC
Yes, I take your point. I was speculating on insufficient information. My apologies.

The solution you propose is essentially what I have already been doing. The logging in over a network already works with the unpatched driver. I have not had any luck with keyboard shortcuts. It looks like xwayland/xserver does not know that a problem has occurred and has still got hold of the keyboard and mouse. This is an obscure problem and probably not worth spending much time on. Especially as I no longer seem to be able to reproduce it!


Thank you for your patience.
Comment 12 roger@beardandsandals.co.uk 2018-02-07 09:16:42 UTC
On 7 February 2018 08:23:06 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=198669
>
> --- Comment #10 from Christian König (christian.koenig@amd.com) ---
> (In reply to roger@beardandsandals.co.uk from comment #9)
>> The most likely cause of this kind of mechanical issue is the signal path
>> between the video interface hardware and the outside world, either a dry
>> joint or a mechanical fault in the cable or cable connectors.
>
> That is what I absolutely agree about.
>
>> The driver has sufficient
>> information to determine that a hard failure has occured, and that failure
>> is probably not in the gpu itself. I would like to see the driver doing a
>> hard reset of the card with rigorous error checking. If it cannot reset the
>> GPU in graphical mode it should try to set the display hardware into a basic
>> console mode.
>
> And that is the part you don't seem to understand. The driver is trying
> exactly
> what you are describing.
>
> We detect a problem because of a timeout, e.g. the hardware doesn't respond
> in
> a given time frame on commands we send to it.
>
> What we do then is to query the hardware how far we proceeded in the
> execution
> and the hardware answered with a nonsense value. In other words bits are set
> in
> the response which should never be set.
>
> This is a clear indicator that the PCIe transaction for the register read
> aborted because the device doesn't response any more.
>
> The most likely cause of that is that the bus interface in the ASIC locked up
> because of an electrical problem (I think the ESD protection kicked in) and
> the
> only way to get out of that is a hard reset of the system.
>
> What we can try to do is trying to prevent further failures like the crash
> you
> described by checking the values read from the hardware. This way you can at
> least access the box over the network or blindly shut it down with keyboard
> short cuts.


Yes, I take your point. I was speculating on insufficient information. My 
apologies. The solution you propose sounds great.

Thank you for your patience.
Comment 13 roger@beardandsandals.co.uk 2018-02-07 12:45:33 UTC
You can ignore comment 11. I I thought the email reply had not worked. So I posted a revised version directly. Comment 10 is the correct one.
Comment 14 Dave Airlie 2018-12-03 03:55:01 UTC
Should we at least push this patch to improve resiliance a little?
Comment 15 roger@beardandsandals.co.uk 2018-12-03 08:14:06 UTC
For information. I enventually tracked the hardware fault to bad solder flow in the area of the dvi-d socket. I still stick by my original comments about usability. To me an outcome of a recovery process that will leave 99.9% of end users clueless of how to safely restart their system is not a good outcome from an end user perspective. This is my last word on this topic.
Comment 16 Christian König 2018-12-03 11:12:27 UTC
(In reply to Dave Airlie from comment #14)
> Should we at least push this patch to improve resiliance a little?

We could, but I don't see much value in that. E.g. we would need to code the software in a way which also works if the hardware is damaged.

That is possible, but I grepped a bit over the source and in this particular case we would need to manually audit 2201 registers accesses so that they also work when the hardware suddenly goes up in flames.

That is totally unrealistic and just fixing this one case doesn't gives us much.