Bug 198669
Summary: | Driver crash at radeon_ring_backup+0xd3/0x140 [radeon] | ||
---|---|---|---|
Product: | Drivers | Reporter: | roger (roger) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | RESOLVED OBSOLETE | ||
Severity: | high | CC: | airlied, christian.koenig, kai.heng.feng |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1746232 | ||
Kernel Version: | 4.13.0-32-generic x86_64 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | Possible fix |
Description
roger@beardandsandals.co.uk
2018-02-04 17:39:49 UTC
What does radeon_ring_backup+0xd3 resolve to on your system? Looking at the debug files. radeon_ring_backup resolves to 0x33430 so +0xd4 is 0x33503. The line info gives this radeon_ring.c 323 0x334f4 radeon_ring.c 324 0x33508 Created attachment 274001 [details]
Possible fix
The attached patch is a shoot into the dark, but please give it a try.
Well it moved the problem. It crashed somewhere else in the driver with some message about scratch. Sorry I cannot tell you what is was because I screwed up the save of the the kernel message buffer, and now I cannot get the thing to glitch again. My normal method of stamping on the floor next to the system is not working. I think I might have overdone it and now the thing has bedded in. Going to leave it powered off overnight and try again in the morning. My best guess is the error came from r600.c:2848: DRM_ERROR("radeon: ring %d test failed (scratch(0x%04X)=0x%08X)\n", I cannot reproduce the mechanical hardware failure. I don't want to clobber the system any harder and risk damaging a disk. I assume this is being called from the GPU reset path. Well the issue is triggered by the driver reading nonsense values from the hardware. E.g. we ask the hardware what the last good position on a 16k ring buffer is and get 0xffffffff as result (or something like this) which obviously can't be correct. My patch mitigated that by clamping the value to a valid range, but if you read nonsense values from the hardware because the hardware has a loose connection and acts strange on vibrations then I basically can't guarantee for anything. The original point I made in the bug report was that this bug is not about the mechanical hardware glitch. It as about the driver being in what is obviously a failure mode and attempting a recovery that fails and leaves the system in unusable state. The error recovery paths of any driver should be its most resilient components. Especially when the driver is controlling a part of the primary user interface to it. To pose another question. Why, when the driver has the information to tell it that the GPU is irrevocably stalled, does it attempt a soft restart and leave the system in an unusable state. (In reply to roger@beardandsandals.co.uk from comment #7) > The original point I made in the bug report was that this bug is not about > the mechanical hardware glitch. It as about the driver being in what is > obviously a failure mode and attempting a recovery that fails and leaves the > system in unusable state. You are missing the point. The driver fails to recover because the hardware is buggy and not because there is any problem with the recovery routine. In other words we read back an impossible value from the hardware and that is why the system is failing. I mean I can handle this impossible value at this code location, but as you actually figured out by yourself it then fails at the next best location. There are simply hundreds or even thousands of locations where the assumption is that the hardware works correctly and we don't handle the case to get nonsense values. I think we have to agree to differ on this one. You seem to be focussing on the software interface between the GPU and the driver. What follows is my personal opinion. The most likely cause of this kind of mechanical issue is the signal path between the video interface hardware and the outside world, either a dry joint or a mechanical fault in the cable or cable connectors. I can only reiterate what I said in my previous post. The driver has sufficient information to determine that a hard failure has occured, and that failure is probably not in the gpu itself. I would like to see the driver doing a hard reset of the card with rigorous error checking. If it cannot reset the GPU in graphical mode it should try to set the display hardware into a basic console mode. (In reply to roger@beardandsandals.co.uk from comment #9) > The most likely cause of this kind of mechanical issue is the signal path > between the video interface hardware and the outside world, either a dry > joint or a mechanical fault in the cable or cable connectors. That is what I absolutely agree about. > The driver has sufficient > information to determine that a hard failure has occured, and that failure > is probably not in the gpu itself. I would like to see the driver doing a > hard reset of the card with rigorous error checking. If it cannot reset the > GPU in graphical mode it should try to set the display hardware into a basic > console mode. And that is the part you don't seem to understand. The driver is trying exactly what you are describing. We detect a problem because of a timeout, e.g. the hardware doesn't respond in a given time frame on commands we send to it. What we do then is to query the hardware how far we proceeded in the execution and the hardware answered with a nonsense value. In other words bits are set in the response which should never be set. This is a clear indicator that the PCIe transaction for the register read aborted because the device doesn't response any more. The most likely cause of that is that the bus interface in the ASIC locked up because of an electrical problem (I think the ESD protection kicked in) and the only way to get out of that is a hard reset of the system. What we can try to do is trying to prevent further failures like the crash you described by checking the values read from the hardware. This way you can at least access the box over the network or blindly shut it down with keyboard short cuts. Yes, I take your point. I was speculating on insufficient information. My apologies. The solution you propose is essentially what I have already been doing. The logging in over a network already works with the unpatched driver. I have not had any luck with keyboard shortcuts. It looks like xwayland/xserver does not know that a problem has occurred and has still got hold of the keyboard and mouse. This is an obscure problem and probably not worth spending much time on. Especially as I no longer seem to be able to reproduce it! Thank you for your patience. On 7 February 2018 08:23:06 bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=198669 > > --- Comment #10 from Christian König (christian.koenig@amd.com) --- > (In reply to roger@beardandsandals.co.uk from comment #9) >> The most likely cause of this kind of mechanical issue is the signal path >> between the video interface hardware and the outside world, either a dry >> joint or a mechanical fault in the cable or cable connectors. > > That is what I absolutely agree about. > >> The driver has sufficient >> information to determine that a hard failure has occured, and that failure >> is probably not in the gpu itself. I would like to see the driver doing a >> hard reset of the card with rigorous error checking. If it cannot reset the >> GPU in graphical mode it should try to set the display hardware into a basic >> console mode. > > And that is the part you don't seem to understand. The driver is trying > exactly > what you are describing. > > We detect a problem because of a timeout, e.g. the hardware doesn't respond > in > a given time frame on commands we send to it. > > What we do then is to query the hardware how far we proceeded in the > execution > and the hardware answered with a nonsense value. In other words bits are set > in > the response which should never be set. > > This is a clear indicator that the PCIe transaction for the register read > aborted because the device doesn't response any more. > > The most likely cause of that is that the bus interface in the ASIC locked up > because of an electrical problem (I think the ESD protection kicked in) and > the > only way to get out of that is a hard reset of the system. > > What we can try to do is trying to prevent further failures like the crash > you > described by checking the values read from the hardware. This way you can at > least access the box over the network or blindly shut it down with keyboard > short cuts. Yes, I take your point. I was speculating on insufficient information. My apologies. The solution you propose sounds great. Thank you for your patience. You can ignore comment 11. I I thought the email reply had not worked. So I posted a revised version directly. Comment 10 is the correct one. Should we at least push this patch to improve resiliance a little? For information. I enventually tracked the hardware fault to bad solder flow in the area of the dvi-d socket. I still stick by my original comments about usability. To me an outcome of a recovery process that will leave 99.9% of end users clueless of how to safely restart their system is not a good outcome from an end user perspective. This is my last word on this topic. (In reply to Dave Airlie from comment #14) > Should we at least push this patch to improve resiliance a little? We could, but I don't see much value in that. E.g. we would need to code the software in a way which also works if the hardware is damaged. That is possible, but I grepped a bit over the source and in this particular case we would need to manually audit 2201 registers accesses so that they also work when the hardware suddenly goes up in flames. That is totally unrealistic and just fixing this one case doesn't gives us much. |