Bug 95911

Summary: Recursive error in radeon device driver module after resume from hibernation
Product: Drivers Reporter: gitne
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEW ---    
Severity: high CC: alexdeucher, christian.koenig, deathsimple, gitne, szg00000
Priority: P1    
Hardware: x86-64   
OS: Linux   
See Also: https://bugzilla.redhat.com/show_bug.cgi?id=1207874
Kernel Version: 3.18.x and later Subsystem:
Regression: No Bisected commit-id:
Attachments: Kernel log
cpuinfo
meminfo
lspci
iomem
ioports
Kernel version 3.19.5 log + log data before resume/crash
possible fix

Description gitne 2015-04-01 00:49:27 UTC
Created attachment 172841 [details]
Kernel log

Description of problem:
Recursive error in radeon device driver module after resume from hibernation. See attached kernel log for details.

Version-Release number of selected component (if applicable):
Linux 3.18.9-200.fc21.x86_64 #1 SMP Mon Mar 9 15:10:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux (and later)

Steps to reproduce:
1. Run X.Org X Server 1.16.3 with radeon X.Org driver 7.5.0 (plus GNOME 3.14 or whatever else you like) on AMD A10-7800 Radeon R7
2. Suspend to disk with i.e. "systemctl hibernate".
3. Resume from hibernation.

Actual results:
Current login session is killed and X.Org server recovers into AIGLX software rendering.

Expected results:
No error in radeon device driver module.

Additional info:
See attached kernel log, cpuinfo, meminfo, and lspci. Booting with radeon default kernel boot parameters.
Comment 1 gitne 2015-04-01 00:50:41 UTC
Created attachment 172851 [details]
cpuinfo
Comment 2 gitne 2015-04-01 00:53:42 UTC
Created attachment 172861 [details]
meminfo
Comment 3 gitne 2015-04-01 00:54:12 UTC
Created attachment 172871 [details]
lspci
Comment 4 gitne 2015-04-01 00:54:44 UTC
Created attachment 172881 [details]
iomem
Comment 5 gitne 2015-04-01 00:55:04 UTC
Created attachment 172891 [details]
ioports
Comment 6 Michel Dänzer 2015-04-01 02:13:07 UTC
Is this a regression compared to older kernel versions? If yes, can you bisect or at least narrow down which kernel version is the last good / first bad one?
Comment 7 gitne 2015-04-27 11:36:51 UTC
(In reply to Michel Dänzer from comment #6)
> Is this a regression compared to older kernel versions? If yes, can you
> bisect or at least narrow down which kernel version is the last good / first
> bad one?

Unfortunately, I cannot tell if this bug is a regression or new. I have started to experience this bug only with my new piece of hardware AMD Radeon R7. I have not tested specifically on a range different kernel versions either, nor intend to bisect anything. Sorry, but I am using my machine to do actual work and do not have time to do this kind of testing nor have the required knowledge to compile the kernel. So no, I am not a kernel developer. If I were, I would probably fix this and submit a patch but I am not.

The AMD APU Radeon R7 should be a common piece of hardware by now, and very similar to any other APU Radeon R* so testing should be relatively easy for kernel developers.

All I can say is that I have went through all the F21 kernel versions with the same annoying effect. Since, Fedora kernel versions are somewhat lagging a bit behind the upstream kernel versions, and hibernation not being a No. 1 test feature, plus the R7 being a relatively new piece of hardware makes me assume that this has not been fixed in the latest kernel versions either. I would really love hibernation to work because it is a very practical and time saving feature.
Comment 8 Michel Dänzer 2015-04-30 09:25:07 UTC
Can you attach the kernel log from before and across suspend/resume as well?

Christian, any ideas?
Comment 9 gitne 2015-04-30 09:59:03 UTC
(In reply to Michel Dänzer from comment #8)
> Can you attach the kernel log from before and across suspend/resume as well?

I can try but there seems to be nothing worth noting before resume. Is there any kernel boot parameter or a radeon module parameter that I can tune to get a more verbose kernel log? I'll try running on my latest F21 kernel version 3.19.5 but I doubt I'll see any improvement, judging by the changelog.

Best regards
Comment 10 Christian König 2015-04-30 11:51:19 UTC
(In reply to Michel Dänzer from comment #8)
> Can you attach the kernel log from before and across suspend/resume as well?
> 
> Christian, any ideas?

Only possibility I can see is a bug in the BOs reference counting or a race between closing the handle and suspending.
Comment 11 Michel Dänzer 2015-05-01 00:56:40 UTC
(In reply to gitne from comment #9)
> I can try but there seems to be nothing worth noting before resume.

At the very least, there's lots of radeon / drm initialization messages. Please just attach all (non-sensitive parts of) it.
Comment 12 gitne 2015-05-01 06:41:03 UTC
Created attachment 175411 [details]
Kernel version 3.19.5 log + log data before resume/crash

You can clearly see the moment of hibernation based on the break in the relative monotonic timestamp list. Please pay attention especially to the failed ring 1, 2, and 3 tests by the drm device driver:
[21546.321630] xxxxx kernel: [drm:cik_ring_test [radeon]] *ERROR* radeon: ring 1 test failed (scratch(0x3010C)=0xCAFEDEAD)
[21546.473738] xxxxx kernel: [drm:cik_ring_test [radeon]] *ERROR* radeon: ring 2 test failed (scratch(0x3010C)=0xCAFEDEAD)
[21546.588708] xxxxx kernel: [drm:cik_sdma_ring_test [radeon]] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD)

Please also note:
[    3.662536] xxxxx kernel: ACPI Error: [\_SB_.ALIB] Namespace lookup failure, AE_NOT_FOUND (20141107/psargs-359)
[    3.662639] xxxxx kernel: ACPI Error: Method parse/execution failed [\_SB_.PCI0.VGA_.ATC0] (Node ffff88042f0c9a78), AE_NOT_FOUND (20141107/psparse-536)
[    3.662709] xxxxx kernel: ACPI Error: Method parse/execution failed [\_SB_.PCI0.VGA_.ATCS] (Node ffff88042f0c9a50), AE_NOT_FOUND (20141107/psparse-536)

I am not an expert but this error seems to be linked to the drm device driver.
Comment 13 gitne 2015-05-06 16:39:07 UTC
This bug seems to be linked to bugs bz77181 and bz60827 too.
Mantas Mikulėnas has determined that git commit 4474f3a91f95 was the last known good to work. archiesix has determined that this bug persists even since kernel version 3.9.11.

It's a pity that actually *users* have to do the digging for this kind of information. Its all there but kernel developers are obviously too tired or too lazy to do actual work after they have spent countless hours bragging about how genius they are in delivering fucked up work. If you can't do it, don't touch it.
Oh and another "secret" has been revealed: The bug is caused by ring test failures. Wow! Who could have thought of that!

@Michel Dänzer
To answer your previous question: Yes, the bug is a regression, or sort of.
Comment 14 Michel Dänzer 2015-05-07 01:17:48 UTC
(In reply to gitne from comment #13)
> Mantas Mikulėnas has determined that git commit 4474f3a91f95 was the last
> known good to work.

So commit f2ba57b5eab8817d86d0f108fdf1878e51dc0a37 ("drm/radeon: UVD bringup v8") is the first bad one? In that case, I think you should be able to work around it by removing the /lib/firmware/radeon/*_uvd.bin files (and re-generating the initrd if necessary)


> It's a pity that actually *users* have to do the digging for this kind of
> information. Its all there but kernel developers are obviously too tired or
> too lazy to do actual work after they have spent countless hours bragging
> about how genius they are in delivering fucked up work.

Your rudeness makes me highly doubt that you're Japanese, despite your e-mail address. Would you like us to respond in kind?

We could spend all our time fixing bugs, but then there wouldn't be any support for new hardware or features. It's a trade-off. So yes, we do need users' help for testing and tracking down bugs.


> Oh and another "secret" has been revealed: The bug is caused by ring test
> failures. Wow! Who could have thought of that!

A ring test failure is a symptom which can be caused by many things.
Comment 15 gitne 2015-05-12 13:59:15 UTC
(In reply to Michel Dänzer from comment #14)
>> Mantas Mikulėnas has determined that git commit 4474f3a91f95 was the last
>> known good to work.
> 
> So commit f2ba57b5eab8817d86d0f108fdf1878e51dc0a37 ("drm/radeon: UVD bringup
> v8") is the first bad one? In that case, I think you should be able to work
> around it by removing the /lib/firmware/radeon/*_uvd.bin files (and
> re-generating the initrd if necessary)

Okay, I have tried it and recreated the initramfs-*.img file with mkinitrd. I am not sure I have done it correctly because Fedora's documentation on this matter is rather non-existent and mkinitrd's man page is rather doll, just mentioning that it uses dracut internally. So, this is the best I can offer.

Anyhow, this measure does not work around this issue either because it causes the system to hang after a suspend or hibernate command is issued. This may be due to the fact that I do not have a KAVERI_uvd.bin firmware file. I do not know if such a file would be required. Or, perhaps it does exist but under a different name. Maybe something like RV7xx_uvd.bin? Would simply soft-linking to the adequate file and then recreating the initrd help? I do not know how the firmware loading mechanism works.

>> It's a pity that actually *users* have to do the digging for this kind of
>> information. Its all there but kernel developers are obviously too tired or
>> too lazy to do actual work after they have spent countless hours bragging
>> about how genius they are in delivering fucked up work.
> 
> Your rudeness makes me highly doubt that you're Japanese, despite your
> e-mail address. Would you like us to respond in kind?

Nobody cares who I am, even if I were from Mars. The point is that this is a serious bug (even always reproducible) and all I hear at conferences, events, and in the media is how incredibility genius Linux kernel developers are supposed to be, but when one offers advice on how to make things better by making them more user friendly, or when it comes to fixing bugs and really delivering a quality product that would make a lot more people to switch over from proprietary software one gets shouted and talked down - or even worse - gets humiliated by getting called a technical "dump ass" (which of course is not true). Yes, most kernel developers do really suffer from broken egos, just like many doctors suffer from the delusion of being a god over life and death. But, enough said on that.
 
> We could spend all our time fixing bugs, but then there wouldn't be any
> support for new hardware or features. It's a trade-off. So yes, we do
> need users' help for testing and tracking down bugs.

Yes, I know. However, prioritizing fixing bugs is what makes software great and foremost usable. There is no replacement for quality in products, whatever they might be. It is bad enough that the industry prioritizes developing new features over product support and quality. The open-source community should not follow down this path either if it is serious about offering a working alternative.

>> Oh and another "secret" has been revealed: The bug is caused by ring test
>> failures. Wow! Who could have thought of that!
> 
> A ring test failure is a symptom which can be caused by many things.

Yeah, so why do I have the feeling that it isn't adequately addressed?
If hunting down bugs with mean users, at the current state of the kernel is really such a big problem then maybe the kernel needs a different design that makes it more user friendly. To be honest, the current design of the kernel is overly messy (compared to other kernels) and because of this, makes debugging really difficult. This is a fact. But again, if one dares to mention this then it is perceived by the responsible like pure blasphemy.
Comment 16 Alex Deucher 2015-05-27 19:26:29 UTC
Created attachment 178141 [details]
possible fix

Does the attached patch help?
Comment 17 gitne 2015-06-08 23:05:34 UTC
(In reply to Alex Deucher from comment #16)
> Created attachment 178141 [details]
> possible fix
> 
> Does the attached patch help?

Thank you for proposing a possible solution. Unfortunately, I do not know if this patch is going to help because I am not a kernel developer. I do not build nor intend to build any kernels. However, I can test fully functional Fedora RPM packages.

Nevertheless, I have tested kernel version 3.19.7. As it turns out, hibernation and wake up seem to work properly if no Firefox is running. Latest versions of Firefox always use some sort of hardware accelerated graphics. This makes me assume that the radeon device driver (or the device itself) is put into some inconsistent or invalid state from user-space, meaning, input values to kernel-space are not checked properly by the device driver. This invalid state can probably be saved but cannot be restored on wake up because it is - well - invalid by definition. So, although I am not a kernel developer I pretty much doubt that the proposed patch is going to fix this problem because it seems to put the device just into a deeper sleep mode. This change has no impact on any previously saved invalid state, regardless while saving or while restoring.
But, as always, I may be wrong.

Can you build a test package?
Comment 18 gitne 2015-08-04 22:43:15 UTC
Just FYI:
I have just tested kernel version 4.0.8. Unfortunately, the problem still persists. Additionally, since kernel version 3.15.7 the machine does not power down after swapping the hibernation data to disk.

While trying to resolve this problem perhaps via a workaround, I have tried forcing the kernel to use BIOS routines where possible, just to rule out the possibility of any hardware bugs that may have been worked around in the BIOS. This did not improve the situation either. So, I still assume that the radeon device driver is or can be put into invalid state from user space.
Comment 19 gitne 2015-08-18 07:23:34 UTC
Just tested kernel version 4.1.4, and I have also started playing with some of the radeon module parameters. It turns out that when I set radeon.msi=0 everything works as expected. No ring test failures anymore and hibernation works perfect too. However, when I set radeon.msi to 1 or -1 (auto) I get exactly the same behavior as described earlier.

As it turns out the message signaled interrupts code in the radeon device driver module is broken. I have had similar behavior with the Radeon Mobility 2600 HD (R630). The difference was that hibernation worked but when radeon.msi was set to 1 or -1 (auto) the radeon device driver module caused initially seemingly random lockups, which over time, as interrupts got generated from input devices, the lockups have been released for a couple hundred milliseconds until another interrupt had to be handled by the radeon device driver module. So, I would say that the message signaling interrupt code still needs some major work.
Comment 20 Alex Deucher 2015-08-18 15:01:40 UTC
Does this help?

diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
index 5751446..1b4ac44 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -435,14 +435,14 @@ static int radeon_pmops_freeze(struct device *dev)
 {
        struct pci_dev *pdev = to_pci_dev(dev);
        struct drm_device *drm_dev = pci_get_drvdata(pdev);
-       return radeon_suspend_kms(drm_dev, false, true);
+       return radeon_suspend_kms(drm_dev, true, true);
 }
 
 static int radeon_pmops_thaw(struct device *dev)
 {
        struct pci_dev *pdev = to_pci_dev(dev);
        struct drm_device *drm_dev = pci_get_drvdata(pdev);
-       return radeon_resume_kms(drm_dev, false, true);
+       return radeon_resume_kms(drm_dev, true, true);
 }
 
 static int radeon_pmops_runtime_suspend(struct device *dev)
Comment 21 Michel Dänzer 2015-08-19 02:53:16 UTC
(In reply to Alex Deucher from comment #20)
> Does this help?

I'm afraid you're wasting your time here, since the bug reporter refuses to build a kernel for testing fixes. Anyway, FWIW:


> diff --git a/drivers/gpu/drm/radeon/radeon_drv.c
> b/drivers/gpu/drm/radeon/radeon_drv.c
> index 5751446..1b4ac44 100644
> --- a/drivers/gpu/drm/radeon/radeon_drv.c
> +++ b/drivers/gpu/drm/radeon/radeon_drv.c
> @@ -435,14 +435,14 @@ static int radeon_pmops_freeze(struct device *dev)
>  {
>         struct pci_dev *pdev = to_pci_dev(dev);
>         struct drm_device *drm_dev = pci_get_drvdata(pdev);
> -       return radeon_suspend_kms(drm_dev, false, true);
> +       return radeon_suspend_kms(drm_dev, true, true);
>  }
>  
>  static int radeon_pmops_thaw(struct device *dev)
>  {
>         struct pci_dev *pdev = to_pci_dev(dev);
>         struct drm_device *drm_dev = pci_get_drvdata(pdev);
> -       return radeon_resume_kms(drm_dev, false, true);
> +       return radeon_resume_kms(drm_dev, true, true);
>  }
>  
>  static int radeon_pmops_runtime_suspend(struct device *dev)

The freeze callback change makes sense to me, but the thaw callback should probably enable the device, not disable it?
Comment 22 Michel Dänzer 2015-08-19 02:57:02 UTC
(In reply to Michel Dänzer from comment #21)
> The freeze callback change makes sense to me, but the thaw callback should
> probably enable the device, not disable it?

Never mind, I misread the patch.
Comment 23 gitne 2015-09-02 23:57:47 UTC
(In reply to Alex Deucher from comment #20)
> Does this help?
>
> diff --git a/drivers/gpu/drm/radeon/radeon_drv.c
> b/drivers/gpu/drm/radeon/radeon_drv.c
> index 5751446..1b4ac44 100644
> --- a/drivers/gpu/drm/radeon/radeon_drv.c
> +++ b/drivers/gpu/drm/radeon/radeon_drv.c
> @@ -435,14 +435,14 @@ static int radeon_pmops_freeze(struct device *dev)
> {
> struct pci_dev *pdev = to_pci_dev(dev);
> struct drm_device *drm_dev = pci_get_drvdata(pdev);
> - return radeon_suspend_kms(drm_dev, false, true);
> + return radeon_suspend_kms(drm_dev, true, true);
> }
>
> static int radeon_pmops_thaw(struct device *dev)
> {
> struct pci_dev *pdev = to_pci_dev(dev);
> struct drm_device *drm_dev = pci_get_drvdata(pdev);
> - return radeon_resume_kms(drm_dev, false, true);
> + return radeon_resume_kms(drm_dev, true, true);
> }
>
> static int radeon_pmops_runtime_suspend(struct device *dev)

Thank you for digging into this. Again, I do not have the capacity to build a Linux kernel. Could you do this for me? Then, I could also test it on my older machine, which could potentially greatly improve support for many other machines at the same time.

So far, radeon.msi=0 works for me as a workaround. I do not know how much of a performance penalty this incurs but things seem to work smoothly. To quantify any penalty, some exact measurements are required. However, since radeon.msi=1 does not work smoothly, it is going to be difficult to assess any quantifiable results (and be able to compare them to radeon.msi=0) in this case. Can you give me a hint as to what benchmarks I could/should run?