Bug 216917

Summary: hibernation regression since 6.0.18 (Ryzen-5650U incl. Radeon GPU)
Product: Drivers Reporter: kolAflash (kolAflash)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: NEEDINFO ---    
Severity: normal CC: alexdeucher, jrf, mario.limonciello, regressions
Priority: P1    
Hardware: AMD   
OS: Linux   
Kernel Version: 6.0.18 Subsystem:
Regression: No Bisected commit-id:
Attachments: 6.1.4 dmesg after hibernation

Description kolAflash 2023-01-11 18:04:49 UTC
Since Linux-6.0.18 hibernation isn't working anymore. 6.0.17 was working fine.

When doing
  systemctl start hibernate.target
the screen turns black, but the system doesn't power down.
Same problem for "platform" and "shutdown" in /etc/systemd/sleep.conf => HibernateMode.

Force rebooting the system makes GRUB skip the boot menu (like normal when waking from hibernation), but the system just boots up freshly, not restoring the old state.


= System =
Model: HP EliteBook 845 G8 (notebook)
CPU+GPU: Ryzen 5650U incl. Radeon GPU
OS: openSUSE-15.4
Kernel: compiled from kernel.org
The HP EliteBook 845 G8 uses s0ix/s2idle.


Sadly I don't know how to provide helpful logs. After reboot there's nothing helpful in /var/log/messages
Just this:
2023-01-11T18:45:51.208584+01:00 myhost systemd[1]: Reached target Sleep.
2023-01-11T18:45:51.224615+01:00 myhost systemd[1]: Starting Hibernate...
2023-01-11T18:45:51.253804+01:00 myhost systemd-sleep[1998]: INFO: running /usr/lib/systemd/system-sleep/grub2.sleep for hibernate
2023-01-11T18:45:51.253875+01:00 myhost systemd-sleep[1998]: INFO: Running prepare-grub ..
2023-01-11T18:45:51.322933+01:00 myhost systemd-sleep[1998]:   running kernel is grub menu entry openSUSE Leap 15.4 (vmlinuz-6.0.18-v6.0.18-myhost)
2023-01-11T18:45:51.323010+01:00 myhost systemd-sleep[1998]:   preparing boot-loader: selecting entry openSUSE Leap 15.4, kernel /boot/6.0.18-v6.0.18-myhost
2023-01-11T18:45:51.331546+01:00 myhost systemd-sleep[1998]:   running /usr/sbin/grub2-once "openSUSE Leap 15.4"
2023-01-11T18:45:51.585503+01:00 myhost systemd-sleep[1998]:     time needed for sync: 0.0 seconds, time needed for grub: 0.2 seconds.
2023-01-11T18:45:51.585585+01:00 myhost systemd-sleep[1998]: INFO: Done.
2023-01-11T18:45:51.586273+01:00 myhost systemd-sleep[1996]: Entering sleep state 'hibernate'...
2023-01-11T18:45:51.588025+01:00 myhost kernel: [   39.640758][ T1996] PM: hibernation: hibernation entry
Comment 1 kolAflash 2023-01-11 18:50:42 UTC
I've narrowed the problem down to somewhen between 6fc4c0cd9 (last known good) and v6.0.18

Linux-6.1.4 is fine.
(i just can't use it productively because of https://gitlab.freedesktop.org/drm/amd/-/issues/2171 )
Comment 2 Mario Limonciello (AMD) 2023-01-11 19:18:17 UTC
If it's between 6fc4c0cd9 and v6.0.18 a bisect would be best, but my first educated guess would be:

306df163069e ("drm/amdgpu: make display pinning more flexible (v2)")

If you revert that does it start working again?

It's peculiar that 6.1.4 is fine, that fix is also in 6.1.4 but we might need something else.

> (i just can't use it productively because of
> https://gitlab.freedesktop.org/drm/amd/-/issues/2171 )

Yeah; hopefully that's fixed soon.
Comment 3 kolAflash 2023-01-11 21:36:06 UTC
Perfect guess!
Indeed 306df163069e is broken and it's predecessor is fine.
Reverting 306df163069e on v6.0.18 also made the problem disappear.

Last good:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=306df163069e78160e7a534b892c5cd6fefdd537^

First bad:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=306df163069e78160e7a534b892c5cd6fefdd537


Just wanted to say THANK YOU for all your help in the last couple of month!!!
Both of my Ryzen notebooks wouldn't work as great as they do without you and Alex.
Comment 4 Mario Limonciello (AMD) 2023-01-11 21:45:55 UTC
> Perfect guess!

OK.. so we need to find out why this works in 6.1.y and not in 6.0.y.  There are some fairly severe bugs it fixed.

Is it 100% failure rate on 6.0.y?

Since you mentioned that you couldn't effectively use 6.1.y because of the MST issue, are you only finding it on 6.0.y when connected to a dock or anything else unique?

> Sadly I don't know how to provide helpful logs. After reboot there's nothing
> helpful in /var/log/messages

Can you check /var/lib/systemd/pstore?  Perhaps there was a kernel crash that got saved into NVRAM and restored by systemd on the next boot.

> Just wanted to say THANK YOU for all your help in the last couple of month!!!

:)
Comment 5 Alex Deucher 2023-01-11 21:46:49 UTC
Can you attach your dmesg output?
Comment 6 kolAflash 2023-01-11 22:23:30 UTC
Created attachment 303585 [details]
6.1.4 dmesg after hibernation

(In reply to Mario Limonciello (AMD) from comment #4)
> [...]
> Is it 100% failure rate on 6.0.y?

Yes.


> Since you mentioned that you couldn't effectively use 6.1.y because of the
> MST issue, are you only finding it on 6.0.y when connected to a dock or
> anything else unique?

No.

Happens with dock, with simple USB-C power (no dock) and on battery.


> > Sadly I don't know how to provide helpful logs. After reboot there's
> nothing
> > helpful in /var/log/messages
> 
> Can you check /var/lib/systemd/pstore?  Perhaps there was a kernel crash
> that got saved into NVRAM and restored by systemd on the next boot.

Sadly that file doesn't exist.
There are some files in /sys/fs/pstore/. But nothing from today.


(In reply to Alex Deucher from comment #5)
> Can you attach your dmesg output?

I don't know how to get logs (including dmesg) when hibernation has failed.
As said, after reboot there's nothing new in /var/log/messages

Instead I attached dmesg after hibernation with v6.1.4. Is that helpful?



Another thing:
Is it important that my SWAP is a file /swap on an ext4 partition inside a LUKS partition?
Comment 7 Alex Deucher 2023-01-11 22:36:57 UTC
do you still have the problem with:
CONFIG_DRM_FBDEV_EMULATION=n
in your .config?

Does reverting a6250bdb6c4677ee77d699b338e077b900f94c0c fix it?
Comment 8 kolAflash 2023-01-11 23:19:56 UTC
(In reply to Alex Deucher from comment #7)
> do you still have the problem with:
> CONFIG_DRM_FBDEV_EMULATION=n
> in your .config?

The problem unfortunately still exists with CONFIG_DRM_FBDEV_EMULATION=n
(and I get a black screen on the virtual console)


> Does reverting a6250bdb6c4677ee77d699b338e077b900f94c0c fix it?

No. That also doesn't help.


I'm sorry.
Anything else I can try?
Comment 9 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-01-12 13:18:12 UTC
FWIW, I just wanted to add this to the regression tracking, but 6.0.y is EOL now; and it seems 6.1.y works. Greg might do another fixup release, but maybe investigating this further is not worth it.
Comment 10 kolAflash 2023-01-12 17:24:39 UTC
Looks like the display issue with linux-6.1.y is on a good way.
Hibernation still works fine with the latest revert-commit by Mario & Wayne, which I tested here.
https://gitlab.freedesktop.org/drm/amd/-/issues/2171#note_1720281

So from my point of view this bug isn't relevant anymore. At least as long as it doesn't appear on newer kernels again.
Comment 11 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-01-13 09:16:46 UTC
Just for the record, if someone cares or lands here some time in the future:

There is another report about hibernation problems with ryzen cppus in 6.0.18 here: https://lore.kernel.org/all/2d59ed2b-ba8f-6695-9764-fd3b109acd4c@mailbox.org/

Bisection result included (drm/amdgpu: make display pinning more flexible (v2)).
Comment 12 Rainer Fiebig 2023-01-16 16:14:12 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #9)
> FWIW, I just wanted to add this to the regression tracking, but 6.0.y is EOL
> now; and it seems 6.1.y works. Greg might do another fixup release, but
> maybe investigating this further is not worth it.

I beg to differ. Longterm kernels 5.15.87/88 and probably all other LTS kernels to which commit 306df163069e78160e7a534b892c5cd6fefdd537 has been backported are also affected. As "hibernate" is a basic, reliable feature and a hard reset (as in my case) may result in data loss, I only see two possibilities: either revert the commit in the longterm kernels or try to find out quickly what makes it work for them.

The diff between 6.0.18 and 6.1.4 (where it was introduced) shows that only 86 files in /drivers/gpu/drm/amd/amdgpu have been modified. Probably only a few of them are relevant in this matter. So for the experts it should not be too hard to figure out a solution.
Comment 13 Mario Limonciello (AMD) 2023-01-16 16:15:21 UTC
Can we please confirm it's actually broken in 5.15.y before going through that effort?
Comment 14 Mario Limonciello (AMD) 2023-01-16 16:16:56 UTC
> and a hard reset (as in my case) may result

Sorry - specifically that reverting the backported commit fixes your case.  If so, yeah then we should see if there is anything else obvious to backport to help it.
Comment 15 Rainer Fiebig 2023-01-16 16:20:23 UTC
(In reply to Mario Limonciello (AMD) from comment #13)
> Can we please confirm it's actually broken in 5.15.y before going through
> that effort?

I have tested this with 5.15.87/88. Error messages and symptoms were the same as with 6.0.18. Spared me the bisecting this time, though.
Comment 16 Alex Deucher 2023-01-16 16:21:26 UTC
(In reply to Rainer Fiebig from comment #15)
> (In reply to Mario Limonciello (AMD) from comment #13)
> > Can we please confirm it's actually broken in 5.15.y before going through
> > that effort?
> 
> I have tested this with 5.15.87/88. Error messages and symptoms were the
> same as with 6.0.18. Spared me the bisecting this time, though.

Can you verify that reverting the change in 5.15.y fixes it?
Comment 17 Rainer Fiebig 2023-01-16 16:41:19 UTC
(In reply to Alex Deucher from comment #16)
> (In reply to Rainer Fiebig from comment #15)
> > (In reply to Mario Limonciello (AMD) from comment #13)
> > > Can we please confirm it's actually broken in 5.15.y before going through
> > > that effort?
> > 
> > I have tested this with 5.15.87/88. Error messages and symptoms were the
> > same as with 6.0.18. Spared me the bisecting this time, though.
> 
> Can you verify that reverting the change in 5.15.y fixes it?

Will do it.
Comment 18 Rainer Fiebig 2023-01-16 17:02:38 UTC
(In reply to Alex Deucher from comment #16)
> (In reply to Rainer Fiebig from comment #15)
> > (In reply to Mario Limonciello (AMD) from comment #13)
> > > Can we please confirm it's actually broken in 5.15.y before going through
> > > that effort?
> > 
> > I have tested this with 5.15.87/88. Error messages and symptoms were the
> > same as with 6.0.18. Spared me the bisecting this time, though.
> 
> Can you verify that reverting the change in 5.15.y fixes it?

Alright, I do confirm that reverting commit 306df163069e78160e7a534b892c5cd6fefdd537 ("drm/amdgpu: make display pinning more flexible (v2)") solves the problem with hibernate and resume in 5.15.88.

To me it seems that this patch cannot be backported in an isolated fashion.
Comment 19 Mario Limonciello (AMD) 2023-01-16 17:49:30 UTC
Assuming it's within amdgpu and not DRM helpers it's still ~800 commits to sift through. Even though 6.0.y is EOL now, I think it would be easier to check the missing commit(s) from there to backport.  We can worry about 5.15.y after that.

Can you see if this series from 6.1 on top of 6.0.19 helps?

https://patchwork.freedesktop.org/series/106027/
Comment 20 Rainer Fiebig 2023-01-16 18:12:25 UTC
(In reply to Mario Limonciello (AMD) from comment #19)
> Assuming it's within amdgpu and not DRM helpers it's still ~800 commits to
> sift through. Even though 6.0.y is EOL now, I think it would be easier to
> check the missing commit(s) from there to backport.  We can worry about
> 5.15.y after that.
> 
> Can you see if this series from 6.1 on top of 6.0.19 helps?
> 
> https://patchwork.freedesktop.org/series/106027/

Yes, but may take a while.
Comment 21 Rainer Fiebig 2023-01-16 20:59:30 UTC
(In reply to Mario Limonciello (AMD) from comment #19)
> Assuming it's within amdgpu and not DRM helpers it's still ~800 commits to
> sift through. Even though 6.0.y is EOL now, I think it would be easier to
> check the missing commit(s) from there to backport.  We can worry about
> 5.15.y after that.
> 
> Can you see if this series from 6.1 on top of 6.0.19 helps?
> 
> https://patchwork.freedesktop.org/series/106027/

No, those patches didn't help. Hibernate was always fine but resume always failed in the same way as described in my original mail to "stable".

Note that I'm not going to test 800 commits in this manner. ;)

So long!
Comment 22 Mario Limonciello (AMD) 2023-01-16 21:02:26 UTC
Thanks for trying.

Another idea that might be feasible to do to identify it is a proper bisect between v6.0 and v6.1 but manually applying 
'306df163069e78160e7a534b892c5cd6fefdd537 ("drm/amdgpu: make display pinning more flexible (v2)")' on each test point.
Comment 23 Alex Deucher 2023-01-16 21:08:35 UTC
I'll just revert it.  It is more important for kernels with the the drm_buddy changes.
Comment 24 Rainer Fiebig 2023-01-17 07:43:02 UTC
(In reply to Alex Deucher from comment #23)
> I'll just revert it.  It is more important for kernels with the the
> drm_buddy changes.

Right thing to do for now, I guess. If I can find a way to identify the commit(s) between 6.0.19 and 6.1 that fix the problem, I'll report it here. Thanks.

Rainer
Comment 25 Rainer Fiebig 2023-01-17 15:34:06 UTC
(In reply to Alex Deucher from comment #23)
> I'll just revert it.  It is more important for kernels with the the
> drm_buddy changes.

Would the following be equivalent to what you intended with your commit?
Looks a bit awkward but hibernate/resume work with it for 6.0.19 (and a Ryzen 5600G):


uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
					    uint32_t domain)
{
	if (domain == (AMDGPU_GEM_DOMAIN_VRAM | AMDGPU_GEM_DOMAIN_GTT)) {
		domain = AMDGPU_GEM_DOMAIN_VRAM;
		if ((adev->asic_type == CHIP_CARRIZO) || (adev->asic_type == CHIP_STONEY)) {
			if (adev->gmc.real_vram_size <= AMDGPU_SG_THRESHOLD)
				domain = AMDGPU_GEM_DOMAIN_GTT;
		}
	}
	return domain;
}


Let me know whether this is worth persuing. I could then test it with 5.15.88 and 6.1.6.
Comment 26 Alex Deucher 2023-01-17 15:46:55 UTC
(In reply to Rainer Fiebig from comment #25)
> (In reply to Alex Deucher from comment #23)
> > I'll just revert it.  It is more important for kernels with the the
> > drm_buddy changes.
> 
> Would the following be equivalent to what you intended with your commit?
> Looks a bit awkward but hibernate/resume work with it for 6.0.19 (and a
> Ryzen 5600G):
> 
> 
> uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
>                                           uint32_t domain)
> {
>       if (domain == (AMDGPU_GEM_DOMAIN_VRAM | AMDGPU_GEM_DOMAIN_GTT)) {
>               domain = AMDGPU_GEM_DOMAIN_VRAM;
>               if ((adev->asic_type == CHIP_CARRIZO) || (adev->asic_type ==
> CHIP_STONEY))
> {
>                       if (adev->gmc.real_vram_size <= AMDGPU_SG_THRESHOLD)
>                               domain = AMDGPU_GEM_DOMAIN_GTT;
>               }
>       }
>       return domain;
> }
> 
> 
> Let me know whether this is worth persuing. I could then test it with
> 5.15.88 and 6.1.6.

Nope.  What my patch does is allow display buffers to be in either system memory (GTT) or carve out (VRAM) depending on what is available.  Without the patch, the driver picks either VRAM or GTT depending on how much VRAM is available on the system.  This can lead to memory exhaustion in some cases with multiple large resolution monitors depending on memory fragmentation.

What your patch does is just always use VRAM unless the chip is Carrizo or Stoney.  So it is effectively just reverting the commit (depending on how much VRAM your system has).
Comment 27 Rainer Fiebig 2023-01-17 16:57:18 UTC
(In reply to Alex Deucher from comment #26)
> (In reply to Rainer Fiebig from comment #25)
> > (In reply to Alex Deucher from comment #23)
> > > I'll just revert it.  It is more important for kernels with the the
> > > drm_buddy changes.
> > 
> > Would the following be equivalent to what you intended with your commit?
> > Looks a bit awkward but hibernate/resume work with it for 6.0.19 (and a
> > Ryzen 5600G):
> > 
> > 
> > uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
> >                                           uint32_t domain)
> > {
> >       if (domain == (AMDGPU_GEM_DOMAIN_VRAM | AMDGPU_GEM_DOMAIN_GTT)) {
> >               domain = AMDGPU_GEM_DOMAIN_VRAM;
> >               if ((adev->asic_type == CHIP_CARRIZO) || (adev->asic_type ==
> > CHIP_STONEY))
> > {
> >                       if (adev->gmc.real_vram_size <= AMDGPU_SG_THRESHOLD)
> >                               domain = AMDGPU_GEM_DOMAIN_GTT;
> >               }
> >       }
> >       return domain;
> > }
> > 
> > 
> > Let me know whether this is worth persuing. I could then test it with
> > 5.15.88 and 6.1.6.
> 
> Nope.  What my patch does is allow display buffers to be in either system
> memory (GTT) or carve out (VRAM) depending on what is available.  Without
> the patch, the driver picks either VRAM or GTT depending on how much VRAM is
> available on the system.  This can lead to memory exhaustion in some cases
> with multiple large resolution monitors depending on memory fragmentation.
> 
> What your patch does is just always use VRAM unless the chip is Carrizo or
> Stoney.  So it is effectively just reverting the commit (depending on how
> much VRAM your system has).

I see. Thanks a lot for the explanation!