Bug 211277

Summary: sometimes crash at s2ram-wake (Ryzen 3500U): amdgpu, drm, commit_tail, amdgpu_dm_atomic_commit_tail
Product: Drivers Reporter: kolAflash (kolAflash)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: normal CC: alexdeucher, arhamjain, jamesz, linus.kardell, me, perry_yuan, ted437, xiehuanjun, youling257
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.10.4 Subsystem:
Regression: No Bisected commit-id:
Attachments: kern.log
BIOS update history (just in case someone has a clue if something looks suspicios and this might not be a Linux problem)
Kernel log
to fix suspend/resume hung issue
AMDGPU fence info
all kernel messages with ip_block_mask=0x0ff (Debian kernel 5.10.0-6)
dmesg via SSH, running amd-drm-next-5.14-2021-05-12 without ip_block_mask=0x0ff and with Xorg
/var/log/kern.log running amd-drm-next-5.14-2021-05-12 (ae30d41eb) with Xorg
publickey - me@jeromec.com - fa4f4559.asc
signature.asc
A workaround for suspend/resume hung issue
publickey - me@jeromec.com - fa4f4559.asc
signature.asc
journalctl of amdgpu trace
Fix for S3 hung issue
publickey - EmailAddress(s=me@jeromec.com) - 0xFA4F4559.asc
signature.asc
dmesg5.15.txt
config-5.15.0-rc2-android-x86_64+
backport patch for 5.10 stable.
Linux kernel make .config

Description kolAflash 2021-01-19 10:25:30 UTC
I'm currently on Debian-11-Testing (Bullseye). And since a few weeks the system sometimes (not always) doesn't wake up from suspend.
Most of the time suspend works. But about 1 in 10 times it crashes.

I attached /var/log/kern.log which holds plenty of information about the crash. Looks like the crash happened in amdgpu_dm.c:7273 (amdgpu_dm_atomic_commit_tail, Linux-5.10.4).


I'm pretty sure this behavior didn't appeared a few month before. So I guess a recent change is causing it. This may either be:


1. an updated package by Debian-Testing

Indeed I'm pretty sure the problem didn't appeared before Linux-5.9. So maybe this is being caused by a change between Linux-5.8 and Linux-5.9.
I'll try to test going back to Linux-5.8 in the next days.


2. a BIOS update
In November 2020 I installed the BIOS update sp110770.exe.
Before I was using sp107599.exe.
You can find the BIOS history attached.
I'll also see if I can test a BIOS downgrade in the next days.
Comment 1 kolAflash 2021-01-19 10:25:58 UTC
Created attachment 294747 [details]
kern.log
Comment 2 kolAflash 2021-01-19 10:27:11 UTC
Created attachment 294749 [details]
BIOS update history (just in case someone has a clue if something looks suspicios and this might not be a Linux problem)
Comment 3 kolAflash 2021-01-22 04:42:48 UTC
I searched through my journalctl log.

I set up the whole system in May 2020 with Linux-5.6.7.
(journalctl has everything back to that date)



The bug appeared as following since October and Linux-5.8. So Linux-5.8 was also affected (contradicting my original post).

I used the system nearly every day and always use s2ram (never shutting down, only rebooting when needed for updates).
So this can be seen statistically.

- 2020-10-21 with Linux-5.8.14 (Debian 5.8.0-3, installed after 2020-09-26)
- 2020-12-11 with Linux-5.9.11 (Debian 5.9.0-4, installed 2020-12-04)
- 2020-12-25 with Linux-5.9.11
- 2021-01-13 with Linux-5.10.4 (Debian 5.10.0-1, installed 2021-01-10)
- 2021-01-16 with Linux-5.10.4
- 2021-01-19 with Linux-5.10.4

So the bug didn't appear with Linux <= 5.7.
And the bugs frequency increased with Linux-5.10.



In parallel I'm still trying to rule out other factors. (BIOS updates, other software changes, ...)
Something significant might be, that Debian used GCC-9 for Linux-5.7. And starting with Linux-5.8 GCC-10 was used.
Comment 4 Jerome C 2021-01-24 19:23:19 UTC
I too have a Ryzen 5 3500U and random resumes where the screen updates are very slow ( 1 frame change every 1-2 minutes ) which looks like it's crashed and in the kernel logs I see a bunch of "flip_done timed out" and "amdgpu_dm_atomic_commit_tail" errors

This never happened for me between 5.4.6 - 5.9.14. I noticed this since 5.10.4 and did never suspended on 5.10.0 - 5.10.3, so my guess it's an issue sometime in 5.10.0 - 5.10.3

Do you have kernel parameter set "init_on_free=1" or in your kernel config "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", if so try changing/setting the kernel parameter "init_on_free=0", so far ( for me and still testing ) it's resumed every time

I think it's an issue with amdgpu and kernel paramater "init_on_free=1" or kernel config "CONFIG_INIT_ON_FREE_DEFAULT_ON=y" which zero's memory on free/deallocation.

kernel paramter "init_on_alloc=1" or kernel config "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" works fine for me
Comment 5 Jerome C 2021-01-27 02:11:30 UTC
Created attachment 294879 [details]
Kernel log

Unfortunately it crashed again although I've noticed it's been crashing a lot less (4-5 days) since I set kernel parameter "init_on_free=0".

I've attached a kernel log for 5.10.10
Comment 6 kolAflash 2021-01-30 10:25:23 UTC
(In reply to Jerome C from comment #4)
> [...]
> Do you have kernel parameter set "init_on_free=1" or in your kernel config
> "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", [...]

I'm using the Debian-11 (Testing / Bullseye) standard kernel.

$ grep -i init_on_free /boot/config-5.10.0-2-amd64 
# CONFIG_INIT_ON_FREE_DEFAULT_ON is not set
Comment 7 Jerome C 2021-01-30 10:41:59 UTC
ok, you have it turned off already

Weird thing happened this morning... I woke my laptop up and it was slow screen updates... I just closed my laptop lid, frustrated... I noticed it suspended again... I open my laptop again and it resumed

I looked in my kernel logs and saw the error messages from the first resume


NOTE: only copied the error messages
> [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR*
> [CRTC:62:crtc-0] flip_done timed out
> [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR*
> [CRTC:62:crtc-0] flip_done timed out
> [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR*
> [CONNECTOR:73:eDP-1] flip_done timed out
> [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR*
> [PLANE:52:plane-3] flip_done timed out


but on the second resume... no warnings or errors

I think it's a bug somewhere between suspension and resuming
Comment 8 Jerome C 2021-01-31 13:11:56 UTC
I've tried kernel 5.11-rc5 and same issue occurs there.

For now I've downgraded kernel to 5.9.14 ( will update it to 5.9.16 ) until this issue is fixed

What I've mentioned in comment 4 isn't really helping I think

Sometimes the issue happens frequently in a day but then other times it could be a few days before it happens again
Comment 9 kolAflash 2021-02-21 00:17:14 UTC
I'm on Linux-5.7 now since 2021-01-26.
And I woke up the notebook at least once a day since then.
So it's clearly a regression in the kernel somewhere between 5.7 and 5.10 and probably between 5.7 and 5.8.

And it's definitely not a BIOS issue, because I changed anything about the BIOS since the problem appeared last time with Kernel-5.10.

Regards,
kolAflash
Comment 10 Alex Deucher 2021-02-21 14:40:45 UTC
Can you bisect?  https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html
Comment 11 kolAflash 2021-02-25 22:28:28 UTC
(In reply to Alex Deucher from comment #10)
> Can you bisect? 
> https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html

I will try to.

But it will definitely need some time and may not be possible at all. Because the bug cannot be reproduced completely deterministically.
Comment 12 kolAflash 2021-03-05 15:02:17 UTC
I've tried doing a bisect using this script. Unfortunately I couldn't reproduce the bug this way. So I bisecting will take a lot longer.

for i in {0..19}; do
  echo -e "\n${i}"
  /usr/sbin/rtcwake --seconds 15 --mode no
  systemctl start suspend.target
  sleep 15
done
Comment 13 Jerome C 2021-03-05 15:13:15 UTC
(In reply to kolAflash from comment #12)
> I've tried doing a bisect using this script. Unfortunately I couldn't
> reproduce the bug this way. So I bisecting will take a lot longer.
> 
> for i in {0..19}; do
>   echo -e "\n${i}"
>   /usr/sbin/rtcwake --seconds 15 --mode no
>   systemctl start suspend.target
>   sleep 15
> done

Hiya

I did some testing myself recently and unfortunately doing 20 tests was not enough for me. I found that it could be 50 - 100 resumes before it would fail so I capped mine at 150 resumes, there were too many times where things looked fine for me with less than 50. After I tested kernels between 5.10.4 to 5.11-rc5 ( I didn't use 5.10.0 to 5.10.3 ) and found that this commit

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a10aad137326d137a969fc6cc3555992b99ff9fc

was causing the issue for me
Comment 14 kolAflash 2021-03-07 15:43:04 UTC
(In reply to Jerome C from comment #13)

I don't get how you got to your results.
There's no straight path from 5.10.4 to 5.11-rc5, as they are on different branches (5.10.y and master).

Nevertheless, your result may be reasonable from the point of the git history. I'm not sure about the commit ID a10aad137, but it has an completly identical twin commit c6d2b0fbb (also removing AMD_PG_SUPPORT_VCN_DPG from that expression).
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c6d2b0fbb893d5c7dda405aa0e7bcbecf1c75f98
And c6d2b0fbb has been applied between v5.10-rc2 and v5.10-rc3 (a10aad137 is only in master).

So if c6d2b0fbb (a.k.a a10aad137) is responsible, this explains why I started recognizing the problem when Debian-Testing went from Linux-5.9 to Linux-5.10.

I'm now running a 5.10.21 kernel where I reverted c6d2b0fbb. And I'll try using this kernel for at least one week and also run some iterative tests with it. 



Regarding reproduction in general:

I really wonder what triggers this bug. I didn't went so far to test with more than 50 tests (sleep-wake iterations). Especially I didn't tried more than 50  because the bug definitely appeared more often if it happened under "natural" (non-testing) circumstances.

Some test series I did which are hard to make sense of statistically:
I tried 20 tests and nothing happened. A few minutes later I decided to try 50 more tests and it directly failed on the first one. So I had to reboot, tried again 50 tests and nothing happened. Afterwards I put my notebook into s2ram and when I woke it the next day it immediately crashed.



By the way the two times it crashed recently (see above) happened with a kernel I compiled from clean kernel.org sources. Also I never experienced the bug with a clean 5.8.18 compiled from kernel.org running with the same system for about a week. So I'm quite convinced it's nothing Debian specific.
Comment 15 kolAflash 2021-03-11 13:55:08 UTC
(In reply to Alex Deucher from comment #10)
> Can you bisect? 
> https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html

I've done several s2ram-wakeup cycles (100 automatic and about three manual wakeups/day) with the kernel I compiled on 2021-03-07.

It's based on 5.10.21 with c6d2b0fbb reverted. (as suggested by Jerome)
Result: No crashes.
This looks very prosiming!

@Alex
Can I help with anything else to solve this?




I also compiled 5.10.21 without reverting c6d2b0fbb, tested it for a few hours and got three wakeup-crashes.
Comment 16 kolAflash 2021-05-12 23:43:10 UTC
@Alex
Any progress on this?

If there's no perfect way to fix this, what about an option to turn on/off this behaviour?  
A module option that can be changed at runtime would be ideal. So it can be set right before suspending. But a kernel boot parameter would be fine too.


P.S.
Would someone be so kind and set this bug to "confirmed"?
Comment 17 Alex Deucher 2021-05-13 02:20:57 UTC
I don't think we've been able to reproduce it.  That said, we did double check the programmign sequences and I believe it may be fixed with these patches:
https://gitlab.freedesktop.org/agd5f/linux/-/commit/71efc8701a47aa9e3de74bab06020da81757893f
https://gitlab.freedesktop.org/agd5f/linux/-/commit/a8f768874aaf751738a2e0350bf2e70085f93ace
Comment 18 Jerome C 2021-05-13 06:26:49 UTC
(In reply to Alex Deucher from comment #17)
> I don't think we've been able to reproduce it.  That said, we did double
> check the programmign sequences and I believe it may be fixed with these
> patches:
> https://gitlab.freedesktop.org/agd5f/linux/-/commit/
> 71efc8701a47aa9e3de74bab06020da81757893f
> https://gitlab.freedesktop.org/agd5f/linux/-/commit/
> a8f768874aaf751738a2e0350bf2e70085f93ace

I've tried these two commits and the issue still there unfortunately
Comment 19 James Zhu 2021-05-18 15:17:36 UTC
Created attachment 296841 [details]
to fix suspend/resume hung issue

Hi @kolAflash and @jeromec, Can you help check if this patch can fix the issue? Since we can't reproduce at our side. Thanks! James
Comment 20 Jerome C 2021-05-18 16:13:36 UTC
(In reply to James Zhu from comment #19)
> Created attachment 296841 [details]
> to fix suspend/resume hung issue
> 
> Hi @kolAflash and @jeromec, Can you help check if this patch can fix the
> issue? Since we can't reproduce at our side. Thanks! James

no, this doesn't work for me.

I'm curious to how your exactly to reproducing this

I start Xorg using the command "startx"

Xorg is running with LXQT

I start "Konsole" a gui terminal and execute the following

"for i in $(seq 1 150); do echo $i; sudo rtcwake -s 7 -m mem; done"
Comment 21 James Zhu 2021-05-18 16:41:32 UTC
Hi Jeromec, to isolate the cause, can you help run two experiments separately?
1. To run suspend/resume without launching Xorg, just on text mode.
2. To disable video acceleration (VCN IP). I  need you share me the whole dmesg log after loading amdgpu driver. I think basically running modprobe with ip_block_mask=0x0ff should disable vcn ip for VCN1.(you can find words in dmesg to tell you if vcn ip is disabled or not).

Thanks!
James
Comment 22 kolAflash 2021-05-18 17:18:53 UTC
@James
What do you mean by video acceleration?
Is this about 3D / DRI acceleration like in video games?
Or do you mean just "video" playback (movie, mp4, webm, h264, vp8, ...) acceleration?

And I don't completely understand what ip_block_mask=0x0ff is supposed to do.
I just rebootet with that kernel parameter added and 3D acceleration (DRI) is still working.


----


I'm planing to run these kernels in the next days:

1. Current Debian testing Linux-5.10.0-6 with ip_block_mask=0x0ff, Xorg and 3D acceleration in daily use.

2. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg and with 3D acceleration in daily use.

3. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg, but without 3D acceleration** in daily use.

4. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff and without Xorg, doing some standby cycles for testing.

If I encounter any crash I'll post the whole dmesg starting with the boot output.


----


*
amd-drm-next-5.14-2021-05-12
https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-drm-next-5.14-2021-05-12
ae30d41eb


**
Is there something special I should do to turn off acceleration?
Or should I just don't start any application doing 3D / DRI acceleration?
(the latter one might be difficult - I got to keep an eye on every application like Firefox, Atom, VLC, KWin/KDE window manager, ... not to use DRI)
Comment 23 James Zhu 2021-05-18 18:04:12 UTC
Hi kolAflash, 
VCN IP is for video acceleration(for video playback), if vcn ip didn't handle suspend/resume process properly, we do observe other IP blocks be affected. For your case it is display IP(dm) related. ip_block_mask=0xff (in grub should be amdgpu.ip_block_mask=0x0ff) can disable VCN IP during amdgpu driver loading. so this experiment can tell if this dm error is caused by VCN IP or not. 
sometimes /sys/kernel/debug/dri/0/amdgpu_fence_info  can provide some useful information if it has chance to be dumped.
these experiments can help identified which IP cause the issue. So we can find expert in that area to continue to triage. Your current report is case 2, so it can be replaced with 
2. amd-drm-next-5.14-2021-05-12* with ip_block_mask=0x0ff, with Xorg and without 3D acceleration in daily use.
I suggest you to execute your test plan in order 4->3->2->1.
Thanks!
James
Comment 24 Jerome C 2021-05-18 18:21:06 UTC
(In reply to James Zhu from comment #21)
> Hi Jeromec, to isolate the cause, can you help run two experiments
> separately?
> 1. To run suspend/resume without launching Xorg, just on text mode.
> 2. To disable video acceleration (VCN IP). I  need you share me the whole
> dmesg log after loading amdgpu driver. I think basically running modprobe
> with ip_block_mask=0x0ff should disable vcn ip for VCN1.(you can find words
> in dmesg to tell you if vcn ip is disabled or not).
> 
> Thanks!
> James

1) In text mode, VCN enabled, suspensions issues are still there
2) I see the message confirming that VCN is disabled, In text mode, VCN disabled, suspensions issues are gone, After starting Xorg, VCN disabled, suspensions issues are gone

I'll gather the logs those soon ( tomorrow sometime )
Comment 25 Jerome C 2021-05-18 18:21:36 UTC
I forgot to mention... I'm on kernel 5.13.4
Comment 26 Jerome C 2021-05-18 18:48:54 UTC
(In reply to Jerome C from comment #25)
> I forgot to mention... I'm on kernel 5.13.4

5.12.4 I mean
Comment 27 James Zhu 2021-05-18 19:33:15 UTC
Hi Jeromec, thanks for your feedback, can you also add drm.debug=0x1ff modprobe? I need log: case 1 dmesg and /sys/kernel/debug/dri/0/amdgpu_fence_info (if you can). James.
Comment 28 Jerome C 2021-05-19 19:44:31 UTC
Created attachment 296877 [details]
AMDGPU fence info

(In reply to James Zhu from comment #27)
> Hi Jeromec, thanks for your feedback, can you also add drm.debug=0x1ff
> modprobe? I need log: case 1 dmesg and
> /sys/kernel/debug/dri/0/amdgpu_fence_info (if you can). James.

I've tested text mode and gui/drm mode with "drm.debug=0x1ff" set and found no crashes... when "drm.debug=0x1ff" is unset... the crashes/timeouts are back... I think this is why your unable to reproduce the problem...

I've never known debug option(s) to remove issue(s)... oh well

I've added the contents of the file "/sys/kernel/debug/dri/0/amdgpu_fence_info".

The file contains 4 different boot states ( vcn on/off, drm debug on/off ) clearly marked/seperated in the attached file

I'm using 5.12.5 now but I also tried this on 5.12.4. Usually the crashes happen within 50 suspensions/resumes but today I left it to do over 2000 suspensions/resumes just to make sure...

I know you asked for a log but I spent so much time on this ( other things too ), it wasn't on my mind so I'll get that by Friday, if you still need it ofcourse

thanks
Comment 29 James Zhu 2021-05-19 20:02:15 UTC
Hi Jeromec,I think debug turn-on changes a little bit timing. log without debug info can't give me any help. The amdgpu_fence_info looks good for all cases. this issue is possible device specified.
Comment 30 kolAflash 2021-05-20 09:31:03 UTC
Created attachment 296891 [details]
all kernel messages with ip_block_mask=0x0ff (Debian kernel 5.10.0-6)

Also crashes with ip_block_mask=0x0ff
Tested with the current Debian Testing kernel 5.10.0-6.

I attached all kernel messages from /var/log/messages from boot to crash.
I think that should be the dmesg output.
Comment 31 Jerome C 2021-05-20 09:40:38 UTC
(In reply to kolAflash from comment #30)
> Created attachment 296891 [details]
> all kernel messages with ip_block_mask=0x0ff (Debian kernel 5.10.0-6)
> 
> Also crashes with ip_block_mask=0x0ff
> Tested with the current Debian Testing kernel 5.10.0-6.
> 
> I attached all kernel messages from /var/log/messages from boot to crash.
> I think that should be the dmesg output.

hiya, you may not know this but use in "amdgpu.ip_block_mask=0x0ff" and not "ip_block_mask=0x0ff"

"ip_block_mask=0x0ff" will only apply to linux

"amdgpu.ip_block_mask=0x0ff" will only apply to amdgpu module

I can see in your kernel logs that VCN is still enabled
Comment 32 kolAflash 2021-05-20 18:34:58 UTC
Created attachment 296901 [details]
dmesg via SSH, running amd-drm-next-5.14-2021-05-12 without ip_block_mask=0x0ff and with Xorg

(In reply to Jerome C from comment #31)
> [...]
> hiya, you may not know this but use in "amdgpu.ip_block_mask=0x0ff" and not
> "ip_block_mask=0x0ff"
> [...]
> I can see in your kernel logs that VCN is still enabled

Ooops you're right.
I know someone wrote that before. But it seems I somehow missed it while editing my Grub parameters.

I'll give it another try!


----


In the meanwhile I performed test number 2.

> 2. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg [...]

This time the crash was very different!

After some minutes (about 3) the graphical screen actually turned back on.
I'm pretty sure that didn't happen with the other kernels I tested.
(never tested amd-drm-next-5.14-2021-05-12 before)

Nevertheless everything graphical is lagging extremely. If I move the mouse or do anything else it takes more than 10 seconds until something happens on the screen.

On the other hand SSH access is smoothly possible. And I was able to save the dmesg output. (see attachment)
Unlocking the screen via SSH (loginctl) or starting graphical programs (DISPLAY=:0 xterm) works, but is extremely slow too. (> 10 seconds waiting)
Comment 33 Jerome C 2021-05-20 18:42:11 UTC
(In reply to kolAflash from comment #32)
> In the meanwhile I performed test number 2.
> 
> > 2. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg
> [...]
> 
> This time the crash was very different!
> 
> After some minutes (about 3) the graphical screen actually turned back on.
> I'm pretty sure that didn't happen with the other kernels I tested.
> (never tested amd-drm-next-5.14-2021-05-12 before)
> 
> Nevertheless everything graphical is lagging extremely. If I move the mouse
> or do anything else it takes more than 10 seconds until something happens on
> the screen.
> 
> On the other hand SSH access is smoothly possible. And I was able to save
> the dmesg output. (see attachment)
> Unlocking the screen via SSH (loginctl) or starting graphical programs
> (DISPLAY=:0 xterm) works, but is extremely slow too. (> 10 seconds waiting)

I experienced this laggy too although I didn't try the SSH thing ( I don't have it setup )
Comment 34 Jerome C 2021-06-28 09:01:10 UTC
Using 5.13.0 now and the issue is still here

(In reply to kolAflash from comment #32)
> Created attachment 296901 [details]
> dmesg via SSH, running amd-drm-next-5.14-2021-05-12 without
> ip_block_mask=0x0ff and with Xorg
> 
> (In reply to Jerome C from comment #31)
> > [...]
> > hiya, you may not know this but use in "amdgpu.ip_block_mask=0x0ff" and not
> > "ip_block_mask=0x0ff"
> > [...]
> > I can see in your kernel logs that VCN is still enabled
> 
> Ooops you're right.
> I know someone wrote that before. But it seems I somehow missed it while
> editing my Grub parameters.
> 
> I'll give it another try!
> 
> 
> ----
> 
> 
> In the meanwhile I performed test number 2.
> 
> > 2. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg
> [...]
> 
> This time the crash was very different!
> 
> After some minutes (about 3) the graphical screen actually turned back on.
> I'm pretty sure that didn't happen with the other kernels I tested.
> (never tested amd-drm-next-5.14-2021-05-12 before)
> 
> Nevertheless everything graphical is lagging extremely. If I move the mouse
> or do anything else it takes more than 10 seconds until something happens on
> the screen.
> 
> On the other hand SSH access is smoothly possible. And I was able to save
> the dmesg output. (see attachment)
> Unlocking the screen via SSH (loginctl) or starting graphical programs
> (DISPLAY=:0 xterm) works, but is extremely slow too. (> 10 seconds waiting)

You have any updates since you corrected the kernel parameter?
Comment 35 kolAflash 2021-08-04 12:43:27 UTC
Created attachment 298193 [details]
/var/log/kern.log running amd-drm-next-5.14-2021-05-12 (ae30d41eb) with Xorg

Sorry for the long delay.
I've tested:


1. Current Debian-11 testing Linux-5.10.0-8 with amdgpu.ip_block_mask=0x0ff while running Xorg.
Result: everything ok


2. amd-drm-next-5.14-2021-05-12* (ae30d41eb) without any special kernel options while running Xorg.
Result:
- crashes
- also the screen starts flickering about every 10 seconds after second resume
  - flickering also happens with using a8f768874^ (before the first fix-commit by Alex D.)
- log attached: 5.12.0-rc7-original-ae30d41eb_crash.txt


3. Upstream Linux-5.14.0-rc4.
Result: Still broken.


----


*
amd-drm-next-5.14-2021-05-12
https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-drm-next-5.14-2021-05-12
ae30d41eb
Comment 36 Jerome C 2021-08-04 13:24:38 UTC
I've been watching linux-next and noticed that this commit 

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/gpu/drm/amd?id=65660ad349fd947feb16b45ff9231f2ceaf44318

was posted on linux-next back between 5.10-5.11, I don't remember but it keeps getting pushed back and not mainlined...

I think this is why the issues are still here and none of AMD are responding to this now since comment 29
Comment 37 James Zhu 2021-08-25 12:00:20 UTC
HiJerome and kolAflash, 
would you mind base on your original test configuration,and add pci=noats in boot parameter? for example: 
linux	/boot/vmlinuz-5.4.0-54-generic root=UUID=803844cc-7291-4056-bd04-f1b43b54ed97 ro  pci=noats
see if this helps.
Thanks!
James
Comment 38 Jerome C 2021-08-25 16:53:10 UTC
Created attachment 298471 [details]
publickey - me@jeromec.com - fa4f4559.asc

Hi James,

With "pci=noats" set the suspension and resume works fine

I did see some errors ( something about device not added ) in the kernel
log from "kfd" but I guess that's related to PCIe ATS being disabled
with the kernel parameter set

Thanks

Jerome

On 21/02/2021 00:17, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=211277
>
> --- Comment #9 from kolAflash (kolAflash@kolahilft.de) ---
> I'm on Linux-5.7 now since 2021-01-26.
> And I woke up the notebook at least once a day since then.
> So it's clearly a regression in the kernel somewhere between 5.7 and 5.10 and
> probably between 5.7 and 5.8.
>
> And it's definitely not a BIOS issue, because I changed anything about the
> BIOS
> since the problem appeared last time with Kernel-5.10.
>
> Regards,
> kolAflash
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 39 Jerome C 2021-08-25 16:53:10 UTC
Created attachment 298473 [details]
signature.asc
Comment 40 James Zhu 2021-08-25 17:09:37 UTC
Hi Jerome,
Yes, you are right.Turning off ats will affect iommu. KFD needs iommu enable. KFD supports computing engine. It won't affect 3D and video acceleration. After I confirm if ats/iommu causes the issue, I will find right person to fix it.
Thanks!
James
Comment 41 kolAflash 2021-09-01 13:34:27 UTC
I can confirm Jeromes result.

Bug is gone with pci=noats.
(Debian-11 kernel 5.10.0-8-amd64)

I ran 50 suspend/standby rounds.
Also I used the notebook for 2 days and suspended it multiple times without issues.
Comment 42 James Zhu 2021-09-02 12:59:51 UTC
Hi Jerome and kolAflash,

Thanks for confirmation. I have a workaround for this issue. But I wish I can find the root cause or better workaround.

James
Comment 43 kolAflash 2021-09-02 14:20:53 UTC
(In reply to James Zhu from comment #42)
> Hi Jerome and kolAflash,
> 
> Thanks for confirmation. I have a workaround for this issue. But I wish I
> can find the root cause or better workaround.

Thanks too for your help James!

For me personally the situation is quite fine with pci=noats.
I'm sometimes using Qemu/KVM and VirtualBox. But no need for absolute bleeding edge VM performance. So I'll probably be fine with pci=noats.

However, I'd love to contribute to a fix for all users without kernel parameter stuff.
(including a fix in longterm Linux-5.10 for Debian)
So just tell me if I can help by doing more tests, sending logs, ... :-)
Comment 44 James Zhu 2021-09-02 21:24:11 UTC
Created attachment 298651 [details]
A workaround for suspend/resume hung issue

The VCN block passed all ring tests, usually the vcn will get into idle within 1 sec. Somehow it affected later amd iommu device resume which is controlled by kfd resume. This workaround is to gate vcn block immediately when ring test passed.
It can fix the suspend/resume hung issue.

Hi kolAflash,
Please help check the WA in your setup. I will continue working on root cause.
thanks!
James
Comment 45 Jerome C 2021-09-03 06:52:28 UTC
Created attachment 298653 [details]
publickey - me@jeromec.com - fa4f4559.asc

Unfortunately this failed after 138 susp/resu


Thanks

Jerome

On 02/09/2021 22:24, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=211277
>
> --- Comment #44 from James Zhu (jamesz@amd.com) ---
> Created attachment 298651 [details]
>    --> https://bugzilla.kernel.org/attachment.cgi?id=298651&action=edit
> A workaround for suspend/resume hung issue
>
> The VCN block passed all ring tests, usually the vcn will get into idle
> within
> 1 sec. Somehow it affected later amd iommu device resume which is controlled
> by
> kfd resume. This workaround is to gate vcn block immediately when ring test
> passed.
> It can fix the suspend/resume hung issue.
>
> Hi kolAflash,
> Please help check the WA in your setup. I will continue working on root
> cause.
> thanks!
> James
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 46 Jerome C 2021-09-03 06:52:29 UTC
Created attachment 298655 [details]
signature.asc
Comment 47 James Zhu 2021-09-03 11:54:41 UTC
Hi Jerome,
Thanks! I knew this issue is not easy to judge if it is fixed. Since it occurred quite randomly. On my setup, this WA passed 5 times up to 300 suspend/resume cycles, 1 time up to 3800 suspend/resume cycle.
But I doubt that it is root cause, so I took it as WA. But it seems it is not WA for all system.
James
Comment 48 Anthony Rabbito 2021-09-03 12:12:15 UTC
I'm also facing consistent wake up from screen saver crashes on a Radeon VII. This became more appearant 5.14.0-rc7 and has made it's way to 5.14.0. After the screens blank waking up from sleep typically leaves artifacts on one screen, another screen will be forozen, and a third screen allows to unlock out of SDDM. I will attach kernel logs of a trace while this happens. Please let me know if I can assist in anyway.
Comment 49 Anthony Rabbito 2021-09-03 12:13:33 UTC
Created attachment 298661 [details]
journalctl of amdgpu trace

(In reply to Anthony Rabbito from comment #48)
Comment 50 James Zhu 2021-09-03 12:23:52 UTC
Hi Anthony, 
Can you try if Comment #37? see if it helps. But from the log that you attached, it is a different issue  that GFX hw has lots of ECC error, which cause gfx ring time out. after that the gpu recover is triggered, unfortunately, screen blank came up. I think you need create another ticket for your case.
Best Regards!
James
Comment 51 Arham Jain 2021-09-04 17:41:03 UTC
I can confirm that the issue I was having after trying to wake after suspend (Ryzen 3500u, Linux 5.14 RC7) has vanished after adding pci=noats to my boot parameters a few days ago. I've had this issue on every kernel since 5.10 (5.4 and 5.9 were fine for me for several months each, not sure what I used in between). Thank you so much James for posting this (and trying to fix it)!
Comment 52 James Zhu 2021-09-07 02:00:10 UTC
Created attachment 298691 [details]
Fix for S3 hung issue

Hi Jerome and kolAflash,

I think iommu device init is put at wrong place during the resume. I attache a patch. Please confirm if it works.
Thanks!
James
Comment 53 Anthony Rabbito 2021-09-07 02:32:01 UTC
Thanks for chiming in James! Few things I've observed since adding 'pci=noats' the graphic artifacts seem to happen way less. I did observe one lockup which required me to hard shut down the computer. This was a wake from suspend scenario. 

I used to deal with somwhat similar issues here -- https://bugs.freedesktop.org/show_bug.cgi?id=110674 not sure if that's of any use. Let me know if a fresh bug is warranted.
Comment 54 Jerome C 2021-09-07 06:27:06 UTC
Created attachment 298693 [details]
publickey - EmailAddress(s=me@jeromec.com) - 0xFA4F4559.asc

Hi James,

After 900 ( 600 on LLVM, 300 on GCC ) susp/resu using kernel 5.14.1 compiled by LLVM 12.0.1 ( LLVM\_IAS is unset during compiling ) and again by GCC 11.1.0, there no crash on resume, awesome. It usually fails between 1-150 susp/resu

BRING ON THE RYZEN 6000 SERIES APU

Thanks

Jerome





\-------- Original Message --------
On 7 Sep 2021, 03:00, < bugzilla-daemon@bugzilla.kernel.org> wrote:

>
>
>
>
> [https://bugzilla.kernel.org/show\_bug.cgi?id=211277][https_bugzilla.kernel.org_show_bug.cgi_id_211277]
>
> \--- Comment \#52 from James Zhu (jamesz@amd.com) ---
> Created attachment 298691 [details]
> \--> https://bugzilla.kernel.org/attachment.cgi?id=298691&action=edit
> Fix for S3 hung issue
>
> Hi Jerome and kolAflash,
>
> I think iommu device init is put at wrong place during the resume. I attache
> a
> patch. Please confirm if it works.
> Thanks!
> James
>
> \--
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.


[https_bugzilla.kernel.org_show_bug.cgi_id_211277]: https://bugzilla.kernel.org/show_bug.cgi?id=211277
Comment 55 Jerome C 2021-09-07 06:27:07 UTC
Created attachment 298695 [details]
signature.asc
Comment 56 Jerome C 2021-09-07 07:47:52 UTC
damn, sorry for the ugly message layout replies

I didn't realize my e-mail provider was doing that
Comment 57 James Zhu 2021-09-07 11:02:13 UTC
(In reply to Anthony Rabbito from comment #53)
> Thanks for chiming in James! Few things I've observed since adding
> 'pci=noats' the graphic artifacts seem to happen way less. I did observe one
> lockup which required me to hard shut down the computer. This was a wake
> from suspend scenario. 
> 
> I used to deal with somwhat similar issues here --
> https://bugs.freedesktop.org/show_bug.cgi?id=110674 not sure if that's of
> any use. Let me know if a fresh bug is warranted.

Hi Anthony,

The s3 hung issue here always with error: AMD-Vi: Event logged [IO_PAGE_FAULT...] Bug:110674 don't have gfx ECC error. You case do have lots of them.
Can you share the whole dmesg after you added pci=noats?
Regards!
James
Comment 58 youling257 2021-09-20 10:47:13 UTC
drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk resume failed on my amdgpu 3400g.
Comment 59 James Zhu 2021-09-20 11:34:33 UTC
(In reply to youling257 from comment #58)
> drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk
> resume failed on my amdgpu 3400g.

Can you share whole demsg log? Regards! James
Comment 60 youling257 2021-09-20 14:43:08 UTC
Created attachment 298889 [details]
dmesg5.15.txt

(In reply to James Zhu from comment #59)
> (In reply to youling257 from comment #58)
> > drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk
> > resume failed on my amdgpu 3400g.
> 
> Can you share whole demsg log? Regards! James

when resume failed have to force shutdown, how to output dmesg?
only has boot log dmesg.
Comment 61 James Zhu 2021-09-20 14:57:00 UTC
(In reply to youling257 from comment #60)
> Created attachment 298889 [details]
> dmesg5.15.txt
> 
> (In reply to James Zhu from comment #59)
> > (In reply to youling257 from comment #58)
> > > drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk
> > > resume failed on my amdgpu 3400g.
> > 
> > Can you share whole demsg log? Regards! James
> 
> when resume failed have to force shutdown, how to output dmesg?
> only has boot log dmesg.

after reboot, you can find under /var/log/kern.log and /var/log/syslog based on timestamp. you can just attach kern.log
Comment 62 youling257 2021-09-20 15:00:23 UTC
(In reply to James Zhu from comment #61)
> (In reply to youling257 from comment #60)
> > Created attachment 298889 [details]
> > dmesg5.15.txt
> > 
> > (In reply to James Zhu from comment #59)
> > > (In reply to youling257 from comment #58)
> > > > drm/amdgpu: move iommu_resume before ip init/resume cause suspend to
> disk
> > > > resume failed on my amdgpu 3400g.
> > > 
> > > Can you share whole demsg log? Regards! James
> > 
> > when resume failed have to force shutdown, how to output dmesg?
> > only has boot log dmesg.
> 
> after reboot, you can find under /var/log/kern.log and /var/log/syslog based
> on timestamp. you can just attach kern.log

my userspace is androidx86, running androidx86 with linux 5.15 and mesa21 on amdgpu, no /var/log.
git bisect linux kernel 5.15rc1 and rc2, bad commit is drm/amdgpu: move iommu_resume before ip init/resume.
Comment 63 James Zhu 2021-09-20 21:07:48 UTC
(In reply to youling257 from comment #62)
> (In reply to James Zhu from comment #61)
> > (In reply to youling257 from comment #60)
> > > Created attachment 298889 [details]
> > > dmesg5.15.txt
> > > 
> > > (In reply to James Zhu from comment #59)
> > > > (In reply to youling257 from comment #58)
> > > > > drm/amdgpu: move iommu_resume before ip init/resume cause suspend to
> > disk
> > > > > resume failed on my amdgpu 3400g.
> > > > 
> > > > Can you share whole demsg log? Regards! James
> > > 
> > > when resume failed have to force shutdown, how to output dmesg?
> > > only has boot log dmesg.
> > 
> > after reboot, you can find under /var/log/kern.log and /var/log/syslog
> based
> > on timestamp. you can just attach kern.log
> 
> my userspace is androidx86, running androidx86 with linux 5.15 and mesa21 on
> amdgpu, no /var/log.
> git bisect linux kernel 5.15rc1 and rc2, bad commit is drm/amdgpu: move
> iommu_resume before ip init/resume.

Can you check CONFIG_HSA_AMD setting in .config? By the way , see if the below link help you dump the  error message during resume. https://stackoverflow.com/questions/9682306/android-how-to-get-kernel-logs-after-kernel-panic
Comment 64 youling257 2021-09-21 03:56:08 UTC
Created attachment 298899 [details]
config-5.15.0-rc2-android-x86_64+

CONFIG_HSA_AMD=y
Comment 65 youling257 2021-09-21 04:04:58 UTC
(In reply to James Zhu from comment #63)
> (In reply to youling257 from comment #62)
> > (In reply to James Zhu from comment #61)
> > > (In reply to youling257 from comment #60)
> > > > Created attachment 298889 [details]
> > > > dmesg5.15.txt
> > > > 
> > > > (In reply to James Zhu from comment #59)
> > > > > (In reply to youling257 from comment #58)
> > > > > > drm/amdgpu: move iommu_resume before ip init/resume cause suspend
> to
> > > disk
> > > > > > resume failed on my amdgpu 3400g.
> > > > > 
> > > > > Can you share whole demsg log? Regards! James
> > > > 
> > > > when resume failed have to force shutdown, how to output dmesg?
> > > > only has boot log dmesg.
> > > 
> > > after reboot, you can find under /var/log/kern.log and /var/log/syslog
> > based
> > > on timestamp. you can just attach kern.log
> > 
> > my userspace is androidx86, running androidx86 with linux 5.15 and mesa21
> on
> > amdgpu, no /var/log.
> > git bisect linux kernel 5.15rc1 and rc2, bad commit is drm/amdgpu: move
> > iommu_resume before ip init/resume.
> 
> Can you check CONFIG_HSA_AMD setting in .config? By the way , see if the
> below link help you dump the  error message during resume.
> https://stackoverflow.com/questions/9682306/android-how-to-get-kernel-logs-
> after-kernel-panic

do you see my dmesg kernel command line "memmap=1M!5M ramoops.mem_size=1048576 ramoops.ecc=1 ramoops.mem_address=0x00500000 ramoops.console_size=16384 ramoops.ftrace_size=16384 ramoops.pmsg_size=16384 ramoops.record_size=32768".

if kernel panic reboot, can get /sys/fs/pstore/console-ramoops-0 and /sys/fs/pstore/pmsg-ramoops-0.
but when resume failed, have to press power button force shutdown, no anything.
Comment 66 youling257 2021-09-21 04:53:00 UTC
resume failed record video, https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz-shir0pqX?usp=sharing
Comment 67 James Zhu 2021-09-21 14:32:11 UTC
(In reply to youling257 from comment #66)
> resume failed record video,
> https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz-
> shir0pqX?usp=sharing

Can you try apply this patch:  https://lore.kernel.org/all/20210920163922.313113287@linuxfoundation.org/?
Comment 68 youling257 2021-09-21 17:43:22 UTC
(In reply to James Zhu from comment #67)
> (In reply to youling257 from comment #66)
> > resume failed record video,
> > https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz-
> > shir0pqX?usp=sharing
> 
> Can you try apply this patch: 
> https://lore.kernel.org/all/20210920163922.313113287@linuxfoundation.org/?

linux kernel 5.15rc1 is good, suspend to disk resume success.
linux kernel 5.15rc2 is bad, suspend to disk failed.
revert "drm/amdgpu: move iommu_resume before ip init/resume" can suspend to disk resume success.

linux kernel 5.15rc2 has "drm/amdkfd: separate kfd_iommu_resume from kfd_resume", why you suggest me apply the patch
Comment 69 youling257 2021-09-21 17:43:42 UTC
(In reply to James Zhu from comment #67)
> (In reply to youling257 from comment #66)
> > resume failed record video,
> > https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz-
> > shir0pqX?usp=sharing
> 
> Can you try apply this patch: 
> https://lore.kernel.org/all/20210920163922.313113287@linuxfoundation.org/?

linux kernel 5.15rc1 is good, suspend to disk resume success.
linux kernel 5.15rc2 is bad, suspend to disk failed.
revert "drm/amdgpu: move iommu_resume before ip init/resume" can suspend to disk resume success.

linux kernel 5.15rc2 has "drm/amdkfd: separate kfd_iommu_resume from kfd_resume", why you suggest me apply the patch
Comment 70 James Zhu 2021-09-21 18:02:02 UTC
My mistaake. Can you try add pci=noats in boot parameters?
Comment 71 youling257 2021-09-21 18:29:31 UTC
(In reply to James Zhu from comment #70)
> My mistaake. Can you try add pci=noats in boot parameters?

no help, still resume failed.
Comment 72 Jerome C 2021-09-22 13:59:43 UTC
Hi James,

I noticed the patch that you asked us to try from comment 52 were also submitted to kernel 5.14.7

tested it, all is good for now

Thanks

Jerome
Comment 74 kolAflash 2021-11-23 09:31:47 UTC
@James Zhu

Tested 5.15.2 for over a week and more than 50 standby-wakeups.
No problems!
Thanks :-)

I would be happy about a patch for the 5.10 longterm kernel.
The bug became a problem with v5.10-rc3 (see comment 14), just before Debian made 5.10-longterm the Debian-11 kernel. So it would be great if I and probably other Debian-11 users could finally use that AMD GPU without workarounds.
Comment 75 James Zhu 2021-11-23 13:28:23 UTC
(In reply to kolAflash from comment #74)
> @James Zhu
> 
> Tested 5.15.2 for over a week and more than 50 standby-wakeups.
> No problems!
> Thanks :-)
> 
> I would be happy about a patch for the 5.10 longterm kernel.
> The bug became a problem with v5.10-rc3 (see comment 14), just before Debian
> made 5.10-longterm the Debian-11 kernel. So it would be great if I and
> probably other Debian-11 users could finally use that AMD GPU without
> workarounds.

Hi @Alex Deucher, Can you help on this request? thanks! James
Comment 76 Alex Deucher 2021-11-23 20:44:05 UTC
(In reply to James Zhu from comment #75)
> (In reply to kolAflash from comment #74)
> > @James Zhu
> > 
> > Tested 5.15.2 for over a week and more than 50 standby-wakeups.
> > No problems!
> > Thanks :-)
> > 
> > I would be happy about a patch for the 5.10 longterm kernel.
> > The bug became a problem with v5.10-rc3 (see comment 14), just before
> Debian
> > made 5.10-longterm the Debian-11 kernel. So it would be great if I and
> > probably other Debian-11 users could finally use that AMD GPU without
> > workarounds.
> 
> Hi @Alex Deucher, Can you help on this request? thanks! James

I cc'ed stable with the patches so they should show up in 5.10 assuming they apply cleanly.  If not, can you look at what it would take to backport them?
Comment 77 James Zhu 2021-11-24 03:22:56 UTC
Created attachment 299697 [details]
backport patch for 5.10 stable.

Hi @kolAflash, before I send out them to public for review,. could you help take a test? Thanks so much! James
Comment 78 kolAflash 2021-11-25 18:34:43 UTC
(In reply to James Zhu from comment #77)
> Created attachment 299697 [details]
> backport patch for 5.10 stable.
> 
> Hi @kolAflash, before I send out them to public for review,. could you help
> take a test? Thanks so much! James

Thanks for the patch! :-)

make is currently running and I'll conduct some tests in the next days.
Comment 79 kolAflash 2021-11-25 18:58:10 UTC
@James

Got this when compiling with Linux-5.10.81:

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c: In function ‘kgd2kfd_device_init’:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c:754:6: error: implicit declaration of function ‘kgd2kfd_resume_iommu’; did you mean ‘kgd2kfd_resume_mm’? [-Werror=implicit-function-declaration]
  754 |  if (kgd2kfd_resume_iommu(kfd))
      |      ^~~~~~~~~~~~~~~~~~~~
      |      kgd2kfd_resume_mm


Patching 5.10.81 was without problems:

$ patch -p1 -i ../../backport_patch/0001-drm-amdkfd-separate-kfd_iommu_resume-from-kfd_resume.patch
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
patching file drivers/gpu/drm/amd/amdkfd/kfd_device.c

$ patch -p1 -i ../../backport_patch/0002-drm-amdgpu-add-amdgpu_amdkfd_resume_iommu.patch
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h

$ patch -p1 -i ../../backport_patch/0003-drm-amdgpu-move-iommu_resume-before-ip-init-resume.patch
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

$ patch -p1 -i ../../backport_patch/0004-drm-amdgpu-init-iommu-after-amdkfd-device-init.patch
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

$ patch -p1 -i ../../backport_patch/0005-drm-amdkfd-fix-boot-failure-when-iommu-is-disabled-i.patch
patching file drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
patching file drivers/gpu/drm/amd/amdkfd/kfd_device.c
Comment 80 James Zhu 2021-11-25 20:48:21 UTC
Hi @kolAflash,
I applied those patches on 
(https://github.com/gregkh/linux.git  linux-5.10.y  f884bb85b8d877d4e0c670403754813a7901705b)
(https://github.com/gregkh/linux.git  linux-5.12.y  0e6f651912bdd027a6d730b68d6d1c3f4427c0ae).
 I didn't see compiling issue.

Can you share me .config?

James
Comment 81 kolAflash 2021-11-26 04:04:04 UTC
Created attachment 299721 [details]
Linux kernel make .config

@James

Compiling v5.10.80 (f884bb85b8d877d4e0c670403754813a7901705b) with the provided patch results in the same error.

I attached my Linux kernel make .config.

Compilation platform is Debian-11.1.0.
Comment 82 James Zhu 2021-11-26 16:37:11 UTC
Hi @kolAflash,

 I don't have issue with your .config. on ubuntu 20.04

From source code, it should be fine.

$ grep -rn  "kgd2kfd_resume_iommu"  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
309:int kgd2kfd_resume_iommu(struct kfd_dev *kfd);

$ grep -rn  "amdgpu_amdkfd.h\|kgd2kfd_resume_iommu"  drivers/gpu/drm/amd/amdkfd/kfd_device.c
31:#include "amdgpu_amdkfd.h"
604:	kfd->pci_atomic_requested = amdgpu_amdkfd_have_atomics_support(kgd);
>>>>792:        if (kgd2kfd_resume_iommu(kfd))
940:int kgd2kfd_resume_iommu(struct kfd_dev *kfd)


Looks we are using different 5.10, should we use 5.10 stable for adding this backport patches?. 
>>>>754 |  if (kgd2kfd_resume_iommu(kfd))
      |      ^~~~~~~~~~~~~~~~~~~~
      |      kgd2kfd_resume_mm
Best Regards!
James
Comment 83 kolAflash 2021-11-27 12:14:49 UTC
Hi James,

(In reply to James Zhu from comment #82)
> [...]
> $ grep -rn  "amdgpu_amdkfd.h\|kgd2kfd_resume_iommu" 
> drivers/gpu/drm/amd/amdkfd/kfd_device.c
> 31:#include "amdgpu_amdkfd.h"
> 604:  kfd->pci_atomic_requested = amdgpu_amdkfd_have_atomics_support(kgd);
> >>>>792:        if (kgd2kfd_resume_iommu(kfd))
> 940:int kgd2kfd_resume_iommu(struct kfd_dev *kfd)

the line numbers you're quoting are for Linux v5.12.19 (0e6f651912bdd027a6d730b68d6d1c3f4427c0ae) + the attachment-299697 patch.


> Looks we are using different 5.10, should we use 5.10 stable for adding this
> backport patches?. 
> >>>>754 |  if (kgd2kfd_resume_iommu(kfd))
>       |      ^~~~~~~~~~~~~~~~~~~~
>       |      kgd2kfd_resume_mm

I'm testing with Linux v5.10.80 (f884bb85b8d877d4e0c670403754813a7901705b) + the attachment-299697 patch.
And there it's line number 754.
Comment 84 kolAflash 2021-11-27 13:03:17 UTC
@James

I was able to compile!

Looks like this was some fault of mine.
(I'm usually building out of source directory and did something wrong...)

Now I'm testing the current v5.10.82 with the provided attachment 299697 [details] patches.
Comment 85 kolAflash 2021-11-29 19:21:52 UTC
(In reply to James Zhu from comment #77)
> Created attachment 299697 [details]
> backport patch for 5.10 stable.
> 
> Hi @kolAflash, before I send out them to public for review,. could you help
> take a test? Thanks so much! James

Works excellent!

Tested with Linux-5.10.82 on Debian-11.
Comment 86 James Zhu 2021-11-29 19:53:23 UTC
Hi @kolAflash, thanks so much for your effort on this verification!
Would you mind help apply those patches on 5.12 stable to check also?
it should be automatically merged.  Thanks! James
Comment 87 kolAflash 2021-12-04 22:29:05 UTC
(In reply to James Zhu from comment #86)
> Hi @kolAflash, thanks so much for your effort on this verification!
> Would you mind help apply those patches on 5.12 stable to check also?
> it should be automatically merged.  Thanks! James

I'm testing Linux-5.12.19 with the patch from attachment 299697 [details] since 2021-12-02.
Until now everything works fine.
Comment 88 kolAflash 2022-01-23 13:54:55 UTC
Debian-11 just got a kernel security update, giving me Linux-5.10.92.

https://snapshot.debian.org/package/linux-signed-amd64/5.10.92%2B1/#linux-image-5.10.0-11-amd64_5.10.92-1

Since rebooting into that kernel I got no more crashes after waking from s2ram.
(not using pci=noats or any other workarounds)


Conclusion: Everything fixed!
Thanks a lot to everyone involved :-)