Bug 211277
Description
kolAflash
2021-01-19 10:25:30 UTC
Created attachment 294747 [details]
kern.log
Created attachment 294749 [details]
BIOS update history (just in case someone has a clue if something looks suspicios and this might not be a Linux problem)
I searched through my journalctl log. I set up the whole system in May 2020 with Linux-5.6.7. (journalctl has everything back to that date) The bug appeared as following since October and Linux-5.8. So Linux-5.8 was also affected (contradicting my original post). I used the system nearly every day and always use s2ram (never shutting down, only rebooting when needed for updates). So this can be seen statistically. - 2020-10-21 with Linux-5.8.14 (Debian 5.8.0-3, installed after 2020-09-26) - 2020-12-11 with Linux-5.9.11 (Debian 5.9.0-4, installed 2020-12-04) - 2020-12-25 with Linux-5.9.11 - 2021-01-13 with Linux-5.10.4 (Debian 5.10.0-1, installed 2021-01-10) - 2021-01-16 with Linux-5.10.4 - 2021-01-19 with Linux-5.10.4 So the bug didn't appear with Linux <= 5.7. And the bugs frequency increased with Linux-5.10. In parallel I'm still trying to rule out other factors. (BIOS updates, other software changes, ...) Something significant might be, that Debian used GCC-9 for Linux-5.7. And starting with Linux-5.8 GCC-10 was used. I too have a Ryzen 5 3500U and random resumes where the screen updates are very slow ( 1 frame change every 1-2 minutes ) which looks like it's crashed and in the kernel logs I see a bunch of "flip_done timed out" and "amdgpu_dm_atomic_commit_tail" errors This never happened for me between 5.4.6 - 5.9.14. I noticed this since 5.10.4 and did never suspended on 5.10.0 - 5.10.3, so my guess it's an issue sometime in 5.10.0 - 5.10.3 Do you have kernel parameter set "init_on_free=1" or in your kernel config "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", if so try changing/setting the kernel parameter "init_on_free=0", so far ( for me and still testing ) it's resumed every time I think it's an issue with amdgpu and kernel paramater "init_on_free=1" or kernel config "CONFIG_INIT_ON_FREE_DEFAULT_ON=y" which zero's memory on free/deallocation. kernel paramter "init_on_alloc=1" or kernel config "CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y" works fine for me Created attachment 294879 [details]
Kernel log
Unfortunately it crashed again although I've noticed it's been crashing a lot less (4-5 days) since I set kernel parameter "init_on_free=0".
I've attached a kernel log for 5.10.10
(In reply to Jerome C from comment #4) > [...] > Do you have kernel parameter set "init_on_free=1" or in your kernel config > "CONFIG_INIT_ON_FREE_DEFAULT_ON=y", [...] I'm using the Debian-11 (Testing / Bullseye) standard kernel. $ grep -i init_on_free /boot/config-5.10.0-2-amd64 # CONFIG_INIT_ON_FREE_DEFAULT_ON is not set ok, you have it turned off already
Weird thing happened this morning... I woke my laptop up and it was slow screen updates... I just closed my laptop lid, frustrated... I noticed it suspended again... I open my laptop again and it resumed
I looked in my kernel logs and saw the error messages from the first resume
NOTE: only copied the error messages
> [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR*
> [CRTC:62:crtc-0] flip_done timed out
> [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR*
> [CRTC:62:crtc-0] flip_done timed out
> [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR*
> [CONNECTOR:73:eDP-1] flip_done timed out
> [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR*
> [PLANE:52:plane-3] flip_done timed out
but on the second resume... no warnings or errors
I think it's a bug somewhere between suspension and resuming
I've tried kernel 5.11-rc5 and same issue occurs there. For now I've downgraded kernel to 5.9.14 ( will update it to 5.9.16 ) until this issue is fixed What I've mentioned in comment 4 isn't really helping I think Sometimes the issue happens frequently in a day but then other times it could be a few days before it happens again I'm on Linux-5.7 now since 2021-01-26. And I woke up the notebook at least once a day since then. So it's clearly a regression in the kernel somewhere between 5.7 and 5.10 and probably between 5.7 and 5.8. And it's definitely not a BIOS issue, because I changed anything about the BIOS since the problem appeared last time with Kernel-5.10. Regards, kolAflash (In reply to Alex Deucher from comment #10) > Can you bisect? > https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html I will try to. But it will definitely need some time and may not be possible at all. Because the bug cannot be reproduced completely deterministically. I've tried doing a bisect using this script. Unfortunately I couldn't reproduce the bug this way. So I bisecting will take a lot longer. for i in {0..19}; do echo -e "\n${i}" /usr/sbin/rtcwake --seconds 15 --mode no systemctl start suspend.target sleep 15 done (In reply to kolAflash from comment #12) > I've tried doing a bisect using this script. Unfortunately I couldn't > reproduce the bug this way. So I bisecting will take a lot longer. > > for i in {0..19}; do > echo -e "\n${i}" > /usr/sbin/rtcwake --seconds 15 --mode no > systemctl start suspend.target > sleep 15 > done Hiya I did some testing myself recently and unfortunately doing 20 tests was not enough for me. I found that it could be 50 - 100 resumes before it would fail so I capped mine at 150 resumes, there were too many times where things looked fine for me with less than 50. After I tested kernels between 5.10.4 to 5.11-rc5 ( I didn't use 5.10.0 to 5.10.3 ) and found that this commit https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a10aad137326d137a969fc6cc3555992b99ff9fc was causing the issue for me (In reply to Jerome C from comment #13) I don't get how you got to your results. There's no straight path from 5.10.4 to 5.11-rc5, as they are on different branches (5.10.y and master). Nevertheless, your result may be reasonable from the point of the git history. I'm not sure about the commit ID a10aad137, but it has an completly identical twin commit c6d2b0fbb (also removing AMD_PG_SUPPORT_VCN_DPG from that expression). https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c6d2b0fbb893d5c7dda405aa0e7bcbecf1c75f98 And c6d2b0fbb has been applied between v5.10-rc2 and v5.10-rc3 (a10aad137 is only in master). So if c6d2b0fbb (a.k.a a10aad137) is responsible, this explains why I started recognizing the problem when Debian-Testing went from Linux-5.9 to Linux-5.10. I'm now running a 5.10.21 kernel where I reverted c6d2b0fbb. And I'll try using this kernel for at least one week and also run some iterative tests with it. Regarding reproduction in general: I really wonder what triggers this bug. I didn't went so far to test with more than 50 tests (sleep-wake iterations). Especially I didn't tried more than 50 because the bug definitely appeared more often if it happened under "natural" (non-testing) circumstances. Some test series I did which are hard to make sense of statistically: I tried 20 tests and nothing happened. A few minutes later I decided to try 50 more tests and it directly failed on the first one. So I had to reboot, tried again 50 tests and nothing happened. Afterwards I put my notebook into s2ram and when I woke it the next day it immediately crashed. By the way the two times it crashed recently (see above) happened with a kernel I compiled from clean kernel.org sources. Also I never experienced the bug with a clean 5.8.18 compiled from kernel.org running with the same system for about a week. So I'm quite convinced it's nothing Debian specific. (In reply to Alex Deucher from comment #10) > Can you bisect? > https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html I've done several s2ram-wakeup cycles (100 automatic and about three manual wakeups/day) with the kernel I compiled on 2021-03-07. It's based on 5.10.21 with c6d2b0fbb reverted. (as suggested by Jerome) Result: No crashes. This looks very prosiming! @Alex Can I help with anything else to solve this? I also compiled 5.10.21 without reverting c6d2b0fbb, tested it for a few hours and got three wakeup-crashes. @Alex Any progress on this? If there's no perfect way to fix this, what about an option to turn on/off this behaviour? A module option that can be changed at runtime would be ideal. So it can be set right before suspending. But a kernel boot parameter would be fine too. P.S. Would someone be so kind and set this bug to "confirmed"? I don't think we've been able to reproduce it. That said, we did double check the programmign sequences and I believe it may be fixed with these patches: https://gitlab.freedesktop.org/agd5f/linux/-/commit/71efc8701a47aa9e3de74bab06020da81757893f https://gitlab.freedesktop.org/agd5f/linux/-/commit/a8f768874aaf751738a2e0350bf2e70085f93ace (In reply to Alex Deucher from comment #17) > I don't think we've been able to reproduce it. That said, we did double > check the programmign sequences and I believe it may be fixed with these > patches: > https://gitlab.freedesktop.org/agd5f/linux/-/commit/ > 71efc8701a47aa9e3de74bab06020da81757893f > https://gitlab.freedesktop.org/agd5f/linux/-/commit/ > a8f768874aaf751738a2e0350bf2e70085f93ace I've tried these two commits and the issue still there unfortunately Created attachment 296841 [details]
to fix suspend/resume hung issue
Hi @kolAflash and @jeromec, Can you help check if this patch can fix the issue? Since we can't reproduce at our side. Thanks! James
(In reply to James Zhu from comment #19) > Created attachment 296841 [details] > to fix suspend/resume hung issue > > Hi @kolAflash and @jeromec, Can you help check if this patch can fix the > issue? Since we can't reproduce at our side. Thanks! James no, this doesn't work for me. I'm curious to how your exactly to reproducing this I start Xorg using the command "startx" Xorg is running with LXQT I start "Konsole" a gui terminal and execute the following "for i in $(seq 1 150); do echo $i; sudo rtcwake -s 7 -m mem; done" Hi Jeromec, to isolate the cause, can you help run two experiments separately? 1. To run suspend/resume without launching Xorg, just on text mode. 2. To disable video acceleration (VCN IP). I need you share me the whole dmesg log after loading amdgpu driver. I think basically running modprobe with ip_block_mask=0x0ff should disable vcn ip for VCN1.(you can find words in dmesg to tell you if vcn ip is disabled or not). Thanks! James @James What do you mean by video acceleration? Is this about 3D / DRI acceleration like in video games? Or do you mean just "video" playback (movie, mp4, webm, h264, vp8, ...) acceleration? And I don't completely understand what ip_block_mask=0x0ff is supposed to do. I just rebootet with that kernel parameter added and 3D acceleration (DRI) is still working. ---- I'm planing to run these kernels in the next days: 1. Current Debian testing Linux-5.10.0-6 with ip_block_mask=0x0ff, Xorg and 3D acceleration in daily use. 2. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg and with 3D acceleration in daily use. 3. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg, but without 3D acceleration** in daily use. 4. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff and without Xorg, doing some standby cycles for testing. If I encounter any crash I'll post the whole dmesg starting with the boot output. ---- * amd-drm-next-5.14-2021-05-12 https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-drm-next-5.14-2021-05-12 ae30d41eb ** Is there something special I should do to turn off acceleration? Or should I just don't start any application doing 3D / DRI acceleration? (the latter one might be difficult - I got to keep an eye on every application like Firefox, Atom, VLC, KWin/KDE window manager, ... not to use DRI) Hi kolAflash, VCN IP is for video acceleration(for video playback), if vcn ip didn't handle suspend/resume process properly, we do observe other IP blocks be affected. For your case it is display IP(dm) related. ip_block_mask=0xff (in grub should be amdgpu.ip_block_mask=0x0ff) can disable VCN IP during amdgpu driver loading. so this experiment can tell if this dm error is caused by VCN IP or not. sometimes /sys/kernel/debug/dri/0/amdgpu_fence_info can provide some useful information if it has chance to be dumped. these experiments can help identified which IP cause the issue. So we can find expert in that area to continue to triage. Your current report is case 2, so it can be replaced with 2. amd-drm-next-5.14-2021-05-12* with ip_block_mask=0x0ff, with Xorg and without 3D acceleration in daily use. I suggest you to execute your test plan in order 4->3->2->1. Thanks! James (In reply to James Zhu from comment #21) > Hi Jeromec, to isolate the cause, can you help run two experiments > separately? > 1. To run suspend/resume without launching Xorg, just on text mode. > 2. To disable video acceleration (VCN IP). I need you share me the whole > dmesg log after loading amdgpu driver. I think basically running modprobe > with ip_block_mask=0x0ff should disable vcn ip for VCN1.(you can find words > in dmesg to tell you if vcn ip is disabled or not). > > Thanks! > James 1) In text mode, VCN enabled, suspensions issues are still there 2) I see the message confirming that VCN is disabled, In text mode, VCN disabled, suspensions issues are gone, After starting Xorg, VCN disabled, suspensions issues are gone I'll gather the logs those soon ( tomorrow sometime ) I forgot to mention... I'm on kernel 5.13.4 (In reply to Jerome C from comment #25) > I forgot to mention... I'm on kernel 5.13.4 5.12.4 I mean Hi Jeromec, thanks for your feedback, can you also add drm.debug=0x1ff modprobe? I need log: case 1 dmesg and /sys/kernel/debug/dri/0/amdgpu_fence_info (if you can). James. Created attachment 296877 [details] AMDGPU fence info (In reply to James Zhu from comment #27) > Hi Jeromec, thanks for your feedback, can you also add drm.debug=0x1ff > modprobe? I need log: case 1 dmesg and > /sys/kernel/debug/dri/0/amdgpu_fence_info (if you can). James. I've tested text mode and gui/drm mode with "drm.debug=0x1ff" set and found no crashes... when "drm.debug=0x1ff" is unset... the crashes/timeouts are back... I think this is why your unable to reproduce the problem... I've never known debug option(s) to remove issue(s)... oh well I've added the contents of the file "/sys/kernel/debug/dri/0/amdgpu_fence_info". The file contains 4 different boot states ( vcn on/off, drm debug on/off ) clearly marked/seperated in the attached file I'm using 5.12.5 now but I also tried this on 5.12.4. Usually the crashes happen within 50 suspensions/resumes but today I left it to do over 2000 suspensions/resumes just to make sure... I know you asked for a log but I spent so much time on this ( other things too ), it wasn't on my mind so I'll get that by Friday, if you still need it ofcourse thanks Hi Jeromec,I think debug turn-on changes a little bit timing. log without debug info can't give me any help. The amdgpu_fence_info looks good for all cases. this issue is possible device specified. Created attachment 296891 [details]
all kernel messages with ip_block_mask=0x0ff (Debian kernel 5.10.0-6)
Also crashes with ip_block_mask=0x0ff
Tested with the current Debian Testing kernel 5.10.0-6.
I attached all kernel messages from /var/log/messages from boot to crash.
I think that should be the dmesg output.
(In reply to kolAflash from comment #30) > Created attachment 296891 [details] > all kernel messages with ip_block_mask=0x0ff (Debian kernel 5.10.0-6) > > Also crashes with ip_block_mask=0x0ff > Tested with the current Debian Testing kernel 5.10.0-6. > > I attached all kernel messages from /var/log/messages from boot to crash. > I think that should be the dmesg output. hiya, you may not know this but use in "amdgpu.ip_block_mask=0x0ff" and not "ip_block_mask=0x0ff" "ip_block_mask=0x0ff" will only apply to linux "amdgpu.ip_block_mask=0x0ff" will only apply to amdgpu module I can see in your kernel logs that VCN is still enabled Created attachment 296901 [details] dmesg via SSH, running amd-drm-next-5.14-2021-05-12 without ip_block_mask=0x0ff and with Xorg (In reply to Jerome C from comment #31) > [...] > hiya, you may not know this but use in "amdgpu.ip_block_mask=0x0ff" and not > "ip_block_mask=0x0ff" > [...] > I can see in your kernel logs that VCN is still enabled Ooops you're right. I know someone wrote that before. But it seems I somehow missed it while editing my Grub parameters. I'll give it another try! ---- In the meanwhile I performed test number 2. > 2. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg [...] This time the crash was very different! After some minutes (about 3) the graphical screen actually turned back on. I'm pretty sure that didn't happen with the other kernels I tested. (never tested amd-drm-next-5.14-2021-05-12 before) Nevertheless everything graphical is lagging extremely. If I move the mouse or do anything else it takes more than 10 seconds until something happens on the screen. On the other hand SSH access is smoothly possible. And I was able to save the dmesg output. (see attachment) Unlocking the screen via SSH (loginctl) or starting graphical programs (DISPLAY=:0 xterm) works, but is extremely slow too. (> 10 seconds waiting) (In reply to kolAflash from comment #32) > In the meanwhile I performed test number 2. > > > 2. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg > [...] > > This time the crash was very different! > > After some minutes (about 3) the graphical screen actually turned back on. > I'm pretty sure that didn't happen with the other kernels I tested. > (never tested amd-drm-next-5.14-2021-05-12 before) > > Nevertheless everything graphical is lagging extremely. If I move the mouse > or do anything else it takes more than 10 seconds until something happens on > the screen. > > On the other hand SSH access is smoothly possible. And I was able to save > the dmesg output. (see attachment) > Unlocking the screen via SSH (loginctl) or starting graphical programs > (DISPLAY=:0 xterm) works, but is extremely slow too. (> 10 seconds waiting) I experienced this laggy too although I didn't try the SSH thing ( I don't have it setup ) Using 5.13.0 now and the issue is still here (In reply to kolAflash from comment #32) > Created attachment 296901 [details] > dmesg via SSH, running amd-drm-next-5.14-2021-05-12 without > ip_block_mask=0x0ff and with Xorg > > (In reply to Jerome C from comment #31) > > [...] > > hiya, you may not know this but use in "amdgpu.ip_block_mask=0x0ff" and not > > "ip_block_mask=0x0ff" > > [...] > > I can see in your kernel logs that VCN is still enabled > > Ooops you're right. > I know someone wrote that before. But it seems I somehow missed it while > editing my Grub parameters. > > I'll give it another try! > > > ---- > > > In the meanwhile I performed test number 2. > > > 2. amd-drm-next-5.14-2021-05-12* without ip_block_mask=0x0ff, with Xorg > [...] > > This time the crash was very different! > > After some minutes (about 3) the graphical screen actually turned back on. > I'm pretty sure that didn't happen with the other kernels I tested. > (never tested amd-drm-next-5.14-2021-05-12 before) > > Nevertheless everything graphical is lagging extremely. If I move the mouse > or do anything else it takes more than 10 seconds until something happens on > the screen. > > On the other hand SSH access is smoothly possible. And I was able to save > the dmesg output. (see attachment) > Unlocking the screen via SSH (loginctl) or starting graphical programs > (DISPLAY=:0 xterm) works, but is extremely slow too. (> 10 seconds waiting) You have any updates since you corrected the kernel parameter? Created attachment 298193 [details] /var/log/kern.log running amd-drm-next-5.14-2021-05-12 (ae30d41eb) with Xorg Sorry for the long delay. I've tested: 1. Current Debian-11 testing Linux-5.10.0-8 with amdgpu.ip_block_mask=0x0ff while running Xorg. Result: everything ok 2. amd-drm-next-5.14-2021-05-12* (ae30d41eb) without any special kernel options while running Xorg. Result: - crashes - also the screen starts flickering about every 10 seconds after second resume - flickering also happens with using a8f768874^ (before the first fix-commit by Alex D.) - log attached: 5.12.0-rc7-original-ae30d41eb_crash.txt 3. Upstream Linux-5.14.0-rc4. Result: Still broken. ---- * amd-drm-next-5.14-2021-05-12 https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-drm-next-5.14-2021-05-12 ae30d41eb I've been watching linux-next and noticed that this commit https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/gpu/drm/amd?id=65660ad349fd947feb16b45ff9231f2ceaf44318 was posted on linux-next back between 5.10-5.11, I don't remember but it keeps getting pushed back and not mainlined... I think this is why the issues are still here and none of AMD are responding to this now since comment 29 HiJerome and kolAflash, would you mind base on your original test configuration,and add pci=noats in boot parameter? for example: linux /boot/vmlinuz-5.4.0-54-generic root=UUID=803844cc-7291-4056-bd04-f1b43b54ed97 ro pci=noats see if this helps. Thanks! James Created attachment 298471 [details] publickey - me@jeromec.com - fa4f4559.asc Hi James, With "pci=noats" set the suspension and resume works fine I did see some errors ( something about device not added ) in the kernel log from "kfd" but I guess that's related to PCIe ATS being disabled with the kernel parameter set Thanks Jerome On 21/02/2021 00:17, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=211277 > > --- Comment #9 from kolAflash (kolAflash@kolahilft.de) --- > I'm on Linux-5.7 now since 2021-01-26. > And I woke up the notebook at least once a day since then. > So it's clearly a regression in the kernel somewhere between 5.7 and 5.10 and > probably between 5.7 and 5.8. > > And it's definitely not a BIOS issue, because I changed anything about the > BIOS > since the problem appeared last time with Kernel-5.10. > > Regards, > kolAflash > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. Created attachment 298473 [details]
signature.asc
Hi Jerome, Yes, you are right.Turning off ats will affect iommu. KFD needs iommu enable. KFD supports computing engine. It won't affect 3D and video acceleration. After I confirm if ats/iommu causes the issue, I will find right person to fix it. Thanks! James I can confirm Jeromes result. Bug is gone with pci=noats. (Debian-11 kernel 5.10.0-8-amd64) I ran 50 suspend/standby rounds. Also I used the notebook for 2 days and suspended it multiple times without issues. Hi Jerome and kolAflash, Thanks for confirmation. I have a workaround for this issue. But I wish I can find the root cause or better workaround. James (In reply to James Zhu from comment #42) > Hi Jerome and kolAflash, > > Thanks for confirmation. I have a workaround for this issue. But I wish I > can find the root cause or better workaround. Thanks too for your help James! For me personally the situation is quite fine with pci=noats. I'm sometimes using Qemu/KVM and VirtualBox. But no need for absolute bleeding edge VM performance. So I'll probably be fine with pci=noats. However, I'd love to contribute to a fix for all users without kernel parameter stuff. (including a fix in longterm Linux-5.10 for Debian) So just tell me if I can help by doing more tests, sending logs, ... :-) Created attachment 298651 [details]
A workaround for suspend/resume hung issue
The VCN block passed all ring tests, usually the vcn will get into idle within 1 sec. Somehow it affected later amd iommu device resume which is controlled by kfd resume. This workaround is to gate vcn block immediately when ring test passed.
It can fix the suspend/resume hung issue.
Hi kolAflash,
Please help check the WA in your setup. I will continue working on root cause.
thanks!
James
Created attachment 298653 [details] publickey - me@jeromec.com - fa4f4559.asc Unfortunately this failed after 138 susp/resu Thanks Jerome On 02/09/2021 22:24, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=211277 > > --- Comment #44 from James Zhu (jamesz@amd.com) --- > Created attachment 298651 [details] > --> https://bugzilla.kernel.org/attachment.cgi?id=298651&action=edit > A workaround for suspend/resume hung issue > > The VCN block passed all ring tests, usually the vcn will get into idle > within > 1 sec. Somehow it affected later amd iommu device resume which is controlled > by > kfd resume. This workaround is to gate vcn block immediately when ring test > passed. > It can fix the suspend/resume hung issue. > > Hi kolAflash, > Please help check the WA in your setup. I will continue working on root > cause. > thanks! > James > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. Created attachment 298655 [details]
signature.asc
Hi Jerome, Thanks! I knew this issue is not easy to judge if it is fixed. Since it occurred quite randomly. On my setup, this WA passed 5 times up to 300 suspend/resume cycles, 1 time up to 3800 suspend/resume cycle. But I doubt that it is root cause, so I took it as WA. But it seems it is not WA for all system. James I'm also facing consistent wake up from screen saver crashes on a Radeon VII. This became more appearant 5.14.0-rc7 and has made it's way to 5.14.0. After the screens blank waking up from sleep typically leaves artifacts on one screen, another screen will be forozen, and a third screen allows to unlock out of SDDM. I will attach kernel logs of a trace while this happens. Please let me know if I can assist in anyway. Created attachment 298661 [details] journalctl of amdgpu trace (In reply to Anthony Rabbito from comment #48) Hi Anthony, Can you try if Comment #37? see if it helps. But from the log that you attached, it is a different issue that GFX hw has lots of ECC error, which cause gfx ring time out. after that the gpu recover is triggered, unfortunately, screen blank came up. I think you need create another ticket for your case. Best Regards! James I can confirm that the issue I was having after trying to wake after suspend (Ryzen 3500u, Linux 5.14 RC7) has vanished after adding pci=noats to my boot parameters a few days ago. I've had this issue on every kernel since 5.10 (5.4 and 5.9 were fine for me for several months each, not sure what I used in between). Thank you so much James for posting this (and trying to fix it)! Created attachment 298691 [details]
Fix for S3 hung issue
Hi Jerome and kolAflash,
I think iommu device init is put at wrong place during the resume. I attache a patch. Please confirm if it works.
Thanks!
James
Thanks for chiming in James! Few things I've observed since adding 'pci=noats' the graphic artifacts seem to happen way less. I did observe one lockup which required me to hard shut down the computer. This was a wake from suspend scenario. I used to deal with somwhat similar issues here -- https://bugs.freedesktop.org/show_bug.cgi?id=110674 not sure if that's of any use. Let me know if a fresh bug is warranted. Created attachment 298693 [details] publickey - EmailAddress(s=me@jeromec.com) - 0xFA4F4559.asc Hi James, After 900 ( 600 on LLVM, 300 on GCC ) susp/resu using kernel 5.14.1 compiled by LLVM 12.0.1 ( LLVM\_IAS is unset during compiling ) and again by GCC 11.1.0, there no crash on resume, awesome. It usually fails between 1-150 susp/resu BRING ON THE RYZEN 6000 SERIES APU Thanks Jerome \-------- Original Message -------- On 7 Sep 2021, 03:00, < bugzilla-daemon@bugzilla.kernel.org> wrote: > > > > > [https://bugzilla.kernel.org/show\_bug.cgi?id=211277][https_bugzilla.kernel.org_show_bug.cgi_id_211277] > > \--- Comment \#52 from James Zhu (jamesz@amd.com) --- > Created attachment 298691 [details] > \--> https://bugzilla.kernel.org/attachment.cgi?id=298691&action=edit > Fix for S3 hung issue > > Hi Jerome and kolAflash, > > I think iommu device init is put at wrong place during the resume. I attache > a > patch. Please confirm if it works. > Thanks! > James > > \-- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. [https_bugzilla.kernel.org_show_bug.cgi_id_211277]: https://bugzilla.kernel.org/show_bug.cgi?id=211277 Created attachment 298695 [details]
signature.asc
damn, sorry for the ugly message layout replies I didn't realize my e-mail provider was doing that (In reply to Anthony Rabbito from comment #53) > Thanks for chiming in James! Few things I've observed since adding > 'pci=noats' the graphic artifacts seem to happen way less. I did observe one > lockup which required me to hard shut down the computer. This was a wake > from suspend scenario. > > I used to deal with somwhat similar issues here -- > https://bugs.freedesktop.org/show_bug.cgi?id=110674 not sure if that's of > any use. Let me know if a fresh bug is warranted. Hi Anthony, The s3 hung issue here always with error: AMD-Vi: Event logged [IO_PAGE_FAULT...] Bug:110674 don't have gfx ECC error. You case do have lots of them. Can you share the whole dmesg after you added pci=noats? Regards! James drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk resume failed on my amdgpu 3400g. (In reply to youling257 from comment #58) > drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk > resume failed on my amdgpu 3400g. Can you share whole demsg log? Regards! James Created attachment 298889 [details] dmesg5.15.txt (In reply to James Zhu from comment #59) > (In reply to youling257 from comment #58) > > drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk > > resume failed on my amdgpu 3400g. > > Can you share whole demsg log? Regards! James when resume failed have to force shutdown, how to output dmesg? only has boot log dmesg. (In reply to youling257 from comment #60) > Created attachment 298889 [details] > dmesg5.15.txt > > (In reply to James Zhu from comment #59) > > (In reply to youling257 from comment #58) > > > drm/amdgpu: move iommu_resume before ip init/resume cause suspend to disk > > > resume failed on my amdgpu 3400g. > > > > Can you share whole demsg log? Regards! James > > when resume failed have to force shutdown, how to output dmesg? > only has boot log dmesg. after reboot, you can find under /var/log/kern.log and /var/log/syslog based on timestamp. you can just attach kern.log (In reply to James Zhu from comment #61) > (In reply to youling257 from comment #60) > > Created attachment 298889 [details] > > dmesg5.15.txt > > > > (In reply to James Zhu from comment #59) > > > (In reply to youling257 from comment #58) > > > > drm/amdgpu: move iommu_resume before ip init/resume cause suspend to > disk > > > > resume failed on my amdgpu 3400g. > > > > > > Can you share whole demsg log? Regards! James > > > > when resume failed have to force shutdown, how to output dmesg? > > only has boot log dmesg. > > after reboot, you can find under /var/log/kern.log and /var/log/syslog based > on timestamp. you can just attach kern.log my userspace is androidx86, running androidx86 with linux 5.15 and mesa21 on amdgpu, no /var/log. git bisect linux kernel 5.15rc1 and rc2, bad commit is drm/amdgpu: move iommu_resume before ip init/resume. (In reply to youling257 from comment #62) > (In reply to James Zhu from comment #61) > > (In reply to youling257 from comment #60) > > > Created attachment 298889 [details] > > > dmesg5.15.txt > > > > > > (In reply to James Zhu from comment #59) > > > > (In reply to youling257 from comment #58) > > > > > drm/amdgpu: move iommu_resume before ip init/resume cause suspend to > > disk > > > > > resume failed on my amdgpu 3400g. > > > > > > > > Can you share whole demsg log? Regards! James > > > > > > when resume failed have to force shutdown, how to output dmesg? > > > only has boot log dmesg. > > > > after reboot, you can find under /var/log/kern.log and /var/log/syslog > based > > on timestamp. you can just attach kern.log > > my userspace is androidx86, running androidx86 with linux 5.15 and mesa21 on > amdgpu, no /var/log. > git bisect linux kernel 5.15rc1 and rc2, bad commit is drm/amdgpu: move > iommu_resume before ip init/resume. Can you check CONFIG_HSA_AMD setting in .config? By the way , see if the below link help you dump the error message during resume. https://stackoverflow.com/questions/9682306/android-how-to-get-kernel-logs-after-kernel-panic Created attachment 298899 [details]
config-5.15.0-rc2-android-x86_64+
CONFIG_HSA_AMD=y
(In reply to James Zhu from comment #63) > (In reply to youling257 from comment #62) > > (In reply to James Zhu from comment #61) > > > (In reply to youling257 from comment #60) > > > > Created attachment 298889 [details] > > > > dmesg5.15.txt > > > > > > > > (In reply to James Zhu from comment #59) > > > > > (In reply to youling257 from comment #58) > > > > > > drm/amdgpu: move iommu_resume before ip init/resume cause suspend > to > > > disk > > > > > > resume failed on my amdgpu 3400g. > > > > > > > > > > Can you share whole demsg log? Regards! James > > > > > > > > when resume failed have to force shutdown, how to output dmesg? > > > > only has boot log dmesg. > > > > > > after reboot, you can find under /var/log/kern.log and /var/log/syslog > > based > > > on timestamp. you can just attach kern.log > > > > my userspace is androidx86, running androidx86 with linux 5.15 and mesa21 > on > > amdgpu, no /var/log. > > git bisect linux kernel 5.15rc1 and rc2, bad commit is drm/amdgpu: move > > iommu_resume before ip init/resume. > > Can you check CONFIG_HSA_AMD setting in .config? By the way , see if the > below link help you dump the error message during resume. > https://stackoverflow.com/questions/9682306/android-how-to-get-kernel-logs- > after-kernel-panic do you see my dmesg kernel command line "memmap=1M!5M ramoops.mem_size=1048576 ramoops.ecc=1 ramoops.mem_address=0x00500000 ramoops.console_size=16384 ramoops.ftrace_size=16384 ramoops.pmsg_size=16384 ramoops.record_size=32768". if kernel panic reboot, can get /sys/fs/pstore/console-ramoops-0 and /sys/fs/pstore/pmsg-ramoops-0. but when resume failed, have to press power button force shutdown, no anything. resume failed record video, https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz-shir0pqX?usp=sharing (In reply to youling257 from comment #66) > resume failed record video, > https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz- > shir0pqX?usp=sharing Can you try apply this patch: https://lore.kernel.org/all/20210920163922.313113287@linuxfoundation.org/? (In reply to James Zhu from comment #67) > (In reply to youling257 from comment #66) > > resume failed record video, > > https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz- > > shir0pqX?usp=sharing > > Can you try apply this patch: > https://lore.kernel.org/all/20210920163922.313113287@linuxfoundation.org/? linux kernel 5.15rc1 is good, suspend to disk resume success. linux kernel 5.15rc2 is bad, suspend to disk failed. revert "drm/amdgpu: move iommu_resume before ip init/resume" can suspend to disk resume success. linux kernel 5.15rc2 has "drm/amdkfd: separate kfd_iommu_resume from kfd_resume", why you suggest me apply the patch (In reply to James Zhu from comment #67) > (In reply to youling257 from comment #66) > > resume failed record video, > > https://drive.google.com/drive/folders/1bWMC4ByGvudC9zBk-9Xgamz- > > shir0pqX?usp=sharing > > Can you try apply this patch: > https://lore.kernel.org/all/20210920163922.313113287@linuxfoundation.org/? linux kernel 5.15rc1 is good, suspend to disk resume success. linux kernel 5.15rc2 is bad, suspend to disk failed. revert "drm/amdgpu: move iommu_resume before ip init/resume" can suspend to disk resume success. linux kernel 5.15rc2 has "drm/amdkfd: separate kfd_iommu_resume from kfd_resume", why you suggest me apply the patch My mistaake. Can you try add pci=noats in boot parameters? (In reply to James Zhu from comment #70) > My mistaake. Can you try add pci=noats in boot parameters? no help, still resume failed. Hi James, I noticed the patch that you asked us to try from comment 52 were also submitted to kernel 5.14.7 tested it, all is good for now Thanks Jerome (In reply to Jerome C from comment #72) > Hi James, > > I noticed the patch that you asked us to try from comment 52 were also > submitted to kernel 5.14.7 > > tested it, all is good for now Pleased to hear that :-) I'm just compiling 5.15.2 to run a test myself. @James Will those patches be backported to the Linux-5.10 LTS kernel? master and Linux-5.15 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f02abeb0779700c308e661a412451b38962b8a0b https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8066008482e533e91934bee49765bf8b4a7c40db https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=fefc01f042f44ede373ee66773b8238dd8fdcb55 Linux-5.14.7 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=fe232886fb710a4bf0532f61ebdb87463a780e7e https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=413a8644966a9b4709b114bdb102f64f505d57ef https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=64ca7170c9b17042dc63828b56681aaea88ca38e @James Zhu Tested 5.15.2 for over a week and more than 50 standby-wakeups. No problems! Thanks :-) I would be happy about a patch for the 5.10 longterm kernel. The bug became a problem with v5.10-rc3 (see comment 14), just before Debian made 5.10-longterm the Debian-11 kernel. So it would be great if I and probably other Debian-11 users could finally use that AMD GPU without workarounds. (In reply to kolAflash from comment #74) > @James Zhu > > Tested 5.15.2 for over a week and more than 50 standby-wakeups. > No problems! > Thanks :-) > > I would be happy about a patch for the 5.10 longterm kernel. > The bug became a problem with v5.10-rc3 (see comment 14), just before Debian > made 5.10-longterm the Debian-11 kernel. So it would be great if I and > probably other Debian-11 users could finally use that AMD GPU without > workarounds. Hi @Alex Deucher, Can you help on this request? thanks! James (In reply to James Zhu from comment #75) > (In reply to kolAflash from comment #74) > > @James Zhu > > > > Tested 5.15.2 for over a week and more than 50 standby-wakeups. > > No problems! > > Thanks :-) > > > > I would be happy about a patch for the 5.10 longterm kernel. > > The bug became a problem with v5.10-rc3 (see comment 14), just before > Debian > > made 5.10-longterm the Debian-11 kernel. So it would be great if I and > > probably other Debian-11 users could finally use that AMD GPU without > > workarounds. > > Hi @Alex Deucher, Can you help on this request? thanks! James I cc'ed stable with the patches so they should show up in 5.10 assuming they apply cleanly. If not, can you look at what it would take to backport them? Created attachment 299697 [details]
backport patch for 5.10 stable.
Hi @kolAflash, before I send out them to public for review,. could you help take a test? Thanks so much! James
(In reply to James Zhu from comment #77) > Created attachment 299697 [details] > backport patch for 5.10 stable. > > Hi @kolAflash, before I send out them to public for review,. could you help > take a test? Thanks so much! James Thanks for the patch! :-) make is currently running and I'll conduct some tests in the next days. @James Got this when compiling with Linux-5.10.81: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c: In function ‘kgd2kfd_device_init’: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c:754:6: error: implicit declaration of function ‘kgd2kfd_resume_iommu’; did you mean ‘kgd2kfd_resume_mm’? [-Werror=implicit-function-declaration] 754 | if (kgd2kfd_resume_iommu(kfd)) | ^~~~~~~~~~~~~~~~~~~~ | kgd2kfd_resume_mm Patching 5.10.81 was without problems: $ patch -p1 -i ../../backport_patch/0001-drm-amdkfd-separate-kfd_iommu_resume-from-kfd_resume.patch patching file drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h patching file drivers/gpu/drm/amd/amdkfd/kfd_device.c $ patch -p1 -i ../../backport_patch/0002-drm-amdgpu-add-amdgpu_amdkfd_resume_iommu.patch patching file drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c patching file drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h $ patch -p1 -i ../../backport_patch/0003-drm-amdgpu-move-iommu_resume-before-ip-init-resume.patch patching file drivers/gpu/drm/amd/amdgpu/amdgpu_device.c $ patch -p1 -i ../../backport_patch/0004-drm-amdgpu-init-iommu-after-amdkfd-device-init.patch patching file drivers/gpu/drm/amd/amdgpu/amdgpu_device.c $ patch -p1 -i ../../backport_patch/0005-drm-amdkfd-fix-boot-failure-when-iommu-is-disabled-i.patch patching file drivers/gpu/drm/amd/amdgpu/amdgpu_device.c patching file drivers/gpu/drm/amd/amdkfd/kfd_device.c Hi @kolAflash, I applied those patches on (https://github.com/gregkh/linux.git linux-5.10.y f884bb85b8d877d4e0c670403754813a7901705b) (https://github.com/gregkh/linux.git linux-5.12.y 0e6f651912bdd027a6d730b68d6d1c3f4427c0ae). I didn't see compiling issue. Can you share me .config? James Created attachment 299721 [details]
Linux kernel make .config
@James
Compiling v5.10.80 (f884bb85b8d877d4e0c670403754813a7901705b) with the provided patch results in the same error.
I attached my Linux kernel make .config.
Compilation platform is Debian-11.1.0.
Hi @kolAflash, I don't have issue with your .config. on ubuntu 20.04 From source code, it should be fine. $ grep -rn "kgd2kfd_resume_iommu" drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h 309:int kgd2kfd_resume_iommu(struct kfd_dev *kfd); $ grep -rn "amdgpu_amdkfd.h\|kgd2kfd_resume_iommu" drivers/gpu/drm/amd/amdkfd/kfd_device.c 31:#include "amdgpu_amdkfd.h" 604: kfd->pci_atomic_requested = amdgpu_amdkfd_have_atomics_support(kgd); >>>>792: if (kgd2kfd_resume_iommu(kfd)) 940:int kgd2kfd_resume_iommu(struct kfd_dev *kfd) Looks we are using different 5.10, should we use 5.10 stable for adding this backport patches?. >>>>754 | if (kgd2kfd_resume_iommu(kfd)) | ^~~~~~~~~~~~~~~~~~~~ | kgd2kfd_resume_mm Best Regards! James Hi James, (In reply to James Zhu from comment #82) > [...] > $ grep -rn "amdgpu_amdkfd.h\|kgd2kfd_resume_iommu" > drivers/gpu/drm/amd/amdkfd/kfd_device.c > 31:#include "amdgpu_amdkfd.h" > 604: kfd->pci_atomic_requested = amdgpu_amdkfd_have_atomics_support(kgd); > >>>>792: if (kgd2kfd_resume_iommu(kfd)) > 940:int kgd2kfd_resume_iommu(struct kfd_dev *kfd) the line numbers you're quoting are for Linux v5.12.19 (0e6f651912bdd027a6d730b68d6d1c3f4427c0ae) + the attachment-299697 patch. > Looks we are using different 5.10, should we use 5.10 stable for adding this > backport patches?. > >>>>754 | if (kgd2kfd_resume_iommu(kfd)) > | ^~~~~~~~~~~~~~~~~~~~ > | kgd2kfd_resume_mm I'm testing with Linux v5.10.80 (f884bb85b8d877d4e0c670403754813a7901705b) + the attachment-299697 patch. And there it's line number 754. @James
I was able to compile!
Looks like this was some fault of mine.
(I'm usually building out of source directory and did something wrong...)
Now I'm testing the current v5.10.82 with the provided attachment 299697 [details] patches.
(In reply to James Zhu from comment #77) > Created attachment 299697 [details] > backport patch for 5.10 stable. > > Hi @kolAflash, before I send out them to public for review,. could you help > take a test? Thanks so much! James Works excellent! Tested with Linux-5.10.82 on Debian-11. Hi @kolAflash, thanks so much for your effort on this verification! Would you mind help apply those patches on 5.12 stable to check also? it should be automatically merged. Thanks! James (In reply to James Zhu from comment #86) > Hi @kolAflash, thanks so much for your effort on this verification! > Would you mind help apply those patches on 5.12 stable to check also? > it should be automatically merged. Thanks! James I'm testing Linux-5.12.19 with the patch from attachment 299697 [details] since 2021-12-02. Until now everything works fine. Debian-11 just got a kernel security update, giving me Linux-5.10.92. https://snapshot.debian.org/package/linux-signed-amd64/5.10.92%2B1/#linux-image-5.10.0-11-amd64_5.10.92-1 Since rebooting into that kernel I got no more crashes after waking from s2ram. (not using pci=noats or any other workarounds) Conclusion: Everything fixed! Thanks a lot to everyone involved :-) |