Bug 215315

Summary: [REGRESSION BISECTED] amdgpu crashes system suspend - NUC8i7HVKVA
Product: Drivers Reporter: Len Brown (lenb)
Component: Video(DRI - non Intel)Assignee: Christian König (christian.koenig)
Status: RESOLVED CODE_FIX    
Severity: normal CC: alexdeucher, andrey.grodzovsky, guchun.chen
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.15-rc1, 5.15, 5.16-rc4, 5.16-rc5, 5.14.10, 5.16-rc6, 5.16-rc8 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: photograph of screen upon failure
another photograph of screen upon failure
dmesg for 5.16-rc8+ failure
screen shot for 5.16-rc8+ failure
dmesg across "init 1" mitigation
dmesg -w output for 5.16.0-rc8-00077-gd1587f7bfe9a
dmesg from failure
another dmesg, repeated failure using same kernel as above

Description Len Brown 2021-12-12 23:08:28 UTC
My Intel NUC8i7HVKVA has an AMD GPU.

Until 5.15-rc1, this machine was rock solid in suspend stress testing -- never crashing after hundreds of hours of back-to-back suspend cycles.

Until this patch went upstream:

commit f7d6779df642720e22bffd449e683bb8690bd3bf (refs/bisect/bad)
Author: Guchun Chen <guchun.chen@amd.com>
Date:   Fri Aug 27 18:31:41 2021 +0800

    drm/amdgpu: stop scheduler when calling hw_fini (v2)
    
    This gurantees no more work on the ring can be submitted
    to hardware in suspend/resume case, otherwise a potential
    race will occur and the ring will get no chance to stay
    empty before suspend.
    
    v2: Call drm_sched_resubmit_job before drm_sched_start to
    restart jobs from the pending list.
    
    Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
    Suggested-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Guchun Chen <guchun.chen@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Cc: stable@vger.kernel.org

I bisected that the patch before this one was integrated can handle over 1,000 back-to-back "freeze" system suspend cycles.  Yet, when this patch is present, the system may crash before it completes only 100 cycles, and at most lasts a few hundred cycles.

This crash is present in all following upstream rc's, including 5.15-rc4.

When I revert this patch from 5.15-rc4, stability returns.

Usually, the crash is manifest by a black screen, and a system that does not respond to ping, and will only respond to a long AC power button press to remove power; and a subsequent cold reboot.

I have witnessed the crash occur, and the "ubuntu color themed" screen enters some sort of reverse video mode.  In this weird color mode, I've seen a text window oscillate between scrolling and un-scrolling for a line -- sort of like it is going back in time, but then changes its mind.  There is no response to keyboard, mouse, or network input.
Comment 1 Len Brown 2021-12-12 23:13:37 UTC
* This crash is present in all following upstream rc's, including 5.16-rc4 (ie. latest upstream kernel tree)
Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2021-12-13 06:04:22 UTC
[TLDR: adding this regression to regzbot; most of this mail is compiled
from a few templates paragraphs some of you might have seen already.]

Hi, this is your Linux kernel regression tracker speaking.

Top-posting for once, to make this easy accessible to everyone.

Thanks for the report.

Adding the regression mailing list to the list of recipients, as it
should be in the loop for all regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

To be sure this issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced f7d6779df642720e22bffd449e683bb8690bd3bf
#regzbot title drm: amdgpu: NUC8i7HVKVA crashes during system suspend
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=215315
#regzbot ignore-activity

Reminder: when fixing the issue, please add a 'Link:' tag with the URL
to the report (the parent of this mail), then regzbot will automatically
mark the regression as resolved once the fix lands in the appropriate
tree. For more details about regzbot see footer.

Sending this to everyone that got the initial report, to make all aware
of the tracking. I also hope that messages like this motivate people to
directly get at least the regression mailing list and ideally even
regzbot involved when dealing with regressions, as messages like this
wouldn't be needed then.

Don't worry, I'll send further messages wrt to this regression just to
the lists (with a tag in the subject so people can filter them away), as
long as they are intended just for regzbot. With a bit of luck no such
messages will be needed anyway.

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat).

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply. That's in everyone's interest, as
what I wrote above might be misleading to everyone reading this; any
suggestion I gave thus might sent someone reading this down the wrong
rabbit hole, which none of us wants.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.


On 13.12.21 00:08, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215315
> 
>             Bug ID: 215315
>            Summary: [REGRESSION BISECTED] amdgpu crashes system suspend -
>                     NUC8i7HVKVA
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 5.15-rc1, 5.15, 5.16-rc4
>           Hardware: x86-64
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Video(DRI - non Intel)
>           Assignee: drivers_video-dri@kernel-bugs.osdl.org
>           Reporter: lenb@kernel.org
>         Regression: No
> 
> My Intel NUC8i7HVKVA has an AMD GPU.
> 
> Until 5.15-rc1, this machine was rock solid in suspend stress testing --
> never
> crashing after hundreds of hours of back-to-back suspend cycles.
> 
> Until this patch went upstream:
> 
> commit f7d6779df642720e22bffd449e683bb8690bd3bf (refs/bisect/bad)
> Author: Guchun Chen <guchun.chen@amd.com>
> Date:   Fri Aug 27 18:31:41 2021 +0800
> 
>     drm/amdgpu: stop scheduler when calling hw_fini (v2)
> 
>     This gurantees no more work on the ring can be submitted
>     to hardware in suspend/resume case, otherwise a potential
>     race will occur and the ring will get no chance to stay
>     empty before suspend.
> 
>     v2: Call drm_sched_resubmit_job before drm_sched_start to
>     restart jobs from the pending list.
> 
>     Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>     Suggested-by: Christian König <christian.koenig@amd.com>
>     Signed-off-by: Guchun Chen <guchun.chen@amd.com>
>     Reviewed-by: Christian König <christian.koenig@amd.com>
>     Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
>     Cc: stable@vger.kernel.org
> 
> I bisected that the patch before this one was integrated can handle over
> 1,000
> back-to-back "freeze" system suspend cycles.  Yet, when this patch is
> present,
> the system may crash before it completes only 100 cycles, and at most lasts a
> few hundred cycles.
> 
> This crash is present in all following upstream rc's, including 5.15-rc4.
> 
> When I revert this patch from 5.15-rc4, stability returns.
> 
> Usually, the crash is manifest by a black screen, and a system that does not
> respond to ping, and will only respond to a long AC power button press to
> remove power; and a subsequent cold reboot.
> 
> I have witnessed the crash occur, and the "ubuntu color themed" screen enters
> some sort of reverse video mode.  In this weird color mode, I've seen a text
> window oscillate between scrolling and un-scrolling for a line -- sort of
> like
> it is going back in time, but then changes its mind.  There is no response to
> keyboard, mouse, or network input.
>
Comment 3 Alex Deucher 2021-12-13 14:29:36 UTC
Can you get a kernel log when the crash happens?  Please include the dmesg output of the system in general as well.
Comment 4 Len Brown 2021-12-13 16:28:40 UTC
I do not have a serial console on this system,
and since GFX kbd and mouse are frozen upon the crash,
and it requires a hard power cycle to regain control of the machine,
I can't get a kernel log of the crash.
Comment 5 Len Brown 2021-12-13 16:33:20 UTC
I have confirmed that this issue is still present in 5.16-rc5
(it failed in 2 hours)

I have confirmed that proper function returns upon reverting the commit above 5.16-rc5
(still working after 8-hours of testing)
Comment 6 Len Brown 2021-12-15 21:23:01 UTC
This commit is present in 5.14.10

Author: Guchun Chen <guchun.chen@amd.com>
Date:   Fri Aug 27 18:31:41 2021 +0800

    drm/amdgpu: stop scheduler when calling hw_fini (v2)
    
    [ Upstream commit f7d6779df642720e22bffd449e683bb8690bd3bf ]
    
    This gurantees no more work on the ring can be submitted
    to hardware in suspend/resume case, otherwise a potential
    race will occur and the ring will get no chance to stay
    empty before suspend.
    
    v2: Call drm_sched_resubmit_job before drm_sched_start to
    restart jobs from the pending list.
    
    Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
    Suggested-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Guchun Chen <guchun.chen@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Sasha Levin <sashal@kernel.org>
Comment 7 Len Brown 2021-12-17 16:21:04 UTC
5.14.9 does not fail
5.14.10 fails.

The failure in 5.14.stable bisects to this commit:

920e3c77f13084206949730f0d0d0a797425c4e7 is the first bad commit
commit 920e3c77f13084206949730f0d0d0a797425c4e7
Author: Guchun Chen <guchun.chen@amd.com>
Date:   Fri Aug 27 18:31:41 2021 +0800

    drm/amdgpu: stop scheduler when calling hw_fini (v2)
    
    [ Upstream commit f7d6779df642720e22bffd449e683bb8690bd3bf ]
    
    This gurantees no more work on the ring can be submitted
    to hardware in suspend/resume case, otherwise a potential
    race will occur and the ring will get no chance to stay
    empty before suspend.
    
    v2: Call drm_sched_resubmit_job before drm_sched_start to
    restart jobs from the pending list.
    
    Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
    Suggested-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Guchun Chen <guchun.chen@amd.com>
    Reviewed-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Cc: stable@vger.kernel.org
    Signed-off-by: Sasha Levin <sashal@kernel.org>

 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++
 1 file changed, 8 insertions(+)
Comment 8 Alex Deucher 2021-12-17 16:24:02 UTC
Does a newer kernel or Linus' master have the same issue or is it specific to the 5.14 branch?
Comment 9 Andrey Grodzovsky 2021-12-17 17:07:18 UTC
(In reply to Len Brown from comment #4)
> I do not have a serial console on this system,
> and since GFX kbd and mouse are frozen upon the crash,
> and it requires a hard power cycle to regain control of the machine,
> I can't get a kernel log of the crash.

Are you able to take a screenshot of your screen when this happens ? In case your system reboots automatticly (which doesn't sound like it) you can boot with kernel.panic_on_oops=0 in GRUB
Comment 10 Len Brown 2021-12-20 12:16:55 UTC
re: comment #8

I have verified that this issue is still present in latest upstream (5.16-rc6)
and that reverting the offending patch still fixes the issue.
Comment 11 Len Brown 2021-12-20 12:25:17 UTC
Re: comment #9

Yes, I have a couple of screen shots taken with my phone upon the failure, and will attach them below.  There is just the output of sleepgraph in a window -- I don't see any system output.  Is there some way to configure the screen so that I would see system messages when the GUI is stopped?

I have been at my desk several times upon the failure, and after a few minutes of "reverse video" garbled screen, it goes black and there seems to be no way other than power cycle to get any response out of the system after that.

Apparently the system will sit in this state forever, until power goes out -- at least I've observed it do so overnight...
Comment 12 Len Brown 2021-12-20 12:34:41 UTC
Created attachment 300095 [details]
photograph of screen upon failure
Comment 13 Len Brown 2021-12-20 12:41:25 UTC
Created attachment 300097 [details]
another photograph of screen upon failure
Comment 14 Guchun Chen 2021-12-21 03:02:15 UTC
(In reply to Len Brown from comment #13)
> Created attachment 300097 [details]
> another photograph of screen upon failure

Hi Brown,

Your screenshot is fuzzy, and there is no dmesg output in your screenshot. dmesg log can help us know the exact kernel error if there is. Can you pls repeat your test, and pls ensure you follow below steps:

1. Open a console to trigger your test
2. Open another console, and enter "dmesg -w", it will print realtime kernel log.
3. Once the issue happens, copy the logs especially the part prior to the error occurrence to us.
Comment 15 Len Brown 2022-01-06 02:13:49 UTC
Good news and bad news...

I tried to reproduce this with the latest upstream 5.16-rc8-75acfdb6fd922598a408a0d864486aeb167c1a97
and things may actually be worse now.

As you requested, I ran dmesg -w, and I did it via tee to a file.
Also, I was logged into the machine via wired ethernet
doing the same dmesg -w.

What I observed is that the screen became garbled after a few suspend cycles,
and the mouse would move a box across the screen, but was unable to
select anything.  The attached usb keyboard appeared to be ineffective.

However, the ssh session with dmesg -w running still ran.

Attached is a screen shot, and the saved dmesg, which match.
Comment 16 Len Brown 2022-01-06 02:14:51 UTC
Created attachment 300231 [details]
dmesg for 5.16-rc8+ failure
Comment 17 Len Brown 2022-01-06 02:16:43 UTC
Created attachment 300232 [details]
screen shot for 5.16-rc8+ failure
Comment 18 Len Brown 2022-01-06 14:48:54 UTC
I have confirmed that reverting the offending patch:

(drm/amdgpu: stop scheduler when calling hw_fini (v2))

allows 5.16-rc8 to pass a 10-hour suspend endurance test, with video intact.

(the 5.16-rc8 failure described in comment #15 occurred in under an hour)
Comment 19 Len Brown 2022-01-07 23:54:00 UTC
Obviously, the offending patch should be reverted before
this regression infects yet another release.

But to better understand the failure, I reproduced it again using
5.16-rc8-048 as in comment #15.

What I found is that after the screen gets garbled...
The mouse motion works, whether via dedicated USB mouse
or from the trackpoint on my USB thinkpad keyboard.
However, all it does is move a 1" square box on the display.
Mouse buttons appear to have no effect.

Also, the text output on the video console is frozen,
even though sleepgraph is still running, and "dmesg -w"
is still showing progress in a window on another computer
via ssh.

The first time I found that I had control from the remote machine,
I tried "sudo reboot" and the network disconnected, but the system
failed to reboot.  This time I did an "init 1" to kill the window system.
Console text output came back to the video display, and I was able
to ^D on the console to restart the GUI, which seemed to then work fine --
without rebooting the kernel.
Comment 20 Len Brown 2022-01-08 00:03:44 UTC
Created attachment 300237 [details]
dmesg across "init 1" mitigation

here is the dmesg across the "init 1" path taken in the previous comments.

And a correction -- the mouse seems to have subsequently stopped working
after the window system restart.  "sudo reboot" from the ssh session
again resulted in a frozen screen with the GUI still up.  ping worked,
ssh failed, and only power button hard-cycle recovered the machine.
Comment 21 Len Brown 2022-01-08 16:17:42 UTC
still broken in latest upstream: 5.16.0-rc8-00077-gd1587f7bfe9a

Overnight run completed 486 iterations.
screen was garbled, and frozen -- no mouse or kbd input.
ping worked, but not ssh.

dmesg -w was running to to a file, and at the end there was some amdgpu complaints:

[11341.478526] amdgpu 0000:01:00.0: amdgpu: PCI CONFIG reset
[11383.468593] amdgpu 0000:01:00.0: amdgpu: PCI CONFIG reset
[11424.046117] amdgpu 0000:01:00.0: amdgpu: PCI CONFIG reset
[11465.831379] amdgpu 0000:01:00.0: amdgpu: PCI CONFIG reset
[11507.250829] amdgpu 0000:01:00.0: amdgpu: PCI CONFIG reset
[11514.337785] amdgpu 0000:01:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on sdma0 (-110).
[11514.474335] [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110).
[11523.587731] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=207738, emitted seq=207741
[11523.588253] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[11523.588738] amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
[11523.749865] amdgpu 0000:01:00.0: amdgpu: PCI CONFIG reset
[11523.771829] amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
[11524.027065] amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow start
[11524.027081] amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow done
[11524.027167] amdgpu 0000:01:00.0: amdgpu: GPU reset(3) succeeded!
Comment 22 Len Brown 2022-01-08 16:19:21 UTC
Created attachment 300238 [details]
dmesg -w output for 5.16.0-rc8-00077-gd1587f7bfe9a

for context, here is the dmesg file that contained the amdgpu messages in previous comment
Comment 23 Len Brown 2022-01-09 05:14:24 UTC
Created attachment 300244 [details]
dmesg from failure

another dmesg through failure, same kernel as above
Comment 24 Len Brown 2022-01-09 05:19:57 UTC
Created attachment 300245 [details]
another dmesg, repeated failure using same kernel as above
Comment 25 Len Brown 2022-01-10 18:10:10 UTC
I have confirmed with a 10-hour suspend endurance test that Linux 5.16 works.

Closed.


commit df5bc0aa7ff6e2e14cb75182b4eda20253c711d4
Author: Len Brown <len.brown@intel.com>
Date:   Sun Jan 9 13:11:37 2022 -0500

    Revert "drm/amdgpu: stop scheduler when calling hw_fini (v2)"
    
    This reverts commit f7d6779df642720e22bffd449e683bb8690bd3bf.
    
    This bisected regression has impacted suspend-resume stability
    since 5.15-rc1. It regressed -stable via 5.14.10.
    
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=215315
    Fixes: f7d6779df64 ("drm/amdgpu: stop scheduler when calling hw_fini (v2)")
    Cc: Guchun Chen <guchun.chen@amd.com>
    Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
    Cc: Christian Koenig <christian.koenig@amd.com>
    Cc: Alex Deucher <alexander.deucher@amd.com>
    Cc: <stable@vger.kernel.org> # 5.14+
    Signed-off-by: Len Brown <len.brown@intel.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>