Bug 212133

Summary: Kernel 5.11 crashes when it boots, it produces black screen.
Product: Platform Specific/Hardware Reporter: rollingrock (david.capiz)
Component: x86-64Assignee: Joerg Roedel (joro)
Status: RESOLVED CODE_FIX    
Severity: blocking CC: alexdeucher, bp, david.capiz, fkrueger, im_dracula, joro, jshand2013, rslnrdnk666, savelov, turtle
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.11 Subsystem:
Regression: No Bisected commit-id:
Attachments: Dmesg output for kernel 5.10.19 in my computer.

Description rollingrock 2021-03-08 15:34:42 UTC
Kernel 5.11 crashes when it boots in my computer, it produces a black screen. I don't hear the fans (as usual when my computer boots), my computer remains blocked and freezed, and then I have to power off and power on my computer with the power button. I add below the details of my computer. For kernel 5.10 I use the kernel parameter amdgpu.dc=0 to avoid the black screen, but this method does NOT solve the problem of kernel 5.11. Perhaps is not a problem of graphics, I don’t know. I’ve tried to reach a console through ctrl+alt+F2, and also ctrl+fn+alt+F2, fn+alt+F2, and alt+F2, with and without amdgpu.dc=0, and the result is the same: Black screen. I’ve also tried this kernel parameter: video=DP-0:1366x768@60 video=DP-1:1366x768@60 video=DP-2:1366x768@60. The result is the same: Black screen. I’ve also restored the default BIOS configuration and the result is again the same: Black screen. Details of my computer:
Computer: HP Laptop 15-db0xxx
CPU: Dual Core AMD A6-9225 RADEON R4 5 COMPUTE CORES 2C+3G (-MCP-) @ 2x 2.6GHz
GPU/graphics card: AMD Stoney [Radeon R2/R3/R4/R5 Graphics]
Graphics driver: amdgpu
Bootloader: GRUB
GNU/Linux distribution: Manjaro
Display manager: sddm
Desktop: KDE Plasma
Comment 1 rollingrock 2021-03-09 20:07:21 UTC
UPDATE: After the new update of Manjaro today, I have kernel 5.11.2 now. This kernel still crashes and produces a black screen in my computer, but now, if I use the kernel parameter amdgpu.dc=0, I hear the fans while booting, and I see in my router the light of the wired connection of my computer, so I suppose that in this case kernel 5.11.2 is booting in my computer, but it still produces a black screen, so I cannot see anything and I have to reboot my computer. In the case of kernel 5.10 now I've seen that I don't need to continue using the kernel parameter amdgpu.dc=0.
Comment 2 Borislav Petkov 2021-03-09 21:53:30 UTC
Do you have the latest firmware for the amdgpu driver? It should be a package called something "amdgpu...linux...firmware" or so on your distro.

Do you see anything when you boot with "nomodeset" on the kernel command line?

Can you upload dmesg from a working kernel?

Thx.
Comment 3 Ryan 2021-03-10 03:48:05 UTC
This is also impacting me on a DELL Inspiron 3180 11" which has the AMD A9 A420e processor with Radeon R5 graphics. I am running Fedora 33 on x86_64 architecture. I tested 'nomodeset' and amdgpu.dc=0 and amdgpu.dc=1 both on their own and with 'nomodeset'. It makes no difference, I can get the grub boot menu and then once the kernel is selected it switches to a blank screen with a white cursor (usually it will briefly flick to this screen and then flick to the DELL logo) I have LUKS encryption enabled on this laptop, and it doesn't even reach the prompt to decrypt the disk on 5.11 kernels. I can't do anything except press the power button (ctrl+alt+del and trying to switch tty's does nothing). here is my dmesg from the latest 5.10.22 build kernel:

https://paste.centos.org/view/8f3d9782 (expires in 23hrs)
Comment 4 Ryan 2021-03-10 03:49:15 UTC
apologies, cpu is AMD A9-9420e
Comment 5 Borislav Petkov 2021-03-10 05:18:08 UTC
+ Alex.

Interesting, you both are using StoneyRidge-something GPUs. Not that it means something right now as we don't know where it fails during boot and debugging laptops is always a PITA, but still...

The only thing I can think of if you guys could bisect it, see "man git-bisect".

You take v5.10 upstream and verify that it works, then you take v5.11 and check whether it fails. If it does, you do the bisection as described in the manpage. Hopefully, we can pinpoint the faulty commit this way, *if* it is a kernel commit causing this or not something else during boot. But we don't know yet because there's no way to connect serial to those laptops or otherwise see debug output during boot. :-\

I guess one other thing you could try is to temporarily blacklist the amdgpu driver and see if it boots to text console then. If it explodes due to something else, then at least you will see it and perhaps even be able to make a picture...

HTH.
Comment 6 Alex Deucher 2021-03-10 05:57:46 UTC
Is this still an issue with 5.11.3?
Comment 7 Ryan 2021-03-10 06:13:53 UTC
(In reply to Alex Deucher from comment #6)
> Is this still an issue with 5.11.3?

Fedora's test week original kernel was 5.11.3. it had the same problem, as did 5.11.4
Comment 8 Alex Deucher 2021-03-10 06:36:47 UTC
Can you get a dmesg output from 5.11 with or without the GPU driver loaded?  You can append `modprobe.blacklist=amdgpu 3` to the kernel command line in grub to disable the amdgpu driver and boot into runlevel 3 (no GUI).  From there you can get the dmesg output and try and manually load the amdgpu driver (modprobe amdgpu).  Bisecting would be ideal: https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html
Comment 9 Ryan 2021-03-10 06:52:47 UTC
(In reply to Alex Deucher from comment #8)
> Can you get a dmesg output from 5.11 with or without the GPU driver loaded? 
> You can append `modprobe.blacklist=amdgpu 3` to the kernel command line in
> grub to disable the amdgpu driver and boot into runlevel 3 (no GUI).  From
> there you can get the dmesg output and try and manually load the amdgpu
> driver (modprobe amdgpu).

I did this, I still end up on a black screen with a white cursor in the top left. I will try doing the git bisect later when I have some time
Comment 10 Alex Deucher 2021-03-10 06:54:40 UTC
(In reply to Ryan from comment #9)
> I did this, I still end up on a black screen with a white cursor in the top
> left. I will try doing the git bisect later when I have some time

So it doesn't boot even with amdgpu blacklisted?
Comment 11 Ryan 2021-03-10 06:56:26 UTC
nope, I still had to revert to the 5.10.22 kernel to get it to boot again. oh I'm also using EFI boot btw, so even after telling it to boot in text mode in grub it still seems to be trying to use the framebuffer.
Comment 12 Borislav Petkov 2021-03-10 07:10:41 UTC
You can try to narrow it down to see which kernel caused it by booting v5.11 upstream, v5.11.1, v5.11.2 etc the stable ones. And then you can go back and try v5.11-rc7, 5.11-rc6... you get the idea. This way you can slim down the range you'd need to bisect. The usual experience I have is that -rc1 is almost always introducing the regression, which would mean, v5.10 would boot on your machine and v5.11-rc1 would not. But you'd have to experiment to actually see which two release candidates are good and bad and start bisecting there.

Don't hesitate to ask if there are questions still open.

HTH.
Comment 13 Ryan 2021-03-10 07:18:38 UTC
(In reply to Borislav Petkov from comment #12)
> You can try to narrow it down to see which kernel caused it by booting v5.11
> upstream, v5.11.1, v5.11.2 etc the stable ones. And then you can go back and
> try v5.11-rc7, 5.11-rc6... you get the idea. This way you can slim down the
> range you'd need to bisect. The usual experience I have is that -rc1 is
> almost always introducing the regression, which would mean, v5.10 would boot
> on your machine and v5.11-rc1 would not. But you'd have to experiment to
> actually see which two release candidates are good and bad and start
> bisecting there.
> 
> Don't hesitate to ask if there are questions still open.
> 
> HTH.

Thanks for that, I was thinking it would probably require going back up the tree to pin it down. I will try and get to it soon, currently at work, might not have enough free time to work through this until the weekend.
Comment 14 Borislav Petkov 2021-03-10 08:35:43 UTC
No worries, whenever. Good luck! :)
Comment 15 Ryan 2021-03-11 05:04:07 UTC
So I decided to start with the lowest possible option (RC1) and work up from there if it was working, I've compiled 5.11-RC1 and I still have the problem, so that significantly reduces the testing footprint i guess :-D So as far as it goes, 5.11-RC1 = BAD, 5.10.x = Good. When I have more time, I will try a git bisect...
Comment 16 Borislav Petkov 2021-03-11 10:43:05 UTC
Yap, that confirms my speculation that -rc1 broke it. That's a good start.
Comment 17 rollingrock 2021-03-12 21:07:56 UTC
Created attachment 295821 [details]
Dmesg output for kernel 5.10.19 in my computer.

I've tried to get dmesg output for kernels 5.11 and 5.10 in my computer. It has been possible only for kernel 5.10, and this file is this output. I hope that it's useful.
Comment 18 Ryan 2021-03-13 06:23:07 UTC
not sure if this will help narrow it down, but I have been able to manipulate 5.11 into almost booting by passing the kernel arg acpi=off. I am at least able to get to a point where it gets to systemd-udev after initialising the gpu and then it stops processing any further beyond that. I saw this suggestion from someone with a Dell latitude in the Fedora test day results page who was having issues booting, however they have an Intel i5 processor in that laptop.
Comment 19 Ryan 2021-03-13 06:26:08 UTC
it was a Dell Latitude E5570
Comment 20 rollingrock 2021-03-13 20:07:08 UTC
(In reply to Ryan from comment #19)
> it was a Dell Latitude E5570

When I pass the kernel arg acpi=off to kernel 5.11, an emergency console appears after some system messages, and when I use lsmod I see that the graphical modules amdgpu and drm are not loaded at the point where the boot process has stopped, and when I use 'dmesg | less' I see the same, there is not the graphical initialization that you can see in the dmesg output of kernel 5.10.
Comment 21 John Shand 2021-03-14 09:42:39 UTC
(In reply to Alex Deucher from comment #10)
> (In reply to Ryan from comment #9)
> > I did this, I still end up on a black screen with a white cursor in the top
> > left. I will try doing the git bisect later when I have some time
> 
> So it doesn't boot even with amdgpu blacklisted?

i've even tried booting with nomodeset and and still didn't work
Comment 22 rollingrock 2021-03-15 23:46:40 UTC
I've found more kernel parameters whose results are very similar to the results of acpi=off . These are the parameters:
noapic
nosmp
maxcpus=0
acpi=noirq
They're explained in https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html . These kernel parameters may give more clues.
Comment 23 Joerg Roedel 2021-03-16 12:42:13 UTC
Can you please test whether a kernel built from the branch below fixes the issue?

Branch: https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=iommu-fixes
Comment 24 Ryan 2021-03-17 04:52:51 UTC
(In reply to Joerg Roedel from comment #23)
> Can you please test whether a kernel built from the branch below fixes the
> issue?
> 
> Branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=iommu-
> fixes

Amazing, works for me :) Thanks 👍
Comment 25 John Shand 2021-03-17 06:45:38 UTC
Created attachment 295889 [details]
attachment-14492-0.html

Does this fix change anything in regards to the safe function of the os

On Wed, Mar 17, 2021, 5:52 PM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=212133
>
> --- Comment #24 from Ryan (im_dracula@hotmail.com) ---
> (In reply to Joerg Roedel from comment #23)
> > Can you please test whether a kernel built from the branch below fixes
> the
> > issue?
> >
> > Branch:
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=iommu-
> > fixes
>
> Amazing, works for me :) Thanks 👍
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 26 Ryan 2021-03-17 07:14:34 UTC
(In reply to John Shand from comment #25)
> Created attachment 295889 [details]
> attachment-14492-0.html
> 
> Does this fix change anything in regards to the safe function of the os

If you read the commit history on the branch it takes all the work up to 5.12-rc3 and updates the AMD IOMMU handling so that it's actually evaluated properly. It ensures that the kernel properly sets IOMMU to disabled for the STONEY chipset. The other commits cleanup some handling variables when IOMMU is disabled. As far as it goes, this just makes sure that when IOMMU is disabled, it gets disabled properly. I believe that it should have no bearing on the 'safe function' of the OS, it makes sure that when required or requested, IOMMU is disabled properly.
Comment 27 Joerg Roedel 2021-03-17 09:00:45 UTC
(In reply to John Shand from comment #25)
> Created attachment 295889 [details]
> attachment-14492-0.html
> 
> Does this fix change anything in regards to the safe function of the os

No, the patches fix two issues in IOMMU initialization code. The safe function of the OS is in no way harmed by these changes.
Comment 28 Ruslan Rudenko 2021-03-17 16:38:21 UTC
(In reply to Joerg Roedel from comment #27)
> (In reply to John Shand from comment #25)
> > Created attachment 295889 [details]
> > attachment-14492-0.html
> > 
> > Does this fix change anything in regards to the safe function of the os
> 
> No, the patches fix two issues in IOMMU initialization code. The safe
> function of the OS is in no way harmed by these changes.

Will it be possible to include those patches into mainline 5.12-rc4 kernel, I wonder?
Comment 29 Joerg Roedel 2021-03-17 16:41:58 UTC
(In reply to Ruslan Rudenko from comment #28)

> Will it be possible to include those patches into mainline 5.12-rc4 kernel,
> I wonder?

Possible. I plan to send them upstream asap. They are also tagged for stable, so once in mainline Linux they will also get backported to the next 5.11 stable release.
Comment 30 Ruslan Rudenko 2021-03-17 16:48:30 UTC
(In reply to Joerg Roedel from comment #29)
> (In reply to Ruslan Rudenko from comment #28)
> 
> > Will it be possible to include those patches into mainline 5.12-rc4 kernel,
> > I wonder?
> 
> Possible. I plan to send them upstream asap. They are also tagged for
> stable, so once in mainline Linux they will also get backported to the next
> 5.11 stable release.

This is awesome news. I thank you very much for your work!
Comment 31 Ryan 2021-03-18 03:43:11 UTC
a workaround for those of us running affected AMD hardware:

set kernel boot option `iommu=soft`

as per https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt this pushes IOMMU handling to software instead of trying hardware IOMMU.
Comment 32 Ruslan Rudenko 2021-03-18 04:07:13 UTC
(In reply to Ryan from comment #31)
> a workaround for those of us running affected AMD hardware:
> 
> set kernel boot option `iommu=soft`
> 
> as per https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt
> this pushes IOMMU handling to software instead of trying hardware IOMMU.

Should this workaround be effective in trying current 5.12-rc kernels as well?
Comment 33 Ryan 2021-03-18 04:25:31 UTC
(In reply to Ruslan Rudenko from comment #32)
> (In reply to Ryan from comment #31)
> > a workaround for those of us running affected AMD hardware:
> > 
> > set kernel boot option `iommu=soft`
> > 
> > as per https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt
> > this pushes IOMMU handling to software instead of trying hardware IOMMU.
> 
> Should this workaround be effective in trying current 5.12-rc kernels as
> well?

I believe it will, but I haven't tested it as I am running Fedora 33 and the latest available kernel release is 5.11.7 (soon to be in testing updates), So unless I specifically compile a 5.12 kernel (pre-patch) or run one of the Fedora 34 development kernels I can't confirm. If you already have a 5.12 kernel installed, you should be able to test it.
Comment 34 Ryan 2021-03-18 05:16:16 UTC
correction: 5.12 is a Fedora 35 development kernel

Meanwhile I am in the process of compiling 5.12-rc3. I will let you know the results with iommu=soft although I imagine it will be much the same as 5.11, please stand by
Comment 35 Ryan 2021-03-18 06:44:53 UTC
I can confirm i can boot fine with 5.12-rc3 with `iommu=soft` set
Comment 36 rollingrock 2021-03-18 09:55:38 UTC
(In reply to Joerg Roedel from comment #23)
> Can you please test whether a kernel built from the branch below fixes the
> issue?
> 
> Branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=iommu-
> fixes

It works for me too! :-) Thank you very much!!
Comment 37 Ruslan Rudenko 2021-03-18 10:00:25 UTC
(In reply to Ryan from comment #33)
> (In reply to Ruslan Rudenko from comment #32)
> > (In reply to Ryan from comment #31)
> > > a workaround for those of us running affected AMD hardware:
> > > 
> > > set kernel boot option `iommu=soft`
> > > 
> > > as per
> https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt
> > > this pushes IOMMU handling to software instead of trying hardware IOMMU.
> > 
> > Should this workaround be effective in trying current 5.12-rc kernels as
> > well?
> 
> I believe it will, but I haven't tested it as I am running Fedora 33 and the
> latest available kernel release is 5.11.7 (soon to be in testing updates),
> So unless I specifically compile a 5.12 kernel (pre-patch) or run one of the
> Fedora 34 development kernels I can't confirm. If you already have a 5.12
> kernel installed, you should be able to test it.

Got it, will try to test it this or next evening while installing latest vanilla 5.12-rc kernel from mainline.
Comment 38 Borislav Petkov 2021-03-18 10:02:16 UTC
joro, please add the commit IDs here for future reference once they're available. Assigning to you for closing afterwards.

Thx to everyone involved!
Comment 39 Ruslan Rudenko 2021-03-18 14:51:13 UTC
(In reply to Ruslan Rudenko from comment #37)
> (In reply to Ryan from comment #33)
> > (In reply to Ruslan Rudenko from comment #32)
> > > (In reply to Ryan from comment #31)
> > > > a workaround for those of us running affected AMD hardware:
> > > > 
> > > > set kernel boot option `iommu=soft`
> > > > 
> > > > as per
> > https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt
> > > > this pushes IOMMU handling to software instead of trying hardware
> IOMMU.
> > > 
> > > Should this workaround be effective in trying current 5.12-rc kernels as
> > > well?
> > 
> > I believe it will, but I haven't tested it as I am running Fedora 33 and
> the
> > latest available kernel release is 5.11.7 (soon to be in testing updates),
> > So unless I specifically compile a 5.12 kernel (pre-patch) or run one of
> the
> > Fedora 34 development kernels I can't confirm. If you already have a 5.12
> > kernel installed, you should be able to test it.
> 
> Got it, will try to test it this or next evening while installing latest
> vanilla 5.12-rc kernel from mainline.

Gentlemen, I can confirm that with this "iommu=soft" I successfully booted into latest vanilla 5.12-rc3 kernel in my Fedora Rawhide setup with Stoney chip. Thanks to everyone for their hard work and tips!
Comment 40 John Shand 2021-03-18 20:20:30 UTC
i can also confirm the iommu=soft works.  it is nice to know you have choices on how to run iommu
Comment 41 rollingrock 2021-03-19 20:29:28 UTC
With the kernel option iommu=soft kernel 5.11 works perfectly in the Manjaro GNU/Linux system of my AMD-based HP laptop. It's a good trick while kernels are being corrected :-)
Comment 42 Ryan 2021-03-21 22:14:30 UTC
Looks like 5.12-rc4 contains the iommu-fixes. These have not been backported to 5.11 yet though.
Comment 43 John Shand 2021-03-22 05:57:45 UTC
i've been using the iommu=soft for a few days and i have noticed that the opensuse os runs more smoothly
Comment 44 Ryan 2021-03-24 23:17:40 UTC
As expected (after reading the changelog and noting the fixes have been backported) 5.11.9 fixes this issue. Thanks Joerg :)
Comment 45 Azamat S. Kalimoulline 2021-03-27 19:31:29 UTC
I have same issue in 5.11.9 kernel, but on Renoir architecture. I have AMD Ryzen 5 PRO 4650U with Radeon Graphics. Same stuck on loading initial ramdisk. modprobe.blacklist=amdgpu 3` didn't help to boot. Same stuck. Also iommu=off and acpi=off too.
Comment 46 Azamat S. Kalimoulline 2021-03-27 19:32:16 UTC
5.10.26 boots fine.
Comment 47 Azamat S. Kalimoulline 2021-03-27 19:35:31 UTC
I boot via efi and I have no option boot without it.
Comment 48 Joerg Roedel 2021-04-06 09:32:47 UTC
(In reply to Azamat S. Kalimoulline from comment #45)
> I have same issue in 5.11.9 kernel, but on Renoir architecture. I have AMD
> Ryzen 5 PRO 4650U with Radeon Graphics. Same stuck on loading initial
> ramdisk. modprobe.blacklist=amdgpu 3` didn't help to boot. Same stuck. Also
> iommu=off and acpi=off too.

This is another issue, please open a separate bug report for it.
Comment 49 Joerg Roedel 2021-04-06 09:38:07 UTC
Fixes are upstream now as:

072a03e0a0b1 iommu/amd: Move Stoney Ridge check to detect_ivrs()
9f81ca8d1fd6 iommu/amd: Don't call early_amd_iommu_init() when AMD IOMMU is disabled
4b8ef157ca83 iommu/amd: Keep track of amd_iommu_irq_remap state


They are also backported to the 5.11 stable kernel for 5.11.9.

Closing.
Comment 50 Azamat S. Kalimoulline 2023-05-10 16:54:27 UTC
Please check https://bugzilla.kernel.org/show_bug.cgi?id=216340 for details. My problem solved with page_poison=0.