Bug 212133
Summary: | Kernel 5.11 crashes when it boots, it produces black screen. | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | rollingrock (david.capiz) |
Component: | x86-64 | Assignee: | Joerg Roedel (joro) |
Status: | RESOLVED CODE_FIX | ||
Severity: | blocking | CC: | alexdeucher, bp, david.capiz, fkrueger, im_dracula, joro, jshand2013, rslnrdnk666, savelov, turtle |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.11 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | Dmesg output for kernel 5.10.19 in my computer. |
Description
rollingrock
2021-03-08 15:34:42 UTC
UPDATE: After the new update of Manjaro today, I have kernel 5.11.2 now. This kernel still crashes and produces a black screen in my computer, but now, if I use the kernel parameter amdgpu.dc=0, I hear the fans while booting, and I see in my router the light of the wired connection of my computer, so I suppose that in this case kernel 5.11.2 is booting in my computer, but it still produces a black screen, so I cannot see anything and I have to reboot my computer. In the case of kernel 5.10 now I've seen that I don't need to continue using the kernel parameter amdgpu.dc=0. Do you have the latest firmware for the amdgpu driver? It should be a package called something "amdgpu...linux...firmware" or so on your distro. Do you see anything when you boot with "nomodeset" on the kernel command line? Can you upload dmesg from a working kernel? Thx. This is also impacting me on a DELL Inspiron 3180 11" which has the AMD A9 A420e processor with Radeon R5 graphics. I am running Fedora 33 on x86_64 architecture. I tested 'nomodeset' and amdgpu.dc=0 and amdgpu.dc=1 both on their own and with 'nomodeset'. It makes no difference, I can get the grub boot menu and then once the kernel is selected it switches to a blank screen with a white cursor (usually it will briefly flick to this screen and then flick to the DELL logo) I have LUKS encryption enabled on this laptop, and it doesn't even reach the prompt to decrypt the disk on 5.11 kernels. I can't do anything except press the power button (ctrl+alt+del and trying to switch tty's does nothing). here is my dmesg from the latest 5.10.22 build kernel: https://paste.centos.org/view/8f3d9782 (expires in 23hrs) apologies, cpu is AMD A9-9420e + Alex. Interesting, you both are using StoneyRidge-something GPUs. Not that it means something right now as we don't know where it fails during boot and debugging laptops is always a PITA, but still... The only thing I can think of if you guys could bisect it, see "man git-bisect". You take v5.10 upstream and verify that it works, then you take v5.11 and check whether it fails. If it does, you do the bisection as described in the manpage. Hopefully, we can pinpoint the faulty commit this way, *if* it is a kernel commit causing this or not something else during boot. But we don't know yet because there's no way to connect serial to those laptops or otherwise see debug output during boot. :-\ I guess one other thing you could try is to temporarily blacklist the amdgpu driver and see if it boots to text console then. If it explodes due to something else, then at least you will see it and perhaps even be able to make a picture... HTH. Is this still an issue with 5.11.3? (In reply to Alex Deucher from comment #6) > Is this still an issue with 5.11.3? Fedora's test week original kernel was 5.11.3. it had the same problem, as did 5.11.4 Can you get a dmesg output from 5.11 with or without the GPU driver loaded? You can append `modprobe.blacklist=amdgpu 3` to the kernel command line in grub to disable the amdgpu driver and boot into runlevel 3 (no GUI). From there you can get the dmesg output and try and manually load the amdgpu driver (modprobe amdgpu). Bisecting would be ideal: https://www.kernel.org/doc/html/latest/admin-guide/bug-bisect.html (In reply to Alex Deucher from comment #8) > Can you get a dmesg output from 5.11 with or without the GPU driver loaded? > You can append `modprobe.blacklist=amdgpu 3` to the kernel command line in > grub to disable the amdgpu driver and boot into runlevel 3 (no GUI). From > there you can get the dmesg output and try and manually load the amdgpu > driver (modprobe amdgpu). I did this, I still end up on a black screen with a white cursor in the top left. I will try doing the git bisect later when I have some time (In reply to Ryan from comment #9) > I did this, I still end up on a black screen with a white cursor in the top > left. I will try doing the git bisect later when I have some time So it doesn't boot even with amdgpu blacklisted? nope, I still had to revert to the 5.10.22 kernel to get it to boot again. oh I'm also using EFI boot btw, so even after telling it to boot in text mode in grub it still seems to be trying to use the framebuffer. You can try to narrow it down to see which kernel caused it by booting v5.11 upstream, v5.11.1, v5.11.2 etc the stable ones. And then you can go back and try v5.11-rc7, 5.11-rc6... you get the idea. This way you can slim down the range you'd need to bisect. The usual experience I have is that -rc1 is almost always introducing the regression, which would mean, v5.10 would boot on your machine and v5.11-rc1 would not. But you'd have to experiment to actually see which two release candidates are good and bad and start bisecting there. Don't hesitate to ask if there are questions still open. HTH. (In reply to Borislav Petkov from comment #12) > You can try to narrow it down to see which kernel caused it by booting v5.11 > upstream, v5.11.1, v5.11.2 etc the stable ones. And then you can go back and > try v5.11-rc7, 5.11-rc6... you get the idea. This way you can slim down the > range you'd need to bisect. The usual experience I have is that -rc1 is > almost always introducing the regression, which would mean, v5.10 would boot > on your machine and v5.11-rc1 would not. But you'd have to experiment to > actually see which two release candidates are good and bad and start > bisecting there. > > Don't hesitate to ask if there are questions still open. > > HTH. Thanks for that, I was thinking it would probably require going back up the tree to pin it down. I will try and get to it soon, currently at work, might not have enough free time to work through this until the weekend. No worries, whenever. Good luck! :) So I decided to start with the lowest possible option (RC1) and work up from there if it was working, I've compiled 5.11-RC1 and I still have the problem, so that significantly reduces the testing footprint i guess :-D So as far as it goes, 5.11-RC1 = BAD, 5.10.x = Good. When I have more time, I will try a git bisect... Yap, that confirms my speculation that -rc1 broke it. That's a good start. Created attachment 295821 [details]
Dmesg output for kernel 5.10.19 in my computer.
I've tried to get dmesg output for kernels 5.11 and 5.10 in my computer. It has been possible only for kernel 5.10, and this file is this output. I hope that it's useful.
not sure if this will help narrow it down, but I have been able to manipulate 5.11 into almost booting by passing the kernel arg acpi=off. I am at least able to get to a point where it gets to systemd-udev after initialising the gpu and then it stops processing any further beyond that. I saw this suggestion from someone with a Dell latitude in the Fedora test day results page who was having issues booting, however they have an Intel i5 processor in that laptop. it was a Dell Latitude E5570 (In reply to Ryan from comment #19) > it was a Dell Latitude E5570 When I pass the kernel arg acpi=off to kernel 5.11, an emergency console appears after some system messages, and when I use lsmod I see that the graphical modules amdgpu and drm are not loaded at the point where the boot process has stopped, and when I use 'dmesg | less' I see the same, there is not the graphical initialization that you can see in the dmesg output of kernel 5.10. (In reply to Alex Deucher from comment #10) > (In reply to Ryan from comment #9) > > I did this, I still end up on a black screen with a white cursor in the top > > left. I will try doing the git bisect later when I have some time > > So it doesn't boot even with amdgpu blacklisted? i've even tried booting with nomodeset and and still didn't work I've found more kernel parameters whose results are very similar to the results of acpi=off . These are the parameters: noapic nosmp maxcpus=0 acpi=noirq They're explained in https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html . These kernel parameters may give more clues. Can you please test whether a kernel built from the branch below fixes the issue? Branch: https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=iommu-fixes (In reply to Joerg Roedel from comment #23) > Can you please test whether a kernel built from the branch below fixes the > issue? > > Branch: > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=iommu- > fixes Amazing, works for me :) Thanks 👍 Created attachment 295889 [details] attachment-14492-0.html Does this fix change anything in regards to the safe function of the os On Wed, Mar 17, 2021, 5:52 PM <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=212133 > > --- Comment #24 from Ryan (im_dracula@hotmail.com) --- > (In reply to Joerg Roedel from comment #23) > > Can you please test whether a kernel built from the branch below fixes > the > > issue? > > > > Branch: > > > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=iommu- > > fixes > > Amazing, works for me :) Thanks 👍 > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You are on the CC list for the bug. (In reply to John Shand from comment #25) > Created attachment 295889 [details] > attachment-14492-0.html > > Does this fix change anything in regards to the safe function of the os If you read the commit history on the branch it takes all the work up to 5.12-rc3 and updates the AMD IOMMU handling so that it's actually evaluated properly. It ensures that the kernel properly sets IOMMU to disabled for the STONEY chipset. The other commits cleanup some handling variables when IOMMU is disabled. As far as it goes, this just makes sure that when IOMMU is disabled, it gets disabled properly. I believe that it should have no bearing on the 'safe function' of the OS, it makes sure that when required or requested, IOMMU is disabled properly. (In reply to John Shand from comment #25) > Created attachment 295889 [details] > attachment-14492-0.html > > Does this fix change anything in regards to the safe function of the os No, the patches fix two issues in IOMMU initialization code. The safe function of the OS is in no way harmed by these changes. (In reply to Joerg Roedel from comment #27) > (In reply to John Shand from comment #25) > > Created attachment 295889 [details] > > attachment-14492-0.html > > > > Does this fix change anything in regards to the safe function of the os > > No, the patches fix two issues in IOMMU initialization code. The safe > function of the OS is in no way harmed by these changes. Will it be possible to include those patches into mainline 5.12-rc4 kernel, I wonder? (In reply to Ruslan Rudenko from comment #28) > Will it be possible to include those patches into mainline 5.12-rc4 kernel, > I wonder? Possible. I plan to send them upstream asap. They are also tagged for stable, so once in mainline Linux they will also get backported to the next 5.11 stable release. (In reply to Joerg Roedel from comment #29) > (In reply to Ruslan Rudenko from comment #28) > > > Will it be possible to include those patches into mainline 5.12-rc4 kernel, > > I wonder? > > Possible. I plan to send them upstream asap. They are also tagged for > stable, so once in mainline Linux they will also get backported to the next > 5.11 stable release. This is awesome news. I thank you very much for your work! a workaround for those of us running affected AMD hardware: set kernel boot option `iommu=soft` as per https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt this pushes IOMMU handling to software instead of trying hardware IOMMU. (In reply to Ryan from comment #31) > a workaround for those of us running affected AMD hardware: > > set kernel boot option `iommu=soft` > > as per https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt > this pushes IOMMU handling to software instead of trying hardware IOMMU. Should this workaround be effective in trying current 5.12-rc kernels as well? (In reply to Ruslan Rudenko from comment #32) > (In reply to Ryan from comment #31) > > a workaround for those of us running affected AMD hardware: > > > > set kernel boot option `iommu=soft` > > > > as per https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt > > this pushes IOMMU handling to software instead of trying hardware IOMMU. > > Should this workaround be effective in trying current 5.12-rc kernels as > well? I believe it will, but I haven't tested it as I am running Fedora 33 and the latest available kernel release is 5.11.7 (soon to be in testing updates), So unless I specifically compile a 5.12 kernel (pre-patch) or run one of the Fedora 34 development kernels I can't confirm. If you already have a 5.12 kernel installed, you should be able to test it. correction: 5.12 is a Fedora 35 development kernel Meanwhile I am in the process of compiling 5.12-rc3. I will let you know the results with iommu=soft although I imagine it will be much the same as 5.11, please stand by I can confirm i can boot fine with 5.12-rc3 with `iommu=soft` set (In reply to Joerg Roedel from comment #23) > Can you please test whether a kernel built from the branch below fixes the > issue? > > Branch: > https://git.kernel.org/pub/scm/linux/kernel/git/joro/linux.git/log/?h=iommu- > fixes It works for me too! :-) Thank you very much!! (In reply to Ryan from comment #33) > (In reply to Ruslan Rudenko from comment #32) > > (In reply to Ryan from comment #31) > > > a workaround for those of us running affected AMD hardware: > > > > > > set kernel boot option `iommu=soft` > > > > > > as per > https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt > > > this pushes IOMMU handling to software instead of trying hardware IOMMU. > > > > Should this workaround be effective in trying current 5.12-rc kernels as > > well? > > I believe it will, but I haven't tested it as I am running Fedora 33 and the > latest available kernel release is 5.11.7 (soon to be in testing updates), > So unless I specifically compile a 5.12 kernel (pre-patch) or run one of the > Fedora 34 development kernels I can't confirm. If you already have a 5.12 > kernel installed, you should be able to test it. Got it, will try to test it this or next evening while installing latest vanilla 5.12-rc kernel from mainline. joro, please add the commit IDs here for future reference once they're available. Assigning to you for closing afterwards. Thx to everyone involved! (In reply to Ruslan Rudenko from comment #37) > (In reply to Ryan from comment #33) > > (In reply to Ruslan Rudenko from comment #32) > > > (In reply to Ryan from comment #31) > > > > a workaround for those of us running affected AMD hardware: > > > > > > > > set kernel boot option `iommu=soft` > > > > > > > > as per > > https://www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt > > > > this pushes IOMMU handling to software instead of trying hardware > IOMMU. > > > > > > Should this workaround be effective in trying current 5.12-rc kernels as > > > well? > > > > I believe it will, but I haven't tested it as I am running Fedora 33 and > the > > latest available kernel release is 5.11.7 (soon to be in testing updates), > > So unless I specifically compile a 5.12 kernel (pre-patch) or run one of > the > > Fedora 34 development kernels I can't confirm. If you already have a 5.12 > > kernel installed, you should be able to test it. > > Got it, will try to test it this or next evening while installing latest > vanilla 5.12-rc kernel from mainline. Gentlemen, I can confirm that with this "iommu=soft" I successfully booted into latest vanilla 5.12-rc3 kernel in my Fedora Rawhide setup with Stoney chip. Thanks to everyone for their hard work and tips! i can also confirm the iommu=soft works. it is nice to know you have choices on how to run iommu With the kernel option iommu=soft kernel 5.11 works perfectly in the Manjaro GNU/Linux system of my AMD-based HP laptop. It's a good trick while kernels are being corrected :-) Looks like 5.12-rc4 contains the iommu-fixes. These have not been backported to 5.11 yet though. i've been using the iommu=soft for a few days and i have noticed that the opensuse os runs more smoothly As expected (after reading the changelog and noting the fixes have been backported) 5.11.9 fixes this issue. Thanks Joerg :) I have same issue in 5.11.9 kernel, but on Renoir architecture. I have AMD Ryzen 5 PRO 4650U with Radeon Graphics. Same stuck on loading initial ramdisk. modprobe.blacklist=amdgpu 3` didn't help to boot. Same stuck. Also iommu=off and acpi=off too. 5.10.26 boots fine. I boot via efi and I have no option boot without it. (In reply to Azamat S. Kalimoulline from comment #45) > I have same issue in 5.11.9 kernel, but on Renoir architecture. I have AMD > Ryzen 5 PRO 4650U with Radeon Graphics. Same stuck on loading initial > ramdisk. modprobe.blacklist=amdgpu 3` didn't help to boot. Same stuck. Also > iommu=off and acpi=off too. This is another issue, please open a separate bug report for it. Fixes are upstream now as: 072a03e0a0b1 iommu/amd: Move Stoney Ridge check to detect_ivrs() 9f81ca8d1fd6 iommu/amd: Don't call early_amd_iommu_init() when AMD IOMMU is disabled 4b8ef157ca83 iommu/amd: Keep track of amd_iommu_irq_remap state They are also backported to the 5.11 stable kernel for 5.11.9. Closing. Please check https://bugzilla.kernel.org/show_bug.cgi?id=216340 for details. My problem solved with page_poison=0. |