Bug 216624 - The system freezes when it reaches the screen to ask password for LUKS
Summary: The system freezes when it reaches the screen to ask password for LUKS
Status: NEEDINFO
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 blocking
Assignee: drivers_video-dri
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-10-25 11:28 UTC by Alireza Haghshenas
Modified: 2022-11-16 14:33 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.19.16
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Alireza Haghshenas 2022-10-25 11:28:10 UTC
This is a desktop computer, with Amd Zen 4 CPU and Nvidia Rtx 4090. So all components are recently released. Also, I do not experience this problem on my Frame.Work laptop (11th Gen Intel), with the same kernel version, and with LUKS enabled.

The system works fine on 5.19.15 and I keep using the system by booting into that kernel.

I cannot collect more detailed logs (because the system freezes before unlocking hard drives, none of the logs are written)

Operating System: Fedora Linux 36
KDE Plasma Version: 5.25.5
KDE Frameworks Version: 5.99.0
Qt Version: 5.15.6
Kernel Version: 5.19.15-201.fc36.x86_64 (64-bit)
Graphics Platform: X11
Processors: 32 × AMD Ryzen 9 7950X 16-Core Processor
Memory: 30.5 GiB of RAM
Graphics Processor: NVIDIA Graphics Device/PCIe/SSE2
Manufacturer: ASUS
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-10-26 06:20:12 UTC
Well, this apparently is with a distro kernel and hence should be reported to the distro. But well, it's likely a upstream regression that should be fixed, so it's not totally wrong to file it here. But well, 5.19.y is EOL now. You might want to check if 6.0.y is broken as well (I expect it to show up in updates-testing for fedora soon).
Comment 2 Alireza Haghshenas 2022-10-26 06:40:23 UTC
Thank you. I see if I can collect more information.
Comment 3 Artem S. Tashkinov 2022-10-27 08:51:23 UTC
There are not that many commits between the two, you may want to perform regression testing using git bisect anyways. Whatever commit exists in 5.19.16 may exist in new kernels as well.

I'd simply try installing 6.0.x kernel first, e.g. https://bodhi.fedoraproject.org/updates/FEDORA-2022-6cc700823d
Comment 4 Alireza Haghshenas 2022-10-30 04:37:41 UTC
Unfortunately the Kernel 6.0.5-200.fc36 has the same problem. This will be my first time compiling and debugging the Linux Kernel. I'll try to find some resources, but I'll also appreciate it if you share any links.

Thanks
Comment 5 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-10-30 05:22:03 UTC
(In reply to Alireza Haghshenas from comment #4)
> Unfortunately the Kernel 6.0.5-200.fc36 has the same problem.

In that case you might want to check first if current mainline shows the same problem, for example by using the kernels from rawhide or those from here:
https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories

If it shows the problem the easiest way to find the problem is likely a bisection between 5.19.15 and 5.19.16. It won't be fixed in 5.19.y, but the change that causes your problem is likely the same in newer branches.

And make sure to not use Nvidia's prop. driver in all those tests, most kernel devs don't care about problems when it's loaded.

This guide will help you to create a configuration that compiles relative quickly; it's old (I'm working on something newer, but it's not finished yet), but should still work:
http://www.h-online.com/open/features/Good-and-quick-kernel-configuration-creation-1403046.html
Comment 6 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-10-30 06:17:44 UTC
FWIW, I briefly looked into the changelog of 5.19.16 (https://lwn.net/Articles/911274/) and have no real idea what might cause this, expect the dwc3 changes that are known to cause problems. It's a very wild guess, but manybe try to disconnect any USB devices and see if it helps
Comment 7 Alireza Haghshenas 2022-10-30 09:09:15 UTC
Thank you very much for the links and for looking into the changes. I disconnected all USB devices except the keyboard's dongle (a Logitech unified receiver) and I still have the same problem. I cannot test by removing that dongle as I have no way of typing my LUKS password.

I'll follow your links and see if I can find the cause.

Thanks
Comment 8 Alireza Haghshenas 2022-10-31 06:02:00 UTC
Some interesting observations:

I installed the Minline Vanila and Rawhide kernels. So now I have:

kernel.x86_64        5.19.14-200.fc36                                   @updates
kernel.x86_64        5.19.15-201.fc36                                   @updates
kernel.x86_64        5.19.16-200.fc36                                   @updates
kernel.x86_64        6.0.5-200.fc36                                     @updates
kernel.x86_64        6.1.0-0.rc2.20221028git23758867219c.24.fc38        @rawhide

I should actually have two 6.1.0-0.rc2, with slight differences in behavior. Please stay with me.

Same behavior in all Kernels after 5.19.15: They hang at the LUKS password screen, But I had additional observation as well.

I plugged the monitor into the HDMI plug on the Motherboard (which is connected to the integration AMD Graphics). There is no halting at the LUKS page, all Kernels work fine here. But there is one interesting difference: 

All kernels starting with 5.19.16 switch to a high resolution display. 5.19.15 stays on a low resolution display. I cannot tell if this is because 5.19.15 has been recompiled with NVidia module, or is this because of a real change in the kernel.

When connected to the onboard graphics, none of the Kernels load the GUI, which is expected: it is configured to use NVidia, but I'm connected to the onboard AMD.

When trying to install NVidia drivers into the Kernels, I got two diiferent behaviours:

Fedora and Vanila kernels show this error:

Error: An NVidia Kernel module 'nvidia-uvm` appears to already be loaded in your kernel. This may be because it is in use (for example, by an X server, a CUDA program, or the NVidia Persistence Daemon), but this may also happen in your kernel was configured without support for module unloading. Please be sure to exit any programa that may be using the GPU(s) before attempting to upgrade your driver. If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occurred that has corrupted an NVidia kernel module's usage count, for which the simplest remedy is to reboot your computer.

The rawhide kernels allows me to install the driver with just a warning that it is already installed (which is normal for NVidia driver).

So the main interesting observation is that the Kernels that hang when monitor is connected to NVidia driver, behave differently when connected to AMD: They switch the monitor to high resolution on LUKS password entry. Also, that the problem can be seen in upstream Kernel as well as Fedora ones.

I will proceed to read your other link to learn to bisect to the commit that caused this.

Thanks.
Comment 9 Mario Limonciello (AMD) 2022-10-31 19:13:46 UTC
To start out - we only focus on open source upstreamed kernel code.

So please remove nvidia.ko associated code from the system, and let's make sure you can reproduce this with nouveau.

If this issue only happens with nvidia.ko, you should report it to NVIDIA.

If for some reason nouveau is unusable for your hardware, then please either:
1) remove the NVIDIA card from the system and let's see it reproduced with the integrated graphics.
or
2) modprobe.blacklist=nouveau so that it can use the framebuffer provided by the firmware.
Comment 10 Alireza Haghshenas 2022-11-12 21:18:09 UTC
For some reason, I'm not able to remove the driver.
I have reported this to Nvidia and Fedora.

I'm waiting for Fedora 37 (set to be release on Nov 15th) to do a clean re-install of Fedora. If that works properly, I will report my result and close this ticket.

Thanks
Comment 11 Alireza Haghshenas 2022-11-15 20:50:48 UTC
UPDATE: I have the same problem with Fedora 37 live, booting from usb: it starts correctly when monitor is connected to the integrated AMD graphics, but not when connected to Nvidia.

So the same problem exists without Nvidia's proprietary driver.
Comment 12 Mario Limonciello (AMD) 2022-11-16 02:36:30 UTC
Then it sounds like you're looking at a nouveau bug.  You may be able to boot with "nomodeset" to get by.
Comment 13 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-11-16 14:33:16 UTC
FWIW, the nouveau developers might not see this report here, they want bugs filed here:
https://gitlab.freedesktop.org/drm/nouveau/-/issues

I'd suggest you file it there and make it obvious that's a regression between 5.19.15 and 5.19.16 (it isn isn't it?) that continues to exist with 6.0

A bisection would be ideal to find the cause, but with a bit of luck the nouveau developers have a idea what might be wrong.

Note You need to log in before you can comment on or make changes to this bug.