Bug 217637

Summary: unable to boot when monitor is attached
Product: Linux Reporter: primalmotion (primalmotion)
Component: KernelAssignee: Virtual assignee for kernel bugs (linux-kernel)
Status: NEW ---    
Severity: normal CC: bagasdotme, jonathon.hall, pmenzel+bugzilla.kernel.org
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: photo of the trace
another picture of the issue
dmesg

Description primalmotion 2023-07-06 01:41:32 UTC
Created attachment 304554 [details]
photo of the trace

In the latest 6.3 and 6.4, it is impossible for me to boot my laptop if my DELL U2720Q monitor is plugged in (USB-C). I have to unplug it, then boot. As soon as the first second of boot went through, I can plug in my monitor and there is no issue afterward. There is no issue waking up after suspend. Only when it boots.

See the attached pictures of the trace. The trace itself seems random (at least to me :)). I tried several things, like removing any attached USB devices from the monitor built-in USB-hub, but that does not change anything. (there is a keyboard and trackpad attached).
Comment 1 primalmotion 2023-07-06 01:42:06 UTC
Created attachment 304555 [details]
another picture of the issue
Comment 2 Bagas Sanjaya 2023-07-06 02:59:20 UTC
(In reply to primalmotion from comment #0)
> Created attachment 304554 [details]
> photo of the trace
> 
> In the latest 6.3 and 6.4, it is impossible for me to boot my laptop if my
> DELL U2720Q monitor is plugged in (USB-C). I have to unplug it, then boot.
> As soon as the first second of boot went through, I can plug in my monitor
> and there is no issue afterward. There is no issue waking up after suspend.
> Only when it boots.
> 
> See the attached pictures of the trace. The trace itself seems random (at
> least to me :)). I tried several things, like removing any attached USB
> devices from the monitor built-in USB-hub, but that does not change
> anything. (there is a keyboard and trackpad attached).

Do you have this issue on v6.1? Can you attach dmesg instead?
Comment 3 primalmotion 2023-07-06 15:52:20 UTC
No the issue does not happen with 6.1. I'm not sure how to get the dmesg since it panics immediately after loading the kernel
Comment 4 Bagas Sanjaya 2023-07-06 23:56:33 UTC
(In reply to primalmotion from comment #3)
> No the issue does not happen with 6.1. I'm not sure how to get the dmesg
> since it panics immediately after loading the kernel

Then can you please bisect between v6.1 and v6.3?
Comment 5 Bagas Sanjaya 2023-07-07 00:06:25 UTC
Can you also attach lspci and lsusb?
Comment 6 primalmotion 2023-07-07 00:57:49 UTC
> Then can you please bisect between v6.1 and v6.3?

The bisect operation is gonna take a long time, I'm not sure when I'll have the time to do so. I'll keep you posted


> Can you also attach lspci and lsusb?

lspci: 


00:00.0 Host bridge: Intel Corporation Device 9b51
00:02.0 VGA compatible controller: Intel Corporation Comet Lake UHD Graphics (rev 04)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
00:12.0 Signal processing controller: Intel Corporation Comet Lake Thermal Subsytem
00:14.0 USB controller: Intel Corporation Comet Lake PCH-LP USB 3.1 xHCI Host Controller
00:14.2 RAM memory: Intel Corporation Comet Lake PCH-LP Shared SRAM
00:15.0 Serial bus controller: Intel Corporation Serial IO I2C Host Controller
00:1c.0 PCI bridge: Intel Corporation Device 02be (rev f0)
00:1c.7 PCI bridge: Intel Corporation Comet Lake PCI Express Root Port #8 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Comet Lake PCI Express Root Port #13 (rev f0)
00:1f.0 ISA bridge: Intel Corporation Comet Lake PCH-LP LPC Premium Controller/eSPI Controller
00:1f.3 Audio device: Intel Corporation Comet Lake PCH-LP cAVS
00:1f.4 SMBus: Intel Corporation Comet Lake PCH-LP SMBus Host Controller
00:1f.5 Serial bus controller: Intel Corporation Comet Lake SPI (flash) Controller
01:00.0 Network controller: Qualcomm Atheros AR9462 Wireless Network Adapter (rev 01)
02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO


lsusb: 

Bus 002 Device 002: ID 05e3:0749 Genesys Logic, Inc. SD Card Reader and Writer
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 081: ID 0451:82ff Texas Instruments, Inc. 
Bus 001 Device 079: ID 445a:1424 DZTECH DZ65RGBV3
Bus 001 Device 083: ID 0763:410e M-Audio AIR 192 14
Bus 001 Device 082: ID 05e3:0608 Genesys Logic, Inc. Hub
Bus 001 Device 080: ID 05ac:0265 Apple, Inc. Magic Trackpad 2
Bus 001 Device 078: ID 05e3:0608 Genesys Logic, Inc. Hub
Bus 001 Device 077: ID 0451:8442 Texas Instruments, Inc. 
Bus 001 Device 071: ID 04ca:300d Lite-On Technology Corp. Atheros AR3012 Bluetooth
Bus 001 Device 031: ID 20a0:42b2 Clay Logic Nitrokey 3
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Comment 7 Paul Menzel 2023-07-13 04:14:47 UTC
You seem to use Arch Linux, which might provide a way to easily test mainline releases and the release candidates, so you can save building the Linux kernel.
Comment 8 primalmotion 2023-08-23 23:09:20 UTC
Sorry it took so long, but I've finally started working on the bisect.
I will try to find the first packaged version in Arch that breaks, then bisect from there. I'll report back when I'm done
Comment 9 primalmotion 2023-08-24 02:45:53 UTC
So... I went back as far as 6.0.10 and the panic is happening on every single version I tested now... I'm a bit at loss..
Comment 10 Paul Menzel 2023-08-24 05:35:02 UTC
Thank you for running the tests.

I lost track, can you please list the working and non-working Linux kernels?

Lastly, please attach the output of `dmesg` of a successful boot, where you plug in the monitor later.
Comment 11 primalmotion 2023-08-24 16:29:36 UTC
I'm sorry if I was unclear, what I meant is I could not find any working kernel version. I tried most of the versions back to 6.0.10. They all crash the same now. Something else must have changed that triggers this crash in the kernel but what? I tried to downgrade linux-firmware to a version from April (I'm sure everything was working fine back then) but that did not help.
Comment 12 primalmotion 2023-08-24 16:30:06 UTC
Created attachment 304933 [details]
dmesg

This is the dmesg of a successful boot
Comment 13 primalmotion 2023-08-24 18:35:10 UTC
After talking to pureboot's maintainer, it seems this may be the culprit. They had several reports from other users stating the inability to boot any device when they attached to a 4K display. The maintainer tried to plug their laptop to a 4K TV and encoutered the crash. I have downgraded pureboot and I will try again tonight when I have access to my offending monitor, then report back
Comment 14 primalmotion 2023-08-25 00:43:47 UTC
Ok so downgrading to the previous version of PureBoot fixes the issue with latest kernel version on arch (6.4.11). 

I'm not sure if this kernel panic is valid, even if caused by pureboot, so I'll let you decide if you want to close the issue or not. If you need more information from me, feel free to ask.

Thank you!
Comment 15 Paul Menzel 2023-08-26 06:24:02 UTC
I’d say, if (system) firmware incorrectly configures the hardware all bets are off. It’d be nice, if the PureBoot folks could chime in and share their analysis. If they won’t, I’d close this issue for now.
Comment 16 Jonathon Hall 2023-08-29 12:35:10 UTC
Hi all, I'm the PureBoot developer at Purism.

This looks like a bug in PureBoot/Heads that originated from https://github.com/osresearch/heads/pull/1378.  I've validated a fix in PureBoot, it'll go out in PureBoot 28 (this week if there are no troubles in testing), and I'll PR to upstream after that.

I think this can be closed here, any change would only be to defend against buggy firmware.

It appears that with a 4K display, the framebuffer memory isn't being properly indicated as reserved one way or another.  When booting with a 4K display garbage briefly appears, then is overwritten by the framebuffer console, I believe this indicates Linux is allocating memory within the framebuffer for general use, then the framebuffer console overwrites and corrupts it.

memtest86+ also shows this.  The test patterns show up on the framebuffer when it reaches the right place in memory, the framebuffer display overwrites part of it, and memtest86+ correctly reports this as a failure.

I'm not sure why this only occurs with 4K (memtest86+ passes with 1080p), but I don't plan on doing a deeper dive to find out since we are moving to a different graphics initialization method (https://github.com/osresearch/heads/pull/1403, same is being done for Librems in PB 28), and I've now validated this method with 4K framebuffer.

Thank you all for testing and investigating this, and let me know if there is any other info I can provide.