Bug 217776 - System (Xeon Nvidia) hangs at boot terminal after kernel 6.4.7
Summary: System (Xeon Nvidia) hangs at boot terminal after kernel 6.4.7
Status: RESOLVED ANSWERED
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P3 normal
Assignee: other_other
URL: https://gitlab.freedesktop.org/drm/no...
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-08 16:16 UTC by Peter Bottomley
Modified: 2023-08-31 09:52 UTC (History)
3 users (show)

See Also:
Kernel Version: 6.4.7
Subsystem: drm/nouveau/disp
Regression: Yes
Bisected commit-id: f01f7f27ca06411fed0dc5b118c767591f9436ae


Attachments
attachment-8334-0.html (1.82 KB, text/html)
2023-08-09 11:25 UTC, peter
Details
attachment-4825-0.html (314 bytes, text/html)
2023-08-10 14:45 UTC, peter
Details

Description Peter Bottomley 2023-08-08 16:16:34 UTC
Kernel 6.4.6 compiled from source worked AOK on my desktop with Intel Xeon cpu and Nvidia graphics - see below for system specs.

Kernels 6.4.7 & 6.4.8 also compiled from source with identical configs hang with a frozen boot terminal screen after a significant way through the boot sequence (e.g. whilst running /etc/profile). The system may still be running as a sound is emitted when the power button is pressed (only way to escape from the system hang).

The issue seems to be specific to the hardware of this desktop as the problem kernels do boot through to completion on other machines.

A test was done with a different build (from Porteus) of kernel 6.5-RC4 and that did not hang - but kernel 6.4.7 from the same builder hung just like my build.

I apologise that I cannot provide any detailed diagnostics - but I can put diagnostics into /etc/profile and provide screenshots if requested.

Forum thread with more details and screenshots:
https://forum.puppylinux.com/viewtopic.php?p=95733#p95733

Computer Profile:
 Machine                    Dell Inc. Precision WorkStation T5400   (version: Not Specified)
 Mainboard                  Dell Inc. 0RW203 (version: NA)
 • BIOS                     Dell Inc. A11 | Date: 04/30/2012 | Type: Legacy
 • CPU                      Intel(R) Xeon(R) CPU E5450 @ 3.00GHz (4 cores)
 • RAM                      Total: 7955 MB | Used: 1555 MB (19.5%) | Actual Used: 775 MB (9.7%)
 Graphics                   Resolution: 1366x768 pixels | Display Server: X.Org 21.1.8
 • device-0                 NVIDIA Corporation GT218 [NVS 300] [10de:10d8] (rev a2)
 Audio                      ALSA
 • device-0                 Intel Corporation 631xESB/632xESB High Definition Audio Controller [8086:269a] (rev 09)
 • device-1                 NVIDIA Corporation High Definition Audio Controller [10de:0be3] (rev a1)
 Network                    wlan1
 • device-0                 Ethernet: Broadcom Inc. and subsidiaries NetXtreme BCM5754 Gigabit Ethernet PCI Express [14e4:167a] (rev 02)
Comment 1 Artem S. Tashkinov 2023-08-09 06:22:34 UTC
Your best chance of getting it fixed is performing regression testing:

https://docs.kernel.org/admin-guide/bug-bisect.html
Comment 2 Bagas Sanjaya 2023-08-09 08:38:49 UTC
(In reply to Peter Bottomley from comment #0)
> Kernel 6.4.6 compiled from source worked AOK on my desktop with Intel Xeon
> cpu and Nvidia graphics - see below for system specs.
> 
> Kernels 6.4.7 & 6.4.8 also compiled from source with identical configs hang
> with a frozen boot terminal screen after a significant way through the boot
> sequence (e.g. whilst running /etc/profile). The system may still be running
> as a sound is emitted when the power button is pressed (only way to escape
> from the system hang).
> 
> The issue seems to be specific to the hardware of this desktop as the
> problem kernels do boot through to completion on other machines.
> 
> A test was done with a different build (from Porteus) of kernel 6.5-RC4 and
> that did not hang - but kernel 6.4.7 from the same builder hung just like my
> build.
> 
> I apologise that I cannot provide any detailed diagnostics - but I can put
> diagnostics into /etc/profile and provide screenshots if requested.
> 
> Forum thread with more details and screenshots:
> https://forum.puppylinux.com/viewtopic.php?p=95733#p95733
> 
> Computer Profile:
>  Machine                    Dell Inc. Precision WorkStation T5400  
> (version: Not Specified)
>  Mainboard                  Dell Inc. 0RW203 (version: NA)
>  • BIOS                     Dell Inc. A11 | Date: 04/30/2012 | Type: Legacy
>  • CPU                      Intel(R) Xeon(R) CPU E5450 @ 3.00GHz (4 cores)
>  • RAM                      Total: 7955 MB | Used: 1555 MB (19.5%) | Actual
> Used: 775 MB (9.7%)
>  Graphics                   Resolution: 1366x768 pixels | Display Server:
> X.Org 21.1.8
>  • device-0                 NVIDIA Corporation GT218 [NVS 300] [10de:10d8]
> (rev a2)
>  Audio                      ALSA
>  • device-0                 Intel Corporation 631xESB/632xESB High
> Definition Audio Controller [8086:269a] (rev 09)
>  • device-1                 NVIDIA Corporation High Definition Audio
> Controller [10de:0be3] (rev a1)
>  Network                    wlan1
>  • device-0                 Ethernet: Broadcom Inc. and subsidiaries
> NetXtreme BCM5754 Gigabit Ethernet PCI Express [14e4:167a] (rev 02)

Do you use Nouveau or NVIDIA driver?

Also, can you attach dmesg and system log output (like journalctl)?
Comment 3 peter 2023-08-09 11:25:00 UTC
Created attachment 304803 [details]
attachment-8334-0.html

On 09/08/2023 09:38, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217776
>
> Bagas Sanjaya (bagasdotme@gmail.com) changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |bagasdotme@gmail.com
>
> --- Comment #2 from Bagas Sanjaya (bagasdotme@gmail.com) ---
> Do you use Nouveau or NVIDIA driver?
>
> Also, can you attach dmesg and system log output (like journalctl)?
>
Nouveau driver ....

Sadly system never gets to the point where dmesg can be run. I'll see if 
I can capture it before the system freezes.

6.4.9 has the same problem as 6.4.7 and 6.4.8.

I'm pretty sure it is a graphics problem of some sort.

Given that each kernel build takes c. 30 mins - the bug-bisect 
regression testing suggestion is challenging to say the least!
Comment 4 Artem S. Tashkinov 2023-08-09 14:29:00 UTC
Boot without any options that hide kernel output, including "quiet".

If you some some kernel messages, try to capture them, e.g. take a photo and upload it here.
Comment 5 Bagas Sanjaya 2023-08-10 01:21:01 UTC
(In reply to peter from comment #3)
> Created attachment 304803 [details]
> attachment-8334-0.html
> 
> On 09/08/2023 09:38, bugzilla-daemon@kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=217776
> >
> > Bagas Sanjaya (bagasdotme@gmail.com) changed:
> >
> >             What    |Removed                     |Added
> >
> ----------------------------------------------------------------------------
> >                   CC|                            |bagasdotme@gmail.com
> >
> > --- Comment #2 from Bagas Sanjaya (bagasdotme@gmail.com) ---
> > Do you use Nouveau or NVIDIA driver?
> >
> > Also, can you attach dmesg and system log output (like journalctl)?
> >
> Nouveau driver ....
> 

Can you also open issue at gitlab.freedesktop.org tracker [1]?

[1]: https://gitlab.freedesktop.org/drm/nouveau/-/issues
Comment 6 peter 2023-08-10 08:05:45 UTC
On 10/08/2023 02:21, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217776
>
> --- Comment #5 from Bagas Sanjaya (bagasdotme@gmail.com) ---
> (In reply to peter from comment #3)
>> Created attachment 304803 [details]
>> attachment-8334-0.html
>>
>> On 09/08/2023 09:38, bugzilla-daemon@kernel.org wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=217776
>>>
>>> Bagas Sanjaya (bagasdotme@gmail.com) changed:
>>>
>>>              What    |Removed                     |Added
>>>
>> ----------------------------------------------------------------------------
>>>                    CC|                            |bagasdotme@gmail.com
>>>
>>> --- Comment #2 from Bagas Sanjaya (bagasdotme@gmail.com) ---
>>> Do you use Nouveau or NVIDIA driver?
>>>
>>> Also, can you attach dmesg and system log output (like journalctl)?
>>>
>> Nouveau driver ....
>>
> Can you also open issue at gitlab.freedesktop.org tracker [1]?
>
> [1]: https://gitlab.freedesktop.org/drm/nouveau/-/issues
>
https://lore.kernel.org/all/20230806213107.GFZNARG6moWpFuSJ9W@fat_crate.local/

identies the cause of the issue....

which apparently comes from:

drm/nouveau/disp: PIOR DP uses GPIO for HPD, not PMGR AUX interrupts

https://cgit.freedesktop.org/drm-misc/commit/?h=drm-misc-fixes&id=2b5d1c29f6c4cb19369ef92881465e5ede75f4ef

which is a patch to:

..../drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c

Does this need to be reported further??
Comment 7 peter 2023-08-10 14:45:19 UTC
Created attachment 304813 [details]
attachment-4825-0.html

6.4.9 built with unconn.c from 6.4.6 builds and boots and runs fine. 
Thanks everybody.
Comment 8 Artem S. Tashkinov 2023-08-17 10:26:21 UTC
Moved here:

https://gitlab.freedesktop.org/drm/nouveau/-/issues/255
Comment 9 Artem S. Tashkinov 2023-08-18 11:23:31 UTC
Commit 1b254b791d7b7dea6e8adc887fbbd51746d8bb27 should fix this.
Comment 10 Peter Bottomley 2023-08-19 06:45:32 UTC
Lets hope so....
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/gpu/drm/nouveau?h=next-20230818&id=1b254b791d7b7dea6e8adc887fbbd51746d8bb27

says:
It might not fix all regressions from commit 2b5d1c29f6c4
("drm/nouveau/disp: PIOR DP uses GPIO for HPD, not PMGR AUX interrupts"),
but at least it fixes a memory corruption in error handling related to
that commit.
Comment 11 Peter Bottomley 2023-08-19 09:17:30 UTC
Sadly..... 

built 6.4.11 with the patched drivers/gpu/drm/nouveau/nouveau_connector.c

and it still hangs on boot....
Comment 12 Artem S. Tashkinov 2023-08-19 09:48:18 UTC
To communicate with nouveau developers please use their bug tracker instead.

Your comments here are not sent to anyone.

Thank you.
Comment 13 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-08-31 09:45:41 UTC
FWIW, does latest 6.4.y now work for you? It might be due to this change
https://lore.kernel.org/all/20230821194136.393887865@linuxfoundation.org/
Comment 14 Peter Bottomley 2023-08-31 09:52:50 UTC
Sadly no - neither 6.4.12 nor 6.5 boot

If I revert to version 6.4.6 of /drivers/gpu/drm/nouveau/nvkm/engine/disp/uconn.c
and build the kernel with that, then that boots.

https://gitlab.freedesktop.org/drm/nouveau/-/issues/255
is where the issue is being tracked.

Note You need to log in before you can comment on or make changes to this bug.