Bug 216859 - PCI bridge to bus boot hang at enumeration
Summary: PCI bridge to bus boot hang at enumeration
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-12-28 08:37 UTC by Zeno R.R. Davatz
Modified: 2023-02-27 15:00 UTC (History)
4 users (show)

See Also:
Kernel Version: 6.1-rc1
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Screenshot of the PCI enumeration hang (1.43 MB, image/jpeg)
2022-12-28 08:37 UTC, Zeno R.R. Davatz
Details
Video of the PCI numeration hang (2.78 MB, video/quicktime)
2022-12-28 08:39 UTC, Zeno R.R. Davatz
Details
lspci -vv of Kernel 6.0 (26.99 KB, text/plain)
2022-12-28 08:44 UTC, Zeno R.R. Davatz
Details
dmesg output of Kernel 6.0 (80.42 KB, text/plain)
2022-12-28 08:47 UTC, Zeno R.R. Davatz
Details
boot with ignore_loglevel initcall_debug (4.13 MB, image/jpeg)
2023-01-06 16:31 UTC, Zeno R.R. Davatz
Details
First bisect hang (4.24 MB, image/jpeg)
2023-01-10 10:48 UTC, Zeno R.R. Davatz
Details
.config of Zeno (124.69 KB, text/plain)
2023-01-14 16:46 UTC, Zeno R.R. Davatz
Details
revert 145eed48de27 (nvidia only) (945 bytes, patch)
2023-01-24 21:18 UTC, Bjorn Helgaas
Details | Diff
sudo lspci -vvv (62.27 KB, text/plain)
2023-01-25 09:56 UTC, Zeno R.R. Davatz
Details
Patch by Dave that works great (5.43 KB, application/mbox)
2023-02-06 08:54 UTC, Zeno R.R. Davatz
Details

Description Zeno R.R. Davatz 2022-12-28 08:37:52 UTC
Created attachment 303492 [details]
Screenshot of the PCI enumeration hang

With Kernel 6.1-rc1 the enumeration process stopped working for me,
see attachments.

The enumeration works fine with Kernel 6.0 and below.

Same problem still exists with v6.1. and v6.2.-rc1

Please see attachments.
Comment 1 Zeno R.R. Davatz 2022-12-28 08:39:31 UTC
Created attachment 303493 [details]
Video of the PCI numeration hang

I you look at this short movie, the boot process does NOT continue. It just hangs there with that status at the end of the video.
Comment 2 Zeno R.R. Davatz 2022-12-28 08:44:57 UTC
Created attachment 303494 [details]
lspci -vv of Kernel 6.0
Comment 3 Zeno R.R. Davatz 2022-12-28 08:47:47 UTC
Created attachment 303495 [details]
dmesg output of Kernel 6.0
Comment 4 Bruno Moreira-Guedes 2022-12-30 18:45:05 UTC
Hello, Zeno, besides happening after PCI enumeration, do you have any reason why you think this might be related to the PCI enumeration itself?

I happened to notice that other subsystems produce messages after that. I'm not saying it means we can rule out the possibility of being PCI-related, just that there might be other possibilities here.

Also, are you able to reproduce this bug on another system?
Comment 5 Zeno R.R. Davatz 2023-01-01 20:48:36 UTC
(In reply to Bruno Moreira-Guedes from comment #4)
> Hello, Zeno, besides happening after PCI enumeration, do you have any reason
> why you think this might be related to the PCI enumeration itself?

No, just what I see right before that hang.

> I happened to notice that other subsystems produce messages after that. I'm
> not saying it means we can rule out the possibility of being PCI-related,
> just that there might be other possibilities here.

Sure, that is totally possible.

> Also, are you able to reproduce this bug on another system?

No, currently not testing on another system.
Comment 6 Zeno R.R. Davatz 2023-01-06 16:29:41 UTC
Happy New Year!

Please find attached the output with the parameter 

"ignore_loglevel initcall_debug"

as requested by Bjorn.
Comment 7 Zeno R.R. Davatz 2023-01-06 16:31:25 UTC
Created attachment 303538 [details]
boot with ignore_loglevel initcall_debug

Boot with "ignore_loglevel initcall_debug"
Comment 8 Zeno R.R. Davatz 2023-01-06 16:37:15 UTC
And this is the video:

https://www.youtube.com/watch?v=uzhUxKteVJM

done with "ignore_loglevel initcall_debug"
Comment 9 Bjorn Helgaas 2023-01-06 16:49:14 UTC
Marking as a regression since this worked in v6.0 and fails in v6.1-rc1.

Thanks for the photo and video.  I have no idea what happened.  It looks like console output stopped in the middle of a line.  Log from v6.0 in comment #3 shows:

  pcieport 0000:00:1c.5: saving config space at offset 0x1c (reading 0xb0b0)

but the photo shows it stopped at:

  pcieport 0000:00:1c.5: saving config space at offset 0x1c (readin

If you want to slow down the output to make the video more readable, you can add "boot_delay=20" (or more if necessary).  But I doubt that will show us anything useful.

The only thing I can think of is to bisect it, which I'm sorry to say is a lot of work: https://docs.kernel.org/admin-guide/bug-bisect.html
Comment 10 Zeno R.R. Davatz 2023-01-06 17:09:00 UTC
Thanks!

The bisecting is not the problem. What is more of a problem is, that after every failed boot I have to fix the file system with "fsck.jfs /dev/sda2" and that takes time ;(, because I have to boot from a USB stick.

How many bisects do you think I will have to do?
Comment 11 Bjorn Helgaas 2023-01-06 17:18:29 UTC
Looks like it would be about 13 builds/boots:

  11:17:06 ~/linux (wip/bjorn-junk)$ git bisect start
  11:17:11 ~/linux (wip/bjorn-junk|BISECTING)$ git bisect bad v6.1-rc1
  11:17:20 ~/linux (wip/bjorn-junk|BISECTING)$ git bisect good v6.0
  Bisecting: 6125 revisions left to test after this (roughly 13 steps)
Comment 12 Zeno R.R. Davatz 2023-01-06 17:27:50 UTC
Ok, thanks for the info!
Comment 13 Zeno R.R. Davatz 2023-01-10 10:16:08 UTC
Ok, now testing:
sudo git checkout e0e492cebef25c13fc29b174f01b5178662f1652
Comment 14 Zeno R.R. Davatz 2023-01-10 10:48:17 UTC
Created attachment 303567 [details]
First bisect hang
Comment 15 Zeno R.R. Davatz 2023-01-10 10:57:19 UTC
I am trying to continue with bisecting but it suggests the same commit again:

/usr/src/linux> sudo git bisect bad e0e492cebef25c13fc29b174f01b5178662f1652
/usr/src/linux> sudo git bisect good v6.0
e0e492cebef25c13fc29b174f01b5178662f1652 war sowohl good als auch bad

any hints?
Comment 16 Zeno R.R. Davatz 2023-01-10 11:03:44 UTC
Running this again:

sudo git bisect bad v6.1-rc1

now gives me:

[18fd049731e67651009f316195da9281b756f2cf] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Comment 17 Zeno R.R. Davatz 2023-01-10 11:51:56 UTC
that one was also bad. Now testing:

1c2daf52185bbc91421f0e84e6bf2706bb350cce
Comment 18 Zeno R.R. Davatz 2023-01-10 12:13:28 UTC
Ok these ones work:
v6.0
1c2daf52185bbc91421f0e84e6bf2706bb350cce

these do not work and all result in the errors in the screenshots above:
18fd049731e67651009f316195da9281b756f2cf
e0e492cebef25c13fc29b174f01b5178662f1652
v6.1-rc1

I am now running on:
6.0.0-03077-g1c2daf52185b #117 SMP PREEMPT_DYNAMIC (uname -a)
Comment 19 Bjorn Helgaas 2023-01-10 14:38:22 UTC
You're doing it right; it's just a laborious process.  Here's what I see when plugging in your test results:

  08:25:22 ~/linux (main|BISECTING)$ git bisect bad v6.1-rc1
  08:25:29 ~/linux (main|BISECTING)$ git bisect good v6.0
  Bisecting: 6125 revisions left to test after this (roughly 13 steps)
  [18fd049731e67651009f316195da9281b756f2cf] Merge tag 'arm64-upstream' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
  08:25:43 ~/linux ((18fd049731e6...)|BISECTING)$ git bisect bad
  Bisecting: 3084 revisions left to test after this (roughly 12 steps)
  [1c2daf52185bbc91421f0e84e6bf2706bb350cce] Merge tag 'tag-chrome-platform-for-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
  08:26:30 ~/linux ((1c2daf52185b...)|BISECTING)$ git bisect good
  Bisecting: 1522 revisions left to test after this (roughly 11 steps)
  [7e6739b9336e61fe23ca4e2c8d1fda8f19f979bf] Merge tag 'drm-next-2022-10-05' of git://anongit.freedesktop.org/drm/drm

So far you've tested these:

  GOOD v6.0
  BAD  v6.1-rc1
    => problem is between v6.0 and v6.1-rc1
  BAD  18fd049731e6 ("Merge tag 'arm64-upstream' ...)
    => problem is between v6.0 and 18fd049731e6
  GOOD 1c2daf52185b ("Merge tag 'tag-chrome-platform-for-v6.1' ...)
    => problem is between 1c2daf52185b and 18fd049731e6
  ???  7e6739b9336e ("Merge tag 'drm-next-2022-10-05' ...)

To continue the bisect, 7e6739b9336e would be the next kernel to test.

You don't need to report all the intermediate steps unless you want to.  At any point, "git bisect log" will show you the kernels you've tested and the results.  You can just attach that log when you get to the end.
Comment 20 Zeno R.R. Davatz 2023-01-10 18:05:33 UTC
this one was good: 7e6739b9336e
Comment 21 Zeno R.R. Davatz 2023-01-11 08:22:32 UTC
this one was bad: 1b929c02afd3
Comment 22 Zeno R.R. Davatz 2023-01-11 08:25:16 UTC
now testing: a09476668e30 Merge tag 'char-misc-6.1-rc1
Comment 23 Zeno R.R. Davatz 2023-01-11 08:56:58 UTC
was bad, now testing:
[188943a15638ceb91f960e072ed7609b2d7f2a55] Merge tag 'fs-for_v6.1-rc1'
Comment 24 Zeno R.R. Davatz 2023-01-11 09:23:29 UTC
was bad, now testing:
[ff6862c23d2e83d12d1759bf4337d41248fb4dc8] Merge tag 'arm-drivers-6.1'
Comment 25 Zeno R.R. Davatz 2023-01-11 09:49:46 UTC
was bad, now testing:
[f7c91bf65388547f61888b7a67169966fc698ce1] ASoC: SOF: mediatek: mt8195: Add pcm_pointer callback
Comment 26 Zeno R.R. Davatz 2023-01-11 10:10:06 UTC
this one was good!

now testing: 
[40285e64c5654c956505dad34ed2ee4be163b1f0] Merge tag 'arm-defconfig-6.1'
Comment 27 Zeno R.R. Davatz 2023-01-11 10:50:17 UTC
this one was bad. 

Now testing:
[02f2e785c4834828876a4701926416157dfd7b26] Merge branch 'for-next'
Comment 28 Zeno R.R. Davatz 2023-01-11 15:56:33 UTC
This one was good.

now testing:
[ffc79c2097fd5e954d99dfeaaa5c0437a27a1ece] Merge tag 'aspeed-6.1-defconfig'
Comment 29 Zeno R.R. Davatz 2023-01-11 16:42:17 UTC
this one was good.

now testing:
[d488b28502d7c22b1b50f0543da119748e575919] Fix PM disable depth imbalance in probe
Comment 30 Zeno R.R. Davatz 2023-01-11 17:08:36 UTC
this one was good.

Now testing:
[e66372ecb80dc5179c7abb880229c7452e813d15] ARM: 9246/1: dump: show page table level name
Comment 31 Zeno R.R. Davatz 2023-01-11 18:10:03 UTC
this one was good.

now testing:
f0c8d7468af0 ASoC: rockchip: i2s: use regmap_read_poll_timeout_atomic to poll I2S_CLR
Comment 32 Zeno R.R. Davatz 2023-01-12 10:02:48 UTC
this one was good.

Now testing:
[7782aae498b92f124267b366293100d121fe0f56] Merge tag 'for-linus'
Comment 33 Zeno R.R. Davatz 2023-01-12 10:35:26 UTC
This one was bad!

Now testing:
[833477fce7a14d43ae4c07f8ddc32fa5119471a2] Merge tag 'sound-6.1-rc1'
Comment 34 Zeno R.R. Davatz 2023-01-12 10:45:49 UTC
this one was bad.

Now testing:
[86a4d29e75540e20f991e72f17aa51d0e775a397] Merge tag 'asoc-v6.1'
Comment 35 Zeno R.R. Davatz 2023-01-12 11:10:33 UTC
the full monty:

/usr/src/linux> git bisect log
git bisect start
# good: [7e6739b9336e61fe23ca4e2c8d1fda8f19f979bf] Merge tag 'drm-next-2022-10-05' of git://anongit.freedesktop.org/drm/drm
git bisect good 7e6739b9336e61fe23ca4e2c8d1fda8f19f979bf
# bad: [9abf2313adc1ca1b6180c508c25f22f9395cc780] Linux 6.1-rc1
git bisect bad e0e492cebef25c13fc29b174f01b5178662f1652
# good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0
git bisect good 45eb8ae5370d5df1ee8236f45df3f29103ba6e12
# bad: [18fd049731e67651009f316195da9281b756f2cf] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
git bisect bad 18fd049731e67651009f316195da9281b756f2cf
# bad: [1b929c02afd37871d5afb9d498426f83432e71c2] Linux 6.2-rc1
git bisect bad 1b929c02afd37871d5afb9d498426f83432e71c2
# bad: [9abf2313adc1ca1b6180c508c25f22f9395cc780] Linux 6.1-rc1
git bisect bad e0e492cebef25c13fc29b174f01b5178662f1652
# good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0
git bisect good 45eb8ae5370d5df1ee8236f45df3f29103ba6e12
# bad: [a09476668e3016ea4a7b0a7ebd02f44e0546c12c] Merge tag 'char-misc-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect bad a09476668e3016ea4a7b0a7ebd02f44e0546c12c
# bad: [188943a15638ceb91f960e072ed7609b2d7f2a55] Merge tag 'fs-for_v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
git bisect bad 188943a15638ceb91f960e072ed7609b2d7f2a55
# bad: [188943a15638ceb91f960e072ed7609b2d7f2a55] Merge tag 'fs-for_v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
git bisect bad 188943a15638ceb91f960e072ed7609b2d7f2a55
# bad: [ff6862c23d2e83d12d1759bf4337d41248fb4dc8] Merge tag 'arm-drivers-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect bad ff6862c23d2e83d12d1759bf4337d41248fb4dc8
# good: [f7c91bf65388547f61888b7a67169966fc698ce1] ASoC: SOF: mediatek: mt8195: Add pcm_pointer callback
git bisect good f7c91bf65388547f61888b7a67169966fc698ce1
# bad: [40285e64c5654c956505dad34ed2ee4be163b1f0] Merge tag 'arm-defconfig-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect bad 40285e64c5654c956505dad34ed2ee4be163b1f0
# good: [02f2e785c4834828876a4701926416157dfd7b26] Merge branch 'for-next' into for-linus
git bisect good 02f2e785c4834828876a4701926416157dfd7b26
# good: [ffc79c2097fd5e954d99dfeaaa5c0437a27a1ece] Merge tag 'aspeed-6.1-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/joel/bmc into arm/defconfig
git bisect good ffc79c2097fd5e954d99dfeaaa5c0437a27a1ece
# good: [d488b28502d7c22b1b50f0543da119748e575919] Fix PM disable depth imbalance in probe
git bisect good d488b28502d7c22b1b50f0543da119748e575919
# good: [e66372ecb80dc5179c7abb880229c7452e813d15] ARM: 9246/1: dump: show page table level name
git bisect good e66372ecb80dc5179c7abb880229c7452e813d15
# good: [f0c8d7468af0001b80b0c86802ee28063f800987] ASoC: rockchip: i2s: use regmap_read_poll_timeout_atomic to poll I2S_CLR
git bisect good f0c8d7468af0001b80b0c86802ee28063f800987
# bad: [7782aae498b92f124267b366293100d121fe0f56] Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
git bisect bad 7782aae498b92f124267b366293100d121fe0f56
# bad: [833477fce7a14d43ae4c07f8ddc32fa5119471a2] Merge tag 'sound-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect bad 833477fce7a14d43ae4c07f8ddc32fa5119471a2
# good: [86a4d29e75540e20f991e72f17aa51d0e775a397] Merge tag 'asoc-v6.1' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
git bisect good 86a4d29e75540e20f991e72f17aa51d0e775a397
# first bad commit: [833477fce7a14d43ae4c07f8ddc32fa5119471a2] Merge tag 'sound-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Comment 36 Zeno R.R. Davatz 2023-01-12 11:11:48 UTC
(In reply to Zeno R.R. Davatz from comment #34)
> this one was bad.
> 
> Now testing:
> [86a4d29e75540e20f991e72f17aa51d0e775a397] Merge tag 'asoc-v6.1'

this one was good btw.
Comment 37 Zeno R.R. Davatz 2023-01-13 10:19:50 UTC
Booting without Sound [CONFIG_SOUND] into the commit 833477fce7a1 did not help. Same hang.
Comment 38 Krzysztof Wilczyński 2023-01-14 14:47:14 UTC
Hi Zeno,

Sorry that you are having all these problems! Also, thank you so much for doing the Git bisect, especially since this can be very time-consuming.

Do you apply any patches atop the vanilla kernel? Any custom build flags?

A favour to ask of you: would you mind sharing the last known working .config for your bespoke kernel build and the previous known broken .config?

Also, do you have a .config that is the smallest possible set of features enabled that would be a reproducer on your platform? That said, what is the platform/system you are booting the kernel on? If you don't mind me asking.

Thank you for your help in advance!

Krzysztof
Comment 39 Zeno R.R. Davatz 2023-01-14 16:46:29 UTC
Created attachment 303602 [details]
.config of Zeno
Comment 40 Zeno R.R. Davatz 2023-01-14 16:51:59 UTC
sure. So for every new Kernel I always do

1. git pull
2. git checkout v6.1
3. make oldconfig
4. make -j9
5. copy bzImage to /boot/
6. lilo -v
7. reboot
8. So far this has always worked. ;)

I am a Linux Funtoo User. No fancy desktop, just i3.

For all the new options I normally say "N"

I never apply patches and always do a git pull after Linus does the official release. I never test RC releases.
Comment 41 Zeno R.R. Davatz 2023-01-19 12:21:44 UTC
as requested by Bjorn a boot with 

"ignore_loglevel initcall_debug boot_delay=100"

https://youtu.be/gNJQpze9cFg
Comment 42 Zeno R.R. Davatz 2023-01-19 17:33:05 UTC
As requested by Bjorn

boot_delay=1000 pcie_ports=compat

https://youtu.be/3tPRG43R2fg
Comment 43 Bjorn Helgaas 2023-01-23 17:16:30 UTC
Sorry, my mistake, "boot_delay=" doesn't do anything unless you set CONFIG_BOOT_PRINTK_DELAY=y in your .config.  I'll propose a documentation patch for this.  You may also need to supply "lpj=3200194" (based on your v6.0 dmesg).
Comment 44 Zeno R.R. Davatz 2023-01-24 12:00:57 UTC
here we go:

boot_delay=40 pcie_ports=compat lpj=3200194

https://youtu.be/tbhLd0w0JRI
Comment 45 Bjorn Helgaas 2023-01-24 12:15:46 UTC
Great, thanks!  Unfortunately I didn't learn anything.

1) Can you add "initcall_debug" and "ignore_loglevel" again?  Maybe bump up the delay to 60 or 80?

2) Is there any non-essential hardware (plug-in cards, USB devices, etc) that you could remove?
Comment 46 Zeno R.R. Davatz 2023-01-24 13:16:30 UTC
1. here we go:

boot_delay=60 initall_debug ignore_log level pcie_ports=compat lpj=3200194

https://www.youtube.com/watch?v=PlXwqw7UiJ4

2. I have a USB keyboard and a USB headset pluged in. Plus a monitor attached via DVI.
Comment 47 Bjorn Helgaas 2023-01-24 19:30:11 UTC
Beautiful video, thank you!  A typo ("initall_debug" instead of "initcall_debug") means that part didn't work, but nevertheless, I think we learned something important.  Here's what I saw:

  2:56 162.495191 pci 0000:00:1c.0:  bridge window [io  0xd000-0xdfff]
  ...
  3:01 167.149191 nvidiafb 0000:05:00.0: vgaarb: deactivate vga console
  3:01 162.495191 pci 0000:00:1c.0:  bridge window [io  0xd000-0xdfff]

As soon as nvidiafb deactivated the VGA console, the screen was replaced with old contents from about 5 seconds in the past, and the machine seemed hung.

From your bisect:

              good 86a4d29e7554  # Merge tag 'asoc-v6.1'
  first bad commit 833477fce7a1  # Merge tag 'sound-6.1-rc1'

833477fce7a1 (the bad commit) includes 145eed48de27 ("fbdev: Remove conflicting devices on PCI bus"), but 86a4d29e7554 (the good one) does not.  Can you try reverting 145eed48de27?

If that still doesn't work, try unsetting CONFIG_FB_NVIDIA=y and adding "nosmp" to the kernel parameters.
Comment 48 Zeno R.R. Davatz 2023-01-24 21:00:50 UTC
1. (In reply to Bjorn Helgaas from comment #47)
> Beautiful video, thank you!  A typo ("initall_debug" instead of
> "initcall_debug") means that part didn't work, but nevertheless, I think we
> learned something important.  Here's what I saw:

Sorry, will try again tomorrow ;)
 
> 833477fce7a1 (the bad commit) includes 145eed48de27 ("fbdev: Remove
> conflicting devices on PCI bus"), but 86a4d29e7554 (the good one) does not. 
> Can you try reverting 145eed48de27?

You mean “git checkout 145eed48de2”?

> If that still doesn't work, try unsetting CONFIG_FB_NVIDIA=y and adding
> "nosmp" to the kernel parameters.

Ok.
Comment 49 Bjorn Helgaas 2023-01-24 21:18:27 UTC
Created attachment 303642 [details]
revert 145eed48de27 (nvidia only)

I think there's a good chance this is the issue, so I would probably try the attached patch:

  $ git checkout v6.1
  $ patch -p1 < revert-145eed48de27

and build as usual.
Comment 50 Krzysztof Wilczyński 2023-01-24 21:26:59 UTC
Hi Zeno,

Thank you for taking the time to capture these videos. They are much appreciated!

Aside from Bjorn looking at the captured footage, in the meantime, I attempted to reproduce the issue locally, both under a QEMU and also on bare metal. To test different versions (6.0.19 and 6.1-rc1) I had to adjust the kernel configuration you provided ever so slightly to make the kernel work successfully on the test hardware I have here.

Nevertheless, I was unable to reproduce the problem locally.

That said, the machines I attempted to reproduce the problem all have a single GPU, an iGPU, and either an Intel or AMD (APU). At the moment, I do not have an Nvidia GPU to try, sadly.

However, in light of what Bjorn was able to find concerning the VGA console being disabled at some point during boot, two questions:

- Have you attempted to remote into the system despite the screen being black/appearing as hung? Would access via a network, for example, using SSH, work?

- Do you have any means to hook the serial console to your machine so we could use it together with the earlyprintk and such to capture logs? I wonder if the boot process continues after the VGA console becomes disabled.

Simply put, I am curious if the VGA console being disabled is not endemic to a system freeze or crash and whether it completes the boot process without issues.

That said, Bjorn is suggesting reverting a patch, and if the revert works, then you don't need to do anything else.

Krzysztof
Comment 51 Zeno R.R. Davatz 2023-01-25 07:02:03 UTC
(In reply to Bjorn Helgaas from comment #49)
> Created attachment 303642 [details]
> revert 145eed48de27 (nvidia only)
> 
> I think there's a good chance this is the issue, so I would probably try the
> attached patch:
> 
>   $ git checkout v6.1
>   $ patch -p1 < revert-145eed48de27
> 
> and build as usual.

Bingo! This worked! Please let me know when your patch is in mainline so I can pull again and test. Hopefully your patch will make it into v6.2 ;).

Thanks for the coaching!

Please let me know if you need anything else.

Best
Zeno
Comment 52 Krzysztof Wilczyński 2023-01-25 08:06:19 UTC
Hello,

[...]
> > I think there's a good chance this is the issue, so I would probably try
> the
> > attached patch:
> > 
> >   $ git checkout v6.1
> >   $ patch -p1 < revert-145eed48de27
> > 
> > and build as usual.
> 
> Bingo! This worked! Please let me know when your patch is in mainline so I
> can pull again and test. Hopefully your patch will make it into v6.2 ;).

Thank you for testing!
 
> Please let me know if you need anything else.

Just purely my curiosity: I wonder if you have had some time to confirm whether the machine completely freezes or if you can still remote into it regardless of the console going blank.

Krzysztof
Comment 53 Zeno R.R. Davatz 2023-01-25 08:13:12 UTC
(In reply to Krzysztof Wilczyński from comment #52)
 
> > Please let me know if you need anything else.
> 
> Just purely my curiosity: I wonder if you have had some time to confirm
> whether the machine completely freezes or if you can still remote into it
> regardless of the console going blank.

Sure, I can test that, which commit do you want me to recompile and test?

Please tell me the commit I should checkout.
Comment 54 Krzysztof Wilczyński 2023-01-25 08:22:49 UTC
Hello,

> > > Please let me know if you need anything else.
> > 
> > Just purely my curiosity: I wonder if you have had some time to confirm
> > whether the machine completely freezes or if you can still remote into it
> > regardless of the console going blank.
> 
> Sure, I can test that, which commit do you want me to recompile and test?
> 
> Please tell me the commit I should checkout.

You can drop the revert you applied from Bjorn and then try. However, only if you have both the time to do it and the ability to see if the machine comes up and responds to network traffic, allowing you to even potentially remote into it (via SSH or such).

I am curious about this to judge the severity of this problem: a complete system freeze or simply a VGA console is degraded. The former has a more severe impact than the latter, especially for things like servers, etc.

Thank you for your help in advance!

Krzysztof
Comment 55 Zeno R.R. Davatz 2023-01-25 08:36:14 UTC
(In reply to Krzysztof Wilczyński from comment #54)

> You can drop the revert you applied from Bjorn and then try. However, only
> if you have both the time to do it and the ability to see if the machine
> comes up and responds to network traffic, allowing you to even potentially
> remote into it (via SSH or such).

I done:

1. sudo git stash
2. sudo make -j9
3. rebooted

Yes, I can ssh into the maschine, that does not seem to boot because the screen hangs.
Comment 56 Krzysztof Wilczyński 2023-01-25 09:46:04 UTC
Hello,

> > You can drop the revert you applied from Bjorn and then try. However, only
> > if you have both the time to do it and the ability to see if the machine
> > comes up and responds to network traffic, allowing you to even potentially
> > remote into it (via SSH or such).
> 
> I done:
> 
> 1. sudo git stash
> 2. sudo make -j9
> 3. rebooted
> 
> Yes, I can ssh into the maschine, that does not seem to boot because the
> screen hangs.

Good! Thank you for confirming this for me. Much appreciated!

Given that you can connect to it, would you be able to grab lspci -vvv and dmesg for us? For posterity, so we have all the details here, aside from the video captures.

Thank you!

Krzysztof
Comment 57 Zeno R.R. Davatz 2023-01-25 09:56:39 UTC
Created attachment 303644 [details]
sudo lspci -vvv
Comment 58 Zeno R.R. Davatz 2023-02-06 08:54:56 UTC
Created attachment 303697 [details]
Patch by Dave that works great
Comment 59 Zeno R.R. Davatz 2023-02-06 08:56:22 UTC
Thank you Bjorn and Dave! Pleasure and honor working with you! This patch worked!

~/.backup> uname -a
Linux zenogentoo 6.2.0-rc7-00002-gd2d11f342b17-dirty #144 SMP PREEMPT_DYNAMIC Mon Feb  6 09:42:58 CET 2023 x86_64 Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux

All fine now!

Best
Zeno
Comment 60 Zeno R.R. Davatz 2023-02-27 14:34:30 UTC
This works now with 6.2 release! Thank you everybody!
Comment 61 Bjorn Helgaas 2023-02-27 15:00:06 UTC
Fixed by:

https://git.kernel.org/linus/04119ab1a49f ("nvidiafb: detect the hardware support before removing console.")

which appeared in v6.2.

Thank you very much for pushing all the way through this, Zeno.

Note You need to log in before you can comment on or make changes to this bug.