Bug 216859
Summary: | PCI bridge to bus boot hang at enumeration | ||
---|---|---|---|
Product: | Drivers | Reporter: | Zeno R.R. Davatz (zdavatz) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | bjorn, brunodout.dev, jmennius, kw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 6.1-rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Screenshot of the PCI enumeration hang
Video of the PCI numeration hang lspci -vv of Kernel 6.0 dmesg output of Kernel 6.0 boot with ignore_loglevel initcall_debug First bisect hang .config of Zeno revert 145eed48de27 (nvidia only) sudo lspci -vvv Patch by Dave that works great |
Created attachment 303493 [details]
Video of the PCI numeration hang
I you look at this short movie, the boot process does NOT continue. It just hangs there with that status at the end of the video.
Created attachment 303494 [details]
lspci -vv of Kernel 6.0
Created attachment 303495 [details]
dmesg output of Kernel 6.0
Hello, Zeno, besides happening after PCI enumeration, do you have any reason why you think this might be related to the PCI enumeration itself? I happened to notice that other subsystems produce messages after that. I'm not saying it means we can rule out the possibility of being PCI-related, just that there might be other possibilities here. Also, are you able to reproduce this bug on another system? (In reply to Bruno Moreira-Guedes from comment #4) > Hello, Zeno, besides happening after PCI enumeration, do you have any reason > why you think this might be related to the PCI enumeration itself? No, just what I see right before that hang. > I happened to notice that other subsystems produce messages after that. I'm > not saying it means we can rule out the possibility of being PCI-related, > just that there might be other possibilities here. Sure, that is totally possible. > Also, are you able to reproduce this bug on another system? No, currently not testing on another system. Happy New Year! Please find attached the output with the parameter "ignore_loglevel initcall_debug" as requested by Bjorn. Created attachment 303538 [details]
boot with ignore_loglevel initcall_debug
Boot with "ignore_loglevel initcall_debug"
And this is the video: https://www.youtube.com/watch?v=uzhUxKteVJM done with "ignore_loglevel initcall_debug" Marking as a regression since this worked in v6.0 and fails in v6.1-rc1. Thanks for the photo and video. I have no idea what happened. It looks like console output stopped in the middle of a line. Log from v6.0 in comment #3 shows: pcieport 0000:00:1c.5: saving config space at offset 0x1c (reading 0xb0b0) but the photo shows it stopped at: pcieport 0000:00:1c.5: saving config space at offset 0x1c (readin If you want to slow down the output to make the video more readable, you can add "boot_delay=20" (or more if necessary). But I doubt that will show us anything useful. The only thing I can think of is to bisect it, which I'm sorry to say is a lot of work: https://docs.kernel.org/admin-guide/bug-bisect.html Thanks! The bisecting is not the problem. What is more of a problem is, that after every failed boot I have to fix the file system with "fsck.jfs /dev/sda2" and that takes time ;(, because I have to boot from a USB stick. How many bisects do you think I will have to do? Looks like it would be about 13 builds/boots: 11:17:06 ~/linux (wip/bjorn-junk)$ git bisect start 11:17:11 ~/linux (wip/bjorn-junk|BISECTING)$ git bisect bad v6.1-rc1 11:17:20 ~/linux (wip/bjorn-junk|BISECTING)$ git bisect good v6.0 Bisecting: 6125 revisions left to test after this (roughly 13 steps) Ok, thanks for the info! Ok, now testing: sudo git checkout e0e492cebef25c13fc29b174f01b5178662f1652 Created attachment 303567 [details]
First bisect hang
I am trying to continue with bisecting but it suggests the same commit again: /usr/src/linux> sudo git bisect bad e0e492cebef25c13fc29b174f01b5178662f1652 /usr/src/linux> sudo git bisect good v6.0 e0e492cebef25c13fc29b174f01b5178662f1652 war sowohl good als auch bad any hints? Running this again: sudo git bisect bad v6.1-rc1 now gives me: [18fd049731e67651009f316195da9281b756f2cf] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux that one was also bad. Now testing: 1c2daf52185bbc91421f0e84e6bf2706bb350cce Ok these ones work: v6.0 1c2daf52185bbc91421f0e84e6bf2706bb350cce these do not work and all result in the errors in the screenshots above: 18fd049731e67651009f316195da9281b756f2cf e0e492cebef25c13fc29b174f01b5178662f1652 v6.1-rc1 I am now running on: 6.0.0-03077-g1c2daf52185b #117 SMP PREEMPT_DYNAMIC (uname -a) You're doing it right; it's just a laborious process. Here's what I see when plugging in your test results: 08:25:22 ~/linux (main|BISECTING)$ git bisect bad v6.1-rc1 08:25:29 ~/linux (main|BISECTING)$ git bisect good v6.0 Bisecting: 6125 revisions left to test after this (roughly 13 steps) [18fd049731e67651009f316195da9281b756f2cf] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux 08:25:43 ~/linux ((18fd049731e6...)|BISECTING)$ git bisect bad Bisecting: 3084 revisions left to test after this (roughly 12 steps) [1c2daf52185bbc91421f0e84e6bf2706bb350cce] Merge tag 'tag-chrome-platform-for-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux 08:26:30 ~/linux ((1c2daf52185b...)|BISECTING)$ git bisect good Bisecting: 1522 revisions left to test after this (roughly 11 steps) [7e6739b9336e61fe23ca4e2c8d1fda8f19f979bf] Merge tag 'drm-next-2022-10-05' of git://anongit.freedesktop.org/drm/drm So far you've tested these: GOOD v6.0 BAD v6.1-rc1 => problem is between v6.0 and v6.1-rc1 BAD 18fd049731e6 ("Merge tag 'arm64-upstream' ...) => problem is between v6.0 and 18fd049731e6 GOOD 1c2daf52185b ("Merge tag 'tag-chrome-platform-for-v6.1' ...) => problem is between 1c2daf52185b and 18fd049731e6 ??? 7e6739b9336e ("Merge tag 'drm-next-2022-10-05' ...) To continue the bisect, 7e6739b9336e would be the next kernel to test. You don't need to report all the intermediate steps unless you want to. At any point, "git bisect log" will show you the kernels you've tested and the results. You can just attach that log when you get to the end. this one was good: 7e6739b9336e this one was bad: 1b929c02afd3 now testing: a09476668e30 Merge tag 'char-misc-6.1-rc1 was bad, now testing: [188943a15638ceb91f960e072ed7609b2d7f2a55] Merge tag 'fs-for_v6.1-rc1' was bad, now testing: [ff6862c23d2e83d12d1759bf4337d41248fb4dc8] Merge tag 'arm-drivers-6.1' was bad, now testing: [f7c91bf65388547f61888b7a67169966fc698ce1] ASoC: SOF: mediatek: mt8195: Add pcm_pointer callback this one was good! now testing: [40285e64c5654c956505dad34ed2ee4be163b1f0] Merge tag 'arm-defconfig-6.1' this one was bad. Now testing: [02f2e785c4834828876a4701926416157dfd7b26] Merge branch 'for-next' This one was good. now testing: [ffc79c2097fd5e954d99dfeaaa5c0437a27a1ece] Merge tag 'aspeed-6.1-defconfig' this one was good. now testing: [d488b28502d7c22b1b50f0543da119748e575919] Fix PM disable depth imbalance in probe this one was good. Now testing: [e66372ecb80dc5179c7abb880229c7452e813d15] ARM: 9246/1: dump: show page table level name this one was good. now testing: f0c8d7468af0 ASoC: rockchip: i2s: use regmap_read_poll_timeout_atomic to poll I2S_CLR this one was good. Now testing: [7782aae498b92f124267b366293100d121fe0f56] Merge tag 'for-linus' This one was bad! Now testing: [833477fce7a14d43ae4c07f8ddc32fa5119471a2] Merge tag 'sound-6.1-rc1' this one was bad. Now testing: [86a4d29e75540e20f991e72f17aa51d0e775a397] Merge tag 'asoc-v6.1' the full monty: /usr/src/linux> git bisect log git bisect start # good: [7e6739b9336e61fe23ca4e2c8d1fda8f19f979bf] Merge tag 'drm-next-2022-10-05' of git://anongit.freedesktop.org/drm/drm git bisect good 7e6739b9336e61fe23ca4e2c8d1fda8f19f979bf # bad: [9abf2313adc1ca1b6180c508c25f22f9395cc780] Linux 6.1-rc1 git bisect bad e0e492cebef25c13fc29b174f01b5178662f1652 # good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0 git bisect good 45eb8ae5370d5df1ee8236f45df3f29103ba6e12 # bad: [18fd049731e67651009f316195da9281b756f2cf] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux git bisect bad 18fd049731e67651009f316195da9281b756f2cf # bad: [1b929c02afd37871d5afb9d498426f83432e71c2] Linux 6.2-rc1 git bisect bad 1b929c02afd37871d5afb9d498426f83432e71c2 # bad: [9abf2313adc1ca1b6180c508c25f22f9395cc780] Linux 6.1-rc1 git bisect bad e0e492cebef25c13fc29b174f01b5178662f1652 # good: [4fe89d07dcc2804c8b562f6c7896a45643d34b2f] Linux 6.0 git bisect good 45eb8ae5370d5df1ee8236f45df3f29103ba6e12 # bad: [a09476668e3016ea4a7b0a7ebd02f44e0546c12c] Merge tag 'char-misc-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc git bisect bad a09476668e3016ea4a7b0a7ebd02f44e0546c12c # bad: [188943a15638ceb91f960e072ed7609b2d7f2a55] Merge tag 'fs-for_v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs git bisect bad 188943a15638ceb91f960e072ed7609b2d7f2a55 # bad: [188943a15638ceb91f960e072ed7609b2d7f2a55] Merge tag 'fs-for_v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs git bisect bad 188943a15638ceb91f960e072ed7609b2d7f2a55 # bad: [ff6862c23d2e83d12d1759bf4337d41248fb4dc8] Merge tag 'arm-drivers-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect bad ff6862c23d2e83d12d1759bf4337d41248fb4dc8 # good: [f7c91bf65388547f61888b7a67169966fc698ce1] ASoC: SOF: mediatek: mt8195: Add pcm_pointer callback git bisect good f7c91bf65388547f61888b7a67169966fc698ce1 # bad: [40285e64c5654c956505dad34ed2ee4be163b1f0] Merge tag 'arm-defconfig-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc git bisect bad 40285e64c5654c956505dad34ed2ee4be163b1f0 # good: [02f2e785c4834828876a4701926416157dfd7b26] Merge branch 'for-next' into for-linus git bisect good 02f2e785c4834828876a4701926416157dfd7b26 # good: [ffc79c2097fd5e954d99dfeaaa5c0437a27a1ece] Merge tag 'aspeed-6.1-defconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/joel/bmc into arm/defconfig git bisect good ffc79c2097fd5e954d99dfeaaa5c0437a27a1ece # good: [d488b28502d7c22b1b50f0543da119748e575919] Fix PM disable depth imbalance in probe git bisect good d488b28502d7c22b1b50f0543da119748e575919 # good: [e66372ecb80dc5179c7abb880229c7452e813d15] ARM: 9246/1: dump: show page table level name git bisect good e66372ecb80dc5179c7abb880229c7452e813d15 # good: [f0c8d7468af0001b80b0c86802ee28063f800987] ASoC: rockchip: i2s: use regmap_read_poll_timeout_atomic to poll I2S_CLR git bisect good f0c8d7468af0001b80b0c86802ee28063f800987 # bad: [7782aae498b92f124267b366293100d121fe0f56] Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm git bisect bad 7782aae498b92f124267b366293100d121fe0f56 # bad: [833477fce7a14d43ae4c07f8ddc32fa5119471a2] Merge tag 'sound-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound git bisect bad 833477fce7a14d43ae4c07f8ddc32fa5119471a2 # good: [86a4d29e75540e20f991e72f17aa51d0e775a397] Merge tag 'asoc-v6.1' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus git bisect good 86a4d29e75540e20f991e72f17aa51d0e775a397 # first bad commit: [833477fce7a14d43ae4c07f8ddc32fa5119471a2] Merge tag 'sound-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound (In reply to Zeno R.R. Davatz from comment #34) > this one was bad. > > Now testing: > [86a4d29e75540e20f991e72f17aa51d0e775a397] Merge tag 'asoc-v6.1' this one was good btw. Booting without Sound [CONFIG_SOUND] into the commit 833477fce7a1 did not help. Same hang. Hi Zeno, Sorry that you are having all these problems! Also, thank you so much for doing the Git bisect, especially since this can be very time-consuming. Do you apply any patches atop the vanilla kernel? Any custom build flags? A favour to ask of you: would you mind sharing the last known working .config for your bespoke kernel build and the previous known broken .config? Also, do you have a .config that is the smallest possible set of features enabled that would be a reproducer on your platform? That said, what is the platform/system you are booting the kernel on? If you don't mind me asking. Thank you for your help in advance! Krzysztof Created attachment 303602 [details]
.config of Zeno
sure. So for every new Kernel I always do 1. git pull 2. git checkout v6.1 3. make oldconfig 4. make -j9 5. copy bzImage to /boot/ 6. lilo -v 7. reboot 8. So far this has always worked. ;) I am a Linux Funtoo User. No fancy desktop, just i3. For all the new options I normally say "N" I never apply patches and always do a git pull after Linus does the official release. I never test RC releases. as requested by Bjorn a boot with "ignore_loglevel initcall_debug boot_delay=100" https://youtu.be/gNJQpze9cFg As requested by Bjorn boot_delay=1000 pcie_ports=compat https://youtu.be/3tPRG43R2fg Sorry, my mistake, "boot_delay=" doesn't do anything unless you set CONFIG_BOOT_PRINTK_DELAY=y in your .config. I'll propose a documentation patch for this. You may also need to supply "lpj=3200194" (based on your v6.0 dmesg). here we go: boot_delay=40 pcie_ports=compat lpj=3200194 https://youtu.be/tbhLd0w0JRI Great, thanks! Unfortunately I didn't learn anything. 1) Can you add "initcall_debug" and "ignore_loglevel" again? Maybe bump up the delay to 60 or 80? 2) Is there any non-essential hardware (plug-in cards, USB devices, etc) that you could remove? 1. here we go: boot_delay=60 initall_debug ignore_log level pcie_ports=compat lpj=3200194 https://www.youtube.com/watch?v=PlXwqw7UiJ4 2. I have a USB keyboard and a USB headset pluged in. Plus a monitor attached via DVI. Beautiful video, thank you! A typo ("initall_debug" instead of "initcall_debug") means that part didn't work, but nevertheless, I think we learned something important. Here's what I saw: 2:56 162.495191 pci 0000:00:1c.0: bridge window [io 0xd000-0xdfff] ... 3:01 167.149191 nvidiafb 0000:05:00.0: vgaarb: deactivate vga console 3:01 162.495191 pci 0000:00:1c.0: bridge window [io 0xd000-0xdfff] As soon as nvidiafb deactivated the VGA console, the screen was replaced with old contents from about 5 seconds in the past, and the machine seemed hung. From your bisect: good 86a4d29e7554 # Merge tag 'asoc-v6.1' first bad commit 833477fce7a1 # Merge tag 'sound-6.1-rc1' 833477fce7a1 (the bad commit) includes 145eed48de27 ("fbdev: Remove conflicting devices on PCI bus"), but 86a4d29e7554 (the good one) does not. Can you try reverting 145eed48de27? If that still doesn't work, try unsetting CONFIG_FB_NVIDIA=y and adding "nosmp" to the kernel parameters. 1. (In reply to Bjorn Helgaas from comment #47) > Beautiful video, thank you! A typo ("initall_debug" instead of > "initcall_debug") means that part didn't work, but nevertheless, I think we > learned something important. Here's what I saw: Sorry, will try again tomorrow ;) > 833477fce7a1 (the bad commit) includes 145eed48de27 ("fbdev: Remove > conflicting devices on PCI bus"), but 86a4d29e7554 (the good one) does not. > Can you try reverting 145eed48de27? You mean “git checkout 145eed48de2”? > If that still doesn't work, try unsetting CONFIG_FB_NVIDIA=y and adding > "nosmp" to the kernel parameters. Ok. Created attachment 303642 [details]
revert 145eed48de27 (nvidia only)
I think there's a good chance this is the issue, so I would probably try the attached patch:
$ git checkout v6.1
$ patch -p1 < revert-145eed48de27
and build as usual.
Hi Zeno, Thank you for taking the time to capture these videos. They are much appreciated! Aside from Bjorn looking at the captured footage, in the meantime, I attempted to reproduce the issue locally, both under a QEMU and also on bare metal. To test different versions (6.0.19 and 6.1-rc1) I had to adjust the kernel configuration you provided ever so slightly to make the kernel work successfully on the test hardware I have here. Nevertheless, I was unable to reproduce the problem locally. That said, the machines I attempted to reproduce the problem all have a single GPU, an iGPU, and either an Intel or AMD (APU). At the moment, I do not have an Nvidia GPU to try, sadly. However, in light of what Bjorn was able to find concerning the VGA console being disabled at some point during boot, two questions: - Have you attempted to remote into the system despite the screen being black/appearing as hung? Would access via a network, for example, using SSH, work? - Do you have any means to hook the serial console to your machine so we could use it together with the earlyprintk and such to capture logs? I wonder if the boot process continues after the VGA console becomes disabled. Simply put, I am curious if the VGA console being disabled is not endemic to a system freeze or crash and whether it completes the boot process without issues. That said, Bjorn is suggesting reverting a patch, and if the revert works, then you don't need to do anything else. Krzysztof (In reply to Bjorn Helgaas from comment #49) > Created attachment 303642 [details] > revert 145eed48de27 (nvidia only) > > I think there's a good chance this is the issue, so I would probably try the > attached patch: > > $ git checkout v6.1 > $ patch -p1 < revert-145eed48de27 > > and build as usual. Bingo! This worked! Please let me know when your patch is in mainline so I can pull again and test. Hopefully your patch will make it into v6.2 ;). Thanks for the coaching! Please let me know if you need anything else. Best Zeno Hello, [...] > > I think there's a good chance this is the issue, so I would probably try > the > > attached patch: > > > > $ git checkout v6.1 > > $ patch -p1 < revert-145eed48de27 > > > > and build as usual. > > Bingo! This worked! Please let me know when your patch is in mainline so I > can pull again and test. Hopefully your patch will make it into v6.2 ;). Thank you for testing! > Please let me know if you need anything else. Just purely my curiosity: I wonder if you have had some time to confirm whether the machine completely freezes or if you can still remote into it regardless of the console going blank. Krzysztof (In reply to Krzysztof Wilczyński from comment #52) > > Please let me know if you need anything else. > > Just purely my curiosity: I wonder if you have had some time to confirm > whether the machine completely freezes or if you can still remote into it > regardless of the console going blank. Sure, I can test that, which commit do you want me to recompile and test? Please tell me the commit I should checkout. Hello,
> > > Please let me know if you need anything else.
> >
> > Just purely my curiosity: I wonder if you have had some time to confirm
> > whether the machine completely freezes or if you can still remote into it
> > regardless of the console going blank.
>
> Sure, I can test that, which commit do you want me to recompile and test?
>
> Please tell me the commit I should checkout.
You can drop the revert you applied from Bjorn and then try. However, only if you have both the time to do it and the ability to see if the machine comes up and responds to network traffic, allowing you to even potentially remote into it (via SSH or such).
I am curious about this to judge the severity of this problem: a complete system freeze or simply a VGA console is degraded. The former has a more severe impact than the latter, especially for things like servers, etc.
Thank you for your help in advance!
Krzysztof
(In reply to Krzysztof Wilczyński from comment #54) > You can drop the revert you applied from Bjorn and then try. However, only > if you have both the time to do it and the ability to see if the machine > comes up and responds to network traffic, allowing you to even potentially > remote into it (via SSH or such). I done: 1. sudo git stash 2. sudo make -j9 3. rebooted Yes, I can ssh into the maschine, that does not seem to boot because the screen hangs. Hello,
> > You can drop the revert you applied from Bjorn and then try. However, only
> > if you have both the time to do it and the ability to see if the machine
> > comes up and responds to network traffic, allowing you to even potentially
> > remote into it (via SSH or such).
>
> I done:
>
> 1. sudo git stash
> 2. sudo make -j9
> 3. rebooted
>
> Yes, I can ssh into the maschine, that does not seem to boot because the
> screen hangs.
Good! Thank you for confirming this for me. Much appreciated!
Given that you can connect to it, would you be able to grab lspci -vvv and dmesg for us? For posterity, so we have all the details here, aside from the video captures.
Thank you!
Krzysztof
Created attachment 303644 [details]
sudo lspci -vvv
Created attachment 303697 [details]
Patch by Dave that works great
Thank you Bjorn and Dave! Pleasure and honor working with you! This patch worked! ~/.backup> uname -a Linux zenogentoo 6.2.0-rc7-00002-gd2d11f342b17-dirty #144 SMP PREEMPT_DYNAMIC Mon Feb 6 09:42:58 CET 2023 x86_64 Intel(R) Core(TM) i7 CPU 960 @ 3.20GHz GenuineIntel GNU/Linux All fine now! Best Zeno This works now with 6.2 release! Thank you everybody! Fixed by: https://git.kernel.org/linus/04119ab1a49f ("nvidiafb: detect the hardware support before removing console.") which appeared in v6.2. Thank you very much for pushing all the way through this, Zeno. |
Created attachment 303492 [details] Screenshot of the PCI enumeration hang With Kernel 6.1-rc1 the enumeration process stopped working for me, see attachments. The enumeration works fine with Kernel 6.0 and below. Same problem still exists with v6.1. and v6.2.-rc1 Please see attachments.