Bug 215925
Summary: | PCIe regression on Raspberry Pi Compute Module 4 (CM4) breaks booting | ||
---|---|---|---|
Product: | Drivers | Reporter: | Cyril Brulebois (kibi) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | aurelien, bjorn, f.fainelli, jim2101024, kibi, pbrobinson, stefan.wahren |
Priority: | P1 | ||
Hardware: | ARM | ||
OS: | Linux | ||
Kernel Version: | v5.17-rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
Backtrace of the panic at boot-up
dmesg with distro kernel (5.10.106-1) dmesg with mainline kernel (v5.16-9949-g88db8458086b) lspci -vvv on v5.16.0+ lspci -vvv on 5.10.y Moving the ASPM settings back into brcm_pcie_linkup Moving the ASPM settings back into brcm_pcie_linkup (v2) printk sprinkle=on, v5.18-rc7, with PCIe→quad-USB extension board Add pcie node to the CM4 IO DTS dtdiff between unpatched v5.18-rc7 DTB and second patch (without resets line) dt overlay to workaround the issue |
Description
Cyril Brulebois
2022-04-30 19:53:47 UTC
Created attachment 300861 [details]
Backtrace of the panic at boot-up
Thanks for providing the backtrace, we'll look into it now. Could you attach a good log just for comparison? Thanks! Created attachment 300970 [details]
dmesg with distro kernel (5.10.106-1)
dmesg for successful boot with the distribution-provided kernel (we backported the DTB).
One more thing, sorry, could you also attach the output of lspci -vvv of a successful boot? I would like to see what sort of L0/L1 we have here. Thanks! Oh wait, so the PCIe is always down in your case, no devices connected? Created attachment 300971 [details]
dmesg with mainline kernel (v5.16-9949-g88db8458086b)
dmesg for successful boot with the last ok mainline commit (distro .config for v5.16.y + oldconfig + bindeb-pkg)
OK, so in 5.10 the PCIe link was down, and in 5.16 is it UP, that makes more sense to me, thanks. For the avoidance of doubt: I'm getting the failure to boot with or without any PCIe device connected. The first dmesg I attached was without the PCIe→quad-USB extension board (v5.10.y) – I had left the IO board in a “keep it simple” state. Seeing your question about lspci -vvv while (re)building the mainline kernel, I plugged the PCIe→quad-USB extension board, and you'll see it in the second dmesg (v5.16+). I'm attaching the lspci output in a second. Created attachment 300972 [details]
lspci -vvv on v5.16.0+
lspci -vvv on v5.16.0+, with PCIe→quad-USB extension board connected
Thanks, this is very helpful, it does not look like the end point has ASPM support at all, so that may explain why we are having these issues, I will circle back with Jim Quinlan and we should have a tentative patch shortly. Created attachment 300973 [details]
lspci -vvv on 5.10.y
lspci -vvv on 5.10.y, with PCIe→quad-USB extension board connected
Alright, that sounds great! I'm happy to perform any additional testing/reporting you folks might need or find useful. Created attachment 300974 [details]
Moving the ASPM settings back into brcm_pcie_linkup
Could you try the attached patch?
Created attachment 300976 [details]
Moving the ASPM settings back into brcm_pcie_linkup (v2)
v2: missing declaration fixed
Hoping for the best, I tried applying that on top of v5.18-rc7 directly, but the screen remains entirely black after the rainbow disappears. The same happens with or without the PCIe→quad-USB extension board plugged in. I'll try and apply the patch on top of v5.17-rc1 (first release that shows the regression), to see if something else might have sneaked in since then, and might interfere with the tentative fix. Still with v5.18-rc7+ (= the tentative patch), I can also confirm the same as in my original report: this isn't just a display issue, I'm not getting anything on the serial console. Going back to v5.16.0+ (last ok commit in mainline), or to the distro kernel, the boot is fine, and everything shows up on the serial console and on the screen (so that's not an electrical issue or software configuration error). Not better with v5.17-rc1+, I'm going back to the last commit that showed the panic (before everything was replaced by a black screen), which was 87c71931633bd15e9cfd51d4a4d9cd685e8cdb55 according to my notes. Testing 87c71931633bd15e9cfd51d4a4d9cd685e8cdb55 again, and the tentative patch on top of it, then comparing both traces, they seem equivalent. This was tested without the PCIe→quad-USB extension board plugged in. Summing it up, it looks to me that the patch doesn't change much: - still getting the same trace when testing commits after the change was introduced (when the trace was still visible upon boot failure); - still getting black screens when testing it on top of v5.17-rc1 and higher (after some extra merges came up and made the trace disappear entirely). Hello, we are on this as much as we can be w/o having a CM4.. FWIW, someone raised this issue before in personal email to myself and a couple of others but they dropped off. I could not reproduce the error with an RP4. There is an error in the commit you have identified(830aa6f29f07a4e2); the probe function is somehow missing a call to brcm_pcie_linkup(pcie) after invoking brcm_pcie_setup(). However, this omission is corrected two commits later in 93e41f3fca3d "PCI: brcmstb: Add control of subdevice voltage regulators" I believe that something in 93e41f3fca3d is the real issue. From the backtrace, it appears that an error occurred trying to read the DEV-ID of the endpoint device. This can happen when the RC-EP link has not been established. Our RC controller is unforgiving; a bad access does not return 0xffffffff but rather causes a CPU abort. I am not sure if you can actually observe any added printk() statements -- can you? If so, please, if you have the bandwidth, sprinkle them in the brcm_pcie_add_bus() function in pcie-brcmstb.c. This is where the pcie linkup should happen. Knowing what goes on in this function would yield insight. Regards, Jim Hi Cyril, is it possible for you to configure a serial UART for the CM4, as opposed to using the HDMI? The former will not chop off half of the bootlog. This is what I have for my RPi4 running upstream Linux: config.txt: enable_uart=1 enable_jtag_gpio=1 #arm_64bit=1 upstream_kernel=1 force_turbo=1 cmdline.txt: console=ttyS1,115200 earlycon Note that I am using a Serial Hat board on my RPi4 so your serial device may not be ttyS1. Cheers, Jim Thanks, those instructions are super useful. In earlier comments, I mentioned I had a serial console set up already (and that's via the GPIO 40-pin header — pins 6, 8, 10 — plus a CP210x converter). I didn't check all combinations, but adding those helps: - enable_jtag_gpio=1 - force_turbo=1 (probably changes default settings for core_freq) - and of course earlycon I'm now getting a trace on my console with the most recent builds (version-wise): v5.18-rc7, so I'll go add some printk() all over the place and report back. It's going to be easier to process than a photograph in any cases… With sprinkler=on in both brcm_pcie_add_bus() and pci_subdev_regulators_add_bus() for good measure, having split the conditions to pinpoint things easily, with v5.18-rc7 as a basis, I'm having hard-to-reproduce results since the serial console is getting somewhat eaten in some cases, meaning some debug messages aren't seen, and the trace is missing 1/2 or even 3/4 of the function calls. It might be due some frequency issue, I'm told the serial console can be a little annoying… That being said, I'm sometimes lucky enough to have a somewhat clean output, which is the case for the trace I'm attaching. This is with the PCIe→quad-USB extension board plugged in. Relevant parts before the Asynchronous SError Interrupt: [ 1.901644] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges: [ 1.908686] brcm-pcie fd500000.pcie: No bus range found for /scb/pcie@7d500000, using [bus 00-ff] [ 1.917838] brcm-pcie fd500000.pcie: MEM 0x0600000000..0x0603ffffff -> 0x00f8000000 [ 1.926023] brcm-pcie fd500000.pcie: IB MEM 0x0000000000..0x003fffffff -> 0x0400000000 [ 1.934741] kibi -- main: cond2 [ 1.937919] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00 [ 1.944254] pci_bus 0000:00: root bus resource [bus 00-ff] [ 1.949797] pci_bus 0000:00: root bus resource [mem 0x600000000-0x603ffffff] (bus address [0xf8000000-0xfbffffff]) [ 1.960277] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400 [ 1.966393] pci 0000:00:00.0: PME# supported from D0 D3hot [ 1.972958] kibi -- main: cond1 Subtitles: - kibi's my nickname, easy to grep for - main = brcm_pcie_add_bus() - cond2 = we're returning 0 from: if (!bus->parent) - cond1 = we're returning 0 from: if (!dev->of_node) For the avoidance of doubt, I had split conditions the following way: if (!dev->of_node) { printk(KERN_ERR "kibi -- main: cond1\n"); return 0; } if (!bus->parent) { printk(KERN_ERR "kibi -- main: cond2\n"); return 0; } if (!pci_is_root_bus(bus->parent)) { printk(KERN_ERR "kibi -- main: cond3\n"); return 0; } Created attachment 300982 [details]
printk sprinkle=on, v5.18-rc7, with PCIe→quad-USB extension board
Hi Cryil, First, thanks a lot for helping me with this -- from what I can see the CM4 IO board is available to purchase but the CM4 itself is out of stock. Can you please do the following to change the source code of brcm_pcie_add_bus(): (a) Put this as the first line of brcm_pcie_add_bus(): printk(KERN_ERR "kibi: bn:%d c1:%d c2:%d c3:%d c4:%d \n", (int)bus->number, !!dev->of_node, !!bus->parent, bus->parent ? !!pci_is_root_bus(bus->parent) : -1, !!dev->driver_data ); (b) In the same function, replace this if (!dev->of_node || !bus->parent || !pci_is_root_bus(bus->parent)) return 0; with this if (bus->number != 1) return 0; I get the following (kibi) output, yours will be different: kibi: bn:0 c1:1 c2:0 c3:-1 c4:0 kibi: bn:1 c1:1 c2:1 c3:1 c4:0 Many thanks, Jim Quinlan Broadcom STB PS Of course, please remove the existing "kibi" printk conditionals in this function. Hi Cyril, I've just noticed something: the RPi4 has a subnode in the DTS that is missing in the CM4 DTS. Specifically, the RPi4 has "usb@0,0": pcie@7d500000 { compatible = "brcm,bcm2711-pcie"; /* ... */ pci@0,0 { device_type = "pci"; #address-cells = <0x3>; #size-cells = <0x2>; ranges; reg = <0x0 0x0 0x0 0x0 0x0>; usb@0,0 { reg = <0x0 0x0 0x0 0x0 0x0>; resets = <0x2e 0x0>; }; }; and the CM4 DTB has pcie@7d500000 { compatible = "brcm,bcm2711-pcie"; /* ... */ pci@0,0 { device_type = "pci"; #address-cells = <0x3>; #size-cells = <0x2>; ranges; reg = <0x0 0x0 0x0 0x0 0x0>; }; } If you could add the following to the CM4 DTS file and compile it to a DTB file, we might not require a code change at all: pcie@7d500000 { compatible = "brcm,bcm2711-pcie"; /* ... */ pci@0,0 { device_type = "pci"; #address-cells = <0x3>; #size-cells = <0x2>; ranges; reg = <0x0 0x0 0x0 0x0 0x0>; /* Generic PCIe subnode */ pci-ep@0,0 { reg = <0x10000 0x0 0x0 0x0 0x0>; }; }; } Could you please try the above and let me know? Thanks, Jim Quinlan Broadcom STB Hi Jim, That looks great! I'm not seeing any pci@0,0 node in the CM4 DTB (arch/arm64/boot/dts/broadcom/bcm2711-rpi-cm4-io.dtb), so I went for adding the whole node, copying the &pcie0 section from the Pi 4 B DTS (arch/arm/boot/dts/bcm2711-rpi-4-b.dts) to the CM4 IO DTS (arch/arm/boot/dts/bcm2711-rpi-cm4-io.dts). I'm not sure whether it makes more sense there or in the CM4 DTSI (arch/arm/boot/dts/bcm2711-rpi-cm4.dtsi). Tested successfully with v5.18-rc7 (keeping my patched “sprinkle” kernel, deploying only the patched DTB, but that patch was only about adding printk): - without anything plugged on the PCIe slot; - with the PCIe→quad-USB extension board, with both a keyboard and a USB stick on it. Everything looks good: - boot is fine up to userspace, ssh, etc.; - keyboard works, - lsblk sees the storage, etc. I didn't put much work into the commit message itself, only included some metadata (missing parts: problem it fixes, why it's needed, pointer to v5.17-rc1, cc: stable for v5.17.y, etc.), but I'm attaching a draft commit for you to review/expand as you see fit. Please let me know if you need any more tests/feedback, Cyril. Created attachment 300985 [details]
Add pcie node to the CM4 IO DTS
Hi Cyril, sorry for being late at the party, but bugzilla isn't the usual place to report Rpi issues. There is a patch which should fix the HDMI issue: https://marc.info/?l=linux-arm-kernel&m=165117078030443&w=2 The suggested pci DT patch doesn't look acceptable. There is no USB on the IO board. Except of the why should a DT entry fix a driver regression. It already worked before. Best regards Hi Cyril, can you please test behavior of your patch without the following line: resets = <&reset RASPBERRYPI_FIRMWARE_RESET_ID_USB>; Best regards Hi Stefan, The modified patch seems to work just fine as well: - unused PCIe slot: boots fine; - PCIe→quad-USB extension board with keyboard and USB mass storage: boots fine, and both devices work fine. The dtdiff output (between the v5.18-rc7 DTB and the modified one) is also much less noisy (no phandle renumbering in other places), I'm attaching it. Regarding the patch at: https://marc.info/?l=linux-arm-kernel&m=165117078030443&w=2 I've tried it, along with the unpatched DTB v5.18-rc7, and I'm still only getting output (including the trace) via the serial console only. To make sure the serial console doesn't interfere: - I've reverted all settings mentioned by Jim that I didn't have at the beginning (enable_jtag_gpio=1, force_turbo=1 in config.txt; earlycon in cmdline.txt); - I even removed console=ttyS1,115200; but I'm still getting a black screen. Best regards, Cyril. Created attachment 300989 [details]
dtdiff between unpatched v5.18-rc7 DTB and second patch (without resets line)
Cyril (and Stefan), Your submitted (DTS PCIe) patch may work but it is improper. Cyril, I believe that if you remove the "usb@0,0" subnode your patch will still fix the problem and also address Stefan's objection. Note that doing the following command wget -q https://github.com/raspberrypi/firmware/blob/master/boot/bcm2711-rpi-cm4.dtb?raw=true -O -|dtc -Idtb -Odts 2>/dev/null | grep -A6 pci@0,0 will show that this DTB has the pci@0,0 node. Why is this different from the DTS you folks are using? Regards, Jim Quinlan Broadcom STB Jim, (Just for the avoidance of doubt, I'm merely following instructions/requests for tests, I'm not going to pretend I know what I'm doing.) Applying just the following diff, inlining and limiting whitespaces for brevity: +pci@0,0 { + #address-cells = <0x03>; + #size-cells = <0x02>; + device_type = "pci"; + ranges; + reg = <0x00 0x00 0x00 0x00 0x00>; +}; I can confirm the system boots fine, with or without something on the PCIe slot. Both keyboard and USB mass storage work fine when connected through the PCIe→quad-USB extension board. Regards, Cyril. Cyril, I think I know enough about this issue to make a pullreq from upstream Linux, and I can copy you on the email. But what I need is for you to apply the pullreq's commits to your CM4 build and test them, and to let us know. I can help you for any questions you have. IOtherwise, I can't fix this until I get my own CM4. Are you okay with that? Note that if I don't submit something soon a number of my commits in Linux will be reverted, assuming that has not already happened. Thanks, Jim Sure thing, I'm happy to be cc'ed and to test anything, be it isolated patches, branches, etc. Cheers, Cyril. Created attachment 301403 [details]
dt overlay to workaround the issue
I confirm that just adding the pci@0,0 node fixes the reported issue. And it doesn't look that wrong given the PCIE slot exists, even if unpopulated.
To avoid having to mangle the existing .dtb file until the issue is fixed upstream, one option is to use a DT overlay like the one attached. Compile it with dtc, copy the .dtbo to the overlays/ directory and add it to config.txt.
Has this problem been resolved? If so, can somebody add a link to the commit that fixed it? If not, what needs to be done to make progress on it? Hi Bjorn, Yes, you solved this by reverting the whole patchset via: - 7894025c783ca36394d3afe49c8cfb4c830b82fe - 212942609d83b591f5a2f2691df122d13aa3a87d - 420be2f7ebe60c9ba3e332f5290017cd168e2bf8 - f4fd559de3434c44bed1d2912bd0c75cfa42898b which landed in v5.19-rc1. The patchset was resubmitted later on, and I've run some checks to ensure this issue didn't come back. I'm not familiar with Linux workflows, but looks like something that can be closed at this time. Thank you, Cyril! I'll close it as resolved. Please don't hesitate to open a new bug report if you see it or other issues in the future! |