Bug 215925

Summary: PCIe regression on Raspberry Pi Compute Module 4 (CM4) breaks booting
Product: Drivers Reporter: Cyril Brulebois (kibi)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: RESOLVED CODE_FIX    
Severity: normal CC: aurelien, bjorn, f.fainelli, jim2101024, kibi, pbrobinson, stefan.wahren
Priority: P1    
Hardware: ARM   
OS: Linux   
Kernel Version: v5.17-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Backtrace of the panic at boot-up
dmesg with distro kernel (5.10.106-1)
dmesg with mainline kernel (v5.16-9949-g88db8458086b)
lspci -vvv on v5.16.0+
lspci -vvv on 5.10.y
Moving the ASPM settings back into brcm_pcie_linkup
Moving the ASPM settings back into brcm_pcie_linkup (v2)
printk sprinkle=on, v5.18-rc7, with PCIe→quad-USB extension board
Add pcie node to the CM4 IO DTS
dtdiff between unpatched v5.18-rc7 DTB and second patch (without resets line)
dt overlay to workaround the issue

Description Cyril Brulebois 2022-04-30 19:53:47 UTC
Catching up with latest kernel releases in Debian, it turned out that my
Raspberry Pi Compute Module 4, mounted on an official Compute Module 4 IO Board,
and booting from an SD card, no longer boots: this means a black screen on the
HDMI output, and no output on the serial console.

Trying various releases, I confirmed that v5.16 was fine, and v5.17-rc1 was the
first (pre)release that wasn't.

After some git bisect, it turns out the cause seems to be the following commit
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=830aa6f29f07a4e2f1a947dfa72b3ccddb46dd21):

```
commit 830aa6f29f07a4e2f1a947dfa72b3ccddb46dd21
Author: Jim Quinlan <jim2101024@gmail.com>
Date:   Thu Jan 6 11:03:27 2022 -0500

    PCI: brcmstb: Split brcm_pcie_setup() into two funcs
```

Starting with this commit, the kernel panics early (before 0.30 seconds), with
an `Asynchronous SError Interrupt`. The backtrace references various
`brcm_pcie_*` functions; I can share a picture or try and transcribe it
manually if that helps (nothing on the serial console…).

This commit is part of a branch that was ultimately merged as
d0a231f01e5b25bacd23e6edc7c979a18a517b2b; starting with this commit, there's
not even a backtrace anymore, the screen stays black after the usual “boot-up
rainbow”, and there's still nothing on the serial console.

I confirmed that 88db8458086b1dcf20b56682504bdb34d2bca0e2 (on the master side)
was still booting properly, and that 87c71931633bd15e9cfd51d4a4d9cd685e8cdb55
(from the branch being merged into master) is the last commit showing the
panic.

Since d0a231f01e5b25bacd23e6edc7c979a18a517b2b is a merge commit that includes
conflict resolutions in drivers/pci/controller/pcie-brcmstb.c, I suppose this
could be consistent with the initial panic being “upgraded” into an even more
serious issue.

I've also verified that latest master (v5.18-rc4-396-g57ae8a492116) is still
affected by this issue.

The regular Raspberry Pi 4 B doesn't seem to be affected by this issue: the
exact same image on the same SD card (with latest master) boots fine on it.
Comment 1 Cyril Brulebois 2022-04-30 19:59:57 UTC
Created attachment 300861 [details]
Backtrace of the panic at boot-up
Comment 2 Florian Fainelli 2022-05-11 20:43:57 UTC
Thanks for providing the backtrace, we'll look into it now.
Comment 3 Florian Fainelli 2022-05-16 16:44:04 UTC
Could you attach a good log just for comparison? Thanks!
Comment 4 Cyril Brulebois 2022-05-16 16:57:10 UTC
Created attachment 300970 [details]
dmesg with distro kernel (5.10.106-1)

dmesg for successful boot with the distribution-provided kernel (we backported the DTB).
Comment 5 Florian Fainelli 2022-05-16 16:58:37 UTC
One more thing, sorry, could you also attach the output of lspci -vvv of a successful boot? I would like to see what sort of L0/L1 we have here. Thanks!
Comment 6 Florian Fainelli 2022-05-16 17:10:27 UTC
Oh wait, so the PCIe is always down in your case, no devices connected?
Comment 7 Cyril Brulebois 2022-05-16 17:13:54 UTC
Created attachment 300971 [details]
dmesg with mainline kernel (v5.16-9949-g88db8458086b)

dmesg for successful boot with the last ok mainline commit (distro .config for v5.16.y + oldconfig + bindeb-pkg)
Comment 8 Florian Fainelli 2022-05-16 17:15:48 UTC
OK, so in 5.10 the PCIe link was down, and in 5.16 is it UP, that makes more sense to me, thanks.
Comment 9 Cyril Brulebois 2022-05-16 17:18:38 UTC
For the avoidance of doubt: I'm getting the failure to boot with or without any PCIe device connected.

The first dmesg I attached was without the PCIe→quad-USB extension board (v5.10.y) – I had left the IO board in a “keep it simple” state.

Seeing your question about lspci -vvv while (re)building the mainline kernel, I plugged the PCIe→quad-USB extension board, and you'll see it in the second dmesg (v5.16+). I'm attaching the lspci output in a second.
Comment 10 Cyril Brulebois 2022-05-16 17:19:36 UTC
Created attachment 300972 [details]
lspci -vvv on v5.16.0+

lspci -vvv on v5.16.0+, with PCIe→quad-USB extension board connected
Comment 11 Florian Fainelli 2022-05-16 17:21:36 UTC
Thanks, this is very helpful, it does not look like the end point has ASPM support at all, so that may explain why we are having these issues, I will circle back with Jim Quinlan and we should have a tentative patch shortly.
Comment 12 Cyril Brulebois 2022-05-16 17:23:14 UTC
Created attachment 300973 [details]
lspci -vvv on 5.10.y

lspci -vvv on 5.10.y, with PCIe→quad-USB extension board connected
Comment 13 Cyril Brulebois 2022-05-16 17:25:03 UTC
Alright, that sounds great!

I'm happy to perform any additional testing/reporting you folks might need or find useful.
Comment 14 Florian Fainelli 2022-05-16 17:28:38 UTC
Created attachment 300974 [details]
Moving the ASPM settings back into brcm_pcie_linkup

Could you try the attached patch?
Comment 15 Cyril Brulebois 2022-05-16 17:54:13 UTC
Created attachment 300976 [details]
Moving the ASPM settings back into brcm_pcie_linkup (v2)

v2: missing declaration fixed
Comment 16 Cyril Brulebois 2022-05-16 17:58:12 UTC
Hoping for the best, I tried applying that on top of v5.18-rc7 directly, but the screen remains entirely black after the rainbow disappears.

The same happens with or without the PCIe→quad-USB extension board plugged in.

I'll try and apply the patch on top of v5.17-rc1 (first release that shows the regression), to see if something else might have sneaked in since then, and might interfere with the tentative fix.
Comment 17 Cyril Brulebois 2022-05-16 18:07:54 UTC
Still with v5.18-rc7+ (= the tentative patch), I can also confirm the same as in my original report: this isn't just a display issue, I'm not getting anything on the serial console.

Going back to v5.16.0+ (last ok commit in mainline), or to the distro kernel, the boot is fine, and everything shows up on the serial console and on the screen (so that's not an electrical issue or software configuration error).
Comment 18 Cyril Brulebois 2022-05-16 18:19:39 UTC
Not better with v5.17-rc1+, I'm going back to the last commit that showed the panic (before everything was replaced by a black screen), which was 87c71931633bd15e9cfd51d4a4d9cd685e8cdb55 according to my notes.
Comment 19 Cyril Brulebois 2022-05-16 19:05:47 UTC
Testing 87c71931633bd15e9cfd51d4a4d9cd685e8cdb55 again, and the tentative patch on top of it, then comparing both traces, they seem equivalent.

This was tested without the PCIe→quad-USB extension board plugged in.

Summing it up, it looks to me that the patch doesn't change much:
 - still getting the same trace when testing commits after the change was introduced (when the trace was still visible upon boot failure);
 - still getting black screens when testing it on top of v5.17-rc1 and higher (after some extra merges came up and made the trace disappear entirely).
Comment 20 Jim Quinlan 2022-05-16 19:48:15 UTC
Hello, we are on this as much as we can be w/o having a CM4..  FWIW, someone raised this issue before in personal email to myself and a couple of others but they dropped off.  I could not reproduce the error with an RP4.

There is an error in the commit you have identified(830aa6f29f07a4e2); the probe function is somehow missing a call to brcm_pcie_linkup(pcie) after invoking brcm_pcie_setup().  However, this omission is corrected two commits later in 

93e41f3fca3d  "PCI: brcmstb: Add control of subdevice voltage regulators"

I believe that something in 93e41f3fca3d is the real issue.  From the backtrace, it appears that an error occurred trying to read the DEV-ID of the endpoint device.  This can happen when the RC-EP link has not been established.  Our RC controller is unforgiving; a bad access does not return 0xffffffff but rather causes a CPU abort.

I am not sure if you can actually observe any added printk() statements -- can you?  If so, please, if you have the bandwidth, sprinkle them in the brcm_pcie_add_bus() function in pcie-brcmstb.c. This is where the pcie linkup should happen.  Knowing what goes on in this function would yield insight.

Regards, Jim
Comment 21 Jim Quinlan 2022-05-16 21:10:21 UTC
Hi Cyril, is it possible for you to configure a serial UART for the CM4, as opposed to using the HDMI?  The former will not chop off half of the bootlog.  This is what I have for my RPi4 running upstream Linux:

config.txt:
enable_uart=1
enable_jtag_gpio=1
#arm_64bit=1
upstream_kernel=1
force_turbo=1

cmdline.txt:
console=ttyS1,115200 earlycon

Note that I am using a Serial Hat board on my RPi4 so your serial device may not be ttyS1.

Cheers, Jim
Comment 22 Cyril Brulebois 2022-05-17 06:06:15 UTC
Thanks, those instructions are super useful.

In earlier comments, I mentioned I had a serial console set up already (and that's via the GPIO 40-pin header — pins 6, 8, 10 — plus a CP210x converter). I didn't check all combinations, but adding those helps:

 - enable_jtag_gpio=1
 - force_turbo=1 (probably changes default settings for core_freq)
 - and of course earlycon

I'm now getting a trace on my console with the most recent builds (version-wise): v5.18-rc7, so I'll go add some printk() all over the place and report back. It's going to be easier to process than a photograph in any cases…
Comment 23 Cyril Brulebois 2022-05-17 07:07:01 UTC
With sprinkler=on in both brcm_pcie_add_bus() and pci_subdev_regulators_add_bus() for good measure, having split the conditions to pinpoint things easily, with v5.18-rc7 as a basis, I'm having hard-to-reproduce results since the serial console is getting somewhat eaten in some cases, meaning some debug messages aren't seen, and the trace is missing 1/2 or even 3/4 of the function calls. It might be due some frequency issue, I'm told the serial console can be a little annoying…

That being said, I'm sometimes lucky enough to have a somewhat clean output, which is the case for the trace I'm attaching. This is with the PCIe→quad-USB extension board plugged in.

Relevant parts before the Asynchronous SError Interrupt:

[    1.901644] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    1.908686] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    1.917838] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x0603ffffff -> 0x00f8000000
[    1.926023] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x003fffffff -> 0x0400000000
[    1.934741] kibi -- main: cond2
[    1.937919] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    1.944254] pci_bus 0000:00: root bus resource [bus 00-ff]
[    1.949797] pci_bus 0000:00: root bus resource [mem 0x600000000-0x603ffffff] (bus address [0xf8000000-0xfbffffff])
[    1.960277] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    1.966393] pci 0000:00:00.0: PME# supported from D0 D3hot
[    1.972958] kibi -- main: cond1

Subtitles:
 - kibi's my nickname, easy to grep for
 - main = brcm_pcie_add_bus()
 - cond2 = we're returning 0 from: if (!bus->parent)
 - cond1 = we're returning 0 from: if (!dev->of_node)

For the avoidance of doubt, I had split conditions the following way:

	if (!dev->of_node) {
		printk(KERN_ERR "kibi -- main: cond1\n");
		return 0;
	}

	if (!bus->parent) {
		printk(KERN_ERR "kibi -- main: cond2\n");
		return 0;
	}

	if (!pci_is_root_bus(bus->parent)) {
		printk(KERN_ERR "kibi -- main: cond3\n");
		return 0;
	}
Comment 24 Cyril Brulebois 2022-05-17 07:08:30 UTC
Created attachment 300982 [details]
printk sprinkle=on, v5.18-rc7, with PCIe→quad-USB extension board
Comment 25 Jim Quinlan 2022-05-17 15:07:46 UTC
Hi Cryil,

First, thanks a lot for helping me with this -- from what I can see the CM4 IO board is available to purchase but the CM4 itself is out of stock.

Can you please do the following to change the source code of brcm_pcie_add_bus():

(a) Put this as the first line of brcm_pcie_add_bus():

        printk(KERN_ERR "kibi: bn:%d c1:%d c2:%d c3:%d c4:%d \n",
               (int)bus->number,
               !!dev->of_node,
               !!bus->parent,
               bus->parent ? !!pci_is_root_bus(bus->parent) : -1,
               !!dev->driver_data
                );

(b) In the same function, replace this

        if (!dev->of_node || !bus->parent || !pci_is_root_bus(bus->parent))
                return 0;

with this

        if (bus->number != 1)
                return 0;


I get the following (kibi) output, yours will be different:

    kibi: bn:0 c1:1 c2:0 c3:-1 c4:0
    kibi: bn:1 c1:1 c2:1 c3:1 c4:0

Many thanks,
Jim Quinlan
Broadcom STB

PS Of course, please remove the existing "kibi" printk conditionals in this function.
Comment 26 Jim Quinlan 2022-05-17 19:31:32 UTC
Hi Cyril,

I've just noticed something: the RPi4 has a subnode in the DTS that is missing in the CM4 DTS.  Specifically, the RPi4 has "usb@0,0":

    pcie@7d500000 {
        compatible = "brcm,bcm2711-pcie";
        /* ... */
        pci@0,0 {
            device_type = "pci";
            #address-cells = <0x3>;
            #size-cells = <0x2>;
            ranges;
            reg = <0x0 0x0 0x0 0x0 0x0>;

            usb@0,0 {
                reg = <0x0 0x0 0x0 0x0 0x0>;
                resets = <0x2e 0x0>;
            };
        };

and the CM4 DTB has

   pcie@7d500000 {
        compatible = "brcm,bcm2711-pcie";
        /* ... */
        pci@0,0 {
            device_type = "pci";
            #address-cells = <0x3>;
            #size-cells = <0x2>;
            ranges;
            reg = <0x0 0x0 0x0 0x0 0x0>;
        };
   }

If you could add the following to the CM4 DTS file and compile it to a DTB file, we might not require a code change at all:

  pcie@7d500000 {
        compatible = "brcm,bcm2711-pcie";
        /* ... */
        pci@0,0 {
            device_type = "pci";
            #address-cells = <0x3>;
            #size-cells = <0x2>;
            ranges;
            reg = <0x0 0x0 0x0 0x0 0x0>;

            /* Generic PCIe subnode */
            pci-ep@0,0 {
                reg = <0x10000 0x0 0x0 0x0 0x0>;
            };

        };
   }

Could you please try the above and let me know?

Thanks,
Jim Quinlan
Broadcom STB
Comment 27 Cyril Brulebois 2022-05-17 22:19:07 UTC
Hi Jim,

That looks great!

I'm not seeing any pci@0,0 node in the CM4 DTB (arch/arm64/boot/dts/broadcom/bcm2711-rpi-cm4-io.dtb), so I went for adding the whole node, copying the &pcie0 section from the Pi 4 B DTS (arch/arm/boot/dts/bcm2711-rpi-4-b.dts) to the CM4 IO DTS (arch/arm/boot/dts/bcm2711-rpi-cm4-io.dts). I'm not sure whether it makes more sense there or in the CM4 DTSI (arch/arm/boot/dts/bcm2711-rpi-cm4.dtsi).

Tested successfully with v5.18-rc7 (keeping my patched “sprinkle” kernel, deploying only the patched DTB, but that patch was only about adding printk):

- without anything plugged on the PCIe slot;
- with the PCIe→quad-USB extension board, with both a keyboard and a USB stick on it.

Everything looks good:

- boot is fine up to userspace, ssh, etc.;
- keyboard works,
- lsblk sees the storage, etc.

I didn't put much work into the commit message itself, only included some metadata (missing parts: problem it fixes, why it's needed, pointer to v5.17-rc1, cc: stable for v5.17.y, etc.), but I'm attaching a draft commit for you to review/expand as you see fit.

Please let me know if you need any more tests/feedback,
Cyril.
Comment 28 Cyril Brulebois 2022-05-17 22:19:44 UTC
Created attachment 300985 [details]
Add pcie node to the CM4 IO DTS
Comment 29 Stefan Wahren 2022-05-18 05:41:25 UTC
Hi Cyril,

sorry for being late at the party, but bugzilla isn't the usual place to report Rpi issues.

There is a patch which should fix the HDMI issue:
https://marc.info/?l=linux-arm-kernel&m=165117078030443&w=2

The suggested pci DT patch doesn't look acceptable. There is no USB on the IO board. Except of the why should a DT entry fix a driver regression. It already worked before.

Best regards
Comment 30 Stefan Wahren 2022-05-18 05:53:44 UTC
Hi Cyril,

can you please test behavior of your patch without the following line:

resets = <&reset RASPBERRYPI_FIRMWARE_RESET_ID_USB>;

Best regards
Comment 31 Cyril Brulebois 2022-05-18 08:16:33 UTC
Hi Stefan,

The modified patch seems to work just fine as well:

- unused PCIe slot: boots fine;
- PCIe→quad-USB extension board with keyboard and USB mass storage: boots fine, and both devices work fine.

The dtdiff output (between the v5.18-rc7 DTB and the modified one) is also much less noisy (no phandle renumbering in other places), I'm attaching it.


Regarding the patch at:
https://marc.info/?l=linux-arm-kernel&m=165117078030443&w=2

I've tried it, along with the unpatched DTB v5.18-rc7, and I'm still only getting output (including the trace) via the serial console only.

To make sure the serial console doesn't interfere:

- I've reverted all settings mentioned by Jim that I didn't have at the beginning (enable_jtag_gpio=1, force_turbo=1 in config.txt; earlycon in cmdline.txt);
- I even removed console=ttyS1,115200;

but I'm still getting a black screen.


Best regards,
Cyril.
Comment 32 Cyril Brulebois 2022-05-18 08:17:53 UTC
Created attachment 300989 [details]
dtdiff between unpatched v5.18-rc7 DTB and second patch (without resets line)
Comment 33 Jim Quinlan 2022-05-18 12:58:32 UTC
Cyril (and Stefan),

Your submitted (DTS PCIe) patch may work but it is improper.  

Cyril, I believe that if you remove the "usb@0,0" subnode your patch will still fix the problem and also address Stefan's objection.

Note that doing the following command

wget -q https://github.com/raspberrypi/firmware/blob/master/boot/bcm2711-rpi-cm4.dtb?raw=true -O -|dtc -Idtb -Odts 2>/dev/null | grep -A6 pci@0,0

will show that this DTB has the pci@0,0 node.  Why is this different from the DTS you folks are using?


Regards,
Jim Quinlan
Broadcom STB
Comment 34 Cyril Brulebois 2022-05-18 13:40:29 UTC
Jim,

(Just for the avoidance of doubt, I'm merely following instructions/requests for tests, I'm not going to pretend I know what I'm doing.)

Applying just the following diff, inlining and limiting whitespaces for brevity:

+pci@0,0 {
+  #address-cells = <0x03>;
+  #size-cells = <0x02>;
+  device_type = "pci";
+  ranges;
+  reg = <0x00 0x00 0x00 0x00 0x00>;
+};

I can confirm the system boots fine, with or without something on the PCIe slot. Both keyboard and USB mass storage work fine when connected through the PCIe→quad-USB extension board.

Regards,
Cyril.
Comment 35 Jim Quinlan 2022-05-18 15:46:42 UTC
Cyril,

I think I know enough about this issue to make a pullreq from upstream Linux, and I can copy you on the email.  But what I need is for you to apply the pullreq's commits to your CM4 build and test them, and to let us know.  I can help you for any questions you have.

IOtherwise, I can't fix this until I get my own CM4.

Are you okay with that?

Note that if I don't submit something soon a number of my commits in Linux will be reverted, assuming that has not already happened.

Thanks,
Jim
Comment 36 Cyril Brulebois 2022-05-18 16:13:35 UTC
Sure thing, I'm happy to be cc'ed and to test anything, be it isolated patches, branches, etc.

Cheers,
Cyril.
Comment 37 Aurelien Jarno 2022-07-12 20:34:02 UTC
Created attachment 301403 [details]
dt overlay to workaround the issue

I confirm that just adding the pci@0,0 node fixes the reported issue. And it doesn't look that wrong given the PCIE slot exists, even if unpopulated.

To avoid having to mangle the existing .dtb file until the issue is fixed upstream, one option is to use a DT overlay like the one attached. Compile it with dtc, copy the .dtbo to the overlays/ directory and add it to config.txt.
Comment 38 Bjorn Helgaas 2023-01-17 21:48:00 UTC
Has this problem been resolved?  If so, can somebody add a link to the commit that fixed it?  If not, what needs to be done to make progress on it?
Comment 39 Cyril Brulebois 2023-01-17 21:58:02 UTC
Hi Bjorn,

Yes, you solved this by reverting the whole patchset via:
- 7894025c783ca36394d3afe49c8cfb4c830b82fe
- 212942609d83b591f5a2f2691df122d13aa3a87d
- 420be2f7ebe60c9ba3e332f5290017cd168e2bf8
- f4fd559de3434c44bed1d2912bd0c75cfa42898b

which landed in v5.19-rc1.

The patchset was resubmitted later on, and I've run some checks to ensure this issue didn't come back. I'm not familiar with Linux workflows, but looks like something that can be closed at this time.
Comment 40 Bjorn Helgaas 2023-01-17 22:00:39 UTC
Thank you, Cyril!  I'll close it as resolved.  Please don't hesitate to open a new bug report if you see it or other issues in the future!