Created attachment 304062 [details] dmesg with master (b2bc47e9b201) Hi, This bug can be tricky to reproduce, since hitting or dodging it seems very much dependent on the actual chips and revisions of all involved components. The general setup is: - Raspberry Pi Compute Module 4 - Raspberry Pi Compute Module 4 IO Board (carrier board) - Something plugged onto the PCIe slot At the moment, I'm able to reproduce this issue reliably with: - Compute Module 4 including eMMC (Compute Module 4 Lite, without eMMC, using the exact same operating system image on an SD card, doesn't trigger the issue). - SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S (PCE6U1C-R02, VER 006 looks very similar, but definitely includes different chips on its PCB, and doesn't trigger the issue). With either v6.1.20 as packaged by Debian, or with a local master build (as of b2bc47e9b201), plus a Debian testing userspace, I'm hitting the following kernel panic: ``` [ 1.914315] Kernel panic - not syncing: Asynchronous SError Interrupt [ 1.914317] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc4+ #1 [ 1.914322] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT) [ 1.914324] Call trace: [ 1.914326] dump_backtrace+0xa8/0x138 [ 1.914333] show_stack+0x20/0x38 [ 1.914336] dump_stack_lvl+0x48/0x60 [ 1.914345] dump_stack+0x18/0x28 [ 1.914350] panic+0x378/0x398 [ 1.914355] nmi_panic+0xb4/0xc0 [ 1.914359] arm64_serror_panic+0x78/0x90 [ 1.914363] do_serror+0x30/0x70 [ 1.914367] el1h_64_error_handler+0x30/0x48 [ 1.914371] el1h_64_error+0x64/0x68 [ 1.914375] pci_generic_config_read+0x44/0xe8 [ 1.914380] pci_bus_read_config_dword+0x98/0x140 [ 1.914386] pci_bus_generic_read_dev_vendor_id+0x3c/0x1c0 [ 1.914390] pci_scan_single_device+0xa8/0x118 [ 1.914393] pci_scan_slot+0x6c/0x1e0 [ 1.914396] pci_scan_child_bus_extend+0x50/0x2e0 [ 1.914399] pci_scan_bridge_extend+0x31c/0x5a8 [ 1.914403] pci_scan_child_bus_extend+0x1c4/0x2e0 [ 1.914406] pci_scan_root_bus_bridge+0x6c/0xf8 [ 1.914409] pci_host_probe+0x20/0xd0 [ 1.914413] brcm_pcie_probe+0x294/0x618 [ 1.914419] platform_probe+0x70/0xe8 [ 1.914426] really_probe+0x18c/0x3d8 [ 1.914429] __driver_probe_device+0x84/0x198 [ 1.914434] driver_probe_device+0x44/0x120 [ 1.914437] __driver_attach+0xfc/0x210 [ 1.914441] bus_for_each_dev+0x7c/0xe8 [ 1.914445] driver_attach+0x2c/0x40 [ 1.914448] bus_add_driver+0x118/0x228 [ 1.914452] driver_register+0x68/0x138 [ 1.914456] __platform_driver_register+0x30/0x48 [ 1.914461] brcm_pcie_driver_init+0x24/0x38 [ 1.914468] do_one_initcall+0x4c/0x238 [ 1.914472] kernel_init_freeable+0x21c/0x3f0 [ 1.914479] kernel_init+0x2c/0x1f8 [ 1.914483] ret_from_fork+0x10/0x20 ``` Full dmesg captured from b2bc47e9b201 is attached, I'll follow up with a very similar trace using v6.1.20. Serial logging implemented this way, should that matter: - "earlycon console=ttyS1,115200" on the kernel command line; - "enable_jtag_gpio=1" and "force_turbo=1" in config.txt (consumed by the bootloader); - and pins 6, 8, 10 on the pin header hooked up on a cp210x-based serial adapter. Reminder: there was some discussion around the possible need for a subnode in the DTB when I filed the PCIe regression a while back (https://bugzilla.kernel.org/show_bug.cgi?id=215925). I'm happy to test any patches and provide any input you folks might need. Cheers, Cyril.
Created attachment 304063 [details] dmesg with v6.1.20
I've hit the same bug when booting from SD card and with different PCIe adapters (2 SATA, 1 NVMe) in the PCIe slot on the official CM4 IO board. I can also help with testing. Thanks!
Hi, I'm the Broadcom STB PCIe driver maintainer (which covers RPi). I believe the issue here is that a lot of the cheap x1/x4 cards out there have their clkreq# signal unattached. The current driver assumes that the clkreq# line is connected and working, as it is on most of our non-RPi STB boards. You may want to look at the clkreq# pin (12) on your card; I'm guessing you will not see a PCB trace line connected.<p> The driver must be modified to allow this. Note that this is not a regression; the driver has always behaved this way for cards like this. Our PCIe HW has to be deliberately set into one of three clkreq# modes: "none", "aspm", and "l1ss". Right now the default is "aspm". Once the mode is set it is unsafe for the mode to be changed dynamically. The Raspian folks use an unofficial property "brcm,l1ss" which puts the PCIe HW into L1SS mode. Although using this mode gets around the error, it is a mode that is more apt for L1SS-capable cards. I'm working on new commit(s) that will propose it here for testing so that we can be sure we are looking at the same issue. Then I will submit it upstream. I appreciate anyone that can test for me, HankB, Cyril, ... Regards, Jim Quinlan Broadcom STB
Created attachment 304217 [details] serial log capture, PCIe/SATA adapter kernel panic Capture referenced in 2023-05-04 comment
Hi Jim, I have been performing further testing on this. My setup is * CM4 Lite (with up-to-date EEPROM: 2023-01-11) * Official IO Board * Debian Bookworm install (including Gnome) on an SD card * Two different PCIe/NVME Adapters ("PCENVME-N01 VER0006S" PCIe 3.0 x1 and "NFHK Model: N-M2X1 Ver. 1.0" PCIe 3.0 x1) * One PCIe/SATA adapter, SI-PEX40156, ASmedia 1064 chip, PCIe 3.0 x1. The system works well with either NVME card. The former has been in use for several weeks now without any apparent problems (booting from the NVME SSD.) The latter was briefly tested with your v5 patch series. The PCIe/SATA card results in the characteristic kernel panic with results captured using a serial connection and the earlycon boot parameter. A full capture is attached to previous message. If there is anything I can do to help this along, please speak up. I'm thrilled that a CM4 can boot and run from NVME and really appreciate the effort that has gone into this. Thanks!
I assume these plug-in cards work fine in other systems, so I'm dubious that the problem is the cards. Comment #3 suggests that "brcm,enable-l1ss" from [1] avoids the problem. Cyril and Hank, have you tried that series with that property in the DT? If so, what are your observations? It's not clear whether it's safe to use that property in general. A hardware engineer said defaulting to that configuration was a bad idea and "asking for trouble" [2], but I don't know what's behind that or what sort of trouble could ensue. If it is safe, and if it turns out to avoid this issue, that would be great. [1] https://lore.kernel.org/r/20230428223500.23337-1-jim2101024@gmail.com [2] https://lore.kernel.org/r/CA+-6iNxO6y_y5En2Q7YHgDGh=v4a-8E1Qbr2VL0NpWNNJqRf-g@mail.gmail.com
I have tried the V5 patch series with a DTB provided by Cyril that implements "brcm,enable-l1ss" and found no change in behavior. In other words the NVME SSD still worked (booting from SD card) and the kernel still paniced with the PCIe/SATA card in the slot.
Just to double-check, I assume you mean the *v4* series (not v5) at https://lore.kernel.org/r/20230428223500.23337-1-jim2101024@gmail.com ?
Regarding the double-check: yes, Hank and I are definitely using the *v4* series, applied on top of documented base commit (76f598ba7d8e2bfb4855b5298caedd5af0c374a8). Regarding the "brcm,enable-l1ss" property: before upgrading the EEPROM on my CM4 devices, it made a difference for me (see first table on https://lore.kernel.org/all/20230502231558.5zt5tyxczd22ppjz@mraw.org/#t, comparing lines by pairs: 1 & 2, 3 & 4, 5 & 6); after upgrading, it made no apparent differences (see second table, same mail). For the avoidance of doubt, I tested this by setting that property alongside "brcm,enable-scc" directly in the DTSI (for testing purposes only, knowing it would only be used on CM4 devices): --- a/arch/arm/boot/dts/bcm2711.dtsi +++ b/arch/arm/boot/dts/bcm2711.dtsi @@ -584,6 +584,7 @@ IRQ_TYPE_LEVEL_HIGH>, dma-ranges = <0x02000000 0x0 0x00000000 0x0 0x00000000 0x0 0xc0000000>; brcm,enable-ssc; + brcm,enable-l1ss; }; genet: ethernet@7d580000 { The resulting "arch/arm64/boot/dts/broadcom/bcm2711-rpi-cm4-io.dtb" is what I deployed on my test systems, and what I shared with Hank; to be deployed under "/boot/firmware/", replacing the original file. If Hank needs to double check whether the test with the property set was indeed done with the proper DTB, here are the sha1sum for both: - before (original code, without it): ac4ca46963aa967e7cd54d066937d6a092f35d70 - after (updated code, with it): c652ccf2eeb5652c28a2e36c396831957e1536a3 On a running system, this can also be verified by checking whether "/proc/device-tree/scb/pcie@7d500000/brcm,enable-l1ss" is absent or present. Cheers, Cyril.
Confirming the DTB provided by Cyril (on the running system): hbarta@cm4deb:~$ ls -l /proc/device-tree/scb/pcie@7d500000/brcm,enable-l1ss -r--r--r-- 1 root root 0 May 4 15:00 /proc/device-tree/scb/pcie@7d500000/brcm,enable-l1ss hbarta@cm4deb:~$ sha1sum /boot/firmware/bcm2711-rpi-cm4-io.dtb c652ccf2eeb5652c28a2e36c396831957e1536a3 /boot/firmware/bcm2711-rpi-cm4-io.dtb hbarta@cm4deb:~$
Created attachment 304235 [details] Traces with v5 and PCIe/SATA adapter minicom.2023-05-09-0930.cap is using the original DTB (ac4ca46963aa967e7cd54d066937d6a092f35d70). Kernel image kindly provided by Cyril
Created attachment 304236 [details] Trace with V5, PCIe/SATA and patched DTB minicom.2023-05-09-1010.cap is captured using the patched DTB. c652ccf2eeb5652c28a2e36c396831957e1536a3 /boot/firmware/bcm2711-rpi-cm4-io.dtb
Happens on my pi 400 too No change in kernel or dtb Using bus2 device 002 Asm media technology satabridge asm1035e
The patch series (v6) was partially merged in the following branch: https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/log/?h=controller/brcmstb The interesting commit for this particular bug is included, you might want to give that branch a spin.
Hey folks, I only got today's email from Bugzilla, I did see any email from 5/4 to current. Perhaps gmail was placing them in the spam folder, although I've never really had that problem. Please, when you post, attach two logs: the "control experiment" log, which is from the commit preceding my patch-series. Then apply the patches and send that log. It is paramount that you have *everything* else exactly the same between the two tests. At any rate, it appears that there is a SATA card that panics, correct? I see this line in "sata.minicom.2023-05-03.1651.cap: [ 3.702650] brcm-pcie fd500000.pcie: uni-dir CLKREQ# for L0s, L1 ASPM This means that the settings made by the driver to the PCIe core are exactly the same as they are done by default before the patch series was applied. Hence my admonition about including the before and after logs with all other variables frozen. Note that with these SATA devices you need to add external power both to the SATA drive AND to the card; do not assume that the power supplied by the CM4 board is enough for the card. I've just seen an example of this with a USB card on the CM4. I happen to have an Asmedia card but it is 1b21:0612 not 1b21:1064. I'll fire it up when I get a moment. Regards, Jim Quinlan Broadcom STB/CM
One more thing: make sure you update your CM4 eeprom blob to the latest version. Cryril was using and old version -- actually a pre-release version -- when he updated his CM4 eeprom image the errors went away.
I faced a similar issue on the CM4, but I can't reproduce it anymore. I'm using a USB card with a Renesas upd720201 chip and when I tried to run a lspci on a freshly compiled kernel it yielded a kernel panic. [ 31.293515] SError Interrupt on CPU2, code 0x00000000bf000002 -- SError [ 31.293527] CPU: 2 PID: 750 Comm: lspci Tainted: G C 6.1.64-v8-VFIO_ENABLED+ #2 [ 31.293533] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT) [ 31.293536] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 31.293541] pc : pci_generic_config_read+0x44/0xc0 [ 31.293557] lr : pci_generic_config_read+0x2c/0xc0 [ 31.293563] sp : ffffffc008dabbd0 [ 31.293564] x29: ffffffc008dabbd0 x28: ffffff8040f06d80 x27: 0000000000000000 [ 31.293573] x26: 000000000000000f x25: ffffff8040f06d80 x24: 0000000000000040 [ 31.293578] x23: 0000000000000040 x22: ffffffc008dabca4 x21: ffffffdb2851f0b8 [ 31.293584] x20: ffffffc008dabc24 x19: 0000000000000004 x18: 0000000000000000 [ 31.293589] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 31.293593] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 31.293598] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdb274c55fc [ 31.293603] x8 : ffffffc008dabb18 x7 : 0000000000000000 x6 : 000000000000000b [ 31.293607] x5 : ffffff8041229340 x4 : ffffffc008ae0000 x3 : ffffffc008ae9000 [ 31.293612] x2 : 0000000000008000 x1 : 00000000deaddead x0 : ffffffc008ae8000 [ 31.293619] Kernel panic - not syncing: Asynchronous SError Interrupt [ 31.293622] CPU: 2 PID: 750 Comm: lspci Tainted: G C 6.1.64-v8-VFIO_ENABLED+ #2 [ 31.293626] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT) [ 31.293629] Call trace: [ 31.293630] dump_backtrace.part.0+0xec/0x100 [ 31.293637] show_stack+0x20/0x30 [ 31.293640] dump_stack_lvl+0x88/0xb4 [ 31.293649] dump_stack+0x18/0x34 [ 31.293655] panic+0x1a0/0x370 [ 31.293662] nmi_panic+0xb4/0xbc [ 31.293667] arm64_serror_panic+0x78/0x84 [ 31.293671] is_valid_bugaddr+0x0/0x30 [ 31.293675] el1h_64_error_handler+0x38/0x50 [ 31.293679] el1h_64_error+0x64/0x68 [ 31.293683] pci_generic_config_read+0x44/0xc0 [ 31.293688] pci_user_read_config_dword+0x80/0x120 [ 31.293694] pci_read_config+0xec/0x2a4 [ 31.293699] sysfs_kf_bin_read+0x74/0x94 [ 31.293704] kernfs_fop_read_iter+0xa8/0x1b4 [ 31.293707] vfs_read+0x214/0x2c0 [ 31.293712] ksys_pread64+0x84/0xd0 [ 31.293716] __arm64_sys_pread64+0x28/0x34 [ 31.293720] invoke_syscall+0x50/0x120 [ 31.293727] el0_svc_common.constprop.0+0x68/0x124 [ 31.293732] do_el0_svc+0x34/0xd0 [ 31.293738] el0_svc+0x30/0x94 [ 31.293741] el0t_64_sync_handler+0xf4/0x120 [ 31.293745] el0t_64_sync+0x18c/0x190 [ 31.293750] SMP: stopping secondary CPUs [ 32.363059] SMP: failed to stop secondary CPUs 0,2 [ 32.363063] Kernel Offset: 0x1b1ee00000 from 0xffffffc008000000 [ 32.363065] PHYS_OFFSET: 0x0 [ 32.363067] CPU features: 0x80000,2013c080,0000421b [ 32.363070] Memory Limit: none [ 32.620285] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]--- I'm using this version of the kernel - 6.1.64-v8. The error went away after some restarts, so I'm not sure it I can reproduce it consistently and if it's the same error.
Hello Anne, This could be related but I cannot tell for sure. My fix has yet to be accepted by upstream Linux, however you should be okay on Raspian. Regards, Jim Quinlan Broadcom STB/CM
This issue happen in my Raspberry Pi Compute Module 4 as well. While early this year the board worked super well and then when I tried to reinstall it one month ago, it failed at boot. So the issue will happen when: * boot with WaveShare PCIE SATA card on it I've tried upgrade the EEPROM, not work. And then I tried to look at the boot order in the EEPROM config, which is the boot.conf. and the boot order is good. And then I realize that this board work at least on Feb 2023, then I think it's related to the kernel version. So I download a old version of the lite OS here: https://downloads.raspberrypi.com/raspios_lite_arm64/images/raspios_lite_arm64-2021-11-08/ Make sure the PCIE board is not plugged in and install this OS, and then use the "sudo rpi-update <HASH>" to upgrade the kernel to the last v5.15.92 one. the HASH can be found in this link https://github.com/raspberrypi/rpi-firmware/commits/master/?after=7ca14294c4bf09fda8d138f9987cd031ced61f7c+69 And then once upgrade done, reboot the Pi. Before plug the PCIE board back, make sure finish all the change related to the kernel or boot (for example, enable the cgroup for memory), otherwise the Async SERROR will happen again. And that's all. This may help some people.
Hello 3Rivers (and Anne if it applies), The context of this bug is when running the RPi4 + CM under upstream or Debian Linux OS. This bug report does not intend to cover issues with RaspianOS, although there may be commonality. Regards, Jim Quinlan Broadcom STB/CM
Hi, Jim, Sure. I am trying to say it's not about the Raspian or Debian or EEPROM. So people may don't have to waste time on working on them. Before your fix accepted by upstream Linux, downgrading the kernel should be a quick win on this issue. And thanks for the fix.