Bug 217276 - Kernel panic - not syncing: Asynchronous SError Interrupt (brcm_pcie_probe), with Raspberry Pi CM4 + PCIe setups
Summary: Kernel panic - not syncing: Asynchronous SError Interrupt (brcm_pcie_probe), ...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: ARM Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-03-30 23:43 UTC by Cyril Brulebois
Modified: 2023-12-27 23:44 UTC (History)
8 users (show)

See Also:
Kernel Version: master
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg with master (b2bc47e9b201) (15.52 KB, text/plain)
2023-03-30 23:43 UTC, Cyril Brulebois
Details
dmesg with v6.1.20 (15.45 KB, text/plain)
2023-03-30 23:51 UTC, Cyril Brulebois
Details
serial log capture, PCIe/SATA adapter kernel panic (16.39 KB, text/plain)
2023-05-04 13:18 UTC, HankB
Details
Traces with v5 and PCIe/SATA adapter (16.15 KB, text/plain)
2023-05-09 15:17 UTC, HankB
Details
Trace with V5, PCIe/SATA and patched DTB (16.35 KB, text/plain)
2023-05-09 15:19 UTC, HankB
Details

Description Cyril Brulebois 2023-03-30 23:43:13 UTC
Created attachment 304062 [details]
dmesg with master (b2bc47e9b201)

Hi,

This bug can be tricky to reproduce, since hitting or dodging it seems very much dependent on the actual chips and revisions of all involved components.

The general setup is:

- Raspberry Pi Compute Module 4
- Raspberry Pi Compute Module 4 IO Board (carrier board)
- Something plugged onto the PCIe slot

At the moment, I'm able to reproduce this issue reliably with:

- Compute Module 4 including eMMC (Compute Module 4 Lite, without eMMC, using the exact same operating system image on an SD card, doesn't trigger the issue).
- SupaHub PCIe-to-multiple-USB adapter, reference PCE6U1C-R02, VER 006S (PCE6U1C-R02, VER 006 looks very similar, but definitely includes different chips on its PCB, and doesn't trigger the issue).

With either v6.1.20 as packaged by Debian, or with a local master build (as of b2bc47e9b201), plus a Debian testing userspace, I'm hitting the following kernel panic:

```
[    1.914315] Kernel panic - not syncing: Asynchronous SError Interrupt
[    1.914317] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc4+ #1
[    1.914322] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT)
[    1.914324] Call trace:
[    1.914326]  dump_backtrace+0xa8/0x138
[    1.914333]  show_stack+0x20/0x38
[    1.914336]  dump_stack_lvl+0x48/0x60
[    1.914345]  dump_stack+0x18/0x28
[    1.914350]  panic+0x378/0x398
[    1.914355]  nmi_panic+0xb4/0xc0
[    1.914359]  arm64_serror_panic+0x78/0x90
[    1.914363]  do_serror+0x30/0x70
[    1.914367]  el1h_64_error_handler+0x30/0x48
[    1.914371]  el1h_64_error+0x64/0x68
[    1.914375]  pci_generic_config_read+0x44/0xe8
[    1.914380]  pci_bus_read_config_dword+0x98/0x140
[    1.914386]  pci_bus_generic_read_dev_vendor_id+0x3c/0x1c0
[    1.914390]  pci_scan_single_device+0xa8/0x118
[    1.914393]  pci_scan_slot+0x6c/0x1e0
[    1.914396]  pci_scan_child_bus_extend+0x50/0x2e0
[    1.914399]  pci_scan_bridge_extend+0x31c/0x5a8
[    1.914403]  pci_scan_child_bus_extend+0x1c4/0x2e0
[    1.914406]  pci_scan_root_bus_bridge+0x6c/0xf8
[    1.914409]  pci_host_probe+0x20/0xd0
[    1.914413]  brcm_pcie_probe+0x294/0x618
[    1.914419]  platform_probe+0x70/0xe8
[    1.914426]  really_probe+0x18c/0x3d8
[    1.914429]  __driver_probe_device+0x84/0x198
[    1.914434]  driver_probe_device+0x44/0x120
[    1.914437]  __driver_attach+0xfc/0x210
[    1.914441]  bus_for_each_dev+0x7c/0xe8
[    1.914445]  driver_attach+0x2c/0x40
[    1.914448]  bus_add_driver+0x118/0x228
[    1.914452]  driver_register+0x68/0x138
[    1.914456]  __platform_driver_register+0x30/0x48
[    1.914461]  brcm_pcie_driver_init+0x24/0x38
[    1.914468]  do_one_initcall+0x4c/0x238
[    1.914472]  kernel_init_freeable+0x21c/0x3f0
[    1.914479]  kernel_init+0x2c/0x1f8
[    1.914483]  ret_from_fork+0x10/0x20
```

Full dmesg captured from b2bc47e9b201 is attached, I'll follow up with a very similar trace using v6.1.20.

Serial logging implemented this way, should that matter:

- "earlycon console=ttyS1,115200" on the kernel command line;
- "enable_jtag_gpio=1" and "force_turbo=1" in config.txt (consumed by the bootloader);
- and pins 6, 8, 10 on the pin header hooked up on a cp210x-based serial adapter.


Reminder: there was some discussion around the possible need for a subnode in the DTB when I filed the PCIe regression a while back (https://bugzilla.kernel.org/show_bug.cgi?id=215925).

I'm happy to test any patches and provide any input you folks might need.


Cheers,
Cyril.
Comment 1 Cyril Brulebois 2023-03-30 23:51:03 UTC
Created attachment 304063 [details]
dmesg with v6.1.20
Comment 2 HankB 2023-03-31 02:51:00 UTC
I've hit the same bug when booting from SD card and with different PCIe adapters (2 SATA, 1 NVMe) in the PCIe slot on the official CM4 IO board. I can also help with testing.

Thanks!
Comment 3 Jim Quinlan 2023-03-31 18:51:18 UTC
Hi, 

I'm the Broadcom STB PCIe driver maintainer (which covers RPi).

I believe the issue here is that a lot of the cheap x1/x4 cards out there have their clkreq# signal unattached.  The current driver assumes that the clkreq# line is connected and working, as it is on most of our non-RPi STB boards.  You may want to look at the clkreq# pin (12) on your card; I'm guessing you will not see a PCB trace line connected.<p>

The driver must be modified to allow this.  Note that this is not a regression; the driver has always behaved this way for cards like this.

Our PCIe HW has to be deliberately set into one of three clkreq# modes: "none",
"aspm", and "l1ss".  Right now the default is "aspm".  Once the mode is set it is unsafe for the mode to be changed dynamically.

The Raspian folks use an unofficial property "brcm,l1ss" which puts the PCIe HW into L1SS mode.  Although using this mode gets around the error, it is a mode that is more apt for L1SS-capable cards.

I'm working on new commit(s) that will propose it here for testing so that we can be sure we are looking at the same issue.  Then I will submit it upstream.  I appreciate anyone that can test for me, HankB, Cyril, ...

Regards,
Jim Quinlan
Broadcom STB
Comment 4 HankB 2023-05-04 13:18:40 UTC
Created attachment 304217 [details]
serial log capture, PCIe/SATA adapter kernel panic

Capture referenced in 2023-05-04 comment
Comment 5 HankB 2023-05-04 13:20:57 UTC
Hi Jim,
I have been performing further testing on this. My setup is
 
 
* CM4 Lite (with up-to-date EEPROM: 2023-01-11)
* Official IO Board
* Debian Bookworm install (including Gnome) on an SD card
* Two different PCIe/NVME Adapters ("PCENVME-N01 VER0006S" PCIe 3.0 x1 and "NFHK Model: N-M2X1 Ver. 1.0" PCIe 3.0 x1)
* One PCIe/SATA adapter, SI-PEX40156, ASmedia 1064 chip, PCIe 3.0 x1.
 
 
The system works well with either NVME card. The former has been in use for several weeks now without any apparent problems (booting from the NVME SSD.) The latter was briefly tested with your v5 patch series. 
 
 
The PCIe/SATA card results in the characteristic kernel panic with results captured using a serial connection and the earlycon boot parameter. A full capture is attached to previous message.
 
 
If there is anything I can do to help this along, please speak up. I'm thrilled that a CM4 can boot and run from NVME and really appreciate the effort that has gone into this.
 
 
Thanks!
Comment 6 Bjorn Helgaas 2023-05-04 14:33:45 UTC
I assume these plug-in cards work fine in other systems, so I'm dubious that the problem is the cards.

Comment #3 suggests that "brcm,enable-l1ss" from [1] avoids the problem.  Cyril and Hank, have you tried that series with that property in the DT?  If so, what are your observations?

It's not clear whether it's safe to use that property in general.  A hardware engineer said defaulting to that configuration was a bad idea and "asking for trouble" [2], but I don't know what's behind that or what sort of trouble could ensue.

If it is safe, and if it turns out to avoid this issue, that would be great.

[1] https://lore.kernel.org/r/20230428223500.23337-1-jim2101024@gmail.com
[2] https://lore.kernel.org/r/CA+-6iNxO6y_y5En2Q7YHgDGh=v4a-8E1Qbr2VL0NpWNNJqRf-g@mail.gmail.com
Comment 7 HankB 2023-05-04 15:17:42 UTC
I have tried the V5 patch series with a DTB provided by Cyril that implements  "brcm,enable-l1ss" and found no change in behavior. In other words the NVME SSD still worked (booting from SD card) and the kernel still paniced with the PCIe/SATA card in the slot.
Comment 8 Bjorn Helgaas 2023-05-04 15:55:20 UTC
Just to double-check, I assume you mean the *v4* series (not v5) at https://lore.kernel.org/r/20230428223500.23337-1-jim2101024@gmail.com ?
Comment 9 Cyril Brulebois 2023-05-04 17:06:06 UTC
Regarding the double-check: yes, Hank and I are definitely using the *v4* series, applied on top of documented base commit (76f598ba7d8e2bfb4855b5298caedd5af0c374a8).

Regarding the "brcm,enable-l1ss" property: before upgrading the EEPROM on my CM4 devices, it made a difference for me (see first table on https://lore.kernel.org/all/20230502231558.5zt5tyxczd22ppjz@mraw.org/#t, comparing lines by pairs: 1 & 2, 3 & 4, 5 & 6); after upgrading, it made no apparent differences (see second table, same mail).


For the avoidance of doubt, I tested this by setting that property alongside "brcm,enable-scc" directly in the DTSI (for testing purposes only, knowing it would only be used on CM4 devices):

--- a/arch/arm/boot/dts/bcm2711.dtsi
+++ b/arch/arm/boot/dts/bcm2711.dtsi
@@ -584,6 +584,7 @@ IRQ_TYPE_LEVEL_HIGH>,
                        dma-ranges = <0x02000000 0x0 0x00000000 0x0 0x00000000
                                      0x0 0xc0000000>;
                        brcm,enable-ssc;
+                       brcm,enable-l1ss;
                };
 
                genet: ethernet@7d580000 {

The resulting "arch/arm64/boot/dts/broadcom/bcm2711-rpi-cm4-io.dtb" is what I deployed on my test systems, and what I shared with Hank; to be deployed under "/boot/firmware/", replacing the original file.

If Hank needs to double check whether the test with the property set was indeed done with the proper DTB, here are the sha1sum for both:

- before (original code, without it): ac4ca46963aa967e7cd54d066937d6a092f35d70
- after (updated code, with it):      c652ccf2eeb5652c28a2e36c396831957e1536a3

On a running system, this can also be verified by checking whether "/proc/device-tree/scb/pcie@7d500000/brcm,enable-l1ss" is absent or present.

Cheers,
Cyril.
Comment 10 HankB 2023-05-04 20:03:09 UTC
Confirming the DTB provided by Cyril (on the running system):

hbarta@cm4deb:~$ ls -l /proc/device-tree/scb/pcie@7d500000/brcm,enable-l1ss
-r--r--r-- 1 root root 0 May  4 15:00 /proc/device-tree/scb/pcie@7d500000/brcm,enable-l1ss
hbarta@cm4deb:~$ sha1sum /boot/firmware/bcm2711-rpi-cm4-io.dtb
c652ccf2eeb5652c28a2e36c396831957e1536a3  /boot/firmware/bcm2711-rpi-cm4-io.dtb
hbarta@cm4deb:~$
Comment 11 HankB 2023-05-09 15:17:35 UTC
Created attachment 304235 [details]
Traces with v5 and PCIe/SATA adapter

minicom.2023-05-09-0930.cap is using the original DTB (ac4ca46963aa967e7cd54d066937d6a092f35d70).

Kernel image kindly provided by Cyril
Comment 12 HankB 2023-05-09 15:19:15 UTC
Created attachment 304236 [details]
Trace with V5, PCIe/SATA and patched DTB

minicom.2023-05-09-1010.cap is captured using the patched DTB.
c652ccf2eeb5652c28a2e36c396831957e1536a3  /boot/firmware/bcm2711-rpi-cm4-io.dtb
Comment 13 kernel 2023-08-30 15:24:54 UTC
Happens on my pi 400 too
No change in kernel or dtb
Using bus2 device 002 Asm media technology satabridge asm1035e
Comment 14 Cyril Brulebois 2023-08-30 15:30:54 UTC
The patch series (v6) was partially merged in the following branch:

https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git/log/?h=controller/brcmstb

The interesting commit for this particular bug is included, you might want to give that branch a spin.
Comment 15 Jim Quinlan 2023-08-30 16:03:38 UTC
Hey folks, 

I only got today's email from Bugzilla, I did see any email from 5/4 to current.  Perhaps gmail was placing them in the spam folder, although I've never really had that problem.

Please, when you post, attach two logs:  the "control experiment" log, which is from the commit preceding my patch-series.  Then apply the patches and send that log. It is paramount that you have *everything* else exactly the same between the two tests.

At any rate, it appears that there is a SATA card that panics, correct?  I see this line in "sata.minicom.2023-05-03.1651.cap:


[    3.702650] brcm-pcie fd500000.pcie: uni-dir CLKREQ# for L0s, L1 ASPM

This means that the settings made by the driver to the PCIe core are exactly the same as they are done by default before the patch series was applied.  Hence my admonition about including the before and after logs with all other variables frozen.

Note that with these SATA devices you need to add external power both to the SATA drive AND to the card; do not assume that the power supplied by the CM4 board is enough for the card.  I've just seen an example of this with a USB card on the CM4.

I happen to have an Asmedia card but it is 1b21:0612 not 1b21:1064.  I'll fire it up when I get a moment.

Regards,
Jim Quinlan
Broadcom STB/CM
Comment 16 Jim Quinlan 2023-08-31 16:25:44 UTC
One more thing: make sure you update your CM4 eeprom blob to the latest version.  Cryril was using and old version -- actually a pre-release version -- when he updated his CM4 eeprom image the errors went away.
Comment 17 Anne Macedo 2023-12-03 05:06:31 UTC
I faced a similar issue on the CM4, but I can't reproduce it anymore. I'm using a USB card with a Renesas upd720201 chip and when I tried to run a lspci on a freshly compiled kernel it yielded a kernel panic. 

[   31.293515] SError Interrupt on CPU2, code 0x00000000bf000002 -- SError
[   31.293527] CPU: 2 PID: 750 Comm: lspci Tainted: G         C         6.1.64-v8-VFIO_ENABLED+ #2
[   31.293533] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT)
[   31.293536] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   31.293541] pc : pci_generic_config_read+0x44/0xc0
[   31.293557] lr : pci_generic_config_read+0x2c/0xc0
[   31.293563] sp : ffffffc008dabbd0
[   31.293564] x29: ffffffc008dabbd0 x28: ffffff8040f06d80 x27: 0000000000000000
[   31.293573] x26: 000000000000000f x25: ffffff8040f06d80 x24: 0000000000000040
[   31.293578] x23: 0000000000000040 x22: ffffffc008dabca4 x21: ffffffdb2851f0b8
[   31.293584] x20: ffffffc008dabc24 x19: 0000000000000004 x18: 0000000000000000
[   31.293589] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[   31.293593] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[   31.293598] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffdb274c55fc
[   31.293603] x8 : ffffffc008dabb18 x7 : 0000000000000000 x6 : 000000000000000b
[   31.293607] x5 : ffffff8041229340 x4 : ffffffc008ae0000 x3 : ffffffc008ae9000
[   31.293612] x2 : 0000000000008000 x1 : 00000000deaddead x0 : ffffffc008ae8000
[   31.293619] Kernel panic - not syncing: Asynchronous SError Interrupt
[   31.293622] CPU: 2 PID: 750 Comm: lspci Tainted: G         C         6.1.64-v8-VFIO_ENABLED+ #2
[   31.293626] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT)
[   31.293629] Call trace:
[   31.293630]  dump_backtrace.part.0+0xec/0x100
[   31.293637]  show_stack+0x20/0x30
[   31.293640]  dump_stack_lvl+0x88/0xb4
[   31.293649]  dump_stack+0x18/0x34
[   31.293655]  panic+0x1a0/0x370
[   31.293662]  nmi_panic+0xb4/0xbc
[   31.293667]  arm64_serror_panic+0x78/0x84
[   31.293671]  is_valid_bugaddr+0x0/0x30
[   31.293675]  el1h_64_error_handler+0x38/0x50
[   31.293679]  el1h_64_error+0x64/0x68
[   31.293683]  pci_generic_config_read+0x44/0xc0
[   31.293688]  pci_user_read_config_dword+0x80/0x120
[   31.293694]  pci_read_config+0xec/0x2a4
[   31.293699]  sysfs_kf_bin_read+0x74/0x94
[   31.293704]  kernfs_fop_read_iter+0xa8/0x1b4
[   31.293707]  vfs_read+0x214/0x2c0
[   31.293712]  ksys_pread64+0x84/0xd0
[   31.293716]  __arm64_sys_pread64+0x28/0x34
[   31.293720]  invoke_syscall+0x50/0x120
[   31.293727]  el0_svc_common.constprop.0+0x68/0x124
[   31.293732]  do_el0_svc+0x34/0xd0
[   31.293738]  el0_svc+0x30/0x94
[   31.293741]  el0t_64_sync_handler+0xf4/0x120
[   31.293745]  el0t_64_sync+0x18c/0x190
[   31.293750] SMP: stopping secondary CPUs
[   32.363059] SMP: failed to stop secondary CPUs 0,2
[   32.363063] Kernel Offset: 0x1b1ee00000 from 0xffffffc008000000
[   32.363065] PHYS_OFFSET: 0x0
[   32.363067] CPU features: 0x80000,2013c080,0000421b
[   32.363070] Memory Limit: none
[   32.620285] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

I'm using this version of the kernel - 6.1.64-v8.

The error went away after some restarts, so I'm not sure it I can reproduce it consistently and if it's the same error.
Comment 18 Jim Quinlan 2023-12-03 21:29:51 UTC
Hello Anne, 
This could be related but I cannot tell for sure.  My fix has yet to be accepted by upstream Linux, however you should be okay on Raspian. 
Regards,
Jim Quinlan
Broadcom STB/CM
Comment 19 3Rivers 2023-12-27 15:22:06 UTC
This issue happen in my Raspberry Pi Compute Module 4 as well.
While early this year the board worked super well and then when I tried to reinstall it one month ago, it failed at boot.

So the issue will happen when:
* boot with WaveShare PCIE SATA card on it

I've tried upgrade the EEPROM, not work.
And then I tried to look at the boot order in the EEPROM config, which is the boot.conf. and the boot order is good.

And then I realize that this board work at least on Feb 2023, then I think it's related to the kernel version.

So I download a old version of the lite OS here: https://downloads.raspberrypi.com/raspios_lite_arm64/images/raspios_lite_arm64-2021-11-08/

Make sure the PCIE board is not plugged in and install this OS, and then use the "sudo rpi-update <HASH>" to upgrade the kernel to the last v5.15.92 one. 
the HASH can be found in this link 
https://github.com/raspberrypi/rpi-firmware/commits/master/?after=7ca14294c4bf09fda8d138f9987cd031ced61f7c+69

And then once upgrade done, reboot the Pi. Before plug the PCIE board back, make sure finish all the change related to the kernel or boot (for example, enable the cgroup for memory), otherwise the Async SERROR will happen again.

And that's all. 
This may help some people.
Comment 20 Jim Quinlan 2023-12-27 16:22:26 UTC
Hello 3Rivers (and Anne if it applies),
The context of this bug is when running the RPi4 + CM under upstream or Debian Linux OS.  This bug report does not intend to cover issues with RaspianOS, although there may be commonality.

Regards,
Jim Quinlan
Broadcom STB/CM
Comment 21 3Rivers 2023-12-27 23:44:35 UTC
Hi, Jim, 

Sure. I am trying to say it's not about the Raspian or Debian or EEPROM. So people may don't have to waste time on working on them.

Before your fix accepted by upstream Linux, downgrading the kernel should be a quick win on this issue.

And thanks for the fix.

Note You need to log in before you can comment on or make changes to this bug.