Bug 216094
Description
Hajo Noerenberg
2022-06-07 07:29:03 UTC
Created attachment 301113 [details]
lspci Linux version 3.2.0-4-kirkwood mach-kirkwood/pci.c: HDDs ok
Created attachment 301114 [details]
lspci Linux version 3.16.0-0.bpo.4-kirkwood - with DTB -> mvebu-pci: HDDs fail
Created attachment 301115 [details]
lspci Linux version 3.16.0-0.bpo.4-kirkwood - without DTB -> mach-kirkwood/pci.c: HDDs ok
Created attachment 301116 [details]
lspci Linux version 5.10.0-11-marvell (Debian bullseye) mvebu-pci: HDDs fail
Created attachment 301117 [details]
lspci Linux version 5.17.0-1-marvell mvebu-pci: HDDs fail
I added lspci from a more recent 5.17.0 kernel. The 6281 is detected as a v1 root port (first time, yeah!), but the HDDs continue to fail Created attachment 301121 [details]
dmesg Linux version 3.2.0-4-kirkwood - kw pci
Created attachment 301122 [details]
dmesg Linux version 3.16.0-0.bpo.4-kirkwood - kw pci
Created attachment 301123 [details]
dmesg Linux version 5.17.0-1-marvell - mvebu pci
Created attachment 301124 [details]
U-Boot log with pci/ahci #define DEBUG 1
Just to clarify the dmesg output: 6121-port0 is a SATA-II ST3500418AS (except with U-Boot I attached a 3TB EFRX) 6121-port1 is a SATA-I ST3250310NS (always works) 6121-port2 is (unused) PATA SoC-port0 is a WD3202ABYS (always works) SoC-port1 is a ST500NM0011 (always works) (Current setup, during the investigation I think I have had every tangible HDD on every available port :-) ) Please let me know if it would help to patch the mvebu-pci module in any way for debug purposes. For Pali: PCI/ASPM with 5.17.0: root@nas440:~# grep -oE pci.* /proc/cmdline pci=nomsi -> HDDs still fail root@nas440:~# grep -oE pci.* /proc/cmdline pcie_aspm=off -> HDDs still fail I would suggest setting the flag that disables the READ_LOG_EXT command for that controller. I suspect it should be set for all PATA controllers. I only have experience with an old PPC macMini, and that ATA command causes hangs for it. However in that case it's fine in the end since it hits a timeout and says "interrupt lost" and continues, so it only slows down boot but 10 seconds or so Hmm, I misread and this is a proper SATA controller, so it would be weird for it to fail for this command. Still, might be one of the easier things to try anyway... The DTB file can be found at my projekt page: https://github.com/hn/seagate-blackarmor-nas (kirkwood-blackarmor-nas440.dtb, source linux-nas440.diff or u-boot-2022.04-nas440.diff) Created attachment 301162 [details]
dmesg Linux version 5.17.6pci-marvell - pci debug, mvebu-pci module
dmesg log with CONFIG_PCI_DEBUG=y and mvebu-pci as a module (CONFIG_PCI_MVEBU=m).
I wonder if the missing IRQ could be significant. With Kirkwood-PCI (HDDs working) the bridge had an IRQ assigned, with pci-mvebu not ("pcieport 0000:00:01.0: assign IRQ: got 0").
On Monday 13 June 2022 08:08:36 bugzilla-daemon@kernel.org wrote: > I wonder if the missing IRQ could be significant. With Kirkwood-PCI (HDDs > working) the bridge had an IRQ assigned, with pci-mvebu not ("pcieport > 0000:00:01.0: assign IRQ: got 0"). Hello! mvebu PCIe Root Port does not provide interrupt support because it is not implemented in mainline kernel (yet). Patches for this support are prepared in branch pci-mvebu of my git repo: https://git.kernel.org/pub/scm/linux/kernel/git/pali/linux.git/ But I do not think this is the root cause. Hi Pali, I compiled and started your pci-mvebu branch, unfortunately it didn't change anything (HDDs still fail). For the INTx interrupts and other things it would probably be necessary to change the dts(i) files, i.e. port your changes from armada-*.dtsi to kirkwood-6281.dtsi. But I don't know (yet) exactly how to do that and if it would be worth the effort. Created attachment 301220 [details]
Kirkwoord DTS PCIe interrupt patch
I looked into Kirkwoord documentation and it seems that SoC PCIe INTx interrupt is 9 and SoC PCIe summary interrupt is 44. In attachment is a patch for Kirkwood DTS files to define them.
Created attachment 301225 [details] dmesg Linux 5.16.0rc1-palimvebu with patch 301220 Hi Pali, I applied your patch (attachment 301220 [details]). The good: something changed. The bad: now both HDDs (6121 port 0 and port1) do NOT work anymore (even the SATA-1 hard disk, which worked correctly before, does not work anymore). This may be a slight indication that the error of the HDDs has something to do with the interrupt handling after all. But this is just a guess from me. Created attachment 301226 [details]
lspci Linux 5.16.0rc1-palimvebu with patch 301220
On Monday 20 June 2022 08:43:35 bugzilla-daemon@kernel.org wrote: > Hi Pali, I applied your patch (attachment 301220 [details]). > > The good: something changed. > The bad: now both HDDs (6121 port 0 and port1) do NOT work anymore (even the > SATA-1 hard disk, which worked correctly before, does not work anymore). That patch only changes PCIe DTS nodes and does not touch on-board Marvell disk controller. So it should not have any effect on that second HDD which is not connected via PCIe. > This may be a slight indication that the error of the HDDs has something to > do > with the interrupt handling after all. But this is just a guess from me. Can you check in /proc/interrupts that PCIe interrupt counts are increasing during usage of disk connected via PCIe? >> The good: something changed. >> The bad: now both HDDs (6121 port 0 and port1) do NOT work anymore (even the >> SATA-1 hard disk, which worked correctly before, does not work anymore). > > That patch only changes PCIe DTS nodes and does not touch on-board > Marvell disk controller. So it should not have any effect on that second > HDD which is not connected via PCIe. > Sorry, I expressed it in a misleading way: your patch does not change the sata_mv HDDs, they always work, with and without your patch. Your patch only changes (in my case) the SATA-I HDD connected to port1 of the 6121 controller (which is connected via PCIe): 5.16.0rc1-pali-mvebu without DTS-301220-patch (same behaviour as vanilla 3.16..5.x): 6121-port0 = SATA-II ST3500418AS => "failed to IDENTIFY" 6121-port1 = SATA-I ST3250310NS => works 5.16.0rc1-pali-mvebu with DTS-301220-patch: 6121-port0 = SATA-II ST3500418AS => "failed to IDENTIFY" 6121-port1 = SATA-I ST3250310NS => "failed to IDENTIFY" >> This may be a slight indication that the error of the HDDs has something to >> do >> with the interrupt handling after all. But this is just a guess from me. > > Can you check in /proc/interrupts that PCIe interrupt counts are > increasing during usage of disk connected via PCIe? > Since the HDDs are not detected, there is no block device with which I could generate disk traffic or interrupts. According to my understanding the interrupts could possibly only be generated during the detection phase, but they are zero: root@nas440:~# cat /proc/interrupts CPU0 17: 41238521 bridge-interrupt-ctrl 2 Edge orion_event 25: 420 interrupt-controller@20200 29 Edge mv64xxx_i2c 26: 865 interrupt-controller@20200 33 Edge ttyS0 28: 0 bridge-interrupt-ctrl 3 Edge f1020300.watchdog-timer 29: 0 interrupt-controller@20200 22 Edge f1030000.crypto 30: 473966 interrupt-controller@20200 19 Edge ehci_hcd:usb1 31: 310380 interrupt-controller@20200 46 Edge f1072004.mdio-bus 32: 0 interrupt-controller@20200 53 Edge f1010300.rtc 33: 9175 interrupt-controller@20200 21 Edge sata_mv[f1080000.sata] 34: 2 interrupt-controller@20200 5 Edge f1060800.xor 35: 2 interrupt-controller@20200 7 Edge f1060900.xor 36: 0 f1010100.gpio 29 Edge Reset 37: 0 f1010140.gpio 17 Edge Power 38: 49658 interrupt-controller@20200 11 Edge eth0 40: 0 interrupt-controller@20200 44 Edge pcie0.0 41: 0 mvebu-rp 0 Edge pciehp 42: 0 mvebu-INTx 0 Level ahci[0000:01:00.0] Err: 0 With the old mach-kirkwood/pcie.c driver (= both SATA1/2 HDDs working with the 6121 controller) both the Host Bridge and the 88SE6121 SATA controller were connected to IRQ9 ("pin A routed to IRQ 9", see attachment "lspci Linux version 3.2.0-4-kirkwood"). With newer kernels they have different IRQs in the 40+ range. On Thursday 23 June 2022 12:35:51 bugzilla-daemon@kernel.org wrote: > Since the HDDs are not detected, there is no block device with which I could > generate disk traffic or interrupts. According to my understanding the > interrupts could possibly only be generated during the detection phase Exactly. I would expect that during detection phase there is some interrupt from controller over PCIe. > but they are zero: > > root@nas440:~# cat /proc/interrupts > CPU0 > 17: 41238521 bridge-interrupt-ctrl 2 Edge orion_event > 25: 420 interrupt-controller@20200 29 Edge mv64xxx_i2c > 26: 865 interrupt-controller@20200 33 Edge ttyS0 > 28: 0 bridge-interrupt-ctrl 3 Edge f1020300.watchdog-timer > 29: 0 interrupt-controller@20200 22 Edge f1030000.crypto > 30: 473966 interrupt-controller@20200 19 Edge ehci_hcd:usb1 > 31: 310380 interrupt-controller@20200 46 Edge f1072004.mdio-bus > 32: 0 interrupt-controller@20200 53 Edge f1010300.rtc > 33: 9175 interrupt-controller@20200 21 Edge > sata_mv[f1080000.sata] > 34: 2 interrupt-controller@20200 5 Edge f1060800.xor > 35: 2 interrupt-controller@20200 7 Edge f1060900.xor > 36: 0 f1010100.gpio 29 Edge Reset > 37: 0 f1010140.gpio 17 Edge Power > 38: 49658 interrupt-controller@20200 11 Edge eth0 > 40: 0 interrupt-controller@20200 44 Edge pcie0.0 > 41: 0 mvebu-rp 0 Edge pciehp > 42: 0 mvebu-INTx 0 Level ahci[0000:01:00.0] > Err: 0 This output should be from the new "non-working" kernel, right? > With the old mach-kirkwood/pcie.c driver (= both SATA1/2 HDDs working with > the > 6121 controller) both the Host Bridge and the 88SE6121 SATA controller were > connected to IRQ9 ("pin A routed to IRQ 9", see attachment "lspci Linux > version > 3.2.0-4-kirkwood"). With newer kernels they have different IRQs in the 40+ > range. IRQ numbers are dynamically assigned by kernel, they may change during kernel versions and even during reboot (initialization of drivers is asynchronous and sometimes one driver can ask for assigning IRQ number faster than other driver). So could you provide also /proc/interrupts output from "working" kernel including assigned IRQ numbers which you see in lspci (in case they changes between reboot)? Created attachment 301263 [details]
lspci Linux 5.16.0rc1-palimvebu without patch 301220
>> root@nas440:~# cat /proc/interrupts >> CPU0 >> 17: 41238521 bridge-interrupt-ctrl 2 Edge orion_event >> 25: 420 interrupt-controller@20200 29 Edge mv64xxx_i2c >> 26: 865 interrupt-controller@20200 33 Edge ttyS0 >> 28: 0 bridge-interrupt-ctrl 3 Edge f1020300.watchdog-timer >> 29: 0 interrupt-controller@20200 22 Edge f1030000.crypto >> 30: 473966 interrupt-controller@20200 19 Edge ehci_hcd:usb1 >> 31: 310380 interrupt-controller@20200 46 Edge f1072004.mdio-bus >> 32: 0 interrupt-controller@20200 53 Edge f1010300.rtc >> 33: 9175 interrupt-controller@20200 21 Edge >> sata_mv[f1080000.sata] >> 34: 2 interrupt-controller@20200 5 Edge f1060800.xor >> 35: 2 interrupt-controller@20200 7 Edge f1060900.xor >> 36: 0 f1010100.gpio 29 Edge Reset >> 37: 0 f1010140.gpio 17 Edge Power >> 38: 49658 interrupt-controller@20200 11 Edge eth0 >> 40: 0 interrupt-controller@20200 44 Edge pcie0.0 >> 41: 0 mvebu-rp 0 Edge pciehp >> 42: 0 mvebu-INTx 0 Level ahci[0000:01:00.0] >> Err: 0 > > This output should be from the new "non-working" kernel, right? > Yes, the above is 5.16.0rc1-pali-mvebu with DTS-301220-patch (both 6121 ports not working). > So could you provide also /proc/interrupts output from "working" kernel > including assigned IRQ numbers which you see in lspci (in case they > changes between reboot)? > Depends on what you mean by "working" ;-) 5.16.0rc1-pali-mvebu without DTS-301220-patch (6121-port1=SATA-I working and port0=SATA-II not working): root@nas440:~# cat /proc/interrupts CPU0 17: 66162 bridge-interrupt-ctrl 2 Edge orion_event 25: 396 interrupt-controller@20200 29 Edge mv64xxx_i2c 26: 2473 interrupt-controller@20200 33 Edge ttyS0 28: 0 bridge-interrupt-ctrl 3 Edge f1020300.watchdog-timer 29: 0 interrupt-controller@20200 22 Edge f1030000.crypto 30: 1687 interrupt-controller@20200 19 Edge ehci_hcd:usb1 31: 432 interrupt-controller@20200 46 Edge f1072004.mdio-bus 32: 0 interrupt-controller@20200 53 Edge f1010300.rtc 33: 3550 interrupt-controller@20200 21 Edge sata_mv[f1080000.sata] 34: 2 interrupt-controller@20200 5 Edge f1060800.xor 35: 2 interrupt-controller@20200 7 Edge f1060900.xor 36: 0 f1010100.gpio 29 Edge Reset 37: 0 f1010140.gpio 17 Edge Power 38: 206 interrupt-controller@20200 11 Edge eth0 40: 59 interrupt-controller@20200 9 Edge ahci[0000:01:00.0] Err: 0 lspci can be seen in https://bugzilla.kernel.org/attachment.cgi?id=301263 . There is currently no way to get port1=SATA-II working with kernel >3.16 (pci-mvebu); this is 3.2.0-4-kirkwood with mach-kirkwood/pci.c (both 6121 port1=SATA-I and port0=SATA-II working): root@wheezy:~# cat /proc/interrupts CPU0 1: 2677 orion_irq orion_tick 5: 2 orion_irq mv_xor.0 6: 2 orion_irq mv_xor.1 7: 2 orion_irq mv_xor.2 8: 2 orion_irq mv_xor.3 9: 355 orion_irq ahci 11: 36 orion_irq eth0 19: 979 orion_irq ehci_hcd:usb1 21: 3809 orion_irq sata_mv 22: 6 orion_irq mv_crypto 28: 52 orion_irq mvsdio 33: 1840 orion_irq serial 46: 24 orion_irq mv643xx_eth 53: 0 orion_irq rtc-mv 102: 1 - mvsdio cd Err: 0 lspci is identical to https://bugzilla.kernel.org/attachment.cgi?id=301113 . There is an issue with the (newer) 88SE91xx family of Marvell SATA controllers: https://bugzilla.kernel.org/show_bug.cgi?id=42679 . I do not know if this might be relevant for the (old) 88SE6121 as well (at least the error message "failed to IDENTIFY" is identical). Some time ago i tested this quirk (adapted for the 6121) in one of the 5.x kernels, but it did not help. I updated my pci-mvebu branch, added definitions of INTx and summary interrupts for all platforms which use pci-mvebu.c driver (kirkwood, dove, a370, axp, a375, a380, a385, a39x). I split features into different commits, to easily test just functionality which adds INTx support and "big" summary interrupt supports. Hajo, could you test my branch again? And if there is some regression (e.g. disk which works without my patches, but does not with patches), could you identify commit which broke it? Hi Pali, this is what I've tested so far: | 88SE6121 | | port0= | port1= | | SATA-2 | SATA-1 | Loader/Kernel --------------------------------------------------------------------------- | works | works | U-Boot 2022.04 with PCI-bindings patch 20220328 | works | works | 3.2.0-4-kirkwood => mach-kirkwood/pci.c | works | works | 3.16.0-0.bpo.4-kirkwood without DTB => kirkwood/pci.c | fails | works | 3.16.0-0.bpo.4-kirkwood with DTB => mvebu-pci | fails | works | 5.17.0-1-marvell Debian bullseye => mvebu-pci | fails | works | 5.16.0rc1-pali-mvebu-20220222 | fails | fails | 5.16.0rc1-pali-mvebu-20220222 with patch DTS-301220 | fails | fails | 5.16.0rc1-pali-mvebu-20220627 | fails | works | 5.16.0rc1-pali-mvebu-20220627 without summary int (revert commit 304aaac07620bbedbcafd40f8de2a108ac9f3ab5) => With summary interrupt enabled, both SATA-1/2 HDDs do not work, there seems to be a fundamental problem (side note: with "summary int" 44, do you mean "PEX0Err" from the documentation (i.e. 44 is calculated by 32+12?)). => SATA-2 HDDs do not work, even with latest pci-mvebu-20220627. On Friday 01 July 2022 08:00:17 bugzilla-daemon@kernel.org wrote: > => With summary interrupt enabled, both SATA-1/2 HDDs do not work, there > seems > to be a fundamental problem (side note: with "summary int" 44, do you mean > "PEX0Err" from the documentation (i.e. 44 is calculated by 32+12?)). Yes. At least on Armada 385 is PEX0Err superset of events including also PEX0INT and act as summary interrupt source. PEX0INT triggers only INTx sources. So on Armada 385 it is needed to disable PEX0INT source when PEX0Err is enabled. On Kirkwood PEX0INT is 9 in low register and PEX0Err is 12 in high register (so 32+12 is ID). So it looks like that on Kirkwood PEX0Err is not superset of PEX0INT. I will adjust patches to reflect this. So seems that when both err and intx are enabled on A385 then intx are not reported via err source. Now I updated pci-mvebu branch with new code to always use intx source for intx interrupts. Hi Pali, with latest pci-mvebu SATA-1 is working again. SATA-2 sadly still fails with "failed to IDENTIFY": | 88SE6121 | | port0= | port1= | | SATA-2 | SATA-1 | Loader/Kernel --------------------------------------------------------------------------- | fails | works | 5.16.0rc1-pali-mvebu-20220701 Disk activity on port1=SATA-1 synchronously increases both interrupt 40 (9) and 43 (value 643 in this screenshot): root@nas440:~# cat /proc/interrupts CPU0 17: 347575 bridge-interrupt-ctrl 2 Edge orion_event 25: 396 interrupt-controller@20200 29 Edge mv64xxx_i2c 26: 4465 interrupt-controller@20200 33 Edge ttyS0 28: 0 bridge-interrupt-ctrl 3 Edge f1020300.watchdog-timer 29: 0 interrupt-controller@20200 22 Edge f1030000.crypto 30: 4900 interrupt-controller@20200 19 Edge ehci_hcd:usb1 31: 2552 interrupt-controller@20200 46 Edge f1072004.mdio-bus 32: 0 interrupt-controller@20200 53 Edge f1010300.rtc 33: 3969 interrupt-controller@20200 21 Edge sata_mv[f1080000.sata] 34: 2 interrupt-controller@20200 5 Edge f1060800.xor 35: 2 interrupt-controller@20200 7 Edge f1060900.xor 36: 0 f1010100.gpio 29 Edge Reset 37: 0 f1010140.gpio 17 Edge Power 38: 2110 interrupt-controller@20200 11 Edge eth0 40: 643 interrupt-controller@20200 9 Edge pcie0.0 41: 0 interrupt-controller@20200 44 Edge pcie0.0 42: 0 mvebu-rp 0 Edge pciehp 43: 643 mvebu-INTx 0 Level ahci[0000:01:00.0] Err: 0 root@nas440:~# lspci -vv -nn 0001:00:01.0 PCI bridge [0604]: Marvell Technology Group Ltd. 88F6281 [Kirkwood] ARM SoC [11ab:6281] (rev 03) (prog-if 00 [Normal decode]) Device tree node: /sys/firmware/devicetree/base/mbus@f1000000/pcie@82000000/pcie@1,0 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 44 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 [...] 0001:01:00.0 IDE interface [0101]: Marvell Technology Group Ltd. 88SE6111/6121 SATA II / PATA Controller [11ab:6121] (rev b2) (prog-if 8f [PCI native mode controller, supports both channels switched to ISA compatibility mode, supports bus mastering]) Subsystem: Marvell Technology Group Ltd. 88SE6111/6121 1/2 port SATA II + 1 port PATA Controller [11ab:6121] Physical Slot: 1 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 45 Region 0: I/O ports at 10010 [size=8] [...] Side note: after unloading and re-loading the pci-mvebu module, I get a "Unable to handle kernel paging request at virtual address bf0770cc" error when calling "cat /proc/interrupts". I /think/ I never observed this problem before. > Hi Pali, with latest pci-mvebu SATA-1 is working again Perfect! > SATA-2 sadly still fails with "failed to IDENTIFY": I thought that we would see some PCIe error here bas sadly error interrupt (source id 44) was not triggered: > 41: 0 interrupt-controller@20200 44 Edge pcie0.0 Anyway on Root Port you do not have registered PCIe AER driver: > 42: 0 mvebu-rp 0 Edge pciehp Could you check if you have enabled AER support during kernel compilation? With patches from my pci-mvebu.c branch, AER support for mvebu should work and if there is some PCIe issue, it should be printed into dmesg. > Disk activity on port1=SATA-1 synchronously increases both interrupt 40 (9) > and 43 That is correct, so legacy INTx interrupts for SATA port1 are working fine. Anyway, in output you have: > cat /proc/interrupts > 42: 0 mvebu-rp 0 Edge pciehp > 43: 643 mvebu-INTx 0 Level ahci[0000:01:00.0] and > lspci -vv -nn > 0001:00:01.0 PCI bridge [0604] ... > Interrupt: pin A routed to IRQ 44 > 0001:01:00.0 IDE interface [0101]: > Interrupt: pin A routed to IRQ 45 This does not match IRQ numbers. Have you put lspci output **after** module unloading and re-loading? Also PCI domain number changed from 0000: to 0001. This is IIRC known issue which happens after module reloading. I sent patch for this few days ago https://lore.kernel.org/linux-pci/20220702204737.7719-1-pali@kernel.org/ > Side note: after unloading and re-loading the pci-mvebu module, I get a > "Unable to handle kernel paging request at virtual address bf0770cc" error > when calling "cat /proc/interrupts". Based on the above observation (IRQ numbers after reloading were allocated after the "gap") I think that module unloading have not released IRQs. Could you try patch below if it helps? diff --git a/drivers/pci/controller/pci-mvebu.c b/drivers/pci/controller/pci-mvebu.c index cf0ebcac8757..fee2d40bcf08 100644 --- a/drivers/pci/controller/pci-mvebu.c +++ b/drivers/pci/controller/pci-mvebu.c @@ -2063,6 +2063,11 @@ static int mvebu_pcie_remove(struct platform_device *pdev) /* Clear all interrupt causes. */ mvebu_writel(port, ~PCIE_INT_ALL_MASK, PCIE_INT_CAUSE_OFF); + if (port->intx_irq > 0) + devm_free_irq(dev, port->intx_irq, port); + if (port->error_irq > 0) + devm_free_irq(dev, port->error_irq, port); + /* Remove IRQ domains. */ if (port->intx_irq_domain) irq_domain_remove(port->intx_irq_domain); > Could you check if you have enabled AER support during kernel compilation? > With > patches from my pci-mvebu.c branch, AER support for mvebu should work and if > there is some PCIe issue, it should be printed into dmesg. I recompiled with CONFIG_PCIEAER=y, but I do not see any AER errors in dmesg. > Anyway, in output you have: > >> cat /proc/interrupts >> 42: 0 mvebu-rp 0 Edge pciehp >> 43: 643 mvebu-INTx 0 Level ahci[0000:01:00.0] > > and > >> lspci -vv -nn >> 0001:00:01.0 PCI bridge [0604] ... >> Interrupt: pin A routed to IRQ 44 >> 0001:01:00.0 IDE interface [0101]: >> Interrupt: pin A routed to IRQ 45 > > This does not match IRQ numbers. Have you put lspci output **after** module > unloading and re-loading? > I don't remember. With the new AER-enabled kernel after a fresh boot IRQs are aligned: root@nas440:~# cat /proc/interrupts [...] 40: 59 interrupt-controller@20200 9 Edge pcie0.0 41: 0 interrupt-controller@20200 44 Edge pcie0.0 42: 0 mvebu-rp 0 Edge aerdrv, pciehp 43: 59 mvebu-INTx 0 Level ahci[0000:01:00.0] root@nas440:~# lspci -vv -nn 00:01.0 PCI bridge [0604]: Marvell Technology Group Ltd. 88F6281 [Kirkwood] ARM SoC [11ab:6281] (rev 03) (prog-if 00 [Norm Device tree node: /sys/firmware/devicetree/base/mbus@f1000000/pcie@82000000/pcie@1,0 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 42 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 [...] Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 4a000001 01000004 01080000 2b010000 RootCmd: CERptEn- NFERptEn- FERptEn- RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd- FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0 ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000 [...] 01:00.0 IDE interface [0101]: Marvell Technology Group Ltd. 88SE6111/6121 SATA II / PATA Controller [11ab:6121] (rev b2) ( Subsystem: Marvell Technology Group Ltd. 88SE6111/6121 1/2 port SATA II + 1 port PATA Controller [11ab:6121] Physical Slot: 1 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 43 [...] Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ AERCap: First Error Pointer: 1f, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn- MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- HeaderLog: 00000000 00000000 00000000 00000000 Kernel driver in use: ahci Kernel modules: ahci > Also PCI domain number changed from 0000: to 0001. This is IIRC known issue > which happens after module reloading. I sent patch for this few days ago > https://lore.kernel.org/linux-pci/20220702204737.7719-1-pali@kernel.org/ I did not include this patch yet. >> Side note: after unloading and re-loading the pci-mvebu module, I get a >> "Unable to handle kernel paging request at virtual address bf0770cc" error >> when calling "cat /proc/interrupts". > > Based on the above observation (IRQ numbers after reloading were allocated > after the "gap") I think that module unloading have not released IRQs. Could > you try patch below if it helps? I included this patch (I bravely added "struct device *dev = &pdev->dev;" in function top to make it compile), but the problem still persists: after rmmod && modprobe, lspci shows IRQ44 and 45, and "cat /proc/interrupts" crashes. I wonder why the old mach-kirkwood/pcie.c, which looks like a very simple implementation, works and the current mvebu PCI driver, which seems much more sophisticated, has trouble driving SATA-2 HDDs. > and "cat /proc/interrupts" crashes Something which would be needed to debug :-( Probably related to module unloading or host bridge unbinding. > I wonder why the old mach-kirkwood/pcie.c, which looks like a very simple > implementation, works and the current mvebu PCI driver, which seems much more > sophisticated, has trouble driving SATA-2 HDDs. I thought that there is some PCIe related error and AER reports it. As a last chance, could you please provide output of /proc/ioports and /proc/iomem files from v3.16 with DTS (non-working) and witout DTS (working) versions? To compare if there is not some error in memory mapping. But if you have disk which works without any issue and there is no AER error I have feeling that issue is not PCIe or pci-mvebu.c related. Looks like that it could be sata/ahci controller related. Could ATA people help here? Created attachment 301359 [details]
iomem ioports Linux version 3.16.0-0.bpo.4-kirkwood with/without DTB
>> I wonder why the old mach-kirkwood/pcie.c, which looks like a very simple >> implementation, works and the current mvebu PCI driver, which seems much >> more >> sophisticated, has trouble driving SATA-2 HDDs. > > I thought that there is some PCIe related error and AER reports it. > > As a last chance, could you please provide output of /proc/ioports and > /proc/iomem files from v3.16 with DTS (non-working) and witout DTS (working) > versions? To compare if there is not some error in memory mapping. > Please see attachment 301359 [details]. > But if you have disk which works without any issue and there is no AER error > I > have feeling that issue is not PCIe or pci-mvebu.c related. Looks like that > it > could be sata/ahci controller related. > Well, that's absolutely possible. My conclusion is based on the fact that under kernel 3.16. the identical AHCI/ATA module is used, but depending on the (non-)inclusion of the DTB a different PCIe driver. But the problem can also be due to other parts of the kernel, or the way the AHCI driver is integrated with/without DTB. I am not able to assess these things. It looks like that PCI IO ports with DT/pci-mvebu.c version starts at 0x00010000 but in non-DT version starts at 0x00001000. Could you try following patch which could move start PCI IO address? diff --git a/drivers/pci/controller/pci-mvebu.c b/drivers/pci/controller/pci-mvebu.c index 629e9701ddf4..3269ce1daa1d 100644 --- a/drivers/pci/controller/pci-mvebu.c +++ b/drivers/pci/controller/pci-mvebu.c @@ -1937,7 +1939,7 @@ static int mvebu_pcie_parse_request_resources(struct mvebu_pcie *pcie) if (resource_size(&pcie->io) != 0) { pcie->realio.flags = pcie->io.flags; - pcie->realio.start = PCIBIOS_MIN_IO; + pcie->realio.start = 0x0; pcie->realio.end = min_t(resource_size_t, IO_SPACE_LIMIT - SZ_64K, resource_size(&pcie->io) - 1); > Could you try following patch which could move start PCI IO address?
> - pcie->realio.start = PCIBIOS_MIN_IO;
> + pcie->realio.start = 0x0;
> pcie->realio.end = min_t(resource_size_t,
Only the very first line changed:
root@nas440:~# cat /proc/ioports
00000000-000effff : PCI I/O
00010000-00010fff : PCI Bus 0000:01
00010000-0001000f : 0000:01:00.0
00010000-0001000f : ahci
00010010-00010017 : 0000:01:00.0
00010010-00010017 : ahci
00010018-0001001f : 0000:01:00.0
00010018-0001001f : ahci
00010020-00010023 : 0000:01:00.0
00010020-00010023 : ahci
00010024-00010027 : 0000:01:00.0
00010024-00010027 : ahci
The other lines stay at 0x100xx -- and SATA-2 HDDs still fail.
(kernel is 5.16.0rc1-pali-mvebu)
> the problem still persists: after rmmod && modprobe, lspci shows IRQ44 and
> 45, and "cat /proc/interrupts" crashes.
I was able to reproduce this issue also on A385, happens only sometimes, but I think I found the root cause. Interrupt mappings must be disposed prior removeing domain. Could you try following patch? I helped for A385.
diff --git a/drivers/pci/controller/pci-mvebu.c b/drivers/pci/controller/pci-mvebu.c
index 31f53a019b8f..951030052358 100644
--- a/drivers/pci/controller/pci-mvebu.c
+++ b/drivers/pci/controller/pci-mvebu.c
@@ -1713,8 +1713,15 @@ static int mvebu_pcie_remove(struct platform_device *pdev)
mvebu_writel(port, ~PCIE_INT_ALL_MASK, PCIE_INT_CAUSE_OFF);
/* Remove IRQ domains. */
- if (port->intx_irq_domain)
+ if (port->intx_irq_domain) {
+ int virq, j;
+ for (j = 0; j < PCI_NUM_INTX; j++) {
+ virq = irq_find_mapping(port->intx_irq_domain, j);
+ if (virq > 0)
+ irq_dispose_mapping(virq);
+ }
irq_domain_remove(port->intx_irq_domain);
+ }
/* Free config space for emulated root bridge. */
pci_bridge_emul_cleanup(&port->bridge);
Created attachment 301380 [details]
U-Boot log with more io/region info
> I was able to reproduce this issue also on A385, happens only sometimes, but > I > think I found the root cause. Interrupt mappings must be disposed prior > removeing domain. Could you try following patch? I helped for A385. > Yes, it helps. "cat /proc/interrupts" does not crash anymore and IRQs are increasing just by 1 (before: 2, see comment 33). 40: 59 interrupt-controller@20200 9 Edge pcie0.0 41: 0 interrupt-controller@20200 44 Edge pcie0.0 42: 0 mvebu-rp 0 Edge aerdrv, pciehp 43: 59 mvebu-INTx 0 Level ahci[0000:01:00.0] /* rmmod && insmod */ 40: 118 interrupt-controller@20200 9 Edge pcie0.0 41: 0 interrupt-controller@20200 44 Edge pcie0.0 43: 0 mvebu-rp 0 Edge aerdrv, pciehp 44: 59 mvebu-INTx 0 Level ahci[0001:01:00.0] I had to manually adjust the patch, because in my source file there is this part before the call to pci_bridge_emul_cleanup: if (port->rp_irq_domain) irq_domain_remove(port->rp_irq_domain); if (port->error_irq > 0) del_timer_sync(&port->link_irq_timer); I /think/ you have to apply the same logic to rp_irq_domain to stop the IRQ increase completely. Can you please have a look at attachment 301380 [details] and possibly 301124. Within U-Boot both HDDs work and maybe there is some hint (ioport location and/or size?) which helps. > Yes, it helps. Perfect, thank you for testing! I will send patch to linux-pci ASAP. > I /think/ you have to apply the same logic to rp_irq_domain to stop the IRQ > increase completely. Now it is in my pci-mvebu branch. I have looked at io ports output but I do not see what could be wrong here. Current linux configuration seems to be OK. If PCI BARs are configured incorrectly then you would not be able to access IO or MEM of SATA controller and so no disk would work. But you have at least one working disk, so in my opinion there is some ATA/AHCI related issue, not PCIe. Anyway, cannot be this IDENTIFY problem similar to one which was observed in sata_mv? https://lists.denx.de/pipermail/u-boot/2022-March/479294.html https://lists.denx.de/pipermail/u-boot/2021-August/456705.html That IDENTIFY command needs to be called two times. > Anyway, cannot be this IDENTIFY problem similar to one which was observed in > sata_mv? > If I understood correctly, the problem observed in sata_mv occurred primarily on cold boot. The problem described in this issue makes no difference between cold and warm boot, and even if the drive was successfully detected by U-Boot, the Linux kernel cannot subsequently detect it. > https://lists.denx.de/pipermail/u-boot/2022-March/479294.html > https://lists.denx.de/pipermail/u-boot/2021-August/456705.html > > That IDENTIFY command needs to be called two times. > Anyway, I've patched drivers/ata/libata-core.c like this: if (ap->ops->read_id) err_mask = ap->ops->read_id(dev, &tf, id); else err_mask = ata_do_dev_read_id(dev, &tf, id); + + if (err_mask) { + ata_dev_warn(dev, "CHECK: read_id error, may_readagain=%d\n", may_readagain); + + if (may_readagain) { + may_readagain = 0; + ata_dev_warn(dev, "CHECK: read_id retry, may_readagain=%d\n", may_readagain); + ata_eh_thaw_port(ap); /* need to unfreeze port after failed cmd */ + goto retry; + } + } Which results in: [ 53.201652] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [ 58.339290] ata3.00: qc timeout (cmd 0xec) [ 58.343493] ata3.00: CHECK: read_id error, may_readagain=1 [ 58.348602] ata3.00: CHECK: read_id retry, may_readagain=0 [ 68.579391] ata3.00: qc timeout (cmd 0xec) [ 68.583595] ata3.00: CHECK: read_id error, may_readagain=0 [ 68.588695] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) To my limited understanding, this sends ATA_CMD_ID_ATA (0xEC) two times, but stills fails. I do not have more ideas. Try to forward this issue to linux-ide@vger.kernel.org mailinglist. Maybe other people would have idea if there could be issue in ahci driver. And when you write to linux-ide mailing list, please CC me, so I can keep track of this issue. Hey guys, This is an interesting situation but disks are complex devices and can be mean sometimes. Gladly you posted this on the mailing list and it caught my attention. There might be many reasons to break things but you forgot to take the basic steps thus troubleshoot the issue faster. Both affected disks are really nasty models. Their vendor has released updates which in this case have not been applied. Your disks maybe degraded and your data may be at risk. The .12 having CC38 fw can be updated to CC49 as can be seen here: https://www.seagate.com/gb/en/support/kb/barracuda-720012-firmware-update-213891en The ES.2 having SN04 fw can be updated to SN06 as can be seen here: https://www.seagate.com/gb/en/support/kb/firmware-update-for-st3250310ns-st3500320ns-st3750330ns-st31000340ns-207963en I had examined 2 failed ones(1TB) recently at job and one of them had an extraordinary fault that I had never met again on HDD! Please update them on a x86 PC. You can do it on the NAS too but it is highly advised to use vendor tools at supported platforms just to be safe. Upon updating them, please repeat all tests :) If the problem remains, try to disable NCQ. After that, there is only one easy thing to do. Force device to SATA I mode using a jumper on affected disk(s). This has been the case for the first VIA SATA I hosts like VT8237,VT8237R,... and there is no software workaround for that situation. More details can be seen at this page: https://ata.wiki.kernel.org/index.php/Sata_via Waiting for news on this! Discussion about this issue is now on linux-ide mailing list: https://lore.kernel.org/linux-ide/db6b48b7-d69a-564b-24f0-75fbd6a9e543@noerenberg.de/t/#u I do not think that firmware upgrade or disabling NCQ do something with failed IDENTIFY command. Because simple AHCI implementation in U-Boot can detect disk without any problems. FYI The problem is the pci-mvebu driver. I have two different hw setups on Marvell Kirkwood. One with a working setup and a broken driver. I need more time and tests to investigate this. Hans Ulli, can you explain very briefly why you think it is pci-mvebu or in which part of the driver there might be a problem? I share this thought (mainly because it worked under ancient kernels with kirkwood-pci), but cannot verify it due to my complete lack of knowledge in this area. It would be great if this mystery is solved after years or rather almost decades :) Let me know if I can help (by testing something). Hy Hajo .. is not, my bad I do tests v6.3 with userspace buildroot on different platforms. Here for pci-mvebu armv5 Popoplug V4 mobile and Iomega Iconnect I discovered some differences in the output of lspci, while using uclibc-ng as libc, with glibc and musl this is OK. xhci, external controller on PCI, is missing on pogoplug. This took some time to discover, after I reported this error here. on armv7, Linksys WRT3200ACM, this driver works too, sort of. I can use mwlwifi from here https://github.com/kaloz/mwlwifi after I rebased for v6.3 kernel I can load this driver and allocate IO/IRQ, but actually not activate this interface. There is some callback in the driver missing I assume. Lots of PCI/DMA/MAC80211 API is changed from v5.4 to v6.3 I've also done with a dual network minipci card This is my output buildroot ~ # cat /proc/interrupts CPU0 17: 8514 bridge-interrupt-ctrl 2 Edge orion_event 26: 2 interrupt-controller@20200 5 Edge f1060800.xor 27: 2 interrupt-controller@20200 7 Edge f1060900.xor 28: 2452 interrupt-controller@20200 33 Edge ttyS0 29: 1665 interrupt-controller@20200 46 Edge f1072004.mdio-bus 30: 0 interrupt-controller@20200 11 Edge eth0 31: 147 mvebu-INTx 3 Level eth2, eth1 32: 30 interrupt-controller@20200 19 Edge ehci_hcd:usb1 33: 0 interrupt-controller@20200 53 Edge f1010300.rtc 34: 0 interrupt-controller@20200 29 Edge mv64xxx_i2c 35: 0 bridge-interrupt-ctrl 3 Edge f1020300.watchdog-timer 36: 0 interrupt-controller@20200 22 Edge f1030000.crypto 37: 0 f1010140.gpio 3 Edge OTB Button 38: 0 f1010100.gpio 12 Edge Reset Err: 0 Both eth1 and eth2 are working How many ports SATA *and* PATA have you beside of the two from the SoC I counted 3 which is odd. Hmm some site, without datasheet, tells me this is true. I need to get a picture and summarize your output, there is a lot of garbage in your output missed something Did you compile with CONFIG_PATA_MARVELL support Hajo can you please post the output of ls /sys/bus/pci/devices/*/ from the working and none working kernel version. I need only the directory entries and not the contents of every file Created attachment 304372 [details]
ls /sys/bus/pci/devices/*/ Linux 3.2.0-4-kirkwood: HDDs ok
Created attachment 304373 [details]
ls /sys/bus/pci/devices/*/ Linux 6.2.0-rc5: HDDs fail
Hi Ulli, the SoC has 2 SATA ports (they always work, with all HDDs and speeds). SATA-2 and SATA-3 hard disks connected to a 88SE6121 (AHCI) controller, wired via PCIe to the 88F6281 SoC fail to operate ("failed to IDENTIFY" ... "qc timeout") when the pci-mvebu driver (Kernel 3.16 .. 5.10 Debian) is in use. The 88SE6121 also has a PATA port, which at least to my knowledge isn't wired on the PCB. CONFIG_PATA_MARVELL does not work: https://marc.info/?l=linux-ide&m=167474771722812&w=2 I uploaded "/sys/bus/pci/devices/*/" as attachments to this bug as you requested. Created attachment 306531 [details]
bootlog Linux 6.10.0-rc6
Hi all! Glad to hear that we have some progress on this matter! I have my ix4-200d (with OpenWRT 23.05.2) waiting to be updated... just point me out how to! I don't like the idea of having 2 2TB disks in trays 1 and 2 just "sleeping" without doing anything... ;-) Thank you all for the effort! Regards, Gabriel from Buenos Aires, Argentina |