Bug 209833

Summary: [ASPM - Compex WLE900VX card] BAR error updating
Product: Drivers Reporter: vtolkm
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: high CC: bjorn, pali, toke
Priority: P1    
Hardware: ARM   
OS: Linux   
URL: https://lore.kernel.org/r/20200430080625.26070-5-pali@kernel.org
Kernel Version: all up to and incl. next-20201029 Subsystem:
Regression: No Bisected commit-id:
Attachments: accumulated kernel log messages
requested lspci output
klog and lspci output with ASPM revoved
patch proposed on ML
testing patch

Description vtolkm 2020-10-23 17:17:28 UTC
Created attachment 293153 [details]
accumulated kernel log messages

device: Turris Omnia (armv7l)

lspci >
00:01.0 PCI bridge: Marvell Technology Group Ltd. Device 6820 (rev 04)
00:02.0 PCI bridge: Marvell Technology Group Ltd. Device 6820 (rev 04)
00:03.0 PCI bridge: Marvell Technology Group Ltd. Device 6820 (rev 04)
02:00.0 Network controller: Qualcomm Atheros QCA986x/988x 802.11ac Wireless Network Adapter (rev ff)
03:00.0 Network controller: Qualcomm Atheros AR9287 Wireless Network Adapter (PCI-Express) (rev 01)

----

Since kernel 5.5.x this error exhibits:

BAR 0: error updating

It appears to be a regression from 5.4 where the error does not exhibits - enclosed accumulated kernel logs. 

It may or may not cause issues for the atheros drivers as mentioned in bug 209751.

I have tried to narrow it further down to a particular commit from the 5.5 branch but unfortunately the kernel does not compile then.
Comment 1 Bjorn Helgaas 2020-10-23 17:36:55 UTC
Can you attach the output of "sudo lspci -vv" and "sudo lspci -vvb", please?
Comment 2 vtolkm 2020-10-23 17:43:12 UTC
Created attachment 293155 [details]
requested lspci output

attached as requested
Comment 3 Bjorn Helgaas 2020-10-23 18:01:45 UTC
02:00.0 Network controller: Qualcomm Atheros QCA986x/988x 802.11ac Wireless Network Adapter (rev ff) (prog-if ff)
        !!! Unknown header type 7f

Suggests the device is powered off or otherwise not responding to config accesses.

Interesting that shortly before, we tried to reconfigure ASPM clock config:

00:02.0 is the Root Port to [bus 02].

pci 0000:02:00.0: [168c:003c] type 00 class 0x028000
pci 0000:00:02.0: ASPM: current common clock configuration is inconsistent, reconfiguring

The 5.4.72 dmesg doesn't mention this ASPM reconfiguration.  Does it make any difference if you boot with "pcie_aspm=off" or rebuild with CONFIG_PCIEASPM turned off?
Comment 4 vtolkm 2020-10-23 18:52:15 UTC
Created attachment 293157 [details]
klog and lspci output with ASPM revoved

Just for clarification - the lspci output posted previously is from 5.9.1

---

>Suggests the device is powered off or otherwise not responding to config
>accesses.

It is powered on but not loading firmware, which is another subject though that will have to look into.

ath10k_pci 0000:02:00.0: assign IRQ: got 53
ath10k_pci 0000:02:00.0: enabling device (0140 -> 0142)
ath10k_pci 0000:02:00.0: enabling bus mastering
ath10k_pci 0000:02:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
ath10k_pci 0000:02:00.0: Failed to find firmware-N.bin (N between 2 and 6) from ath10k/QCA988X/hw2.0: -2
ath10k_pci 0000:02:00.0: could not fetch firmware files (-2)
ath10k_pci 0000:02:00.0: could not probe fw (-2)

---

> The 5.4.72 dmesg doesn't mention this ASPM reconfiguration.  Does it make any
> difference if you boot with "pcie_aspm=off" or rebuild with CONFIG_PCIEASPM
> turned off?

Recompiled 5.9.1 with 

# CONFIG_PCIEASPM is not set

and the issue does not exhibit. Enclosed klog and lspci output.

Shall this bug then be closed (does not fix) as ASPM not being suitable on the device?
Comment 5 Bjorn Helgaas 2020-10-23 20:09:30 UTC
Great news that turning off CONFIG_PCIEASPM avoids the issue.  I think the common clock reconfiguration involves retraining the link, so we may be doing that wrong or not allowing enough time for the link to come back up. 

We should not close this issue.  We should find and fix whatever is wrong with Linux ASPM support.  If it's related to a hardware defect, this may involve a runtime quirk to disable ASPM on this platform.  But the goal is that it should be safe to set CONFIG_PCIEASPM=y for all platforms.
Comment 6 Bjorn Helgaas 2020-10-27 18:05:58 UTC
If it's practical for you to bisect this to a specific commit, that would save a lot of debug effort.  If you hit a commit that doesn't build, you can avoid testing that commit (see the git-bisect man page, "Bisect skip").
Comment 7 vtolkm 2020-10-28 16:51:51 UTC
Apologies, it is clear that my earlier statement

> It appears to be a regression from 5.4 where the error does not exhibit

does not hold true but that in fact ASPM never worked for the Compex WLE900VX card. Incidently stumbled upon this patch discussion https://lore.kernel.org/r/20200430080625.26070-5-pali@kernel.org

where it is mentioned

> Currently the aardvark driver trains link in PCIe gen2 mode. This may
cause some buggy gen1 cards (such as Compex WLE900VX) to be unstable or
even not detected. Moreover when ASPM code tries to retrain link second
time, these cards may stop responding and link goes down. If gen1 is
used this does not happen.
Comment 8 vtolkm 2020-10-29 16:37:58 UTC
Created attachment 293301 [details]
patch proposed on ML

for posterity, this patch been proposed on the mailing list but did not remedy the issue - tested on two different (Turris Omnia) nodes, each with a WLE900VX card.

Another consumer commented on the mailing-list

>I just tried sticking an MT76-based WiFi card into the third PCI slot, and
>that doesn't come up either when I enable PCIEASPM
Comment 9 vtolkm 2020-11-03 17:26:59 UTC
Created attachment 293419 [details]
testing patch

for posterity, another (testing) patch been introduced on the mailing-list which though does not remedy the issue but let to the conclusion that the issue is dissimilar to the one mentioned previously for the aardvark driver.

---

one consumer reported:

> I have WLE200/WLE900/MT76 in those three slots, which makes slot 1 and 3
> work, while slot 2 craps out. If I remove the MT76 card (as it was
> originally), neither of slots 1 and 2 work...

---

On my node the PCI slots are occupied slightly differently:

* right slot (next to the CPU) hosts a SSD
* centre slot hosts WLE900VX
* left slot (over the SIM card slot) hosts the WLE200N2

A common denominator with the other consumer is the deployment of WLE900VX in the centre slot, thus theoretically it could be that particular PCI slot being buggy/faulty until PCI devices are tested in a different deployment setting.

The Turris Omnia has a particular feature set for its PCI slots:

* 1 x miniPCIe (1.2) | mSATA -> slot located next to the CPU
* 1 x miniPCIe (1.2) no USB support -> centre slot
* 1 x miniPCIe (1.2) with USB (2.0) support -> slot located farthest from the CPU

---

during the mailing-list discussion is was revealed that:

>aardvark and mvebu have one very strong connection: they are the only two
>drivers making use of the PCI Bridge emulation logic in
>drivers/pci/pci-bridge-emul.c

>the PCI Bridge seen by Linux is *not* a real HW bridge. It is faked by the the
>pci-bridge-emul code

---

With only two Turris Omnia consumers currently in the test loop it is not possible to discern whether the issue is particular to the TO (PCI slot) a/o the WLE900VX card or the Armada 385 SOC or even on any other device that uses pci-mvebu.c
Comment 10 Pali Rohár 2021-04-17 18:18:16 UTC
This is issue in (all?) Atheros AR9xxx and QCA9xxx chips which fails to reset/initialize when issue PCIe Hot Reset or PCIe Link Retraining.

As enabling PCIe ASPM requires to do PCIe Link Retraining, it triggers this issue.

Patch with quirk for these chips:
https://lore.kernel.org/linux-pci/20210326124326.21163-1-pali@kernel.org/T/#u