Bug 209833
Summary: | [ASPM - Compex WLE900VX card] BAR error updating | ||
---|---|---|---|
Product: | Drivers | Reporter: | vtolkm |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | NEW --- | ||
Severity: | high | CC: | bjorn, pali, toke |
Priority: | P1 | ||
Hardware: | ARM | ||
OS: | Linux | ||
URL: | https://lore.kernel.org/r/20200430080625.26070-5-pali@kernel.org | ||
Kernel Version: | all up to and incl. next-20201029 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
accumulated kernel log messages
requested lspci output klog and lspci output with ASPM revoved patch proposed on ML testing patch |
Description
vtolkm
2020-10-23 17:17:28 UTC
Can you attach the output of "sudo lspci -vv" and "sudo lspci -vvb", please? Created attachment 293155 [details]
requested lspci output
attached as requested
02:00.0 Network controller: Qualcomm Atheros QCA986x/988x 802.11ac Wireless Network Adapter (rev ff) (prog-if ff) !!! Unknown header type 7f Suggests the device is powered off or otherwise not responding to config accesses. Interesting that shortly before, we tried to reconfigure ASPM clock config: 00:02.0 is the Root Port to [bus 02]. pci 0000:02:00.0: [168c:003c] type 00 class 0x028000 pci 0000:00:02.0: ASPM: current common clock configuration is inconsistent, reconfiguring The 5.4.72 dmesg doesn't mention this ASPM reconfiguration. Does it make any difference if you boot with "pcie_aspm=off" or rebuild with CONFIG_PCIEASPM turned off? Created attachment 293157 [details] klog and lspci output with ASPM revoved Just for clarification - the lspci output posted previously is from 5.9.1 --- >Suggests the device is powered off or otherwise not responding to config >accesses. It is powered on but not loading firmware, which is another subject though that will have to look into. ath10k_pci 0000:02:00.0: assign IRQ: got 53 ath10k_pci 0000:02:00.0: enabling device (0140 -> 0142) ath10k_pci 0000:02:00.0: enabling bus mastering ath10k_pci 0000:02:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0 ath10k_pci 0000:02:00.0: Failed to find firmware-N.bin (N between 2 and 6) from ath10k/QCA988X/hw2.0: -2 ath10k_pci 0000:02:00.0: could not fetch firmware files (-2) ath10k_pci 0000:02:00.0: could not probe fw (-2) --- > The 5.4.72 dmesg doesn't mention this ASPM reconfiguration. Does it make any > difference if you boot with "pcie_aspm=off" or rebuild with CONFIG_PCIEASPM > turned off? Recompiled 5.9.1 with # CONFIG_PCIEASPM is not set and the issue does not exhibit. Enclosed klog and lspci output. Shall this bug then be closed (does not fix) as ASPM not being suitable on the device? Great news that turning off CONFIG_PCIEASPM avoids the issue. I think the common clock reconfiguration involves retraining the link, so we may be doing that wrong or not allowing enough time for the link to come back up. We should not close this issue. We should find and fix whatever is wrong with Linux ASPM support. If it's related to a hardware defect, this may involve a runtime quirk to disable ASPM on this platform. But the goal is that it should be safe to set CONFIG_PCIEASPM=y for all platforms. If it's practical for you to bisect this to a specific commit, that would save a lot of debug effort. If you hit a commit that doesn't build, you can avoid testing that commit (see the git-bisect man page, "Bisect skip"). Apologies, it is clear that my earlier statement > It appears to be a regression from 5.4 where the error does not exhibit does not hold true but that in fact ASPM never worked for the Compex WLE900VX card. Incidently stumbled upon this patch discussion https://lore.kernel.org/r/20200430080625.26070-5-pali@kernel.org where it is mentioned > Currently the aardvark driver trains link in PCIe gen2 mode. This may cause some buggy gen1 cards (such as Compex WLE900VX) to be unstable or even not detected. Moreover when ASPM code tries to retrain link second time, these cards may stop responding and link goes down. If gen1 is used this does not happen. Created attachment 293301 [details] patch proposed on ML for posterity, this patch been proposed on the mailing list but did not remedy the issue - tested on two different (Turris Omnia) nodes, each with a WLE900VX card. Another consumer commented on the mailing-list >I just tried sticking an MT76-based WiFi card into the third PCI slot, and >that doesn't come up either when I enable PCIEASPM Created attachment 293419 [details] testing patch for posterity, another (testing) patch been introduced on the mailing-list which though does not remedy the issue but let to the conclusion that the issue is dissimilar to the one mentioned previously for the aardvark driver. --- one consumer reported: > I have WLE200/WLE900/MT76 in those three slots, which makes slot 1 and 3 > work, while slot 2 craps out. If I remove the MT76 card (as it was > originally), neither of slots 1 and 2 work... --- On my node the PCI slots are occupied slightly differently: * right slot (next to the CPU) hosts a SSD * centre slot hosts WLE900VX * left slot (over the SIM card slot) hosts the WLE200N2 A common denominator with the other consumer is the deployment of WLE900VX in the centre slot, thus theoretically it could be that particular PCI slot being buggy/faulty until PCI devices are tested in a different deployment setting. The Turris Omnia has a particular feature set for its PCI slots: * 1 x miniPCIe (1.2) | mSATA -> slot located next to the CPU * 1 x miniPCIe (1.2) no USB support -> centre slot * 1 x miniPCIe (1.2) with USB (2.0) support -> slot located farthest from the CPU --- during the mailing-list discussion is was revealed that: >aardvark and mvebu have one very strong connection: they are the only two >drivers making use of the PCI Bridge emulation logic in >drivers/pci/pci-bridge-emul.c >the PCI Bridge seen by Linux is *not* a real HW bridge. It is faked by the the >pci-bridge-emul code --- With only two Turris Omnia consumers currently in the test loop it is not possible to discern whether the issue is particular to the TO (PCI slot) a/o the WLE900VX card or the Armada 385 SOC or even on any other device that uses pci-mvebu.c This is issue in (all?) Atheros AR9xxx and QCA9xxx chips which fails to reset/initialize when issue PCIe Hot Reset or PCIe Link Retraining. As enabling PCIe ASPM requires to do PCIe Link Retraining, it triggers this issue. Patch with quirk for these chips: https://lore.kernel.org/linux-pci/20210326124326.21163-1-pali@kernel.org/T/#u |