Bug 189361
Summary: | PCIe packet switch enumeration issue on linux kernel 4.2 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Jao Ching Chen (jcchen) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | bjorn, blake.moore |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 4.2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
The file contains the complete dmesg logs from kernel 3.19 and kernel 4.2, and the output of lspci -vv.
attachment-26488-0.html lspci output for linux kernel 4.9 dmesg output linux kernel 4.9 Test patch #1 Test patch #2 lspci -vv dump lspci -vv of non working pcie hierarchy lspci -vv of working pcie hierarchy (pci=pcie_scan_all is used here) 3.5.0 dmesg log (working) 3.5.0 lspci -vv log (working) 4.2.0 dmesg log (non working) 4.2.0 lspci -vv log (non working) 4.8.0-34 dmesg log (with patch applied) 4.8.0-34 lspci -vv log (patch applied) instrument has_secondary_link 4.8.0-34.36 dmesg 4.8.0-34.36 lspci -vv |
Hello, To add to this, the problem that we are facing is when a Wireless Card (WLE200NX)is connected to a Pericom PCI-to-PCIe bridge in any Linux kernel newer than 4.1.2,though still detected in 'lspci', cannot be initialized. We noticed this is 'dmesg': -ath9k: "cant derive routing for PCI INT A" -ath9k: "PCI INT A: no GSI - using ISA IRQ 7" -ath: phy0: "Mac chip Rev 0xfffc0.f is not supported by this driver" -ath: phy0: "Unable to initialize device" -ath9k: "probe of 0000:08:00.0 failed with error -95" After much testing, we discovered the cfg registers are being altered during boot of the newer kernels. This doesnt seem to be the case in any kernel older than 4.1.2. On systems running 4.2 or newer, we viewed the cfg registers of the devices from BIOS and they were what they should be, however as soon as the OS is booted with the newer kernel, the registers are different or misconfigured somehow. Pericom uses a PCI-to-PCIe bridge (PI7C9X110)that is set to Reverse mode which is connected to a Swidge (PI7C9X442SL) which is then connected to the Wireless Card (WLE200NX). Can you please test this with a current kernel, e.g., v4.9? v4.2 was released over a year ago, and I don't want to debug a problem that has already been fixed. If the problem still occurs with v4.9, please attach the complete dmesg log and "lspci -vv" output. Created attachment 248731 [details] attachment-26488-0.html Werter Kunde / Dear customer Vielen Dank für Ihre Nachricht. Ich bin bis einschließlich 02.01.2017 nicht im Haus. Ihre E-Mail werde ich nach meiner Rückkehr umgehend bearbeiten. MEN hat Betriebsurlaub. Jedoch ist in besonders dringenden Fällen unser technischer Support unter der Rufnummer +49 911 99335100 und unter der Email-Adresse support@men.de<mailto:support@men.de> erreichbar. Ich wünsche Ihnen ein geruhsames Fest und einen guten Start ins Jahr 2017 Many thanks for your e-mail. I am not in office until 2017-01-02 I will answer your -mail as soon as possible. The company MEN is closed, however in extreme urgent matters please call the general support hotline +49 (0)911 99335100 or write an email to support@men.de<mailto:support@men.de> I wish you a merry Christmas and a happy new year. Blake Moore Customer Support MEN Mikro Elektronik GmbH Neuwieder Straße 3-7 90411 Nürnberg, Deutschland Phone +49 911 99 33 5 - 253 Fax +49 911 99 33 5 - 901 Blake.Moore@men.de<mailto:Blake.Moore@men.de> www.men.de<http://www.men.de> > Subscribe to our newsletter<https://www.men.de/news-media/newsletter/> MEN Mikro Elektronik GmbH - Manfred Schmitz (CEO) - Bernd Härtlein (COO) - Handelsregister/Trade Register AG Nürnberg HRB 5540 Please consider the environment before printing this e-mail Hi, I tried to update my kernel from v4.2 to v4.9, however, it is failed. Could you provide the w-link to download the installation image file for kernel v4.9? It looks like you're running Ubuntu? Try here: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9/ Created attachment 249611 [details]
lspci output for linux kernel 4.9
LSPCI output
Created attachment 249621 [details]
dmesg output linux kernel 4.9
Hello, I can confirm this is still happening on 4.9. Same errors regarding being unable to initialize hardware in dmesg as well. I have attached the dmesg and lspci outputs as well. Thanks! Created attachment 249651 [details]
Test patch #1
I think this was probably broken by d0751b98dfa3 ("PCI: Add dev->has_secondary_link to track downstream PCIe links"). We made that change for an unusual PCIe topology where a Root Port is connected to a Downstream Port (in most cases a Root Port is connected to an Upstream Port). Basically it assumes a Root Port is the upstream end of a Link and uses that to anchor the hierarchy of Links below it.
The PI7C9X111SL (01:00.0) apparently has PCIe on one side and PCI/PCI-X on the other and can be configured with either as primary.
In terms of the PCIe to PCI/PCI-X Bridge Spec, r1.0, sec 1.1, if the primary interface is PCIe, this would be a "PCI Express to PCI/PCI-X Bridge", and it should include a PCI Express Capability with a Device/Port Type of "PCI Express to PCI/PCI-X Bridge".
Your system looks like this:
00:1e.0 Intel 82801 PCI Bridge to [bus 01-0a]
01:00.0 Pericom PI7C9X111SL PCIe-to-PCI Reversible Bridge to [bus 02-0a]
02:00.0 Pericom Device 8608 [PCIe Upstream Port] to [bus 03-0a]
03:01.0 Pericom Device 8608 [PCIe Downstream Port] to [bus 0a]
In this case, I think the PI7C9X111SL is configured with PCI on its primary interface (bus 01) and PCIe on its secondary (bus 02), which would make it (per sec 1.2) a "PCI/PCI-X to PCI Express Bridge" or simply a "reverse bridge". Sec A.6 says it should have a PCI Express Capability with a Device/Port Type of "PCI/PCI-X to PCI Express Bridge". It may also include (as the PI7C9X111SL does) a PCI-X Capability if it supports PCI-X on its primary interface.
I think the problem is that here there's no Root Port, and the first PCIe component we see is the Upstream Port at 02:00.0. After d0751b98dfa3, we think the secondary interface of 02:00.0 (bus 03) is a Link (but in fact it is the switch internal logic), since the upstream bridge is not PCIe and has no Link. When we scan bus 03, there's nothing at 03:00.0 and only_one_child() thinks we don't need to look for more devices, so we don't find the Downstream Ports at 03:01.0, etc. If this is what's happening, booting with "pci=pcie_scan_all" should be a workaround.
Per spec, I think the PI7C9X111SL should always have a PCIe Capability, with either Device Type 7 (PCIe to PCI/PCI-X Bridge) or 8 (PCI/PCI-X to PCIe Bridge), so this looks like a hardware/configuration defect to me.
Can you try the attached patch? It's just a starting point, because other things will be broken. For example, without the PCIe Capability, we don't have any visibility or control of the upstream end of the link between 01:00.0 and 02:00.0, so ASPM, hotplug, etc. won't work.
Hi, The PI7C9X111SL has a PCIe capability Register with Device Type 7 (PCIe to PCI/PCIX bridge) or Device Type 8 (PCI/PCIX bridge to PCIe) in the design as we understand. May I ask how you reached the conclusion that PI7C9X111SL does not show the correct Device Type in PCIe Capability Register? Thanks, JC Hello, I am able to confirm the workout is working for 4.9. The issue I am having now is attempting to apply the patch. Applying the patch fails as there is no quirks.c file located on the system. Now I have located a quirks.c online that can be inserted to our system but I am unsure if this is the correct way to go about this. I could just download the file, insert it to /usr/src/linux-headers-4.9.0-040900/drivers/pci/quirks.c, then use the patch, but I would like to confirm this would be acceptable before actually going forward. Thanks! (In reply to Jao Ching Chen from comment #10) > May I ask how you reached the conclusion that PI7C9X111SL > does not show the correct Device Type in PCIe Capability Register? You and Blake are apparently attaching information from different systems. The 3.19 and 4.2 lspci output that you attached in attachment 246381 [details] shows a PI7C9X111SL with no PCIe Capability. Blake's lspci output in attachment 249611 [details] shows two PI7C9X110 devices: a PCIe-to-PCI bridge at 04:00.0 leading to a PCI-to-PCIe bridge at 05:0f.0. Both of those do have a PCIe Capability. Created attachment 249861 [details]
Test patch #2
Hi Blake, if you don't have a drivers/pci/quirks.c, there's something wrong with your tree. Maybe you don't have a full Linux kernel source tree installed.
The PI7C9X110 devices in your system do have PCIe Capabilities, so this patch (based on v4.10-rc) should be much better than the first one.
JC, I don't think this will help on your system, since your PI7C9X111SL device doesn't seem to have a PCIe Capability (at least not one that we can see). Maybe your device isn't configured correctly, or maybe we'll need some sort of quirk for your case.
Hello, Ok I will attempt with the new test patch. I think my other problem was I was trying to apply the patch to a running system. I went back to a lower kernel version (4.2), downloaded 4.9, unpacked it, and was able to apply the patch the right way. Now I am compiling the 4.9 kernel to apply it and see if it works without the manually added pci=pcie_scan_all kernel param. Please let me know if I am goofing somewhere but I will let you know how it goes. Thanks! Hello, Unfortunately I have been unsuccessful. Here is what I did: Downloaded kernel 4.9 tar.xz from kernel.org unpacked the tarball to /usr/src cd to the directory of the unpacked tarball applied Test Patch 2 (renamed to test2.patch "patch -p1 </root/test2.patch Completed successfully Then I proceeded to "make menuconfig" Using the defaults, I saved the .config I cleaned the source tree "make-kdkg clean" Compiled the kernel "fakeroot make-kpkg --initrd kernel_image When this finished, I installed the image "dpkg -i /name/of/.deb/file" Once finished, I reboot. The problem is, during boot, it drops to initramfs and locks up. Just before doing this it shows an error regarding "pkcs#7 signature not signed with a trusted key" I am not sure what I am doing wrong but I will try again and see if I am maybe missing a step somewhere. The good news is the workaround works. Thanks! Sorry, I don't know the details of building a Debian kernel. Maybe there's a hint somewhere in the Debian kernel web. Maybe a module signature problem? Hello, I was attempting this on 15.10. Im going to install a newer Ubuntu version and then attempt the kernel compile again. Hello Bjorn, I was finally able to add your patch to the 4.8 kernel and recompile it. Now the bridge is working as expected and the wireless device is working as well. Given tis news, does this mean it will be possible to see this patch added to future kernel releases? Thanks! Hi Blake, thanks a lot for testing this! I think the patch (comment #13) is pretty clearly correct, so I'll package it up and merge it for either v4.10 or v4.11. Based on the lspci output from Jao Ching (https://bugzilla.kernel.org/attachment.cgi?id=246381), it will *not* solve the issue with the PI7C9X111SL. I posted the proposed patch. JC, I'm stuck on the problem with your PI7C9X111SL. If your device that lacks the PCIe Capability is just a lab prototype or something that is incorrectly configured, maybe we don't need to do anything in the kernel, and you can use "pci=pcie_scan_all" as an internal workaround. But if we're going to find that device in the field without a PCIe Capability, we probably should do a quirk or something to work around the problem. Can you help out with this? I can't do anything more without more information about that device and why it doesn't have the PCIe Capability. Hi Bjorn, By default, PI7C9X111SL has PCIe capability. However, the PCIe capability can be removed by external eeprom. Before, in some platforms, we encountered one issue - the system will hang when it finds that PI7C9X111SL reverse bridge is plugged into a PCI root complex. In order to workaround this issue, we suggest customers to remove PCIe capability by external eeprom. It seems that you only add Vendor/Device ID for PI7C9X111SL part in the patch. Actually, we have another two parts which are similiar to PI7C9X111SL. They are PI7C9X130 and PI7C9X110. Could you help to add their IDs to the patch and update the latest linux kernel? Their IDs are 12d8/e130 and 12d8/e110. By the way, will you consider to put "pci=pcie_scan_all" as default kernel setting? If so, the patch should be not necessary. Thanks, JC Hello Bjorn, I have noticed something. I would apply your patch to the necessary *.c file, then build the kernel, and everything will work for a time. Then after an unspecified time frame, when I boot into the kernel that was created, the path no longer works and I have to use the kernel parameter again. I have no clue what is causing this, but this is the 2nd time its happened now. The patch works and then randomly, it doesnt anymore, as if the patch was never applied. Have you experienced anything like this before? Thanks! (In reply to Jao Ching Chen from comment #21) > By default, PI7C9X111SL has PCIe capability. However, the PCIe capability > can be removed by external eeprom. Before, in some platforms, we encountered > one issue - the system will hang when it finds that PI7C9X111SL reverse > bridge is plugged into a PCI root complex. In order to workaround this > issue, we suggest customers to remove PCIe capability by external eeprom. I'm dubious about this workaround. Removing the PCIe capability is a pretty heavy-handed fix. What sort of hang did you encounter? I assume it's something to do with Linux accessing the PCIe capability. I think it would be better to root-cause that issue and keep the PCIe capability. I don't feel good about adding a quirk to work around a workaround for the hang. Without the actual PCIe capability, too many other things will be broken. The quirk merely sets dev->has_secondary_link. But now we have a new case ("non-PCIe device with dev->has_secondary_link") we didn't have before, and that adds the potential for bugs. > It seems that you only add Vendor/Device ID for PI7C9X111SL part in the > patch. Actually, we have another two parts which are similiar to > PI7C9X111SL. They are PI7C9X130 and PI7C9X110. Could you help to add their > IDs to the patch and update the latest linux kernel? Their IDs are 12d8/e130 > and 12d8/e110. > > By the way, will you consider to put "pci=pcie_scan_all" as default kernel > setting? If so, the patch should be not necessary. We'd have to research why we don't do pcie_scan_all in the first place. I assume the current code has some benefit, and I don't want to change that just to work around an issue on these particular parts. (In reply to Blake Moore from comment #22) > I have noticed something. I would apply your patch to the necessary *.c > file, then build the kernel, and everything will work for a time. Then after > an unspecified time frame, when I boot into the kernel that was created, the > path no longer works and I have to use the kernel parameter again. I have no > clue what is causing this, but this is the 2nd time its happened now. The > patch works and then randomly, it doesnt anymore, as if the patch was never > applied. > > Have you experienced anything like this before? No, I have not. My guess is that either you're accidentally booting a different kernel that doesn't contain the patch, or there's something wrong with the PI7C9X110 reset path such that it no longer advertises the PCIe capability. You should be able to look for the latter via "lspci -vv". Created attachment 252591 [details]
lspci -vv dump
Hello Bjorn- I have added the attachment containing the lspci -vv dump. It looks like you might be right about the 9x110 losing PCIe capability as I no longer see it advertised. Would you have an idea what may cause this capability to randomly stop presenting itself? I could attempt another recompile with the patch and try to carefully monitor any changes to see what might be causing this to happen. Thanks Don't bother recompiling yet; I don't think that's where the problem is. Linux retains no state between boots, so if it looks different after a reboot, it's probably the result of something in hardware or firmware. Can you please attach the complete dmesg and "lspci -vv" output a boot where it works and also for a boot where it fails (four files total)? I assume these are all on the same machine, so no hardware should be changing. Created attachment 252631 [details]
lspci -vv of non working pcie hierarchy
Created attachment 252641 [details]
lspci -vv of working pcie hierarchy (pci=pcie_scan_all is used here)
Hello Bjorn - I have added the two dump files. On the working one, I did use pci=pcie_scan_all. Please let me know if anything else is needed. Thanks very much! Per your comment #18, I assumed that with the comment #13 patch, your system was working without "pci=pcie_scan_all". The question is what changes to make the system not work. So I'd like to see: - The complete dmesg log of a boot where it works (without using "pci=pcie_scan_all"), - The complete "lspci -vv" output from that boot, - The complete dmesg log of a non-working boot (again, without using "pci=pcie_scan_all"), - The complete "lspci -vv" output from this boot. The lspci output you attached in comments #28 and #29 were captured differently (one has CR/LF line endings and space indentation, the other has LF line endings and tab indentation). Please do both the same way. Created attachment 252841 [details]
3.5.0 dmesg log (working)
Created attachment 252851 [details]
3.5.0 lspci -vv log (working)
Created attachment 252861 [details]
4.2.0 dmesg log (non working)
Created attachment 252871 [details]
4.2.0 lspci -vv log (non working)
Hello Bjorn - I have attached the text files you requested. 2 log files from 3.5 and 2 from 4.2. Both of which have not had the patch applied. I have log files from 4.8 where the patch was applied but then later stopped working, if you need this as well. Appreciate all the assistance! Blake The patch is already merged upstream and appeared in v4.10-rc5: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=51ebfc92b72b That seems pretty obviously correct to me, so the question now is why your systems works with the patch sometimes but not always. If you could attach your logs from v4.8 with the patch applied, that would be great. Created attachment 252911 [details]
4.8.0-34 dmesg log (with patch applied)
Created attachment 252921 [details]
4.8.0-34 lspci -vv log (patch applied)
Hello Bjorn, I have attached logs for 4.8.0-34 where the patch was applied and working when the kernel was compiled and installed. Thanks! Created attachment 252931 [details] instrument has_secondary_link The dmesg in comment #38 still shows a problem. Maybe the patch was not actually applied. Please make sure it is applied (the one from http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=51ebfc92b72b), and also apply this debug patch. Here's what's wrong in the comment #38 dmesg. We have this host bridge: ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-3e]) pci_bus 0000:00: root bus resource [mem 0x7da00000-0xfeafffff window] pci_bus 0000:00: root bus resource [bus 00-3e] And we found these devices: 00:1c.6 [8086:1e1c] Intel PCIe Root Port bridge to [bus 04-0a] 04:00.0 [12d8:e110] PI7C9X110 PCIe to PCI/PCI-X bridge to [bus 05-0a] 05:0d.0 [12d8:e110] PI7C9X110 PCI/PCI-X to PCIe bridge to [bus 06-0a] 06:00.0 [12d8:400a] PI7C9X442SL Upstream Port bridge to [bus 07-0a] 08:00.0 [168c:002a] Qualcomm Atheros AR928X NIC Note that we did not find a bridge from bus 07 to bus 08. Based on https://bugzilla.kernel.org/attachment.cgi?id=252841, I think that bridge should be this one: 07:01.0: [12d8:400a] PI7C9X442SL Downstream Port bridge to [bus 08] In comment #38 we didn't enumerate it, probably because there's no device 0, and only_one_child() thinks we don't need to look for any other device numbers. We do find the Atheros device later by brute-force scanning all bus numbers, but that breaks resource assignment, so the device doesn't work: pci 0000:08:00.0: reg 0x10: [mem 0x90500000-0x9050ffff 64bit] pci 0000:08:00.0: can't claim BAR 0 [mem 0x90500000-0x9050ffff 64bit]: address conflict with PCI Bus 0000:00 [mem 0x7da00000-0xfeafffff window] pci 0000:08:00.0: BAR 0: assigned [mem 0x104000000-0x10400ffff 64bit] ath9k 0000:08:00.0: Failed to initialize device Please apply both patches and attach the dmesg log. Created attachment 253201 [details]
4.8.0-34.36 dmesg
Created attachment 253211 [details]
4.8.0-34.36 lspci -vv
Hello Bjorn, I have attached log files from the newly compiled 4.8.0-34.36 kernel with both patches applied (debug patch you applied and patch from http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=51ebfc92b72b). Currently the module is working and is detecting wireless signals. Thanks! The comment #42 dmesg log looks fine. It doesn't have any of the issues I mentioned in comment #41. I can't explain the difference between the logs from comment #38 and comment #42, unless comment #38 really didn't have the patch applied or there's some hardware issue that changed the PCIe capability. I'm going to close this because comment #42 appears to show the problem is solved and we don't have enough information to indicate another kernel problem. If you see a problem with a v4.10-rc5 or newer kernel, please open a new problem report and attach a complete dmesg log and "lspci -vv" output. As mentioned above, the fix we know about is already merged upstream and appeared in v4.10-rc5: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=51ebfc92b72b Hello Bjorn, No problem! Your assistance is greatly appreciated! |
Created attachment 246381 [details] The file contains the complete dmesg logs from kernel 3.19 and kernel 4.2, and the output of lspci -vv. We have two products, one is PCI-to-PCIe reverse bridge (9x111, id=0x12d8e111), and another one is PCIe packet switch (2G808, id=12d88608, it includes one up port and 7 down ports). When I plugged 9x111+2G808+one pcie wifi card on Linux Kernel 3.19, wifi card can work fine. However, if use the same hardware arch., but Linux kernel is upgraded to 4.2 version, wifi card can not work fine. After check, I found that 7 down ports are disappeared, and wifi’s resource is clear (cfg offset 10h is clear). Without resources, wifi card can not work normally.