Bug 202425
Summary: | 3w-9xxx: 3ware 9650SE-2LP RAID controller not working on AMD Ryzen system | ||
---|---|---|---|
Product: | Drivers | Reporter: | robert.smith51 |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | NEW --- | ||
Severity: | normal | CC: | bjorn, bvanassche, cts.cobra, kernel.org, kernel, okaya, public-t.b, robert.spindler |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | >= 3.11 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg excerpt
dmesg sudo lspci -vvv Quirk patch PCIe config Windows 10 Windows 10 PCI configuration (decoded by lspci) dmesg pci=earlydump BIOS PCI configuration (pci=earlydump decoded by lspci) ls -vvv -t -nn for Asus Prime-X470 PRO with Ryzen 7 3700X ls -vvv -nn for Asus Prime-X470 PRO with Ryzen 7 3700X |
Can you bisect this issue to identify the commit that introduced this regression? (In reply to Bart Van Assche from comment #1) > Can you bisect this issue to identify the commit that introduced this > regression? I found it. The commit is: 60db3a4d8cc9073cf56264785197ba75ee1caca4 ("PCI: Enable PCIe Extended Tags if supported"). Reverting it solves the issue and makes the RAID controller accessible. On 1/30/19 7:42 PM, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=202425 > > --- Comment #2 from robert.smith51@protonmail.com --- > (In reply to Bart Van Assche from comment #1) >> Can you bisect this issue to identify the commit that introduced this >> regression? > > I found it. The commit is: 60db3a4d8cc9073cf56264785197ba75ee1caca4 ("PCI: > Enable PCIe Extended Tags if supported"). > > Reverting it solves the issue and makes the RAID controller accessible. Hi Sinan, According to Robert Smith commit 60db3a4d8cc9 ("PCI: Enable PCIe Extended Tags if supported") (v4.11) introduced a regression. Can you have a look at https://bugzilla.kernel.org/show_bug.cgi?id=202425 ? Thanks, Bart. Robert, would you mind attaching the complete dmesg and "sudo lspci -vvv" output, please? Sorry for the inconvenience, and thanks very much for doing all the work of a bisection. I marked this as a regression and am trying to move it to the Drivers/PCI category (bugzilla isn't completely cooperating). Created attachment 280923 [details]
dmesg
Created attachment 280925 [details]
sudo lspci -vvv
Created attachment 280955 [details]
Quirk patch
I have added a quirk for what I think is the AMD PCIe root port, and this solves the issue on kernel 4.20.6 and, presumably, previous versions.
Instead adding a quirk for the RAID controller itself also works ("PCI_VENDOR_ID_3WARE, 0x1004").
Thanks for testing a quirk! Your patch in comment #7 does correspond to the Root Port. The complete path (from comment #6) is: 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode]) Bus: primary=00, secondary=01, subordinate=09, sec-latency=0 Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00 DevCap: ExtTag+ DevCtl: ExtTag+ 01:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43c6 (rev 01) (prog-if 00 [Normal decode]) Bus: primary=01, secondary=02, subordinate=09, sec-latency=0 Capabilities: [80] Express (v2) Upstream Port, MSI 00 DevCap: ExtTag+ DevCtl: ExtTag+ 02:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43c7 (rev 01) (prog-if 00 [Normal decode]) Bus: primary=02, secondary=05, subordinate=05, sec-latency=0 Capabilities: [80] Express (v2) Downstream Port (Slot+), MSI 00 DevCap: ExtTag+ DevCtl: ExtTag+ 05:00.0 RAID bus controller: 3ware Inc 9650SE SATA-II RAID PCIe (rev 01) Capabilities: [70] Express (v1) Legacy Endpoint, MSI 00 DevCap: ExtTag- DevCtl: ExtTag- 60db3a4d8cc9 ("PCI: Enable PCIe Extended Tags if supported") doesn't affect the 3ware controller itself, since it doesn't advertise Extended Tag support, but it would enable Extended Tags in all the upstream devices. ExtTag+ means the function can generate 8-bit tags as a requester. The 3ware controller claims it is not capable of generating 8-bit tags, so I suspect the problem is that 60db3a4d8cc9 enabled the Root Port to generate 8-bit tags, which means MMIO from the driver could use 8-bit tags, and the 3ware controller couldn't handle them. It niggles at me that I would expect the 3ware controller to log some kind of malformed TLP or similar error, and I don't see anything like that. Anyway, I suspect the problem is with the 3ware controller, not the AMD root port or switch, so I would lean toward that version of the quirk. What do you think, Sinan? IF the issue is at 3ware, limiting to 3ware would be a better solution that disabling the extended tags (performance impacting) feature altogether. (In reply to Bjorn Helgaas from comment #8) > 60db3a4d8cc9 ("PCI: Enable PCIe Extended Tags if supported") doesn't affect > the 3ware controller itself, since it doesn't advertise Extended Tag > support, but it would enable Extended Tags in all the upstream devices. The RAID controller was previously working fine on an Intel mainboard with the same 4.20 kernel. This was PCIe 2.0, though, but as far as I understand, the Extended Tag should have been activated with that hardware, too. That would suggest that this is either a problem with the AMD Root Port or with the combination of the Root Port and the 3ware controller. By the way, the controller works fine under Windows 10. Extended Tags would have been enabled on the Intel parts of the path if they advertised support for it, which they probably did ("lspci -vv" would show for sure). If we could get a hex dump of config space under Windows 10, we could see whether it enables Extended Tags. You *might* be able to get that from Device Manager. If not, I'm pretty sure the free trial of AIDA64 (http://www.aida64.com/) can do it. I'm a little queasy about our assumption that it's always safe to enable Extended Tags. Maybe we're reading the spec wrong. Created attachment 280959 [details] PCIe config Windows 10 (In reply to Bjorn Helgaas from comment #11) > If we could get a hex dump of config space under Windows 10, we could see > whether it enables Extended Tags. You *might* be able to get that from > Device Manager. If not, I'm pretty sure the free trial of AIDA64 > (http://www.aida64.com/) can do it. Here you go. As for the Intel parts, I do not have those anymore, so unfortunately I cannot verify whether Extended Tags were actually enabled. Created attachment 281009 [details] Windows 10 PCI configuration (decoded by lspci) I tweaked lspci so it can decode the config space dump from AIDA64, and this is the result of running it on the Windows 10 AIDA64 dump from comment #12. Under Windows, Extended Tags are disabled on the path to the 3ware controller: 00:01.3 Root Port to [bus 01-09] 01:00.2 Switch Upstream Port to [bus 02-09] 02:03.0 Switch Downstream Port to [bus 05] 05:00.0 3ware 9650SE SATA-II RAID Extended Tags are *enabled* on the path to the NVIDIA GP104: 00:03.1 Root Port to [bus 0a] 0a:00.0 NVIDIA GP104 GPU 0a:00.1 NVIDIA GP104 Audio I guess we can't be certain whether this configuration was done by the BIOS or by Windows. Booting Linux with "pci=earlydump" would show how BIOS configured things. If I understand you correctly, whether or not the kernel has the quirk active should not matter for the "pci=earlydump". Is that correct? Because the dmesg output from the 4.11 kernel with the bug looks broken as far as the PCIe config space is concerned. "pci=earlydump" dumps config space before Linux changes anything. The question "pci=earlydump" would answer is whether Windows did anything with Extended Tag configuration. It could be that the BIOS enabled Extended Tags for the NVIDIA card but not for 3ware, and Windows did nothing. On the other hand, it could be that Windows enabled Extended Tags for NVIDIA but not for 3ware. If that's the case, it's a pretty good hint that we need to figure out the reason it left them disabled because it probably also applies to Linux. I'm curious if we should have an option for the endpoint driver to disable extended tags. But then it works on intel, I keep forgetting. Created attachment 281025 [details]
dmesg pci=earlydump
It seems that this is either an issue with the AMD hardware, which is supported by the fact that it does not seem to happen with Intel, even though there is no guarantee that it never happens.
Or this is an issue with the BIOS, but then one would again expect it to pop up on Intel machines, too.
Third option is that this is an issue with the 3ware controller. I can try to get my hands on some other old PCIe hardware without support for Extended Tags. If that also causes on issue, the problem lies not with the RAID controller.
Lastly, the Linux implementation might not be correct w.r.t. the spec.
Anyway, here is the dmesg with pcie=earlydump on a 4.20.6 kernel with the controller quirk active.
Created attachment 281027 [details] BIOS PCI configuration (pci=earlydump decoded by lspci) I manually edited the earlydump from comment #18 so lspci could read it (I lost interest before making lspci smart enough to read it directly), and this is the result. The Extended Tag configuration is the same as in comment #13 (disabled for 3ware, enabled for NVIDIA), which means the BIOS apparently did that configuration and Windows didn't change anything. I'll seek clarification from the PCI-SIG. I tried verifying the problem on a 4.20.6 kernel without the quirk with one PCIe USB card and an ethernet card (TP-Link TG-3468), both of which do not support the Extended Tag. ET was active along the path, just like with the RAID controller. But both cards worked fine without any problems. So, this would suggest that the problem lies at least partially with the 3ware controller. I has been some time. Is there anything actually happening with this issue? Hi, this issue also applies to Asus Prime-X470 PRO with Ryzen 7 3700x as well as Ryzen 5 3600 with 4.18 Kernels of Debian 9 backports. Are there currently any plans to work around/fix this issue on Kernel basis? Or are there some options to solve this, maybe by using kernel boot params to disable extended tags entirely or at least exclude 3ware cards from extended tags? Kind regards. BTW, I'm not familiar with the details regarding pcie extended tags, but I've read in https://lore.kernel.org/patchwork/patch/807638/ that PCIE 1.x devices might need to be handled in another way than PCIE 2+ devices. The 3ware devices of series 9650SE are, as far as I know, specified/marketed as PCIE 1.1 devices. This might be an explanation why devices of series 9750 (PCIE 2) seem to be more compatible to the Ryzen platform (the 9750 works on X370 Taichi with Ryzen, while 9650 does not (rumors from Asrock customer support are that pre-Ryzen AM4-CPUs with embedded graphics support 9650 on Taichi X370 in some slots) ). Created attachment 285213 [details]
ls -vvv -t -nn for Asus Prime-X470 PRO with Ryzen 7 3700X
Created attachment 285215 [details]
ls -vvv -nn for Asus Prime-X470 PRO with Ryzen 7 3700X
I've attached output of
ls -vvv -t -nn
ls -vvv -nn
for Asus Prime-X470 PRO with Ryzen 7 3700X
I hope this can help for fixing this issue on such systems as well.
# Best output I was able to get so far is:
PCI Parity Error: clearing.
PCI Abort: clearing.
Controller Queue Error: clearing.
# Not fully readable due to screen resolution was something like the following:
No Valid response during init connection.
initconnection failed while checking SRL.
Compatibility check failed during reset squence.
We think that this is an issue with AMD hardware rather than the code since the PCI spec says that it should be safe to enable extended tags in both old and new hardware. We quirked several AMD root complexes hoping that this issue wouldn't appear on newer AMD hardware. Both Bjorn and I tried to reach out to AMD engineers and we didn't get any response. You might have a better relationship with them. Can you please ask clarification from them through their channels? Code works in all other hardware except AMD so we need real proof that code is broken before reworking it. This code won't change without an AMD response. Sinan, just to be clear: Have any of the quirks actually been added to the mainline kernel? Or if not, is there a chance this might still happen? Yes, there is a bunch of AMD root port quirks in the kernel for AMD root port for extended tags. https://elixir.bootlin.com/linux/latest/source/drivers/pci/quirks.c#L4904 It is up to Bjorn if he want to accept more. FYI, the quirks are not preventing my specific problem. It is still happening with the 5.3.11 kernel. I guess I'll continue to maintain my own branch with a quirk for the RAID controller. I think we need to fix this in mainline. What quirk are you maintaining, Robert? Is it the one from comment #7? I'm using the one only for the controller in order to interfere with as little as possible:
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_3WARE, 0x1004, quirk_no_ext_tags);
The problem persists with Kernel 5.4.0.0 from Debian backports. robert.smith51@protonmail.com: Could you please provide some information where to apply your fixup line? I would prefere to use your low impact patch rather than my unknowingly made patching attempt I used to get a kernel running on my systems which some way works but seems to somehow affect the entire PCIE configuration. I hope the fix for 3ware 9650se gets into mainline kernel. Currently I'm afraid it might otherweise be a problem to install newer Linux distros on such hardware (first install old release, then upgrade by retaining old/patched kernels for booting). For me, there's no choice to upgrade the raid controller in these Ryzen systems to the extended tags compatible 9750 as the 9750 needs x8 slots and there are only x1 slots left available for raid controllers in the systems. Switching to other raid controller brands is currently also no real choice due to system landscape including spare parts storage. (In reply to robert.smith51 from comment #31) > I'm using the one only for the controller in order to interfere with as > little as possible: > > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_3WARE, 0x1004, quirk_no_ext_tags); Do you see any progress in getting this or any other fix into mainline kernel? Or any other fix for this issue which might work somehow? Neither old Linux kernels nor Windows 7/10 caused this issue, and a beta bios offering to disable ten bit extended tags did also not solve this when using those affected kernels. Kind regards (In reply to robert.smith51 from comment #31) > I'm using the one only for the controller in order to interfere with as > little as possible: > > > > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_3WARE, 0x1004, quirk_no_ext_tags); Do you have any more information how to achieve this , I hit same issue now with Aorus b450 M MBO and ryzen 3600 CPU with 3ware controller. Worked for years on intel processors Xeon and desktop , now have this issue on Ryzen. I also hit this issue on jan21. My system runs a 5.3.x kernel and I could use the quirk patch from robert.smith51 to patch my current kernel would be greate if this is going into the current kernel. After an "prozessor upgrade" on my Proxmox machine I have hit the same problem. Can you please give me some outlook if the proposed fix be implemented in the mainline kernel? In case it will take longer or not be done at all, can you give me some guidance on how I can implement the quirk from robert.smith51 into my running system? Is there a way to tweak the system without recompiling the kernel? I fear to break the whole system by recompiling the kernel of my proxmox server. Is anyone working on this ? Can we have some feedback ? Well just an update that i fixed all ryzen issues with (few patches inside pve kernel) and installation of zenpower for sensors. Kernel build was difficult and was failing many times but in the end victory. Will update the thread with procedure for patching the kernel later. root@proxmara1:~# uname -a Linux proxmara1 5.11.15-1 #1 SMP 5.11.15-1 (Fri, 16 Apr 2021 12:14:57 +0000) x86_64 GNU/Linux Actual build needs ~40GB of space and ~4h on ryzen 3600 just ran into the same bug on 5.10.77, is there any update on getting the patch (mainlined)? (In reply to cts.cobra from comment #38) > Well just an update that i fixed all ryzen issues with (few patches inside > pve kernel) and installation of zenpower for sensors. Could you share your patches - or are they already in the mainline kernel? I think not - as currently I tried to use 6.7.4 but it still failed. After I patched the kernel source with the > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_3WARE, 0x1004, quirk_no_ext_tags); line it works as expected. Can somebody please post the patch to the mailing lists (linux-pci@vger.kernel.org and cc: the 3ware driver maintainers) so we can try to move this forward? See the hints at https://www.kernel.org/doc/html/latest/process/submitting-patches.html for how to do this. I posted the patch to the mailing list and cc to aradford@gmail.com which I found in the /drivers/scsi/3w-9xxx.[hc] file. |
Created attachment 280811 [details] dmesg excerpt The 3ware 9650SE-2LP controller used to work fine on all kernel versions with my previous Intel Sandy Bridge (i7 2600k/P67) system. I replaced the mainboard and CPU with a Ryzen 7 2700x on an Asus ROG Strix X470-F mainboard while maintaining the same OS installation as before. Afterwards, the RAID controller was no longer properly initialised and is inaccessible as a block device. Instead, the driver produces error messages during the boot process (see below) This does not appear to be a hardware issue, because I was able to get the controller working with vanilla kernels of the 4.10 series (up to and including 4.10.17) as well as all of the versions of the 4.09-LTS kernel that I have tested. It does not work with any 4.11 kernels or later, up to and including 4.20.4.