Bug 202425 - 3w-9xxx: 3ware 9650SE-2LP RAID controller not working on AMD Ryzen system
Summary: 3w-9xxx: 3ware 9650SE-2LP RAID controller not working on AMD Ryzen system
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-27 17:08 UTC by robert.smith51
Modified: 2021-04-17 17:40 UTC (History)
7 users (show)

See Also:
Kernel Version: >= 3.11
Tree: Mainline
Regression: Yes


Attachments
dmesg excerpt (2.28 KB, text/plain)
2019-01-27 17:08 UTC, robert.smith51
Details
dmesg (121.56 KB, text/plain)
2019-02-02 04:17 UTC, robert.smith51
Details
sudo lspci -vvv (92.12 KB, text/plain)
2019-02-02 04:18 UTC, robert.smith51
Details
Quirk patch (501 bytes, patch)
2019-02-04 22:02 UTC, robert.smith51
Details | Diff
PCIe config Windows 10 (61.56 KB, text/plain)
2019-02-05 02:55 UTC, robert.smith51
Details
Windows 10 PCI configuration (decoded by lspci) (64.60 KB, text/plain)
2019-02-06 01:13 UTC, Bjorn Helgaas
Details
dmesg pci=earlydump (128.88 KB, text/plain)
2019-02-06 17:46 UTC, robert.smith51
Details
BIOS PCI configuration (pci=earlydump decoded by lspci) (64.47 KB, text/plain)
2019-02-06 20:16 UTC, Bjorn Helgaas
Details
ls -vvv -t -nn for Asus Prime-X470 PRO with Ryzen 7 3700X (3.43 KB, text/plain)
2019-09-27 15:13 UTC, public-t.b
Details
ls -vvv -nn for Asus Prime-X470 PRO with Ryzen 7 3700X (86.80 KB, text/plain)
2019-09-27 15:19 UTC, public-t.b
Details

Description robert.smith51 2019-01-27 17:08:46 UTC
Created attachment 280811 [details]
dmesg excerpt

The 3ware 9650SE-2LP controller used to work fine on all kernel versions with my previous Intel Sandy Bridge (i7 2600k/P67) system. I replaced the mainboard and CPU with a Ryzen 7 2700x on an Asus ROG Strix X470-F mainboard while maintaining the same OS installation as before. Afterwards, the RAID controller was no longer properly initialised and is inaccessible as a block device. Instead, the driver produces error messages during the boot process (see below)

This does not appear to be a hardware issue, because I was able to get the controller working with vanilla kernels of the 4.10 series (up to and including 4.10.17) as well as all of the versions of the 4.09-LTS kernel that I have tested. It does not work with any 4.11 kernels or later, up to and including 4.20.4.
Comment 1 Bart Van Assche 2019-01-27 17:50:13 UTC
Can you bisect this issue to identify the commit that introduced this regression?
Comment 2 robert.smith51 2019-01-31 03:42:55 UTC
(In reply to Bart Van Assche from comment #1)
> Can you bisect this issue to identify the commit that introduced this
> regression?

I found it. The commit is: 60db3a4d8cc9073cf56264785197ba75ee1caca4 ("PCI: Enable PCIe Extended Tags if supported").

Reverting it solves the issue and makes the RAID controller accessible.
Comment 3 Bart Van Assche 2019-01-31 04:25:13 UTC
On 1/30/19 7:42 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=202425
> 
> --- Comment #2 from robert.smith51@protonmail.com ---
> (In reply to Bart Van Assche from comment #1)
>> Can you bisect this issue to identify the commit that introduced this
>> regression?
> 
> I found it. The commit is: 60db3a4d8cc9073cf56264785197ba75ee1caca4 ("PCI:
> Enable PCIe Extended Tags if supported").
> 
> Reverting it solves the issue and makes the RAID controller accessible.

Hi Sinan,

According to Robert Smith commit 60db3a4d8cc9 ("PCI: Enable PCIe
Extended Tags if supported") (v4.11) introduced a regression. Can you
have a look at https://bugzilla.kernel.org/show_bug.cgi?id=202425 ?

Thanks,

Bart.
Comment 4 Bjorn Helgaas 2019-02-01 21:42:43 UTC
Robert, would you mind attaching the complete dmesg and "sudo lspci -vvv" output, please?  Sorry for the inconvenience, and thanks very much for doing all the work of a bisection.  I marked this as a regression and am trying to move it to the Drivers/PCI category (bugzilla isn't completely cooperating).
Comment 5 robert.smith51 2019-02-02 04:17:27 UTC
Created attachment 280923 [details]
dmesg
Comment 6 robert.smith51 2019-02-02 04:18:16 UTC
Created attachment 280925 [details]
sudo lspci -vvv
Comment 7 robert.smith51 2019-02-04 22:02:39 UTC
Created attachment 280955 [details]
Quirk patch

I have added a quirk for what I think is the AMD PCIe root port, and this solves the issue on kernel 4.20.6 and, presumably, previous versions.

Instead adding a quirk for the RAID controller itself also works ("PCI_VENDOR_ID_3WARE, 0x1004").
Comment 8 Bjorn Helgaas 2019-02-04 22:45:25 UTC
Thanks for testing a quirk!  Your patch in comment #7 does correspond to the Root Port.  The complete path (from comment #6) is:

00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge (prog-if 00 [Normal decode])
        Bus: primary=00, secondary=01, subordinate=09, sec-latency=0
        Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
                DevCap: ExtTag+ 
                DevCtl: ExtTag+ 

01:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43c6 (rev 01) (prog-if 00 [Normal decode])
        Bus: primary=01, secondary=02, subordinate=09, sec-latency=0
        Capabilities: [80] Express (v2) Upstream Port, MSI 00
                DevCap: ExtTag+
                DevCtl: ExtTag+ 

02:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43c7 (rev 01) (prog-if 00 [Normal decode])
        Bus: primary=02, secondary=05, subordinate=05, sec-latency=0
        Capabilities: [80] Express (v2) Downstream Port (Slot+), MSI 00
                DevCap: ExtTag+ 
                DevCtl: ExtTag+ 

05:00.0 RAID bus controller: 3ware Inc 9650SE SATA-II RAID PCIe (rev 01)
        Capabilities: [70] Express (v1) Legacy Endpoint, MSI 00
                DevCap: ExtTag- 
                DevCtl: ExtTag-

60db3a4d8cc9 ("PCI: Enable PCIe Extended Tags if supported") doesn't affect the 3ware controller itself, since it doesn't advertise Extended Tag support, but it would enable Extended Tags in all the upstream devices.

ExtTag+ means the function can generate 8-bit tags as a requester.  The 3ware controller claims it is not capable of generating 8-bit tags, so I suspect the problem is that 60db3a4d8cc9 enabled the Root Port to generate 8-bit tags, which means MMIO from the driver could use 8-bit tags, and the 3ware controller couldn't handle them.

It niggles at me that I would expect the 3ware controller to log some kind of malformed TLP or similar error, and I don't see anything like that.

Anyway, I suspect the problem is with the 3ware controller, not the AMD root port or switch, so I would lean toward that version of the quirk.  What do you think, Sinan?
Comment 9 Sinan Kaya 2019-02-04 22:50:08 UTC
IF the issue is at 3ware, limiting to 3ware would be a better solution that disabling the extended tags (performance impacting) feature altogether.
Comment 10 robert.smith51 2019-02-04 23:55:51 UTC
(In reply to Bjorn Helgaas from comment #8)

> 60db3a4d8cc9 ("PCI: Enable PCIe Extended Tags if supported") doesn't affect
> the 3ware controller itself, since it doesn't advertise Extended Tag
> support, but it would enable Extended Tags in all the upstream devices.

The RAID controller was previously working fine on an Intel mainboard with the same 4.20 kernel. This was PCIe 2.0, though, but as far as I understand, the Extended Tag should have been activated with that hardware, too. That would suggest that this is either a problem with the AMD Root Port or with the combination of the Root Port and the 3ware controller.

By the way, the controller works fine under Windows 10.
Comment 11 Bjorn Helgaas 2019-02-05 00:49:10 UTC
Extended Tags would have been enabled on the Intel parts of the path if they advertised support for it, which they probably did ("lspci -vv" would show for sure).

If we could get a hex dump of config space under Windows 10, we could see whether it enables Extended Tags.  You *might* be able to get that from Device Manager.  If not, I'm pretty sure the free trial of AIDA64 (http://www.aida64.com/) can do it.

I'm a little queasy about our assumption that it's always safe to enable Extended Tags.  Maybe we're reading the spec wrong.
Comment 12 robert.smith51 2019-02-05 02:55:42 UTC
Created attachment 280959 [details]
PCIe config Windows 10

(In reply to Bjorn Helgaas from comment #11)

> If we could get a hex dump of config space under Windows 10, we could see
> whether it enables Extended Tags.  You *might* be able to get that from
> Device Manager.  If not, I'm pretty sure the free trial of AIDA64
> (http://www.aida64.com/) can do it.

Here you go.

As for the Intel parts, I do not have those anymore, so unfortunately I cannot verify whether Extended Tags were actually enabled.
Comment 13 Bjorn Helgaas 2019-02-06 01:13:46 UTC
Created attachment 281009 [details]
Windows 10 PCI configuration (decoded by lspci)

I tweaked lspci so it can decode the config space dump from AIDA64, and this is the result of running it on the Windows 10 AIDA64 dump from comment #12.

Under Windows, Extended Tags are disabled on the path to the 3ware controller:
  00:01.3 Root Port              to [bus 01-09]
  01:00.2 Switch Upstream Port   to [bus 02-09]
  02:03.0 Switch Downstream Port to [bus 05]
  05:00.0 3ware 9650SE SATA-II RAID

Extended Tags are *enabled* on the path to the NVIDIA GP104:
  00:03.1 Root Port              to [bus 0a]
  0a:00.0 NVIDIA GP104 GPU
  0a:00.1 NVIDIA GP104 Audio

I guess we can't be certain whether this configuration was done by the BIOS or by Windows.  Booting Linux with "pci=earlydump" would show how BIOS configured things.
Comment 14 robert.smith51 2019-02-06 02:31:46 UTC
If I understand you correctly, whether or not the kernel has the quirk active should not matter for the "pci=earlydump". Is that correct? Because the dmesg output from the 4.11 kernel with the bug looks broken as far as the PCIe config space is concerned.
Comment 15 Bjorn Helgaas 2019-02-06 14:08:55 UTC
"pci=earlydump" dumps config space before Linux changes anything.

The question "pci=earlydump" would answer is whether Windows did anything with Extended Tag configuration.  It could be that the BIOS enabled Extended Tags for the NVIDIA card but not for 3ware, and Windows did nothing.

On the other hand, it could be that Windows enabled Extended Tags for NVIDIA but not for 3ware.  If that's the case, it's a pretty good hint that we need to figure out the reason it left them disabled because it probably also applies to Linux.
Comment 16 Sinan Kaya 2019-02-06 16:33:37 UTC
I'm curious if we should have an option for the endpoint driver to disable extended tags.
Comment 17 Sinan Kaya 2019-02-06 16:33:59 UTC
But then it works on intel, I keep forgetting.
Comment 18 robert.smith51 2019-02-06 17:46:44 UTC
Created attachment 281025 [details]
dmesg pci=earlydump

It seems that this is either an issue with the AMD hardware, which is supported by the fact that it does not seem to happen with Intel, even though there is no guarantee that it never happens.

Or this is an issue with the BIOS, but then one would again expect it to pop up on Intel machines, too.

Third option is that this is an issue with the 3ware controller. I can try to get my hands on some other old PCIe hardware without support for Extended Tags. If that also causes on issue, the problem lies not with the RAID controller.

Lastly, the Linux implementation might not be correct w.r.t. the spec.

Anyway, here is the dmesg with pcie=earlydump on a 4.20.6 kernel with the controller quirk active.
Comment 19 Bjorn Helgaas 2019-02-06 20:16:19 UTC
Created attachment 281027 [details]
BIOS PCI configuration (pci=earlydump decoded by lspci)

I manually edited the earlydump from comment #18 so lspci could read it (I lost interest before making lspci smart enough to read it directly), and this is the result.

The Extended Tag configuration is the same as in comment #13 (disabled for 3ware, enabled for NVIDIA), which means the BIOS apparently did that configuration and Windows didn't change anything.

I'll seek clarification from the PCI-SIG.
Comment 20 robert.smith51 2019-02-09 21:28:50 UTC
I tried verifying the problem on a 4.20.6 kernel without the quirk with one PCIe USB card and an ethernet card (TP-Link TG-3468), both of which do not support the Extended Tag. ET was active along the path, just like with the RAID controller. But both cards worked fine without any problems. So, this would suggest that the problem lies at least partially with the 3ware controller.
Comment 21 robert.smith51 2019-09-16 19:51:21 UTC
I has been some time. Is there anything actually happening with this issue?
Comment 22 public-t.b 2019-09-26 20:10:20 UTC
Hi,
this issue also applies to Asus Prime-X470 PRO with Ryzen 7 3700x as well as Ryzen 5 3600 with 4.18 Kernels of Debian 9 backports.
Are there currently any plans to work around/fix this issue on Kernel basis?
Or are there some options to solve this, maybe by using kernel boot params to disable extended tags entirely or at least exclude 3ware cards from extended tags?
Kind regards.
Comment 23 public-t.b 2019-09-26 22:23:22 UTC
BTW, I'm not familiar with the details regarding pcie extended tags, but I've read in https://lore.kernel.org/patchwork/patch/807638/ that PCIE 1.x devices might need to be handled in another way than PCIE 2+ devices. 
 
The 3ware devices of series 9650SE are, as far as I know, specified/marketed as PCIE 1.1 devices. 
 
This might be an explanation why devices of series 9750 (PCIE 2) seem to be more compatible to the Ryzen platform (the 9750 works on X370 Taichi with Ryzen, while 9650 does not (rumors from Asrock customer support are that pre-Ryzen AM4-CPUs with embedded graphics support 9650 on Taichi X370 in some slots) ).
Comment 24 public-t.b 2019-09-27 15:13:57 UTC
Created attachment 285213 [details]
ls -vvv -t -nn for Asus Prime-X470 PRO with Ryzen 7 3700X
Comment 25 public-t.b 2019-09-27 15:19:36 UTC
Created attachment 285215 [details]
ls -vvv -nn for Asus Prime-X470 PRO with Ryzen 7 3700X

I've attached output of
ls -vvv -t -nn
ls -vvv -nn
for Asus Prime-X470 PRO with Ryzen 7 3700X

I hope this can help for fixing this issue on such systems as well.
# Best output I was able to get so far is:
PCI Parity Error: clearing.
PCI Abort: clearing.
Controller Queue Error: clearing.
# Not fully readable due to screen resolution was something like the following:
No Valid response during init connection.
initconnection failed while checking SRL.
Compatibility check failed during reset squence.
Comment 26 Sinan Kaya 2019-09-29 18:01:29 UTC
We think that this is an issue with AMD hardware rather than the code since the PCI spec says that it should be safe to enable extended tags in both old and new hardware.

We quirked several AMD root complexes hoping that this issue wouldn't appear on newer AMD hardware.

Both Bjorn and I tried to reach out to AMD engineers and we didn't get any response.

You might have a better relationship with them. Can you please ask clarification from them through their channels?

Code works in all other hardware except AMD so we need real proof that code is broken before reworking it.

This code won't change without an AMD response.
Comment 27 robert.smith51 2019-10-12 20:55:58 UTC
Sinan, just to be clear: Have any of the quirks actually been added to the mainline kernel? Or if not, is there a chance this might still happen?
Comment 28 Sinan Kaya 2019-10-12 21:03:02 UTC
Yes, there is a bunch of AMD root port quirks in the kernel for AMD root port for extended tags.

https://elixir.bootlin.com/linux/latest/source/drivers/pci/quirks.c#L4904

It is up to Bjorn if he want to accept more.
Comment 29 robert.smith51 2019-11-16 22:16:48 UTC
FYI, the quirks are not preventing my specific problem. It is still happening with the 5.3.11 kernel. I guess I'll continue to maintain my own branch with a quirk for the RAID controller.
Comment 30 Bjorn Helgaas 2019-11-18 23:29:43 UTC
I think we need to fix this in mainline.  What quirk are you maintaining, Robert?  Is it the one from comment #7?
Comment 31 robert.smith51 2019-11-25 18:38:44 UTC
I'm using the one only for the controller in order to interfere with as little as possible:


> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_3WARE, 0x1004, quirk_no_ext_tags);
Comment 32 public-t.b 2020-03-07 14:26:10 UTC
The problem persists with Kernel 5.4.0.0 from Debian backports.

robert.smith51@protonmail.com:
Could you please provide some information where to apply your fixup line?
I would prefere to use your low impact patch rather than my unknowingly made patching attempt I used to get a kernel running on my systems which some way works but seems to somehow affect the entire PCIE configuration.

I hope the fix for 3ware 9650se gets into mainline kernel.
Currently I'm afraid it might otherweise be a problem to install newer Linux distros on such hardware (first install old release, then upgrade by retaining old/patched kernels for booting).
For me, there's no choice to upgrade the raid controller in these Ryzen systems to the extended tags compatible 9750 as the 9750 needs x8 slots and there are only x1 slots left available for raid controllers in the systems. Switching to other raid controller brands is currently also no real choice due to system landscape including spare parts storage.
Comment 33 public-t.b 2020-07-30 10:57:03 UTC
(In reply to robert.smith51 from comment #31)
> I'm using the one only for the controller in order to interfere with as
> little as possible:
> 
> 
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_3WARE, 0x1004, quirk_no_ext_tags);

Do you see any progress in getting this or any other fix into mainline kernel?
Or any other fix for this issue which might work somehow?
Neither old Linux kernels nor Windows 7/10 caused this issue, and a beta bios offering to disable ten bit extended tags did also not solve this when using those affected kernels.

Kind regards
Comment 34 cts.cobra 2021-03-18 19:21:53 UTC
(In reply to robert.smith51 from comment #31)
> I'm using the one only for the controller in order to interfere with as
> little as possible:
> 
> 
> > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_3WARE, 0x1004, quirk_no_ext_tags);

Do you have any more information how to achieve this , I hit same issue now with Aorus b450 M MBO and ryzen 3600 CPU with 3ware controller. Worked for years on intel processors Xeon and desktop , now have this issue on Ryzen.
Comment 35 J.Wedekind 2021-03-19 21:49:48 UTC
I also hit this issue on jan21. My system runs a 5.3.x kernel and I could use the quirk patch from robert.smith51 to patch my current kernel

would be greate if this is going into the current kernel.
Comment 36 Robert Spindler 2021-03-26 12:44:47 UTC
After an "prozessor upgrade" on my Proxmox machine I have hit the same problem.
Can you please give me some outlook if the proposed fix be implemented in the mainline kernel?
In case it will take longer or not be done at all, can you give me some guidance on how I can implement the quirk from robert.smith51 into my running system? 
Is there a way to tweak the system without recompiling the kernel? I fear to break the whole system by recompiling the kernel of my proxmox server.
Comment 37 cts.cobra 2021-04-16 00:01:42 UTC
Is anyone working on this ? Can we have some feedback ?
Comment 38 cts.cobra 2021-04-17 17:40:12 UTC
Well just an update that i fixed all ryzen issues with (few patches inside pve kernel) and installation of zenpower for sensors.

Kernel build was difficult and was failing many times but in the end victory. 

Will update the thread with procedure for patching the kernel  later.

root@proxmara1:~# uname -a
Linux proxmara1 5.11.15-1 #1 SMP 5.11.15-1 (Fri, 16 Apr 2021 12:14:57 +0000) x86_64 GNU/Linux

Actual build needs ~40GB of space and ~4h on ryzen 3600

Note You need to log in before you can comment on or make changes to this bug.