Bug 216094 - pci-mvebu: SATA HDDs via 88SE6121 AHCI fail with Marvell 88F6281 PCIe
Summary: pci-mvebu: SATA HDDs via 88SE6121 AHCI fail with Marvell 88F6281 PCIe
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: ARM Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL: https://github.com/hn/seagate-blackar...
Keywords:
Depends on:
Blocks:
 
Reported: 2022-06-07 07:29 UTC by Hajo Noerenberg
Modified: 2023-06-05 07:37 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.16 ... 5.10
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci Linux version 3.2.0-4-kirkwood mach-kirkwood/pci.c: HDDs ok (4.54 KB, text/plain)
2022-06-07 07:29 UTC, Hajo Noerenberg
Details
lspci Linux version 3.16.0-0.bpo.4-kirkwood - with DTB -> mvebu-pci: HDDs fail (3.15 KB, text/plain)
2022-06-07 07:30 UTC, Hajo Noerenberg
Details
lspci Linux version 3.16.0-0.bpo.4-kirkwood - without DTB -> mach-kirkwood/pci.c: HDDs ok (4.89 KB, text/plain)
2022-06-07 07:30 UTC, Hajo Noerenberg
Details
lspci Linux version 5.10.0-11-marvell (Debian bullseye) mvebu-pci: HDDs fail (5.61 KB, text/plain)
2022-06-07 07:31 UTC, Hajo Noerenberg
Details
lspci Linux version 5.17.0-1-marvell mvebu-pci: HDDs fail (4.35 KB, text/plain)
2022-06-07 12:08 UTC, Hajo Noerenberg
Details
dmesg Linux version 3.2.0-4-kirkwood - kw pci (21.12 KB, text/plain)
2022-06-08 07:53 UTC, Hajo Noerenberg
Details
dmesg Linux version 3.16.0-0.bpo.4-kirkwood - kw pci (13.80 KB, text/plain)
2022-06-08 07:54 UTC, Hajo Noerenberg
Details
dmesg Linux version 5.17.0-1-marvell - mvebu pci (29.03 KB, text/plain)
2022-06-08 07:54 UTC, Hajo Noerenberg
Details
U-Boot log with pci/ahci #define DEBUG 1 (15.10 KB, text/plain)
2022-06-08 07:56 UTC, Hajo Noerenberg
Details
dmesg Linux version 5.17.6pci-marvell - pci debug, mvebu-pci module (26.29 KB, text/plain)
2022-06-13 08:08 UTC, Hajo Noerenberg
Details
Kirkwoord DTS PCIe interrupt patch (3.43 KB, patch)
2022-06-19 08:25 UTC, Pali Rohár
Details | Diff
dmesg Linux 5.16.0rc1-palimvebu with patch 301220 (36.14 KB, text/plain)
2022-06-20 08:43 UTC, Hajo Noerenberg
Details
lspci Linux 5.16.0rc1-palimvebu with patch 301220 (5.74 KB, text/plain)
2022-06-20 08:44 UTC, Hajo Noerenberg
Details
lspci Linux 5.16.0rc1-palimvebu without patch 301220 (5.69 KB, text/plain)
2022-06-23 13:22 UTC, Hajo Noerenberg
Details
iomem ioports Linux version 3.16.0-0.bpo.4-kirkwood with/without DTB (2.97 KB, text/plain)
2022-07-07 12:46 UTC, Hajo Noerenberg
Details
U-Boot log with more io/region info (3.68 KB, text/plain)
2022-07-09 15:46 UTC, Hajo Noerenberg
Details
ls /sys/bus/pci/devices/*/ Linux 3.2.0-4-kirkwood: HDDs ok (4.13 KB, text/plain)
2023-06-05 07:28 UTC, Hajo Noerenberg
Details
ls /sys/bus/pci/devices/*/ Linux 6.2.0-rc5: HDDs fail (5.77 KB, text/plain)
2023-06-05 07:29 UTC, Hajo Noerenberg
Details

Description Hajo Noerenberg 2022-06-07 07:29:03 UTC
I would like to continue the SATA-related topic started with Pali Rohár at the U-Boot mailing list [1]. I have analysed the issue further and come the following conclusion that it is related to the PCIe subsystem:

SATA-2 and SATA-3 hard disks connected to a 88SE6121 (AHCI) controller, wired via PCIe to the 88F6281 SoC fail to operate ("failed to IDENTIFY" ... "qc timeout") when the pci-mvebu driver (Kernel 3.16 .. 5.10 Debian) is in use.

More details:

- The problem does not exist in 2.6 and 3.16 kernels. With the old mach-kirkwood/pcie.c driver all SATA-2/3 hard disks work correctly. Especially with a 3.16 kernel it is possible to have identical ATA/AHCI drivers but try both PCIe drivers: without DTB -> mach-kirkwood -> SATA-2/3 HDDs work; with DTB -> mach-mvebu -> HDDs fail.

- The problem is specific to SATA-2/3 HDDs. Very old SATA-1-only HDDs work without problems. This might be related to the available data lanes, DMA or other bandwidth-related things -- I can only guess. Interestingly it does not help to limit SATA speed (libata.force=1.5G ...) with SATA-2/3 HDDs, only 'pure' SATA-1 HDDs work with pci-mvebu.

- The problem was identified with the Seagate Blackarmor NAS440 hardware. Forum posts show that other users experience similar problems with the (very similar) Iomega ix4-200d NAS [2].

- Within patched U-Boot [3] all (Sata-1/2/3) HDDs always work. Same for the 88F6281 SoC onboard SATA ports (sata_mv - not connected via PCIe).

- The mach-kirkwood driver operates the 6281 as class "Host bridge [0600]" with Cap "Express (v1) Root Port", the mach-mvebu driver as class "PCI bridge [0604]" with "Express (v2) Root Port" [4][5][6][7]. Notably the v1/v2, cache line size 32/64 or the missing interrupt route might be a key difference.

From the sources I see that all PCI drivers (mach-kirkwood, mach-mvebu and U-Boot) do various unconventional 'magic' things (rewriting PCI class of the root complex, changing capabilitys, host emulation and so on). This is the point where I currently get lost and ask for your help.

Kind regards,
Hajo

[1] https://lists.denx.de/pipermail/u-boot/2022-March/479197.html
[2] https://forum.doozan.com/read.php?2,94079,95519#msg-95519
[3] https://lists.denx.de/pipermail/u-boot/2022-March/479227.html
[4] lspci Linux version 3.2.0-4-kirkwood mach-kirkwood/pci.c: HDDs ok
[5] lspci Linux version 3.16.0-0.bpo.4-kirkwood - with DTB -> mvebu-pci: HDDs fail
[6] lspci Linux version 3.16.0-0.bpo.4-kirkwood - without DTB -> mach-kirkwood/pci.c: HDDs ok
[7] lspci Linux version 5.10.0-11-marvell (Debian bullseye) mvebu-pci: HDDs fail
Comment 1 Hajo Noerenberg 2022-06-07 07:29:30 UTC
Created attachment 301113 [details]
lspci Linux version 3.2.0-4-kirkwood mach-kirkwood/pci.c: HDDs ok
Comment 2 Hajo Noerenberg 2022-06-07 07:30:03 UTC
Created attachment 301114 [details]
lspci Linux version 3.16.0-0.bpo.4-kirkwood - with DTB -> mvebu-pci: HDDs fail
Comment 3 Hajo Noerenberg 2022-06-07 07:30:27 UTC
Created attachment 301115 [details]
lspci Linux version 3.16.0-0.bpo.4-kirkwood - without DTB -> mach-kirkwood/pci.c: HDDs ok
Comment 4 Hajo Noerenberg 2022-06-07 07:31:09 UTC
Created attachment 301116 [details]
lspci Linux version 5.10.0-11-marvell (Debian bullseye) mvebu-pci: HDDs fail
Comment 5 Hajo Noerenberg 2022-06-07 12:08:42 UTC
Created attachment 301117 [details]
lspci Linux version 5.17.0-1-marvell mvebu-pci: HDDs fail
Comment 6 Hajo Noerenberg 2022-06-07 12:12:04 UTC
I added lspci from a more recent 5.17.0 kernel. The 6281 is detected as a v1 root port (first time, yeah!), but the HDDs continue to fail
Comment 7 Hajo Noerenberg 2022-06-08 07:53:39 UTC
Created attachment 301121 [details]
dmesg Linux version 3.2.0-4-kirkwood - kw pci
Comment 8 Hajo Noerenberg 2022-06-08 07:54:22 UTC
Created attachment 301122 [details]
dmesg Linux version 3.16.0-0.bpo.4-kirkwood - kw pci
Comment 9 Hajo Noerenberg 2022-06-08 07:54:57 UTC
Created attachment 301123 [details]
dmesg Linux version 5.17.0-1-marvell - mvebu pci
Comment 10 Hajo Noerenberg 2022-06-08 07:56:34 UTC
Created attachment 301124 [details]
U-Boot log with pci/ahci #define DEBUG 1
Comment 11 Hajo Noerenberg 2022-06-08 08:39:31 UTC
Just to clarify the dmesg output:
6121-port0 is a SATA-II ST3500418AS (except with U-Boot I attached a 3TB EFRX)
6121-port1 is a SATA-I ST3250310NS (always works)
6121-port2 is (unused) PATA
SoC-port0 is a WD3202ABYS (always works)
SoC-port1 is a ST500NM0011 (always works)

(Current setup, during the investigation I think I have had every tangible HDD on every available port :-) )

Please let me know if it would help to patch the mvebu-pci module in any way for debug purposes.

For Pali:

PCI/ASPM with 5.17.0:

root@nas440:~# grep -oE pci.* /proc/cmdline 
pci=nomsi
-> HDDs still fail

root@nas440:~# grep -oE pci.* /proc/cmdline 
pcie_aspm=off
-> HDDs still fail
Comment 12 Reimar D 2022-06-08 08:56:17 UTC
I would suggest setting the flag that disables the READ_LOG_EXT command for that controller.
I suspect it should be set for all PATA controllers.
I only have experience with an old PPC macMini, and that ATA command causes hangs for it. However in that case it's fine in the end since it hits a timeout and says "interrupt lost" and continues, so it only slows down boot but 10 seconds or so
Comment 13 Reimar D 2022-06-08 08:59:03 UTC
Hmm, I misread and this is a proper SATA controller, so it would be weird for it to fail for this command. Still, might be one of the easier things to try anyway...
Comment 14 Hajo Noerenberg 2022-06-08 10:14:24 UTC
The DTB file can be found at my projekt page: https://github.com/hn/seagate-blackarmor-nas (kirkwood-blackarmor-nas440.dtb, source linux-nas440.diff or u-boot-2022.04-nas440.diff)
Comment 15 Hajo Noerenberg 2022-06-13 08:08:36 UTC
Created attachment 301162 [details]
dmesg Linux version 5.17.6pci-marvell - pci debug, mvebu-pci module

dmesg log with CONFIG_PCI_DEBUG=y and mvebu-pci as a module (CONFIG_PCI_MVEBU=m).

I wonder if the missing IRQ could be significant. With Kirkwood-PCI (HDDs working) the bridge had an IRQ assigned, with pci-mvebu not ("pcieport 0000:00:01.0: assign IRQ: got 0").
Comment 16 Pali Rohár 2022-06-13 08:23:02 UTC
On Monday 13 June 2022 08:08:36 bugzilla-daemon@kernel.org wrote:
> I wonder if the missing IRQ could be significant. With Kirkwood-PCI (HDDs
> working) the bridge had an IRQ assigned, with pci-mvebu not ("pcieport
> 0000:00:01.0: assign IRQ: got 0").

Hello! mvebu PCIe Root Port does not provide interrupt support because
it is not implemented in mainline kernel (yet).

Patches for this support are prepared in branch pci-mvebu of my git
repo: https://git.kernel.org/pub/scm/linux/kernel/git/pali/linux.git/

But I do not think this is the root cause.
Comment 17 Hajo Noerenberg 2022-06-17 08:08:27 UTC
Hi Pali, I compiled and started your pci-mvebu branch, unfortunately it didn't change anything (HDDs still fail).

For the INTx interrupts and other things it would probably be necessary to change the dts(i) files, i.e. port your changes from armada-*.dtsi to kirkwood-6281.dtsi. But I don't know (yet) exactly how to do that and if it would be worth the effort.
Comment 18 Pali Rohár 2022-06-19 08:25:09 UTC
Created attachment 301220 [details]
Kirkwoord DTS PCIe interrupt patch

I looked into Kirkwoord documentation and it seems that SoC PCIe INTx interrupt is 9 and SoC PCIe summary interrupt is 44. In attachment is a patch for Kirkwood DTS files to define them.
Comment 19 Hajo Noerenberg 2022-06-20 08:43:35 UTC
Created attachment 301225 [details]
dmesg Linux 5.16.0rc1-palimvebu with patch 301220

Hi Pali, I applied your patch (attachment 301220 [details]).

The good: something changed.
The bad: now both HDDs (6121 port 0 and port1) do NOT work anymore (even the SATA-1 hard disk, which worked correctly before, does not work anymore).

This may be a slight indication that the error of the HDDs has something to do with the interrupt handling after all. But this is just a guess from me.
Comment 20 Hajo Noerenberg 2022-06-20 08:44:12 UTC
Created attachment 301226 [details]
lspci Linux 5.16.0rc1-palimvebu with patch 301220
Comment 21 Pali Rohár 2022-06-23 11:32:58 UTC
On Monday 20 June 2022 08:43:35 bugzilla-daemon@kernel.org wrote:
> Hi Pali, I applied your patch (attachment 301220 [details]).
> 
> The good: something changed.
> The bad: now both HDDs (6121 port 0 and port1) do NOT work anymore (even the
> SATA-1 hard disk, which worked correctly before, does not work anymore).

That patch only changes PCIe DTS nodes and does not touch on-board
Marvell disk controller. So it should not have any effect on that second
HDD which is not connected via PCIe.

> This may be a slight indication that the error of the HDDs has something to
> do
> with the interrupt handling after all. But this is just a guess from me.

Can you check in /proc/interrupts that PCIe interrupt counts are
increasing during usage of disk connected via PCIe?
Comment 22 Hajo Noerenberg 2022-06-23 12:35:51 UTC
>> The good: something changed.
>> The bad: now both HDDs (6121 port 0 and port1) do NOT work anymore (even the
>> SATA-1 hard disk, which worked correctly before, does not work anymore).
> > That patch only changes PCIe DTS nodes and does not touch on-board
> Marvell disk controller. So it should not have any effect on that second
> HDD which is not connected via PCIe.
> 

Sorry, I expressed it in a misleading way: your patch does not change the sata_mv HDDs, they always work, with and without your patch.

Your patch only changes (in my case) the SATA-I HDD connected to port1 of the 6121 controller (which is connected via PCIe):

5.16.0rc1-pali-mvebu without DTS-301220-patch (same behaviour as vanilla 3.16..5.x):
6121-port0 = SATA-II ST3500418AS => "failed to IDENTIFY"
6121-port1 = SATA-I ST3250310NS  => works

5.16.0rc1-pali-mvebu with DTS-301220-patch:
6121-port0 = SATA-II ST3500418AS => "failed to IDENTIFY"
6121-port1 = SATA-I ST3250310NS  => "failed to IDENTIFY"

>> This may be a slight indication that the error of the HDDs has something to
>> do
>> with the interrupt handling after all. But this is just a guess from me.
> 
> Can you check in /proc/interrupts that PCIe interrupt counts are
> increasing during usage of disk connected via PCIe?
> 

Since the HDDs are not detected, there is no block device with which I could generate disk traffic or interrupts. According to my understanding the interrupts could possibly only be generated during the detection phase, but they are zero:

root@nas440:~# cat /proc/interrupts 
           CPU0       
 17:   41238521  bridge-interrupt-ctrl   2 Edge      orion_event
 25:        420  interrupt-controller@20200  29 Edge      mv64xxx_i2c
 26:        865  interrupt-controller@20200  33 Edge      ttyS0
 28:          0  bridge-interrupt-ctrl   3 Edge      f1020300.watchdog-timer
 29:          0  interrupt-controller@20200  22 Edge      f1030000.crypto
 30:     473966  interrupt-controller@20200  19 Edge      ehci_hcd:usb1
 31:     310380  interrupt-controller@20200  46 Edge      f1072004.mdio-bus
 32:          0  interrupt-controller@20200  53 Edge      f1010300.rtc
 33:       9175  interrupt-controller@20200  21 Edge      sata_mv[f1080000.sata]
 34:          2  interrupt-controller@20200   5 Edge      f1060800.xor
 35:          2  interrupt-controller@20200   7 Edge      f1060900.xor
 36:          0  f1010100.gpio  29 Edge      Reset
 37:          0  f1010140.gpio  17 Edge      Power
 38:      49658  interrupt-controller@20200  11 Edge      eth0
 40:          0  interrupt-controller@20200  44 Edge      pcie0.0
 41:          0  mvebu-rp   0 Edge      pciehp
 42:          0  mvebu-INTx   0 Level     ahci[0000:01:00.0]
Err:          0

With the old mach-kirkwood/pcie.c driver (= both SATA1/2 HDDs working with the 6121 controller) both the Host Bridge and the 88SE6121 SATA controller were connected to IRQ9 ("pin A routed to IRQ 9", see attachment "lspci Linux version 3.2.0-4-kirkwood"). With newer kernels they have different IRQs in the 40+ range.
Comment 23 Pali Rohár 2022-06-23 12:48:17 UTC
On Thursday 23 June 2022 12:35:51 bugzilla-daemon@kernel.org wrote:
> Since the HDDs are not detected, there is no block device with which I could
> generate disk traffic or interrupts. According to my understanding the
> interrupts could possibly only be generated during the detection phase

Exactly. I would expect that during detection phase there is some
interrupt from controller over PCIe.

> but they are zero:
> 
> root@nas440:~# cat /proc/interrupts 
>            CPU0       
>  17:   41238521  bridge-interrupt-ctrl   2 Edge      orion_event
>  25:        420  interrupt-controller@20200  29 Edge      mv64xxx_i2c
>  26:        865  interrupt-controller@20200  33 Edge      ttyS0
>  28:          0  bridge-interrupt-ctrl   3 Edge      f1020300.watchdog-timer
>  29:          0  interrupt-controller@20200  22 Edge      f1030000.crypto
>  30:     473966  interrupt-controller@20200  19 Edge      ehci_hcd:usb1
>  31:     310380  interrupt-controller@20200  46 Edge      f1072004.mdio-bus
>  32:          0  interrupt-controller@20200  53 Edge      f1010300.rtc
>  33:       9175  interrupt-controller@20200  21 Edge     
>  sata_mv[f1080000.sata]
>  34:          2  interrupt-controller@20200   5 Edge      f1060800.xor
>  35:          2  interrupt-controller@20200   7 Edge      f1060900.xor
>  36:          0  f1010100.gpio  29 Edge      Reset
>  37:          0  f1010140.gpio  17 Edge      Power
>  38:      49658  interrupt-controller@20200  11 Edge      eth0
>  40:          0  interrupt-controller@20200  44 Edge      pcie0.0
>  41:          0  mvebu-rp   0 Edge      pciehp
>  42:          0  mvebu-INTx   0 Level     ahci[0000:01:00.0]
> Err:          0

This output should be from the new "non-working" kernel, right?

> With the old mach-kirkwood/pcie.c driver (= both SATA1/2 HDDs working with
> the
> 6121 controller) both the Host Bridge and the 88SE6121 SATA controller were
> connected to IRQ9 ("pin A routed to IRQ 9", see attachment "lspci Linux
> version
> 3.2.0-4-kirkwood"). With newer kernels they have different IRQs in the 40+
> range.

IRQ numbers are dynamically assigned by kernel, they may change during
kernel versions and even during reboot (initialization of drivers is
asynchronous and sometimes one driver can ask for assigning IRQ number
faster than other driver).

So could you provide also /proc/interrupts output from "working" kernel
including assigned IRQ numbers which you see in lspci (in case they
changes between reboot)?
Comment 24 Hajo Noerenberg 2022-06-23 13:22:26 UTC
Created attachment 301263 [details]
lspci Linux 5.16.0rc1-palimvebu without patch 301220
Comment 25 Hajo Noerenberg 2022-06-23 13:25:17 UTC
>> root@nas440:~# cat /proc/interrupts 
>>            CPU0       
>>  17:   41238521  bridge-interrupt-ctrl   2 Edge      orion_event
>>  25:        420  interrupt-controller@20200  29 Edge      mv64xxx_i2c
>>  26:        865  interrupt-controller@20200  33 Edge      ttyS0
>>  28:          0  bridge-interrupt-ctrl   3 Edge      f1020300.watchdog-timer
>>  29:          0  interrupt-controller@20200  22 Edge      f1030000.crypto
>>  30:     473966  interrupt-controller@20200  19 Edge      ehci_hcd:usb1
>>  31:     310380  interrupt-controller@20200  46 Edge      f1072004.mdio-bus
>>  32:          0  interrupt-controller@20200  53 Edge      f1010300.rtc
>>  33:       9175  interrupt-controller@20200  21 Edge     
>>  sata_mv[f1080000.sata]
>>  34:          2  interrupt-controller@20200   5 Edge      f1060800.xor
>>  35:          2  interrupt-controller@20200   7 Edge      f1060900.xor
>>  36:          0  f1010100.gpio  29 Edge      Reset
>>  37:          0  f1010140.gpio  17 Edge      Power
>>  38:      49658  interrupt-controller@20200  11 Edge      eth0
>>  40:          0  interrupt-controller@20200  44 Edge      pcie0.0
>>  41:          0  mvebu-rp   0 Edge      pciehp
>>  42:          0  mvebu-INTx   0 Level     ahci[0000:01:00.0]
>> Err:          0
> 
> This output should be from the new "non-working" kernel, right?
> 
Yes, the above is 5.16.0rc1-pali-mvebu with DTS-301220-patch (both 6121 ports not working).

> So could you provide also /proc/interrupts output from "working" kernel
> including assigned IRQ numbers which you see in lspci (in case they
> changes between reboot)?
> 

Depends on what you mean by "working" ;-)

5.16.0rc1-pali-mvebu without DTS-301220-patch (6121-port1=SATA-I working and port0=SATA-II not working):

root@nas440:~# cat /proc/interrupts
           CPU0
 17:      66162  bridge-interrupt-ctrl   2 Edge      orion_event
 25:        396  interrupt-controller@20200  29 Edge      mv64xxx_i2c
 26:       2473  interrupt-controller@20200  33 Edge      ttyS0
 28:          0  bridge-interrupt-ctrl   3 Edge      f1020300.watchdog-timer
 29:          0  interrupt-controller@20200  22 Edge      f1030000.crypto
 30:       1687  interrupt-controller@20200  19 Edge      ehci_hcd:usb1
 31:        432  interrupt-controller@20200  46 Edge      f1072004.mdio-bus
 32:          0  interrupt-controller@20200  53 Edge      f1010300.rtc
 33:       3550  interrupt-controller@20200  21 Edge      sata_mv[f1080000.sata]
 34:          2  interrupt-controller@20200   5 Edge      f1060800.xor
 35:          2  interrupt-controller@20200   7 Edge      f1060900.xor
 36:          0  f1010100.gpio  29 Edge      Reset
 37:          0  f1010140.gpio  17 Edge      Power
 38:        206  interrupt-controller@20200  11 Edge      eth0
 40:         59  interrupt-controller@20200   9 Edge      ahci[0000:01:00.0]
Err:          0

lspci can be seen in https://bugzilla.kernel.org/attachment.cgi?id=301263 .

There is currently no way to get port1=SATA-II working with kernel >3.16 (pci-mvebu);
this is 3.2.0-4-kirkwood with mach-kirkwood/pci.c (both 6121 port1=SATA-I and port0=SATA-II working):

root@wheezy:~# cat /proc/interrupts
           CPU0
  1:       2677  orion_irq  orion_tick
  5:          2  orion_irq  mv_xor.0
  6:          2  orion_irq  mv_xor.1
  7:          2  orion_irq  mv_xor.2
  8:          2  orion_irq  mv_xor.3
  9:        355  orion_irq  ahci
 11:         36  orion_irq  eth0
 19:        979  orion_irq  ehci_hcd:usb1
 21:       3809  orion_irq  sata_mv
 22:          6  orion_irq  mv_crypto
 28:         52  orion_irq  mvsdio
 33:       1840  orion_irq  serial
 46:         24  orion_irq  mv643xx_eth
 53:          0  orion_irq  rtc-mv
102:          1         -  mvsdio cd
Err:          0

lspci is identical to https://bugzilla.kernel.org/attachment.cgi?id=301113 .
Comment 26 Hajo Noerenberg 2022-06-23 13:51:43 UTC
There is an issue with the (newer) 88SE91xx family of Marvell SATA controllers: https://bugzilla.kernel.org/show_bug.cgi?id=42679 .

I do not know if this might be relevant for the (old) 88SE6121 as well (at least the error message "failed to IDENTIFY" is identical).

Some time ago i tested this quirk (adapted for the 6121) in one of the 5.x kernels, but it did not help.
Comment 27 Pali Rohár 2022-06-27 18:27:37 UTC
I updated my pci-mvebu branch, added definitions of INTx and summary interrupts for all platforms which use pci-mvebu.c driver (kirkwood, dove, a370, axp, a375, a380, a385, a39x). I split features into different commits, to easily test just functionality which adds INTx support and "big" summary interrupt supports.

Hajo, could you test my branch again? And if there is some regression (e.g. disk which works without my patches, but does not with patches), could you identify commit which broke it?
Comment 28 Hajo Noerenberg 2022-07-01 08:00:17 UTC
Hi Pali, this is what I've tested so far:

|    88SE6121     |
| port0= | port1= | 
| SATA-2 | SATA-1 | Loader/Kernel
---------------------------------------------------------------------------
| works  | works  | U-Boot 2022.04 with PCI-bindings patch 20220328
| works  | works  | 3.2.0-4-kirkwood => mach-kirkwood/pci.c
| works  | works  | 3.16.0-0.bpo.4-kirkwood without DTB => kirkwood/pci.c
| fails  | works  | 3.16.0-0.bpo.4-kirkwood with DTB => mvebu-pci
| fails  | works  | 5.17.0-1-marvell Debian bullseye => mvebu-pci
| fails  | works  | 5.16.0rc1-pali-mvebu-20220222
| fails  | fails  | 5.16.0rc1-pali-mvebu-20220222 with patch DTS-301220
| fails  | fails  | 5.16.0rc1-pali-mvebu-20220627
| fails  | works  | 5.16.0rc1-pali-mvebu-20220627 without summary int (revert commit 304aaac07620bbedbcafd40f8de2a108ac9f3ab5)

=> With summary interrupt enabled, both SATA-1/2 HDDs do not work, there seems to be a fundamental problem (side note: with "summary int" 44, do you mean "PEX0Err" from the documentation (i.e. 44 is calculated by 32+12?)).

=> SATA-2 HDDs do not work, even with latest pci-mvebu-20220627.
Comment 29 Pali Rohár 2022-07-01 08:51:58 UTC
On Friday 01 July 2022 08:00:17 bugzilla-daemon@kernel.org wrote:
> => With summary interrupt enabled, both SATA-1/2 HDDs do not work, there
> seems
> to be a fundamental problem (side note: with "summary int" 44, do you mean
> "PEX0Err" from the documentation (i.e. 44 is calculated by 32+12?)).

Yes. At least on Armada 385 is PEX0Err superset of events including also
PEX0INT and act as summary interrupt source. PEX0INT triggers only INTx
sources. So on Armada 385 it is needed to disable PEX0INT source when
PEX0Err is enabled.

On Kirkwood PEX0INT is 9 in low register and PEX0Err is 12 in high
register (so 32+12 is ID).

So it looks like that on Kirkwood PEX0Err is not superset of PEX0INT.
I will adjust patches to reflect this.
Comment 30 Pali Rohár 2022-07-01 17:27:54 UTC
So seems that when both err and intx are enabled on A385 then intx are not reported via err source.

Now I updated pci-mvebu branch with new code to always use intx source for intx interrupts.
Comment 31 Hajo Noerenberg 2022-07-05 07:34:05 UTC
Hi Pali, with latest pci-mvebu SATA-1 is working again. SATA-2 sadly still fails with "failed to IDENTIFY":

|    88SE6121     |
| port0= | port1= | 
| SATA-2 | SATA-1 | Loader/Kernel
---------------------------------------------------------------------------
| fails  | works  | 5.16.0rc1-pali-mvebu-20220701

Disk activity on port1=SATA-1 synchronously increases both interrupt 40 (9) and 43 (value 643 in this screenshot):

root@nas440:~# cat /proc/interrupts 
           CPU0       
 17:     347575  bridge-interrupt-ctrl   2 Edge      orion_event
 25:        396  interrupt-controller@20200  29 Edge      mv64xxx_i2c
 26:       4465  interrupt-controller@20200  33 Edge      ttyS0
 28:          0  bridge-interrupt-ctrl   3 Edge      f1020300.watchdog-timer
 29:          0  interrupt-controller@20200  22 Edge      f1030000.crypto
 30:       4900  interrupt-controller@20200  19 Edge      ehci_hcd:usb1
 31:       2552  interrupt-controller@20200  46 Edge      f1072004.mdio-bus
 32:          0  interrupt-controller@20200  53 Edge      f1010300.rtc
 33:       3969  interrupt-controller@20200  21 Edge      sata_mv[f1080000.sata]
 34:          2  interrupt-controller@20200   5 Edge      f1060800.xor
 35:          2  interrupt-controller@20200   7 Edge      f1060900.xor
 36:          0  f1010100.gpio  29 Edge      Reset
 37:          0  f1010140.gpio  17 Edge      Power
 38:       2110  interrupt-controller@20200  11 Edge      eth0
 40:        643  interrupt-controller@20200   9 Edge      pcie0.0
 41:          0  interrupt-controller@20200  44 Edge      pcie0.0
 42:          0  mvebu-rp   0 Edge      pciehp
 43:        643  mvebu-INTx   0 Level     ahci[0000:01:00.0]
Err:          0

root@nas440:~# lspci -vv -nn
0001:00:01.0 PCI bridge [0604]: Marvell Technology Group Ltd. 88F6281 [Kirkwood] ARM SoC [11ab:6281] (rev 03) (prog-if 00 [Normal decode])
        Device tree node: /sys/firmware/devicetree/base/mbus@f1000000/pcie@82000000/pcie@1,0
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 44
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
[...]

0001:01:00.0 IDE interface [0101]: Marvell Technology Group Ltd. 88SE6111/6121 SATA II / PATA Controller [11ab:6121] (rev b2) (prog-if 8f [PCI native mode controller, supports both channels switched to ISA compatibility mode, supports bus mastering])
        Subsystem: Marvell Technology Group Ltd. 88SE6111/6121 1/2 port SATA II + 1 port PATA Controller [11ab:6121]
        Physical Slot: 1
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 45
        Region 0: I/O ports at 10010 [size=8]
[...]


Side note: after unloading and re-loading the pci-mvebu module, I get a "Unable to handle kernel paging request at virtual address bf0770cc" error when calling "cat /proc/interrupts". I /think/ I never observed this problem before.
Comment 32 Pali Rohár 2022-07-05 10:30:24 UTC
> Hi Pali, with latest pci-mvebu SATA-1 is working again

Perfect!

> SATA-2 sadly still fails with "failed to IDENTIFY":

I thought that we would see some PCIe error here bas sadly error interrupt (source id 44) was not triggered:

>  41:          0  interrupt-controller@20200  44 Edge      pcie0.0

Anyway on Root Port you do not have registered PCIe AER driver:

>  42:          0  mvebu-rp   0 Edge      pciehp

Could you check if you have enabled AER support during kernel compilation? With patches from my pci-mvebu.c branch, AER support for mvebu should work and if there is some PCIe issue, it should be printed into dmesg.

> Disk activity on port1=SATA-1 synchronously increases both interrupt 40 (9)
> and 43

That is correct, so legacy INTx interrupts for SATA port1 are working fine.

Anyway, in output you have:

> cat /proc/interrupts
>  42:          0  mvebu-rp   0 Edge      pciehp
>  43:        643  mvebu-INTx   0 Level     ahci[0000:01:00.0]

and

> lspci -vv -nn
> 0001:00:01.0 PCI bridge [0604] ...
>         Interrupt: pin A routed to IRQ 44
> 0001:01:00.0 IDE interface [0101]:
>         Interrupt: pin A routed to IRQ 45

This does not match IRQ numbers. Have you put lspci output **after** module unloading and re-loading?

Also PCI domain number changed from 0000: to 0001. This is IIRC known issue which happens after module reloading. I sent patch for this few days ago https://lore.kernel.org/linux-pci/20220702204737.7719-1-pali@kernel.org/

> Side note: after unloading and re-loading the pci-mvebu module, I get a
> "Unable to handle kernel paging request at virtual address bf0770cc" error
> when calling "cat /proc/interrupts".

Based on the above observation (IRQ numbers after reloading were allocated after the "gap") I think that module unloading have not released IRQs. Could you try patch below if it helps?

diff --git a/drivers/pci/controller/pci-mvebu.c b/drivers/pci/controller/pci-mvebu.c
index cf0ebcac8757..fee2d40bcf08 100644
--- a/drivers/pci/controller/pci-mvebu.c
+++ b/drivers/pci/controller/pci-mvebu.c
@@ -2063,6 +2063,11 @@ static int mvebu_pcie_remove(struct platform_device *pdev)
 		/* Clear all interrupt causes. */
 		mvebu_writel(port, ~PCIE_INT_ALL_MASK, PCIE_INT_CAUSE_OFF);
 
+		if (port->intx_irq > 0)
+			devm_free_irq(dev, port->intx_irq, port);
+		if (port->error_irq > 0)
+			devm_free_irq(dev, port->error_irq, port);
+
 		/* Remove IRQ domains. */
 		if (port->intx_irq_domain)
 			irq_domain_remove(port->intx_irq_domain);
Comment 33 Hajo Noerenberg 2022-07-06 09:24:40 UTC
> Could you check if you have enabled AER support during kernel compilation?
> With
> patches from my pci-mvebu.c branch, AER support for mvebu should work and if
> there is some PCIe issue, it should be printed into dmesg.

I recompiled with CONFIG_PCIEAER=y, but I do not see any AER errors in dmesg.

> Anyway, in output you have:
> 
>> cat /proc/interrupts
>>  42:          0  mvebu-rp   0 Edge      pciehp
>>  43:        643  mvebu-INTx   0 Level     ahci[0000:01:00.0]
> 
> and
> 
>> lspci -vv -nn
>> 0001:00:01.0 PCI bridge [0604] ...
>>         Interrupt: pin A routed to IRQ 44
>> 0001:01:00.0 IDE interface [0101]:
>>         Interrupt: pin A routed to IRQ 45
> 
> This does not match IRQ numbers. Have you put lspci output **after** module
> unloading and re-loading?
> 

I don't remember. With the new AER-enabled kernel after a fresh boot IRQs are aligned:

root@nas440:~# cat /proc/interrupts 
[...]
 40:         59  interrupt-controller@20200   9 Edge      pcie0.0
 41:          0  interrupt-controller@20200  44 Edge      pcie0.0
 42:          0  mvebu-rp   0 Edge      aerdrv, pciehp
 43:         59  mvebu-INTx   0 Level     ahci[0000:01:00.0]

root@nas440:~# lspci -vv -nn
00:01.0 PCI bridge [0604]: Marvell Technology Group Ltd. 88F6281 [Kirkwood] ARM SoC [11ab:6281] (rev 03) (prog-if 00 [Norm
        Device tree node: /sys/firmware/devicetree/base/mbus@f1000000/pcie@82000000/pcie@1,0
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 42
        Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
[...]
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 4a000001 01000004 01080000 2b010000
                RootCmd: CERptEn- NFERptEn- FERptEn-
                RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
                         FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
                ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
[...]
01:00.0 IDE interface [0101]: Marvell Technology Group Ltd. 88SE6111/6121 SATA II / PATA Controller [11ab:6121] (rev b2) (
        Subsystem: Marvell Technology Group Ltd. 88SE6111/6121 1/2 port SATA II + 1 port PATA Controller [11ab:6121]
        Physical Slot: 1
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 43
[...]
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 1f, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Kernel driver in use: ahci
        Kernel modules: ahci

> Also PCI domain number changed from 0000: to 0001. This is IIRC known issue
> which happens after module reloading. I sent patch for this few days ago
> https://lore.kernel.org/linux-pci/20220702204737.7719-1-pali@kernel.org/

I did not include this patch yet.

>> Side note: after unloading and re-loading the pci-mvebu module, I get a
>> "Unable to handle kernel paging request at virtual address bf0770cc" error
>> when calling "cat /proc/interrupts".
> 
> Based on the above observation (IRQ numbers after reloading were allocated
> after the "gap") I think that module unloading have not released IRQs. Could
> you try patch below if it helps?

I included this patch (I bravely added "struct device *dev = &pdev->dev;" in function top to make it compile), but the problem still persists: after rmmod && modprobe, lspci shows IRQ44 and 45, and "cat /proc/interrupts" crashes.

I wonder why the old mach-kirkwood/pcie.c, which looks like a very simple implementation, works and the current mvebu PCI driver, which seems much more sophisticated, has trouble driving SATA-2 HDDs.
Comment 34 Pali Rohár 2022-07-06 18:18:30 UTC
> and "cat /proc/interrupts" crashes

Something which would be needed to debug :-( Probably related to module unloading or host bridge unbinding.

> I wonder why the old mach-kirkwood/pcie.c, which looks like a very simple
> implementation, works and the current mvebu PCI driver, which seems much more
> sophisticated, has trouble driving SATA-2 HDDs.

I thought that there is some PCIe related error and AER reports it.

As a last chance, could you please provide output of /proc/ioports and /proc/iomem files from v3.16 with DTS (non-working) and witout DTS (working) versions? To compare if there is not some error in memory mapping.

But if you have disk which works without any issue and there is no AER error I have feeling that issue is not PCIe or pci-mvebu.c related. Looks like that it could be sata/ahci controller related. Could ATA people help here?
Comment 35 Hajo Noerenberg 2022-07-07 12:46:51 UTC
Created attachment 301359 [details]
iomem ioports Linux version 3.16.0-0.bpo.4-kirkwood with/without DTB
Comment 36 Hajo Noerenberg 2022-07-07 12:57:50 UTC
>> I wonder why the old mach-kirkwood/pcie.c, which looks like a very simple
>> implementation, works and the current mvebu PCI driver, which seems much
>> more
>> sophisticated, has trouble driving SATA-2 HDDs.
> 
> I thought that there is some PCIe related error and AER reports it.
> 
> As a last chance, could you please provide output of /proc/ioports and
> /proc/iomem files from v3.16 with DTS (non-working) and witout DTS (working)
> versions? To compare if there is not some error in memory mapping.
> 

Please see attachment 301359 [details].

> But if you have disk which works without any issue and there is no AER error
> I
> have feeling that issue is not PCIe or pci-mvebu.c related. Looks like that
> it
> could be sata/ahci controller related.
> 

Well, that's absolutely possible. My conclusion is based on the fact that under kernel 3.16. the identical AHCI/ATA module is used, but depending on the (non-)inclusion of the DTB a different PCIe driver. But the problem can also be due to other parts of the kernel, or the way the AHCI driver is integrated with/without DTB. I am not able to assess these things.
Comment 37 Pali Rohár 2022-07-07 13:03:51 UTC
It looks like that PCI IO ports with DT/pci-mvebu.c version starts at
0x00010000 but in non-DT version starts at 0x00001000.

Could you try following patch which could move start PCI IO address?

diff --git a/drivers/pci/controller/pci-mvebu.c b/drivers/pci/controller/pci-mvebu.c
index 629e9701ddf4..3269ce1daa1d 100644
--- a/drivers/pci/controller/pci-mvebu.c
+++ b/drivers/pci/controller/pci-mvebu.c
@@ -1937,7 +1939,7 @@ static int mvebu_pcie_parse_request_resources(struct mvebu_pcie *pcie)
 
 	if (resource_size(&pcie->io) != 0) {
 		pcie->realio.flags = pcie->io.flags;
-		pcie->realio.start = PCIBIOS_MIN_IO;
+		pcie->realio.start = 0x0;
 		pcie->realio.end = min_t(resource_size_t,
 					 IO_SPACE_LIMIT - SZ_64K,
 					 resource_size(&pcie->io) - 1);
Comment 38 Hajo Noerenberg 2022-07-07 17:01:34 UTC
> Could you try following patch which could move start PCI IO address?
> -               pcie->realio.start = PCIBIOS_MIN_IO;
> +               pcie->realio.start = 0x0;
>                 pcie->realio.end = min_t(resource_size_t,
Only the very first line changed:

root@nas440:~# cat /proc/ioports 
00000000-000effff : PCI I/O
  00010000-00010fff : PCI Bus 0000:01
    00010000-0001000f : 0000:01:00.0
      00010000-0001000f : ahci
    00010010-00010017 : 0000:01:00.0
      00010010-00010017 : ahci
    00010018-0001001f : 0000:01:00.0
      00010018-0001001f : ahci
    00010020-00010023 : 0000:01:00.0
      00010020-00010023 : ahci
    00010024-00010027 : 0000:01:00.0
      00010024-00010027 : ahci

The other lines stay at 0x100xx -- and SATA-2 HDDs still fail.

(kernel is 5.16.0rc1-pali-mvebu)
Comment 39 Pali Rohár 2022-07-09 14:21:43 UTC
> the problem still persists: after rmmod && modprobe, lspci shows IRQ44 and
> 45, and "cat /proc/interrupts" crashes.

I was able to reproduce this issue also on A385, happens only sometimes, but I think I found the root cause. Interrupt mappings must be disposed prior removeing domain. Could you try following patch? I helped for A385.

diff --git a/drivers/pci/controller/pci-mvebu.c b/drivers/pci/controller/pci-mvebu.c
index 31f53a019b8f..951030052358 100644
--- a/drivers/pci/controller/pci-mvebu.c
+++ b/drivers/pci/controller/pci-mvebu.c
@@ -1713,8 +1713,15 @@ static int mvebu_pcie_remove(struct platform_device *pdev)
 		mvebu_writel(port, ~PCIE_INT_ALL_MASK, PCIE_INT_CAUSE_OFF);
 
 		/* Remove IRQ domains. */
-		if (port->intx_irq_domain)
+		if (port->intx_irq_domain) {
+			int virq, j;
+			for (j = 0; j < PCI_NUM_INTX; j++) {
+				virq = irq_find_mapping(port->intx_irq_domain, j);
+				if (virq > 0)
+					irq_dispose_mapping(virq);
+			}
 			irq_domain_remove(port->intx_irq_domain);
+		}
 
 		/* Free config space for emulated root bridge. */
 		pci_bridge_emul_cleanup(&port->bridge);
Comment 40 Hajo Noerenberg 2022-07-09 15:46:04 UTC
Created attachment 301380 [details]
U-Boot log with more io/region info
Comment 41 Hajo Noerenberg 2022-07-09 16:07:05 UTC
> I was able to reproduce this issue also on A385, happens only sometimes, but
> I
> think I found the root cause. Interrupt mappings must be disposed prior
> removeing domain. Could you try following patch? I helped for A385.
> 
Yes, it helps. "cat /proc/interrupts" does not crash anymore and IRQs are increasing just by 1 (before: 2, see comment 33).

 40:         59  interrupt-controller@20200   9 Edge      pcie0.0
 41:          0  interrupt-controller@20200  44 Edge      pcie0.0
 42:          0  mvebu-rp   0 Edge      aerdrv, pciehp
 43:         59  mvebu-INTx   0 Level     ahci[0000:01:00.0]
/* rmmod && insmod */
 40:        118  interrupt-controller@20200   9 Edge      pcie0.0
 41:          0  interrupt-controller@20200  44 Edge      pcie0.0
 43:          0  mvebu-rp   0 Edge      aerdrv, pciehp
 44:         59  mvebu-INTx   0 Level     ahci[0001:01:00.0]

I had to manually adjust the patch, because in my source file there is this part before the call to pci_bridge_emul_cleanup:

                if (port->rp_irq_domain)
                        irq_domain_remove(port->rp_irq_domain);

                if (port->error_irq > 0)
                        del_timer_sync(&port->link_irq_timer);

I /think/ you have to apply the same logic to rp_irq_domain to stop the IRQ increase completely.

Can you please have a look at attachment 301380 [details] and possibly 301124. Within U-Boot both HDDs work and maybe there is some hint (ioport location and/or size?) which helps.
Comment 42 Pali Rohár 2022-07-09 16:13:19 UTC
> Yes, it helps.

Perfect, thank you for testing! I will send patch to linux-pci ASAP.

> I /think/ you have to apply the same logic to rp_irq_domain to stop the IRQ
> increase completely.

Now it is in my pci-mvebu branch.
Comment 43 Pali Rohár 2022-07-09 16:24:02 UTC
I have looked at io ports output but I do not see what could be wrong here. Current linux configuration seems to be OK.

If PCI BARs are configured incorrectly then you would not be able to access IO or MEM of SATA controller and so no disk would work. But you have at least one working disk, so in my opinion there is some ATA/AHCI related issue, not PCIe.
Comment 44 Pali Rohár 2022-07-11 18:49:36 UTC
Anyway, cannot be this IDENTIFY problem similar to one which was observed in sata_mv?

https://lists.denx.de/pipermail/u-boot/2022-March/479294.html
https://lists.denx.de/pipermail/u-boot/2021-August/456705.html

That IDENTIFY command needs to be called two times.
Comment 45 Hajo Noerenberg 2022-07-19 08:08:49 UTC
> Anyway, cannot be this IDENTIFY problem similar to one which was observed in
> sata_mv?
> 
If I understood correctly, the problem observed in sata_mv occurred primarily on cold boot. The problem described in this issue makes no difference between cold and warm boot, and even if the drive was successfully detected by U-Boot, the Linux kernel cannot subsequently detect it.

> https://lists.denx.de/pipermail/u-boot/2022-March/479294.html
> https://lists.denx.de/pipermail/u-boot/2021-August/456705.html
> 
> That IDENTIFY command needs to be called two times.
> 
Anyway, I've patched drivers/ata/libata-core.c like this:

        if (ap->ops->read_id)
                err_mask = ap->ops->read_id(dev, &tf, id);
        else
                err_mask = ata_do_dev_read_id(dev, &tf, id);
+
+       if (err_mask) {
+               ata_dev_warn(dev, "CHECK: read_id error, may_readagain=%d\n", may_readagain);
+
+               if (may_readagain) {
+                       may_readagain = 0;
+                       ata_dev_warn(dev, "CHECK: read_id retry, may_readagain=%d\n", may_readagain);
+                       ata_eh_thaw_port(ap); /* need to unfreeze port after failed cmd */
+                       goto retry;
+               }
+       }

Which results in:

[   53.201652] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   58.339290] ata3.00: qc timeout (cmd 0xec)
[   58.343493] ata3.00: CHECK: read_id error, may_readagain=1
[   58.348602] ata3.00: CHECK: read_id retry, may_readagain=0
[   68.579391] ata3.00: qc timeout (cmd 0xec)
[   68.583595] ata3.00: CHECK: read_id error, may_readagain=0
[   68.588695] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)

To my limited understanding, this sends ATA_CMD_ID_ATA (0xEC) two times, but stills fails.
Comment 46 Pali Rohár 2022-07-19 18:16:36 UTC
I do not have more ideas. Try to forward this issue to linux-ide@vger.kernel.org mailinglist. Maybe other people would have idea if there could be issue in ahci driver.
Comment 47 Pali Rohár 2022-11-26 14:37:56 UTC
And when you write to linux-ide mailing list, please CC me, so I can keep track of this issue.
Comment 48 DE 2023-01-27 21:38:01 UTC
Hey guys,

This is an interesting situation but disks are complex devices and can be mean sometimes. Gladly you posted this on the mailing list and it caught my attention.

There might be many reasons to break things but you forgot to take the basic steps thus troubleshoot the issue faster.

Both affected disks are really nasty models. Their vendor has released updates which in this case have not been applied. Your disks maybe degraded and your data may be at risk.

The .12 having CC38 fw can be updated to CC49 as can be seen here:
https://www.seagate.com/gb/en/support/kb/barracuda-720012-firmware-update-213891en

The ES.2 having SN04 fw can be updated to SN06 as can be seen here:
https://www.seagate.com/gb/en/support/kb/firmware-update-for-st3250310ns-st3500320ns-st3750330ns-st31000340ns-207963en
I had examined 2 failed ones(1TB) recently at job and one of them had an extraordinary fault that I had never met again on HDD!

Please update them on a x86 PC. You can do it on the NAS too but it is highly advised to use vendor tools at supported platforms just to be safe.

Upon updating them, please repeat all tests :)

If the problem remains, try to disable NCQ. After that, there is only one easy thing to do. Force device to SATA I mode using a jumper on affected disk(s). This has been the case for the first VIA SATA I hosts like VT8237,VT8237R,... and there is no software workaround for that situation. More details can be seen at this page:
https://ata.wiki.kernel.org/index.php/Sata_via

Waiting for news on this!
Comment 49 Pali Rohár 2023-01-30 20:07:50 UTC
Discussion about this issue is now on linux-ide mailing list:
https://lore.kernel.org/linux-ide/db6b48b7-d69a-564b-24f0-75fbd6a9e543@noerenberg.de/t/#u

I do not think that firmware upgrade or disabling NCQ do something with failed IDENTIFY command. Because simple AHCI implementation in U-Boot can detect disk without any problems.
Comment 50 Hans Ulli Kroll 2023-05-03 15:23:09 UTC
FYI

The problem is the pci-mvebu driver.
I have two different hw setups on Marvell Kirkwood.
One with a working setup and a broken driver.

I need more time and tests to investigate this.
Comment 51 Hajo Noerenberg 2023-05-06 10:24:17 UTC
Hans Ulli,

can you explain very briefly why you think it is pci-mvebu or in which part of the driver there might be a problem? I share this thought (mainly because it worked under ancient kernels with kirkwood-pci), but cannot verify it due to my complete lack of knowledge in this area.

It would be great if this mystery is solved after years or rather almost decades :)

Let me know if I can help (by testing something).
Comment 52 Hans Ulli Kroll 2023-05-24 18:53:21 UTC
Hy Hajo

.. is not, my bad
I do tests v6.3 with userspace buildroot on different platforms.

Here for pci-mvebu

armv5 Popoplug V4 mobile and Iomega Iconnect 

I discovered some differences in the output of lspci, while using uclibc-ng as libc, with glibc and musl this is OK.
xhci, external controller on PCI, is missing on pogoplug.

This took some time to discover, after I reported this error here.

on armv7, Linksys WRT3200ACM, this driver works too, sort of.
I can use mwlwifi from here
https://github.com/kaloz/mwlwifi
after I rebased for v6.3 kernel
I can load this driver and allocate IO/IRQ, but actually not activate this interface. There is some callback in the driver missing I assume.
Lots of PCI/DMA/MAC80211 API is changed from v5.4 to v6.3

I've also done with a dual network minipci card
This is my output
buildroot ~ # cat /proc/interrupts 
           CPU0       
 17:       8514  bridge-interrupt-ctrl   2 Edge      orion_event
 26:          2  interrupt-controller@20200   5 Edge      f1060800.xor
 27:          2  interrupt-controller@20200   7 Edge      f1060900.xor
 28:       2452  interrupt-controller@20200  33 Edge      ttyS0
 29:       1665  interrupt-controller@20200  46 Edge      f1072004.mdio-bus
 30:          0  interrupt-controller@20200  11 Edge      eth0
 31:        147  mvebu-INTx   3 Level     eth2, eth1
 32:         30  interrupt-controller@20200  19 Edge      ehci_hcd:usb1
 33:          0  interrupt-controller@20200  53 Edge      f1010300.rtc
 34:          0  interrupt-controller@20200  29 Edge      mv64xxx_i2c
 35:          0  bridge-interrupt-ctrl   3 Edge      f1020300.watchdog-timer
 36:          0  interrupt-controller@20200  22 Edge      f1030000.crypto
 37:          0  f1010140.gpio   3 Edge      OTB Button
 38:          0  f1010100.gpio  12 Edge      Reset
Err:          0
Both eth1 and eth2 are working

How many ports SATA *and* PATA have you beside of the two from the SoC
I counted 3 which is odd.
Hmm
some site, without datasheet, tells me this is true.
I need to get a picture 
and summarize your output, there is a lot of garbage in your output
Comment 53 Hans Ulli Kroll 2023-05-24 18:59:39 UTC
missed something
Did you compile with CONFIG_PATA_MARVELL support
Comment 54 Hans Ulli Kroll 2023-06-01 15:09:23 UTC
Hajo

can you please post the output of
ls /sys/bus/pci/devices/*/ 

from the working and none working kernel version.

I need only the directory entries and not the contents of every file
Comment 55 Hajo Noerenberg 2023-06-05 07:28:55 UTC
Created attachment 304372 [details]
ls /sys/bus/pci/devices/*/ Linux 3.2.0-4-kirkwood: HDDs ok
Comment 56 Hajo Noerenberg 2023-06-05 07:29:55 UTC
Created attachment 304373 [details]
ls /sys/bus/pci/devices/*/ Linux 6.2.0-rc5: HDDs fail
Comment 57 Hajo Noerenberg 2023-06-05 07:37:06 UTC
Hi Ulli,

the SoC has 2 SATA ports (they always work, with all HDDs and speeds). SATA-2 and SATA-3 hard disks connected to a 88SE6121 (AHCI) controller, wired via PCIe to the 88F6281 SoC fail to operate ("failed to IDENTIFY" ... "qc timeout") when the pci-mvebu driver (Kernel 3.16 .. 5.10 Debian) is in use. The 88SE6121 also has a PATA port, which at least to my knowledge isn't wired on the PCB.

CONFIG_PATA_MARVELL does not work: https://marc.info/?l=linux-ide&m=167474771722812&w=2

I uploaded "/sys/bus/pci/devices/*/" as attachments to this bug as you requested.

Note You need to log in before you can comment on or make changes to this bug.