Bug 212695 - ASMedia ASM1062 needs MPS = 256 quirk
Summary: ASMedia ASM1062 needs MPS = 256 quirk
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL: https://lore.kernel.org/linux-pci/202...
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-16 13:51 UTC by Marek Behún
Modified: 2021-10-12 20:21 UTC (History)
3 users (show)

See Also:
Kernel Version: all
Tree: Mainline
Regression: No


Attachments
lspci-without-quirk.txt (4.30 KB, text/plain)
2021-04-16 13:51 UTC, Marek Behún
Details
lspci-with-quirk.txt (4.30 KB, text/plain)
2021-04-16 13:52 UTC, Marek Behún
Details
config.txt (120.95 KB, text/plain)
2021-04-16 13:52 UTC, Marek Behún
Details
lspci-nnvv ASRockRack TRX40D8-2N2T (160.38 KB, text/plain)
2021-10-12 19:58 UTC, Laurent GUERBY
Details

Description Marek Behún 2021-04-16 13:51:45 UTC
Created attachment 296405 [details]
lspci-without-quirk.txt

Submitting as a bug here as requested by Bjorn Helgaas on linux-pci mailing list https://lore.kernel.org/linux-pci/20210317224549.GA93134@bjorn-Precision-5520/

The ASMedia ASM1062 SATA controller causes an External Abort on
controllers which support Max Payload Size >= 512. It happens with
Aardvark PCIe controller (tested on Turris MOX) and also with DesignWare
controller (armada8k, tested on CN9130-CRB):

  ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
  ata1.00: ATA-9: WDC WD40EFRX-68WT0N0, 80.00A80, max UDMA/133
  ata1.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 32), AA
  ERROR:   Unhandled External Abort received on 0x80000000 at EL3!
  ERROR:    exception reason=1 syndrome=0x92000210
  PANIC at PC : 0x00000000040273bc

Limiting Max Payload Size to 256 bytes solves this problem.

Attaching lspci-without-quirk.txt. lspci-with-quirk.txt and also config.txt.

We first noticed this bug after commit 8a3ebd8de328 ("PCI: aardvark: Implement emulated root PCI bridge config space"). It would therefore seem that this may be a problem of this controller only, but:
- it also happens on CN9130-CRB (which is also a Marvell SOC, but with DesignWare PCIe controller)
- this commit caused this issue by finally allowing kernel to set 512 byte payload size
- it is possible that this card has this issue with every PCIe controller capable of 512 byte payload size, and nobody noticed this because most platform don't support 512 byte payload size - at least most x86 systems do not
Comment 1 Marek Behún 2021-04-16 13:52:18 UTC
Created attachment 296407 [details]
lspci-with-quirk.txt
Comment 2 Marek Behún 2021-04-16 13:52:34 UTC
Created attachment 296409 [details]
config.txt
Comment 3 Laurent GUERBY 2021-10-10 19:55:05 UTC
AsrockRack TRX40D8-2N2T and X470D4U have an asmedia ASM1062 SATA controller that control two ports. On these motherboard we loose after a while SATA SSD plugged in those ports. Same SSD disks when plugged in other ports (non ASM1062) have no issue. Swapping cables does not resolve the issue on the ASM1062.

25:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial ATA Controller [1b21:0612] (rev 02)

On both motherboards the PCI bridge has DevCap: MaxPayload 512 bytes so it looks like it may be the same issue.

Log below of SATA SSD being removed.

[125768.573175] ata1.00: exception Emask 0x0 SAct 0x400040 SErr 0x0 action 0x6 frozen
[125768.573204] ata1.00: failed command: WRITE FPDMA QUEUED
[125768.573219] ata1.00: cmd 61/00:30:88:31:3e/01:00:0d:00:00/40 tag 6 ncq dma 131072 out
                         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[125768.573246] ata1.00: status: { DRDY }
[125768.573256] ata1.00: failed command: WRITE FPDMA QUEUED
[125768.573270] ata1.00: cmd 61/10:b0:88:32:3e/00:00:0d:00:00/40 tag 22 ncq dma 8192 out
                         res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[125768.573303] ata1.00: status: { DRDY }
[125768.573313] ata1: hard resetting link
[125768.573340] ata2.00: exception Emask 0x0 SAct 0x8010000 SErr 0x0 action 0x6 frozen
[125768.573368] ata2.00: failed command: WRITE FPDMA QUEUED
[125768.573384] ata2.00: cmd 61/00:80:28:31:3e/01:00:0d:00:00/40 tag 16 ncq dma 131072 out
                         res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[125768.573418] ata2.00: status: { DRDY }
[125768.573428] ata2.00: failed command: WRITE FPDMA QUEUED
[125768.573443] ata2.00: cmd 61/00:d8:28:32:3e/01:00:0d:00:00/40 tag 27 ncq dma 131072 out
                         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[125768.573998] ata2.00: status: { DRDY }
[125768.574470] ata2: hard resetting link
[125778.573335] ata2: softreset failed (1st FIS failed)
[125778.573806] ata2: hard resetting link
[125778.574223] ata1: softreset failed (1st FIS failed)
[125778.574802] ata1: hard resetting link
[125788.573341] ata2: softreset failed (1st FIS failed)
[125788.573812] ata2: hard resetting link
[125788.574225] ata1: softreset failed (1st FIS failed)
[125788.574826] ata1: hard resetting link
[125823.573875] ata2: softreset failed (1st FIS failed)
[125823.574266] ata2: limiting SATA link speed to 3.0 Gbps
[125823.574572] ata2: hard resetting link
[125823.574896] ata1: softreset failed (1st FIS failed)
[125823.576037] ata1: limiting SATA link speed to 3.0 Gbps
[125823.576479] ata1: hard resetting link
[125828.573692] ata2: softreset failed (1st FIS failed)
[125828.574091] ata2: reset failed, giving up
[125828.574407] ata2.00: disabled
[125828.574733] sd 1:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[125828.574740] ata1: softreset failed (1st FIS failed)
[125828.575059] sd 1:0:0:0: [sdb] tag#16 Sense Key : Not Ready [current] 
[125828.575598] ata1: reset failed, giving up
[125828.576081] sd 1:0:0:0: [sdb] tag#16 Add. Sense: Logical unit not ready, hard reset required
[125828.576588] ata1.00: disabled
Comment 4 Pali Rohár 2021-10-12 18:58:08 UTC
This issue was fixed/workarounded in following commit (in 5.15) which forces MPS to 256 bytes:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b12d93e9958e028856cbcb061b6e64728ca07755

And also it was backported to stable kernel versions:
5.14.6, 5.13.19, 5.10.67, 5.4.148, 4.19.207, 4.14.247, 4.9.283 and 4.4.284
Comment 5 Pali Rohár 2021-10-12 19:02:22 UTC
Laurent, please provide "lspci -nn -vv" log output, update kernel to some patched version and also check if issue is still there. Ideally check also if some PCIe AER message was logged before/after failure.
Comment 6 Laurent GUERBY 2021-10-12 19:57:41 UTC
Pali, thanks for all the info. I've pinged my kernel provider to update to 5.4.148 or later, I'll report here if it fixes my issue: it happens after a few hours of heavy I/O so I'll be able to tell after a day of so if it's fixed.

lspci-nnvv.txt problematic asmedia is 49:00.0
Comment 7 Laurent GUERBY 2021-10-12 19:58:33 UTC
Created attachment 299191 [details]
lspci-nnvv ASRockRack TRX40D8-2N2T
Comment 8 Laurent GUERBY 2021-10-12 20:00:48 UTC
grep -i aer /var/log/kern.log returns only init info. Only errors are ATA at the time of the failure

Oct  9 02:59:33 p kernel: [125768.573175] ata1.00: exception Emask 0x0 SAct 0x400040 SErr 0x0 action 0x6 frozen
Oct  9 02:59:33 p kernel: [125768.573204] ata1.00: failed command: WRITE FPDMA QUEUED
Oct  9 02:59:33 p kernel: [125768.573219] ata1.00: cmd 61/00:30:88:31:3e/01:00:0d:00:00/40 tag 6 ncq dma 131072 out
Oct  9 02:59:33 p kernel: [125768.573219]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  9 02:59:33 p kernel: [125768.573246] ata1.00: status: { DRDY }
Oct  9 02:59:33 p kernel: [125768.573256] ata1.00: failed command: WRITE FPDMA QUEUED
Oct  9 02:59:33 p kernel: [125768.573270] ata1.00: cmd 61/10:b0:88:32:3e/00:00:0d:00:00/40 tag 22 ncq dma 8192 out
Oct  9 02:59:33 p kernel: [125768.573270]          res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  9 02:59:33 p kernel: [125768.573303] ata1.00: status: { DRDY }
Oct  9 02:59:33 p kernel: [125768.573313] ata1: hard resetting link
Comment 9 Pali Rohár 2021-10-12 20:21:58 UTC
Ok! If crash happens again and you are able, try to provide new lspci output again. Comparing PCIe registers from outputs before crash (which you have already posted) and after crash could bring some new information...

Note You need to log in before you can comment on or make changes to this bug.