Created attachment 296405 [details] lspci-without-quirk.txt Submitting as a bug here as requested by Bjorn Helgaas on linux-pci mailing list https://lore.kernel.org/linux-pci/20210317224549.GA93134@bjorn-Precision-5520/ The ASMedia ASM1062 SATA controller causes an External Abort on controllers which support Max Payload Size >= 512. It happens with Aardvark PCIe controller (tested on Turris MOX) and also with DesignWare controller (armada8k, tested on CN9130-CRB): ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) ata1.00: ATA-9: WDC WD40EFRX-68WT0N0, 80.00A80, max UDMA/133 ata1.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 32), AA ERROR: Unhandled External Abort received on 0x80000000 at EL3! ERROR: exception reason=1 syndrome=0x92000210 PANIC at PC : 0x00000000040273bc Limiting Max Payload Size to 256 bytes solves this problem. Attaching lspci-without-quirk.txt. lspci-with-quirk.txt and also config.txt. We first noticed this bug after commit 8a3ebd8de328 ("PCI: aardvark: Implement emulated root PCI bridge config space"). It would therefore seem that this may be a problem of this controller only, but: - it also happens on CN9130-CRB (which is also a Marvell SOC, but with DesignWare PCIe controller) - this commit caused this issue by finally allowing kernel to set 512 byte payload size - it is possible that this card has this issue with every PCIe controller capable of 512 byte payload size, and nobody noticed this because most platform don't support 512 byte payload size - at least most x86 systems do not
Created attachment 296407 [details] lspci-with-quirk.txt
Created attachment 296409 [details] config.txt
AsrockRack TRX40D8-2N2T and X470D4U have an asmedia ASM1062 SATA controller that control two ports. On these motherboard we loose after a while SATA SSD plugged in those ports. Same SSD disks when plugged in other ports (non ASM1062) have no issue. Swapping cables does not resolve the issue on the ASM1062. 25:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial ATA Controller [1b21:0612] (rev 02) On both motherboards the PCI bridge has DevCap: MaxPayload 512 bytes so it looks like it may be the same issue. Log below of SATA SSD being removed. [125768.573175] ata1.00: exception Emask 0x0 SAct 0x400040 SErr 0x0 action 0x6 frozen [125768.573204] ata1.00: failed command: WRITE FPDMA QUEUED [125768.573219] ata1.00: cmd 61/00:30:88:31:3e/01:00:0d:00:00/40 tag 6 ncq dma 131072 out res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [125768.573246] ata1.00: status: { DRDY } [125768.573256] ata1.00: failed command: WRITE FPDMA QUEUED [125768.573270] ata1.00: cmd 61/10:b0:88:32:3e/00:00:0d:00:00/40 tag 22 ncq dma 8192 out res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [125768.573303] ata1.00: status: { DRDY } [125768.573313] ata1: hard resetting link [125768.573340] ata2.00: exception Emask 0x0 SAct 0x8010000 SErr 0x0 action 0x6 frozen [125768.573368] ata2.00: failed command: WRITE FPDMA QUEUED [125768.573384] ata2.00: cmd 61/00:80:28:31:3e/01:00:0d:00:00/40 tag 16 ncq dma 131072 out res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [125768.573418] ata2.00: status: { DRDY } [125768.573428] ata2.00: failed command: WRITE FPDMA QUEUED [125768.573443] ata2.00: cmd 61/00:d8:28:32:3e/01:00:0d:00:00/40 tag 27 ncq dma 131072 out res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [125768.573998] ata2.00: status: { DRDY } [125768.574470] ata2: hard resetting link [125778.573335] ata2: softreset failed (1st FIS failed) [125778.573806] ata2: hard resetting link [125778.574223] ata1: softreset failed (1st FIS failed) [125778.574802] ata1: hard resetting link [125788.573341] ata2: softreset failed (1st FIS failed) [125788.573812] ata2: hard resetting link [125788.574225] ata1: softreset failed (1st FIS failed) [125788.574826] ata1: hard resetting link [125823.573875] ata2: softreset failed (1st FIS failed) [125823.574266] ata2: limiting SATA link speed to 3.0 Gbps [125823.574572] ata2: hard resetting link [125823.574896] ata1: softreset failed (1st FIS failed) [125823.576037] ata1: limiting SATA link speed to 3.0 Gbps [125823.576479] ata1: hard resetting link [125828.573692] ata2: softreset failed (1st FIS failed) [125828.574091] ata2: reset failed, giving up [125828.574407] ata2.00: disabled [125828.574733] sd 1:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [125828.574740] ata1: softreset failed (1st FIS failed) [125828.575059] sd 1:0:0:0: [sdb] tag#16 Sense Key : Not Ready [current] [125828.575598] ata1: reset failed, giving up [125828.576081] sd 1:0:0:0: [sdb] tag#16 Add. Sense: Logical unit not ready, hard reset required [125828.576588] ata1.00: disabled
This issue was fixed/workarounded in following commit (in 5.15) which forces MPS to 256 bytes: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b12d93e9958e028856cbcb061b6e64728ca07755 And also it was backported to stable kernel versions: 5.14.6, 5.13.19, 5.10.67, 5.4.148, 4.19.207, 4.14.247, 4.9.283 and 4.4.284
Laurent, please provide "lspci -nn -vv" log output, update kernel to some patched version and also check if issue is still there. Ideally check also if some PCIe AER message was logged before/after failure.
Pali, thanks for all the info. I've pinged my kernel provider to update to 5.4.148 or later, I'll report here if it fixes my issue: it happens after a few hours of heavy I/O so I'll be able to tell after a day of so if it's fixed. lspci-nnvv.txt problematic asmedia is 49:00.0
Created attachment 299191 [details] lspci-nnvv ASRockRack TRX40D8-2N2T
grep -i aer /var/log/kern.log returns only init info. Only errors are ATA at the time of the failure Oct 9 02:59:33 p kernel: [125768.573175] ata1.00: exception Emask 0x0 SAct 0x400040 SErr 0x0 action 0x6 frozen Oct 9 02:59:33 p kernel: [125768.573204] ata1.00: failed command: WRITE FPDMA QUEUED Oct 9 02:59:33 p kernel: [125768.573219] ata1.00: cmd 61/00:30:88:31:3e/01:00:0d:00:00/40 tag 6 ncq dma 131072 out Oct 9 02:59:33 p kernel: [125768.573219] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 9 02:59:33 p kernel: [125768.573246] ata1.00: status: { DRDY } Oct 9 02:59:33 p kernel: [125768.573256] ata1.00: failed command: WRITE FPDMA QUEUED Oct 9 02:59:33 p kernel: [125768.573270] ata1.00: cmd 61/10:b0:88:32:3e/00:00:0d:00:00/40 tag 22 ncq dma 8192 out Oct 9 02:59:33 p kernel: [125768.573270] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 9 02:59:33 p kernel: [125768.573303] ata1.00: status: { DRDY } Oct 9 02:59:33 p kernel: [125768.573313] ata1: hard resetting link
Ok! If crash happens again and you are able, try to provide new lspci output again. Comparing PCIe registers from outputs before crash (which you have already posted) and after crash could bring some new information...
Created attachment 300021 [details] lspci-nnvv ASRockRack TRX40D8-2N2T with 5.4.157 After 7 days of running 5.4.157 no issue so far, new lspci attached but looks identical to the older one