Created attachment 296405 [details] lspci-without-quirk.txt Submitting as a bug here as requested by Bjorn Helgaas on linux-pci mailing list https://lore.kernel.org/linux-pci/20210317224549.GA93134@bjorn-Precision-5520/ The ASMedia ASM1062 SATA controller causes an External Abort on controllers which support Max Payload Size >= 512. It happens with Aardvark PCIe controller (tested on Turris MOX) and also with DesignWare controller (armada8k, tested on CN9130-CRB): ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) ata1.00: ATA-9: WDC WD40EFRX-68WT0N0, 80.00A80, max UDMA/133 ata1.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 32), AA ERROR: Unhandled External Abort received on 0x80000000 at EL3! ERROR: exception reason=1 syndrome=0x92000210 PANIC at PC : 0x00000000040273bc Limiting Max Payload Size to 256 bytes solves this problem. Attaching lspci-without-quirk.txt. lspci-with-quirk.txt and also config.txt. We first noticed this bug after commit 8a3ebd8de328 ("PCI: aardvark: Implement emulated root PCI bridge config space"). It would therefore seem that this may be a problem of this controller only, but: - it also happens on CN9130-CRB (which is also a Marvell SOC, but with DesignWare PCIe controller) - this commit caused this issue by finally allowing kernel to set 512 byte payload size - it is possible that this card has this issue with every PCIe controller capable of 512 byte payload size, and nobody noticed this because most platform don't support 512 byte payload size - at least most x86 systems do not
Created attachment 296407 [details] lspci-with-quirk.txt
Created attachment 296409 [details] config.txt
AsrockRack TRX40D8-2N2T and X470D4U have an asmedia ASM1062 SATA controller that control two ports. On these motherboard we loose after a while SATA SSD plugged in those ports. Same SSD disks when plugged in other ports (non ASM1062) have no issue. Swapping cables does not resolve the issue on the ASM1062. 25:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial ATA Controller [1b21:0612] (rev 02) On both motherboards the PCI bridge has DevCap: MaxPayload 512 bytes so it looks like it may be the same issue. Log below of SATA SSD being removed. [125768.573175] ata1.00: exception Emask 0x0 SAct 0x400040 SErr 0x0 action 0x6 frozen [125768.573204] ata1.00: failed command: WRITE FPDMA QUEUED [125768.573219] ata1.00: cmd 61/00:30:88:31:3e/01:00:0d:00:00/40 tag 6 ncq dma 131072 out res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [125768.573246] ata1.00: status: { DRDY } [125768.573256] ata1.00: failed command: WRITE FPDMA QUEUED [125768.573270] ata1.00: cmd 61/10:b0:88:32:3e/00:00:0d:00:00/40 tag 22 ncq dma 8192 out res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [125768.573303] ata1.00: status: { DRDY } [125768.573313] ata1: hard resetting link [125768.573340] ata2.00: exception Emask 0x0 SAct 0x8010000 SErr 0x0 action 0x6 frozen [125768.573368] ata2.00: failed command: WRITE FPDMA QUEUED [125768.573384] ata2.00: cmd 61/00:80:28:31:3e/01:00:0d:00:00/40 tag 16 ncq dma 131072 out res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [125768.573418] ata2.00: status: { DRDY } [125768.573428] ata2.00: failed command: WRITE FPDMA QUEUED [125768.573443] ata2.00: cmd 61/00:d8:28:32:3e/01:00:0d:00:00/40 tag 27 ncq dma 131072 out res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [125768.573998] ata2.00: status: { DRDY } [125768.574470] ata2: hard resetting link [125778.573335] ata2: softreset failed (1st FIS failed) [125778.573806] ata2: hard resetting link [125778.574223] ata1: softreset failed (1st FIS failed) [125778.574802] ata1: hard resetting link [125788.573341] ata2: softreset failed (1st FIS failed) [125788.573812] ata2: hard resetting link [125788.574225] ata1: softreset failed (1st FIS failed) [125788.574826] ata1: hard resetting link [125823.573875] ata2: softreset failed (1st FIS failed) [125823.574266] ata2: limiting SATA link speed to 3.0 Gbps [125823.574572] ata2: hard resetting link [125823.574896] ata1: softreset failed (1st FIS failed) [125823.576037] ata1: limiting SATA link speed to 3.0 Gbps [125823.576479] ata1: hard resetting link [125828.573692] ata2: softreset failed (1st FIS failed) [125828.574091] ata2: reset failed, giving up [125828.574407] ata2.00: disabled [125828.574733] sd 1:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [125828.574740] ata1: softreset failed (1st FIS failed) [125828.575059] sd 1:0:0:0: [sdb] tag#16 Sense Key : Not Ready [current] [125828.575598] ata1: reset failed, giving up [125828.576081] sd 1:0:0:0: [sdb] tag#16 Add. Sense: Logical unit not ready, hard reset required [125828.576588] ata1.00: disabled
This issue was fixed/workarounded in following commit (in 5.15) which forces MPS to 256 bytes: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b12d93e9958e028856cbcb061b6e64728ca07755 And also it was backported to stable kernel versions: 5.14.6, 5.13.19, 5.10.67, 5.4.148, 4.19.207, 4.14.247, 4.9.283 and 4.4.284
Laurent, please provide "lspci -nn -vv" log output, update kernel to some patched version and also check if issue is still there. Ideally check also if some PCIe AER message was logged before/after failure.
Pali, thanks for all the info. I've pinged my kernel provider to update to 5.4.148 or later, I'll report here if it fixes my issue: it happens after a few hours of heavy I/O so I'll be able to tell after a day of so if it's fixed. lspci-nnvv.txt problematic asmedia is 49:00.0
Created attachment 299191 [details] lspci-nnvv ASRockRack TRX40D8-2N2T
grep -i aer /var/log/kern.log returns only init info. Only errors are ATA at the time of the failure Oct 9 02:59:33 p kernel: [125768.573175] ata1.00: exception Emask 0x0 SAct 0x400040 SErr 0x0 action 0x6 frozen Oct 9 02:59:33 p kernel: [125768.573204] ata1.00: failed command: WRITE FPDMA QUEUED Oct 9 02:59:33 p kernel: [125768.573219] ata1.00: cmd 61/00:30:88:31:3e/01:00:0d:00:00/40 tag 6 ncq dma 131072 out Oct 9 02:59:33 p kernel: [125768.573219] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 9 02:59:33 p kernel: [125768.573246] ata1.00: status: { DRDY } Oct 9 02:59:33 p kernel: [125768.573256] ata1.00: failed command: WRITE FPDMA QUEUED Oct 9 02:59:33 p kernel: [125768.573270] ata1.00: cmd 61/10:b0:88:32:3e/00:00:0d:00:00/40 tag 22 ncq dma 8192 out Oct 9 02:59:33 p kernel: [125768.573270] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 9 02:59:33 p kernel: [125768.573303] ata1.00: status: { DRDY } Oct 9 02:59:33 p kernel: [125768.573313] ata1: hard resetting link
Ok! If crash happens again and you are able, try to provide new lspci output again. Comparing PCIe registers from outputs before crash (which you have already posted) and after crash could bring some new information...
Created attachment 300021 [details] lspci-nnvv ASRockRack TRX40D8-2N2T with 5.4.157 After 7 days of running 5.4.157 no issue so far, new lspci attached but looks identical to the older one
Unfortunately it didn't last and even with 5.15.64-1-pve I have the failure. I purchased a PCIe storage card: https://www.startech.com/fr-fr/cartes-additionelles-et-peripheriques/8p6g-pcie-sata-card Unluckily for me it also has ASM1062 on it: 51:00.0 0106: 1b21:0612 (rev 02) 52:00.0 0106: 1b21:0612 (rev 02) 53:00.0 0106: 1b21:0612 (rev 02) 54:00.0 0106: 1b21:0612 (rev 02) And disks plugged into it disappear with the same error after a few hours.
Created attachment 303083 [details] lspci post crash
Hi, Seems that this problem also affects ASM1061, and I'm guessing this patch only applies to ASM1062 ? Even worst in my instance the affected machine completly freezes. Is there any other way to have this patch applied other than re-compiling a custon kernel ? Mar 27 21:51:40 pve2 kernel: [ 1349.324899] ata7.00: exception Emask 0x73 SAct 0x1c000 SErr 0xffffffff action 0xe frozen Mar 27 21:51:40 pve2 kernel: [ 1349.324928] ata7.00: irq_stat 0xffffffff, unknown FIS 00000000 00000000 00000000 00000000, host bus Mar 27 21:51:40 pve2 kernel: [ 1349.324946] ata7: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch } Mar 27 21:51:40 pve2 kernel: [ 1349.324974] ata7.00: failed command: WRITE FPDMA QUEUED Mar 27 21:51:40 pve2 kernel: [ 1349.324987] ata7.00: cmd 61/58:70:80:72:e7/00:00:8f:00:00/40 tag 14 ncq dma 45056 out Mar 27 21:51:40 pve2 kernel: [ 1349.324987] res 40/00:80:30:73:e7/00:00:8f:00:00/40 Emask 0x72 (host bus error) Mar 27 21:51:40 pve2 kernel: [ 1349.325019] ata7.00: status: { DRDY } Mar 27 21:51:40 pve2 kernel: [ 1349.325032] ata7.00: failed command: WRITE FPDMA QUEUED Mar 27 21:51:40 pve2 kernel: [ 1349.325047] ata7.00: cmd 61/58:78:d8:72:e7/00:00:8f:00:00/40 tag 15 ncq dma 45056 out Mar 27 21:51:40 pve2 kernel: [ 1349.325047] res 40/00:80:30:73:e7/00:00:8f:00:00/40 Emask 0x72 (host bus error) Mar 27 21:51:40 pve2 kernel: [ 1349.325079] ata7.00: status: { DRDY } Mar 27 21:51:40 pve2 kernel: [ 1349.325094] ata7.00: failed command: WRITE FPDMA QUEUED Mar 27 21:51:40 pve2 kernel: [ 1349.325109] ata7.00: cmd 61/d0:80:30:73:e7/04:00:8f:00:00/40 tag 16 ncq dma 630784 out Mar 27 21:51:40 pve2 kernel: [ 1349.325109] res 40/00:80:30:73:e7/00:00:8f:00:00/40 Emask 0x72 (host bus error) Mar 27 21:51:40 pve2 kernel: [ 1349.325140] ata7.00: status: { DRDY } Mar 27 21:51:40 pve2 kernel: [ 1349.325157] ata7: hard resetting link Mar 27 21:51:40 pve2 kernel: [ 1349.375033] ahci 0000:05:00.0: AHCI controller unavailable! Mar 27 21:51:41 pve2 kernel: [ 1349.675685] ata8.00: exception Emask 0x73 SAct 0x800000 SErr 0xffffffff action 0xe frozen Mar 27 21:51:41 pve2 kernel: [ 1349.675709] ata8.00: irq_stat 0xffffffff, unknown FIS 00000000 00000000 00000000 00000000, host bus Mar 27 21:51:41 pve2 kernel: [ 1349.675726] ata8: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch } Mar 27 21:51:41 pve2 kernel: [ 1349.675753] ata8.00: failed command: WRITE FPDMA QUEUED Mar 27 21:51:41 pve2 kernel: [ 1349.675766] ata8.00: cmd 61/b8:b8:f0:70:e7/01:00:8f:00:00/40 tag 23 ncq dma 225280 out Mar 27 21:51:41 pve2 kernel: [ 1349.675766] res 40/00:bc:f0:70:e7/00:00:8f:00:00/40 Emask 0x72 (host bus error) Mar 27 21:51:41 pve2 kernel: [ 1349.675793] ata8.00: status: { DRDY } Mar 27 21:51:41 pve2 kernel: [ 1349.675806] ata8: hard resetting link Mar 27 21:51:41 pve2 kernel: [ 1349.725797] ahci 0000:05:00.0: AHCI controller unavailable! Mar 27 21:51:43 pve2 kernel: [ 1351.199208] ata7: failed to resume link (SControl FFFFFFFF) Mar 27 21:51:43 pve2 kernel: [ 1351.750470] ata7: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF) Mar 27 21:51:48 pve2 kernel: [ 1356.911883] ata7: hard resetting link Mar 27 21:51:48 pve2 kernel: [ 1356.952522] ahci 0000:05:00.0: AHCI controller unavailable! Mar 27 21:51:48 pve2 kernel: [ 1357.253183] ata8: failed to resume link (SControl FFFFFFFF) Mar 27 21:51:49 pve2 kernel: [ 1357.804463] ata8: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF) Mar 27 21:51:54 pve2 kernel: [ 1363.056031] ata8: hard resetting link Mar 27 21:51:54 pve2 kernel: [ 1363.096719] ahci 0000:05:00.0: AHCI controller unavailable! Mar 27 21:51:56 pve2 kernel: [ 1364.610195] ata7: failed to resume link (SControl FFFFFFFF) Mar 27 21:51:56 pve2 kernel: [ 1365.161462] ata7: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF) [MACHINE IS FROZEN]