Bug 212695

Summary: ASMedia ASM1062 needs MPS = 256 quirk
Product: Drivers Reporter: Marek Behún (kabel)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: clement, kabel, laurent, pali
Priority: P1    
Hardware: All   
OS: Linux   
URL: https://lore.kernel.org/linux-pci/20210317115924.31885-1-kabel@kernel.org/
Kernel Version: all Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci-without-quirk.txt
lspci-with-quirk.txt
config.txt
lspci-nnvv ASRockRack TRX40D8-2N2T
lspci-nnvv ASRockRack TRX40D8-2N2T with 5.4.157
lspci post crash

Description Marek Behún 2021-04-16 13:51:45 UTC
Created attachment 296405 [details]
lspci-without-quirk.txt

Submitting as a bug here as requested by Bjorn Helgaas on linux-pci mailing list https://lore.kernel.org/linux-pci/20210317224549.GA93134@bjorn-Precision-5520/

The ASMedia ASM1062 SATA controller causes an External Abort on
controllers which support Max Payload Size >= 512. It happens with
Aardvark PCIe controller (tested on Turris MOX) and also with DesignWare
controller (armada8k, tested on CN9130-CRB):

  ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
  ata1.00: ATA-9: WDC WD40EFRX-68WT0N0, 80.00A80, max UDMA/133
  ata1.00: 7814037168 sectors, multi 0: LBA48 NCQ (depth 32), AA
  ERROR:   Unhandled External Abort received on 0x80000000 at EL3!
  ERROR:    exception reason=1 syndrome=0x92000210
  PANIC at PC : 0x00000000040273bc

Limiting Max Payload Size to 256 bytes solves this problem.

Attaching lspci-without-quirk.txt. lspci-with-quirk.txt and also config.txt.

We first noticed this bug after commit 8a3ebd8de328 ("PCI: aardvark: Implement emulated root PCI bridge config space"). It would therefore seem that this may be a problem of this controller only, but:
- it also happens on CN9130-CRB (which is also a Marvell SOC, but with DesignWare PCIe controller)
- this commit caused this issue by finally allowing kernel to set 512 byte payload size
- it is possible that this card has this issue with every PCIe controller capable of 512 byte payload size, and nobody noticed this because most platform don't support 512 byte payload size - at least most x86 systems do not
Comment 1 Marek Behún 2021-04-16 13:52:18 UTC
Created attachment 296407 [details]
lspci-with-quirk.txt
Comment 2 Marek Behún 2021-04-16 13:52:34 UTC
Created attachment 296409 [details]
config.txt
Comment 3 Laurent GUERBY 2021-10-10 19:55:05 UTC
AsrockRack TRX40D8-2N2T and X470D4U have an asmedia ASM1062 SATA controller that control two ports. On these motherboard we loose after a while SATA SSD plugged in those ports. Same SSD disks when plugged in other ports (non ASM1062) have no issue. Swapping cables does not resolve the issue on the ASM1062.

25:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial ATA Controller [1b21:0612] (rev 02)

On both motherboards the PCI bridge has DevCap: MaxPayload 512 bytes so it looks like it may be the same issue.

Log below of SATA SSD being removed.

[125768.573175] ata1.00: exception Emask 0x0 SAct 0x400040 SErr 0x0 action 0x6 frozen
[125768.573204] ata1.00: failed command: WRITE FPDMA QUEUED
[125768.573219] ata1.00: cmd 61/00:30:88:31:3e/01:00:0d:00:00/40 tag 6 ncq dma 131072 out
                         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[125768.573246] ata1.00: status: { DRDY }
[125768.573256] ata1.00: failed command: WRITE FPDMA QUEUED
[125768.573270] ata1.00: cmd 61/10:b0:88:32:3e/00:00:0d:00:00/40 tag 22 ncq dma 8192 out
                         res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[125768.573303] ata1.00: status: { DRDY }
[125768.573313] ata1: hard resetting link
[125768.573340] ata2.00: exception Emask 0x0 SAct 0x8010000 SErr 0x0 action 0x6 frozen
[125768.573368] ata2.00: failed command: WRITE FPDMA QUEUED
[125768.573384] ata2.00: cmd 61/00:80:28:31:3e/01:00:0d:00:00/40 tag 16 ncq dma 131072 out
                         res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[125768.573418] ata2.00: status: { DRDY }
[125768.573428] ata2.00: failed command: WRITE FPDMA QUEUED
[125768.573443] ata2.00: cmd 61/00:d8:28:32:3e/01:00:0d:00:00/40 tag 27 ncq dma 131072 out
                         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[125768.573998] ata2.00: status: { DRDY }
[125768.574470] ata2: hard resetting link
[125778.573335] ata2: softreset failed (1st FIS failed)
[125778.573806] ata2: hard resetting link
[125778.574223] ata1: softreset failed (1st FIS failed)
[125778.574802] ata1: hard resetting link
[125788.573341] ata2: softreset failed (1st FIS failed)
[125788.573812] ata2: hard resetting link
[125788.574225] ata1: softreset failed (1st FIS failed)
[125788.574826] ata1: hard resetting link
[125823.573875] ata2: softreset failed (1st FIS failed)
[125823.574266] ata2: limiting SATA link speed to 3.0 Gbps
[125823.574572] ata2: hard resetting link
[125823.574896] ata1: softreset failed (1st FIS failed)
[125823.576037] ata1: limiting SATA link speed to 3.0 Gbps
[125823.576479] ata1: hard resetting link
[125828.573692] ata2: softreset failed (1st FIS failed)
[125828.574091] ata2: reset failed, giving up
[125828.574407] ata2.00: disabled
[125828.574733] sd 1:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[125828.574740] ata1: softreset failed (1st FIS failed)
[125828.575059] sd 1:0:0:0: [sdb] tag#16 Sense Key : Not Ready [current] 
[125828.575598] ata1: reset failed, giving up
[125828.576081] sd 1:0:0:0: [sdb] tag#16 Add. Sense: Logical unit not ready, hard reset required
[125828.576588] ata1.00: disabled
Comment 4 Pali Rohár 2021-10-12 18:58:08 UTC
This issue was fixed/workarounded in following commit (in 5.15) which forces MPS to 256 bytes:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b12d93e9958e028856cbcb061b6e64728ca07755

And also it was backported to stable kernel versions:
5.14.6, 5.13.19, 5.10.67, 5.4.148, 4.19.207, 4.14.247, 4.9.283 and 4.4.284
Comment 5 Pali Rohár 2021-10-12 19:02:22 UTC
Laurent, please provide "lspci -nn -vv" log output, update kernel to some patched version and also check if issue is still there. Ideally check also if some PCIe AER message was logged before/after failure.
Comment 6 Laurent GUERBY 2021-10-12 19:57:41 UTC
Pali, thanks for all the info. I've pinged my kernel provider to update to 5.4.148 or later, I'll report here if it fixes my issue: it happens after a few hours of heavy I/O so I'll be able to tell after a day of so if it's fixed.

lspci-nnvv.txt problematic asmedia is 49:00.0
Comment 7 Laurent GUERBY 2021-10-12 19:58:33 UTC
Created attachment 299191 [details]
lspci-nnvv ASRockRack TRX40D8-2N2T
Comment 8 Laurent GUERBY 2021-10-12 20:00:48 UTC
grep -i aer /var/log/kern.log returns only init info. Only errors are ATA at the time of the failure

Oct  9 02:59:33 p kernel: [125768.573175] ata1.00: exception Emask 0x0 SAct 0x400040 SErr 0x0 action 0x6 frozen
Oct  9 02:59:33 p kernel: [125768.573204] ata1.00: failed command: WRITE FPDMA QUEUED
Oct  9 02:59:33 p kernel: [125768.573219] ata1.00: cmd 61/00:30:88:31:3e/01:00:0d:00:00/40 tag 6 ncq dma 131072 out
Oct  9 02:59:33 p kernel: [125768.573219]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  9 02:59:33 p kernel: [125768.573246] ata1.00: status: { DRDY }
Oct  9 02:59:33 p kernel: [125768.573256] ata1.00: failed command: WRITE FPDMA QUEUED
Oct  9 02:59:33 p kernel: [125768.573270] ata1.00: cmd 61/10:b0:88:32:3e/00:00:0d:00:00/40 tag 22 ncq dma 8192 out
Oct  9 02:59:33 p kernel: [125768.573270]          res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  9 02:59:33 p kernel: [125768.573303] ata1.00: status: { DRDY }
Oct  9 02:59:33 p kernel: [125768.573313] ata1: hard resetting link
Comment 9 Pali Rohár 2021-10-12 20:21:58 UTC
Ok! If crash happens again and you are able, try to provide new lspci output again. Comparing PCIe registers from outputs before crash (which you have already posted) and after crash could bring some new information...
Comment 10 Laurent GUERBY 2021-12-14 15:00:09 UTC
Created attachment 300021 [details]
lspci-nnvv ASRockRack TRX40D8-2N2T with  5.4.157

After 7 days of running 5.4.157 no issue so far, new lspci attached but looks identical to the older one
Comment 11 Laurent GUERBY 2022-10-25 11:51:08 UTC
Unfortunately it didn't last and even with 5.15.64-1-pve I have the failure.

I purchased a PCIe storage card:

https://www.startech.com/fr-fr/cartes-additionelles-et-peripheriques/8p6g-pcie-sata-card

Unluckily for me it also has ASM1062 on it:

51:00.0 0106: 1b21:0612 (rev 02)
52:00.0 0106: 1b21:0612 (rev 02)
53:00.0 0106: 1b21:0612 (rev 02)
54:00.0 0106: 1b21:0612 (rev 02)

And disks plugged into it disappear with the same error after a few hours.
Comment 12 Laurent GUERBY 2022-10-25 11:55:07 UTC
Created attachment 303083 [details]
lspci post crash
Comment 13 Clément Fiere 2023-03-27 19:18:35 UTC
Hi,

Seems that this problem also affects ASM1061, and I'm guessing this patch only applies to ASM1062 ?
Even worst in my instance the affected machine completly freezes.

Is there any other way to have this patch applied other than re-compiling a custon kernel ?

Mar 27 21:51:40 pve2 kernel: [ 1349.324899] ata7.00: exception Emask 0x73 SAct 0x1c000 SErr 0xffffffff action 0xe frozen
Mar 27 21:51:40 pve2 kernel: [ 1349.324928] ata7.00: irq_stat 0xffffffff, unknown FIS 00000000 00000000 00000000 00000000, host bus
Mar 27 21:51:40 pve2 kernel: [ 1349.324946] ata7: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
Mar 27 21:51:40 pve2 kernel: [ 1349.324974] ata7.00: failed command: WRITE FPDMA QUEUED
Mar 27 21:51:40 pve2 kernel: [ 1349.324987] ata7.00: cmd 61/58:70:80:72:e7/00:00:8f:00:00/40 tag 14 ncq dma 45056 out
Mar 27 21:51:40 pve2 kernel: [ 1349.324987]          res 40/00:80:30:73:e7/00:00:8f:00:00/40 Emask 0x72 (host bus error)
Mar 27 21:51:40 pve2 kernel: [ 1349.325019] ata7.00: status: { DRDY }
Mar 27 21:51:40 pve2 kernel: [ 1349.325032] ata7.00: failed command: WRITE FPDMA QUEUED
Mar 27 21:51:40 pve2 kernel: [ 1349.325047] ata7.00: cmd 61/58:78:d8:72:e7/00:00:8f:00:00/40 tag 15 ncq dma 45056 out
Mar 27 21:51:40 pve2 kernel: [ 1349.325047]          res 40/00:80:30:73:e7/00:00:8f:00:00/40 Emask 0x72 (host bus error)
Mar 27 21:51:40 pve2 kernel: [ 1349.325079] ata7.00: status: { DRDY }
Mar 27 21:51:40 pve2 kernel: [ 1349.325094] ata7.00: failed command: WRITE FPDMA QUEUED
Mar 27 21:51:40 pve2 kernel: [ 1349.325109] ata7.00: cmd 61/d0:80:30:73:e7/04:00:8f:00:00/40 tag 16 ncq dma 630784 out
Mar 27 21:51:40 pve2 kernel: [ 1349.325109]          res 40/00:80:30:73:e7/00:00:8f:00:00/40 Emask 0x72 (host bus error)
Mar 27 21:51:40 pve2 kernel: [ 1349.325140] ata7.00: status: { DRDY }
Mar 27 21:51:40 pve2 kernel: [ 1349.325157] ata7: hard resetting link
Mar 27 21:51:40 pve2 kernel: [ 1349.375033] ahci 0000:05:00.0: AHCI controller unavailable!
Mar 27 21:51:41 pve2 kernel: [ 1349.675685] ata8.00: exception Emask 0x73 SAct 0x800000 SErr 0xffffffff action 0xe frozen
Mar 27 21:51:41 pve2 kernel: [ 1349.675709] ata8.00: irq_stat 0xffffffff, unknown FIS 00000000 00000000 00000000 00000000, host bus
Mar 27 21:51:41 pve2 kernel: [ 1349.675726] ata8: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
Mar 27 21:51:41 pve2 kernel: [ 1349.675753] ata8.00: failed command: WRITE FPDMA QUEUED
Mar 27 21:51:41 pve2 kernel: [ 1349.675766] ata8.00: cmd 61/b8:b8:f0:70:e7/01:00:8f:00:00/40 tag 23 ncq dma 225280 out
Mar 27 21:51:41 pve2 kernel: [ 1349.675766]          res 40/00:bc:f0:70:e7/00:00:8f:00:00/40 Emask 0x72 (host bus error)
Mar 27 21:51:41 pve2 kernel: [ 1349.675793] ata8.00: status: { DRDY }
Mar 27 21:51:41 pve2 kernel: [ 1349.675806] ata8: hard resetting link
Mar 27 21:51:41 pve2 kernel: [ 1349.725797] ahci 0000:05:00.0: AHCI controller unavailable!
Mar 27 21:51:43 pve2 kernel: [ 1351.199208] ata7: failed to resume link (SControl FFFFFFFF)
Mar 27 21:51:43 pve2 kernel: [ 1351.750470] ata7: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Mar 27 21:51:48 pve2 kernel: [ 1356.911883] ata7: hard resetting link
Mar 27 21:51:48 pve2 kernel: [ 1356.952522] ahci 0000:05:00.0: AHCI controller unavailable!
Mar 27 21:51:48 pve2 kernel: [ 1357.253183] ata8: failed to resume link (SControl FFFFFFFF)
Mar 27 21:51:49 pve2 kernel: [ 1357.804463] ata8: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Mar 27 21:51:54 pve2 kernel: [ 1363.056031] ata8: hard resetting link
Mar 27 21:51:54 pve2 kernel: [ 1363.096719] ahci 0000:05:00.0: AHCI controller unavailable!
Mar 27 21:51:56 pve2 kernel: [ 1364.610195] ata7: failed to resume link (SControl FFFFFFFF)
Mar 27 21:51:56 pve2 kernel: [ 1365.161462] ata7: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
[MACHINE IS FROZEN]