Bug 206123

Summary:	aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64
Product:	SCSI Drivers	Reporter:	gyakovlev
Component:	AACRAID	Assignee:	scsi_drivers-aacraid
Status:	RESOLVED PATCH_ALREADY_AVAILABLE
Severity:	normal	CC:	cam, oohall, sagar.biradar, tpearson
Priority:	P1
Hardware:	All
OS:	Linux
URL:	https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200908015106.79661-1-aik@ozlabs.ru/
Kernel Version:	5.4.8	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	full dmesg

Description gyakovlev 2020-01-08 05:59:58 UTC

Device Talos Raptor T2P9D01 REV 1.01 SAS
System chassis: SC747TQ-R1620B
All disks are attached via SAS backplane on the chassis.

Problem description: I'm running linux 5.4.8 (LE, 64k pages) and using
iommu=nobypass kernel option to catch and prevent illegal DMA access.

Since installing SAS dives (4x HUH721008AL4200 drives) I'm seeing
errors in dmesg and all IO to the controller stops.
controller tries to reset itself and reports successful reset but after
that any IO to ANY disk on the controller hangs.
detailed errors in attached file.

I've been using SATA (both SSD and HDD) devices just fine before,
accessing those SAS drives seems not to trigger the error.

I've tried patching kernel with latest patches from 5.5
https://github.com/torvalds/linux/commits/master/drivers/scsi/aacraid
^ all commits from oct15, but errors still here.

Steps to reproduce
I have 4 disks, I create 4 filesystems (ext4 but it's irrelevant)
then I copy a 5gb file to tmpfs
then I copy that 5gb file to each disk in parallel couple of times.
after 1-4 attempts error happens, all IO to the controller hangs
the only way to recover is to hard-reboot

if I boot WITHOUT iommu=nobypass, everything works just fine
some key messages from attached dmesg

[    0.000000] PowerNV: IOMMU bypass window disabled.
^ here system shows that bypass disabled

[13860.157868] EEH: Frozen PHB#2-PE#fd detected
[13860.157876] EEH: PE location: UOPWR.A100034-Node0-Builtin SAS
^ hang triggered

after that EEH asks driver to reset and block/filesystem layer
starts to report errors


after that controller reports that it's been reset, but it's not
functional. any IO to any disks on controller will hang forever.

In attached dmesg I have ZFS filesystem, but it's reproduce-able with
simple single partition with ext4 on top of that. with single partition
IO also never recovers, so please don't focus on ZFS.



Any help appreciated.
I hope it's a driver bug and not a HW bug in PM8068 itself.


full dmesg below

[    0.000000] PowerNV: IOMMU bypass window disabled.
...
[13428.236656] logitech-hidpp-device 0003:046D:4069.0006: multiplier = 8
[13860.157868] EEH: Frozen PHB#2-PE#fd detected
[13860.157876] EEH: PE location: UOPWR.A100034-Node0-Builtin SAS, PHB location: N/A
[13860.157877] EEH: Frozen PHB#2-PE#fd detected
[13860.157878] EEH: Call Trace:
[13860.157885] EEH: [000000009c57f2e8] __eeh_send_failure_event+0x60/0x110
[13860.157888] EEH: [0000000006b53b28] eeh_dev_check_failure+0x360/0x5f0
[13860.157890] EEH: [000000001947df59] eeh_check_failure+0x98/0x100
[13860.157894] EEH: [0000000066f23435] aac_src_check_health+0x8c/0xc0
[13860.157898] EEH: [00000000361f4dbd] aac_command_thread+0x718/0x930
[13860.157902] EEH: [00000000b5fb52e2] kthread+0x180/0x190
[13860.157906] EEH: [000000005791e370] ret_from_kernel_thread+0x5c/0x74
[13860.157908] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[13860.157908] EEH: Notify device drivers to shutdown
[13860.157910] EEH: Beginning: 'error_detected(IO frozen)'
[13860.157914] PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->error_detected(IO frozen)
[13860.157918] aacraid 0002:01:00.0: aacraid: PCI error detected 2
[13860.158142] sd 0:2:5:0: [sde] tag#788 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[13860.158150] sd 0:2:5:0: [sde] tag#788 CDB: opcode=0x2a 2a 00 08 4c a9 05 00 00 40 00
[13860.158154] blk_update_request: I/O error, dev sde, sector 1113933864 op 0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158168] sd 0:2:5:0: [sde] tag#789 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[13860.158171] sd 0:2:5:0: [sde] tag#789 CDB: opcode=0x2a 2a 00 08 4c a9 45 00 00 40 00
[13860.158174] blk_update_request: I/O error, dev sde, sector 1113934376 op 0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158179] sd 0:2:5:0: [sde] tag#790 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[13860.158182] sd 0:2:5:0: [sde] tag#790 CDB: opcode=0x2a 2a 00 08 4c a9 85 00 00 40 00
[13860.158185] blk_update_request: I/O error, dev sde, sector 1113934888 op 0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158190] sd 0:2:5:0: [sde] tag#791 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[13860.158193] sd 0:2:5:0: [sde] tag#791 CDB: opcode=0x2a 2a 00 08 4c a9 c5 00 00 20 00
[13860.158196] blk_update_request: I/O error, dev sde, sector 1113935400 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[13860.158200] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2565ab744-part1 error=5 type=2 offset=570325749760 size=917504 flags=40080c80
[13860.158416] blk_update_request: I/O error, dev sdf, sector 1448660344 op 0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158419] sd 0:2:6:0: [sdf] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[13860.158424] blk_update_request: I/O error, dev sdf, sector 1448660088 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[13860.158426] sd 0:2:4:0: [sdd] tag#35 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[13860.158429] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca266d5c270-part1 error=5 type=2 offset=741705707520 size=1048576 flags=40080c80
[13860.158431] sd 0:2:6:0: [sdf] tag#1 CDB: opcode=0x2a 2a 00 0a cb bf cf 00 00 40 00
[13860.158433] sd 0:2:4:0: [sdd] tag#35 CDB: opcode=0x2a 2a 00 08 4c a4 c5 00 00 40 00
[13860.158436] blk_update_request: I/O error, dev sdf, sector 1449000568 op 0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158440] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca266d5c270-part1 error=5 type=2 offset=741705576448 size=131072 flags=180880
[13860.158445] blk_update_request: I/O error, dev sdd, sector 1113925160 op 0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158449] blk_update_request: I/O error, dev sdd, sector 1113927208 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[13860.158451] sd 0:2:6:0: [sdf] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[13860.158455] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2666c6d98-part1 error=5 type=2 offset=570322341888 size=262144 flags=40080c80
[13860.158456] sd 0:2:6:0: [sdf] tag#2 CDB: opcode=0x2a 2a 00 0a cb c0 0f 00 00 40 00
[13860.158458] sd 0:2:7:0: [sdg] tag#32 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[13860.158462] blk_update_request: I/O error, dev sdf, sector 1449001080 op 0x1:(WRITE) flags 0x4700 phys_seg 1 prio class 0
[13860.158464] sd 0:2:4:0: [sdd] tag#36 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[13860.158466] sd 0:2:7:0: [sdg] tag#32 CDB: opcode=0x2a 2a 00 0a cb e4 4f 00 00 20 00
[13860.158470] sd 0:2:4:0: [sdd] tag#36 CDB: opcode=0x2a 2a 00 08 4c a5 05 00 00 40 00
[13860.158473] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca26b5708c8-part1 error=5 type=2 offset=741918175232 size=131072 flags=180880
[13860.158480] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca266d5c270-part1 error=5 type=2 offset=741879902208 size=1048576 flags=40080c80
[13860.158483] sd 0:2:4:0: [sdd] tag#38 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
[13860.158486] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2565ab744-part1 error=5 type=2 offset=570325094400 size=131072 flags=180880
[13860.158488] sd 0:2:4:0: [sdd] tag#38 CDB: opcode=0x2a 2a 00 08 4c a5 45 00 00 40 00
[13860.158497] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2666c6d98-part1 error=5 type=2 offset=570321293312 size=1048576 flags=40080c80
[13860.158500] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca266d5c270-part1 error=5 type=2 offset=741918437376 size=1048576 flags=40080c80
[13860.158542] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2565ab744-part1 error=5 type=2 offset=570324701184 size=131072 flags=180880
[13860.158545] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2565ab744-part1 error=5 type=2 offset=570323783680 size=131072 flags=180880
[13860.158575] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2666c6d98-part1 error=5 type=2 offset=570320244736 size=1048576 flags=40080c80
[13860.158583] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca266d5c270-part1 error=5 type=2 offset=741740703744 size=655360 flags=40080c80
[13860.158586] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca2565ab744-part1 error=5 type=2 offset=570325225472 size=524288 flags=40080c80
[13860.158606] zio pool=zdata vdev=/dev/disk/by-id/wwn-0x5000cca26b5708c8-part1 error=5 type=2 offset=741921583104 size=1048576
...
[13860.158939] PCI 0002:01:00.0#00fd: EEH: aacraid driver reports: 'need reset'
[13860.158941] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[13860.158945] EEH: Collect temporary log
[13860.158970] EEH: of node=0002:01:00.0
[13860.158972] EEH: PCI device/vendor: 028d9005
[13860.158975] EEH: PCI cmd/status register: 00100146
[13860.158976] EEH: PCI-E capabilities and status follow:
[13860.158987] EEH: PCI-E 00: 00020010 000081a2 00002950 00437083 
[13860.158996] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 
[13860.158997] EEH: PCI-E 20: 00000000 
[13860.158998] EEH: PCI-E AER capability register set follows:
[13860.159009] EEH: PCI-E AER 00: 30020001 00000000 00400000 00462030 
[13860.159017] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000 
[13860.159026] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
[13860.159029] EEH: PCI-E AER 30: 00000000 00000000 
[13860.159031] PHB4 PHB#2 Diag-data (Version: 1)
[13860.159032] brdgCtl:    00000002
[13860.159033] RootSts:    00060040 00402000 e0820008 00100107 00000800
[13860.159035] PhbSts:     0000001c00000000 0000001c00000000
[13860.159036] Lem:        0000000100000080 0000000000000000 0000000000000080
[13860.159038] PhbErr:     0000028000000000 0000020000000000 2148000098000240 a008400000000000
[13860.159039] RxeTceErr:  6000000000000000 2000000000000000 c0000000000000fd 0000000000000000
[13860.159041] PblErr:     0000000000020000 0000000000020000 0000000000000000 0000000000000000
[13860.159042] RegbErr:    0000004000000000 0000004000000000 8800000400000000 0000000000000000
[13860.159044] PE[0fd] A/B: 8300b03800000000 8000000000000000
[13860.159046] EEH: Reset without hotplug activity
flags=40080c80
[13865.217467] aacraid 0002:01:00.0: enabling device (0140 -> 0142)
[13865.224325] EEH: Beginning: 'slot_reset'
[13865.224334] PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->slot_reset()
[13865.224337] aacraid 0002:01:00.0: aacraid: PCI error - slot reset
[13865.224401] PCI 0002:01:00.0#00fd: EEH: aacraid driver reports: 'recovered'
[13865.224402] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
[13865.224403] EEH: Notify device driver to resume
[13865.224404] EEH: Beginning: 'resume'
[13865.224406] PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->resume()




^ at this poing all IO to any of the disks on the controller hangs forever.

Comment 1 gyakovlev 2020-01-08 06:00:24 UTC

Created attachment 286681 [details]
full dmesg

Comment 2 gyakovlev 2020-01-08 06:05:16 UTC

also I have to add that attaching disks directly to controller ports (bypassing backplane) makes no difference.

Comment 3 Timothy Pearson 2020-01-08 06:25:22 UTC

If I'm decoding this right, the EEH is caused by a PCIe configuration space write, triggering a correctable error in the PCIe core.  I have no way of knowing if the address reported is valid (I suspect it is not) but would be 0x0.

Comment 4 Oliver O'Halloran 2020-04-20 03:18:03 UTC

(In reply to Timothy Pearson from comment #3)
> If I'm decoding this right, the EEH is caused by a PCIe configuration space
> write, triggering a correctable error in the PCIe core.  I have no way of
> knowing if the address reported is valid (I suspect it is not) but would be
> 0x0.

$ pest 8300b03800000000 8000000000000000
Transaction type: DMA Read Response
Invalid MMIO Address
TCE Page Fault
TCE Access Fault
LEM Bit Number 56
Requestor 00:0.0
MSI Data 0x0000
Fault Address = 0x0000000000000000

A TCE fault makes more sense given that it doesn't happen when bypass is enabled. I'm leaning towards this being a driver bug, but it could be a powerpc IOMMU specific issue. I'll investigate.

Comment 5 Cameron Berkenpas 2020-04-20 18:41:11 UTC

Bug 207359 may potentially be a duplicate of this one so perhaps some of the info there could be useful.

(In reply to Oliver O'Halloran from comment #4)
> A TCE fault makes more sense given that it doesn't happen when bypass is
> enabled. I'm leaning towards this being a driver bug, but it could be a
> powerpc IOMMU specific issue. I'll investigate.

Comment 6 gyakovlev 2020-05-06 08:21:50 UTC

tried linux 5.6.10, it now happens right at boot, but at least controller reset is working it seems, before needed a reboot to access disks again.

[May 6 01:10] PowerNV: IOMMU bypass window disabled.
...
[   24.609683] Adaptec aacraid driver 1.2.1[50983]-custom
[   24.609784] aacraid 0002:01:00.0: enabling device (0140 -> 0142)
[   24.628036] aacraid: Comm Interface type3 enabled
...
[   25.661962] EEH: Recovering PHB#2-PE#fd
[   25.662010] EEH: PE location: UOPWR.A100034-Node0-Builtin SAS, PHB location: N/A
[   25.662097] EEH: Frozen PHB#2-PE#fd detected
[   25.662145] EEH: Call Trace:
[   25.662186] EEH: [(____ptrval____)] __eeh_send_failure_event+0x60/0x110
[   25.662282] EEH: [(____ptrval____)] eeh_dev_check_failure+0x360/0x5f0
[   25.662373] EEH: [(____ptrval____)] eeh_check_failure+0x98/0x100
[   25.666794] EEH: [(____ptrval____)] aac_src_check_health+0x8c/0xc0
[   25.669770] EEH: [(____ptrval____)] aac_command_thread+0x718/0x930
[   25.672745] EEH: [(____ptrval____)] kthread+0x180/0x190
[   25.675719] EEH: [(____ptrval____)] ret_from_kernel_thread+0x5c/0x6c
[   25.678722] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[   25.681822] EEH: Notify device drivers to shutdown
[   25.684910] EEH: Beginning: 'error_detected(IO frozen)'
[   25.688007] PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->error_detected(IO frozen)
[   25.688011] aacraid 0002:01:00.0: aacraid: PCI error detected 2
[   25.695317] PCI 0002:01:00.0#00fd: EEH: aacraid driver reports: 'need reset'
[   25.695320] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[   25.695325] EEH: Collect temporary log
[   25.695354] EEH: of node=0002:01:00.0
[   25.695358] EEH: PCI device/vendor: 028d9005
[   25.695361] EEH: PCI cmd/status register: 00100146
[   25.695362] EEH: PCI-E capabilities and status follow:
[   25.695376] EEH: PCI-E 00: 00020010 000081a2 00002950 00437083
[   25.695387] EEH: PCI-E 10: 10820000 00000000 00000000 00000000
[   25.695389] EEH: PCI-E 20: 00000000
[   25.695391] EEH: PCI-E AER capability register set follows:
[   25.695404] EEH: PCI-E AER 00: 30020001 00000000 00400000 00462030
[   25.695415] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000
[   25.695426] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[   25.695430] EEH: PCI-E AER 30: 00000000 00000000
[   25.695432] PHB4 PHB#2 Diag-data (Version: 1)
[   25.695434] brdgCtl:    00000002
[   25.695436] RootSts:    00000040 00402000 e0820008 00100107 00000000
[   25.695438] PhbSts:     0000001c00000000 0000001c00000000
[   25.695440] Lem:        0000000000000080 0000000000000000 0000000000000080
[   25.695443] PhbErr:     0000020000000000 0000020000000000 2148000098000240 a008400000000000
[   25.695445] RxeTceErr:  6000000000000000 2000000000000000 40000000000000fd 0000000000000000
[   25.695450] PE[0fd] A/B: 8000b03800000000 8000000000000000
[   25.695453] EEH: Reset without hotplug activity
...
aacraid 0002:01:00.0: enabling device (0140 -> 0142)
[ 1392.284584276,3] PHB#0002[0:2]:                  brdgCtl = 00000002
[ 1392.284685636,3] PHB#0002[0:2]:             deviceStatus = 00000040
[ 1392.284739080,3] PHB#0002[0:2]:               slotStatus = 00402000
[ 1392.284804382,3] PHB#0002[0:2]:               linkStatus = e0820008
[ 1392.284857805,3] PHB#0002[0:2]:             devCmdStatus = 00100107
[ 1392.284899389,3] PHB#0002[0:2]:             devSecStatus = 00000000
[ 1392.284948786,3] PHB#0002[0:2]:          rootErrorStatus = 00000000
[ 1392.285006352,3] PHB#0002[0:2]:          corrErrorStatus = 00000000
[ 1392.285055882,3] PHB#0002[0:2]:        uncorrErrorStatus = 00000000
[ 1392.285113499,3] PHB#0002[0:2]:                   devctl = 00000040
[ 1392.285162880,3] PHB#0002[0:2]:                  devStat = 00000000
[ 1392.285224300,3] PHB#0002[0:2]:                  tlpHdr1 = 00000000
[ 1392.285285888,3] PHB#0002[0:2]:                  tlpHdr2 = 00000000
[ 1392.285355027,3] PHB#0002[0:2]:                  tlpHdr3 = 00000000
[ 1392.285404499,3] PHB#0002[0:2]:                  tlpHdr4 = 00000000
[ 1392.285473783,3] PHB#0002[0:2]:                 sourceId = 00000000
[ 1392.285523293,3] PHB#0002[0:2]:                     nFir = 0000000000000000
[ 1392.285599065,3] PHB#0002[0:2]:                 nFirMask = 0030001c00000000
[ 1392.285658870,3] PHB#0002[0:2]:                  nFirWOF = 0000000000000000
[ 1392.285718721,3] PHB#0002[0:2]:                 phbPlssr = 0000001c00000000
[ 1392.285778426,3] PHB#0002[0:2]:                   phbCsr = 0000001c00000000
[ 1392.285834260,3] PHB#0002[0:2]:                   lemFir = 0000000000000080
[ 1392.285894227,3] PHB#0002[0:2]:             lemErrorMask = 0000000000000000
[ 1392.285954146,3] PHB#0002[0:2]:                   lemWOF = 0000000000000080
[ 1392.286017988,3] PHB#0002[0:2]:           phbErrorStatus = 0000020000000000
[ 1392.286085562,3] PHB#0002[0:2]:      phbFirstErrorStatus = 0000020000000000
[ 1392.286145499,3] PHB#0002[0:2]:             phbErrorLog0 = 2148000098000240
[ 1392.286205500,3] PHB#0002[0:2]:             phbErrorLog1 = a008400000000000
[ 1392.286265282,3] PHB#0002[0:2]:        phbTxeErrorStatus = 0000000000000000
[ 1392.286328808,3] PHB#0002[0:2]:   phbTxeFirstErrorStatus = 0000000000000000
[ 1392.286388242,3] PHB#0002[0:2]:          phbTxeErrorLog0 = 0000000000000000
[ 1392.286448308,3] PHB#0002[0:2]:          phbTxeErrorLog1 = 0000000000000000
[ 1392.286508132,3] PHB#0002[0:2]:     phbRxeArbErrorStatus = 0000000000000000
[ 1392.286568068,3] PHB#0002[0:2]: phbRxeArbFrstErrorStatus = 0000000000000000
[ 1392.286623656,3] PHB#0002[0:2]:       phbRxeArbErrorLog0 = 0000000000000000
[ 1392.286683206,3] PHB#0002[0:2]:       phbRxeArbErrorLog1 = 0000000000000000
[ 1392.286743009,3] PHB#0002[0:2]:     phbRxeMrgErrorStatus = 0000000000000000
[ 1392.286802898,3] PHB#0002[0:2]: phbRxeMrgFrstErrorStatus = 0000000000000000
[ 1392.286862689,3] PHB#0002[0:2]:       phbRxeMrgErrorLog0 = 0000000000000000
[ 1392.286922435,3] PHB#0002[0:2]:       phbRxeMrgErrorLog1 = 0000000000000000
[ 1392.286982236,3] PHB#0002[0:2]:     phbRxeTceErrorStatus = 6000000000000000
[ 1392.287042233,3] PHB#0002[0:2]: phbRxeTceFrstErrorStatus = 2000000000000000
[ 1392.287101957,3] PHB#0002[0:2]:       phbRxeTceErrorLog0 = 40000000000000fd
[ 1392.287161569,3] PHB#0002[0:2]:       phbRxeTceErrorLog1 = 0000000000000000
[ 1392.287221038,3] PHB#0002[0:2]:        phbPblErrorStatus = 0000000000000000
[ 1392.287280741,3] PHB#0002[0:2]:   phbPblFirstErrorStatus = 0000000000000000
[ 1392.287336316,3] PHB#0002[0:2]:          phbPblErrorLog0 = 0000000000000000
[ 1392.287407731,3] PHB#0002[0:2]:          phbPblErrorLog1 = 0000000000000000
[ 1392.287479365,3] PHB#0002[0:2]:      phbPcieDlpErrorLog1 = 0000000000000000
[ 1392.287550878,3] PHB#0002[0:2]:      phbPcieDlpErrorLog2 = 0000000000000000
[ 1392.287622331,3] PHB#0002[0:2]:    phbPcieDlpErrorStatus = 0000000000000000
[ 1392.287682208,3] PHB#0002[0:2]:       phbRegbErrorStatus = 0040000000000000
[ 1392.287741819,3] PHB#0002[0:2]:  phbRegbFirstErrorStatus = 0000000000000000
[ 1392.287801590,3] PHB#0002[0:2]:         phbRegbErrorLog0 = 4800003c00000000
[ 1392.287861285,3] PHB#0002[0:2]:         phbRegbErrorLog1 = 0000000000000200
[ 1392.287921850,3] PHB#0002[0:2]:                PEST[0fd] = 8000b03800000000 8000000000000000
EEH: Beginning: 'slot_reset'
PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->slot_reset()
aacraid 0002:01:00.0: aacraid: PCI error - slot reset
PCI 0002:01:00.0#00fd: EEH: aacraid driver reports: 'recovered'
EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
EEH: Notify device driver to resume
EEH: Beginning: 'resume'
PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->resume()

Comment 7 Sagar 2020-09-09 19:07:34 UTC

Hi @gyakovlev@gentoo.org,
Is this issue still observed?
I tried to dupe this, so far no luck. I haven't run into this issue.
Could you please confirm if this issue still persists?

Thanks
Sagar

Comment 8 Oliver O'Halloran 2020-09-09 23:07:28 UTC

Can you see if this patch fixes it?

https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200908015106.79661-1-aik@ozlabs.ru/

Comment 9 gyakovlev 2020-09-09 23:14:47 UTC

Hi!
Thanks.

Yes, will test again sometime this week, sooner that later.

Comment 10 gyakovlev 2020-09-10 01:32:11 UTC

(In reply to Oliver O'Halloran from comment #8)
> Can you see if this patch fixes it?
> 
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200908015106.79661-
> 1-aik@ozlabs.ru/

ok I applied this patch to linux-5.4.63

looks very good so far, kernel booted with 'iommu=nobypass' and I don't see any problems with aacraid yet, it works. I can write to all 8 SAS disks in parallel, and can't trigger the error.


I'll try to generate torture/heavy random IO on the disks a bit later.

also I may give linux-5.8.8 a try.

Comment 11 Sagar 2020-09-10 01:46:44 UTC

(In reply to gyakovlev from comment #10)
> (In reply to Oliver O'Halloran from comment #8)
> > Can you see if this patch fixes it?
> > 
> >
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200908015106.79661-
> > 1-aik@ozlabs.ru/
> 
> ok I applied this patch to linux-5.4.63
> 
> looks very good so far, kernel booted with 'iommu=nobypass' and I don't see
> any problems with aacraid yet, it works. I can write to all 8 SAS disks in
> parallel, and can't trigger the error.
> 
> 
> I'll try to generate torture/heavy random IO on the disks a bit later.
> 
> also I may give linux-5.8.8 a try.

Hi,
could you please post you findings with heavy IO load?

Also Thanks Oliver for adding the reference to the potential patch.
Appreciate it.

Thanks
Sagar

Comment 12 gyakovlev 2020-09-11 03:38:29 UTC

applied patch to linux-5.4.64, booted with iommu=nobypass and ran some
stress-ng tests across all drives.
looks good so far.
wrote close to 1TB of test data, not a sign of a problem.
performance is excellent.


also booted linux-5.8.8 (without patch), just a tiny bit of IO triggers the error, controller does not recover after reset, everything hangs.
I have not tested patched 5.8.8.


Conclusion is - patch Oliver linked definitely helps and system is stable and performant.

Hopefully it'll make into 5.4 and 5.8, I'll be patching manually meanwhile.


Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org>

Comment 13 Sagar 2020-09-11 06:16:19 UTC

(In reply to gyakovlev from comment #12)
> applied patch to linux-5.4.64, booted with iommu=nobypass and ran some
> stress-ng tests across all drives.
> looks good so far.
> wrote close to 1TB of test data, not a sign of a problem.
> performance is excellent.
> 
> 
> also booted linux-5.8.8 (without patch), just a tiny bit of IO triggers the
> error, controller does not recover after reset, everything hangs.
> I have not tested patched 5.8.8.
> 
> 
> Conclusion is - patch Oliver linked definitely helps and system is stable
> and performant.
> 
> Hopefully it'll make into 5.4 and 5.8, I'll be patching manually meanwhile.
> 
> 
> Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org>

Thanks for the prompt response Georgy. Appreciate it
Does that mean you will patch it one more time on 5.8.8 and based on the result - we can consider closing this BZ?

Sagar

Comment 14 gyakovlev 2020-09-11 06:22:50 UTC

Hi Sagar,

testing on 5.8 is a bit problematic for me, because some things on that system require 5.4 kernel.

Patch applies to 5.8, I assume it'll work just fine. No plans testing it.
I meant I'll be patching my kernels till the fix will be backported to release versions of linux.

Thanks all for you help and attention, bug definitely can be closed from my point of view.

Comment 15 Sagar 2020-09-11 18:30:55 UTC

Hi Georgy,
Thanks for your response and efforts on this.

Also Thanks to Oliver for pointing to the right patch.
I am closing this one since we are no longer seeing this issue.

Sagar

Comment 16 Sagar 2020-09-11 18:39:43 UTC

Hi Georgy,
I cannot resolve and mark this BZ "CLOSED" since this is not assigned to me. Could you please mark this closed since you are the reporter?

Thanks
Sagar

Comment 17 gyakovlev 2020-09-11 18:48:02 UTC

Sure, closing.
Thanks again.