Bug 206123
Summary: | aacraid ( PM8068) and iommu=nobypass Frozen PHB error on ppc64 | ||
---|---|---|---|
Product: | SCSI Drivers | Reporter: | gyakovlev |
Component: | AACRAID | Assignee: | scsi_drivers-aacraid |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | cam, oohall, sagar.biradar, tpearson |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200908015106.79661-1-aik@ozlabs.ru/ | ||
Kernel Version: | 5.4.8 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | full dmesg |
Description
gyakovlev
2020-01-08 05:59:58 UTC
Created attachment 286681 [details]
full dmesg
also I have to add that attaching disks directly to controller ports (bypassing backplane) makes no difference. If I'm decoding this right, the EEH is caused by a PCIe configuration space write, triggering a correctable error in the PCIe core. I have no way of knowing if the address reported is valid (I suspect it is not) but would be 0x0. (In reply to Timothy Pearson from comment #3) > If I'm decoding this right, the EEH is caused by a PCIe configuration space > write, triggering a correctable error in the PCIe core. I have no way of > knowing if the address reported is valid (I suspect it is not) but would be > 0x0. $ pest 8300b03800000000 8000000000000000 Transaction type: DMA Read Response Invalid MMIO Address TCE Page Fault TCE Access Fault LEM Bit Number 56 Requestor 00:0.0 MSI Data 0x0000 Fault Address = 0x0000000000000000 A TCE fault makes more sense given that it doesn't happen when bypass is enabled. I'm leaning towards this being a driver bug, but it could be a powerpc IOMMU specific issue. I'll investigate. Bug 207359 may potentially be a duplicate of this one so perhaps some of the info there could be useful. (In reply to Oliver O'Halloran from comment #4) > A TCE fault makes more sense given that it doesn't happen when bypass is > enabled. I'm leaning towards this being a driver bug, but it could be a > powerpc IOMMU specific issue. I'll investigate. tried linux 5.6.10, it now happens right at boot, but at least controller reset is working it seems, before needed a reboot to access disks again. [May 6 01:10] PowerNV: IOMMU bypass window disabled. ... [ 24.609683] Adaptec aacraid driver 1.2.1[50983]-custom [ 24.609784] aacraid 0002:01:00.0: enabling device (0140 -> 0142) [ 24.628036] aacraid: Comm Interface type3 enabled ... [ 25.661962] EEH: Recovering PHB#2-PE#fd [ 25.662010] EEH: PE location: UOPWR.A100034-Node0-Builtin SAS, PHB location: N/A [ 25.662097] EEH: Frozen PHB#2-PE#fd detected [ 25.662145] EEH: Call Trace: [ 25.662186] EEH: [(____ptrval____)] __eeh_send_failure_event+0x60/0x110 [ 25.662282] EEH: [(____ptrval____)] eeh_dev_check_failure+0x360/0x5f0 [ 25.662373] EEH: [(____ptrval____)] eeh_check_failure+0x98/0x100 [ 25.666794] EEH: [(____ptrval____)] aac_src_check_health+0x8c/0xc0 [ 25.669770] EEH: [(____ptrval____)] aac_command_thread+0x718/0x930 [ 25.672745] EEH: [(____ptrval____)] kthread+0x180/0x190 [ 25.675719] EEH: [(____ptrval____)] ret_from_kernel_thread+0x5c/0x6c [ 25.678722] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures. [ 25.681822] EEH: Notify device drivers to shutdown [ 25.684910] EEH: Beginning: 'error_detected(IO frozen)' [ 25.688007] PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->error_detected(IO frozen) [ 25.688011] aacraid 0002:01:00.0: aacraid: PCI error detected 2 [ 25.695317] PCI 0002:01:00.0#00fd: EEH: aacraid driver reports: 'need reset' [ 25.695320] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset' [ 25.695325] EEH: Collect temporary log [ 25.695354] EEH: of node=0002:01:00.0 [ 25.695358] EEH: PCI device/vendor: 028d9005 [ 25.695361] EEH: PCI cmd/status register: 00100146 [ 25.695362] EEH: PCI-E capabilities and status follow: [ 25.695376] EEH: PCI-E 00: 00020010 000081a2 00002950 00437083 [ 25.695387] EEH: PCI-E 10: 10820000 00000000 00000000 00000000 [ 25.695389] EEH: PCI-E 20: 00000000 [ 25.695391] EEH: PCI-E AER capability register set follows: [ 25.695404] EEH: PCI-E AER 00: 30020001 00000000 00400000 00462030 [ 25.695415] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000 [ 25.695426] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 25.695430] EEH: PCI-E AER 30: 00000000 00000000 [ 25.695432] PHB4 PHB#2 Diag-data (Version: 1) [ 25.695434] brdgCtl: 00000002 [ 25.695436] RootSts: 00000040 00402000 e0820008 00100107 00000000 [ 25.695438] PhbSts: 0000001c00000000 0000001c00000000 [ 25.695440] Lem: 0000000000000080 0000000000000000 0000000000000080 [ 25.695443] PhbErr: 0000020000000000 0000020000000000 2148000098000240 a008400000000000 [ 25.695445] RxeTceErr: 6000000000000000 2000000000000000 40000000000000fd 0000000000000000 [ 25.695450] PE[0fd] A/B: 8000b03800000000 8000000000000000 [ 25.695453] EEH: Reset without hotplug activity ... aacraid 0002:01:00.0: enabling device (0140 -> 0142) [ 1392.284584276,3] PHB#0002[0:2]: brdgCtl = 00000002 [ 1392.284685636,3] PHB#0002[0:2]: deviceStatus = 00000040 [ 1392.284739080,3] PHB#0002[0:2]: slotStatus = 00402000 [ 1392.284804382,3] PHB#0002[0:2]: linkStatus = e0820008 [ 1392.284857805,3] PHB#0002[0:2]: devCmdStatus = 00100107 [ 1392.284899389,3] PHB#0002[0:2]: devSecStatus = 00000000 [ 1392.284948786,3] PHB#0002[0:2]: rootErrorStatus = 00000000 [ 1392.285006352,3] PHB#0002[0:2]: corrErrorStatus = 00000000 [ 1392.285055882,3] PHB#0002[0:2]: uncorrErrorStatus = 00000000 [ 1392.285113499,3] PHB#0002[0:2]: devctl = 00000040 [ 1392.285162880,3] PHB#0002[0:2]: devStat = 00000000 [ 1392.285224300,3] PHB#0002[0:2]: tlpHdr1 = 00000000 [ 1392.285285888,3] PHB#0002[0:2]: tlpHdr2 = 00000000 [ 1392.285355027,3] PHB#0002[0:2]: tlpHdr3 = 00000000 [ 1392.285404499,3] PHB#0002[0:2]: tlpHdr4 = 00000000 [ 1392.285473783,3] PHB#0002[0:2]: sourceId = 00000000 [ 1392.285523293,3] PHB#0002[0:2]: nFir = 0000000000000000 [ 1392.285599065,3] PHB#0002[0:2]: nFirMask = 0030001c00000000 [ 1392.285658870,3] PHB#0002[0:2]: nFirWOF = 0000000000000000 [ 1392.285718721,3] PHB#0002[0:2]: phbPlssr = 0000001c00000000 [ 1392.285778426,3] PHB#0002[0:2]: phbCsr = 0000001c00000000 [ 1392.285834260,3] PHB#0002[0:2]: lemFir = 0000000000000080 [ 1392.285894227,3] PHB#0002[0:2]: lemErrorMask = 0000000000000000 [ 1392.285954146,3] PHB#0002[0:2]: lemWOF = 0000000000000080 [ 1392.286017988,3] PHB#0002[0:2]: phbErrorStatus = 0000020000000000 [ 1392.286085562,3] PHB#0002[0:2]: phbFirstErrorStatus = 0000020000000000 [ 1392.286145499,3] PHB#0002[0:2]: phbErrorLog0 = 2148000098000240 [ 1392.286205500,3] PHB#0002[0:2]: phbErrorLog1 = a008400000000000 [ 1392.286265282,3] PHB#0002[0:2]: phbTxeErrorStatus = 0000000000000000 [ 1392.286328808,3] PHB#0002[0:2]: phbTxeFirstErrorStatus = 0000000000000000 [ 1392.286388242,3] PHB#0002[0:2]: phbTxeErrorLog0 = 0000000000000000 [ 1392.286448308,3] PHB#0002[0:2]: phbTxeErrorLog1 = 0000000000000000 [ 1392.286508132,3] PHB#0002[0:2]: phbRxeArbErrorStatus = 0000000000000000 [ 1392.286568068,3] PHB#0002[0:2]: phbRxeArbFrstErrorStatus = 0000000000000000 [ 1392.286623656,3] PHB#0002[0:2]: phbRxeArbErrorLog0 = 0000000000000000 [ 1392.286683206,3] PHB#0002[0:2]: phbRxeArbErrorLog1 = 0000000000000000 [ 1392.286743009,3] PHB#0002[0:2]: phbRxeMrgErrorStatus = 0000000000000000 [ 1392.286802898,3] PHB#0002[0:2]: phbRxeMrgFrstErrorStatus = 0000000000000000 [ 1392.286862689,3] PHB#0002[0:2]: phbRxeMrgErrorLog0 = 0000000000000000 [ 1392.286922435,3] PHB#0002[0:2]: phbRxeMrgErrorLog1 = 0000000000000000 [ 1392.286982236,3] PHB#0002[0:2]: phbRxeTceErrorStatus = 6000000000000000 [ 1392.287042233,3] PHB#0002[0:2]: phbRxeTceFrstErrorStatus = 2000000000000000 [ 1392.287101957,3] PHB#0002[0:2]: phbRxeTceErrorLog0 = 40000000000000fd [ 1392.287161569,3] PHB#0002[0:2]: phbRxeTceErrorLog1 = 0000000000000000 [ 1392.287221038,3] PHB#0002[0:2]: phbPblErrorStatus = 0000000000000000 [ 1392.287280741,3] PHB#0002[0:2]: phbPblFirstErrorStatus = 0000000000000000 [ 1392.287336316,3] PHB#0002[0:2]: phbPblErrorLog0 = 0000000000000000 [ 1392.287407731,3] PHB#0002[0:2]: phbPblErrorLog1 = 0000000000000000 [ 1392.287479365,3] PHB#0002[0:2]: phbPcieDlpErrorLog1 = 0000000000000000 [ 1392.287550878,3] PHB#0002[0:2]: phbPcieDlpErrorLog2 = 0000000000000000 [ 1392.287622331,3] PHB#0002[0:2]: phbPcieDlpErrorStatus = 0000000000000000 [ 1392.287682208,3] PHB#0002[0:2]: phbRegbErrorStatus = 0040000000000000 [ 1392.287741819,3] PHB#0002[0:2]: phbRegbFirstErrorStatus = 0000000000000000 [ 1392.287801590,3] PHB#0002[0:2]: phbRegbErrorLog0 = 4800003c00000000 [ 1392.287861285,3] PHB#0002[0:2]: phbRegbErrorLog1 = 0000000000000200 [ 1392.287921850,3] PHB#0002[0:2]: PEST[0fd] = 8000b03800000000 8000000000000000 EEH: Beginning: 'slot_reset' PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->slot_reset() aacraid 0002:01:00.0: aacraid: PCI error - slot reset PCI 0002:01:00.0#00fd: EEH: aacraid driver reports: 'recovered' EEH: Finished:'slot_reset' with aggregate recovery state:'recovered' EEH: Notify device driver to resume EEH: Beginning: 'resume' PCI 0002:01:00.0#00fd: EEH: Invoking aacraid->resume() Hi @gyakovlev@gentoo.org, Is this issue still observed? I tried to dupe this, so far no luck. I haven't run into this issue. Could you please confirm if this issue still persists? Thanks Sagar Can you see if this patch fixes it? https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200908015106.79661-1-aik@ozlabs.ru/ Hi! Thanks. Yes, will test again sometime this week, sooner that later. (In reply to Oliver O'Halloran from comment #8) > Can you see if this patch fixes it? > > https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200908015106.79661- > 1-aik@ozlabs.ru/ ok I applied this patch to linux-5.4.63 looks very good so far, kernel booted with 'iommu=nobypass' and I don't see any problems with aacraid yet, it works. I can write to all 8 SAS disks in parallel, and can't trigger the error. I'll try to generate torture/heavy random IO on the disks a bit later. also I may give linux-5.8.8 a try. (In reply to gyakovlev from comment #10) > (In reply to Oliver O'Halloran from comment #8) > > Can you see if this patch fixes it? > > > > > https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200908015106.79661- > > 1-aik@ozlabs.ru/ > > ok I applied this patch to linux-5.4.63 > > looks very good so far, kernel booted with 'iommu=nobypass' and I don't see > any problems with aacraid yet, it works. I can write to all 8 SAS disks in > parallel, and can't trigger the error. > > > I'll try to generate torture/heavy random IO on the disks a bit later. > > also I may give linux-5.8.8 a try. Hi, could you please post you findings with heavy IO load? Also Thanks Oliver for adding the reference to the potential patch. Appreciate it. Thanks Sagar applied patch to linux-5.4.64, booted with iommu=nobypass and ran some stress-ng tests across all drives. looks good so far. wrote close to 1TB of test data, not a sign of a problem. performance is excellent. also booted linux-5.8.8 (without patch), just a tiny bit of IO triggers the error, controller does not recover after reset, everything hangs. I have not tested patched 5.8.8. Conclusion is - patch Oliver linked definitely helps and system is stable and performant. Hopefully it'll make into 5.4 and 5.8, I'll be patching manually meanwhile. Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org> (In reply to gyakovlev from comment #12) > applied patch to linux-5.4.64, booted with iommu=nobypass and ran some > stress-ng tests across all drives. > looks good so far. > wrote close to 1TB of test data, not a sign of a problem. > performance is excellent. > > > also booted linux-5.8.8 (without patch), just a tiny bit of IO triggers the > error, controller does not recover after reset, everything hangs. > I have not tested patched 5.8.8. > > > Conclusion is - patch Oliver linked definitely helps and system is stable > and performant. > > Hopefully it'll make into 5.4 and 5.8, I'll be patching manually meanwhile. > > > Tested-by: Georgy Yakovlev <gyakovlev@gentoo.org> Thanks for the prompt response Georgy. Appreciate it Does that mean you will patch it one more time on 5.8.8 and based on the result - we can consider closing this BZ? Sagar Hi Sagar, testing on 5.8 is a bit problematic for me, because some things on that system require 5.4 kernel. Patch applies to 5.8, I assume it'll work just fine. No plans testing it. I meant I'll be patching my kernels till the fix will be backported to release versions of linux. Thanks all for you help and attention, bug definitely can be closed from my point of view. Hi Georgy, Thanks for your response and efforts on this. Also Thanks to Oliver for pointing to the right patch. I am closing this one since we are no longer seeing this issue. Sagar Hi Georgy, I cannot resolve and mark this BZ "CLOSED" since this is not assigned to me. Could you please mark this closed since you are the reporter? Thanks Sagar Sure, closing. Thanks again. |