Bug 187231
Summary: | kernel panic during hpsa MSI plus tg3 MSI | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Patrick Schaaf (kernelorg) |
Component: | SCSI | Assignee: | linux-scsi (linux-scsi) |
Status: | RESOLVED UNREPRODUCIBLE | ||
Severity: | normal | CC: | don.brace |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.8.6 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
kernel 4.8.6 .config
Patch to correct resets |
Description
Patrick Schaaf
2016-11-07 13:53:06 UTC
Created attachment 243811 [details]
Patch to correct resets
I will be uploading this patch to linux-scsi this week.
I am attaching the patch in case you would like to test this patch now.
Thanks Don for the reaction! Right now, on the box that had that panic and the worst resetting/reset issues (see the other bug I linked), I'm back to 3.14.79, and want to stay there for another 24 to 36 hours, to see that this issue was not present with that kernel series. What would your patch help with? Specifically the panic potential in case a logical device reset is ongoing? Or should it affect / remedy the mysterious (to me) "resetting logical" events in the first place? I'm willing to test patches on that box starting Thursday, but I'd like to understand a bit better what we are dealing with here. (In reply to Patrick Schaaf from comment #2) > Thanks Don for the reaction! > > Right now, on the box that had that panic and the worst resetting/reset > issues (see the other bug I linked), I'm back to 3.14.79, and want to stay > there for another 24 to 36 hours, to see that this issue was not present > with that kernel series. > > What would your patch help with? Specifically the panic potential in case a > logical device reset is ongoing? Or should it affect / remedy the mysterious > (to me) "resetting logical" events in the first place? > > I'm willing to test patches on that box starting Thursday, but I'd like to > understand a bit better what we are dealing with here. The specific issue that this patch addresses is that during a reset, complete_scsi_command returns without having called scsi_done which causes the OS to offline the disk (after two more occurrences). But this code path is not often followed so the issue does not happen with all resets. There are some other recent patches that should also be tested that have been recently applied. From git format-patch: 0457-scsi-hpsa-Check-for-null-device-pointers.patch * This checks for a NULL device that can happen if the OS off-lines the disk because of the afore mentioned reset issue. 0460-scsi-hpsa-Check-for-null-devices-in-ioaccel-submissi.patch 0462-scsi-hpsa-correct-call-to-hpsa_do_reset.patch * Fine tunes resets into LOGICAL/Physical resets. A patch I still have pending on linux-scsi 0464-hpsa-add-generate-controller-NMI-on-lockup.patch * This patch just adds more granularity on lock-up detection. It would be nice to know why the reset is happening in the first place. That problematic box, which showed the kernel panic with 4.8.6, and the resetting/reset-up-to-20-seconds pauses several times a day with both 4.8 and 4.4.x, has now been running on 3.14.79 (with the same kvm load as before), for 30 hours, without any such HPSA resetting symptoms, or untoward pauses in the VMs that I could otherwise notice in monitoring. So somehow 3.14 does not trigger these episodes, or so it seems. After almost 4 days my problematic box downgraded to 3.14.79, finally made some noise, like this: 2016-11-11T03:31:10.608539+01:00 kvm3f kernel: [320020.727691] hpsa 0000:03:00.0: Abort request on C0:B0:T0:L0 2016-11-11T03:31:10.608555+01:00 kvm3f kernel: [320020.728175] hpsa 0000:03:00.0: cp ffff8868f2c17000 is reported invalid (probably means target device no longer present) 2016-11-11T03:31:10.608557+01:00 kvm3f kernel: [320020.728796] hpsa 0000:03:00.0: cp ffff8868f2c17000 is reported invalid (probably means target device no longer present) 2016-11-11T03:31:10.608558+01:00 kvm3f kernel: [320020.729389] hpsa 0000:03:00.0: FAILED abort on device C0:B0:T0:L0 2016-11-11T03:31:10.608560+01:00 kvm3f kernel: [320020.729708] hpsa 0000:03:00.0: resetting device 0:0:0:0 2016-11-11T03:31:26.968534+01:00 kvm3f kernel: [320037.081397] hpsa 0000:03:00.0: device is ready. So, maybe there is a somewhat weirdly faulty drive in that array, which otherwise does not show any (SMART / ILO logs) symptoms... After several more such Abort request / reset sequences with 3.14.79, two days ago the box _finally_ announced that one of its 8 drives has a SMART "predictive failure"; after swapping that drive for a spare, the symptoms are no longer seen. This is the third or fourth time, over the last year, that I've seen Gen9 servers with P440ar cards behave that way. Anyway, my immediate test case is gone, so I'll close this as RESOLVED / unreproducible... |