Bug 187221

Summary: HPSA resetting logical / reset logical
Product: IO/Storage Reporter: Patrick Schaaf (kernelorg)
Component: SCSIAssignee: linux-scsi (linux-scsi)
Status: NEW ---    
Severity: normal CC: vsudoblog
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.4.x, 4.8.x Subsystem:
Regression: No Bisected commit-id:

Description Patrick Schaaf 2016-11-07 13:39:48 UTC
I have about 20 HP DL 380 (some 360) servers, from Gen7 to Gen9, using the HPSA driver with various smartarray controllers.

For a long time I've been running mainline 3.14 kernels, without any issues. Some time ago I updated to mainline 4.4.x, up to the most recent 4.4.30.

Now I noticed, especially on one server, but in the logs on 6 of them, the following kind of message:

2016-11-06T22:09:50.227592+01:00 HOST kernel: [68853.338610] hpsa 0000:03:00.0: scsi 0:1:0:0: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-5 SSDSmartPathCap- En- Exp=1
2016-11-06T22:10:18.713759+01:00 HOST kernel: [68881.832436] hpsa 0000:03:00.0: scsi 0:1:0:0: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-5 SSDSmartPathCap- En- Exp=1

I see such messages, _usually_ only with 1 second between resetting/reset, on machines with the following controller+controller firmware variants:
1 P410i 5.14
1 P420i 5.42
2 P440ar 3.02
1 P440ar 3.56
1 P440ar 4.02

The one machine for which I've shown the concrete message, is a P440ar with firmware 3.02. There, contrary to the other machines, it sometimes takes up to 20 seconds for that resetting operation, and meanwhile, all I/O stalls.

I also tested with 4.8.x kernels, and saw the same symptoms there. I'm somewhat sure that I did not see these with 3.14 kernels. This morning I rebooted the most problematic box to 3.14.79, so far it was silent. I'll report if that changes.

Apart from these log lines, there is nothing strange to be found - no ILO or IML notifications visible, no other kernel messages, no drive failures, SMART alerts, or performance regressions...
Comment 1 Patrick Schaaf 2016-11-16 06:19:09 UTC
Some more info on my problematic machine / further diagnosing is in https://bugzilla.kernel.org/show_bug.cgi?id=187231

Summary: at least with the P440ar controllers, such 10-30 second "logical reset" episodes eventually reveal an underlying faulty drive, and go away when that is drive is replaced.

But there is no up-front information in the "logical reset" that would permit pinpointing the drive on the first round.