Bug 187221 - HPSA resetting logical / reset logical
Summary: HPSA resetting logical / reset logical
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: linux-scsi@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-07 13:39 UTC by Patrick Schaaf
Modified: 2021-03-29 02:20 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.4.x, 4.8.x
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Patrick Schaaf 2016-11-07 13:39:48 UTC
I have about 20 HP DL 380 (some 360) servers, from Gen7 to Gen9, using the HPSA driver with various smartarray controllers.

For a long time I've been running mainline 3.14 kernels, without any issues. Some time ago I updated to mainline 4.4.x, up to the most recent 4.4.30.

Now I noticed, especially on one server, but in the logs on 6 of them, the following kind of message:

2016-11-06T22:09:50.227592+01:00 HOST kernel: [68853.338610] hpsa 0000:03:00.0: scsi 0:1:0:0: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-5 SSDSmartPathCap- En- Exp=1
2016-11-06T22:10:18.713759+01:00 HOST kernel: [68881.832436] hpsa 0000:03:00.0: scsi 0:1:0:0: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-5 SSDSmartPathCap- En- Exp=1

I see such messages, _usually_ only with 1 second between resetting/reset, on machines with the following controller+controller firmware variants:
1 P410i 5.14
1 P420i 5.42
2 P440ar 3.02
1 P440ar 3.56
1 P440ar 4.02

The one machine for which I've shown the concrete message, is a P440ar with firmware 3.02. There, contrary to the other machines, it sometimes takes up to 20 seconds for that resetting operation, and meanwhile, all I/O stalls.

I also tested with 4.8.x kernels, and saw the same symptoms there. I'm somewhat sure that I did not see these with 3.14 kernels. This morning I rebooted the most problematic box to 3.14.79, so far it was silent. I'll report if that changes.

Apart from these log lines, there is nothing strange to be found - no ILO or IML notifications visible, no other kernel messages, no drive failures, SMART alerts, or performance regressions...
Comment 1 Patrick Schaaf 2016-11-16 06:19:09 UTC
Some more info on my problematic machine / further diagnosing is in https://bugzilla.kernel.org/show_bug.cgi?id=187231

Summary: at least with the P440ar controllers, such 10-30 second "logical reset" episodes eventually reveal an underlying faulty drive, and go away when that is drive is replaced.

But there is no up-front information in the "logical reset" that would permit pinpointing the drive on the first round.

Note You need to log in before you can comment on or make changes to this bug.