Bug 187231 - kernel panic during hpsa MSI plus tg3 MSI
Summary: kernel panic during hpsa MSI plus tg3 MSI
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: linux-scsi@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-07 13:53 UTC by Patrick Schaaf
Modified: 2016-11-16 06:17 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.8.6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel 4.8.6 .config (72.93 KB, application/octet-stream)
2016-11-07 13:53 UTC, Patrick Schaaf
Details
Patch to correct resets (2.17 KB, patch)
2016-11-07 15:25 UTC, Don
Details | Diff

Description Patrick Schaaf 2016-11-07 13:53:06 UTC
Created attachment 243801 [details]
kernel 4.8.6 .config

I'm not sure whether this is a SCSI / HPSA bug or a networking / tg3 driver bug. Both are seen in the stack dump. As the trigger seems to be HPSA I'm reporting as a SCSI issue here...

I've been recently attempting to run mainline 4.8.x kernels, most recently 4.8.6, on our production HP DL 380 Intel servers.

On several of them there is some related issue reported in https://bugzilla.kernel.org/show_bug.cgi?id=187221 where the HPSA driver on some of the hosts sometimes resets the logical device. I had seen that already with 4.4.x kernels, and again with 4.8.6.

Now, specifically with 4.8.6, on the box which has the worst of these symptoms, I _additionally_ experienced multiple full kernel panics. The same box (with the same hpsa reset symtoms) had been running 4.4.x kernels before without such kernel panics. The panics then happened multiple times with about a day in between.

On the last round I had the ILO SSH console running under screen with logging enabled, and was able to retrieve the following panic backtrace:

[187283.903173] hpsa 0000:03:00.0: scsi 0:1:0:0: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-5 SSDSmartPathCap- En- Exp=1               
[187314.331375] sd 0:1:0:0: rejecting I/O to offline device                     
[187314.413441] sd 0:1:0:0: rejecting I/O to offline device                     
[187314.854183] sd 0:1:0:0: rejecting I/O to offline device                     
... lots of these ...
[187328.991285] sd 0:1:0:0: rejecting I/O to offline device                     
[187328.991389] sd 0:1:0:0: rejecting I/O to offline device                     
[187329.190166] sd 0:1:0:0: rejecting I/O to offline device                     
[187329.271304]  ffff88bd1a7e8000 ffff88bd1a7be500 ffff88bd7f483eb8 ffffffff8143
493f                                                                            
[187329.271304] Call Trace:                                                     
[187329.271310]  <IRQ>                                                          
[187329.271310]  [<ffffffffa002e332>] ? tg3_poll_msix+0xc2/0x160 [tg3]          
[187329.271311]  [<ffffffff8143493f>] do_hpsa_intr_msi+0x8f/0x1c0               
[187329.271314]  [<ffffffff81148c46>] __handle_irq_event_percpu+0x66/0xe0       
[187329.271315]  [<ffffffff81148cde>] handle_irq_event_percpu+0x1e/0x50         
[187329.271316]  [<ffffffff81148d37>] handle_irq_event+0x27/0x50                
[187329.271318]  [<ffffffff8114bda5>] handle_edge_irq+0x65/0x140                
[187329.271320]  [<ffffffff81057255>] handle_irq+0x15/0x20                      
[187329.271321]  [<ffffffff81057086>] do_IRQ+0x46/0xd0                          
[187329.271324]  [<ffffffff816dc4fc>] common_interrupt+0x7c/0x7c                
[187329.271325]  <EOI>                                                          
[187329.271338] Code: 53 48 89 fb 48 83 ec 28 4c 8b a7 5c 02 00 00 4c 8b bf 40 0
2 00 00 4c 8b b7 38 02 00 00 4c 8b af 4c 02 00 00 49 8b 04 24 4c 89 e7 <48> 8b 8
0 98 00 00 00 48 89 45 c0 49 8b 87 d0 01 00 00 48 89 45                         
[187329.271339] RIP  [<ffffffff81431417>] complete_scsi_command+0x37/0x8c0      
[187329.271339]  RSP <ffff88bd7f483e38>                                         
[187329.271339] CR2: 0000000000000098                                           
[187329.271341] ---[ end trace 52898916f0da5c53 ]---                            
[187329.273413] Kernel panic - not syncing: Fatal exception in interrupt        
[187330.308465] Shutting down cpus with NMI                                     
[187330.308471] Kernel Offset: disabled                                         
[187330.919173] Rebooting in 300 seconds..  

I'll attach my kernel .config.

As this is a production system and so far the panics only hit with our usual (webserver and DB kvm machine) production load active, there's not much testing or bisecting I can do, but I didn't want to drop the issue unreported, either. 

Hope this helps somebody. If there is any more info I can provide, just ask what would be useful.

(I'm back to running 4.4.x)
Comment 1 Don 2016-11-07 15:25:11 UTC
Created attachment 243811 [details]
Patch to correct resets

I will be uploading this patch to linux-scsi this week.

I am attaching the patch in case you would like to test this patch now.
Comment 2 Patrick Schaaf 2016-11-07 15:45:01 UTC
Thanks Don for the reaction!

Right now, on the box that had that panic and the worst resetting/reset issues (see the other bug I linked), I'm back to 3.14.79, and want to stay there for another 24 to 36 hours, to see that this issue was not present with that kernel series.

What would your patch help with? Specifically the panic potential in case a logical device reset is ongoing? Or should it affect / remedy the mysterious (to me) "resetting logical" events in the first place?

I'm willing to test patches on that box starting Thursday, but I'd like to understand a bit better what we are dealing with here.
Comment 3 Don 2016-11-07 16:16:05 UTC
(In reply to Patrick Schaaf from comment #2)
> Thanks Don for the reaction!
> 
> Right now, on the box that had that panic and the worst resetting/reset
> issues (see the other bug I linked), I'm back to 3.14.79, and want to stay
> there for another 24 to 36 hours, to see that this issue was not present
> with that kernel series.
> 
> What would your patch help with? Specifically the panic potential in case a
> logical device reset is ongoing? Or should it affect / remedy the mysterious
> (to me) "resetting logical" events in the first place?
> 
> I'm willing to test patches on that box starting Thursday, but I'd like to
> understand a bit better what we are dealing with here.

The specific issue that this patch addresses is that during a reset, complete_scsi_command returns without having called scsi_done which causes the OS to offline the disk (after two more occurrences). But this code path is not often followed so the issue does not happen with all resets.

There are some other recent patches that should also be tested that have been recently applied.

From git format-patch:
0457-scsi-hpsa-Check-for-null-device-pointers.patch
    * This checks for a NULL device that can happen if the OS
      off-lines the disk because of the afore mentioned reset issue.
0460-scsi-hpsa-Check-for-null-devices-in-ioaccel-submissi.patch
0462-scsi-hpsa-correct-call-to-hpsa_do_reset.patch
    * Fine tunes resets into LOGICAL/Physical resets.

A patch I still have pending on linux-scsi
0464-hpsa-add-generate-controller-NMI-on-lockup.patch
    * This patch just adds more granularity on lock-up detection.

It would be nice to know why the reset is happening in the first place.
Comment 4 Patrick Schaaf 2016-11-08 16:01:26 UTC
That problematic box, which showed the kernel panic with 4.8.6, and the resetting/reset-up-to-20-seconds pauses several times a day with both 4.8 and 4.4.x, has now been running on 3.14.79 (with the same kvm load as before), for 30 hours, without any such HPSA resetting symptoms, or untoward pauses in the VMs that I could otherwise notice in monitoring.

So somehow 3.14 does not trigger these episodes, or so it seems.
Comment 5 Patrick Schaaf 2016-11-11 09:54:23 UTC
After almost 4 days my problematic box downgraded to 3.14.79, finally made some noise, like this:

2016-11-11T03:31:10.608539+01:00 kvm3f kernel: [320020.727691] hpsa 0000:03:00.0: Abort request on C0:B0:T0:L0
2016-11-11T03:31:10.608555+01:00 kvm3f kernel: [320020.728175] hpsa 0000:03:00.0: cp ffff8868f2c17000 is reported invalid (probably means target device no longer present)
2016-11-11T03:31:10.608557+01:00 kvm3f kernel: [320020.728796] hpsa 0000:03:00.0: cp ffff8868f2c17000 is reported invalid (probably means target device no longer present)
2016-11-11T03:31:10.608558+01:00 kvm3f kernel: [320020.729389] hpsa 0000:03:00.0: FAILED abort on device C0:B0:T0:L0
2016-11-11T03:31:10.608560+01:00 kvm3f kernel: [320020.729708] hpsa 0000:03:00.0: resetting device 0:0:0:0
2016-11-11T03:31:26.968534+01:00 kvm3f kernel: [320037.081397] hpsa 0000:03:00.0: device is ready.

So, maybe there is a somewhat weirdly faulty drive in that array, which otherwise does not show any (SMART / ILO logs) symptoms...
Comment 6 Patrick Schaaf 2016-11-16 06:17:38 UTC
After several more such Abort request / reset sequences with 3.14.79, two days ago the box _finally_ announced that one of its 8 drives has a SMART "predictive failure"; after swapping that drive for a spare, the symptoms are no longer seen.

This is the third or fourth time, over the last year, that I've seen Gen9 servers with P440ar cards behave that way.

Anyway, my immediate test case is gone, so I'll close this as RESOLVED / unreproducible...

Note You need to log in before you can comment on or make changes to this bug.