Bug 204815

Summary: qla2xxx: firmware is not responding to mailbox commands
Product: SCSI Drivers Reporter: Roman Bolshakov (r.bolshakov)
Component: QLOGIC QLA2XXXAssignee: scsi_drivers-qla2xxx
Status: NEW ---    
Severity: high CC: mwilck
Priority: P1    
Hardware: PPC-64   
OS: Linux   
Kernel Version: 5.2-rc1 up to 5.3-rc8 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: firmware times out on 5.3-rc8
firmware behaves properly

Description Roman Bolshakov 2019-09-11 15:54:28 UTC
Created attachment 284925 [details]
firmware times out on 5.3-rc8

I'm using QLogic HBAs (QLE2560 and QLE2742) inside pseries guests on ppc64le/POWER8 hypervisor and they are not usable since commit f8f97b0c5b7f7 ("scsi: qla2xxx: Cleanups for NVRAM/Flash read/write path"). 

The firmware stops responding to mailbox commands shortly after system boot is done.
That also triggers an EEH on pseries machine and driver doesn't handle the EEH properly because firmware is effectively not available. I disabled eeh inside the guest as it caused a deadlock on the host kernel.

The issue is fixed in linux-next by the commit edbd56472a63 ("scsi: qla2xxx: qla2x00_alloc_fw_dump: set ha->eft"). I think it should be included to 5.3 if possible. It can be cherry-picked cleanly to master.

The logs of 5.3-rc8 (bad.log) and 5.3-rc8 with edbd56472a63 (good.log) are applied.
Comment 1 Roman Bolshakov 2019-09-11 15:55:47 UTC
Created attachment 284927 [details]
firmware behaves properly
Comment 2 Martin Wilck 2019-09-13 15:04:02 UTC
Nice to hear that edbd56472a63 fixes your problem, but it was meant to fix a28d9e4ef997 ("scsi: qla2xxx: Add support for multiple fwdump templates/segments"), which was applied (directly) after f8f97b0c5b7f7.

Maybe your problem has been caused by a28d9e4ef997?
Comment 3 Roman Bolshakov 2019-09-13 16:03:11 UTC
Hi Martin,

I can't tell for sure, because f8f97b0c5b7f7 introduces a regression fixed in 1710ac17547ac8b ("scsi: qla2xxx: Fix read offset in qla24xx_load_risc_flash()"). 

Here's the possible timeline:
1. f8f97b0c5b7f7 ("scsi: qla2xxx: Cleanups for NVRAM/Flash read/write path") introduces a regression which prevents successful ISP firmware checksum validation and kernel panics shortly after.
2. a28d9e4ef997 ("scsi: qla2xxx: Add support for multiple fwdump templates/segments") introduces a regression which causes EEH and system lockup on POWER8 or makes firmware unavailable (this bug).
3. 1710ac17547ac8 ("scsi: qla2xxx: Fix read offset in qla24xx_load_risc_flash()") fixes  f8f97b0c5b7f7 but the fix depends both on #1 and #2.
4. edbd56472a63 ("scsi: qla2xxx: qla2x00_alloc_fw_dump: set ha->eft") fixes a28d9e4ef997. 

It's not possible to bisect between #1 and #3 because of the panic introduced in #1. And firmware works reliably only after #4.

And I think it's important to include your fix into 5.3, otherwise qla2xxx is broken in the release.

Thanks,
Roman