Bug 204815 - qla2xxx: firmware is not responding to mailbox commands
Summary: qla2xxx: firmware is not responding to mailbox commands
Status: NEW
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: QLOGIC QLA2XXX (show other bugs)
Hardware: PPC-64 Linux
: P1 high
Assignee: scsi_drivers-qla2xxx
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-11 15:54 UTC by Roman Bolshakov
Modified: 2019-09-13 16:03 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.2-rc1 up to 5.3-rc8
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
firmware times out on 5.3-rc8 (49.35 KB, application/x-xz)
2019-09-11 15:54 UTC, Roman Bolshakov
Details
firmware behaves properly (49.61 KB, application/x-xz)
2019-09-11 15:55 UTC, Roman Bolshakov
Details

Description Roman Bolshakov 2019-09-11 15:54:28 UTC
Created attachment 284925 [details]
firmware times out on 5.3-rc8

I'm using QLogic HBAs (QLE2560 and QLE2742) inside pseries guests on ppc64le/POWER8 hypervisor and they are not usable since commit f8f97b0c5b7f7 ("scsi: qla2xxx: Cleanups for NVRAM/Flash read/write path"). 

The firmware stops responding to mailbox commands shortly after system boot is done.
That also triggers an EEH on pseries machine and driver doesn't handle the EEH properly because firmware is effectively not available. I disabled eeh inside the guest as it caused a deadlock on the host kernel.

The issue is fixed in linux-next by the commit edbd56472a63 ("scsi: qla2xxx: qla2x00_alloc_fw_dump: set ha->eft"). I think it should be included to 5.3 if possible. It can be cherry-picked cleanly to master.

The logs of 5.3-rc8 (bad.log) and 5.3-rc8 with edbd56472a63 (good.log) are applied.
Comment 1 Roman Bolshakov 2019-09-11 15:55:47 UTC
Created attachment 284927 [details]
firmware behaves properly
Comment 2 Martin Wilck 2019-09-13 15:04:02 UTC
Nice to hear that edbd56472a63 fixes your problem, but it was meant to fix a28d9e4ef997 ("scsi: qla2xxx: Add support for multiple fwdump templates/segments"), which was applied (directly) after f8f97b0c5b7f7.

Maybe your problem has been caused by a28d9e4ef997?
Comment 3 Roman Bolshakov 2019-09-13 16:03:11 UTC
Hi Martin,

I can't tell for sure, because f8f97b0c5b7f7 introduces a regression fixed in 1710ac17547ac8b ("scsi: qla2xxx: Fix read offset in qla24xx_load_risc_flash()"). 

Here's the possible timeline:
1. f8f97b0c5b7f7 ("scsi: qla2xxx: Cleanups for NVRAM/Flash read/write path") introduces a regression which prevents successful ISP firmware checksum validation and kernel panics shortly after.
2. a28d9e4ef997 ("scsi: qla2xxx: Add support for multiple fwdump templates/segments") introduces a regression which causes EEH and system lockup on POWER8 or makes firmware unavailable (this bug).
3. 1710ac17547ac8 ("scsi: qla2xxx: Fix read offset in qla24xx_load_risc_flash()") fixes  f8f97b0c5b7f7 but the fix depends both on #1 and #2.
4. edbd56472a63 ("scsi: qla2xxx: qla2x00_alloc_fw_dump: set ha->eft") fixes a28d9e4ef997. 

It's not possible to bisect between #1 and #3 because of the panic introduced in #1. And firmware works reliably only after #4.

And I think it's important to include your fix into 5.3, otherwise qla2xxx is broken in the release.

Thanks,
Roman

Note You need to log in before you can comment on or make changes to this bug.