Bug 199887 - Fibre login failure on older adapters
Summary: Fibre login failure on older adapters
Status: NEW
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: QLOGIC QLA2XXX (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: scsi_drivers-qla2xxx
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-30 11:59 UTC by Jur van der Burg
Modified: 2022-09-17 20:50 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.14.44
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
kernel log with error logging enabled (298.74 KB, text/plain)
2018-05-30 11:59 UTC, Jur van der Burg
Details
kinda fix (1.89 KB, patch)
2022-08-28 19:54 UTC, Pavel Kankovsky
Details | Diff

Description Jur van der Burg 2018-05-30 11:59:19 UTC
Created attachment 276269 [details]
kernel log with error logging enabled

I have a problem bringing up a fibrechannel connection when I'm using older Qlogic controllers. There are two that I tried that have issues, a PCI board with a SP202 chipset, and a PCI-X board with an ISP2312 chipset. PCI-e boards with a ISP2432 chipset are okay.

The problem is that the connection will not come up correctly, and that the kernel logfile is flooded with messages like this as soon as the link is up:

May 24 14:16:02 venus kernel: [  474.647874] qla2xxx [0000:08:03.0]-500a:7: LOOP UP detected (2 Gbps).
May 24 14:16:02 venus kernel: [  475.153551] qla2xxx [0000:08:03.0]-5046:7: Async-gnlist failed - hdl=12 portid=d20200 status=30 mb0=4006 mb1=4000 mb2=0 mb6=0 mb7=0.
May 24 14:16:02 venus kernel: [  475.153589] qla2xxx [0000:08:03.0]-5046:7: Async-gnlist failed - hdl=13 portid=d20201 status=30 mb0=4006 mb1=4000 mb2=0 mb6=0 mb7=0.
May 24 14:16:02 venus kernel: [  475.153602] qla2xxx [0000:08:03.0]-5046:7: Async-gnlist failed - hdl=14 portid=d20202 status=30 mb0=4006 mb1=4000 mb2=0 mb6=0 mb7=0.
May 24 14:16:02 venus kernel: [  475.153615] qla2xxx [0000:08:03.0]-5046:7: Async-gnlist failed - hdl=15 portid=d20600 status=30 mb0=4006 mb1=4000 mb2=0 mb6=0 mb7=0.
May 24 14:16:02 venus kernel: [  475.153628] qla2xxx [0000:08:03.0]-5046:7: Async-gnlist failed - hdl=16 portid=d20700 status=30 mb0=4006 mb1=4000 mb2=0 mb6=0 mb7=0.
May 24 14:16:02 venus kernel: [  475.153641] qla2xxx [0000:08:03.0]-5046:7: Async-gnlist failed - hdl=17 portid=d20a00 status=30 mb0=4006 mb1=4000 mb2=0 mb6=0 mb7=0.
May 24 14:16:02 venus kernel: [  475.153653] qla2xxx [0000:08:03.0]-5046:7: Async-gnlist failed - hdl=18 portid=d20a01 status=30 mb0=4006 mb1=4000 mb2=0 mb6=0 mb7=0.

Kernels 4.4.128 and 4.9.6 have no issues. Starting with the 4.14.44 kernel I see this problem, i'm not sure in which kernel this was introduced. The 4.16.11 kernel behaves a little bit different in that I get the error messages once per second, so that version does not flood the logfile but still fails.

I created an extended logfile with ql2xextended_error_logging enabled (see attachment).

The logfile was created on a san with multiple systems and one san storage controller, the problem however can also be reproduced when only a single switch is connected to the adapter and nothing else.

Jur van der Burg
Comment 1 Jur van der Burg 2018-06-20 14:04:36 UTC
I have some more info.

First of all, the problem started with kernel V4.11, most likely this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=726b85487067d7f5b23495bc33c484b8517c4074

There is an issue in qla_isr.c, routines qla2x00_mbx_iocb_entry and qla24xx_logio_entry, they always return success even in case of an error.
In qla24xx_async_gnl_sp_done the status should be checked before we calculated the amount of data transferred returned in mbox 1, which is an error code (in this case 4006) if the request failed, leading to a lot of bogus messages like this:

  May 29 08:19:58 venus kernel: [  148.973869] qla2xxx [0000:08:03.0]-28e8:4: qla24xx_async_gnl_sp_done 00:00:00:00:00:00:00:00 00:00:00 state 0/0 lid 0

Correcting that leaves this fatal error in the case of an ISP23xx type adapter which is key to the problem:

  Jun 20 08:24:11 venus kernel: [  609.676194] qla2xxx [0000:04:07.0]-5046:3: Async-gnlist failed - hdl=2 portid=010000 status=30 mb0=4006 mb1=4000 mb2=0 mb6=0 mb7=0.

This error (4006) in response to the command MBC_PORT_NODE_NAME_LIST translates to MBS_COMMAND_PARAMETER_ERROR, and is only returned by an ISP23xx adapter, an ISP24xx works fine. So it looks like the old adapters do not understand it, or need other parameters. This may have something to do with the firmware which is currently 3.03.28 (the latest I could find). Notice that the command MBC_GET_PORT_DATABASE fails in the same way.

It works with the pre-v4.11 kernels because these commands are never given to the device.

It's a pity that there is no technical manual available that documents the various calls that the device understands, it would make troubleshooting much easier.
Comment 2 Matthew Whitehead 2019-04-29 22:09:19 UTC
I have reproduced this problem with similar hardware, an ISP2300. The problem is not present using the 4.9.171 kernel, and it is present on the 4.14.114 kernel, so the suspicion that it was introduced in 4.11 is likely.

I will attempt to test it on a similar ISP6312 next.
Comment 3 Michael Graham 2021-12-29 03:43:11 UTC
I think I ran into this today, trying to set up an ISP2100 based controller on the server in my basement.  No error messages as far as I could tell, but on the 5.X kernel I was using the adapter just would not communicate back any info about the drives attached.  Tried out an old version of OpenSuse using kernel 4.4.104 and it worked with no special configuration on my part (I did try a current version of OpenSuse first, in which it was also broken).  I'm fine using an old version of Linux for how I'm using this server for the time being, but it would be nice if there was a fix for newer kernels.
Comment 4 Pavel Kankovsky 2022-08-28 19:54:34 UTC
Created attachment 301697 [details]
kinda fix

I did some experiments with my old QLA2340 (ISP2312, fw 3.03.28) and the most recent stable kernel, ie. 5.19.4.

"Async-gnlist" failures seem to be survivable and I decided to ignore them for the time being. In fact, the old driver in 4.9.325 was able to work without MBC_PORT_NODE_NAME_LIST. There was a function issuing that command, namely qla2x00_get_node_name_list(), but AFAICT it was never called.

"Async-gpdb" failures are a real problem because they trigger session deletion (qla24xx_handle_gpdb_event() gets an invalid zero login state).

As far as I can tell, the new asynchronous implementation provides correct parameters to MBC_GET_PORT_DATABASE (compare qla24xx_async_gpdb() with qla2x00_get_port_database(), HAS_EXTENDED_IDS is true for ISP2312) but
1. the adapter cannot handle the request when it receives it via the IOCB interface, and
2. the driver would not be able to handle returned data anyway because their format is completely different on old non-IS_FWI2_CAPABLE adapters (compare qla24xx_handle_gpdb_event() with the final part of qla2x00_get_port_database()).

I tried replacing the new code with a small wrapper around a call to the old qla2x00_get_port_database() sending the request synchronously via the mbox interface... and it worked! The driver was able to finish logins and access available FC targets. See the attached patch.

That said, it is a horrible hack done by someone almost totally ignorant of the inner workings of the driver. There is absolutely no guarantee. It might crash your kernel. It might fail to handle some (newly connected?) remote ports. It might brick your adapter. It might wipe all your disk arrays. It might summon the Elder Gods. You have been warned.
Comment 5 Pavel Kankovsky 2022-09-17 20:50:46 UTC
Some additional findings:

1. It turns out qla2x00_get_node_name_list() was introduced in 3.5 and it was called from qla_target.c until 3.11 when the call was removed and the function remained unused until its own removal in 4.11.

I have not tested whether it would work on an old HBA but it is far from certain (its result was an array of "struct qla_port_24xx_data", corresponding to "struct get_name_list" in recent versions), and even if it would, it would not help much (there seem to be two variants of MBC_PORT_NODE_NAME_LIST, the old function invoked the variant providing less data while the current code needs the variant providing more data, "struct get_name_list_extended").

2. The driver is sometimes unable to relogin when an old HBA reconnects to the fabric because "Async-login" keeps failing with 4007 ie. MBS_PORT_ID_USED. It turns out qla24xx_handle_plogi_done_event expects an offending loopid in ea->iop[1] but qla2x00_mbx_iocb_entry stores the value in ea->data[1].

(A similar problem occurs during the handling 4008 ie. MBS_LOOP_ID_USED when qla24xx_handle_plogi_done_event expects an offending portid in ea->iop[1] but it is not stored anywhere. But the driver seems to be able to recover in this case.)

3. Newer HBAs seem to use the same command (MBC_LOGIN_FABRIC_PORT) for both fabric and private loop port login but old HBAs need a different command (MBC_LOGIN_LOOP_PORT) in the latter case. See qla2x00_local_device_login.

Note You need to log in before you can comment on or make changes to this bug.