Bug 219128 - lpfc driver reports failed messages
Summary: lpfc driver reports failed messages
Status: ASSIGNED
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: Intel Linux
: P3 normal
Assignee: Xavier
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-08-06 03:15 UTC by Xavier
Modified: 2024-09-12 23:34 UTC (History)
2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output log for upstream kernel 6.10 (4.64 MB, text/plain)
2024-08-06 03:25 UTC, Xavier
Details
kernel 6.11.0-rc1+ with "lpfc.lpfc_enable_fc4_type=1" log (245.08 KB, text/plain)
2024-08-16 02:56 UTC, Xavier
Details

Description Xavier 2024-08-06 03:15:56 UTC
After installing Ubuntu 24.04 in a SAN disk via "Emulex LPe35000 32Gb 1-Port Fibre Channel Adapter", the lpfc driver show many errors, like:

lpfc 0000:41:00.0: 53: [ 250.627110] 0:(0):0220 FDMI cmd failed FS_RJT Data: x211
lpfc 0000:41:00.0: 0:(0):2754 PRLI failure DID:030E00 Status:x9/x32a00, data: x4 x0 x80000000
lpfc 0000:41:00.0: 25: [ 247.537454] 0:(0):9024 FCP command xa3 failed: x0 SNS x0 x0 Data: x8 xfdc x0 x0 x0
lpfc 0000:27:00.0: 0:(0):2753 PLOGI failure DID:040F00 Status:x3/x103
kernel: rport-13:0-5: blocked FC remote port time out: removing rport

Reproduce Steps:
  1. Install Ubunut 24.04 - can intall successfully
  2. Boot into OS
  3. Check /var/log/syslog, found lpfc relevance errors.
  4. Update to the upstream latest v6.10 kernel, the lpfc errors still there.

 Expected results:
 No lpfc driver related failures or errors.

 Actual results:
 Many lpfc driver failed messages was recorded.

Configuration:
LENOVO SR630V3 - Intel EGS
XCC: ESX323N
UEFI:ESE123F
UEFI Setting:Default

os:ubuntu-24.04-live-server-amd64-GA-0426
cpu: Intel(R) Xeon(R) Platinum 8460Y+*2
mem:M321R2GA3BB6-CQKEG*2
Emulex LPe35000 32Gb 1-Port Fibre Channel Adapter fw:  14.2.673.40
Intel Ethernet Server Adapter I350-T4 fw:  1.3450.0
SAN - Lenovo DM7000

Additional info:
  1. Although there're those failures, we can read/write date to SAN disks.
  2. Intel Birch Stream platform has this lpfc driver issues as well.
Comment 1 Xavier 2024-08-06 03:25:58 UTC
Created attachment 306673 [details]
dmesg output log for upstream kernel 6.10

Errors still exist with 6.10 kernel, the attachment file is the kernel boot log with parameter "lpfc.lpfc_log_verbose=4115 log_buf_len=32M"
Comment 2 Xavier 2024-08-06 03:27:14 UTC
Link Ubuntu launchpad issue here at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2073571
Comment 3 justin.tee 2024-08-06 16:57:14 UTC
Hi Xavier,

Broadcom (Emulex) will have a look at this reported bug.

Please reassign to justin.tee@broadcom.com , and we will have a look at the dmesg log.

Thanks,
Justin
Comment 4 Xavier 2024-08-07 02:06:11 UTC
Hi, Justin
Thanks for your quick reply.
I found I don't have permission to change the bug assignee even I am the reporter.

Regards,
--Xavier
Comment 5 justin.tee 2024-08-07 16:41:54 UTC
(In reply to Xavier from comment #4)
> I found I don't have permission to change the bug assignee even I am the
> reporter.
Okay, no worries.  Please find below a reply based on dmesg log provided.

(In reply to Xavier from comment #1)
> Created attachment 306673 [details]
> dmesg output log for upstream kernel 6.10
> 
> Errors still exist with 6.10 kernel, the attachment file is the kernel boot
> log with parameter "lpfc.lpfc_log_verbose=4115 log_buf_len=32M"

1.) PRLI failure log messages.

[   12.754100] lpfc 0000:27:00.0: 0:(0):0108 No retry ELS command x18001420 to remote NPORT x30700 Retried:1 Error:x9/b0000 IoTag x2f60 nflags x80020000
[   12.754102] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:030700 Status:x9/xb0000, data: x4 x0 x80020000
[   12.859313] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:020100 Status:x9/xb0000, data: x4 x0 x80020000
[   13.029796] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:051400 Status:x9/xb0000, data: x4 x0 x80000000
…
[   13.127530] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:030F00 Status:x9/xb0000, data: x4 x0 x80020000
[   13.127948] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:051000 Status:x9/xb0000, data: x4 x0 x80000000
[   13.129275] lpfc 0000:27:00.0: 0:(0):0108 No retry ELS command x18001420 to remote NPORT x51601 Retried:1 Error:x9/b2c00 IoTag x2f48 nflags x80000000
[   13.129277] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:051601 Status:x9/xb2c00, data: x4 x0 x80000000
[   13.131082] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:051201 Status:x9/xb2c00, data: x4 x0 x80000000
…

These log messages indicate that several ports sent LS_RJT with reason code 0x0B (Command Unsupported) to our sent NVME-PRLI.  This is most likely due to a target that does not support NVME.

2.) FDMI cmd failed messages.

[ 2236.658100] lpfc 0000:27:00.0: 0:(0):0220 FDMI cmd failed FS_RJT Data: x211

These log messages are benign and do not affect normal HBA operations.  It is related to a proprietary management tool, which if interested, we can provide further information in a non-public setting.

3.) FCP command failed messages.

[   12.501657] lpfc 0000:27:00.0: 0:(0):9024 FCP command xa0 failed: x0 SNS x0 x0 Data: x8 xff0 x0 x0 x0
[   12.526335] lpfc 0000:27:00.0: 0:(0):9024 FCP command x12 failed: x0 SNS x0 x0 Data: x8 xa x0 x0 x0

These are Report_LUNs and Inquiry SCSI commands during FCP discovery getting logged with underruns.  This is part of normal HBA operations and can be ignored because returned status in the SCSI commands are SAM_STAT_GOOD = 0x00.

4.) PLOGI failure messages.

[    9.707863] lpfc 0000:27:00.0: 0:(0):0116 Xmit ELS command x3 to remote NPORT x30b00 I/O tag: x2fdd, port state:xd rpi xd fc_flag:x208114
…
[    9.707936] lpfc 0000:27:00.0: 0:(0):0112 ELS command x3 received from NPORT x30b00 refcnt 3 Data: x20 x208154 x41000 x41000
[    9.707938] lpfc 0000:27:00.0: 0:(0):3178 PLOGI confirm: ndlp x30b00 x40000 x3: new_ndlp x0 x0 x0
[    9.707941] lpfc 0000:27:00.0: 0:(0):0211 DSM in event x0 on NPort x30b00 in state 1 rpi xd Data: x40000 x30000
[    9.707943] lpfc 0000:27:00.0: 0:(0):0114 PLOGI chkparm OK Data: x30b00 x1 x40000 xd x20 x208154
[    9.707944] lpfc 0000:27:00.0: 0:(0):2819 Abort outstanding I/O on NPort x30b00 Data: x40000 x1 xd
…
[   10.481216] lpfc 0000:27:00.0: 0:(0):0102 PLOGI completes to NPort x030b00 IoTag x2fdd Data: x3 x3 x103 x0 x1d
[   10.481218] lpfc 0000:27:00.0: 0:(0):0108 No retry ELS command x3 to remote NPORT x30b00 Retried:0 Error:x3/103 IoTag x2fdd nflags x1080000
[   10.481219] lpfc 0000:27:00.0: 0:(0):2753 PLOGI failure DID:030B00 Status:x3/x103

This particular PLOGI was “failed” out because the lpfc driver originally sent a PLOGI to port_id 0x030B00, but did not receive a response.  Rather port_id 0x030B00, sent a PLOGI of its own instead.  Thus, this PLOGI failure message indicates that this HBA’s port sent PLOGI was canceled.

In fact, 0x030B00 is an initiator port and not a target port.  So, this is expected behavior:

[   10.484676] lpfc 0000:27:00.0: 0:(0):0103 PRLI completes to NPort x030b00 Data: x0 x41000 x7 x1 x0
[   10.484678] lpfc 0000:27:00.0: 0:(0):0211 DSM in event x7 on NPort x30b00 in state 4 rpi xd Data: x80080000 x30000
[   10.484680] lpfc 0000:27:00.0: 0:(0):6028 FCP NPR PRLI Cmpl Init 1 Target 0 EIP 0 AccCode x1


>   1. Although there're those failures, we can read/write date to SAN disks.
Good, because the lpfc driver logs indicate successful SCSI-PRLIs and eventual successful SCSI discovery:

[   12.464025] lpfc 0000:27:00.0: 0:(0):0103 PRLI completes to NPort x051500 Data: x0 x41000 x7 x0 x0
[   12.464027] lpfc 0000:27:00.0: 0:(0):0211 DSM in event x7 on NPort x51500 in state 4 rpi x18 Data: x80000000 x30008
[   12.464661] lpfc 0000:27:00.0: 0:(0):0212 DSM out state 7 on NPort x51500 rpi x18 Data: x80000000 x30008
[   12.466486] scsi 14:0:0:0: Direct-Access     LENOVO   DE_Series        0881 PQ: 1 ANSI: 5
Comment 6 Xavier 2024-08-09 03:12:20 UTC
Hi, Justin
Thanks for your clear clarifications. It sounds All of that failures can be ignore. But I still have some concerns as follows. Especially, you know, our testing will check kernel boot messages to filter all failed messages and treat them as errors default, unless there are some official documents point that errors is normal.

(In reply to justin.tee from comment #5)
> (In reply to Xavier from comment #4)
> > I found I don't have permission to change the bug assignee even I am the
> > reporter.
> Okay, no worries.  Please find below a reply based on dmesg log provided.
> 
> (In reply to Xavier from comment #1)
> > Created attachment 306673 [details]
> > dmesg output log for upstream kernel 6.10
> > 
> > Errors still exist with 6.10 kernel, the attachment file is the kernel boot
> > log with parameter "lpfc.lpfc_log_verbose=4115 log_buf_len=32M"
> 
> 1.) PRLI failure log messages.
> 
> [   12.754100] lpfc 0000:27:00.0: 0:(0):0108 No retry ELS command x18001420
> to remote NPORT x30700 Retried:1 Error:x9/b0000 IoTag x2f60 nflags x80020000
> [   12.754102] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:030700
> Status:x9/xb0000, data: x4 x0 x80020000
> [   12.859313] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:020100
> Status:x9/xb0000, data: x4 x0 x80020000
> [   13.029796] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:051400
> Status:x9/xb0000, data: x4 x0 x80000000
> …
> [   13.127530] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:030F00
> Status:x9/xb0000, data: x4 x0 x80020000
> [   13.127948] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:051000
> Status:x9/xb0000, data: x4 x0 x80000000
> [   13.129275] lpfc 0000:27:00.0: 0:(0):0108 No retry ELS command x18001420
> to remote NPORT x51601 Retried:1 Error:x9/b2c00 IoTag x2f48 nflags x80000000
> [   13.129277] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:051601
> Status:x9/xb2c00, data: x4 x0 x80000000
> [   13.131082] lpfc 0000:27:00.0: 0:(0):2754 PRLI failure DID:051201
> Status:x9/xb2c00, data: x4 x0 x80000000
> …
> 
> These log messages indicate that several ports sent LS_RJT with reason code
> 0x0B (Command Unsupported) to our sent NVME-PRLI.  This is most likely due
> to a target that does not support NVME.
> 

But our target SAN storage has not any NVMEs disks. In this circumstance, are they a normal behavior if the above failed messages popped out? How do we need to do to make that failed information disappear in the kernel boot message?

> 2.) FDMI cmd failed messages.
> 
> [ 2236.658100] lpfc 0000:27:00.0: 0:(0):0220 FDMI cmd failed FS_RJT Data:
> x211
> 
> These log messages are benign and do not affect normal HBA operations.  It
> is related to a proprietary management tool, which if interested, we can
> provide further information in a non-public setting.

OK, I'll contact you after other type of failed message are clarified.

> 
> 3.) FCP command failed messages.
> 
> [   12.501657] lpfc 0000:27:00.0: 0:(0):9024 FCP command xa0 failed: x0 SNS
> x0 x0 Data: x8 xff0 x0 x0 x0
> [   12.526335] lpfc 0000:27:00.0: 0:(0):9024 FCP command x12 failed: x0 SNS
> x0 x0 Data: x8 xa x0 x0 x0
> 
> These are Report_LUNs and Inquiry SCSI commands during FCP discovery getting
> logged with underruns.  This is part of normal HBA operations and can be
> ignored because returned status in the SCSI commands are SAM_STAT_GOOD =
> 0x00.

Again, Is there a way to block those failed messages, like module parameters?

> 
> 4.) PLOGI failure messages.
> 
> [    9.707863] lpfc 0000:27:00.0: 0:(0):0116 Xmit ELS command x3 to remote
> NPORT x30b00 I/O tag: x2fdd, port state:xd rpi xd fc_flag:x208114
> …
> [    9.707936] lpfc 0000:27:00.0: 0:(0):0112 ELS command x3 received from
> NPORT x30b00 refcnt 3 Data: x20 x208154 x41000 x41000
> [    9.707938] lpfc 0000:27:00.0: 0:(0):3178 PLOGI confirm: ndlp x30b00
> x40000 x3: new_ndlp x0 x0 x0
> [    9.707941] lpfc 0000:27:00.0: 0:(0):0211 DSM in event x0 on NPort x30b00
> in state 1 rpi xd Data: x40000 x30000
> [    9.707943] lpfc 0000:27:00.0: 0:(0):0114 PLOGI chkparm OK Data: x30b00
> x1 x40000 xd x20 x208154
> [    9.707944] lpfc 0000:27:00.0: 0:(0):2819 Abort outstanding I/O on NPort
> x30b00 Data: x40000 x1 xd
> …
> [   10.481216] lpfc 0000:27:00.0: 0:(0):0102 PLOGI completes to NPort
> x030b00 IoTag x2fdd Data: x3 x3 x103 x0 x1d
> [   10.481218] lpfc 0000:27:00.0: 0:(0):0108 No retry ELS command x3 to
> remote NPORT x30b00 Retried:0 Error:x3/103 IoTag x2fdd nflags x1080000
> [   10.481219] lpfc 0000:27:00.0: 0:(0):2753 PLOGI failure DID:030B00
> Status:x3/x103
> 
> This particular PLOGI was “failed” out because the lpfc driver originally
> sent a PLOGI to port_id 0x030B00, but did not receive a response.  Rather
> port_id 0x030B00, sent a PLOGI of its own instead.  Thus, this PLOGI failure
> message indicates that this HBA’s port sent PLOGI was canceled.
> 
> In fact, 0x030B00 is an initiator port and not a target port.  So, this is
> expected behavior:

Once more, How can we avoid the PLOGI failed logs occurred in kernel boot log.

> 
> [   10.484676] lpfc 0000:27:00.0: 0:(0):0103 PRLI completes to NPort x030b00
> Data: x0 x41000 x7 x1 x0
> [   10.484678] lpfc 0000:27:00.0: 0:(0):0211 DSM in event x7 on NPort x30b00
> in state 4 rpi xd Data: x80080000 x30000
> [   10.484680] lpfc 0000:27:00.0: 0:(0):6028 FCP NPR PRLI Cmpl Init 1 Target
> 0 EIP 0 AccCode x1
> 
> 
> >   1. Although there're those failures, we can read/write date to SAN disks.
> Good, because the lpfc driver logs indicate successful SCSI-PRLIs and
> eventual successful SCSI discovery:
> 
> [   12.464025] lpfc 0000:27:00.0: 0:(0):0103 PRLI completes to NPort x051500
> Data: x0 x41000 x7 x0 x0
> [   12.464027] lpfc 0000:27:00.0: 0:(0):0211 DSM in event x7 on NPort x51500
> in state 4 rpi x18 Data: x80000000 x30008
> [   12.464661] lpfc 0000:27:00.0: 0:(0):0212 DSM out state 7 on NPort x51500
> rpi x18 Data: x80000000 x30008
> [   12.466486] scsi 14:0:0:0: Direct-Access     LENOVO   DE_Series       
> 0881 PQ: 1 ANSI: 5
Comment 7 justin.tee 2024-08-09 16:26:24 UTC
(In reply to Xavier from comment #6)
May we remove “lpfc.lpfc_log_verbose=4115” from the kernel command line parameters and replace with the following?

“lpfc.lpfc_enable_fc4_type=1”

The lpfc_enable_fc4_type parameter will instruct the lpfc driver into SCSI-only mode and I’m hoping without the extraneous NVME logic it might help with the LOG_TRACE_EVENT feature from sounding off too much automated verbose logging.

Information about the LOG_TRACE_EVENT feature can be found from this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/commit/?id=372c187b8a705c0b52c230653ac7bc38ef62182f
Comment 8 Xavier 2024-08-16 02:56:31 UTC
Created attachment 306743 [details]
kernel 6.11.0-rc1+ with "lpfc.lpfc_enable_fc4_type=1" log

Appended "lpfc.lpfc_enable_fc4_type=1" in kernel command line, lpfc driver failure messages still exist.
Comment 9 justin.tee 2024-08-19 20:36:01 UTC
(In reply to Xavier from comment #8)
> Created attachment 306743 [details]
> kernel 6.11.0-rc1+ with "lpfc.lpfc_enable_fc4_type=1" log
> 
> Appended "lpfc.lpfc_enable_fc4_type=1" in kernel command line, lpfc driver
> failure messages still exist.

These are all benign log messages with regards to normal lpfc driver operation.  Please see below explanation for all cases.

1.)
[   68.221534] lpfc 0000:41:00.0: 0:(0):2753 PLOGI failure DID:011300
Status:x3/x31420002
[   68.223771] lpfc 0000:41:00.0: 0:(0):2753 PLOGI failure DID:030100
Status:x3/x31420002

The lpfc driver detected that the sent PLOGIs to port_ids 0x011300 and 0x030100
timed out with no responses.  This is not an lpfc driver issue.  The port_id
0x011300 and 0x030100 simply did not respond to a PLOGI.


2.) 
[   68.238426] lpfc 0000:41:00.0: 0:(0):2754 PRLI failure DID:010300
Status:x3/x103, data: x2 x0 x80040080

The lpfc driver sent a PRLI to port_id 0x010300 first, but right after port_id
0x010300 sent a PRLI of its own instead.  The lpfc driver LS_ACC’ed the
received PRLI, and this PRLI failure message is merely logging that the lpfc
driver silently aborted the first sent PRLI because port_id 0x010300 sent a
PRLI of its own instead, and we have already LS_ACC’ed it.


3.) 
[  131.338583] lpfc 0000:41:00.0: 0:(0):2755 ADISC failure DID:010300
Status:x3/x31420002

The lpfc driver sent an ADISC to port_id 0x010300, but never received a
response from port_id 0x010300.


4.) 
[ 1042.692998] lpfc 0000:41:00.0: 0:(0):2755 ADISC failure DID:040E00
Status:x3/x103
[ 1048.020992] lpfc 0000:41:00.0: 0:(0):2755 ADISC failure DID:040B00
Status:x3/x103

The lpfc driver sent an ADISC to port_ids 0x040E00 and 0x040B00, but instead
received LOGOs from both 0x040E00 and 0x040B00.  Thus, these ADISC failure
messages are merely logging that the lpfc driver silently aborted the
previously sent ADISCs due to receipt of LOGOs from the port_ids instead.
Comment 10 justin.tee 2024-08-19 20:36:55 UTC
(In reply to justin.tee from comment #9)
> (In reply to Xavier from comment #8)
> > Created attachment 306743 [details]
> > kernel 6.11.0-rc1+ with "lpfc.lpfc_enable_fc4_type=1" log
> > 
> > Appended "lpfc.lpfc_enable_fc4_type=1" in kernel command line, lpfc driver
> > failure messages still exist.
> 
> These are all benign log messages with regards to normal lpfc driver
> operation.  Please see below explanation for all cases.

Perhaps the word “failure” is not appropriate in log messages that report informational SAN activity meant for a sysadmin.  We will plan to post a version update for lpfc that reduces the amount of default SAN activity information.  I will keep you updated when this new version is posted.
Comment 11 justin.tee 2024-09-12 23:34:37 UTC
(In reply to justin.tee from comment #10)
> Perhaps the word “failure” is not appropriate in log messages that report
> informational SAN activity meant for a sysadmin.  We will plan to post a
> version update for lpfc that reduces the amount of default SAN activity
> information.  I will keep you updated when this new version is posted.

An updated lpfc version 14.4.0.5 has been posted.

Of particular interest may be the patch called, “[PATCH 6/8] lpfc: Revise TRACE_EVENT log flag severities from KERN_ERR to KERN_WARNING”

https://lore.kernel.org/linux-scsi/20240912232447.45607-1-justintee8345@gmail.com/T/#m0c5162512ab2bd5cabf2135596cfc64de068f3f9

Note You need to log in before you can comment on or make changes to this bug.