Bug 59601

Summary: commit 97dec564fd4948e0e560869c80b76e166ca2a83e breaks communication with XYRATEX disk shelves
Product: SCSI Drivers Reporter: Jack Hill (jackhill)
Component: QLOGIC QLA2XXXAssignee: scsi_drivers-qla2xxx
Status: RESOLVED CODE_FIX    
Severity: normal CC: jackhill, saurav.kashyap
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: >2.6.38-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg output for good kernel with extended error reporting
dmest output for bad kernel with extended error reporting
Patch for dumping the incoming packet to the driver.
dmesg output with packet dumps
Properly-set-the-tagging-for-commands

Description Jack Hill 2013-06-11 13:53:48 UTC
My setup:

I have a

Fibre Channel: QLogic Corp. ISP2322-based 2Gb Fibre Channel to PCI-X HBA (rev 03)

Connected via 2Gb Fibre Channel to a NetApp branded disk shelf, model RA-1402. The shelf has dual XYRATEX P/N: 106-00101+C0 Fribe Channel controllers. The shelf has 14 SATA disks connected via Fibre Channel to SATA converters.

Prior to commit 97dec564fd4948e0e560869c80b76e166ca2a83e (as determined with git bisect) Linux was able to see all the disks as well as the two XRATEX controllers and read and write to them normally.

After the problem commit, niether the XYRATEX controllers nor the disks are visible.

Here is an excerpt of the working dmesg output

"""
[    2.708225] QLogic Fibre Channel HBA Driver: 8.03.01-k6
[    2.708302]   alloc irq_desc for 19 on node -1
[    2.708304]   alloc kstat_irqs on node -1
[    2.708309] qla2xxx 0000:03:07.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19
[    2.708388] qla2xxx 0000:03:07.0: Found an ISP2322, irq 19, iobase 0xffffc900021fe000
[    2.709059] qla2xxx 0000:03:07.0: Configuring PCI space...
[    2.709415] qla2xxx 0000:03:07.0: Configure NVRAM parameters...
[    2.801997] qla2xxx 0000:03:07.0: Verifying loaded RISC code...
[    2.813685] FDC 0 is a post-1991 82077
[    2.832538] qla2xxx 0000:03:07.0: firmware: requesting ql2322_fw.bin
[    2.972588] qla2xxx 0000:03:07.0: Allocated (1180 KB) for firmware dump...
[    3.032548] scsi4 : qla2xxx
[    3.032854] qla2xxx 0000:03:07.0: 
[    3.032855]  QLogic Fibre Channel HBA Driver: 8.03.01-k6
[    3.032856]   QLogic QLE2360 - PCI-Express to 2Gb FC, Single Channel
[    3.032856]   ISP2322: PCI-X (66 MHz) @ 0000:03:07.0 hdma+, host#=4, fw=3.03.28 IPX
[    3.313904] qla2xxx 0000:03:07.0: LIP reset occurred (f8f7).
[    3.345574] qla2xxx 0000:03:07.0: LIP occurred (f8f7).
[    3.452223] qla2xxx 0000:03:07.0: LIP reset occurred (f7f7).
[    3.482427] qla2xxx 0000:03:07.0: LIP occurred (f7f7).
[    3.544781] qla2xxx 0000:03:07.0: LOOP UP detected (2 Gbps).
[    3.856407] scsi 4:0:0:0: Enclosure         XYRATEX  RS-1402-SA-XNS1  3033 PQ: 0 ANSI: 3
[    3.859343] scsi 4:0:1:0: Enclosure         XYRATEX  RS-1402-SA-XNS1  3033 PQ: 0 ANSI: 3
[    3.863781] scsi 4:0:2:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.866454] scsi 4:0:3:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.869147] scsi 4:0:4:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.871834] scsi 4:0:5:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.874515] scsi 4:0:6:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.877196] scsi 4:0:7:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.879867] scsi 4:0:8:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.882539] scsi 4:0:9:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.885218] scsi 4:0:10:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.888035] scsi 4:0:11:0: Direct-Access     ST332082 0AS           SX .AAE PQ: 0 ANSI: 3
[    3.890712] scsi 4:0:12:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.893391] scsi 4:0:13:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.896142] scsi 4:0:14:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
[    3.898820] scsi 4:0:15:0: Direct-Access     HITACHI  HDS725050KLA36SX AB0A PQ: 0 ANSI: 3
"""

It would be super if the fix for this could be backported to 3.2.

Let me know if you need more information or would like me to test anything.
Comment 1 Saurav Kashyap 2013-06-11 18:23:45 UTC
HI Jack,
Please provide the driver logs for both good and bad case with ql2xextended_error_logging=1. The commit you have mentioned don't effect 2G cards.

Have you tried reverting the commit? Did it resolved the problem?

Thanks,
~Saurav
Comment 2 Jack Hill 2013-06-17 14:04:52 UTC
Created attachment 104971 [details]
dmesg output for good kernel with extended error reporting
Comment 3 Jack Hill 2013-06-17 14:05:31 UTC
Created attachment 104981 [details]
dmest output for bad kernel with extended error reporting
Comment 4 Jack Hill 2013-06-17 14:14:03 UTC
Hi,

I have attached dmesg output from good ad bad kernels with extended error logging.

Reverting the commit solved the problem. I was not able to revert the commit on 3.10-rc4 because it resulted in conficts, and I'm not familiar enough with that code to resolve them by hand.

Best,
Jack
Comment 5 Saurav Kashyap 2013-06-19 18:29:46 UTC
Hi Jack,
I am seeing "FCP I/O protocol failure (0x8/0x2)" messages in the failed logs. We need more data on what is coming back to the driver. I am attaching a patch that will dump that packet. Apply that patch, enable ql2xextended_error_logging and share the logs.

thanks,
~Saurav
Comment 6 Saurav Kashyap 2013-06-19 18:31:38 UTC
Created attachment 105401 [details]
Patch for dumping the incoming packet to the driver.

Apply this patch, enable ql2xextended_error_logging and share the logs. This dumps the pkt coming to the  driver.

Thanks,
~Saurav
Comment 7 Jack Hill 2013-07-02 22:13:28 UTC
Created attachment 106661 [details]
dmesg output with packet dumps

I have attached the dmesg output after applying the patch you provided.
Comment 8 Jack Hill 2013-07-02 22:15:42 UTC
Also, I think the commit that I claimed introduced the problem after my bisect run was the wrong one, it appears to be the last good commit. I think the one that introduces the bug is ff2fc42e74e43721310bff710416230aae6ce0b9

Sorry about that,
Jack
Comment 9 Saurav Kashyap 2013-07-11 11:09:35 UTC
Created attachment 106870 [details]
Properly-set-the-tagging-for-commands

Hi Jack,
Try this patch and see if this resolves this issue.

thanks,
~Saurav
Comment 10 Jack Hill 2013-07-11 14:14:56 UTC
Saurav,

Yes, in my initial testing that patch does resolve the issue.

Thanks,
Jack

P.S. I set the kernel version field, becasue bugzilla was not letting me submit this comment with it empty.
Comment 11 Saurav Kashyap 2013-07-19 08:31:08 UTC
Hi Jack,
This patch http://marc.info/?l=linux-scsi&m=137365649318663&w=2 is submitted to upstream. Please close this BZ.

thanks,
~Saurav
Comment 12 Jack Hill 2013-07-19 15:54:26 UTC
Closing this bug since Saurav has submitted the patch upstream.