Bug 11646 - QLA2xxx: Kernel deadlock on high load somewhere after 2.6.20
Summary: QLA2xxx: Kernel deadlock on high load somewhere after 2.6.20
Status: RESOLVED OBSOLETE
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: All Linux
: P1 high
Assignee: linux-scsi@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-09-25 06:55 UTC by peter gervai
Modified: 2014-07-29 20:22 UTC (History)
12 users (show)

See Also:
Kernel Version: 2.6.32
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Some further logfile around a lockup (32.57 KB, application/x-compressed-tar)
2008-09-30 00:49 UTC, peter gervai
Details
syslog when starting up multipath (17.93 KB, application/octet-stream)
2009-02-27 01:50 UTC, peter gervai
Details

Description peter gervai 2008-09-25 06:55:15 UTC
Latest working kernel version: 2.6.20
Earliest failing kernel version: known 2.6.24
Distribution: Debian stable
Hardware Environment: QLogic Corp. QLA2422 Fibre Channel Adapter (rev 02), IBM (intel based) HS21 blade server, external SAN storage [IBM DS4200], optional full multipath (happens with or without), further details on specified requests
Software Environment: multipathd handling dm devices, lvm2, xfs

Problem Description:
The machines go dead under heavy IO load. Go dead may mean rare complete crashes, more often infinite resource wait states, or stuck udev streads all over. 

The diagnostic was pretty hard, many components were checked and finally it boiled down to the qla2xxx driver. 
It seems that somewhere after 2.6.20 the driver have a problem with high loads, where it first:
- start to see (or generate) link downs without reason
- tries to handle these, by logging thousands of "try to dump firmware" messages, while
- somehow screw up IRQ handling, because more often than not even eth0 starts complaining about transmit timeouts, and the kernel often say "..no IRQ handler for vector"
- never recovers. I've seen many messages like:
== mailbox command timeout
== performing isp recovery
== loop up 4gbps
== SNS scan failed - assuming zero entry result
== scsi: abort command issued ...
then often
== FC repot port time out
== SCSI DEVICE RESET ISSUED
and it sometimes ends with a stack trace and the happy message
== RIP 0x10

The diagnosic is hard because I cannot easily make it crash by force: even bonnie++ survive multiple runs without problems, but a busy postgres can crash it in a few hours usually.

After we changed and upgraded almost everything in both paths and nothing helped (including kernel upgrade to latest official one) I backed up to 2.6.20 and the problem disappeared. It is not easy to tell when was it broken because I cannot just start playing with live servers and I cannot make it crash on a test server. But if you have any tests which should crash it then I can try it on a different (testing) machine.


Steps to reproduce: I wish I knew. Loads of IO in an unknown pattern make it die in a few hours, or days.

I can provide any info you ask and I'm able to pry out of the machines, kernel [logs], etc. Most crashes does only have screenshots of remote console, since it killed all disk IO around.
Comment 1 Seokmann Ju 2008-09-25 07:10:18 UTC
There have been many updates/changes applied to the qla2xxx module since 2.6.20.
Does it happen with later 2.6.26.5 or latest 2.6.27 kernels?
Yes, please provide /var/log/messages file of the system.
Comment 2 peter gervai 2008-09-25 08:00:51 UTC
Many updates: unfortunately I realised. 

2.6.26.5: yes, it is broken as well. I have no knowledge of released 2.6.27. :-)

Messages: alright, I'll attach some, but please realise that when the IO is blocked there is no log. :-) I have some screenshots of the crashes, but most of them not even relevant, only showing that processes stuck in D state for much too long.
Comment 3 peter gervai 2008-09-25 08:04:52 UTC
Okay Bugzilla doesn't let me, so please get it from
http://foobar.grin.hu/tmp/
Comment 4 peter gervai 2008-09-26 06:48:10 UTC
I have put some screenshots of the crashes to the same location, though I am not sure anything useful could be prayed out of them. 

Can I provide any more info? Were there anything useful? Or any idea on a way I could deliberately crash it (so I can try on a test machine any kernel versions/patches you like)?
Comment 5 Seokmann Ju 2008-09-26 06:59:41 UTC
One thing, could you redirect console output to serial port so that we could grab as much information as the kernel provides?
If I understood correctly, this helps in the situation where the system gets locked up or hung.
Hope it will provide further direction for us to go.
Comment 6 peter gervai 2008-09-27 01:17:08 UTC
Unfortunately no. 

The system in question does not have physical serIO, and said to have a serial-over-IP feature, which in fact does not work (and it's a pretty stupid thing anyway, since you should have to telnet(!) in and capture the output somehow; but the connection breaks after 1-2 minutes anyway).

I've tried netconsole but unfortunately [and naturally] it dies along with eth0. 

I was thinking about usb serial port, but it probably requires IRQs alive either.

But, as I mentioned, I backed up the live system to 2.4.20 to prevent further lockups, and I do not really have a way to kill the test system manually. (So far I've tried 2-3 runs of bonnie++ and tiobench, neither locked it up but I'll try to run then in endless loop and see what happens.)

Which version of kernel have in your opinion a good chance to have the change? I see there was a big version change somewhere, if you could point out the kernel version I'd try to shoot around it.
Comment 7 peter gervai 2008-09-30 00:49:37 UTC
Created attachment 18108 [details]
Some further logfile around a lockup

I tried and failed to manually lockup a kernel (tried 10 bonnie++ and tiobench), but another one (2.6.24 with openvz patch but I believe openvz shouldn't really matter to qla2xxx driver; you're free to disagree) locked up, maybe there's something useful in the logs (the log drive wasn't stuck, no OOPS).

The links seem to go down, but they most probably did not, since all other servers went on completely fine at the time. Links seemed stayed down [and locked up IO], but came up again after a reboot. I do not believe the links were _really_ down at all, so this - by my guess - is the same problem. Everything went to syslog. 

Plus I attached some proc/interrupts and ps tree and whatnot.
Comment 8 peter gervai 2008-10-01 15:40:22 UTC
Hm, I go some logs which contain messages like

Oct  2 00:23:05 galamb kernel: [139240.696070] qla2xxx 0000:08:01.1: RISC paused -- HCCR=0, Dumping firmware!
Oct  2 00:23:05 galamb kernel: [139240.696097] qla2xxx 0000:08:01.1: Firmware has been previously dumped (ffffc20000bcc000) -- ignoring request...
Oct  2 00:23:05 galamb kernel: [139241.494343] scsi(4): dpc: sched qla2x00_abort_isp ha = ffff81007bd84460
Oct  2 00:23:05 galamb kernel: [139241.494350] qla2xxx 0000:08:01.1: Performing ISP error recovery - ha= ffff81007bd84460.
Oct  2 00:23:05 galamb kernel: [139241.530998] scsi(4): **** Load RISC code ****
Oct  2 00:23:05 galamb kernel: [139241.547277] scsi(4): Verifying Checksum of loaded RISC code.
Oct  2 00:23:05 galamb kernel: [139241.564201] scsi(4): Checksum OK, start firmware.
Oct  2 00:23:06 galamb kernel: [139241.747606] scsi(4): Issue init firmware.
Oct  2 00:23:06 galamb kernel: [139242.296514] scsi(4): Asynchronous P2P MODE received.
Oct  2 00:23:06 galamb kernel: [139242.316473] scsi(4): Asynchronous LOOP UP (4 Gbps).
Oct  2 00:23:06 galamb kernel: [139242.316479] qla2xxx 0000:08:01.1: LOOP UP detected (4 Gbps).
Oct  2 00:23:06 galamb kernel: [139242.336435] scsi(4): Asynchronous PORT UPDATE.
Oct  2 00:23:06 galamb kernel: [139242.336440] scsi(4): Port database changed ffff 0006 0000.
Oct  2 00:23:06 galamb kernel: [139242.356395] scsi(4): Asynchronous PORT UPDATE ignored 0000/0004/0600.
Oct  2 00:23:06 galamb kernel: [139242.376358] scsi(4): Asynchronous PORT UPDATE ignored 0000/0007/0b00.
Oct  2 00:23:06 galamb kernel: [139242.396353] scsi(4): F/W Ready - OK 
Oct  2 00:23:06 galamb kernel: [139242.416315] scsi(4): fw_state=3 curr time=100d44784.
Oct  2 00:23:06 galamb kernel: [139242.416321] qla2x00_restart_isp(): Start configure loop, status = 0
Oct  2 00:23:06 galamb kernel: [139242.436258] scsi(4): Configure loop -- dpc flags =0x4080048
Oct  2 00:23:06 galamb kernel: [139242.456218] scsi(4): RSCN queue entry[0] = [00/000000].
Oct  2 00:23:06 galamb kernel: [139242.456223] scsi(4): device_resync: rscn overflow.
Oct  2 00:23:06 galamb kernel: [139242.492382] scsi(4): fcport-0 - port retry count: 2 remaining
Oct  2 00:23:06 galamb kernel: [139242.492406] scsi(4): RFT_ID exiting normally.
Oct  2 00:23:06 galamb kernel: [139242.512366] scsi(4): RFF_ID exiting normally.
Oct  2 00:23:06 galamb kernel: [139242.532324] scsi(4): RNN_ID exiting normally.
Oct  2 00:23:06 galamb kernel: [139242.556047] scsi(4): RSNN_NN exiting normally.
Oct  2 00:23:07 galamb kernel: [139242.632113] scsi(4): GID_PT entry - nn 200100e08bba4036 pn 210100e08bba4036 portid=010400.
Oct  2 00:23:07 galamb kernel: [139242.655856] scsi(4): GID_PT entry - nn 200400a0b8263784 pn 200500a0b8263785 portid=011300.
Oct  2 00:23:07 galamb kernel: [139242.731982] scsi(4): GPSC ext entry - fpn 200400c0dd0daf7b speeds=6000 speed=2000.
Oct  2 00:23:07 galamb kernel: [139242.755684] scsi(4): GPSC ext entry - fpn 201300c0dd0daf7b speeds=e000 speed=2000.
Oct  2 00:23:07 galamb kernel: [139242.775629] qla24xx_fabric_logout(4): failed to complete IOCB -- completion status (31)  ioparam=a/0.
Oct  2 00:23:07 galamb kernel: [139242.775634] scsi(4): device wrap (011300)
Oct  2 00:23:07 galamb kernel: [139242.775639] scsi(4): Trying Fabric Login w/loop id 0x0081 for port 011300.
Oct  2 00:23:07 galamb kernel: [139242.831751] qla2xxx 0000:08:01.1: iIDMA adjusted to 4 GB/s on 200500a0b8263785.
Oct  2 00:23:07 galamb kernel: [139242.831787] scsi(4): LOOP READY
Oct  2 00:23:07 galamb kernel: [139242.831789] qla2x00_restart_isp(): Configure loop done, status = 0x0
Oct  2 00:23:07 galamb kernel: [139242.833926] qla2xxx 0000:08:01.1: scsi(4:0:0:6): Mid-layer underflow detected (40000 of 40000 bytes)...returning error status.
Oct  2 00:23:07 galamb kernel: [139242.843912] qla2xxx 0000:08:01.1: scsi(4:0:0:3): Mid-layer underflow detected (10000 of 10000 bytes)...returning error status.

under 2.6.24+openvz. It was repeatedly generated by asking LVM to move a whole physical volume (PV) to another one, which caused a constant, medium rate dataflow in both directions. The link went up later, and the move so far did not crash the machine.

It may be important to mention that FC#0 is link down (really), FC#1 is active. When FC1 reports link down, mailbox timeouts, etc, FC0 logs _lots_ of firmware dump requests (thousands), which I guess could eventually crash the machine (but so far didn't).

If anyone requests I can provide the full syslog (not as an attachment though).
Comment 9 Seokmann Ju 2008-10-03 07:42:22 UTC
Yes, please forward the syslog to us.
From the log in comment #8, the RISC attempted to dump firmware right after the RISC pause.
The dump image might contain clues explaining what was going on the time spot.
Could you forward the firmware dump to us? 
Here is the steps how to get the dump,
---
When a firmware dump is performed, a message similar to:

       Firmware dump saved to temp buffer (1/adcdabcd)

will be logged by the driver.

To retrieve the dump (do this *BEFORE* you unload the driver and
before the machine is reset), go to a console and type the following:

       $ wget ftp://ftp.qlogic.com/outgoing/linux/beta/8.x/test/qla_dmp.sh
       $ chmod 755 qla_dmp.sh
       $ ./qla_dmp.sh <host_no>

The value passed to qla_dmp.sh should be the same as the first integer
in the 'saved to temp buffer' string (in this example, 1).  If the
operation was successful, a message like to following should be
displayed:

       Firmware dumped to file fw_dump_1_20041217_023222.txt.gz

Send us the file and we can have the firmware folks take a look to see
what's going on.
---
Comment 10 peter gervai 2008-10-06 12:21:05 UTC
Sent by email.
Comment 11 Seokmann Ju 2008-10-07 13:38:48 UTC
Thanks.
Below is the feedback from our firmware folks.
---
Looks like you have a bad part.

b.txt (fw_dump_4) showed us reading a register to go into a jmp table.  I should
have read '1' (was at the time of the dump), but looks like I got 0.
RISC then paused on the parity error.
---

So, please go ahead and contact QLogic services to get serviced or replaced it.
Comment 12 peter gervai 2008-10-07 13:52:29 UTC
Okay, but _which_ part you mean? One HBA? As I've mentioned the problem happened on multiple machines, and they have dual HBAs. (Or can one bad HBA mess up the others? How could be spotted which one is bad? Is there a way to test this particular problem?) 

By the reply I guess they talk about the dumps, and #4 was the second card of the machine in question. But originally this wasn't the server I had most problem with, but that one locks usually up alright on newer kernels and reboot clears the firmware dumps you mentioned. So if machine#3 have bad HBA#2, why did machine #1 lock up every 30 minutes? Still not clear to me. 
Comment 13 Seokmann Ju 2008-10-07 14:27:01 UTC
OK. I see your point.
Could you provide feedback on the questions that I've raised in the email?
Let's continue to narrow down the problem further.
One thing, we would need to have console output redirected to serial as it reveals most accurate clues for us.
Comment 14 Seokmann Ju 2008-10-13 04:45:12 UTC
If you have any updates, please let us know.
Comment 15 peter gervai 2008-10-21 00:13:52 UTC
Just to note that I segmented one live server from the others in question and it did not help [the separated one keeps crashing / locking up], now I am trying to freeze the test server (which runs a stock kernel instead of the openvz one), but probably due to the lack of real server load it's hard, and so far the freezes were total. I am about to create a crashdump kernel, maybe I can catch a glimpse of what happens. Stand by...

By the way I have some weird "RISC paused / firmware dumped" case on the live machine, where it happens right after reboot, the system goes on just fine but it shouldn't happen anyway I guess. I'll send the dumps by email.
Comment 16 Daniel Bakken 2008-11-19 14:10:08 UTC
I have experienced this bug on IBM HS21 Blades running Debian Lenny/2.6.22 connected to IBM DS3400 storage via qlogic switch. The crashes occurred during cp and rsync operations from one array to another.

I solved the problem by replacing the Linux qla2xxx module with the official qlogic RHEL/SUSE driver and hacking it to work as a module in Debian. The mailbox timeouts stopped after switching drivers. This suggests a bug in the current Linux qla2xxx driver- NOT a hardware problem.

Here is syslog output from a typical crash:

Jan 27 19:41:53 hqhost kernel: qla2xxx 0000:08:01.0: Mailbox command timeout occured. Issuing ISP abort.
Jan 27 19:41:53 hqhost kernel: qla2xxx 0000:08:01.0: Performing ISP error recovery - ha= ffff810223d0c530.
Jan 27 19:41:53 hqhost kernel: qla2xxx 0000:08:01.0: LOOP UP detected (4 Gbps).
Jan 27 19:41:54 hqhost kernel: qla2xxx 0000:08:01.0: SNS scan failed -- assuming zero-entry result...
Jan 27 19:41:54 hqhost kernel: APIC error on CPU0: 00(40)
Jan 27 19:41:54 hqhost kernel: qla2xxx 0000:08:01.0: scsi(0:0:1): Abort command issued -- 0 9a776 2002.
Jan 27 19:42:28 hqhost kernel:  rport-0:0-0: blocked FC remote port time out: removing target and saving binding
Jan 27 19:42:28 hqhost kernel:  rport-0:0-4: blocked FC remote port time out: removing target and saving binding
Jan 27 19:42:28 hqhost kernel:  rport-0:0-5: blocked FC remote port time out: removing target and saving binding
Jan 27 19:42:28 hqhost kernel: qla2xxx 0000:08:01.0: scsi(0:0:0): DEVICE RESET ISSUED.
Jan 27 19:42:28 hqhost kernel: APIC error on CPU5: 00(40)
Jan 27 19:42:28 hqhost kernel: sd 0:0:0:0: [sda] Synchronizing SCSI cache
Comment 17 Csillag Tamas 2008-11-23 11:21:26 UTC
I also suffer from this bug:

Linux version 2.6.24-1-pve (root@oahu) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #1 SMP PREEMPT Fri Oct 24 11:34:13 CEST 2008 (Ubuntu 2.6.24-4.6-server)

Nov 22 07:35:37 somehost kernel: qla2xxx 0000:08:01.0: Mailbox command timeout occured. Issuing ISP abort.
Nov 22 07:35:37 somehost kernel: qla2xxx 0000:08:01.0: Performing ISP error recovery - ha= ffff8101618f0468.
Nov 22 07:35:38 somehost kernel: qla2xxx 0000:08:01.0: LOOP UP detected (2 Gbps).
Nov 22 07:35:38 somehost kernel: qla2xxx 0000:08:01.0: SNS scan failed -- assuming zero-entry result...
Nov 22 07:35:38 somehost kernel: APIC error on CPU1: 00(40)
Nov 22 07:35:38 somehost kernel: qla2xxx 0000:08:01.0: scsi(1:0:1): Abort command issued -- 0 1acc93 2002.
Nov 22 07:36:12 somehost kernel:  rport-1:0-4: blocked FC remote port time out: saving binding
Nov 22 07:36:13 somehost kernel: qla2xxx 0000:08:01.0: scsi(1:0:1): DEVICE RESET ISSUED.
Nov 22 07:36:37 somehost kernel:  rport-1:0-0: blocked FC remote port time out: removing rport
Nov 22 07:36:37 somehost kernel:  rport-1:0-1: blocked FC remote port time out: removing rport
Nov 22 07:36:37 somehost kernel:  rport-1:0-2: blocked FC remote port time out: removing rport
Nov 22 07:36:37 somehost kernel:  rport-1:0-3: blocked FC remote port time out: removing rport

This is a HS21 with a Qlogic card:
08:01.0 Fibre Channel: QLogic Corp. QLA2422 Fibre Channel Adapter (rev 02)
08:01.1 Fibre Channel: QLogic Corp. QLA2422 Fibre Channel Adapter (rev 02)

I am using a DS4700 and the other machines works fine at the same time.

Another machine connected to the same fibre channel switch (the one which works
fine) has debugging mode enabled I include its logs here if it can give a hit
what event drove the HS21 machine crazy:

(log from a HS20 2.6.24.7 stock kernel)
06:01.0 Fibre Channel: QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA (rev 02)
06:01.1 Fibre Channel: QLogic Corp. ISP2312-based 2Gb Fibre Channel to PCI-X HBA (rev 02)

2008-11-22_06:35:37.55516 kern.warn: scsi(1): Asynchronous RSCR UPDATE.
2008-11-22_06:35:37.55520 kern.info: scsi(1): RSCN database changed -- 0001 0600.
2008-11-22_06:35:38.18991 kern.warn: scsi(1): qla2x00_loop_resync()
2008-11-22_06:35:38.18998 kern.warn: scsi(1): F/W Ready - OK
2008-11-22_06:35:38.23641 kern.warn: scsi(1): fw_state=3 curr time=69fc0e46.
2008-11-22_06:35:38.23646 kern.warn: scsi(1): Configure loop -- dpc flags =0x40000a0
2008-11-22_06:35:38.48711 kern.warn: scsi(1): RSCN queue entry[30] = [00/010600].
2008-11-22_06:35:38.48713 kern.warn: scsi(1): GID_PT entry - nn 200000112593fc1c pn 210000112593fc1c portid=010100.
2008-11-22_06:35:38.48716 kern.warn: scsi(1): GID_PT entry - nn 200000112593f89c pn 210000112593f89c portid=010200.
2008-11-22_06:35:38.48718 kern.warn: scsi(1): GID_PT entry - nn 200000145e241c2c pn 210000145e241c2c portid=010400.
2008-11-22_06:35:38.48719 kern.warn: scsi(1): GID_PT entry - nn 2000001b3205b641 pn 2100001b3205b641 portid=010600.
2008-11-22_06:35:38.48720 kern.warn: scsi(1): GID_PT entry - nn 2000001b32056b41 pn 2100001b32056b41 portid=010700.
2008-11-22_06:35:38.48721 kern.warn: scsi(1): GID_PT entry - nn 200400a0b8293358 pn 202400a0b8293358 portid=010f00.
2008-11-22_06:35:38.48723 kern.warn: scsi(1): device wrap (010f00)
2008-11-22_06:35:38.48725 kern.warn: scsi(1): Trying Fabric Login w/loop id 0x0083 for port 010600.
2008-11-22_06:35:38.48726 kern.warn: scsi(1): LOOP READY
2008-11-22_06:35:38.48727 kern.warn: scsi(1): qla2x00_loop_resync - end
2008-11-22_06:35:38.54288 kern.warn: scsi(1): Asynchronous RSCR UPDATE.
2008-11-22_06:35:38.54294 kern.info: scsi(1): RSCN database changed -- 0001 0600.
2008-11-22_06:35:39.18405 kern.warn: scsi(1): qla2x00_loop_resync()
2008-11-22_06:35:39.18412 kern.warn: scsi(1): F/W Ready - OK
2008-11-22_06:35:39.21856 kern.warn: scsi(1): fw_state=3 curr time=69fc0f3d.
2008-11-22_06:35:39.21862 kern.warn: scsi(1): Configure loop -- dpc flags =0x40000a0
2008-11-22_06:35:39.25176 kern.warn: scsi(1): RSCN queue entry[31] = [00/010600].
2008-11-22_06:35:39.27638 kern.warn: scsi(1): GID_PT entry - nn 200000112593fc1c pn 210000112593fc1c portid=010100.
2008-11-22_06:35:39.29344 kern.warn: scsi(1): GID_PT entry - nn 200000112593f89c pn 210000112593f89c portid=010200.
2008-11-22_06:35:39.30982 kern.warn: scsi(1): GID_PT entry - nn 200000145e241c2c pn 210000145e241c2c portid=010400.
2008-11-22_06:35:39.32600 kern.warn: scsi(1): GID_PT entry - nn 2000001b3205b641 pn 2100001b3205b641 portid=010600.
2008-11-22_06:35:39.34160 kern.warn: scsi(1): GID_PT entry - nn 2000001b32056b41 pn 2100001b32056b41 portid=010700.
2008-11-22_06:35:39.35693 kern.warn: scsi(1): GID_PT entry - nn 200400a0b8293358 pn 202400a0b8293358 portid=010f00.
2008-11-22_06:35:39.35699 kern.warn: scsi(1): device wrap (010f00)
2008-11-22_06:35:39.38467 kern.warn: scsi(1): Trying Fabric Login w/loop id 0x0083 for port 010600.
2008-11-22_06:35:39.39794 kern.warn: scsi(1): LOOP READY
2008-11-22_06:35:39.39801 kern.warn: scsi(1): qla2x00_loop_resync - end
2008-11-22_06:36:43.23921 kern.warn: scsi(1): Asynchronous RSCR UPDATE.
2008-11-22_06:36:43.23925 kern.info: scsi(1): RSCN database changed -- 0001 0600.
2008-11-22_06:36:44.17780 kern.warn: scsi(1): qla2x00_loop_resync()
2008-11-22_06:36:44.17787 kern.warn: scsi(1): F/W Ready - OK
2008-11-22_06:36:44.19975 kern.warn: scsi(1): fw_state=3 curr time=69fc4eb4.
2008-11-22_06:36:44.19981 kern.warn: scsi(1): Configure loop -- dpc flags =0x40000a0
2008-11-22_06:36:44.22040 kern.warn: scsi(1): RSCN queue entry[0] = [00/010600].
2008-11-22_06:36:44.23846 kern.warn: scsi(1): GID_PT entry - nn 200000112593fc1c pn 210000112593fc1c portid=010100.
2008-11-22_06:36:44.24914 kern.warn: scsi(1): GID_PT entry - nn 200000112593f89c pn 210000112593f89c portid=010200.
2008-11-22_06:36:44.25921 kern.warn: scsi(1): GID_PT entry - nn 200000145e241c2c pn 210000145e241c2c portid=010400.
2008-11-22_06:36:44.26910 kern.warn: scsi(1): GID_PT entry - nn 2000001b3205b641 pn 2100001b3205b641 portid=010600.
2008-11-22_06:36:44.28411 kern.warn: scsi(1): GID_PT entry - nn 2000001b32056b41 pn 2100001b32056b41 portid=010700.
2008-11-22_06:36:44.28776 kern.warn: scsi(1): GID_PT entry - nn 200400a0b8293358 pn 202400a0b8293358 portid=010f00.
2008-11-22_06:36:44.28783 kern.warn: scsi(1): device wrap (010f00)
2008-11-22_06:36:44.30339 kern.warn: scsi(1): Trying Fabric Login w/loop id 0x0083 for port 010600.
2008-11-22_06:36:44.31129 kern.warn: scsi(1): LOOP READY
2008-11-22_06:36:44.31136 kern.warn: scsi(1): qla2x00_loop_resync - end
2008-11-22_06:36:44.34305 kern.warn: scsi(1): Asynchronous RSCR UPDATE.
2008-11-22_06:36:44.34306 kern.info: scsi(1): RSCN database changed -- 0001 0600.
2008-11-22_06:36:45.06317 kern.warn: scsi(1): Asynchronous RSCR UPDATE.
2008-11-22_06:36:45.06319 kern.info: scsi(1): RSCN database changed -- 0001 0600.
2008-11-22_06:36:45.17397 kern.warn: scsi(1): qla2x00_loop_resync()
2008-11-22_06:36:45.17404 kern.warn: scsi(1): F/W Ready - OK 
2008-11-22_06:36:45.18965 kern.warn: scsi(1): fw_state=3 curr time=69fc4fad.
2008-11-22_06:36:45.18972 kern.warn: scsi(1): Configure loop -- dpc flags =0x40000a0
2008-11-22_06:36:45.20585 kern.warn: scsi(1): RSCN queue entry[1] = [00/010600].
2008-11-22_06:36:45.20592 kern.warn: scsi(1): Skipping duplicate RSCN queue entry found at [2].
2008-11-22_06:36:45.22195 kern.warn: scsi(1): RSCN queue entry[2] = [00/010600].
2008-11-22_06:36:45.23816 kern.warn: scsi(1): GID_PT entry - nn 200000112593fc1c pn 210000112593fc1c portid=010100.
2008-11-22_06:36:45.24727 kern.warn: scsi(1): GID_PT entry - nn 200000112593f89c pn 210000112593f89c portid=010200.
2008-11-22_06:36:45.25633 kern.warn: scsi(1): GID_PT entry - nn 200000145e241c2c pn 210000145e241c2c portid=010400.
2008-11-22_06:36:45.26791 kern.warn: scsi(1): GID_PT entry - nn 2000001b3205b641 pn 2100001b3205b641 portid=010600.
2008-11-22_06:36:45.27679 kern.warn: scsi(1): GID_PT entry - nn 2000001b32056b41 pn 2100001b32056b41 portid=010700.
2008-11-22_06:36:45.28572 kern.warn: scsi(1): GID_PT entry - nn 200400a0b8293358 pn 202400a0b8293358 portid=010f00.
2008-11-22_06:36:45.28578 kern.warn: scsi(1): device wrap (010f00)
2008-11-22_06:36:45.30173 kern.warn: scsi(1): Trying Fabric Login w/loop id 0x0083 for port 010600.
2008-11-22_06:36:45.30980 kern.warn: scsi(1): LOOP READY
2008-11-22_06:36:45.30985 kern.warn: scsi(1): qla2x00_loop_resync - end
2008-11-22_06:37:45.99458 kern.warn: scsi(1): Asynchronous RSCR UPDATE.
2008-11-22_06:37:45.99466 kern.info: scsi(1): RSCN database changed -- 0001 0600.
2008-11-22_06:37:46.17424 kern.warn: scsi(1): qla2x00_loop_resync()
2008-11-22_06:37:46.17431 kern.warn: scsi(1): F/W Ready - OK 
2008-11-22_06:37:46.19055 kern.warn: scsi(1): fw_state=3 curr time=69fc8b3f.
2008-11-22_06:37:46.19062 kern.warn: scsi(1): Configure loop -- dpc flags =0x40000a0
2008-11-22_06:37:46.20666 kern.warn: scsi(1): RSCN queue entry[3] = [00/010600].
2008-11-22_06:37:46.22193 kern.warn: scsi(1): GID_PT entry - nn 200000112593fc1c pn 210000112593fc1c portid=010100.
2008-11-22_06:37:46.23109 kern.warn: scsi(1): GID_PT entry - nn 200000112593f89c pn 210000112593f89c portid=010200.
2008-11-22_06:37:46.24022 kern.warn: scsi(1): GID_PT entry - nn 200000145e241c2c pn 210000145e241c2c portid=010400.
2008-11-22_06:37:46.24921 kern.warn: scsi(1): GID_PT entry - nn 2000001b32056b41 pn 2100001b32056b41 portid=010700.
2008-11-22_06:37:46.25818 kern.warn: scsi(1): GID_PT entry - nn 200400a0b8293358 pn 202400a0b8293358 portid=010f00.
2008-11-22_06:37:46.25825 kern.warn: scsi(1): device wrap (010f00)
2008-11-22_06:37:46.27411 kern.warn: scsi(1): LOOP READY
2008-11-22_06:37:46.27418 kern.warn: scsi(1): qla2x00_loop_resync - end
2008-11-22_06:37:47.08468 kern.warn: scsi(1): Asynchronous RSCR UPDATE.
2008-11-22_06:37:47.08470 kern.info: scsi(1): RSCN database changed -- 0001 0600.
2008-11-22_06:37:47.17403 kern.warn: scsi(1): qla2x00_loop_resync()
2008-11-22_06:37:47.17410 kern.warn: scsi(1): F/W Ready - OK 
2008-11-22_06:37:47.18983 kern.warn: scsi(1): fw_state=3 curr time=69fc8c39.
2008-11-22_06:37:47.18994 kern.warn: scsi(1): Configure loop -- dpc flags =0x40000a0
2008-11-22_06:37:47.20616 kern.warn: scsi(1): RSCN queue entry[4] = [00/010600].
2008-11-22_06:37:47.22248 kern.warn: scsi(1): GID_PT entry - nn 200000112593fc1c pn 210000112593fc1c portid=010100.
2008-11-22_06:37:47.23159 kern.warn: scsi(1): GID_PT entry - nn 200000112593f89c pn 210000112593f89c portid=010200.
2008-11-22_06:37:47.24107 kern.warn: scsi(1): GID_PT entry - nn 200000145e241c2c pn 210000145e241c2c portid=010400.
2008-11-22_06:37:47.24995 kern.warn: scsi(1): GID_PT entry - nn 2000001b3205b641 pn 2100001b3205b641 portid=010600.
2008-11-22_06:37:47.25876 kern.warn: scsi(1): GID_PT entry - nn 2000001b32056b41 pn 2100001b32056b41 portid=010700.
2008-11-22_06:37:47.26760 kern.warn: scsi(1): GID_PT entry - nn 200400a0b8293358 pn 202400a0b8293358 portid=010f00.
2008-11-22_06:37:47.26766 kern.warn: scsi(1): device wrap (010f00)
2008-11-22_06:37:47.28333 kern.warn: scsi(1): Trying Fabric Login w/loop id 0x0083 for port 010600.
2008-11-22_06:37:47.29128 kern.warn: scsi(1): LOOP READY
2008-11-22_06:37:47.29135 kern.warn: scsi(1): qla2x00_loop_resync - end

I downgraded the buggy machine as a workaround to an earlier kernel hoping it
will fix the problems outlined here.

Regards,
   cstamas
--
Csillag Tamas (cstamas)
http://digitus.itk.ppke.hu/~cstamas
Comment 18 Zoltan Kiss 2009-02-22 16:54:42 UTC
Hi, i also suffer from this bug.

Here is the qla section of /var/log/messages:
Feb 22 21:28:36 bafs1 kernel: [71282.592558] qla2xxx 0000:08:01.0: Mailbox command timeout occured. Issuing ISP abort.
Feb 22 21:28:36 bafs1 kernel: [71282.592611] qla2xxx 0000:08:01.0: Performing ISP error recovery - ha= ffff81012e5105f8.
Feb 22 21:28:37 bafs1 kernel: [71283.546086] qla2xxx 0000:08:01.0: LOOP UP detected (4 Gbps).
Feb 22 21:28:37 bafs1 kernel: [71283.685581] qla2xxx 0000:08:01.0: scsi(0:0:2): Abort command issued -- 0 2457f0 2002.
Feb 22 21:29:06 bafs1 kernel: [71325.253361] qla2xxx 0000:08:01.0: scsi(0:0:2): DEVICE RESET ISSUED.
Feb 22 21:29:36 bafs1 kernel: [71412.254597] qla2xxx 0000:08:01.0: Mailbox command timeout occured. Issuing ISP abort.
Feb 22 21:29:36 bafs1 kernel: [71412.254597] qla2xxx 0000:08:01.0: Performing ISP error recovery - ha= ffff81012e5105f8.
Feb 22 21:29:37 bafs1 kernel: [71414.455676] qla2xxx 0000:08:01.0: LOOP UP detected (4 Gbps).
Feb 22 21:29:37 bafs1 kernel: [71414.716347] qla2xxx 0000:08:01.0: scsi(0:0:2): DEVICE RESET FAILED: Task management failed.
Feb 22 21:29:37 bafs1 kernel: [71414.716347] qla2xxx 0000:08:01.0: scsi(0:0:2): TARGET RESET ISSUED.
Feb 22 21:30:07 bafs1 kernel: [71488.627662] qla2xxx 0000:08:01.0: Mailbox command timeout occured. Issuing ISP abort.
Feb 22 21:30:07 bafs1 kernel: [71488.627662] qla2xxx 0000:08:01.0: Performing ISP error recovery - ha= ffff81012e5105f8.


This happens on IBM BladeCenter HS21 Blade, Debian Lenny, stock kernel: 2.6.26-1-amd64 #1 SMP Sat Jan 10 17:57:00 UTC 2009 x86_64 GNU/Linux

My storage is IBM TS DS4300.

Now, i compiling the 2.6.20 kernel, with RHEL drivers, hope thats help.

Regards,
Zoltan Kiss
Bardi Auto - Hungary
Comment 19 peter gervai 2009-02-27 01:50:29 UTC
Created attachment 20377 [details]
syslog when starting up multipath

Maybe this contains some info, because the debug values don't say anything to me. To prevent things from crashing we actualy keep one path down by switching off one (qlogic fibre) switch. This way we only have occasional crashes (but since I cannot conjure external console, I'm stuck at that point). 

Someone switched on accidentally the switch, activating the multipath. I'm attaching what happens. (It goes on and on further, end position of the log is arbitrary.)
Comment 20 Seokmann Ju 2009-02-27 02:28:40 UTC
From the log from #19, the multipathd caused recursive interventions/interruptions to the target so that no more stable path to it is available.
---
Feb 26 23:50:42 fred multipathd: sdh: directio checker reports path is down
Feb 26 23:50:42 fred multipathd: checker failed path 8:112 in map multi_3_db
Feb 26 23:50:42 fred multipathd: multi_3_db: remaining active paths: 1
Feb 26 23:50:42 fred kernel: [14020652.739423] device-mapper: multipath: Failing path 8:112.
Feb 26 23:50:42 fred multipathd: sdj: directio checker reports path is down
Feb 26 23:50:42 fred multipathd: checker failed path 8:144 in map mpath7
Feb 26 23:50:42 fred multipathd: mpath7: remaining active paths: 0
---


And it, in turn, caused to trigger timout event followed by aborting commands, as below.
---
Feb 26 23:50:41 fred kernel: [14020651.786979] qla2xxx_eh_abort(3): aborting sp ffff81003c1360c0 from RISC. pid=334462.
Feb 26 23:50:41 fred kernel: [14020651.787849] scsi(3): ABORT status detected 0x5-0x0.
Feb 26 23:50:41 fred kernel: [14020651.788110] qla2xxx 0000:08:01.0: scsi(3:0:3): Abort command issued -- 1 51a7e 2002.
Feb 26 23:50:41 fred kernel: [14020651.847108] qla2xxx_eh_abort(3): aborting sp ffff81003c136dc0 from RISC. pid=334463.
Feb 26 23:50:41 fred kernel: [14020651.847973] scsi(3): ABORT status detected 0x5-0x0.
Feb 26 23:50:41 fred kernel: [14020651.848242] qla2xxx 0000:08:01.0: scsi(3:0:9): Abort command issued -- 1 51a7f 2002.
---
Comment 21 peter gervai 2009-02-27 08:17:19 UTC
#20, enlighten me in the internals, please. Can multipathd actually _cause_ driver timeouts? 

As much as I understood directio checker does nothing more than reads first and last sector of the device by direct IO (O_DIRECT) calls, and if that fails it fires an uevent towards userspace to handle the situation.  Can it do anything "below", like the qla driver, or the scsi device itself?

I tried 'tur' checker in the past with mixed results, and I'm not sure it meant the path was really down that much or the checker failed, but directio seemed the most generic.

I thought it's the other way around, eg. qla timoeuts which makes multipathd cry.
Comment 22 Seokmann Ju 2009-02-27 10:29:03 UTC
Sorry for the confusion.
I've overlooked the log without having clear understanding the layout.

Could you send the log with 'ql2xextended_error_logging' parameter turned on?
From the information on the log file at #19, not sure where the failure started and which command caused it.
Comment 23 Csillag Tamas 2009-03-03 11:00:16 UTC
Well, I am not sure if the multipath issues and the original problems reported are related.
Comment 24 Daniel Bakken 2009-03-04 08:14:04 UTC
I do not use multipathd, and the qlogic timeouts still crash my system. I believe Seokman Ju's multipathd errors are caused by the qlogic driver. Notice how the timestamps for the qlogic events are before the multipath errors.
Comment 25 mike 2009-03-31 16:02:09 UTC
Are there any updates on this bug?  We are planning on installing 4 new database servers in the next couple of weeks using Debian Lenny amd64 (2.6.26 kernel) on servers with dual Qlogic 2460 HBAs using multipathd, connecting to an EMC Clarion SAN.

I came across this bug, but couldn't gauge how big of a concern this should be for us.  Is it recommended to use a kernel version of 2.6.20 or older at this point or is the behavior seen in this bug a rare/special case?
Comment 26 Ronan Guilfoyle 2009-05-12 09:03:28 UTC
I had a similar crash last night.
I'm running Ubuntu 8.04, 2.6.24-23-server.

This is an IBM HS21 (Type 8853), with Q-Logic FC adapter.

The system runs MySQL in a heartbeat controlled failover pair with databases on the SAN.

This error caused a failover but not all of the MySQL transactions were written to disk.  The backup came up fine but with out of date data causing the slave to fail because it could not read the correct log position.

I'd appreciate any help with (or links to) replacing integrated qla2xxx drivers with official rpm ones.

Thanks,
Ronan Guilfoyle

Syslog below,  

May 11 20:17:01 DB1 /USR/SBIN/CRON[7036]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
May 11 20:22:35 DB1 kernel: [3036897.727183] APIC error on CPU1: 00(40)
May 11 20:23:48 DB1 kernel: [3036970.605923] qla2xxx 0000:08:01.1: Mailbox command timeout occured. Issuing ISP abort.
May 11 20:23:48 DB1 kernel: [3036970.605930] qla2xxx 0000:08:01.1: Performing ISP error recovery - ha= ffff810222830460.
May 11 20:23:50 DB1 kernel: [3036972.456812] qla2xxx 0000:08:01.1: LOOP UP detected (4 Gbps).
May 11 20:23:50 DB1 kernel: [3036972.716091] qla2xxx 0000:08:01.1: SNS scan failed -- assuming zero-entry result...
May 11 20:23:50 DB1 kernel: [3036972.756011] APIC error on CPU6: 00(40)
May 11 20:23:50 DB1 kernel: [3036972.775938] qla2xxx 0000:08:01.1: scsi(1:1:10): Abort command issued -- 0 5fe51b 2002.
May 11 20:24:23 DB1 kernel: [3037005.515447]  rport-1:0-0: blocked FC remote port time out: saving binding
May 11 20:24:23 DB1 kernel: [3037005.515512]  rport-1:0-1: blocked FC remote port time out: saving binding
May 11 20:24:23 DB1 kernel: [3037006.019805] qla2xxx 0000:08:01.1: scsi(1:1:10): DEVICE RESET ISSUED.
May 11 20:24:53 DB1 kernel: [3037035.942217] qla2xxx 0000:08:01.1: Mailbox command timeout occured. Issuing ISP abort.
May 11 20:24:53 DB1 kernel: [3037035.942223] qla2xxx 0000:08:01.1: Performing ISP error recovery - ha= ffff810222830460.
May 11 20:24:55 DB1 kernel: [3037037.801242] qla2xxx 0000:08:01.1: LOOP UP detected (4 Gbps).
May 11 20:28:30 DB1 syslogd 1.5.0#1ubuntu1: restart.
Comment 27 Somsak Sriprayoonsakul 2009-07-19 14:25:27 UTC
Hi, we are having about the same problem with about the same log, and we found something similar with workaround posted at

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/268242

Could this possibly be the same bug?

Anyway, we have add pci=nomsi as suggested in above bug report. Will report here again if it help (or not).
Comment 28 Ronan Guilfoyle 2009-07-20 08:26:49 UTC
I used the kernel command line 'pci=nomsi' and the problem has not been seen since.
Three servers that all showed this problem are now running fine for over 6 weeks.  This may not be a fix, but it appears to be a perfectly goo wotkaround for me.
Comment 29 Csillag Tamas 2010-01-28 23:35:43 UTC
I upgraded to this kernel @ 2010-01-18:
Linux somehost 2.6.24-9-pve #1 SMP PREEMPT Tue Nov 17 09:34:41 CET 2009 x86_64 
GNU/Linux

and I experienced this issue again.

In some forum I get the idea to upgrade the cards firmware (ISP2422):

$ md5sum qlgc_flash_image_multiboot143.img
1f43310c7bb24db53d561b60080b5211  qlgc_flash_image_multiboot143.img

I do not have a problem since I did the upgrade (2010-01-20).

YMMV

--
Regards, 
  cstamas
Comment 30 Andrew Vasquez 2010-01-29 00:46:17 UTC
Csillag,

Prior to the FW update, were you seeing the failures while
using the 'pci=nomsi' workaround?  What were the firmware
versions used in your testing -- before and after?
Comment 31 Csillag Tamas 2010-01-31 22:04:39 UTC
Dear Andrew,

Now that is a good question. I remember that when I experienced this problem the only solution that helped me was the kernel downgrade. As far as I remember I tried the nomsi workaround but it did not helped (But I am *not* 100% sure on this).

Is there a way to get the firmware version info from a live system (maybe without reboot?)

For the old one I can get it from another server which is from the same order as this one.

For the new:
I do not know if this info is sufficient:
-rwxr-xr-x 1 root root 1048576 Dec 11  2007 i24af143.bin
72ed710f260788aec4f725659bf54dcd  i24af143.bin

this is the file from the floppy used for the flashing.

If this does not help I can schedule a reboot and get it from the boot screen.

Thanks
--
Regards,
  CSILLAG Tamas
Comment 32 Bernd Zeimetz 2010-03-03 09:37:28 UTC
IBM x3950 machines crash badly enough due to this bug that they reboot instantly after loading the qla2xxx module.

Feb 24 10:33:51 dbsrv01 kernel: [   64.184483] qla2xxx 0000:02:01.0: Performing ISP error recovery - ha= ffff81086b4e85f8.
Feb 24 10:33:51 dbsrv01 kernel: [   64.324785] scsi(1): **** Load RISC code ****
Feb 24 10:33:52 dbsrv01 kernel: [   64.366386] scsi(1): Verifying Checksum of loaded RISC code.
Feb 24 10:33:52 dbsrv01 kernel: [   64.605869] scsi(1): Checksum OK, start firmware.
Feb 24 10:33:52 dbsrv01 kernel: [   65.357677] scsi(1): Issue init firmware.
Feb 24 10:33:55 dbsrv01 kernel: [   71.130990] scsi(2): Loop Down - aborting the queues before time expire
Feb 24 10:33:56 dbsrv01 kernel: [   73.202082] qla2x00_mailbox_command(2): timeout calling abort_isp
Feb 24 10:33:56 dbsrv01 kernel: [   73.238667] qla2x00_mailbox_command(2): timeout calling abort_isp
Feb 24 10:33:56 dbsrv01 kernel: [   73.281349] qla2xxx 0000:10:01.0: Mailbox command timeout occured. Issuing ISP abort.
Feb 24 10:33:56 dbsrv01 kernel: [   73.333347] qla2xxx 0000:10:01.0: Performing ISP error recovery - ha= ffff81105ccf05f8.
Feb 24 10:34:12 dbsrv01 kernel: [   95.516679] qla2xxx 0000:02:01.0: Cable is unplugged...
Feb 24 10:34:12 dbsrv01 kernel: [   95.516679] scsi(1): fw_state=4 curr time=ffff208e.
Feb 24 10:34:12 dbsrv01 kernel: [   95.516679] scsi(1): Firmware ready **** FAILED ****.
Feb 24 10:34:12 dbsrv01 kernel: [   95.516679] qla2x00_restart_isp(): Configure loop done, status = 0x0
Feb 24 10:34:13 dbsrv01 kernel: [   95.516679] qla2xxx 0000:02:01.0: ISP System Error - mbx1=65h mbx2=2h mbx3=8080h.
Feb 24 10:34:13 dbsrv01 kernel: [   95.516679] qla2xxx 0000:02:01.0: Firmware dump saved to temp buffer (1/ffffc20007f84000).
Feb 24 10:34:13 dbsrv01 kernel: [   95.516679] qla2x00_abort_isp(1): exiting.
Feb 24 10:34:13 dbsrv01 kernel: [   95.516679] qla2x00_mailbox_command(1): finished abort_isp
Feb 24 10:34:13 dbsrv01 kernel: [   95.516679] qla2x00_mailbox_command(1): finished abort_isp
Feb 24 10:34:13 dbsrv01 kernel: [   95.545239] qla2x00_mailbox_command(1): **** FAILED. mbx0=69, mbx1=8023, mbx2=ffff, cmd=69 ****
Feb 24 10:34:13 dbsrv01 kernel: [   95.613508] qla2x00_get_firmware_state(1): failed=100.
Feb 24 10:34:13 dbsrv01 kernel: [   95.620441] scsi(1): fw_state=8023 curr time=ffff2118.
Feb 24 10:34:13 dbsrv01 kernel: [   95.625500] scsi(1): Firmware ready **** FAILED ****.
Feb 24 10:34:13 dbsrv01 kernel: [   95.687879] scsi(1): qla2x00_loop_resync - end
Feb 24 10:34:13 dbsrv01 kernel: [   96.232463] scsi(1): dpc: sched qla2x00_abort_isp ha = ffff81086b4e85f8
Feb 24 10:34:13 dbsrv01 kernel: [   96.232463] qla2xxx 0000:02:01.0: Performing ISP error recovery - ha= ffff81086b4e85f8.
Feb 24 10:34:13 dbsrv01 kernel: [   96.236463] Calgary: DMA error on Calgary PHB 0x2, 0x02010000@CSR 0x00008000@PLSSR


Running the kernel with pci=nomsi seems to work, although we didn't test it under load yet. The issue is still happening in Debian's 2.6.32, but interestingly not in the Kernels from Redhat, I guess they still ship this patch: http://launchpadlibrarian.net/17517188/linux-2.6-scsi-qla2xxx-disable-msi-x-by-default.patch
Its a bit disappointing that this bug is still not handled by upstream properly - its pretty much impossible to use recent, non-patched Kernels on a lot of larger IBM machines together with QLogic hardware.
Comment 33 Csillag Tamas 2010-03-03 09:59:55 UTC
Dear Bernd Zeimetz,

Can you tell us what version your qlogic bios is?

If its not recent you can try to upgrade:
http://www-947.ibm.com/systems/support/supportsite.wss/selectproduct?familyind=5305593&typeind=0&osind=0&continue.x=18&continue.y=13&brandind=5000008&oldbrand=5000008&oldfamily=5305593&oldtype=0&taskind=2&matrix=Y&psid=bm#UpdateXpress%20System%20Pack

It seems that this the problem for me.
Comment 34 Bernd Zeimetz 2010-03-03 10:45:41 UTC
bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=11646
> 
> 
> 
> 
> 
> --- Comment #33 from Csillag Tamas <cstamas@digitus.itk.ppke.hu>  2010-03-03
> 09:59:55 ---
> Dear Bernd Zeimetz,
> 
> Can you tell us what version your qlogic bios is?

The QLogic Bios was upgraded a few days ago to the latest versionfrom QLogic,
without any changes in the behaviour.

--------------------------------------------------------------------------------
HBA Instance 0: QLA2460 Port 1
--------------------------------------------------------------------------------
Product Identifier               : DS4000 FC 4Gb PCI-X Single Port HBA
Misc. Information                : PW=15W;PCI=66MHZ;PCI-X=266MHz
EFI Driver Version               : 2.04
Firmware Version                 : 4.06.02
BIOS Version                     : 2.10
FCode Version                    : 2.04

--------------------------------------------------------------------------------
HBA Instance 1: QLA2460 Port 1
--------------------------------------------------------------------------------
Product Identifier               : DS4000 FC 4Gb PCI-X Single Port HBA
Misc. Information                : PW=15W;PCI=66MHZ;PCI-X=266MHz
EFI Driver Version               : 2.04
Firmware Version                 : 4.06.02
BIOS Version                     : 2.10
FCode Version                    : 2.04
Comment 35 Ninad 2010-08-31 06:21:54 UTC
Dear Bernd Zeimetz,

Has there been a resolution yet on this issue?
When you said "pci=nomsi seems to work" - has the problem got resolved for you using pci-nomsi?
I am using Oracle VM 2.2.1 and although we do not see the Mailbox command timeout message, infact we have not got many messages at all in the /var/log/messages - but we have observed IO getting stalled to few or most of our LUNs configured as ocfs2 filesystems.
The port do not show as down (from the SAN logs we have checked) and the LUNs getting stalled for a server - there is no problem from other servers to write to that LUNs (as they are shared to other servers being ocfs2).

But for some reason - I am getting a feeling that our problem could well be the reason of what you are facing, hence request some feedback from you.

Thanks,
Ninad
Comment 36 Bernd Zeimetz 2010-08-31 10:45:16 UTC
On 08/31/2010 08:22 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> Has there been a resolution yet on this issue?
> When you said "pci=nomsi seems to work" - has the problem got resolved for
> you
> using pci-nomsi?

pci=nomsi makes the machine work fine, indeed.
See Debian bug #572322 for details - the Debian Kernel ships a patch to allow to
disable msi(-x) for the Qlogic cards now.

Cheers,

Bernd
Comment 37 Alan 2012-10-30 15:12:24 UTC
If this is still seen on modern kernels then please re-open/update
Comment 38 Ravshan DM 2014-07-29 19:59:54 UTC
I had this issue reproduced on my environment: 2.6.36.4 #848 SMP Thu Jul 17 19:55:17 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

It took time to figure out the root cause, which turned to be a bad SFP, once I replaced it, all on HBA qla2462, the FC switch recognized the FC port(s) immediately and all LUNs re-appeared on the host. I hope this info helps.
Comment 39 Alan 2014-07-29 20:22:34 UTC
Thanks

Note You need to log in before you can comment on or make changes to this bug.