Bug 5775

Summary: when a scsi device is plugged in again, the kernel with dm-multipath paniced
Product: IO/Storage Reporter: Luckey (sunjw)
Component: LVM2/DMAssignee: Alasdair G Kergon (agk)
Status: CLOSED CODE_FIX    
Severity: high CC: andrew.vasquez, protasnb
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.14.2 smp bigmem Subsystem:
Regression: --- Bisected commit-id:

Description Luckey 2005-12-22 19:50:13 UTC
Most recent kernel where this bug did not occur:
none
Distribution:
centos 4.2
Hardware Environment:
SAN storage, the HBA card is drived by module: qla2xxx, qla2300
Software Environment:
kernel 2.6.14.2 smp bigmem
device-mapper.1.01.05
multipath-tools-0.4.6
udev-058-1
Problem Description:

I create the dm device as:
create: 3600d0230006927de000001618fecaf00
[size=476 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=1]
 \_ 0:0:0:0 sda 8:0  [undef] [ready]
\_ round-robin 0 [prio=1]
 \_ 1:0:0:0 sdb 8:16 [undef] [ready]

Then, I try the command: dd if=/dev/dm-0 of=/dev/null
I see only the device /dev/sda was readed. It's OK.

but when I pull out the HBA which related to /dev/sdb for about 1 minute,and 
then plug it in again,
the kernel panics.

the messages are as follows:
Dec 22 05:25:07 nd02 kernel: qla2300 0000:07:01.1: LIP reset occured (f823).
Dec 22 05:25:07 nd02 kernel: qla2300 0000:07:01.1: LIP occured (f823).
Dec 22 05:25:07 nd02 kernel: qla2300 0000:07:01.1: LOOP DOWN detected (2).
Dec 22 05:25:42 nd02 kernel:  rport-1:0-1: blocked FC remote port time out: 
removing target
Dec 22 05:25:42 nd02 multipathd: 8:16: readsector0 checker reports path is down
Dec 22 05:25:42 nd02 multipathd: checker failed path 8:16 in map 
3600d0230006927de000001618fecaf00
Dec 22 05:25:42 nd02 kernel: device-mapper: dm-multipath: Failing path 8:16.
Dec 22 05:25:42 nd02 multipathd: 3600d0230006927de000001618fecaf00: remaining 
active paths: 1
Dec 22 05:25:43 nd02 multipathd: remove sdb path checker
Dec 22 05:25:43 nd02 kernel: Synchronizing SCSI cache for disk sdb:
Dec 22 05:25:43 nd02 kernel: FAILED
Dec 22 05:25:43 nd02 kernel:   status = 0, message = 00, host = 1, driver = 00
Dec 22 05:26:12 nd02 kernel:   <6>qla2300 0000:07:01.1: LIP reset occured 
(f8f7).
Dec 22 05:26:12 nd02 kernel: qla2300 0000:07:01.1: LIP occured (f8f7).
Dec 22 05:26:12 nd02 kernel: qla2300 0000:07:01.1: LOOP UP detected (2 Gbps).
Dec 22 05:26:13 nd02 kernel:   Vendor: TOYOU     Model: NetStor DA9220F   Rev: 
342R
Dec 22 05:26:13 nd02 kernel:   Type:   Direct-Access                      ANSI 
SCSI revision: 03
Dec 22 05:26:13 nd02 kernel: SCSI device sdc: 999950336 512-byte hdwr sectors 
(511975 MB)
Dec 22 05:26:13 nd02 kernel: SCSI device sdc: drive cache: write back
Dec 22 05:26:13 nd02 kernel: SCSI device sdc: 999950336 512-byte hdwr sectors 
(511975 MB)
Dec 22 05:26:14 nd02 kernel: SCSI device sdc: drive cache: write back
Dec 22 05:26:14 nd02 kernel:  sdc:
Dec 22 05:26:14 nd02 kernel: Attached scsi disk sdc at scsi1, channel 0, id 0, 
lun 0
Dec 22 05:26:14 nd02 kernel: Attached scsi generic sg1 at scsi1, channel 0, id 
0, lun 0,  type 0
Dec 22 05:26:14 nd02 scsi.agent[4098]: disk 
at /devices/pci0000:00/0000:00:02.0/0000:05:1d.0/0000:07:01.1/host1/rport-1:0-
1/target1:0:0/1:0:0:0

--------------------->> All right above here.

Dec 22 05:26:14 nd02 kernel:   Vendor: TOYOU     Model: NetStor DA9220F   Rev: 
342R
Dec 22 05:26:14 nd02 kernel:   Type:   Direct-Access                      ANSI 
SCSI revision: 03
Dec 22 05:26:14 nd02 kernel: error 1
Dec 22 05:26:14 nd02 kernel: scsi: Unexpected response from host 1 channel 0 
id 0 lun 0 while scanning, scan aborted
Dec 22 05:26:14 nd02 kernel: Badness in kref_get at lib/kref.c:32
Dec 22 05:26:14 nd02 kernel:  [<c01d9c6a>] kref_get+0x3f/0x41
Dec 22 05:26:14 nd02 kernel:  [<c01d92e0>] kobject_get+0x17/0x1e
Dec 22 05:26:14 nd02 kernel:  [<c019d896>] sysfs_getlink+0x38/0xfa
Dec 22 05:26:14 nd02 kernel:  [<c019d999>] sysfs_follow_link+0x41/0x59
Dec 22 05:26:14 nd02 kernel:  [<c016ee2f>] generic_readlink+0x2a/0x85
Dec 22 05:26:14 nd02 kernel:  [<c017fc72>] __mark_inode_dirty+0x52/0x1a8
Dec 22 05:26:14 nd02 kernel:  [<c012332f>] current_fs_time+0x59/0x67
Dec 22 05:26:14 nd02 kernel:  [<c0177e61>] update_atime+0x67/0x8c
Dec 22 05:26:14 nd02 kernel:  [<c01674b5>] sys_readlink+0x7e/0x82
Dec 22 05:26:14 nd02 kernel:  [<c0103af3>] sysenter_past_esp+0x54/0x75
Dec 22 05:26:17 nd02 kernel: Unable to handle kernel paging requestBadness in 
kref_get at lib/kref.c:32
Dec 22 05:26:17 nd02 kernel:  [<c01d9c6a>] kref_get+0x3f/0x41
Dec 22 05:26:17 nd02 kernel:  [<c01d92e0>] kobject_get+0x17/0x1e
Dec 22 05:26:17 nd02 kernel:  [<c019d896>] sysfs_getlink+0x38/0xfa
Dec 22 05:26:17 nd02 kernel:  [<c019d999>] sysfs_follow_link+0x41/0x59
Dec 22 05:26:17 nd02 kernel:  [<c016ee2f>] generic_readlink+0x2a/0x85
Dec 22 05:26:17 nd02 kernel:  [<c012332f>] current_fs_time+0x59/0x67
Dec 22 05:26:17 nd02 kernel:  [<c0177e61>] update_atime+0x67/0x8c
Dec 22 05:26:17 nd02 kernel:  [<c01674b5>] sys_readlink+0x7e/0x82
Dec 22 05:26:17 nd02 kernel:  [<c0103af3>] sysenter_past_esp+0x54/0x75
Dec 22 05:26:17 nd02 kernel:  at virtual address 00200200
Dec 22 05:26:17 nd02 kernel:  printing eip:
Dec 22 05:26:17 nd02 kernel: c02583a1
Dec 22 05:26:17 nd02 kernel: *pde = 37eb7001
Dec 22 05:26:17 nd02 kernel: Oops: 0002 [#1]
Dec 22 05:26:17 nd02 kernel: SMP
Dec 22 05:26:17 nd02 kernel: Modules linked in: dm_round_robin dm_multipath 
binfmt_misc dm_mirror dm_mod video thermal proces
sor fan button battery ac uhci_hcd usbcore hw_random shpchp pci_hotplug e1000 
qla2300 qla2xxx scsi_transport_fc sd_mod
Dec 22 05:26:17 nd02 kernel: CPU:    1
Dec 22 05:26:17 nd02 kernel: EIP:    0060:[<c02583a1>]    Not tainted VLI
Dec 22 05:26:17 nd02 kernel: EFLAGS: 00010002   (2.6.14.2smp)
Dec 22 05:26:17 nd02 kernel: EIP is at scsi_device_dev_release+0x3d/0x113
Dec 22 05:26:17 nd02 kernel: eax: 00100100   ebx: c2c03194   ecx: 00200200   
edx: 00000286
Dec 22 05:26:17 nd02 kernel: esi: c2c03008   edi: c2c03000   ebp: c229d814   
esp: d326fe68
Dec 22 05:26:17 nd02 kernel: ds: 007b   es: 007b   ss: 0068
Dec 22 05:26:17 nd02 kernel: Process udev (pid: 4108, threadinfo=d326e000 
task=f6e3ca30)
Dec 22 05:26:17 nd02 kernel: Stack: 7f7e7d7c c2c0320c c0371b08 c0371b20 
c229d88c c01d935e c2c03194 c2c03224
Dec 22 05:26:17 nd02 kernel:        c01d9362 c03754b8 c2c0320c c01d9c9e 
c2c0320c c019d854 c03754b8 ef1ac000
Dec 22 05:26:17 nd02 kernel:        c0365040 00000000 c01d938a c2c03224 
c01d9362 c019d927 c2c0320c c03754b8
Dec 22 05:26:17 nd02 kernel: Call Trace:
Dec 22 05:26:17 nd02 kernel:  [<c01d935e>] kobject_cleanup+0x77/0x7b
Dec 22 05:26:17 nd02 kernel:  [<c01d9362>] kobject_release+0x0/0xa
Dec 22 05:26:17 nd02 kernel:  [<c01d9c9e>] kref_put+0x32/0x84
Dec 22 05:26:17 nd02 kernel:  [<c019d854>] sysfs_get_target_path+0x73/0x7d
Dec 22 05:26:17 nd02 kernel:  [<c01d938a>] kobject_put+0x1e/0x22
Dec 22 05:26:17 nd02 kernel:  [<c01d9362>] kobject_release+0x0/0xa
Dec 22 05:26:17 nd02 kernel:  [<c019d927>] sysfs_getlink+0xc9/0xfa
Dec 22 05:26:17 nd02 kernel:  [<c019d999>] sysfs_follow_link+0x41/0x59
Dec 22 05:26:17 nd02 kernel:  [<c016ee2f>] generic_readlink+0x2a/0x85
Dec 22 05:26:17 nd02 kernel:  [<c017fc72>] __mark_inode_dirty+0x52/0x1a8
Dec 22 05:26:17 nd02 kernel:  [<c012332f>] current_fs_time+0x59/0x67
Dec 22 05:26:17 nd02 kernel:  [<c0177e61>] update_atime+0x67/0x8c
Dec 22 05:26:17 nd02 kernel:  [<c01674b5>] sys_readlink+0x7e/0x82
Dec 22 05:26:17 nd02 kernel:  [<c0103af3>] sysenter_past_esp+0x54/0x75
Dec 22 05:26:17 nd02 kernel: Code: ff ff 8d bb 6c fe ff ff 8d 75 ec 8b 40 2c 
e8 cc df 0a 00 83 86 34 01 00 00 01 8d b3 74 fe
ff ff 8b 4e 04 89 c2 8b 83 74 fe ff ff <89> 01 89 48 04 c7 46 04 00 02 20 00 
8d b3 7c fe ff ff 8b 83 7c
Dec 22 05:26:19 nd02 multipathd: sdc: path checker registered
In summary:
kernel with module dm_multipath loaded:
	When I plug in the HBA card, the kernel finds a new device 
named "/dev/sdc" which is "/dev/sdb" orginally,
	and then kernel painc.

kernel without module dm_multipath loaded:
	When I plug in the HBA card, the kernel finds a new device named as 
the old one "/dev/sdb",
	and all works fine.

What's the matter?  where would be the bug ?

Steps to reproduce:
create the dm multipath device, 
pull out one HBA card for 1 minute,
and plug it in again.
Comment 1 Andrew Morton 2006-01-20 00:05:45 UTC
bugme-daemon@bugzilla.kernel.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=5775
> 
>             Summary: when a scsi device is plugged in again, the kernel with
>                      dm-multipath paniced

This looks like a qlogic driver crash to me.

Comment 2 Andrew Vasquez 2006-01-20 16:13:54 UTC
I'm trying to lay to rest the final rport/device_model API change requirements 
for qla2xxx.  I've uploaded a small patchset:

http://marc.theaimsgroup.com/?l=linux-scsi&m=113779768321616&w=2
http://marc.theaimsgroup.com/?l=linux-scsi&m=113779768230038&w=2
http://marc.theaimsgroup.com/?l=linux-scsi&m=113779768230735&w=2

which should address the last of the known qla2xxx issues.
 
Please try them out with 2.6.16-rc1.
Comment 3 Luckey 2006-01-23 07:25:43 UTC
I've patched all three patches to the kernel 2.6.16-rc1;
It is better. Kernel oops when multipath does failback do not exist now.
But other problems exist yet.

1. The recovery time of host A's disk IO is too long (as about 33 seconds) 
during host B reboots, where host A and B share their SAN stroage 
through HBA and FC-switch. Is it related to the FC-switch?

2.When the dm-multipath device of host A(as /dev/dm-0) is readed heavily
(such as dd), if host B reboots, then the SCSI device on host A
(as /dev/sda) will lose.  if host B reboots more than one time, all SCSI
devices will lose at last. But if there is no heavy read, no device will lose.
I use the path_checker "readsector0", 
and the path polling_interval is 1 second in my test.

The logs of file /var/log/messages are:

Jan 23 23:16:47 nd06 kernel: qla2300 0000:07:01.1: scsi(1:2:0): Abort command 
issued -- 137967 2002.
Jan 23 23:16:47 nd06 kernel: sd 1:0:2:0: scsi: Device offlined - not ready 
after error recovery
Jan 23 23:16:47 nd06 kernel: sd 1:0:2:0: scsi: Device offlined - not ready 
after error recovery
Jan 23 23:16:47 nd06 kernel: sd 1:0:2:0: rejecting I/O to offline device
Jan 23 23:16:47 nd06 kernel: device-mapper: dm-multipath: Failing path 8:32.
Jan 23 23:16:47 nd06 kernel: sd 1:0:2:0: SCSI error: return code = 0x20000
Jan 23 23:16:47 nd06 kernel: end_request: I/O error, dev sdc, sector 401358016
Jan 23 23:16:47 nd06 kernel: end_request: I/O error, dev sdc, sector 401358024
Jan 23 23:16:47 nd06 multipathd: 8:32: readsector0 checker reports path is down
Jan 23 23:16:47 nd06 multipathd: checker failed path 8:32 in map 
STOYOU___NetStor_DA9220F0000013158674E00
Jan 23 23:16:47 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
remaining active paths: 1
Jan 23 23:16:47 nd06 kernel: sd 0:0:0:0: SCSI error: return code = 0x20000
Jan 23 23:16:47 nd06 kernel: end_request: I/O error, dev sda, sector 401357568
Jan 23 23:16:47 nd06 kernel: device-mapper: dm-multipath: Failing path 8:0.
Jan 23 23:16:47 nd06 kernel: end_request: I/O error, dev sda, sector 401357576
Jan 23 23:16:48 nd06 multipathd: 8:0: mark as failed
Jan 23 23:16:48 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
Entering recovery mode: max_retries=100
Jan 23 23:16:48 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
remaining active paths: 0
Jan 23 23:16:49 nd06 multipathd: 8:0: readsector0 checker reports path is up
Jan 23 23:16:49 nd06 multipathd: 8:0: reinstated
Jan 23 23:16:49 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
queue_if_no_path enabled
Jan 23 23:16:49 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
Recovered to normal mode
Jan 23 23:16:49 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
remaining active paths: 1

3. dm-multipath's failback will fail when disk IO has write? Is it?
Comment 4 Andrew Vasquez 2006-01-24 08:59:00 UTC
> I've patched all three patches to the kernel 2.6.16-rc1;
> It is better. Kernel oops when multipath does failback do not exist now.
> But other problems exist yet.

Ok.  good.

> 1. The recovery time of host A's disk IO is too long (as about 33 seconds) 
> during host B reboots, where host A and B share their SAN stroage 
> through HBA and FC-switch. Is it related to the FC-switch?
>

As I had mentiond to you in off-line emails -- have you verified with the
storage box manufacturer that your configuration is valid (given the active
passive nature of the box).

> 2.When the dm-multipath device of host A(as /dev/dm-0) is readed heavily
> (such as dd), if host B reboots, then the SCSI device on host A
> (as /dev/sda) will lose.  if host B reboots more than one time, all SCSI
> devices will lose at last. But if there is no heavy read, no device will lose.
> I use the path_checker "readsector0", 
> and the path polling_interval is 1 second in my test.
>
> The logs of file /var/log/messages are:
>
> Jan 23 23:16:47 nd06 kernel: qla2300 0000:07:01.1: scsi(1:2:0): Abort command 
> issued -- 137967 2002.
> Jan 23 23:16:47 nd06 kernel: sd 1:0:2:0: scsi: Device offlined - not ready 
> after error recovery

In our last email, I noted some of the behaviour being exhibited by the storage 
device when I/O was coming from both hosts in the fabric to the storage box -- 
ABORTs (ABTS) being need to 'unstuck' commands from the box, and LOGOs being
issued by the box in response to commands:

--- 8< ---

From: "=?GB2312?B?y++/oc6w?=" <sunjw@onewaveinc.com>
To: "andrew.vasquez" <andrew.vasquez@qlogic.com>
Subject: Re: Re: How to use the firmware of qlogic driver
Message-ID: <SERVERTsmSWzUqQID0E0000b9e3@mail.onewaveinc.com>

>Ok, so that could explain why we some of the initiators are being  
>logged out when I/O is being initiated from another host -- are you
>insuring that I/O is going down the active (controller) path?
>
>What mechanism is in place to have the controller switch the 'active'
>path to the standby path?  Often times, especially in failover
>configurations, a storage box needs a TUR (test-unit-ready) or some
>other CDB to force the controller to switch active paths.  This is
>especially so with active-standby configurations.

Ok, let me try to figure my topology configuration:

     host nd09  host nd10
        /     \/      \
       /      /\       \
   FC-switch A  FC-switch B
      /                \
     /                  \
    |p1  |p2       |p3   |p4
 controller A    controller B
       \               /
        \             /
        RAID5    storage

Each controller has two independant host channels to the RAID which are in the same loop.
When two controllers work in Active-Standby mode, for example, 
controller A now is the active one, then, the port p3/p4 is connnected to p1/p2 through
a so called internel hub in the RAID box. That is to say, 
p1 and p3 are in the same channel, and p2 and p4 are in the same one.  
All ports on both controllers do jobs even though controller B d nothing. 

Actually, two HBAs on each host work in Active-Active mode, not only failover.

Is there any problem?  And what's your opinion?  Thanks!

>> 1. 
>> I execute the command "dd if=/dev/sda of=/dev/null" in host nd10, after about 1 minutes,
>> I do the command "rmmod qla2xxx;modprobe qla2xxx" in host nd09. 
>> And I see the io status through command "vmstat 1".  
>> It spends about 33 seconds that the io of dd recovers after the command "rmmod...".
>> The question is: How can I reduce this io recovery time when this situation occurs.
>
>Perhaps If I had the var/log/messages of nd10 during the unload of
>nd09, I might be able to provide additional insight, but as it stands,
>I can only guess that the storage is still logging nd10 out.
During nd09's module reload, the nd10 has no new message in the file /var/log/message. 
When I do the command "fdisk -l" on nd10, then new messages are added in this file.
>
>Have you verified the validity of your topology with the storage
>vendor?  I'm still unclear on the exact toplopgy (not just the
>components, but how you've attached those components to the storage.
>At least on nd10, I can see the HBAs are coming up in LOOP (FCAL)
>topology -- how exactly are those HBAs (on nd10) attached to the
>storage box, as there doesn't appear to be any FC switch here.  Is
>nd09 attached to the storage box via a FC switch on the secondary
>(standby) controller?  
see above.
>
>> 2.
>> When I bind two disk paths "/dev/sda" and "/dev/sdb" together into one device "/dev/dm-0",
>> And I execute "dd if=/dev/dm-0 if=/dev/null" in host nd10, I do qla2xxx module reload on 
another host.
>> Sometimes one of the devices sda/sdb will lose again, the messages are:
>> 
>> Jan 18 17:39:55 nd09 kernel: qla2xxx_eh_abort(0): aborting sp c3d66e00 from RISC. pid=65007 
sp->state=2
>> Jan 18 17:39:55 nd09 kernel: scsi(0): ABORT status detected 0x5-0x0.
>
>The storage box is hungup trying to process the scsi_cmnd 65007, the
>driver sends an ABTS to the storage box to abort the command, the ABTS
>appears to complete, and the command is returned to the midlayer with
>a DID_ABORT status.
>
>> Jan 18 17:39:55 nd09 kernel: scsi(0:1:0): status_entry: Port Down pid=65008, compl status=0x29, 
port state=0x4
>> Jan 18 17:39:55 nd09 kernel: scsi(0): Port login retry: 220000d023000001, id = 0x0070 retry 
cnt=30
>> Jan 18 17:39:56 nd09 kernel: scsi(0): fcport-1 - port retry count: 29 remaining
>
>But it seems yet another I/O is returning to the hba0 on nd09 with a
>failed status and secondary indicator that the storage port has just
>logged the HBA out.
>
>
>> Jan 18 17:39:56 nd09 kernel: qla2xxx 0000:07:01.0: scsi(0:1:0): Abort command issued -- fdef 
2002.
>> 
>> --> What's this line above meaning?
>
>That the storage port cannot handle commands (concurrently) on both
>controller ports -- hence the active/standby qualifications of the
>storage.
>
>> Jan 18 17:39:56 nd09 kernel: sd 0:0:1:0: scsi: Device offlined - not ready after error recovery
>> Jan 18 17:39:56 nd09 kernel: sd 0:0:1:0: rejecting I/O to offline device
>> Jan 18 17:39:56 nd09 kernel: device-mapper: dm-multipath: Failing path 8:0.
>> Jan 18 17:39:56 nd09 multipathd: 8:0: readsector0 checker reports path is down
>> 
>> 3.
>> Currently I need a stable kernel without the device lost problem, so what's your suggestion?
>
>Insure you have a valid configuration with the storage box in
>question.  The LOOP topology in one controller and F_PORT (via switch)
>on the other is questionable.  More notably, the problem with your
>redundancy testing across two hosts (each connected to one of the
>storage ports) is that I/Os can be sent to the storage simultaneously,
>which is not typically tolerated in an active/standby configuration.
>If the storage was active-active, then it could handle concurrent I/Os
>down both controller and collesce data in the backend.
>
>I'll add the fix (along with the compilation issue with 'static') in
>my next set of patches for upstream -- thanks for helping out with
>that.  But, at this point, the driver is doing all it can in an
>attempt to maintain its connection with the storage -- even with the
>continual LOGOs being sent.  Unfortunetly, it seems the storage box is
>unable to operate properly in the topology configuration you are
>attempting.
>
>Regards,
>Andrew Vasquez

Best regards!
Luckey

--- 8< ---

> Jan 23 23:16:47 nd06 kernel: sd 1:0:2:0: scsi: Device offlined - not ready 
> after error recovery
> Jan 23 23:16:47 nd06 kernel: sd 1:0:2:0: rejecting I/O to offline device
> Jan 23 23:16:47 nd06 kernel: device-mapper: dm-multipath: Failing path 8:32.
> Jan 23 23:16:47 nd06 kernel: sd 1:0:2:0: SCSI error: return code = 0x20000
> Jan 23 23:16:47 nd06 kernel: end_request: I/O error, dev sdc, sector 401358016
> Jan 23 23:16:47 nd06 kernel: end_request: I/O error, dev sdc, sector 401358024
> Jan 23 23:16:47 nd06 multipathd: 8:32: readsector0 checker reports path is down
> Jan 23 23:16:47 nd06 multipathd: checker failed path 8:32 in map 
> STOYOU___NetStor_DA9220F0000013158674E00
> Jan 23 23:16:47 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
> remaining active paths: 1
> Jan 23 23:16:47 nd06 kernel: sd 0:0:0:0: SCSI error: return code = 0x20000
> Jan 23 23:16:47 nd06 kernel: end_request: I/O error, dev sda, sector 401357568
> Jan 23 23:16:47 nd06 kernel: device-mapper: dm-multipath: Failing path 8:0.
> Jan 23 23:16:47 nd06 kernel: end_request: I/O error, dev sda, sector 401357576
> Jan 23 23:16:48 nd06 multipathd: 8:0: mark as failed
> Jan 23 23:16:48 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
> Entering recovery mode: max_retries=100
> Jan 23 23:16:48 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
> remaining active paths: 0
> Jan 23 23:16:49 nd06 multipathd: 8:0: readsector0 checker reports path is up
> Jan 23 23:16:49 nd06 multipathd: 8:0: reinstated
> Jan 23 23:16:49 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
> queue_if_no_path enabled
> Jan 23 23:16:49 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
> Recovered to normal mode
> Jan 23 23:16:49 nd06 multipathd: STOYOU___NetStor_DA9220F0000013158674E00: 
> remaining active paths: 1

It appears that we've at least moved beyond the HBA issues and are still left with the
behaviours being exhibited by the storage box in your topology.  The HBA is performing
it's standard recovery given the actions of the storage -- ABTS for stuck commands and
PLOGIs in response to the unexpected LOGOs from the storage during heavy I/O.
Comment 5 Natalie Protasevich 2007-07-22 17:33:10 UTC
Any updates on this issue?
Thanks.
Comment 6 Anonymous Emailer 2007-07-23 09:00:49 UTC
Reply-To: sunjunwei2@163.com

Hi bugme-daemon,

No, the kernel 2.6.16.22 looks that it works well.Thanks.

Best regards!
Junwei Sun
-----Original Message-----
>http://bugzilla.kernel.org/show_bug.cgi?id=5775
>
>
>protasnb@gmail.com changed:
>
>           What    |Removed                     |Added
>----------------------------------------------------------------------------
>                 CC|                            |protasnb@gmail.com
>
>
>
>
>------- Comment #5 from protasnb@gmail.com  2007-07-22 17:33 -------
>Any updates on this issue?
>Thanks.
>
>
>-- 
>Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
>------- You are receiving this mail because: -------
>You reported the bug, or are watching the reporter.
Comment 7 Alasdair G Kergon 2007-11-13 04:30:21 UTC
I'm going to mark this one resolved.  Reopen it (or create a new bug) if more work on this is needed.  (BTW The component wasn't really Storage/DM - someone might like to change it.)