Bug 93951 - Multipath hangs if Active Alua path stops responding (timeouts)
Summary: Multipath hangs if Active Alua path stops responding (timeouts)
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: linux-scsi@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-27 16:48 UTC by Shawn Swanson
Modified: 2015-02-27 16:48 UTC (History)
0 users

See Also:
Kernel Version: 3.10.0-123.el7.x86_64
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Shawn Swanson 2015-02-27 16:48:24 UTC
We are observing multipath hanging if the Active path of an Active/Optimized - Standby ALUA SAS LUN stops responding to SCSI commands (constant timeouts).

To summarize our setup, the storage device is connected via SAS and supports ALUA Active/Optimized and Standby states.  A typical user configuration would have 2 paths to each LUN.  One path would be Active/Optimized and the other path would be in the Standby state.  
 
Below is our current multipath.conf settings that was used to collect the latest set of logs.  Note, some of settings we set in attempt to work around the issue but did not seem to have any effect.
  
 devices {
    device {
        vendor XXXXX
        product "XXXXX Vol"
        path_checker tur
        prio alua
        path_grouping_policy group_by_prio
	features "2 pg_init_retries 50"
        hardware_handler "1 alua"
        failback immediate
        rr_weight priorities
	no_path_retry 5
	dev_loss_tmo 60
	path_selector "service-time 0"
    }


When everything is operating nominally, "multipath -ll" shows the expected 2 paths to each LUN, the Active/Optimized path has a priority of 50, and the Standby path has a priority of 1.  Example:

mpathi (36000d31001108300000000000000003b) dm-9 XXXXX,XXXXX Vol  
size=500G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw
|-+- policy='service-time 0' prio=50 status=active
| `- 1:0:1:7  sdr 65:16 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 1:0:0:7  sdh 8:112 active ready running

/var/log/messages shows the path_checker running every ~5 seconds as expected:
Feb 24 13:18:22 localhost multipathd: sdr: path state = running
Feb 24 13:18:22 localhost multipathd: sdr: get_state
Feb 24 13:18:22 localhost multipathd: 65:16: tur checker starting up
Feb 24 13:18:22 localhost multipathd: 65:16: tur checker finished, state up
Feb 24 13:18:22 localhost multipathd: sdr: state = up
Feb 24 13:18:22 localhost multipathd: mpathi: disassemble map [3 queue_if_no_path pg_init_retries 50 1 alua 2 1 service-time 0 1 2 65:16 50 1 service-time 0 1 2 8:112 1 1 ]
Feb 24 13:18:22 localhost multipathd: mpathi: disassemble status [2 0 1 0 2 1 A 0 1 2 65:16 A 0 0 1 E 0 1 2 8:112 A 0 0 1 ]
Feb 24 13:18:22 localhost multipathd: sdr: mask = 0x8
Feb 24 13:18:22 localhost multipathd: sdr: path state = running
Feb 24 13:18:22 localhost multipathd: reported target port group is 61447
Feb 24 13:18:22 localhost multipathd: aas = 00 [active/optimized]
Feb 24 13:18:22 localhost multipathd: sdr: alua prio = 50
....
Feb 24 13:18:22 localhost multipathd: sdh: path state = running
Feb 24 13:18:22 localhost multipathd: sdh: get_state
Feb 24 13:18:22 localhost multipathd: 8:112: tur checker starting up
Feb 24 13:18:22 localhost multipathd: 8:112: tur checker finished, state up
Feb 24 13:18:22 localhost multipathd: sdh: state = up
Feb 24 13:18:22 localhost multipathd: mpathi: disassemble map [3 queue_if_no_path pg_init_retries 50 1 alua 2 1 service-time 0 1 2 65:16 50 1 service-time 0 1 2 8:112 1 1 ]
Feb 24 13:18:22 localhost multipathd: mpathi: disassemble status [2 0 1 0 2 1 A 0 1 2 65:16 A 0 0 1 E 0 1 2 8:112 A 0 0 1 ]
Feb 24 13:18:22 localhost multipathd: sdh: mask = 0x8
Feb 24 13:18:22 localhost multipathd: sdh: path state = running
Feb 24 13:18:22 localhost multipathd: reported target port group is 61461
Feb 24 13:18:22 localhost multipathd: aas = 02 [standby]
Feb 24 13:18:22 localhost multipathd: sdh: alua prio = 1

Now if the Active/Optimized path suddenly stops responding to I/O, causing timeouts forever, multipath hangs including the path_checker.  The hang is observed by all multipath and multipathd commands hang, and there are no longer any path checker logs messages.  The below messages occur over and over again showing the active path not responding to I/O.

Feb 24 13:39:42 localhost kernel: scsi target1:0:1: enclosure_logical_id(0x5b083fe0ead0a900), slot(4)
Feb 24 13:39:46 localhost kernel: sd 1:0:1:7: task abort: SUCCESS scmd(ffff8804661f4540)
Feb 24 13:39:46 localhost kernel: sd 1:0:1:7: attempting task abort! scmd(ffff8804661f7480)
Feb 24 13:39:46 localhost kernel: sd 1:0:1:7: [sdr] CDB: 
Feb 24 13:39:46 localhost kernel: Write(10): 2a 00 00 01 e1 c0 00 04 00 00
Feb 24 13:39:46 localhost kernel: scsi target1:0:1: handle(0x000a), sas_address(0x5000d31001108307), phy(4)
Feb 24 13:39:46 localhost kernel: scsi target1:0:1: enclosure_logical_id(0x5b083fe0ead0a900), slot(4)

The hang seems to occur indefinetly until the devices are offlined through some other means and multipath then resumes execution.

It is our expectation, the Active/Optimized path would eventually fail and I/O would then be attempted down the Standby path. Or it is our expectation the path checker would run, and in this case the TUR would fail to the Active/Optimized path, and the Standby path would have transitioned to Active/Optimized=priority 50.  Is this a false expectation and/or why does multipath hang during this time?

Note You need to log in before you can comment on or make changes to this bug.