Bug 210203 - linux boot sometimes hang around scsi_try_target_reset for a system with SSD/SATA 1.92T *10
Summary: linux boot sometimes hang around scsi_try_target_reset for a system with SSD/...
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: SCSI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: linux-scsi@vger.kernel.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-15 06:42 UTC by wangyugui@e16-tech.com
Modified: 2020-11-17 14:58 UTC (History)
0 users

See Also:
Kernel Version: 5.4.76 and other 5.4.x
Subsystem:
Regression: No
Bisected commit-id:


Attachments
sysrq-l-t-1.log (446.68 KB, text/plain)
2020-11-15 06:42 UTC, wangyugui@e16-tech.com
Details
sysrq-l-t-2.log (444.91 KB, text/plain)
2020-11-15 06:43 UTC, wangyugui@e16-tech.com
Details

Description wangyugui@e16-tech.com 2020-11-15 06:42:06 UTC
Created attachment 293673 [details]
sysrq-l-t-1.log

linux boot sometimes hang around scsi_try_target_reset for a system with SSD/SATA 1.92T *10

console output:
	A start job is running for dev-disk-by\x2dlabel-OS_T604.device
		*** always waiting here

Failed to login, so only gathered SysRq info(saved in sysrq-l-t-1.log, sysrq-l-t-2.log).
- sysrq l command
- sysrq t command

From SysRq info, it seem a problem around scsi_try_target_reset();

[  744.388123] scsi_eh_9       S    0   604      2 0x80004000
[  744.388124] Call Trace:
[  744.388125]  __schedule+0x285/0x6e0
[  744.388126]  ? scsi_try_target_reset+0x90/0x90
[  744.388127]  schedule+0x2f/0xa0
[  744.388128]  scsi_error_handler+0x1c4/0x500
[  744.388129]  ? scsi_eh_get_sense+0x220/0x220
[  744.388130]  kthread+0x112/0x130
[  744.388131]  ? kthread_park+0x80/0x80
[  744.388132]  ret_from_fork+0x1f/0x40


Frequency: about 10%
   only happened in a server with btrfs RAID0 & 10 SSD.
   yet not happened in another two server with the same kernel

OS: centos 7.8/7.9
kernel version:
	5.4.76, 5.4.74, 5.4.73, and others 5.4.x

# lsscsi
[14:0:0:0]   disk    ATA      MK1920GFDKU      HPG0  /dev/sda
[14:0:1:0]   disk    ATA      MK1920GFDKU      HPG0  /dev/sdb
[14:0:2:0]   disk    ATA      MK1920GFDKU      HPG0  /dev/sdc
[14:0:3:0]   disk    ATA      MK1920GFDKU      HPG0  /dev/sdd
[14:0:4:0]   disk    ATA      MK1920GFDKU      HPG0  /dev/sde
[14:0:6:0]   disk    ATA      SAMSUNG MZ7KM1T9 003Q  /dev/sdf
[14:0:7:0]   disk    ATA      SAMSUNG MZ7KM1T9 003Q  /dev/sdg
[14:0:8:0]   disk    ATA      SAMSUNG MZ7KM1T9 003Q  /dev/sdh
[14:0:9:0]   disk    ATA      SAMSUNG MZ7KM1T9 003Q  /dev/sdi
[14:0:10:0]  disk    ATA      SAMSUNG MZ7KM1T9 003Q  /dev/sdj
Comment 1 wangyugui@e16-tech.com 2020-11-15 06:43:51 UTC
Created attachment 293675 [details]
sysrq-l-t-2.log
Comment 2 wangyugui@e16-tech.com 2020-11-15 08:22:11 UTC
or maybe some problem of megaraid_sas.ko?

This is a Dell PERCH730P(Broadcom / LSI MegaRAID SAS-3 3108 )with the lastest firmware from Dell.

# modinfo megaraid_sas
filename:       /lib/modules/5.4.76-1.2.el7.x86_64/kernel/drivers/scsi/megaraid/megaraid_sas.ko.xz
description:    Broadcom MegaRAID SAS Driver
author:         megaraidlinux.pdl@broadcom.com
version:        07.710.50.00-rc1
Comment 3 wangyugui@e16-tech.com 2020-11-17 14:58:21 UTC
I tested it with 5.10-rc4 too.

5.10-rc4 always failed to boot on centos 7.9, but x-systemd.device-timeout=120 works, so we can login to check more.

blkid show megaraid_sas works well,

so it should be a problem of systemd or d-bus of centos 7.9? althrough yet not full resolved, Let's close it firstly.

Note You need to log in before you can comment on or make changes to this bug.