Bug 199663 - Can't find root with scsi_blk_mq
Summary: Can't find root with scsi_blk_mq
Status: RESOLVED INVALID
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: io_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-09 05:27 UTC by Randy DuCharme
Modified: 2018-05-19 04:12 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.16.7
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output (84.75 KB, text/plain)
2018-05-11 22:36 UTC, Randy DuCharme
Details
lspci -k output (6.84 KB, text/plain)
2018-05-11 22:37 UTC, Randy DuCharme
Details

Description Randy DuCharme 2018-05-09 05:27:38 UTC
I'm not sure if this is the correct group for this:

The system fails to boot as of 4.16.7 with kernel cmd line args scsi_mod.use_blk_mq=y ( or 1 ) .  It it reports  ERROR: device 'UUID=xxx...' not found. Skipping fsck  and drops me into an emergency shell.  It also does the same with the 4.17-rc4 kernel source from which I captured this...

ERROR Failed to mount the real root device.
Bailing out, you are on your own. Good luck.
sh: can't access tty: job control turned off
[rootfs ]# [  131.408363] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000
[  131.408363]
[  131.408518] CPU: 6 PID: 1 Comm: sh Not tainted 4.17.0-rc4-ARCH-EXPERIMENTAL #1
[  131.408670] Hardware name: To be filled by O.E.M.  To be filled by O.E.M./CROSSHAIR V FORMULA-Z, BIOS 2201 03/23/3015
[  131.408848] Call Trace:
[  131.408894]  dump_stack+0x5c/0x80
[  131.408952]  panic+0xe7/0x247
[  131.409004]  do_exit.cold.4+0x1e/0xd3
[  131.409067]  ?  do_group_exit+0x33/0xa0
[  131.409131]  ?  __x64_sys_exit_group+0x14/0x20
[  131.409205]  ?  do_syscall_64+0x5b/0x170
[  131.409271]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  131.409400]  Kernel Offset: 0x10000000 from 0xffffffff81000000 (relocation range: 0xffffffff8000000000-0xffffffffbfffffff)
[  131.409582]  ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000

Kernels prior to 4.16.7 behaved just fine.  The 4.16 and 4.17 kernels boot fine with scsi_mod.use_blk_mq=n.  In the emergency shell I can mount any of the partitions on any of the devices.

System is the Crosshair V (as above) with 32G of RAM, FX9590 processor, LSI-9260-8I RAID controller is installed but the primariy disk is a 2T Seagate Barracuda on the first SATA port.  

I've not done any kernel debugging myself but I may if for nothing more that academic reasons.  Wish I were able to be of more help.
Comment 1 Lei Ming 2018-05-11 13:17:00 UTC
Could you provide your root disk info?  And the whole dmesg info.

Like which kind of drive, what is the driver, ...
Comment 2 Randy DuCharme 2018-05-11 22:36:47 UTC
Created attachment 275933 [details]
dmesg output
Comment 3 Randy DuCharme 2018-05-11 22:37:27 UTC
Created attachment 275935 [details]
lspci -k output
Comment 4 Randy DuCharme 2018-05-11 22:38:06 UTC
(In reply to Lei Ming from comment #1)
> Could you provide your root disk info?  And the whole dmesg info.
> 
> Like which kind of drive, what is the driver, ...

I discovered I could get it to boot by manually mounting /dev/sdc3 to /new_root/ and exiting the emergency shell.  It takes nearly 30 seconds however for any of the sdX devices to become available and visible in /dev with blk-mq enabled.  I've tried preloading all modules with grub, and in the initramfs but that didn't seem to help any.  Odd.  

System's primary drive is a Seagate 2TB Barracuda and I have 2 RAID-0 arrays configured on an LSI9260-8I controller with 3 Western Digital Raptors in each.

I've attached the dmesg -T and lspci -k outputs hoping it'll help
Comment 5 Lei Ming 2018-05-12 02:28:54 UTC
From the following dmesg log you provided, looks everything is fine.

Maybe some blk-mq sysfs related change breaks userspace, could you test
if commit bf9ae8c5325c0070d0 is good?

If yes, please test commit fa70d2e2c4a0a54ced9 further and see if it is
introduced by Mike's change.


[Fri May 11 17:24:06 2018] scsi 0:2:0:0: Direct-Access     LSI      MR9260-8i        2.13 PQ: 0 ANSI: 5
[Fri May 11 17:24:06 2018] scsi 0:2:1:0: Direct-Access     LSI      MR9260-8i        2.13 PQ: 0 ANSI: 5
[Fri May 11 17:24:10 2018] random: crng init done
[Fri May 11 17:24:20 2018] scsi 1:0:0:0: Direct-Access     ATA      ST2000DM001-1CH1 CC29 PQ: 0 ANSI: 5
[Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
[Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] 4096-byte physical blocks
[Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] 2927099904 512-byte logical blocks: (1.50 TB/1.36 TiB)
[Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] 4096-byte physical blocks
[Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] Write Protect is off
[Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] 2927099904 512-byte logical blocks: (1.50 TB/1.36 TiB)
[Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] Write Protect is off
[Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] 4096-byte physical blocks
[Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] Mode Sense: 1f 00 00 08
[Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] Write Protect is off
[Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] Mode Sense: 1f 00 00 08
[Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[Fri May 11 17:24:20 2018]  sdb: sdb1
[Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] Attached SCSI disk
[Fri May 11 17:24:20 2018] scsi 3:0:0:0: CD-ROM            PIONEER  BD-RW   BDR-209D 1.20 PQ: 0 ANSI: 5
[Fri May 11 17:24:20 2018]  sda: sda1
[Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] Attached SCSI disk
[Fri May 11 17:24:20 2018]  sdc: sdc1 sdc2 sdc3 sdc4
[Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] Attached SCSI disk
[Fri May 11 17:24:20 2018] sr 3:0:0:0: [sr0] scsi3-mmc drive: 125x/125x writer cd/rw xa/form2 cdda tray
[Fri May 11 17:24:20 2018] cdrom: Uniform CD-ROM driver Revision: 3.20
[Fri May 11 17:24:20 2018] sr 3:0:0:0: Attached scsi CD-ROM sr0
[Fri May 11 17:24:21 2018] ata5: SATA link down (SStatus 0 SControl 300)
[Fri May 11 17:24:21 2018] ata7: SATA link down (SStatus 0 SControl 300)
[Fri May 11 17:24:28 2018] EXT4-fs (sdc3): mounted filesystem with ordered data mode. Opts: (null)
Comment 6 Randy DuCharme 2018-05-12 02:53:13 UTC
(In reply to Lei Ming from comment #5)
> From the following dmesg log you provided, looks everything is fine.
> 
> Maybe some blk-mq sysfs related change breaks userspace, could you test
> if commit bf9ae8c5325c0070d0 is good?
> 
> If yes, please test commit fa70d2e2c4a0a54ced9 further and see if it is
> introduced by Mike's change.
> 
> 
> [Fri May 11 17:24:06 2018] scsi 0:2:0:0: Direct-Access     LSI     
> MR9260-8i        2.13 PQ: 0 ANSI: 5
> [Fri May 11 17:24:06 2018] scsi 0:2:1:0: Direct-Access     LSI     
> MR9260-8i        2.13 PQ: 0 ANSI: 5
> [Fri May 11 17:24:10 2018] random: crng init done
> [Fri May 11 17:24:20 2018] scsi 1:0:0:0: Direct-Access     ATA     
> ST2000DM001-1CH1 CC29 PQ: 0 ANSI: 5
> [Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] 3907029168 512-byte logical
> blocks: (2.00 TB/1.82 TiB)
> [Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] 4096-byte physical blocks
> [Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] 2927099904 512-byte logical
> blocks: (1.50 TB/1.36 TiB)
> [Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] 4096-byte physical blocks
> [Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] Write Protect is off
> [Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] Mode Sense: 00 3a 00 00
> [Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] Write cache: enabled, read
> cache: enabled, doesn't support DPO or FUA
> [Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] 2927099904 512-byte logical
> blocks: (1.50 TB/1.36 TiB)
> [Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] Write Protect is off
> [Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] 4096-byte physical blocks
> [Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] Mode Sense: 1f 00 00 08
> [Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] Write Protect is off
> [Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] Mode Sense: 1f 00 00 08
> [Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] Write cache: enabled, read
> cache: enabled, doesn't support DPO or FUA
> [Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] Write cache: enabled, read
> cache: enabled, doesn't support DPO or FUA
> [Fri May 11 17:24:20 2018]  sdb: sdb1
> [Fri May 11 17:24:20 2018] sd 0:2:1:0: [sdb] Attached SCSI disk
> [Fri May 11 17:24:20 2018] scsi 3:0:0:0: CD-ROM            PIONEER  BD-RW  
> BDR-209D 1.20 PQ: 0 ANSI: 5
> [Fri May 11 17:24:20 2018]  sda: sda1
> [Fri May 11 17:24:20 2018] sd 0:2:0:0: [sda] Attached SCSI disk
> [Fri May 11 17:24:20 2018]  sdc: sdc1 sdc2 sdc3 sdc4
> [Fri May 11 17:24:20 2018] sd 1:0:0:0: [sdc] Attached SCSI disk
> [Fri May 11 17:24:20 2018] sr 3:0:0:0: [sr0] scsi3-mmc drive: 125x/125x
> writer cd/rw xa/form2 cdda tray
> [Fri May 11 17:24:20 2018] cdrom: Uniform CD-ROM driver Revision: 3.20
> [Fri May 11 17:24:20 2018] sr 3:0:0:0: Attached scsi CD-ROM sr0
> [Fri May 11 17:24:21 2018] ata5: SATA link down (SStatus 0 SControl 300)
> [Fri May 11 17:24:21 2018] ata7: SATA link down (SStatus 0 SControl 300)
> [Fri May 11 17:24:28 2018] EXT4-fs (sdc3): mounted filesystem with ordered
> data mode. Opts: (null)


I'll surely test but will not be able to until Sunday afternoon.  I'll not be back at the machine until then.  Will post my findings.  Thanks!!!
Comment 7 Randy DuCharme 2018-05-16 04:33:33 UTC
So far nothing I've done with those commits has been very helpful.  I've made some filesystem changes to get some more space so I can install some more tools.  I'm a fairly seasoned C/C++ developer (nearly 20 years) but have little experience with Linux kernel hacking (save a little Android work) and debugging but I guess I'll try to jump into it and see if I can figure it out. Clearly it's not a widespread issue. Any links or resources you can point me to to "fast track" getting up to speed with Linux kernel internals and debugging would be welcome.  I've got a few and found a few more.  This'll likely disappear before I get anywhere with it tho'.
Comment 8 Lei Ming 2018-05-17 02:16:55 UTC
(In reply to Randy DuCharme from comment #7)
> So far nothing I've done with those commits has been very helpful.  I've
> made some filesystem changes to get some more space so I can install some
> more tools.  I'm a fairly seasoned C/C++ developer (nearly 20 years) but
> have little experience with Linux kernel hacking (save a little Android
> work) and debugging but I guess I'll try to jump into it and see if I can
> figure it out. Clearly it's not a widespread issue. Any links or resources
> you can point me to to "fast track" getting up to speed with Linux kernel
> internals and debugging would be welcome.  I've got a few and found a few
> more.  This'll likely disappear before I get anywhere with it tho'.

Actually, one approach I suggest is to use 'git bisect' to find which
commit causes this issue. And it is pretty effective.

Thanks,
Comment 9 Randy DuCharme 2018-05-19 04:12:06 UTC
So, it would seem I've wasted some people's time.  Problem is not a kernel problem at all.  I migrated the system to a larger hard disk and somehow in the process resolved the issue - yet it's odd that Arch release kernels prior to 4.16.7 didn't exhibit the problem on the other disk.  I wish I knew what happened but many apologies for the wasted bandwidth and time.  I'm marking this resolved/invalid.

Note You need to log in before you can comment on or make changes to this bug.