Bug 218459 - MD RAID1 hangs during boot (when starting MD arrays)
Summary: MD RAID1 hangs during boot (when starting MD arrays)
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: MD (show other bugs)
Hardware: i386 Linux
: P3 normal
Assignee: io_md
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-04 21:07 UTC by Matthew Perkowski
Modified: 2024-03-10 01:38 UTC (History)
3 users (show)

See Also:
Kernel Version: 6.7.3
Subsystem:
Regression: Yes
Bisected commit-id: 1b0a2d950ee2a54aa04fb31ead32144be0bbf690


Attachments
Screenshot of hung task message (855.00 KB, image/png)
2024-02-04 21:07 UTC, Matthew Perkowski
Details

Description Matthew Perkowski 2024-02-04 21:07:50 UTC
Created attachment 305822 [details]
Screenshot of hung task message

Root volume is MD RAID1 consisting of two SATA devices. Seems to hang indefinitely during boot (screenshot of log attached). Bisection identified commit 1b0a2d950ee2a54aa04fb31ead32144be0bbf690 as first appearance of problem. All kernels I've tried prior to that commit start array and then mount volume without issue.
Comment 1 The Linux kernel's regression tracker (Thorsten Leemhuis) 2024-02-06 14:47:49 UTC
There is a patch that wanted to fix something in 1b0a2d950ee2 but never was applied: https://lore.kernel.org/all/20231221071109.1562530-3-linan666@huaweicloud.com/

I just asked what's up there and pointed developers here.
Comment 2 Matthew Perkowski 2024-02-06 15:06:19 UTC
I'm quite happy to provide additional information or try out fixes myself when I am able to do so. I'll see if I can give that patch a try in the near future and report back as to whether it seems to help.
Comment 3 Song Liu 2024-02-06 16:59:37 UTC
We had some discussions on that patch set. Matthew, could you please try with that set and see whether it fixes the problem?

https://patchwork.kernel.org/project/linux-raid/list/?series=812045
Comment 4 Matthew Perkowski 2024-02-06 18:00:48 UTC
The patch did not seem to affect the problem. The boot process hung as it did before, eventually indicating a hung task with the same stack context that I had previously observed.
Comment 5 Nan 2024-02-07 01:51:59 UTC
It seems that mddev_suspend_and_lock is waiting for io to complete. Are there any other processes hung?

Can you provide commands for triggering this issue? I will try to replicate this issue in my environment.
Comment 6 Matthew Perkowski 2024-02-07 16:30:13 UTC
I'm not sure. It's happening during boot, presumably when the md driver is loaded but before the root file system is mounted (which is on one of the md volumes itself). As such, I don't have many straightforward paths to extract more information about the system's operating state at the time. I'll give it some thought and see if I can think of any ways to glean more information. Perhaps I'll try to reproduce the whole configuration myself on a different set of hardware to troubleshoot for issues involving drivers or other parts of the kernel. Knowing that it's apparently a matter of I/O that is (apparently) never completing might give me a direction to look, too.

I'll try to provide more information soon if I'm able to gather any.
Comment 7 Song Liu 2024-02-14 01:41:18 UTC
Hi Matthew, 

Could you please try 1/14 through 5/14 of this set fixes this issue? 

https://patchwork.kernel.org/project/linux-raid/list/?series=822030

They should apply on stable tree linux-6.7.y branch. Or you can use md-6.7-fix branch from md tree:

https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-6.7-fix

Thanks,
Song
Comment 8 Song Liu 2024-02-14 02:45:44 UTC
Actually, it is probably not enough. I will test more. 

Thanks,
Song
Comment 9 Song Liu 2024-02-15 00:29:42 UTC
OK, now md-6.7-fix passes my tests. 

Matthew, could you please give it a try?

https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=md-6.7-fix 

Thanks,
Song
Comment 10 Matthew Perkowski 2024-02-15 18:02:37 UTC
I built and tested your md-6.7-fix on my hardware and experienced no problems. System booted normally.
Comment 11 Matthew Perkowski 2024-03-03 00:11:26 UTC
Confirmed that I'm no longer experiencing the issue as of 6.7.7.

Note You need to log in before you can comment on or make changes to this bug.