Bug 218200

Summary: MD assemble raid 6 lockup, IO almost impossible
Product: IO/Storage Reporter: Janpieter Sollie (janpieter.sollie)
Component: MDAssignee: io_md
Status: RESOLVED CODE_FIX    
Severity: normal CC: sam, song, yukuai3
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: 6.7-rc3 Subsystem:
Regression: Yes Bisected commit-id: bc08041b32abe6c9824f78735bac22018eabfc06
Attachments: mdadm -D output of raid-6 drives
cat /proc/mdstat
lsblk output
kernel .config
kernel dmesg
bisect kernel config
decoded stacktrace with ubsan
git bisect log

Description Janpieter Sollie 2023-11-28 12:15:28 UTC
Created attachment 305492 [details]
mdadm -D output of raid-6 drives

This is an issue I encountered in -rc2 as well, but my focus back then was on another md problem, so I'm coming back to this one:
> - /proc/mdstat shows my raid-6 arrays are up-and running.
> - top shows a load of 3.0, but no system, idle, waiting, whatsoever.
> - arrays are very slow
The load disappears when stopping all 3 raid-6 arrays.
when re-assembling them one by one, the load rises again, 0->3.
when writing "check" into md/sync_action, I get the failure:
> bash: echo: write error: Device or resource busy
I can read a few MB from the array (dd if=/dev/md0 of=/dev/null), but the result on 4 SAS drives is 500KB/s (did that twice with different skip= to avoid the spinning up delay).  When using dd to benchmark members of the array, the speed is +-150MB
I did not verify the integrity of the data.
I build the kernel with 
> - basic lock checking
> - detect soft lockups, hard lockups, hung tasks
> - verbose driver info
> - verbose scsi info
... but nothing appears in dmesg.
In attachment:
- mdadm output of 3 raid-6 drives
- cat /proc/mdstat
- lsblk -o name,fstype,uuid 
- dmesg with extra scsi + verbose dmesg output
- kernel .config
Comment 1 Janpieter Sollie 2023-11-28 12:17:17 UTC
Created attachment 305493 [details]
cat /proc/mdstat
Comment 2 Janpieter Sollie 2023-11-28 12:18:12 UTC
Created attachment 305494 [details]
lsblk output
Comment 3 Janpieter Sollie 2023-11-28 12:20:36 UTC
Created attachment 305495 [details]
kernel .config
Comment 4 Janpieter Sollie 2023-11-28 12:21:15 UTC
Created attachment 305496 [details]
kernel dmesg
Comment 5 Kuai Yu 2023-11-30 02:17:36 UTC
Thanks for the report! However, we can't make sure of the root cause yet based on your description.

Can you reproduce this problem reliability? If so, can you provide the exact reproduction steps? And it'll be great if you can bisect to find the blamed commit.
Comment 6 Janpieter Sollie 2023-11-30 13:28:24 UTC
(In reply to Kuai Yu from comment #5)
> Thanks for the report! However, we can't make sure of the root cause yet
> based on your description.
> 
I was afraid of that: I have no clue what's going on either.  
A lot of work in the md block layer has happened since v6.6
> Can you reproduce this problem reliability?
Yes
> If so, can you provide the exact reproduction steps? 
- boot a v6.7-r2 kernel
- start a md array, the load goes +1
- start a 2nd md array, the load goes +1
...
I also tried to disable the raid-6 journal, but that didn't solve it.
I know inactive devices don't cause the extra load, 
and neither no read-only devices,
but as soon as I turn an array read-write, the load gradually rises to 1.00.
I also have a raid-0 array, also no issue.
If needed, I also can try to create a new raid-6 array, to see whether recreating would solve it.
> And it'll be great if you can bisect to find the blamed
> commit.
I can try, but it would be great to have a hint for how to bisect only md-related commits. the commit list of 6.6 - 6.7 is quite exhaustive
Comment 7 Janpieter Sollie 2023-11-30 20:50:30 UTC
might be related (but not necessary, let me know if it's another bug):
when removing the journal device (/dev/nvme2n1p2):
> - fail device
> - remove device
> - zero-superblock device
> - set consistency policy to resync
and then try to re-add it using:
> mdadm --manage /dev/md3 --readonly --add-journal /dev/nvme2n1p2
> mdadm: Failed to hot add /dev/nvme2n1p2 as journal, please try restart
> /dev/md3.
and dmesg says:
> md: nvme2n1p2 does not have a valid v1.2 superblock, not importing!
> md: md_import_device returned -22
even re-assembling didn't help.
I rebooted into 6.6, and the journal was back online, no issue at all, without even adding it (so the add command in 6.7 was partially completed despite the error)
Comment 8 Song Liu 2023-11-30 23:53:52 UTC
Hi Janpieter,

(In reply to Janpieter Sollie from comment #6)
> (In reply to Kuai Yu from comment #5)
> > Thanks for the report! However, we can't make sure of the root cause yet
> > based on your description.
> > 
> I was afraid of that: I have no clue what's going on either.  
> A lot of work in the md block layer has happened since v6.6
> > Can you reproduce this problem reliability?
> Yes
> > If so, can you provide the exact reproduction steps? 
> - boot a v6.7-r2 kernel
> - start a md array, the load goes +1
> - start a 2nd md array, the load goes +1

Do we know what is causing the load to go up? top should show it. 

Also, do we have some file system on the array? IIRC, you have some bcachefs?

Thanks!
Comment 9 Kuai Yu 2023-12-01 01:31:21 UTC
(In reply to Janpieter Sollie from comment #6)
> > Can you reproduce this problem reliability?
> Yes
> > If so, can you provide the exact reproduction steps? 
> - boot a v6.7-r2 kernel
> - start a md array, the load goes +1
> - start a 2nd md array, the load goes +1
> ...

I tried to start several raid6 array with ramdisk in my VM, but I can't reproduce this.

> > And it'll be great if you can bisect to find the blamed
> > commit.
> I can try, but it would be great to have a hint for how to bisect only
> md-related commits. the commit list of 6.6 - 6.7 is quite exhaustive

You can try this:

git bisect start drivers/md/md.[ch] drivers/md/raid5*
git bisect good v6.6
git bisect bad v6.7-rc2

Bisecting: 19 revisions left to test after this (roughly 4 steps)
[2e82248b70f48a98f3720cdaa8ea7a4e7f8bd760] md: replace is_md_suspended() with 'mddev->suspended' in md_check_recovery()
Comment 10 Janpieter Sollie 2023-12-01 09:20:41 UTC
(In reply to Song Liu from comment #8)
> Hi Janpieter,
> ... 
> Do we know what is causing the load to go up? top should show it. 

top shows nothing: no process taking cpu time, no user load, no system load, no waiting load, ... absolutely nothing suspicious.
In the right upper corner:
> load average: load average: 3.13, 3.06, 3.02
And other load statistics:
> %Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0
> st 
so I thought there would be some locking issue.  But that's clearly not the case, as lock checking would have detected that

> Also, do we have some file system on the array? IIRC, you have some bcachefs?

I'm using bcachefs indeed, but at the time of bug reporting the devices are not mounted.  Arrays are up, but not mounted.

> 
> Thanks!

let me know if I can provide more info.  In meantime, I'll try Kuai Yu's hints about bisecting it
Comment 11 Janpieter Sollie 2023-12-01 10:37:40 UTC
Created attachment 305524 [details]
bisect kernel config

created special config for bisecting kernel,
booting it already showed some issues with md (but mostly in net).
I'll post the decoded_stacktrace in the next attachment
Comment 12 Janpieter Sollie 2023-12-01 10:39:09 UTC
Created attachment 305525 [details]
decoded stacktrace with ubsan
Comment 13 Janpieter Sollie 2023-12-01 20:15:36 UTC
so, I performed a bisect with Kuai's suggestions,
I compiled the kernel with md subsystem as module, and used following procedure to install the modules:
- stop all arrays
- modprobe -r raid456 md-mod
- make modules_install
to verify functionality:
1) stop all arrays, wait until load < 1.00
2) mdadm --assemble /dev/md0, watch top
3) after assembly, if load drops back to < 1.00 -> GOOD
4) if load == 1.00, stop array, if load drops back to < 1.00 -> BAD
5) assume 3rd party failure, reboot pc and try steps 1->5 again

According to git bisect, the commit introducing the issue is bc08041b32abe6c9824f78735bac22018eabfc06

I'll attach the git bisect log
Comment 14 Janpieter Sollie 2023-12-01 20:16:20 UTC
Created attachment 305526 [details]
git bisect log
Comment 15 Kuai Yu 2023-12-02 02:37:07 UTC
Thanks for the bisect, that helps a lot!

I think I probably know what is happening now... The steps are as follows:

1) someone set MD_RECOVERY_NEEDED first;
2) daemon thread grab reconfig_mutex, then clear MD_RECOVERY_NEEDED and queue a new sync work;
3) in md_start_sync
   a) check that there are spares that can be added/removed, then suspend the array;
   b) remove_and_add_spares may not be called, or called without really add/remove spares;
   c) resume the array, then set MD_RECOVERY_NEEDED again!
4) daemon thread release reconfig_mutex;
5) loop between 2 - 5, then mddev_suspend() will be called quite often, that's why io is quite slow.
Comment 16 Kuai Yu 2023-12-02 02:41:42 UTC
Hi Janpieter Sollie, this might not be the final fix, but can you help to confirm my analyze with following patch?

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4b1e8007dd15..43ae90725947 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -515,7 +515,6 @@ void mddev_resume(struct mddev *mddev)
        percpu_ref_resurrect(&mddev->active_io);
        wake_up(&mddev->sb_wait);

-       set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
        md_wakeup_thread(mddev->thread);
        md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
Comment 17 Janpieter Sollie 2023-12-02 11:36:47 UTC
seems to work(In reply to Kuai Yu from comment #16)
> Hi Janpieter Sollie, this might not be the final fix, but can you help to
> confirm my analyze with following patch?
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 4b1e8007dd15..43ae90725947 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -515,7 +515,6 @@ void mddev_resume(struct mddev *mddev)
>         percpu_ref_resurrect(&mddev->active_io);
>         wake_up(&mddev->sb_wait);
> 
> -       set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
>         md_wakeup_thread(mddev->thread);
>         md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape
> */

indeed, that fixes it.  Both performance and load issues
Comment 18 Kuai Yu 2023-12-04 01:02:36 UTC
(In reply to Janpieter Sollie from comment #17)
> seems to work(In reply to Kuai Yu from comment #16)
> > Hi Janpieter Sollie, this might not be the final fix, but can you help to
> > confirm my analyze with following patch?
> > 
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 4b1e8007dd15..43ae90725947 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -515,7 +515,6 @@ void mddev_resume(struct mddev *mddev)
> >         percpu_ref_resurrect(&mddev->active_io);
> >         wake_up(&mddev->sb_wait);
> > 
> > -       set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> >         md_wakeup_thread(mddev->thread);
> >         md_wakeup_thread(mddev->sync_thread); /* possibly kick off a
> reshape
> > */
> 
> indeed, that fixes it.  Both performance and load issues

Got it! Will work on a former fix based on this soon.

Thanks