Bug 216343 - XFS: no space left in xlog cause system hang
Summary: XFS: no space left in xlog cause system hang
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: ARM Linux
: P1 high
Assignee: FileSystem/XFS Default Virtual Assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-08-09 11:46 UTC by zhoukete
Modified: 2023-09-21 06:59 UTC (History)
0 users

See Also:
Kernel Version: 5.10.38
Subsystem:
Regression: No
Bisected commit-id:


Attachments
stack (3.94 KB, text/plain)
2022-08-09 11:46 UTC, zhoukete
Details

Description zhoukete 2022-08-09 11:46:23 UTC
Created attachment 301539 [details]
stack

1. cannot login with ssh, system hanged and cannot do anything
2. dmesg report 'audit: audit_backlog=41349 > audit_backlog_limit=8192'
3. I send sysrq-crash and get vmcore file , I dont know how to reproduce it.

Follwing is my analysis from vmcore:

The reason why tty cannot login is pid 2021571 hold the acct_process mutex, and 2021571 cannot release mutex because it is wait for xlog release space. See the stac info in the attachment of stack.txt

So I try to figure out what happened to xlog

crash> struct xfs_ail.ail_target_prev,ail_targe,ail_head 0xffff00ff884f1000 
  ail_target_prev = 0xe9200058600
  ail_target = 0xe9200058600
  ail_head = {
    next = 0xffff0340999a0a80, 
    prev = 0xffff020013c66b40
  }

there are 112 log item in ail list
crash> list 0xffff0340999a0a80 | wc -l
112 

79 item of them are xlog_inode_item
30 item of them are xlog_buf_item

crash> xfs_log_item.li_flags,li_lsn 0xffff0340999a0a80 -x 
  li_flags = 0x1
  li_lsn = 0xe910005cc00 ===> first item lsn

crash> xfs_log_item.li_flags,li_lsn ffff020013c66b40 -x
  li_flags = 0x1
  li_lsn = 0xe9200058600 ===> last item lsn

crash>xfs_log_item.li_buf 0xffff0340999a0a80               
 li_buf = 0xffff0200125b7180

crash> xfs_buf.b_flags 0xffff0200125b7180 -x
 b_flags = 0x110032  (XBF_WRITE|XBF_ASYNC|XBF_DONE|_XBF_INODES|_XBF_PAGES) 

crash> xfs_buf.b_state 0xffff0200125b7180 -x
  b_state = 0x2 (XFS_BSTATE_IN_FLIGHT)

crash> xfs_buf.b_last_error,b_retries,b_first_retry_time 0xffff0200125b7180 -x
  b_last_error = 0x0
  b_retries = 0x0
  b_first_retry_time = 0x0 

The buf flags show the io had been done(XBF_DONE is set).
When I review the code xfs_buf_ioend, if XBF_DONE is set, xfs_buf_inode_iodone will be called and it will remove the log item from ail list, then release the xlog space by moving the tail_lsn.

But now this item is still in the ail list, and the b_last_error = 0, XBF_WRITE is set.

xfs buf log item is the same as the inode log item.

crash> list -s xfs_log_item.li_buf 0xffff0340999a0a80
ffff033f8d7c9de8
  li_buf = 0x0
crash> xfs_buf_log_item.bli_buf  ffff033f8d7c9de8
  bli_buf = 0xffff0200125b4a80
crash> xfs_buf.b_flags 0xffff0200125b4a80 -x
  b_flags = 0x100032 (XBF_WRITE|XBF_ASYNC|XBF_DONE|_XBF_PAGES) 

I think it is impossible that (XBF_DONE is set & b_last_error = 0) and the item still in the ail.

Is my analysis correct? 
Why xlog space cannot release space?
Comment 1 Dave Chinner 2022-08-14 23:54:51 UTC
[cc Amir, the 5.10 stable XFS maintainer]

On Tue, Aug 09, 2022 at 11:46:23AM +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=216343
> 
>             Bug ID: 216343
>            Summary: XFS: no space left in xlog cause system hang
>            Product: File System
>            Version: 2.5
>     Kernel Version: 5.10.38
>           Hardware: ARM
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: XFS
>           Assignee: filesystem_xfs@kernel-bugs.kernel.org
>           Reporter: zhoukete@126.com
>         Regression: No
> 
> Created attachment 301539 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=301539&action=edit
> stack
> 
> 1. cannot login with ssh, system hanged and cannot do anything
> 2. dmesg report 'audit: audit_backlog=41349 > audit_backlog_limit=8192'
> 3. I send sysrq-crash and get vmcore file , I dont know how to reproduce it.
> 
> Follwing is my analysis from vmcore:
> 
> The reason why tty cannot login is pid 2021571 hold the acct_process mutex,
> and
> 2021571 cannot release mutex because it is wait for xlog release space. See
> the
> stac info in the attachment of stack.txt
> 
> So I try to figure out what happened to xlog
> 
> crash> struct xfs_ail.ail_target_prev,ail_targe,ail_head 0xffff00ff884f1000 
>   ail_target_prev = 0xe9200058600
>   ail_target = 0xe9200058600
>   ail_head = {
>     next = 0xffff0340999a0a80, 
>     prev = 0xffff020013c66b40
>   }
> 
> there are 112 log item in ail list
> crash> list 0xffff0340999a0a80 | wc -l
> 112 
> 
> 79 item of them are xlog_inode_item
> 30 item of them are xlog_buf_item
> 
> crash> xfs_log_item.li_flags,li_lsn 0xffff0340999a0a80 -x 
>   li_flags = 0x1
>   li_lsn = 0xe910005cc00 ===> first item lsn
> 
> crash> xfs_log_item.li_flags,li_lsn ffff020013c66b40 -x
>   li_flags = 0x1
>   li_lsn = 0xe9200058600 ===> last item lsn
> 
> crash>xfs_log_item.li_buf 0xffff0340999a0a80               
>  li_buf = 0xffff0200125b7180
> 
> crash> xfs_buf.b_flags 0xffff0200125b7180 -x
>  b_flags = 0x110032  (XBF_WRITE|XBF_ASYNC|XBF_DONE|_XBF_INODES|_XBF_PAGES) 
> 
> crash> xfs_buf.b_state 0xffff0200125b7180 -x
>   b_state = 0x2 (XFS_BSTATE_IN_FLIGHT)
> 
> crash> xfs_buf.b_last_error,b_retries,b_first_retry_time 0xffff0200125b7180
> -x
>   b_last_error = 0x0
>   b_retries = 0x0
>   b_first_retry_time = 0x0 
> 
> The buf flags show the io had been done(XBF_DONE is set).
> When I review the code xfs_buf_ioend, if XBF_DONE is set,
> xfs_buf_inode_iodone
> will be called and it will remove the log item from ail list, then release
> the
> xlog space by moving the tail_lsn.
> 
> But now this item is still in the ail list, and the b_last_error = 0,
> XBF_WRITE
> is set.
> 
> xfs buf log item is the same as the inode log item.
> 
> crash> list -s xfs_log_item.li_buf 0xffff0340999a0a80
> ffff033f8d7c9de8
>   li_buf = 0x0
> crash> xfs_buf_log_item.bli_buf  ffff033f8d7c9de8
>   bli_buf = 0xffff0200125b4a80
> crash> xfs_buf.b_flags 0xffff0200125b4a80 -x
>   b_flags = 0x100032 (XBF_WRITE|XBF_ASYNC|XBF_DONE|_XBF_PAGES) 
> 
> I think it is impossible that (XBF_DONE is set & b_last_error = 0) and the
> item
> still in the ail.
> 
> Is my analysis correct? 
> Why xlog space cannot release space?
> 
> -- 
> You may reply to this email to add a comment.
> 
> You are receiving this mail because:
> You are watching the assignee of the bug.
Comment 2 Amir Goldstein 2022-08-15 16:12:20 UTC
On Mon, Aug 15, 2022 at 2:54 AM Dave Chinner <david@fromorbit.com> wrote:
>
> [cc Amir, the 5.10 stable XFS maintainer]
>
> On Tue, Aug 09, 2022 at 11:46:23AM +0000, bugzilla-daemon@kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=216343
> >
> >             Bug ID: 216343
> >            Summary: XFS: no space left in xlog cause system hang
> >            Product: File System
> >            Version: 2.5
> >     Kernel Version: 5.10.38
> >           Hardware: ARM
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: XFS
> >           Assignee: filesystem_xfs@kernel-bugs.kernel.org
> >           Reporter: zhoukete@126.com
> >         Regression: No
> >
> > Created attachment 301539 [details]
> >   --> https://bugzilla.kernel.org/attachment.cgi?id=301539&action=edit
> > stack
> >
> > 1. cannot login with ssh, system hanged and cannot do anything
> > 2. dmesg report 'audit: audit_backlog=41349 > audit_backlog_limit=8192'
> > 3. I send sysrq-crash and get vmcore file , I dont know how to reproduce
> it.
> >
> > Follwing is my analysis from vmcore:
> >
> > The reason why tty cannot login is pid 2021571 hold the acct_process mutex,
> and
> > 2021571 cannot release mutex because it is wait for xlog release space. See
> the
> > stac info in the attachment of stack.txt
> >
> > So I try to figure out what happened to xlog
> >
> > crash> struct xfs_ail.ail_target_prev,ail_targe,ail_head 0xffff00ff884f1000
> >   ail_target_prev = 0xe9200058600
> >   ail_target = 0xe9200058600
> >   ail_head = {
> >     next = 0xffff0340999a0a80,
> >     prev = 0xffff020013c66b40
> >   }
> >
> > there are 112 log item in ail list
> > crash> list 0xffff0340999a0a80 | wc -l
> > 112
> >
> > 79 item of them are xlog_inode_item
> > 30 item of them are xlog_buf_item
> >
> > crash> xfs_log_item.li_flags,li_lsn 0xffff0340999a0a80 -x
> >   li_flags = 0x1
> >   li_lsn = 0xe910005cc00 ===> first item lsn
> >
> > crash> xfs_log_item.li_flags,li_lsn ffff020013c66b40 -x
> >   li_flags = 0x1
> >   li_lsn = 0xe9200058600 ===> last item lsn
> >
> > crash>xfs_log_item.li_buf 0xffff0340999a0a80
> >  li_buf = 0xffff0200125b7180
> >
> > crash> xfs_buf.b_flags 0xffff0200125b7180 -x
> >  b_flags = 0x110032  (XBF_WRITE|XBF_ASYNC|XBF_DONE|_XBF_INODES|_XBF_PAGES)
> >
> > crash> xfs_buf.b_state 0xffff0200125b7180 -x
> >   b_state = 0x2 (XFS_BSTATE_IN_FLIGHT)
> >
> > crash> xfs_buf.b_last_error,b_retries,b_first_retry_time 0xffff0200125b7180
> -x
> >   b_last_error = 0x0
> >   b_retries = 0x0
> >   b_first_retry_time = 0x0
> >
> > The buf flags show the io had been done(XBF_DONE is set).
> > When I review the code xfs_buf_ioend, if XBF_DONE is set,
> xfs_buf_inode_iodone
> > will be called and it will remove the log item from ail list, then release
> the
> > xlog space by moving the tail_lsn.
> >
> > But now this item is still in the ail list, and the b_last_error = 0,
> XBF_WRITE
> > is set.
> >
> > xfs buf log item is the same as the inode log item.
> >
> > crash> list -s xfs_log_item.li_buf 0xffff0340999a0a80
> > ffff033f8d7c9de8
> >   li_buf = 0x0
> > crash> xfs_buf_log_item.bli_buf  ffff033f8d7c9de8
> >   bli_buf = 0xffff0200125b4a80
> > crash> xfs_buf.b_flags 0xffff0200125b4a80 -x
> >   b_flags = 0x100032 (XBF_WRITE|XBF_ASYNC|XBF_DONE|_XBF_PAGES)
> >
> > I think it is impossible that (XBF_DONE is set & b_last_error = 0) and the
> item
> > still in the ail.
> >
> > Is my analysis correct?

I don't think so.
I think this buffer write is in-flight.

> > Why xlog space cannot release space?

Not sure if space cannot be released or just takes a lot of time.
There are several AIL/CIL improvements in upstream kernel and
none of them are going to land in 5.10.y.

The reported kernel version 5.10.38 has almost no upstream fixes
at all, but I don't think that any of the fixes in 5.10.y are relevant for
this case anyway.

If this hang happens often with your workload, I suggest using
a newer kernel and/or formatting xfs with a larger log to meet
the demands of your workload.

Thanks,
Amir.
Comment 3 zhoukete 2022-08-16 06:56:26 UTC
(In reply to Amir Goldstein from comment #2)
> 
> I don't think so.
> I think this buffer write is in-flight.

But how to explain the XFS_DONE flag is set.Is XFS_DONE flag means the io had been done?


> 
> Not sure if space cannot be released or just takes a lot of time.

crash> ps -m 2006840
[0 14:35:38.418] [UN]  PID: 2006840  TASK: ffff035f80de8f80  CPU: 51  COMMAND: "onestor-peon" 

pid 2006840 hang 14 hours. It is waiting for the xfs buf io finish.
So I  think it is ‘cannot be released’ not take a lot of time.

crash> bt 2006840
PID: 2006840  TASK: ffff035f80de8f80  CPU: 51  COMMAND: "onestor-peon"
 #0 [ffff80002a873580] __switch_to at ffff800010016dec
 #1 [ffff80002a8735b0] __schedule at ffff800010a99d60
 #2 [ffff80002a873650] schedule at ffff800010a9a1cc
 #3 [ffff80002a873670] schedule_timeout at ffff800010a9ee00
 #4 [ffff80002a873710] __down at ffff800010a9d2ec
 #5 [ffff80002a8737a0] down at ffff8000100ea30c
 #6 [ffff80002a8737c0] xfs_buf_lock at ffff800008f5bd4c [xfs]
 #7 [ffff80002a8737f0] xfs_buf_find at ffff800008f5c3f0 [xfs]
 #8 [ffff80002a8738b0] xfs_buf_get_map at ffff800008f5c7d4 [xfs]
 #9 [ffff80002a873930] xfs_buf_read_map at ffff800008f5cf10 [xfs]
#10 [ffff80002a8739a0] xfs_trans_read_buf_map at ffff800008f9c198 [xfs]
#11 [ffff80002a873a20] xfs_da_read_buf at ffff800008f2b51c [xfs]
#12 [ffff80002a873aa0] xfs_da3_node_read at ffff800008f2b5a4 [xfs]
#13 [ffff80002a873ac0] xfs_da3_node_lookup_int at ffff800008f2c778 [xfs]
#14 [ffff80002a873b60] xfs_dir2_node_removename at ffff800008f38964 [xfs]
#15 [ffff80002a873c30] xfs_dir_removename at ffff800008f2fc6c [xfs]
#16 [ffff80002a873c90] xfs_remove at ffff800008f77f70 [xfs]
#17 [ffff80002a873d00] xfs_vn_unlink at ffff800008f72744 [xfs]
#18 [ffff80002a873d50] vfs_unlink at ffff8000103264e8
#19 [ffff80002a873d90] do_unlinkat at ffff80001032ad44
#20 [ffff80002a873e20] __arm64_sys_unlinkat at ffff80001032ae30 

> There are several AIL/CIL improvements in upstream kernel and
> none of them are going to land in 5.10.y.
> The reported kernel version 5.10.38 has almost no upstream fixes
> at all, but I don't think that any of the fixes in 5.10.y are relevant for
> this case anyway.
> 
> If this hang happens often with your workload, I suggest using
> a newer kernel and/or formatting xfs with a larger log to meet
> the demands of your workload.
> 

It is just happens only once.
Comment 4 Amir Goldstein 2022-08-16 14:32:16 UTC
On Tue, Aug 16, 2022 at 11:57 AM <bugzilla-daemon@kernel.org> wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=216343
>
> --- Comment #3 from zhoukete@126.com ---
> (In reply to Amir Goldstein from comment #2)
> >
> > I don't think so.
> > I think this buffer write is in-flight.
>
> But how to explain the XFS_DONE flag is set.Is XFS_DONE flag means the io had
> been done?
>

According to xfs_trans_dirty_buf() I think it could mean uptodate and
dirty buffer.

>
> >
> > Not sure if space cannot be released or just takes a lot of time.
>
> crash> ps -m 2006840
> [0 14:35:38.418] [UN]  PID: 2006840  TASK: ffff035f80de8f80  CPU: 51 
> COMMAND:
> "onestor-peon"
>
> pid 2006840 hang 14 hours. It is waiting for the xfs buf io finish.
> So I  think it is ‘cannot be released’ not take a lot of time.

Maybe the hardware never returned with a response?
Hard to say. Maybe someone else has ideas.

Thanks,
Amir.
Comment 5 zhoukete 2022-08-17 10:05:42 UTC
(In reply to Amir Goldstein from comment #4)

> 
> According to xfs_trans_dirty_buf() I think it could mean uptodate and
> dirty buffer.
> 

when I review the xfs_trans_dirty_buf code, I found that xfs inode item b_log_item is null,

crash> xfs_log_item.li_buf,li_ops 0xffff0340999a0a80 -x
  li_buf = 0xffff0200125b7180
  li_ops = 0xffff800008faec60 <xfs_inode_item_ops> 
crash> xfs_buf.b_log_item 0xffff0200125b7180
  b_log_item = 0x0

and only xfs buf log item b_log_item has value

crash> xfs_log_item.li_buf,li_ops ffff033f8d7c9de8 -x
  li_buf = 0x0
  li_ops = 0xffff800008fae8d8 <xfs_buf_item_ops> 
crash> xfs_buf_log_item.bli_buf  ffff033f8d7c9de8
  bli_buf = 0xffff0200125b4a80
crash> xfs_buf.b_log_item 0xffff0200125b4a80
  b_log_item = 0xffff033f8d7c9de8
crash> xfs_buf_log_item.bli_flags 0xffff033f8d7c9de8
  bli_flags = 2     (XFS_BLI_DIRTY)
crash> xfs_buf_log_item.bli_item.li_flags  ffff033f8d7c9de8
  bli_item.li_flags = 1,  (XFS_LI_IN_AIL)

So xfs buf log item XFS_DONE is set because of xfs_trans_dirty_buf(),buf xfs inode log item never call xfs_trans_dirty_buf() because of b_log_item == 0x0.

Do  you know the reason why xfs inode log item XFS_DONE is set?

> 
> Maybe the hardware never returned with a response?
> Hard to say. Maybe someone else has ideas.
> 

If we can prove that XFS_DONE isn't stand for iodone, I think this issue may cause by the hardware error.

I find the err msg in dmesg:
[ 9824.111366] mpt3sas_cm0: issue target reset: handle = (0x0034)  

Maybe it tell us mpt3sas lost the io requests before.
Comment 6 Amir Goldstein 2022-08-17 13:15:28 UTC
On Wed, Aug 17, 2022 at 1:19 PM <bugzilla-daemon@kernel.org> wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=216343
>
> --- Comment #5 from zhoukete@126.com ---
> (In reply to Amir Goldstein from comment #4)
>
> >
> > According to xfs_trans_dirty_buf() I think it could mean uptodate and
> > dirty buffer.
> >
>
> when I review the xfs_trans_dirty_buf code, I found that xfs inode item
> b_log_item is null,
>
> crash> xfs_log_item.li_buf,li_ops 0xffff0340999a0a80 -x
>   li_buf = 0xffff0200125b7180
>   li_ops = 0xffff800008faec60 <xfs_inode_item_ops>
> crash> xfs_buf.b_log_item 0xffff0200125b7180
>   b_log_item = 0x0
>
> and only xfs buf log item b_log_item has value
>
> crash> xfs_log_item.li_buf,li_ops ffff033f8d7c9de8 -x
>   li_buf = 0x0
>   li_ops = 0xffff800008fae8d8 <xfs_buf_item_ops>
> crash> xfs_buf_log_item.bli_buf  ffff033f8d7c9de8
>   bli_buf = 0xffff0200125b4a80
> crash> xfs_buf.b_log_item 0xffff0200125b4a80
>   b_log_item = 0xffff033f8d7c9de8
> crash> xfs_buf_log_item.bli_flags 0xffff033f8d7c9de8
>   bli_flags = 2     (XFS_BLI_DIRTY)
> crash> xfs_buf_log_item.bli_item.li_flags  ffff033f8d7c9de8
>   bli_item.li_flags = 1,  (XFS_LI_IN_AIL)
>
> So xfs buf log item XFS_DONE is set because of xfs_trans_dirty_buf(),buf xfs
> inode log item never call xfs_trans_dirty_buf() because of b_log_item == 0x0.
>
> Do  you know the reason why xfs inode log item XFS_DONE is set?
>

#define XBF_DONE        (1u << 5) /* all pages in the buffer uptodate */

Buffer uptodate does not mean that it is not dirty.
I am not sure about the rest of your analysis.

> >
> > Maybe the hardware never returned with a response?
> > Hard to say. Maybe someone else has ideas.
> >
>
> If we can prove that XFS_DONE isn't stand for iodone, I think this issue may
> cause by the hardware error.
>
> I find the err msg in dmesg:
> [ 9824.111366] mpt3sas_cm0: issue target reset: handle = (0x0034)
>
> Maybe it tell us mpt3sas lost the io requests before.
>

Yes, maybe it does.

Anyway, if your hardware had errors, could it be that your
filesystem is shutting down?

If it does, you may be hit by the bug fixed by
84d8949e7707 ("xfs: hold buffer across unpin and potential shutdown processing")
but I am not sure if all the conditions in this bug match your case.

If you did get hit by this bug, you may consider upgrade to v5.10.135
which has the bug fix.

Thanks,
Amir.
Comment 7 zhoukete 2022-08-18 08:23:07 UTC
(In reply to Amir Goldstein from comment #6)
> On Wed, Aug 17, 2022 at 1:19 PM <bugzilla-daemon@kernel.org> wrote:
> >
> > https://bugzilla.kernel.org/show_bug.cgi?id=216343
> >
> > --- Comment #5 from zhoukete@126.com ---
> > (In reply to Amir Goldstein from comment #4)
> >
> > >
> > > According to xfs_trans_dirty_buf() I think it could mean uptodate and
> > > dirty buffer.
> > >
> >
> > when I review the xfs_trans_dirty_buf code, I found that xfs inode item
> > b_log_item is null,
> >
> > crash> xfs_log_item.li_buf,li_ops 0xffff0340999a0a80 -x
> >   li_buf = 0xffff0200125b7180
> >   li_ops = 0xffff800008faec60 <xfs_inode_item_ops>
> > crash> xfs_buf.b_log_item 0xffff0200125b7180
> >   b_log_item = 0x0
> >
> > and only xfs buf log item b_log_item has value
> >
> > crash> xfs_log_item.li_buf,li_ops ffff033f8d7c9de8 -x
> >   li_buf = 0x0
> >   li_ops = 0xffff800008fae8d8 <xfs_buf_item_ops>
> > crash> xfs_buf_log_item.bli_buf  ffff033f8d7c9de8
> >   bli_buf = 0xffff0200125b4a80
> > crash> xfs_buf.b_log_item 0xffff0200125b4a80
> >   b_log_item = 0xffff033f8d7c9de8
> > crash> xfs_buf_log_item.bli_flags 0xffff033f8d7c9de8
> >   bli_flags = 2     (XFS_BLI_DIRTY)
> > crash> xfs_buf_log_item.bli_item.li_flags  ffff033f8d7c9de8
> >   bli_item.li_flags = 1,  (XFS_LI_IN_AIL)
> >
> > So xfs buf log item XFS_DONE is set because of xfs_trans_dirty_buf(),buf
> xfs
> > inode log item never call xfs_trans_dirty_buf() because of b_log_item ==
> 0x0.
> >
> > Do  you know the reason why xfs inode log item XFS_DONE is set?
> >
> 
> #define XBF_DONE        (1u << 5) /* all pages in the buffer uptodate */
> 
> Buffer uptodate does not mean that it is not dirty.
> I am not sure about the rest of your analysis.
> 
> > >
> > > Maybe the hardware never returned with a response?
> > > Hard to say. Maybe someone else has ideas.
> > >
> >
> > If we can prove that XFS_DONE isn't stand for iodone, I think this issue
> may
> > cause by the hardware error.
> >
> > I find the err msg in dmesg:
> > [ 9824.111366] mpt3sas_cm0: issue target reset: handle = (0x0034)
> >
> > Maybe it tell us mpt3sas lost the io requests before.
> >
> 
> Yes, maybe it does.
> 
> Anyway, if your hardware had errors, could it be that your
> filesystem is shutting down?
> 
> If it does, you may be hit by the bug fixed by
> 84d8949e7707 ("xfs: hold buffer across unpin and potential shutdown
> processing")
> but I am not sure if all the conditions in this bug match your case.
> 
> If you did get hit by this bug, you may consider upgrade to v5.10.135
> which has the bug fix.
> 
> Thanks,
> Amir.

I think xfs isn't in shutdown stat,
crash> xfs_mount.m_flags 0xffff00ff85169000 -x
  m_flags = 0x100  (XFS_MOUNT_ATTR2)

About the 84d8949e7707 ,I would try to back port to 5.10.38.
Thanks
Comment 8 zhoukete 2023-09-21 06:58:20 UTC
I have found the cause of the problem. By applying the patch d43aaf1685aa471f0593685c9f54d53e3af3cf3f , the problem can be solved.

Note You need to log in before you can comment on or make changes to this bug.