Bug 216007

Summary: XFS hangs in iowait when extracting large number of files
Product: File System Reporter: Peter Pavlisko (bugzkernelorg8392)
Component: XFSAssignee: FileSystem/XFS Default Virtual Assignee (filesystem_xfs)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bugzilla, clockwork80, holger, jack, mgorman
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.15.32 Subsystem:
Regression: No Bisected commit-id:
Attachments: output from dmesg after echo w > /proc/sysrq-trigger
kernel config file
contents of the /proc/slabinfo
.config file of a kernel unaffected by this bug
Patch to always allocate at least one page

Description Peter Pavlisko 2022-05-20 11:56:06 UTC
Created attachment 301008 [details]
output from dmesg after echo w > /proc/sysrq-trigger

Overview:

When I try to extract an uncompressed tar archive (2.6 milion files, 760.3 GiB in size) on newly created (empty) XFS file system, after first low tens of gigabytes extracted the process hangs in iowait indefinitely. One CPU core is 100% occupied with iowait, the other CPU core is idle (on 2-core Intel Celeron G1610T).

I have kernel compiled with my .config file. When I try this with a more "standard" kernel, the problem is not reproducible.

Steps to Reproduce:

1) compile the kernel with the attached .config

2) reboot with this kernel

3) create a new XFS filesystem on a spare drive (just mkfs.xfs -f <dev>)

4) mount this new file system

5) try to extract large amount of data there

Actual results:

After 20-40 GiB written, the process hangs in iowait indefinitely, never finishing the archive extraction.

Expected Results:

Archive extraction continues smoothly until done.

Build Date & Hardware:

2022-05-01 on HP ProLiant MicroServer Gen8, 4GB ECC RAM

Additional Information:

No other filesystem tested with the same archive on the same hardware before or after this (ext2, ext3, ext4, reiserfs3, jfs, nilfs2, f2fs, btrfs, zfs) has shown this behavior. When I downgraded the kernel to 5.10.109, the XFS started working again. Kernel versions higher than 5.15 seem to be affected, I tried 5.17.1, 5.17.6 and 5.18.0-rc7, they all hang up after a few minutes.

No error is reported to the system log or to dmesg when the process hangs. No error shows on stdout or stderr of the tar process either.

This is not a SMR problem. None of the disks present in the test setup are SMR. All are CMR, and while they certainly are not brand new, they are all in good working condition.

Attached is the dmesg output after issuing this command:

echo w > /proc/sysrq-trigger

More could be found here: https://forums.gentoo.org/viewtopic-p-8709116.html
Comment 1 Peter Pavlisko 2022-05-20 11:56:36 UTC
Created attachment 301009 [details]
kernel config file
Comment 2 Chris Murphy 2022-05-20 20:46:58 UTC
Please see https://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F and supplement the bug report with the missing information. Thanks.
Comment 3 Dave Chinner 2022-05-20 23:05:26 UTC
On Fri, May 20, 2022 at 11:56:06AM +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=216007
> 
>             Bug ID: 216007
>            Summary: XFS hangs in iowait when extracting large number of
>                     files
>            Product: File System
>            Version: 2.5
>     Kernel Version: 5.15.32
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: XFS
>           Assignee: filesystem_xfs@kernel-bugs.kernel.org
>           Reporter: bugzkernelorg8392@araxon.sk
>         Regression: No
> 
> Created attachment 301008 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=301008&action=edit
> output from dmesg after echo w > /proc/sysrq-trigger
> 
> Overview:
> 
> When I try to extract an uncompressed tar archive (2.6 milion files, 760.3
> GiB
> in size) on newly created (empty) XFS file system, after first low tens of
> gigabytes extracted the process hangs in iowait indefinitely. One CPU core is
> 100% occupied with iowait, the other CPU core is idle (on 2-core Intel
> Celeron
> G1610T).
> 
> I have kernel compiled with my .config file. When I try this with a more
> "standard" kernel, the problem is not reproducible.
> 
> Steps to Reproduce:
> 
> 1) compile the kernel with the attached .config
> 
> 2) reboot with this kernel
> 
> 3) create a new XFS filesystem on a spare drive (just mkfs.xfs -f <dev>)
> 
> 4) mount this new file system
> 
> 5) try to extract large amount of data there
> 
> Actual results:
> 
> After 20-40 GiB written, the process hangs in iowait indefinitely, never
> finishing the archive extraction.

[  805.233836] task:tar             state:D stack:    0 pid: 2492 ppid:  2491 flags:0x00004000
[  805.233840] Call Trace:
[  805.233841]  <TASK>
[  805.233842]  __schedule+0x1c9/0x510
[  805.233846]  ? lock_timer_base+0x5c/0x80
[  805.233850]  schedule+0x3f/0xa0
[  805.233853]  schedule_timeout+0x7c/0xf0
[  805.233858]  ? init_timer_key+0x30/0x30
[  805.233862]  io_schedule_timeout+0x47/0x70
[  805.233866]  congestion_wait+0x79/0xd0
[  805.233872]  ? wait_woken+0x60/0x60
[  805.233876]  xfs_buf_alloc_pages+0xd0/0x1b0
[  805.233881]  xfs_buf_get_map+0x259/0x300
[  805.233886]  ? xfs_buf_item_init+0x150/0x160
[  805.233892]  xfs_trans_get_buf_map+0xa9/0x120
[  805.233897]  xfs_ialloc_inode_init+0x129/0x2d0
[  805.233901]  ? xfs_ialloc_ag_alloc+0x1df/0x630
[  805.233904]  xfs_ialloc_ag_alloc+0x1df/0x630
[  805.233908]  xfs_dialloc+0x1b4/0x720
[  805.233912]  xfs_create+0x1d7/0x450
[  805.233917]  xfs_generic_create+0x114/0x2d0
[  805.233922]  path_openat+0x510/0xe10
[  805.233925]  do_filp_open+0xad/0x150
[  805.233929]  ? xfs_blockgc_clear_iflag+0x93/0xb0
[  805.233932]  ? xfs_iunlock+0x52/0x90
[  805.233937]  do_sys_openat2+0x91/0x150
[  805.233942]  __x64_sys_openat+0x4e/0x90
[  805.233946]  do_syscall_64+0x43/0x90
[  805.233952]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  805.233959] RIP: 0033:0x7f763ccc9572
[  805.233962] RSP: 002b:00007ffef1391530 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[  805.233966] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f763ccc9572
[  805.233969] RDX: 00000000000809c1 RSI: 000055b1d5b19270 RDI: 0000000000000004
[  805.233971] RBP: 0000000000000180 R08: 000000000000c0c0 R09: 000055b1d5b145f0
[  805.233973] R10: 0000000000000180 R11: 0000000000000246 R12: 0000000000000000
[  805.233974] R13: 00000000000809c1 R14: 000055b1d5b19270 R15: 000055b1d59d2248
[  805.233977]  </TASK>

It's waiting on memory allocation, which is probably waiting on IO
completion somewhere to clean dirty pages. This suggests there's a
problem with the storage hardware, the storage stack below XFS or
there's an issue with memory cleaning/reclaim stalling and not
making progress.

> Expected Results:
> 
> Archive extraction continues smoothly until done.
> 
> Build Date & Hardware:
> 
> 2022-05-01 on HP ProLiant MicroServer Gen8, 4GB ECC RAM
> 
> Additional Information:
> 
> No other filesystem tested with the same archive on the same hardware before
> or
> after this (ext2, ext3, ext4, reiserfs3, jfs, nilfs2, f2fs, btrfs, zfs) has
> shown this behavior. When I downgraded the kernel to 5.10.109, the XFS
> started
> working again. Kernel versions higher than 5.15 seem to be affected, I tried
> 5.17.1, 5.17.6 and 5.18.0-rc7, they all hang up after a few minutes.

Doesn't actually look like an XFS problem from the evidence
supplied, though.

What sort of storage subsystem does this machine have? If it's a
spinning disk then you've probably just filled memory 

> More could be found here: https://forums.gentoo.org/viewtopic-p-8709116.html

Oh, wait:

"I compiled a more mainstream version of
sys-kernel/gentoo-sources-5.15.32-r1 (removed my .config file and
let genkernel to fill it with default options) and lo and behold, in
this kernel I could not make it go stuck anymore.
[....]
However, after I altered my old kernel config to contain these
values and rebooting, I'm still triggering the bug. It may not be a
XFS issue after all."

From the evidence presented, I'd agree that this doesn't look an
XFS problem, either.

Cheers,

Dave.
Comment 4 Peter Pavlisko 2022-05-21 05:14:36 UTC
> What sort of storage subsystem does this machine have? If it's a spinning
> disk then you've probably just filled memory

Yes, all the disks are classic spinning CMR disks. But, out of all file systems tried, only XFS is doing this on the test machine. I can trigger this behavior every time. And kernels from 5.10 and bellow still work, even with my non-standard .config.

Here is the memory situation when it is stuck:

ftp-back ~ # free
               total        used        free      shared  buff/cache   available
Mem:         3995528      175872       69240         416     3750416     3763584
Swap:        2097084           0     2097084

I will try to put together another test machine on Monday, without using any of the components from this one, just to rule out a hardware bug.

This may not be a XFS bug, but so far only XFS seems to suffer from it.
Comment 5 Dave Chinner 2022-05-21 22:31:59 UTC
On Sat, May 21, 2022 at 05:14:36AM +0000, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=216007
> 
> --- Comment #4 from Peter Pavlisko (bugzkernelorg8392@araxon.sk) ---
> > What sort of storage subsystem does this machine have? If it's a spinning
> > disk then you've probably just filled memory
> 
> Yes, all the disks are classic spinning CMR disks. But, out of all file
> systems
> tried, only XFS is doing this on the test machine. I can trigger this
> behavior
> every time. And kernels from 5.10 and bellow still work, even with my
> non-standard .config.
> 
> Here is the memory situation when it is stuck:
> 
> ftp-back ~ # free
>                total        used        free      shared  buff/cache  
>                available
> Mem:         3995528      175872       69240         416     3750416    
> 3763584

Doesn't tell us a whole lot except for "no free memory to allocate
without reclaim". /proc/meminfo, /proc/vmstat and /proc/slabinfo
would tell us a lot more.

Also, knowing if you've tweaked things like dirty ratios, etc would
also be helpful...

> This may not be a XFS bug, but so far only XFS seems to suffer from it.

Not that uncommon, really. XFS puts a different load on the memory
allocation/reclaim and cache subsystems compared to other
filesystems, so XFS tends to trip over bugs that others don't.

Cheers,

Dave.
Comment 6 Peter Pavlisko 2022-05-23 08:29:14 UTC
(In reply to Chris Murphy from comment #2)
> Please see
> https://xfs.org/index.php/XFS_FAQ#Q:
> _What_information_should_I_include_when_reporting_a_problem.3F and
> supplement the bug report with the missing information. Thanks.

certainly!

> kernel version (uname -a)

5.15.32-gentoo-r1

> xfsprogs version (xfs_repair -V)

xfs_repair version 5.14.2

> number of CPUs

1 CPU, 2 cores (Intel Celeron G1610T @ 2.30GHz)

> contents of /proc/meminfo

(at the time of iowait hangup)

MemTotal:        3995528 kB
MemFree:           29096 kB
MemAvailable:    3749216 kB
Buffers:           19984 kB
Cached:          3556248 kB
SwapCached:            0 kB
Active:            62888 kB
Inactive:        3560968 kB
Active(anon):        272 kB
Inactive(anon):    47772 kB
Active(file):      62616 kB
Inactive(file):  3513196 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2097084 kB
SwapFree:        2097084 kB
Dirty:                28 kB
Writeback:             0 kB
AnonPages:         47628 kB
Mapped:            19540 kB
Shmem:               416 kB
KReclaimable:     199964 kB
Slab:             286472 kB
SReclaimable:     199964 kB
SUnreclaim:        86508 kB
KernelStack:        1984 kB
PageTables:         1648 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4094848 kB
Committed_AS:     117448 kB
VmallocTotal:   34359738367 kB
VmallocUsed:        2764 kB
VmallocChunk:          0 kB
Percpu:              832 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:       20364 kB
DirectMap2M:     4139008 kB

> contents of /proc/mounts

/dev/root / ext4 rw,noatime 0 0
devtmpfs /dev devtmpfs rw,nosuid,relatime,size=10240k,nr_inodes=498934,mode=755 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /run tmpfs rw,nosuid,nodev,size=799220k,nr_inodes=819200,mode=755 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
cgroup_root /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,relatime,size=10240k,mode=755 0 0
openrc /sys/fs/cgroup/openrc cgroup rw,nosuid,nodev,noexec,relatime,release_agent=/lib/rc/sh/cgroup-release-agent.sh,name=openrc 0 0
none /sys/fs/cgroup/unified cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
shm /dev/shm tmpfs rw,nosuid,nodev,noexec,relatime 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0
/dev/sdc1 /mnt/test xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0

> contents of /proc/partitions

major minor  #blocks  name

   1        0      16384 ram0
   1        1      16384 ram1
   1        2      16384 ram2
   1        3      16384 ram3
   1        4      16384 ram4
   1        5      16384 ram5
   1        6      16384 ram6
   1        7      16384 ram7
   1        8      16384 ram8
   1        9      16384 ram9
   1       10      16384 ram10
   1       11      16384 ram11
   1       12      16384 ram12
   1       13      16384 ram13
   1       14      16384 ram14
   1       15      16384 ram15
   8        0 1953514584 sda
   8        1     131072 sda1
   8        2    2097152 sda2
   8        3 1951285336 sda3
   8       16 1953514584 sdb
   8       17     131072 sdb1
   8       18    2097152 sdb2
   8       19 1951285336 sdb3
   8       32  976762584 sdc
   8       33  976761560 sdc1
  11        0    1048575 sr0
   9        3 1951285248 md3
   9        2    2097088 md2
   9        1     131008 md1

> RAID layout (hardware and/or software)

software RAID1 on system disks (/dev/sda + /dev/sdb, ext4)
no RAID on xfs disk (/dev/sdc, xfs)

> LVM configuration

no LVM

> type of disks you are using

CMR 3.5" SATA disks

/dev/sda and /dev/sdb: WD Red 2TB (WDC WD20EFRX-68EUZN0)
/dev/sdc: Seagate Constellation ES 1TB (MB1000GCEEK)

> write cache status of drives

not sure - how do I find out?

> size of BBWC and mode it is running in

no HW RAID (the drive controller is set to AHCI mode)
no battery backup

> xfs_info output on the filesystem in question

meta-data=/dev/sdc1              isize=512    agcount=4, agsize=61047598 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0
data     =                       bsize=4096   blocks=244190390, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=119233, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

> dmesg output showing all error messages and stack traces

already an attachment to this ticket
Comment 7 Peter Pavlisko 2022-05-23 08:31:20 UTC
Furthermore, I was able to do a git bisect on this repository:

git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

The offending commit is this one:

9ba0889e2272294bfbb5589b1b180ad2e782b2a4 is the first bad commit
commit 9ba0889e2272294bfbb5589b1b180ad2e782b2a4
Author: Dave Chinner <david@fromorbit.com>
Date:   Tue Jun 8 09:19:22 2021 -0700

    xfs: drop the AGI being passed to xfs_check_agi_freecount

    From: Dave Chinner <dchinner@redhat.com>

    Stephen Rothwell reported this compiler warning from linux-next:

    fs/xfs/libxfs/xfs_ialloc.c: In function 'xfs_difree_finobt':
    fs/xfs/libxfs/xfs_ialloc.c:2032:20: warning: unused variable 'agi' [-Wunused-variable]
     2032 |  struct xfs_agi   *agi = agbp->b_addr;

    Which is fallout from agno -> perag conversions that were done in
    this function. xfs_check_agi_freecount() is the only user of "agi"
    in xfs_difree_finobt() now, and it only uses the agi to get the
    current free inode count. We hold that in the perag structure, so
    there's not need to directly reference the raw AGI to get this
    information.

    The btree cursor being passed to xfs_check_agi_freecount() has a
    reference to the perag being operated on, so use that directly in
    xfs_check_agi_freecount() rather than passing an AGI.

    Fixes: 7b13c5155182 ("xfs: use perag for ialloc btree cursors")
    Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>

 fs/xfs/libxfs/xfs_ialloc.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)
Comment 8 Peter Pavlisko 2022-05-23 10:02:33 UTC
(In reply to Peter Pavlisko from comment #7)
> Furthermore, I was able to do a git bisect on this repository:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
> 
> The offending commit is this one:
> 
> 9ba0889e2272294bfbb5589b1b180ad2e782b2a4 is the first bad commit
> commit 9ba0889e2272294bfbb5589b1b180ad2e782b2a4
> Author: Dave Chinner <david@fromorbit.com>
> Date:   Tue Jun 8 09:19:22 2021 -0700
> 
>     xfs: drop the AGI being passed to xfs_check_agi_freecount
> 
>     From: Dave Chinner <dchinner@redhat.com>
> 
>     Stephen Rothwell reported this compiler warning from linux-next:
> 
>     fs/xfs/libxfs/xfs_ialloc.c: In function 'xfs_difree_finobt':
>     fs/xfs/libxfs/xfs_ialloc.c:2032:20: warning: unused variable 'agi'
> [-Wunused-variable]
>      2032 |  struct xfs_agi   *agi = agbp->b_addr;
> 
>     Which is fallout from agno -> perag conversions that were done in
>     this function. xfs_check_agi_freecount() is the only user of "agi"
>     in xfs_difree_finobt() now, and it only uses the agi to get the
>     current free inode count. We hold that in the perag structure, so
>     there's not need to directly reference the raw AGI to get this
>     information.
> 
>     The btree cursor being passed to xfs_check_agi_freecount() has a
>     reference to the perag being operated on, so use that directly in
>     xfs_check_agi_freecount() rather than passing an AGI.
> 
>     Fixes: 7b13c5155182 ("xfs: use perag for ialloc btree cursors")
>     Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
>     Signed-off-by: Dave Chinner <dchinner@redhat.com>
>     Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
>     Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>     Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> 
>  fs/xfs/libxfs/xfs_ialloc.c | 28 +++++++++++++---------------
>  1 file changed, 13 insertions(+), 15 deletions(-)

I'm sorry, I made a mistake somewhere while bisecting... This is not the commit.

Disregard the quoted message, please. I will try bisecting again.
Comment 9 Holger Hoffstätte 2022-05-23 10:28:23 UTC
Shot in the dark: can you try the latest plain 5.15.x + 9a5280b312e2
aka: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/xfs?id=9a5280b312e2e7898b6397b2ca3cfd03f67d7be1

I suspect your stack traces are not really representative of what's
going on..
Comment 10 Jan Kara 2022-05-24 07:54:12 UTC
(In reply to Peter Pavlisko from comment #6)
> (In reply to Chris Murphy from comment #2)
> > number of CPUs
> 
> 1 CPU, 2 cores (Intel Celeron G1610T @ 2.30GHz)

OK, not much CPU power :)
 
> > contents of /proc/meminfo
> 
> (at the time of iowait hangup)
> 
> MemTotal:        3995528 kB
> MemFree:           29096 kB
> MemAvailable:    3749216 kB
> Buffers:           19984 kB
> Cached:          3556248 kB
> SwapCached:            0 kB
> Active:            62888 kB
> Inactive:        3560968 kB
> Active(anon):        272 kB
> Inactive(anon):    47772 kB
> Active(file):      62616 kB
> Inactive(file):  3513196 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:       2097084 kB
> SwapFree:        2097084 kB
> Dirty:                28 kB
> Writeback:             0 kB

Interestingly basically no dirty memory or memory under writeback. Basically everything is in Inactive(file) list which should be very easy to reclaim. But given you say we make no progress in reclaiming memory and from the traces we see process is hung waiting for memory allocation, it may be that the page cache pages are pinned in memory by extra references or something like that. The question is what could be XFS doing that clean page cache cannot be reclaimed? I have no good answer to that...

...

>    8        0 1953514584 sda
>    8        1     131072 sda1
>    8        2    2097152 sda2
>    8        3 1951285336 sda3
>    8       16 1953514584 sdb
>    8       17     131072 sdb1
>    8       18    2097152 sdb2
>    8       19 1951285336 sdb3
>    8       32  976762584 sdc
>    8       33  976761560 sdc1
>   11        0    1048575 sr0
>    9        3 1951285248 md3
>    9        2    2097088 md2
>    9        1     131008 md1

Actually not that much overall memory compared to the disk sizes or the size of unpacked archive.

You've mentioned another kernel config does not exhibit the problem. Can you perhaps post it here for comparison?
Comment 11 Peter Pavlisko 2022-05-24 10:00:05 UTC
I bisected the issue again, double checking every step, and I arrived at another commit. This looks more promising:

c9fa563072e13337713a441cf30171feb4e96e6d is the first bad commit
commit c9fa563072e13337713a441cf30171feb4e96e6d
Author: Dave Chinner <dchinner@redhat.com>
Date:   Tue Jun 1 13:40:36 2021 +1000

    xfs: use alloc_pages_bulk_array() for buffers

    Because it's more efficient than allocating pages one at a time in a
    loop.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>

 fs/xfs/xfs_buf.c | 62 ++++++++++++++++++++++----------------------------------
 1 file changed, 24 insertions(+), 38 deletions(-)
Comment 12 Jan Kara 2022-05-24 10:49:14 UTC
Interesting. Looks like a bug in the bulk allocator that it is not able to make progress when reclaim is needed? From a quick look I don't see kswapd being kicked or direct reclaim happening from the bulk allocator so that could be the reason. Adding Mel to CC, maybe he has some ideas.
Comment 13 Peter Pavlisko 2022-05-24 10:52:13 UTC
(In reply to Dave Chinner from comment #5)
> Doesn't tell us a whole lot except for "no free memory to allocate
> without reclaim". /proc/meminfo, /proc/vmstat and /proc/slabinfo
> would tell us a lot more.

Notice: all values in this comment are from the kernel version 5.13.0-rc2+, commit c9fa563072e1, recorded at the time of the endless iowait

> /proc/meminfo

MemTotal:        3996092 kB
MemFree:           42176 kB
MemAvailable:    3668964 kB
Buffers:           15684 kB
Cached:          3400932 kB
SwapCached:            0 kB
Active:            58840 kB
Inactive:        3405468 kB
Active(anon):        268 kB
Inactive(anon):    47836 kB
Active(file):      58572 kB
Inactive(file):  3357632 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2097084 kB
SwapFree:        2097084 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         47700 kB
Mapped:            19504 kB
Shmem:               412 kB
KReclaimable:     266460 kB
Slab:             404388 kB
SReclaimable:     266460 kB
SUnreclaim:       137928 kB
KernelStack:        1904 kB
PageTables:         1672 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     4095128 kB
Committed_AS:     117700 kB
VmallocTotal:   34359738367 kB
VmallocUsed:        2732 kB
VmallocChunk:          0 kB
Percpu:              896 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:       20364 kB
DirectMap2M:     4139008 kB

> /proc/vmstat

nr_free_pages 10607
nr_zone_inactive_anon 11978
nr_zone_active_anon 67
nr_zone_inactive_file 839409
nr_zone_active_file 14643
nr_zone_unevictable 0
nr_zone_write_pending 0
nr_mlock 0
nr_bounce 0
nr_free_cma 0
nr_inactive_anon 11978
nr_active_anon 67
nr_inactive_file 839409
nr_active_file 14643
nr_unevictable 0
nr_slab_reclaimable 66615
nr_slab_unreclaimable 34482
nr_isolated_anon 0
nr_isolated_file 0
workingset_nodes 130210
workingset_refault_anon 0
workingset_refault_file 1345
workingset_activate_anon 0
workingset_activate_file 24
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 208053
nr_anon_pages 11931
nr_mapped 4876
nr_file_pages 854155
nr_dirty 0
nr_writeback 0
nr_writeback_temp 0
nr_shmem 103
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_file_hugepages 0
nr_file_pmdmapped 0
nr_anon_transparent_hugepages 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 8243
nr_dirtied 9075788
nr_written 9075787
nr_kernel_misc_reclaimable 0
nr_foll_pin_acquired 0
nr_foll_pin_released 0
nr_kernel_stack 1904
nr_page_table_pages 418
nr_swapcached 0
nr_dirty_threshold 171292
nr_dirty_background_threshold 85541
pgpgin 36279336
pgpgout 36908703
pswpin 0
pswpout 0
pgalloc_dma 1473
pgalloc_dma32 18299092
pgalloc_normal 325690
pgalloc_movable 0
allocstall_dma 0
allocstall_dma32 0
allocstall_normal 0
allocstall_movable 5
pgskip_dma 0
pgskip_dma32 0
pgskip_normal 35
pgskip_movable 0
pgfree 18637453
pgactivate 20445
pgdeactivate 266
pglazyfree 0
pgfault 397160
pgmajfault 585
pglazyfreed 0
pgrefill 34
pgreuse 87282
pgsteal_kswapd 17283321
pgsteal_direct 1048
pgscan_kswapd 17292764
pgscan_direct 1048
pgscan_direct_throttle 0
pgscan_anon 13
pgscan_file 17293799
pgsteal_anon 0
pgsteal_file 17284369
pginodesteal 0
slabs_scanned 212096
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 11222
kswapd_high_wmark_hit_quickly 48
pageoutrun 11290
pgrotated 8478
drop_pagecache 0
drop_slab 0
oom_kill 0
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 0
unevictable_pgs_scanned 0
unevictable_pgs_rescued 0
unevictable_pgs_mlocked 0
unevictable_pgs_munlocked 0
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
swap_ra 0
swap_ra_hit 0
direct_map_level2_splits 1
direct_map_level3_splits 0
nr_unstable 0

> /proc/slabinfo

too big, I will upload it as an attachment

> Also, knowing if you've tweaked things like dirty ratios, etc would
> also be helpful...

I didn't tweak any sysctl value, but I can't say if the defaults in Gentoo are the same as in other linux flavors
Comment 14 Peter Pavlisko 2022-05-24 10:53:00 UTC
Created attachment 301026 [details]
contents of the /proc/slabinfo
Comment 15 Peter Pavlisko 2022-05-24 11:21:35 UTC
(In reply to Holger Hoffstätte from comment #9)
> Shot in the dark: can you try the latest plain 5.15.x + 9a5280b312e2
> aka:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/
> xfs?id=9a5280b312e2e7898b6397b2ca3cfd03f67d7be1
> 
> I suspect your stack traces are not really representative of what's
> going on..

I just tried commit 9a5280b312e2 (5.18.0-rc2+). The bug is still there.
Comment 16 Peter Pavlisko 2022-05-24 11:48:10 UTC
(In reply to Jan Kara from comment #10)
> You've mentioned another kernel config does not exhibit the problem. Can you
> perhaps post it here for comparison?

certainly... but I'm afraid that the differences between the two .config files are too big to be of any use

I'll attach it shortly
Comment 17 Peter Pavlisko 2022-05-24 11:49:25 UTC
Created attachment 301027 [details]
.config file of a kernel unaffected by this bug
Comment 18 Mel Gorman 2022-05-25 17:13:34 UTC
Created attachment 301044 [details]
Patch to always allocate at least one page

Hi Peter,

Could you try the attached patch against 5.18 please? I was unable to reproduce the problem but I think what's happening is that an array for receiving a bulk allocation is partially populated and the bulk allocator is returning without allocating at least one page. Allocating even one page should hit the path where kswapd is woken.
Comment 19 Peter Pavlisko 2022-05-26 04:04:51 UTC
(In reply to Mel Gorman from comment #18)
> Created attachment 301044 [details]
> Patch to always allocate at least one page
> 
> Hi Peter,
> 
> Could you try the attached patch against 5.18 please? I was unable to
> reproduce the problem but I think what's happening is that an array for
> receiving a bulk allocation is partially populated and the bulk allocator is
> returning without allocating at least one page. Allocating even one page
> should hit the path where kswapd is woken.

Hi Mel,

I tried this patch and it does indeed work with 5.18.0-rc7. Without the patch it freezes, after I apply the patch the archive extracts flawlessly.
Comment 20 Mel Gorman 2022-05-26 08:51:10 UTC
(In reply to Peter Pavlisko from comment #19)
> (In reply to Mel Gorman from comment #18)
> > Created attachment 301044 [details]
> > Patch to always allocate at least one page
> > 
> > Hi Peter,
> > 
> > Could you try the attached patch against 5.18 please? I was unable to
> > reproduce the problem but I think what's happening is that an array for
> > receiving a bulk allocation is partially populated and the bulk allocator
> is
> > returning without allocating at least one page. Allocating even one page
> > should hit the path where kswapd is woken.
> 
> Hi Mel,
> 
> I tried this patch and it does indeed work with 5.18.0-rc7. Without the
> patch it freezes, after I apply the patch the archive extracts flawlessly.

Thanks Peter, I'll prepare a proper patch and post it today. You won't be cc'd as I only have the bugzilla email alias for you but I'll post a lore.kernel.org link here.
Comment 21 Peter Pavlisko 2022-05-26 09:16:59 UTC
(In reply to Mel Gorman from comment #20)
> (In reply to Peter Pavlisko from comment #19)
> > (In reply to Mel Gorman from comment #18)
> > > Created attachment 301044 [details]
> > > Patch to always allocate at least one page
> > > 
> > > Hi Peter,
> > > 
> > > Could you try the attached patch against 5.18 please? I was unable to
> > > reproduce the problem but I think what's happening is that an array for
> > > receiving a bulk allocation is partially populated and the bulk allocator
> > is
> > > returning without allocating at least one page. Allocating even one page
> > > should hit the path where kswapd is woken.
> > 
> > Hi Mel,
> > 
> > I tried this patch and it does indeed work with 5.18.0-rc7. Without the
> > patch it freezes, after I apply the patch the archive extracts flawlessly.
> 
> Thanks Peter, I'll prepare a proper patch and post it today. You won't be
> cc'd as I only have the bugzilla email alias for you but I'll post a
> lore.kernel.org link here.

Thank you very much.

I don't know if this is the proper place to discuss this, but I am curious about the cause. Was it an issue with the way XFS is calling the allocator in a for(;;) loop when it does not get the expected result? Or was it an issue with the allocator itself not working in some obscure edge case? Or was it my .config, stretching the kernel ability to function in a bad way?
Comment 22 Mel Gorman 2022-05-26 10:26:54 UTC
(In reply to Mel Gorman from comment #20)
> (In reply to Peter Pavlisko from comment #19)
> > (In reply to Mel Gorman from comment #18)
> > > Created attachment 301044 [details]
> > > Patch to always allocate at least one page
> > > 
> > > Hi Peter,
> > > 
> > > Could you try the attached patch against 5.18 please? I was unable to
> > > reproduce the problem but I think what's happening is that an array for
> > > receiving a bulk allocation is partially populated and the bulk allocator
> > is
> > > returning without allocating at least one page. Allocating even one page
> > > should hit the path where kswapd is woken.
> > 
> > Hi Mel,
> > 
> > I tried this patch and it does indeed work with 5.18.0-rc7. Without the
> > patch it freezes, after I apply the patch the archive extracts flawlessly.
> 
> Thanks Peter, I'll prepare a proper patch and post it today. You won't be
> cc'd as I only have the bugzilla email alias for you but I'll post a
> lore.kernel.org link here.

https://lore.kernel.org/all/20220526091210.GC3441@techsingularity.net/
Comment 23 Mel Gorman 2022-05-26 10:37:31 UTC
(In reply to Peter Pavlisko from comment #21)
> (In reply to Mel Gorman from comment #20)
> > (In reply to Peter Pavlisko from comment #19)
> > > (In reply to Mel Gorman from comment #18)
> > > > Created attachment 301044 [details]
> > > > Patch to always allocate at least one page
> > > > 
> > > > Hi Peter,
> > > > 
> > > > Could you try the attached patch against 5.18 please? I was unable to
> > > > reproduce the problem but I think what's happening is that an array for
> > > > receiving a bulk allocation is partially populated and the bulk
> allocator
> > > is
> > > > returning without allocating at least one page. Allocating even one
> page
> > > > should hit the path where kswapd is woken.
> > > 
> > > Hi Mel,
> > > 
> > > I tried this patch and it does indeed work with 5.18.0-rc7. Without the
> > > patch it freezes, after I apply the patch the archive extracts
> flawlessly.
> > 
> > Thanks Peter, I'll prepare a proper patch and post it today. You won't be
> > cc'd as I only have the bugzilla email alias for you but I'll post a
> > lore.kernel.org link here.
> 
> Thank you very much.
> 
> I don't know if this is the proper place to discuss this, but I am curious
> about the cause. Was it an issue with the way XFS is calling the allocator
> in a for(;;) loop when it does not get the expected result? Or was it an
> issue with the allocator itself not working in some obscure edge case? Or
> was it my .config, stretching the kernel ability to function in a bad way?

I think blaming XFS would be excessive even though the API says it only attempts to allocate the requested number of pages with no guarantee the exact number will be returned. A glance at the implementation would show it was trying to return at least one page and the code flow of XFS hints that the XFS developers expected that some progress would generally be made or kswapd would be woken as appropriate. The original intention was that the caller did not necessarily need all the pages but that's not true for XFS or NFS. While I could have altered XFS, it would have encouraged boiler-plate code to be created for NFS or any other user of the API that requires the exact number of pages to be returned.

The allocator was working as intended but not necessarily working as desired for callers.

I don't think your .config is at fault. While I could not reproduce the problem, I could force the problem to occur by injecting failures. It's possible that the problem is easier to trigger on your particular machine but the corner case existed since 5.13. It's an interesting coincidence that a similar problem was found on NFS at roughly the same time which might indicate that something else changed to make this corner case easier to trigger but the corner case was always there.
Comment 24 Jordi 2022-06-04 16:25:19 UTC
Hi

I can reproduce the issue with ext4 filesystem. I am checking at the moment with different kernels from the Gentoo distro:
5.15.41
5.17.1
5.17.5
5.18.1-r2 (patch should be already applied here)
Comment 25 Jordi 2022-06-05 07:51:13 UTC
I have applied the patch on vanilla 5.18.1 kernel and the issue is not resolved on ext4.

Here's the output of dmesg:

[Sun Jun  5 09:44:51 2022] sysrq: Show Blocked State
[Sun Jun  5 09:44:51 2022] task:kworker/u24:11  state:D stack:    0 pid:  145 ppid:     2 flags:0x00004000
[Sun Jun  5 09:44:51 2022] Workqueue: writeback wb_workfn (flush-8:32)
[Sun Jun  5 09:44:51 2022] Call Trace:
[Sun Jun  5 09:44:51 2022]  <TASK>
[Sun Jun  5 09:44:51 2022]  __schedule+0x26d/0xe80
[Sun Jun  5 09:44:51 2022]  ? __blk_flush_plug+0xe1/0x140
[Sun Jun  5 09:44:51 2022]  schedule+0x46/0xa0
[Sun Jun  5 09:44:51 2022]  io_schedule+0x3d/0x60
[Sun Jun  5 09:44:51 2022]  blk_mq_get_tag+0x115/0x2b0
[Sun Jun  5 09:44:51 2022]  ? destroy_sched_domains_rcu+0x20/0x20
[Sun Jun  5 09:44:51 2022]  __blk_mq_alloc_requests+0x157/0x290
[Sun Jun  5 09:44:51 2022]  blk_mq_submit_bio+0x188/0x520
[Sun Jun  5 09:44:51 2022]  submit_bio_noacct_nocheck+0x252/0x2a0
[Sun Jun  5 09:44:51 2022]  ext4_io_submit+0x1b/0x40
[Sun Jun  5 09:44:51 2022]  ext4_writepages+0x4fc/0xe80
[Sun Jun  5 09:44:51 2022]  ? ata_qc_issue+0x101/0x1b0
[Sun Jun  5 09:44:51 2022]  ? __ata_scsi_queuecmd+0x17a/0x380
[Sun Jun  5 09:44:51 2022]  ? update_load_avg+0x77/0x610
[Sun Jun  5 09:44:51 2022]  do_writepages+0xbc/0x1a0
[Sun Jun  5 09:44:51 2022]  ? __schedule+0x275/0xe80
[Sun Jun  5 09:44:51 2022]  __writeback_single_inode+0x2c/0x180
[Sun Jun  5 09:44:51 2022]  writeback_sb_inodes+0x204/0x490
[Sun Jun  5 09:44:51 2022]  __writeback_inodes_wb+0x47/0xd0
[Sun Jun  5 09:44:51 2022]  wb_writeback.isra.0+0x15d/0x1a0
[Sun Jun  5 09:44:51 2022]  wb_workfn+0x23d/0x3a0
[Sun Jun  5 09:44:51 2022]  process_one_work+0x1b0/0x300
[Sun Jun  5 09:44:51 2022]  worker_thread+0x48/0x3c0
[Sun Jun  5 09:44:51 2022]  ? rescuer_thread+0x380/0x380
[Sun Jun  5 09:44:51 2022]  kthread+0xd4/0x100
[Sun Jun  5 09:44:51 2022]  ? kthread_complete_and_exit+0x20/0x20
[Sun Jun  5 09:44:51 2022]  ret_from_fork+0x22/0x30
[Sun Jun  5 09:44:51 2022]  </TASK>
[Sun Jun  5 09:44:51 2022] task:tar             state:D stack:    0 pid:11576 ppid: 11566 flags:0x00000000
[Sun Jun  5 09:44:51 2022] Call Trace:
[Sun Jun  5 09:44:51 2022]  <TASK>
[Sun Jun  5 09:44:51 2022]  __schedule+0x26d/0xe80
[Sun Jun  5 09:44:51 2022]  ? __mod_timer+0x1c2/0x3b0
[Sun Jun  5 09:44:51 2022]  schedule+0x46/0xa0
[Sun Jun  5 09:44:51 2022]  schedule_timeout+0x7c/0xf0
[Sun Jun  5 09:44:51 2022]  ? fprop_fraction_percpu+0x2b/0x70
[Sun Jun  5 09:44:51 2022]  ? init_timer_key+0x30/0x30
[Sun Jun  5 09:44:51 2022]  io_schedule_timeout+0x47/0x70
[Sun Jun  5 09:44:51 2022]  balance_dirty_pages_ratelimited+0x235/0xa10
[Sun Jun  5 09:44:51 2022]  generic_perform_write+0x142/0x200
[Sun Jun  5 09:44:51 2022]  ext4_buffered_write_iter+0x76/0x100
[Sun Jun  5 09:44:51 2022]  new_sync_write+0x104/0x180
[Sun Jun  5 09:44:51 2022]  vfs_write+0x1e6/0x270
[Sun Jun  5 09:44:51 2022]  ksys_write+0x5a/0xd0
[Sun Jun  5 09:44:51 2022]  do_syscall_64+0x3b/0x90
[Sun Jun  5 09:44:51 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[Sun Jun  5 09:44:51 2022] RIP: 0033:0x7f994fe7c483
[Sun Jun  5 09:44:51 2022] RSP: 002b:00007ffe79956648 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[Sun Jun  5 09:44:51 2022] RAX: ffffffffffffffda RBX: 0000000000001084 RCX: 00007f994fe7c483
[Sun Jun  5 09:44:51 2022] RDX: 0000000000001084 RSI: 000055f5bb5a9a00 RDI: 0000000000000003
[Sun Jun  5 09:44:51 2022] RBP: 000055f5bb5a9a00 R08: 00007f994fe7c260 R09: 0000000000000000
[Sun Jun  5 09:44:51 2022] R10: 00000000000001a4 R11: 0000000000000246 R12: 0000000000000003
[Sun Jun  5 09:44:51 2022] R13: 0000000000000003 R14: 000055f5bb5b5320 R15: 000055f5bb5a9a00
[Sun Jun  5 09:44:51 2022]  </TASK>
Comment 26 Jan Kara 2022-06-06 07:49:15 UTC
Jordi, this is a very different issue (different filesystem, different stacktrace). Please don't hijack bugs like this. For all I can see your stacktraces look very standard and we are just waiting for the disk to complete the IO. If the IO does not progress, please file a separate bug against ext4. Thanks!

And I'm closing this bug as it is already fixed.