Bug 217965 - ext4(?) regression since 6.5.0 on sata hdd
Summary: ext4(?) regression since 6.5.0 on sata hdd
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-02 08:10 UTC by Ivan Ivanich
Modified: 2024-03-23 08:53 UTC (History)
12 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Kernel config (35.29 KB, application/gzip)
2023-10-02 08:56 UTC, Ivan Ivanich
Details
perf lest logs (20.51 KB, text/plain)
2023-11-16 04:33 UTC, Eyal Lebedinsky
Details
detailed perf log (4.88 MB, text/plain)
2023-11-17 01:39 UTC, Eyal Lebedinsky
Details
more perf reports (4.88 MB, text/plain)
2023-11-17 23:53 UTC, Eyal Lebedinsky
Details
more perf reports (4.88 MB, text/plain)
2023-11-17 23:57 UTC, Eyal Lebedinsky
Details

Description Ivan Ivanich 2023-10-02 08:10:21 UTC
Since kernel 6.5.x and 6.6-rc* I'm getting weird kworker flush activity when building openwrt from sources.
91 root      20   0       0      0      0 R  99,7   0,0  18:06.57 kworker/u16:4+flush-8:16

Openwrt sources resides on the sata hdd drive with ext4 fs,I'm using this setup for a last 5 years, the problem is that since 6.5 kernels after the openwrt kernel patch make step(https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=scripts/patch-kernel.sh;h=c2b7e7204952f93946a6075d546cbeae32c2627f;hb=HEAD which probably involves a lot of copy and write operations) kworker/u16:4+flush-8:16 uses 100% of one core for a while(5-15 minutes) even after I canceling openwrt build.

I tried to move this openwrt sources folder to an ssd drive where my system is resides and run openwrt build from there and getting no issues with kworker flush  cpu usage. Also I have no such behavior with 6.4.x and older kernels so it looks like regression to me, not sure if this is a fs, vfs or even block subsystem issue.

This is how it looks in perf
Samples: 320K of event 'cycles:P', Event count (approx.): 363448649248
  Children      Self  Command          Shared Object                        Symbol
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] ret_from_fork_asm
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] ret_from_fork
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] kthread
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] worker_thread
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] process_one_work
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] wb_workfn
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] wb_writeback
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] __writeback_inodes_wb
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] writeback_sb_inodes
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] __writeback_single_inode
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] do_writepages
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] ext4_writepages
+   12,40%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] ext4_do_writepages
+   12,39%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] ext4_map_blocks
+   12,39%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] ext4_ext_map_blocks
+   12,38%     0,00%  kworker/u16:2+f  [kernel.vmlinux]                     [k] ext4_mb_new_blocks
+   12,38%     0,93%  kworker/u16:2+f  [kernel.vmlinux]                     [k] ext4_mb_regular_allocator
+    9,42%     0,00%  cc1              [unknown]                            [.] 0000000000000000
+    5,42%     0,53%  kworker/u16:2+f  [kernel.vmlinux]                     [k] ext4_mb_scan_aligned
+    4,88%     0,69%  kworker/u16:2+f  [kernel.vmlinux]                     [k] mb_find_extent
+    3,99%     3,95%  kworker/u16:2+f  [kernel.vmlinux]                     [k] mb_find_order_for_block
+    3,51%     0,61%  kworker/u16:2+f  [kernel.vmlinux]                     [k] ext4_mb_load_buddy_gfp
+    2,95%     0,01%  cc1              [kernel.vmlinux]                     [k] asm_exc_page_fault
+    2,67%     0,18%  kworker/u16:2+f  [kernel.vmlinux]                     [k] pagecache_get_page
+    2,41%     0,40%  kworker/u16:2+f  [kernel.vmlinux]                     [k] __filemap_get_folio
+    2,33%     2,10%  cc1              cc1                                  [.] cpp_get_token_1
+    2,12%     0,05%  cc1              [kernel.vmlinux]                     [k] exc_page_fault
+    2,07%     0,04%  cc1              [kernel.vmlinux]                     [k] do_user_addr_fault
+    1,81%     0,52%  kworker/u16:2+f  [kernel.vmlinux]                     [k] filemap_get_entry
     1,80%     1,71%  cc1              cc1                                  [.] ht_lookup_with_hash
+    1,77%     0,08%  cc1              [kernel.vmlinux]                     [k] handle_mm_fault
+    1,65%     0,14%  cc1              [kernel.vmlinux]                     [k] __handle_mm_fault
     1,60%     1,49%  cc1              cc1                                  [.] _cpp_lex_direct
+    1,54%     0,73%  kworker/u16:2+f  [kernel.vmlinux]                     [k] ext4_mb_good_group
+    1,49%     1,46%  cc1              cc1                                  [.] ggc_internal_alloc
+    1,28%     0,05%  cc1              [kernel.vmlinux]                     [k] do_anonymous_page
+    1,28%     0,04%  cc1              [kernel.vmlinux]                     [k] entry_SYSCALL_64
Comment 1 Ivan Ivanich 2023-10-02 08:56:06 UTC
Created attachment 305179 [details]
Kernel config
Comment 2 Ivan Ivanich 2023-10-02 08:57:02 UTC
Backtrace echo l > /proc/sysrq-trigger:


[ 3128.330756] CPU: 2 PID: 311853 Comm: patch Tainted: P           O       6.6.0-rc4-x86_64 #1
[ 3128.330758] Hardware name: ASUS All Series/Z97-K, BIOS 2902 03/31/2016
[ 3128.330759] RIP: 0010:mb_find_order_for_block+0x68/0xa0
[ 3128.330762] Code: d0 02 00 00 42 0f b7 04 00 48 01 f0 48 89 c6 41 89 f9 c1 e0 03 83 e0 38 41 d3 f9 48 83 e6 f8 44 01 c8 48 98 48 0f a3 06 73 27 <0f> b7 42 30 83 c1 01 83 c0 01 39 c1 7f 17 48 8b 42 18 48 8b 72 08
[ 3128.330763] RSP: 0018:ffffc9000e363630 EFLAGS: 00000207
[ 3128.330765] RAX: 0000000000000036 RBX: ffffc9000e363698 RCX: 000000000000000c
[ 3128.330766] RDX: ffffc9000e363738 RSI: ffff88810a7e2ff8 RDI: 0000000000006000
[ 3128.330767] RBP: 0000000000007ff0 R08: 0000000000000018 R09: 0000000000000006
[ 3128.330767] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000006000
[ 3128.330768] R13: 0000000000006000 R14: 000000000000000d R15: ffffc9000e363738
[ 3128.330769] FS:  00007f237c609740(0000) GS:ffff88841fa80000(0000) knlGS:0000000000000000
[ 3128.330771] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3128.330772] CR2: 00005576c2964000 CR3: 0000000261c76004 CR4: 00000000001706e0
[ 3128.330773] Call Trace:
[ 3128.330774]  <NMI>
[ 3128.330774]  ? nmi_cpu_backtrace+0x88/0xf0
[ 3128.330776]  ? nmi_cpu_backtrace_handler+0xc/0x20
[ 3128.330778]  ? nmi_handle+0x53/0xf0
[ 3128.330780]  ? default_do_nmi+0x40/0x220
[ 3128.330782]  ? exc_nmi+0xf3/0x120
[ 3128.330784]  ? end_repeat_nmi+0x16/0x67
[ 3128.330787]  ? mb_find_order_for_block+0x68/0xa0
[ 3128.330788]  ? mb_find_order_for_block+0x68/0xa0
[ 3128.330790]  ? mb_find_order_for_block+0x68/0xa0
[ 3128.330792]  </NMI>
[ 3128.330792]  <TASK>
[ 3128.330793]  mb_find_extent+0x149/0x200
[ 3128.330795]  ext4_mb_scan_aligned+0xd3/0x140
[ 3128.330798]  ext4_mb_regular_allocator+0xcca/0xe50
[ 3128.330800]  ext4_mb_new_blocks+0x8c6/0xe80
[ 3128.330802]  ? ext4_find_extent+0x3c1/0x410
[ 3128.330803]  ? release_pages+0x13a/0x400
[ 3128.330805]  ext4_ext_map_blocks+0x626/0x15c0
[ 3128.330807]  ? filemap_get_folios_tag+0x1c6/0x1f0
[ 3128.330809]  ? mpage_prepare_extent_to_map+0x462/0x4a0
[ 3128.330811]  ext4_map_blocks+0x18b/0x5f0
[ 3128.330813]  ? kmem_cache_alloc+0x278/0x410
[ 3128.330814]  ? ext4_alloc_io_end_vec+0x19/0x50
[ 3128.330816]  ext4_do_writepages+0x619/0xa90
[ 3128.330818]  ext4_writepages+0xa8/0x180
[ 3128.330819]  do_writepages+0xc6/0x190
[ 3128.330821]  ? __find_get_block+0x1b2/0x2a0
[ 3128.330824]  filemap_fdatawrite_wbc+0x5e/0x80
[ 3128.330827]  __filemap_fdatawrite_range+0x57/0x80
[ 3128.330829]  ext4_rename+0x753/0xbf0
[ 3128.330831]  vfs_rename+0xa7e/0xbe0
[ 3128.330834]  ? do_renameat2+0x4be/0x660
[ 3128.330836]  do_renameat2+0x4be/0x660
[ 3128.330838]  __x64_sys_renameat+0x44/0x60
[ 3128.330839]  do_syscall_64+0x3f/0x90
[ 3128.330841]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 3128.330843] RIP: 0033:0x7f237c66160a
[ 3128.330844] Code: 48 8b 15 21 e8 15 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 49 89 ca b8 08 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 06 c3 0f 1f 44 00 00 48 8b 15 e9 e7 15 00 f7
[ 3128.330845] RSP: 002b:00007ffcf85ca528 EFLAGS: 00000202 ORIG_RAX: 0000000000000108
[ 3128.330847] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f237c66160a
[ 3128.330848] RDX: 0000000000000004 RSI: 00005576c295a091 RDI: 0000000000000004
[ 3128.330848] RBP: 0000000000000004 R08: 00000000000081a4 R09: 0000000000000000
[ 3128.330849] R10: 00005576c2959ad1 R11: 0000000000000202 R12: 00005576c2959ad1
[ 3128.330850] R13: 00005576c295a091 R14: 00007ffcf85ca630 R15: 0000000000000000
[ 3128.330851]  </TASK>
[ 3128.330852] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x62/0xb0
[ 4648.228980] sysrq: Show backtrace of all active CPUs
[ 4648.228983] NMI backtrace for cpu 1
[ 4648.228985] CPU: 1 PID: 526431 Comm: bash Tainted: P           O       6.6.0-rc4-x86_64 #1
[ 4648.228987] Hardware name: ASUS All Series/Z97-K, BIOS 2902 03/31/2016
[ 4648.228987] Call Trace:
[ 4648.228989]  <TASK>
[ 4648.228991]  dump_stack_lvl+0x36/0x50
[ 4648.228995]  nmi_cpu_backtrace+0xc0/0xf0
[ 4648.228997]  ? __pfx_nmi_raise_cpu_backtrace+0x10/0x10
[ 4648.228999]  nmi_trigger_cpumask_backtrace+0xc8/0xf0
[ 4648.229000]  __handle_sysrq+0xd2/0x190
[ 4648.229002]  write_sysrq_trigger+0x23/0x40
[ 4648.229004]  proc_reg_write+0x57/0xa0
[ 4648.229006]  vfs_write+0xcb/0x410
[ 4648.229008]  ? filp_flush+0x4d/0x80
[ 4648.229010]  ksys_write+0x62/0xe0
[ 4648.229011]  do_syscall_64+0x3f/0x90
[ 4648.229013]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 4648.229016] RIP: 0033:0x7f3d2d70e640
[ 4648.229017] Code: 87 0c 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 80 3d 29 0d 0d 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89
[ 4648.229019] RSP: 002b:00007ffd39f5e4a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 4648.229021] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f3d2d70e640
[ 4648.229022] RDX: 0000000000000002 RSI: 000055fbbed84ba0 RDI: 0000000000000001
[ 4648.229023] RBP: 00007f3d2d7d85c0 R08: 0000000000000030 R09: 00007f3d2d7d7b00
[ 4648.229024] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[ 4648.229025] R13: 000055fbbed84ba0 R14: 0000000000000002 R15: 00007f3d2d7d5f40
[ 4648.229026]  </TASK>
[ 4648.229027] Sending NMI from CPU 1 to CPUs 0,2-7:
[ 4648.229030] NMI backtrace for cpu 3 skipped: idling at intel_idle+0x62/0xb0
[ 4648.229033] NMI backtrace for cpu 7 skipped: idling at intel_idle+0x62/0xb0
[ 4648.229036] NMI backtrace for cpu 5 skipped: idling at intel_idle+0x62/0xb0
[ 4648.229039] NMI backtrace for cpu 2
[ 4648.229041] CPU: 2 PID: 71621 Comm: kworker/u16:1 Tainted: P           O       6.6.0-rc4-x86_64 #1
[ 4648.229042] Hardware name: ASUS All Series/Z97-K, BIOS 2902 03/31/2016
[ 4648.229043] Workqueue: writeback wb_workfn (flush-8:16)
[ 4648.229047] RIP: 0010:mb_find_order_for_block+0x66/0xa0
[ 4648.229050] Code: 8b 80 d0 02 00 00 42 0f b7 04 00 48 01 f0 48 89 c6 41 89 f9 c1 e0 03 83 e0 38 41 d3 f9 48 83 e6 f8 44 01 c8 48 98 48 0f a3 06 <73> 27 0f b7 42 30 83 c1 01 83 c0 01 39 c1 7f 17 48 8b 42 18 48 8b
[ 4648.229052] RSP: 0018:ffffc90001f076c8 EFLAGS: 00000203
[ 4648.229053] RAX: 0000000000000138 RBX: ffffc90001f07730 RCX: 0000000000000006
[ 4648.229054] RDX: ffffc90001f077d0 RSI: ffff88810a7e2f80 RDI: 0000000000004e30
[ 4648.229055] RBP: 0000000000007ff0 R08: 000000000000000c R09: 0000000000000138
[ 4648.229056] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881049f7800
[ 4648.229056] R13: 0000000000000001 R14: 0000000000007ff0 R15: ffffc90001f077d0
[ 4648.229057] FS:  0000000000000000(0000) GS:ffff88841fa80000(0000) knlGS:0000000000000000
[ 4648.229059] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4648.229060] CR2: 00007f23b3ffc3a0 CR3: 000000023ba2a001 CR4: 00000000001706e0
[ 4648.229060] Call Trace:
[ 4648.229061]  <NMI>
[ 4648.229062]  ? nmi_cpu_backtrace+0x88/0xf0
[ 4648.229064]  ? nmi_cpu_backtrace_handler+0xc/0x20
[ 4648.229065]  ? nmi_handle+0x53/0xf0
[ 4648.229068]  ? default_do_nmi+0x40/0x220
[ 4648.229070]  ? exc_nmi+0xf3/0x120
[ 4648.229072]  ? end_repeat_nmi+0x16/0x67
[ 4648.229075]  ? mb_find_order_for_block+0x66/0xa0
[ 4648.229077]  ? mb_find_order_for_block+0x66/0xa0
[ 4648.229079]  ? mb_find_order_for_block+0x66/0xa0
[ 4648.229081]  </NMI>
[ 4648.229081]  <TASK>
[ 4648.229081]  mb_find_extent+0xd7/0x200
[ 4648.229084]  ext4_mb_scan_aligned+0xd3/0x140
[ 4648.229086]  ? ext4_get_group_info+0x4b/0x60
[ 4648.229088]  ext4_mb_regular_allocator+0xcca/0xe50
[ 4648.229090]  ? mempool_alloc+0x58/0x180
[ 4648.229093]  ext4_mb_new_blocks+0x8c6/0xe80
[ 4648.229094]  ? ext4_find_extent+0x3c1/0x410
[ 4648.229096]  ext4_ext_map_blocks+0x626/0x15c0
[ 4648.229099]  ? filemap_get_folios_tag+0x1c6/0x1f0
[ 4648.229100]  ? mpage_prepare_extent_to_map+0x462/0x4a0
[ 4648.229102]  ext4_map_blocks+0x18b/0x5f0
[ 4648.229104]  ? kmem_cache_alloc+0x278/0x410
[ 4648.229106]  ? ext4_alloc_io_end_vec+0x19/0x50
[ 4648.229109]  ext4_do_writepages+0x619/0xa90
[ 4648.229111]  ext4_writepages+0xa8/0x180
[ 4648.229113]  do_writepages+0xc6/0x190
[ 4648.229115]  __writeback_single_inode+0x2c/0x1a0
[ 4648.229117]  writeback_sb_inodes+0x1e8/0x450
[ 4648.229119]  __writeback_inodes_wb+0x47/0xe0
[ 4648.229121]  wb_writeback.isra.0+0x159/0x1b0
[ 4648.229123]  wb_workfn+0x269/0x380
[ 4648.229125]  ? __schedule+0x2ca/0x1110
[ 4648.229127]  ? __mod_timer+0x116/0x370
[ 4648.229130]  process_one_work+0x12f/0x240
[ 4648.229132]  worker_thread+0x2f0/0x410
[ 4648.229134]  ? __pfx_worker_thread+0x10/0x10
[ 4648.229136]  kthread+0xe3/0x110
[ 4648.229138]  ? __pfx_kthread+0x10/0x10
[ 4648.229139]  ret_from_fork+0x2f/0x50
[ 4648.229141]  ? __pfx_kthread+0x10/0x10
[ 4648.229142]  ret_from_fork_asm+0x1b/0x30
[ 4648.229145]  </TASK>
[ 4648.229145] NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0
[ 4648.229149] NMI backtrace for cpu 4 skipped: idling at intel_idle+0x62/0xb0
[ 4648.229151] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x62/0xb0
Comment 3 Theodore Tso 2023-10-02 15:22:43 UTC
What sort of information can you give about the ext4 file system where this is happening?  How much free space is there?   Can you run "df" and "dumpe2fs -h" on the system so we can see which file system features were enabled?   Also, can you specify the kind of the block device?

Differential debugging would be useful.  For example, can you try is to copy the file system image to a hardware, and see if you can reproduce the behavior using the same kernel build workload?   What if you copy to the file system image to a USB attached SSD, and see if you can reproduce the behavior?   What you attach the USB attached SSD, and use a freshly formatted ext4 file system?   Does it reproduce then?   If the file system is nearly full, what if you delete a lot of unused space, to provide a lot more free space, and see if it reproduces then?
Comment 4 Ivan Ivanich 2023-10-02 15:55:08 UTC
(In reply to Theodore Tso from comment #3)
> What sort of information can you give about the ext4 file system where this
> is happening?  How much free space is there?   Can you run "df" and
> "dumpe2fs -h" on the system so we can see which file system features were
> enabled?   Also, can you specify the kind of the block device?
> 
> Differential debugging would be useful.  For example, can you try is to copy
> the file system image to a hardware, and see if you can reproduce the
> behavior using the same kernel build workload?   What if you copy to the
> file system image to a USB attached SSD, and see if you can reproduce the
> behavior?   What you attach the USB attached SSD, and use a freshly
> formatted ext4 file system?   Does it reproduce then?   If the file system
> is nearly full, what if you delete a lot of unused space, to provide a lot
> more free space, and see if it reproduces then?

mount |grep sdb2
/dev/sdb2 on /mnt/sdb2 type ext4 (rw,relatime,stripe=32752)

df /mnt/sdb2/
Filesystem      1K-blocks       Used Available Use% Mounted on
/dev/sdb2      2667140496 2125391752 539022500  80% /mnt/sdb2


dumpe2fs -h /dev/sdb2
dumpe2fs 1.47.0 (5-Feb-2023)
Filesystem volume name:   <none>
Last mounted on:          /mnt/sdb2
Filesystem UUID:          3bd82ee7-f08a-4a23-8e50-2df3786b0858
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              169369600
Block count:              677467392
Reserved block count:     677465
Overhead clusters:        10682268
Free blocks:              135437186
Free inodes:              168057645
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      862
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
RAID stride:              32752
Flex block group size:    16
Filesystem created:       Wed Jun 25 13:10:17 2014
Last mount time:          Mon Oct  2 18:46:12 2023
Last write time:          Mon Oct  2 18:46:12 2023
Mount count:              23
Maximum mount count:      -1
Last checked:             Thu Sep 28 14:37:21 2023
Check interval:           0 (<none>)
Lifetime writes:          84 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      a90aaa2d-577c-4181-83f9-696dee7ea92d
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Total journal size:       128M
Total journal blocks:     32768
Max transaction length:   32768
Fast commit length:       0
Journal sequence:         0x002a8661
Journal start:            9556

I don't have physical access to this machine so can't do any manipulations to the hardware or attaching USB drives etc.

Device is 
Model Family:     Toshiba 3.5" DT01ACA... Desktop HDD
Device Model:     TOSHIBA DT01ACA300
Serial Number:    14KP58HGS
LU WWN Device Id: 5 000039 ff4d7b718
Firmware Version: MX6OABB0
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Oct  2 18:53:26 2023 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled
Comment 5 Ojaswin Mujoo 2023-10-03 06:50:12 UTC
Hi Ivan, 

Thanks for sharing the additional information. So I can see that you are passing stripe=32752 and from the backtrace and I do see some of the functions in the trace that indicate that a striped allocation was taking place. There were some changes in the allocator in 6.5 kernel if I'm not wrong but I'll need to try to replicate this to confirm if the issue is related to those and update here.

Meanwhile, can you provide some more info on the following:

1. I want to check if this issue is related to the striped allocation code path or if its also seen in the usual path. To check that, will it be possible for you to run the same workload without passing "stripe=32752" option during mount time? 

2. How often do you see this issue, is it noticeable 100% of times? 

Regards,
ojaswin
Comment 6 Ivan Ivanich 2023-10-03 06:57:11 UTC
Hi Ojaswin,
1. Actually I'm not passing stripe option in my fstab, seems it's added automatically. Could you please clarify how I can get rid of it?

corresponding fstab entry

UUID=3bd82ee7-f08a-4a23-8e50-2df3786b0858       /mnt/sdb2       ext4            defaults,user_xattr   0 2

2. The issue is reproducible/visible 100% of times when building openwrt on the kernels >=6.5 including 6.6.
Comment 7 Ivan Ivanich 2023-10-03 11:04:59 UTC
Removed stripe with tune2fs -f -E stride=0 /dev/sdb2 and it fixed the issue. Still have no clue why it was enabled.
Thanks
Comment 8 Ojaswin Mujoo 2023-10-03 13:42:04 UTC
Hey Ivan, thanks! 

Seems like stripe mount option is automatically added if the filesystem is made using stride or stripe-width options during mkfs time. 

Can you kindly perform the below steps as well to further help narrow this issue down:

1. Mount FS with the usual options (including using stripe=32752). 
2. Once mounted, run:

$ echo "0" > /sys/fs/ext4/sdb2/mb_best_avail_max_trim_order

This might need sudo. Basically this disables one of the allocator feature that was added in 6.5.

3. Run the workload again and see if the issue is still visible.

Regards,
ojaswin
Comment 9 Ivan Ivanich 2023-10-03 14:23:49 UTC
With mount option stripe=32752 and disabled mb_best_avail_max_trim_order issue is still visible.

cat /sys/fs/ext4/sdb2/mb_best_avail_max_trim_order
0

mount|grep sdb2
/dev/sdb2 on /mnt/sdb2 type ext4 (rw,relatime,stripe=32752)

4687 root      20   0       0      0      0 R 100,0   0,0   0:25.73 kworker/u16:1+flush-8:16
Comment 10 Ojaswin Mujoo 2023-10-04 12:44:30 UTC
Hi Ivan, 

So unfortunately I'm not able to replicate it yet at my end. While I try that, wanted to check if you can give a few things a try,

So it seems like the CPU is stuck at mb_find_order_for_block() called from mb_find_extent(). I do see a while loop in mb_find_order_for_block() but its not obvious if its stuck there and if so why.

If possible can you:

1. Recompile the kernel with CONFIG_DEBUG_INFO=y CONFIG_SOFTLOCKUP_DETECTOR=y and CONFIG_HARDLOCKUP_DETECTOR=y which might provide more information in dmesg when the lockup happens.

2. Replicate it once more and note the RIP value in the trace of stuck CPU, for example in the above trace it was mb_find_order_for_block+0x68 for CPU2.

3. Run the following kernel's source dir to get the corresponding line number (CONFIG_DEBUG_INFO needed):

$ ./scripts/faddr2line vmlinux mb_find_order_for_block+0x68

Maybe you can share the code you see in and around those lines as well as the exact kernel version.

This will help pinpoint the location where the code might be stuck (for example in a loop), which can help root cause this.

Also, you mentioned that the CPU gets stuck at 100% util for 10-15mins, does it ever come back to normal or does it stay stuck? 

Regards,
ojaswin
Comment 11 Ivan Ivanich 2023-10-04 17:23:03 UTC
Hi Ojaswin,

Tried it but unfortunately it gives me nothing useful

[  619.894677] RIP: 0010:mb_find_order_for_block+0x50/0xa0

./scripts/faddr2line vmlinux mb_find_order_for_block+0x50
mb_find_order_for_block+0x50/0xa0:
fs mm ??:0

All required kernel options is enabled
zcat /proc/config.gz |grep -i 'CONFIG_DEBUG_INFO\|CONFIG_SOFTLOCKUP_DETECTOR\|CONFIG_HARDLOCKUP_DETECTOR'|grep -v '#'
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
CONFIG_DEBUG_INFO_COMPRESSED_NONE=y
CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER=y

>Also, you mentioned that the CPU gets stuck at 100% util for 10-15mins, does
>it ever come back to normal or does it stay stuck? 

Yes, it comes back to normal after quite some time.
Comment 12 Ojaswin Mujoo 2023-10-11 07:53:39 UTC
Hey Ivan, 

so I used the kernel v6.6-rc1 and the same config you provided as well as well as mounted an hdd on my VM. Then I followed the steps here build openwrt [1]. However, I'm still unable to replicate the 100% cpu utilization in a kworker/flush thread (I do get .

Since you have the config options enabled and we didn't see them trigger any warning and the fact that we get back to normal after a few minutes indicates that its not a lockup/deadlock. We also see that on faster SSD we don't see this issue so this might even have something to do with a lot of IOs being queued up on the slower disk causing us to notice the delay. Maybe we are waiting a lot more on some spinlock that can explain the CPU utilization.

Since I'm unable to replicate it, I'll have to request you for some more info to get to the bottom of this. More specifically, can you kindly provide the following:

For the kernel with this issue: 

1. Replicate the 100% util in one terminal window.
2. Once the 100% util is hit, in another terminal run the following command:

$ iostat -x /dev/<dev> 2  (run this for 20 to 30 seconds)  
$ perf record -ag sleep 20
$ echo l > /proc/sysrq_trigger
$ uname -a

3. Repeat the above for a kernel where the issue is not seen. 

Kindly share the sysrq back trace, iostat output, perf.data and the uname output  for both the runs here so that I can take a closer look at what is causing the unusual utilization.

[1] https://github.com/openwrt/openwrt#quickstart


Regards,
ojaswin
Comment 13 Ivan Ivanich 2023-10-11 16:07:48 UTC
Affected one:
iostat: https://pastebin.com/mau5VYnV
perf: https://drive.google.com/file/d/15FO5YsvYrFnKtJI9hfl8qPG7oI8SgLeI/view?usp=sharing
backtrace: https://pastebin.com/AgjYXhjR
uname -a:
Linux gentoo-desktop 6.5.7-gentoo-x86_64 #1 SMP PREEMPT_DYNAMIC Wed Oct 11 14:40:24 EEST 2023 x86_64 Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz GenuineIntel GNU/Linux

I don't have 6.4.16 kernel at the moment, will compile it later and provide relevant info.
Comment 14 Ivan Ivanich 2023-10-11 22:23:58 UTC
Not affected 6.4.16:
iostat: https://pastebin.com/6XadeAuh
perf: https://drive.google.com/file/d/1neAq73hoztfTU87NyCHWGAQhCKQfILOH/view?usp=sharing
backtrace: https://pastebin.com/zdnWgx4M
uname -a:
Linux gentoo-desktop 6.4.16-gentoo-x86_64 #1 SMP PREEMPT_DYNAMIC Thu Oct 12 00:09:02 EEST 2023 x86_64 Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz GenuineIntel GNU/Linux
Comment 15 Eduard Kohler 2023-10-17 10:45:29 UTC
I encounter the same issue on a stock fedora 37 kernel
# uname -a
Linux xxx.net 6.4.16-100.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Sep 13 18:22:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

The issue seems to be present also on brtfs partitions, as mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2242391

In my case, I have an EXT4 partition over an mdadm raid 1 array of two HDD.

# mdadm --misc --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Sun Feb 24 21:40:33 2013
        Raid Level : raid1
        Array Size : 3906118928 (3.64 TiB 4.00 TB)
     Used Dev Size : 3906118928 (3.64 TiB 4.00 TB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Tue Oct 17 12:05:22 2023
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

              Name : xxx.net:0  (local to host xxx.net)
              UUID : 25070e74:b2ac1695:46ee20f9:7e8d1e05
            Events : 402440

    Number   Major   Minor   RaidDevice State
       3       8       17        0      active sync   /dev/sdb1
       2       8       33        1      active sync   /dev/sdc1

dumpe2fs -h /dev/md0p1
dumpe2fs 1.46.5 (30-Dec-2021)
Filesystem volume name:   <none>
Last mounted on:          /home
Filesystem UUID:          4df09a03-d8ef-4cef-a64d-cefde4823fc2
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              244137984
Block count:              976529232
Reserved block count:     48826457
Overhead clusters:        15373546
Free blocks:              545717412
Free inodes:              243040909
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      791
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
RAID stride:              32745
Flex block group size:    16
Filesystem created:       Sun Feb 24 22:13:52 2013
Last mount time:          Mon Oct 16 21:06:57 2023
Last write time:          Mon Oct 16 21:06:57 2023
Mount count:              6
Maximum mount count:      -1
Last checked:             Sat Oct  7 16:27:02 2023
Check interval:           0 (<none>)
Lifetime writes:          238 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       13509593
Default directory hash:   half_md4
Directory Hash Seed:      af37e79f-0457-4318-9c5d-840dc2f60bce
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Total journal size:       128M
Total journal blocks:     32768
Max transaction length:   32768
Fast commit length:       0
Journal sequence:         0x003bc716
Journal start:            1

# df /dev/md0p1
Filesystem      1K-blocks       Used  Available Use% Mounted on
/dev/md0p1     3844622744 1661757340 1987543192  46% /home

# mount |grep md0
/dev/md0p1 on /home type ext4 (rw,relatime,seclabel,stripe=32745)

In order to trig the 100% CPU on one core, a simple 
# for i in {0001..0200}; echo "some text" > "file_${i}.txt"
is enough. This will bring the kworker/flush at 100 % for around 90 seconds. This time is dependent of the number of created files. Trying to delete these files is more or less impossible.

On the same setup I have a small SSD used for system installation. The same test doesn't bring the kworker/flush behaviour.

# dumpe2fs -h /dev/sda3
dumpe2fs 1.46.5 (30-Dec-2021)
Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          589968d6-2283-4903-8699-1b23f5c341a5
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              848640
Block count:              3390208
Reserved block count:     169510
Free blocks:              1823000
Free inodes:              735742
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      827
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8160
Inode blocks per group:   510
Flex block group size:    16
Filesystem created:       Sun Apr  6 00:00:12 2014
Last mount time:          Mon Oct 16 21:06:53 2023
Last write time:          Mon Oct 16 21:06:47 2023
Mount count:              437
Maximum mount count:      -1
Last checked:             Sun Apr  6 00:00:12 2014
Check interval:           0 (<none>)
Lifetime writes:          1791 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       130778
Default directory hash:   half_md4
Directory Hash Seed:      7d17214b-9585-4370-8c06-4236e449fa7f
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Total journal size:       128M
Total journal blocks:     32768
Max transaction length:   32768
Fast commit length:       0
Journal sequence:         0x0114205c
Journal start:            8194

# mount |grep sda3
/dev/sda3 on / type ext4 (rw,noatime,seclabel)
Comment 16 Eduard Kohler 2023-10-17 10:52:23 UTC
Oups not the right kernel version.
The 6.4.16 one is the one that it is working. The one with the issue (and actually all of the 6.5 I tested) is:
Linux version 6.5.4-100.fc37.x86_64 (mockbuild@bkernel01.iad2.fedoraproject.org) (gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Tue Sep 19 13:13:25 UTC 2023
Comment 17 Ojaswin Mujoo 2023-10-20 10:24:17 UTC
Hey Ivan, Eduard,

Thanks for all the data. The iostat suggests that we are breaking up the writes much more as seen in the wareq-sz  field, which is much lower in the regressed kernel. Further, since the issue is only when striping is used (ie in raid scenarios) it seems like it has something to do with raiding.

Eduard's replicator seems much easier to try so let me see if I'm able to replicate this over a raid setup of HDDs.

Regards,
ojaswin
Comment 18 Eduard Kohler 2023-10-21 10:07:50 UTC
Hi Ojaswin,

If you want to replicate something similar to my setup, I better give you more info about it, because it's not a desktop computer.

It's a very slow system by current standards and I think it may impact why there aren't more people complaining about this issue. In Ivan's case it seems that after a small while, it recovered even with a lot of files. His setup seems more powerful than mine. In my case pulling a 75MB container with podman didn't recovered after 2 hours of 100% CPU.

The system has 4G RAM:
# free
               total        used        free      shared  buff/cache   available
Mem:         3977448      615760      410332       14580     2951356     3045304
Swap:        1564668           0     1564668

and an AMD G-T40E processor:
# cat /proc/cpuinfo 
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 20
model		: 2
model name	: AMD G-T40E Processor
stepping	: 0
microcode	: 0x5000119
cpu MHz		: 801.798
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 6
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt hw_pstate vmmcall arat npt lbrv svm_lock nrip_save pausefilter
bugs		: fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 1999.82
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor	: 1
vendor_id	: AuthenticAMD
cpu family	: 20
model		: 2
model name	: AMD G-T40E Processor
stepping	: 0
microcode	: 0x5000119
cpu MHz		: 799.941
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 6
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt hw_pstate vmmcall arat npt lbrv svm_lock nrip_save pausefilter
bugs		: fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 1999.82
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

The raid1 underlying hardware changed from the 2013 from 1T -> 2T -> 4T HDD. The file system grew up accordingly. During the step 2T -> 4T disk layout was converted from MBR to GPT:
# gdisk -l /dev/sdb
GPT fdisk (gdisk) version 1.0.9

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdb: 7814037168 sectors, 3.6 TiB
Model: ST4000VN008-2DR1
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): CD6E68D9-1A0F-40F7-9755-FDE136159F92
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 7814037134
Partitions will be aligned on 2048-sector boundaries
Total free space is 1537100 sectors (750.5 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      7812502048   3.6 TiB     8300  primary

And again the raid1 md0 partition table:
# gdisk -l /dev/md0
GPT fdisk (gdisk) version 1.0.9

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/md0: 7812237857 sectors, 3.6 TiB
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): 46C6A662-69D1-492D-A31F-1121048F9FE3
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 7812237823
Partitions will be aligned on 2048-sector boundaries
Total free space is 3933 sectors (1.9 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      7812235904   3.6 TiB     8300  Linux filesystem

Last one is the EXT4 filesystem.
Comment 19 Eyal Lebedinsky 2023-11-13 22:46:47 UTC
I seem to have a similar issue, on a PC with raid6, which was discussed here
   https://marc.info/?l=linux-raid&m=169922587226984
which ended up pointing me here.

My impression is that it is not a fedora (I am on f38) specific issue so here is the correct place to post.

I see no progress for 4 weeks, is there any update? It looks like people were getting close to understanding the cause, maybe even to a solution.
Comment 20 Ojaswin Mujoo 2023-11-14 05:55:58 UTC
Hey Eyal,

I was continuing to look into this but had to keep it aside for a bit mainly since I got caught up with some other work plus I don't seem to have the right hardware to trigger this either.

I see that you are facing a similar issue and have some more information on your post. Thanks for that, let me take some more time to go though it and think about the issue. 

Btw, would it possible for you to do a git bisect of the issue to see which is the bad commit? Since I'm not able to replicate it on my setup that could be really helpful.

Regards,
ojaswin
Comment 21 Eyal Lebedinsky 2023-11-14 06:47:06 UTC
Hi Ojaswin,

Thanks for the info.

This is a 24x7 server, so rebooting often is not something I would consider.

However, if a specific kernel is of interest, which I can install from the fedora repositories, then I could probably be able to give it a test.

I will give 6.4.15 a try though.
Comment 22 Eyal Lebedinsky 2023-11-14 08:22:03 UTC
I now ran a test on 6.4.15. I copied a segment of the tree I have issues with.
    7.2GB in 77,972 files.

It took just over 9 minutes to sync the drain the dirty cache, and the array run like this
(I am watching the kB_wrtn/s column):

         Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
18:15:01 md127           423.00         4.00      7001.20         0.00         40      70012          0
18:15:11 md127           112.30        14.80      9689.60         0.00        148      96896          0
18:15:21 md127            28.70         3.60       964.80         0.00         36       9648          0
18:15:31 md127           321.90         4.00      1340.40         0.00         40      13404          0
18:15:41 md127            18.20         0.00       123.20         0.00          0       1232          0
18:15:51 md127            26.80        10.00      2062.00         0.00        100      20620          0
18:16:01 md127             2.60         0.00        56.40         0.00          0        564          0
18:16:11 md127             2.80         0.00        82.40         0.00          0        824          0
18:16:21 md127             6.10         0.00       150.80         0.00          0       1508          0
18:16:31 md127            86.00        16.80      4978.00         0.00        168      49780          0
18:16:41 md127            48.90        10.80      4505.60         0.00        108      45056          0
18:16:51 md127            36.10         6.80      1498.00         0.00         68      14980          0
18:17:01 md127             3.20         0.40        64.40         0.00          4        644          0
18:17:11 md127             3.40         0.80        84.80         0.00          8        848          0
18:17:21 md127             3.70         0.80       283.60         0.00          8       2836          0
18:17:31 md127            32.90         6.00      1950.00         0.00         60      19500          0
18:17:41 md127           907.50        24.00     15638.40         0.00        240     156384          0
18:17:51 md127           698.00         9.60     10476.40         0.00         96     104764          0
18:18:01 md127           146.80         0.40      4796.00         0.00          4      47960          0
18:18:11 md127           163.00         0.80      1328.80         0.00          8      13288          0
18:18:21 md127           149.50         1.20      1140.00         0.00         12      11400          0
18:18:31 md127            46.50         1.60       545.20         0.00         16       5452          0
18:18:41 md127           335.90        14.40      9770.00         0.00        144      97700          0
18:18:51 md127           119.90         4.40      3868.80         0.00         44      38688          0
18:19:01 md127           315.70        25.60     11422.80         0.00        256     114228          0
18:19:11 md127           285.80         1.20      4142.40         0.00         12      41424          0
18:19:21 md127           239.00         4.40      3888.00         0.00         44      38880          0
18:19:31 md127           116.30         1.20      1752.80         0.00         12      17528          0
18:19:41 md127           166.60         8.00      8886.00         0.00         80      88860          0
18:19:51 md127           357.20         5.60      6146.80         0.00         56      61468          0
18:20:01 md127           118.10         1.20      1670.40         0.00         12      16704          0
18:20:11 md127            66.00         6.40       802.80         0.00         64       8028          0
18:20:21 md127           139.40         4.40      1641.20         0.00         44      16412          0
18:20:31 md127            44.40         2.40       632.40         0.00         24       6324          0
18:20:41 md127           183.50        13.60      4048.40         0.00        136      40484          0
18:20:51 md127           114.40        28.40      8346.40         0.00        284      83464          0
18:21:01 md127           152.20        27.20     15576.00         0.00        272     155760          0
18:21:11 md127            75.30         1.60      1146.40         0.00         16      11464          0
18:21:21 md127           236.20        18.40     13491.20         0.00        184     134912          0
18:21:31 md127           226.70         3.60      3237.60         0.00         36      32376          0
18:21:41 md127           152.70         4.00      2048.00         0.00         40      20480          0
18:21:51 md127            92.10         7.60      1907.20         0.00         76      19072          0
18:22:01 md127           142.50         8.00      1921.60         0.00         80      19216          0
18:22:11 md127           137.30         5.60      2307.60         0.00         56      23076          0
18:22:21 md127           124.10         6.00      1511.20         0.00         60      15112          0
18:22:31 md127           105.90         4.00      1888.80         0.00         40      18888          0
18:22:41 md127           166.30         2.00      2073.20         0.00         20      20732          0
18:22:51 md127           649.90         6.80     10615.20         0.00         68     106152          0
18:23:01 md127          1253.70         8.80    175698.40         0.00         88    1756984          0
18:23:11 md127            88.40         2.00     66840.40         0.00         20     668404          0
18:23:21 md127             0.00         0.00         0.00         0.00          0          0          0

Is writing at a few MB/s reasonable for an array with disks that can top 200MB/s?
Note the burst at the end when a significant chunk of the data was written out.

Now running on 6.5.10 I see a very different story:

         Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
18:45:53 md127             2.60         0.00        10.40         0.00          0        104          0
18:46:03 md127             5.40         0.00        21.60         0.00          0        216          0
18:46:13 md127             5.40         0.00        21.60         0.00          0        216          0
18:46:23 md127             5.30         0.00        21.60         0.00          0        216          0
18:46:33 md127             5.20         0.00        20.80         0.00          0        208          0
18:46:43 md127             5.30         0.00        23.60         0.00          0        236          0
18:46:53 md127             5.10         0.00        44.00         0.00          0        440          0
18:47:03 md127             2.70         0.00        21.60         0.00          0        216          0
18:47:13 md127             5.30         0.00        26.80         0.00          0        268          0
18:47:23 md127             5.40         0.00        25.20         0.00          0        252          0
18:47:33 md127             5.20         0.00        21.60         0.00          0        216          0
18:47:43 md127             5.30         0.00        21.20         0.00          0        212          0
18:47:53 md127             5.30         0.00        22.40         0.00          0        224          0
18:48:03 md127             5.40         0.00        24.00         0.00          0        240          0
18:48:13 md127             2.60         0.00        11.20         0.00          0        112          0
18:48:23 md127             5.30         0.00        22.40         0.00          0        224          0
18:48:33 md127             5.30         0.00        23.20         0.00          0        232          0
18:48:43 md127             5.20         0.00        24.40         0.00          0        244          0
18:48:53 md127             5.30         0.00        22.00         0.00          0        220          0
18:49:03 md127             6.20         3.60        23.20         0.00         36        232          0
18:49:13 md127             3.40         0.00        20.80         0.00          0        208          0
18:49:23 md127             5.20         0.00        20.80         0.00          0        208          0
18:49:33 md127             5.30         0.00        22.00         0.00          0        220          0
18:49:43 md127             5.30         0.00        21.60         0.00          0        216          0
18:49:53 md127             5.20         0.00        33.20         0.00          0        332          0

I expect this to take much longer to complete, possibly hours (20% drained in the last 30m).

In both cases the flusher thread was running at 100%CPU.

I do have the full logs though.

HTH
Comment 23 Rouven 2023-11-14 10:29:44 UTC
My oppinion on this is as follows.

a.) this openwrt runs on what? kernel 6.5 would not run well on a older embedded processor.

b.) the controller chip of the hdd might be the problem, if not damaged the driver of the controller does not have all the functions the new kernel version has. if the controller is not damaged try to install the newest BIOS for that controller, and may be you need a updated driver.
Comment 24 Eyal Lebedinsky 2023-11-15 04:06:33 UTC
In case it helps.

Did a copy of a small part of my files tree.
     368 MB total size
  19,232 files

Checked how long it takes to write out the dirty blocks:

kernel-6.4.15    0m50s
kernel-6.5.7    20m
kernel-6.5.10   same

In all cases the flusher thread was running at 100%CPU from shortly after the copy.

At the start meminfo showed about 380MB Dirty size.

About https://bugzilla.kernel.org/show_bug.cgi?id=217965#c23

I do not see how it is relevant to me.

I also doubt a "new kernel version" was released with such user visible changes without due advice.
Comment 25 Ojaswin Mujoo 2023-11-15 17:27:57 UTC
Hey Eyal,

Will it be possible for you to try this experiment on an ext4 FS mounted with
-o nodelalloc option. This should allocate the blocks immediately instead of doing so during writeback.

Further, will you be able to provide the perf data file along with kallsyms or somthing similar that can be used to resolve the symbols. If not, the following output will also do:

perf report --no-children --stdio -i perf.data

Meanwhile I'll continue to look.

Thanks!
Comment 26 Eyal Lebedinsky 2023-11-15 22:39:39 UTC
Ojaswin,

For my understanding: the (dirty) buffers hold data blocks that need to be written. Clearing dirty buffers does not include fs activity,
which was already performed during the copy. Is this correct? If so then why are we talking about ext4?
    I expect that my understanding is incorrect.
The copy itself is very fast, completing in a few seconds.

I assume that the requested mount option is safe. Is it safe to use this way
   $ sudo mount -o remount,nodelalloc /data1
or should I also add the other current options, 'noatime' (from fstab) and 'stripe=640' (automatic)?
or should I reboot with this option in fstab?

Below is what I did now, is this what you need?

$ sudo perf record -p 1366946 sleep 60
[ perf record: Woken up 37 times to write data ]
[ perf record: Captured and wrote 9.079 MB perf.data (237339 samples) ]

$ sudo perf report --no-children --stdio -i perf.data
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 237K of event 'cycles:P'
# Event count (approx.): 258097341907
#
# Overhead  Command          Shared Object      Symbol                                    
# ........  ...............  .................  ..........................................
#
    53.90%  kworker/u16:4+f  [kernel.kallsyms]  [k] ext4_mb_good_group
    13.71%  kworker/u16:4+f  [kernel.kallsyms]  [k] ext4_get_group_info
     7.39%  kworker/u16:4+f  [kernel.kallsyms]  [k] ext4_mb_find_good_group_avg_frag_lists
     6.96%  kworker/u16:4+f  [kernel.kallsyms]  [k] __rcu_read_unlock
     5.01%  kworker/u16:4+f  [kernel.kallsyms]  [k] ext4_mb_scan_aligned
     4.51%  kworker/u16:4+f  [kernel.kallsyms]  [k] __rcu_read_lock
     1.70%  kworker/u16:4+f  [kernel.kallsyms]  [k] mb_find_order_for_block
     0.89%  kworker/u16:4+f  [kernel.kallsyms]  [k] xas_descend
     0.85%  kworker/u16:4+f  [kernel.kallsyms]  [k] filemap_get_entry
     0.81%  kworker/u16:4+f  [kernel.kallsyms]  [k] ext4_mb_regular_allocator
     0.77%  kworker/u16:4+f  [kernel.kallsyms]  [k] ext4_mb_load_buddy_gfp
     0.54%  kworker/u16:4+f  [kernel.kallsyms]  [k] xas_load
     0.50%  kworker/u16:4+f  [kernel.kallsyms]  [k] ext4_mb_unload_buddy
     0.42%  kworker/u16:4+f  [kernel.kallsyms]  [k] _raw_read_unlock
     0.40%  kworker/u16:4+f  [kernel.kallsyms]  [k] mb_find_extent
     0.40%  kworker/u16:4+f  [kernel.kallsyms]  [k] _raw_spin_trylock
     0.25%  kworker/u16:4+f  [kernel.kallsyms]  [k] __filemap_get_folio
     0.24%  kworker/u16:4+f  [kernel.kallsyms]  [k] _raw_read_lock
     0.13%  kworker/u16:4+f  [kernel.kallsyms]  [k] xas_start
     0.12%  kworker/u16:4+f  [kernel.kallsyms]  [k] mb_find_buddy
     0.10%  kworker/u16:4+f  [kernel.kallsyms]  [k] pagecache_get_page
     0.08%  kworker/u16:4+f  [kernel.kallsyms]  [k] folio_mark_accessed
     0.05%  kworker/u16:4+f  [kernel.kallsyms]  [k] folio_test_hugetlb
     0.04%  kworker/u16:4+f  [kernel.kallsyms]  [k] _raw_spin_unlock
     0.02%  kworker/u16:4+f  [kernel.kallsyms]  [k] __cond_resched
     0.02%  kworker/u16:4+f  [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context
     0.01%  kworker/u16:4+f  [kernel.kallsyms]  [k] gen8_irq_handler
     0.01%  kworker/u16:4+f  [kernel.kallsyms]  [k] _raw_spin_lock
     0.01%  kworker/u16:4+f  [kernel.kallsyms]  [k] native_irq_return_iret
     0.01%  kworker/u16:4+f  [kernel.kallsyms]  [k] native_write_msr
     0.01%  kworker/u16:4+f  [kernel.kallsyms]  [k] timekeeping_advance
     0.01%  kworker/u16:4+f  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     0.01%  kworker/u16:4+f  [kernel.kallsyms]  [k] native_read_msr
<then many zero% items>
Comment 27 Ojaswin Mujoo 2023-11-16 03:15:33 UTC
Hey Eyal,

So the way most file systems handle their writes is:

1. Data is written to memory buffers aka pagecache 

2. When writeback/flush kicks in, FS tries to group adjacent data together and
allocates disk blocks for it 

3. Finally, send the data to lower levels like blocks layer -> raid -> scsi etc
for the actual write.

The practice of delaying the actual allocation till writeback/flush is known as
delayed allocation or delalloc in ext4 and is on by default (other FSes might
have different names).  This is why the ext4 allocation related functions
(ext4_mb_regular_allocator etc) show up in your perf report of the flusher
thread.

With delaloc, we are sending bigger requests to the ext4 allocator since we try
to group together buffers. With nodelalloc we disable this so fs block
allocation happens when we are dirtying the buffers (in step 1 above) and we
only allocate as much as that write asked for thus sending smaller requests at
a time. Since with delalloc we see that your flusher seemed to be taking a lot
of time in ext4 allocation routines, I wanted to check if a change in
allocation pattern via nodelalloc could help us narrow down the issue.

Using:

$ sudo mount -o remount,nodelalloc /data1

should be safe and preserve your other mount options so you can give it a try.

Lastly, thanks for the perf report however I'm sorry I forgot to mention that i
was actually looking for the call graph, which could be collected as follows:

$ sudo perf record -p 1234 -g sleep 60

Can you please share the report of the above command.

Thanks!  
Ojaswin
Comment 28 Eyal Lebedinsky 2023-11-16 04:05:00 UTC
$ sudo perf record -p 1531839 -g sleep 60
[ perf record: Woken up 191 times to write data ]
[ perf record: Captured and wrote 47.856 MB perf.data (238391 samples) ]

$ sudo perf report --no-children --stdio -i perf.data
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 238K of event 'cycles:P'
# Event count (approx.): 262920477580
#
# Overhead  Command          Shared Object      Symbol                                    
# ........  ...............  .................  ..........................................
#
    58.48%  kworker/u16:5+f  [kernel.kallsyms]  [k] ext4_mb_good_group
            |          
            |--56.07%--ext4_mb_good_group
            |          |          
            |           --55.85%--ext4_mb_find_good_group_avg_frag_lists
            |                     ext4_mb_regular_allocator
            |                     ext4_mb_new_blocks
            |                     ext4_ext_map_blocks
            |                     ext4_map_blocks
            |                     ext4_do_writepages
            |                     ext4_writepages
            |                     do_writepages
            |                     __writeback_single_inode
            |                     writeback_sb_inodes
            |                     __writeback_inodes_wb
            |                     wb_writeback
            |                     wb_workfn
            |                     process_one_work
            |                     worker_thread
            |                     kthread
            |                     ret_from_fork
            |                     ret_from_fork_asm
            |          
             --2.16%--ext4_get_group_info
                       ext4_mb_good_group
                       |          
                        --2.16%--ext4_mb_find_good_group_avg_frag_lists
                                  ext4_mb_regular_allocator
                                  ext4_mb_new_blocks
                                  ext4_ext_map_blocks
                                  ext4_map_blocks
                                  ext4_do_writepages
                                  ext4_writepages
                                  do_writepages
                                  __writeback_single_inode
                                  writeback_sb_inodes
                                  __writeback_inodes_wb
                                  wb_writeback
                                  wb_workfn
                                  process_one_work
                                  worker_thread
                                  kthread
                                  ret_from_fork
                                  ret_from_fork_asm

    15.69%  kworker/u16:5+f  [kernel.kallsyms]  [k] ext4_get_group_info
            |          
             --15.10%--ext4_get_group_info
                       |          
                        --15.05%--ext4_mb_good_group
                                  |          
                                   --15.01%--ext4_mb_find_good_group_avg_frag_lists
                                             ext4_mb_regular_allocator
                                             ext4_mb_new_blocks
                                             ext4_ext_map_blocks
                                             ext4_map_blocks
                                             ext4_do_writepages
                                             ext4_writepages
                                             do_writepages
                                             __writeback_single_inode
                                             writeback_sb_inodes
                                             __writeback_inodes_wb
                                             wb_writeback
                                             wb_workfn
                                             process_one_work
                                             worker_thread
                                             kthread
                                             ret_from_fork
                                             ret_from_fork_asm

     9.51%  kworker/u16:5+f  [kernel.kallsyms]  [k] ext4_mb_find_good_group_avg_frag_lists
            |          
             --9.26%--ext4_mb_find_good_group_avg_frag_lists
                       ext4_mb_regular_allocator
                       ext4_mb_new_blocks
                       ext4_ext_map_blocks
                       ext4_map_blocks
                       ext4_do_writepages
                       ext4_writepages
                       do_writepages
                       __writeback_single_inode
                       writeback_sb_inodes
                       __writeback_inodes_wb
                       wb_writeback
                       wb_workfn
                       process_one_work
                       worker_thread
                       kthread
                       ret_from_fork
                       ret_from_fork_asm

     6.63%  kworker/u16:5+f  [kernel.kallsyms]  [k] __rcu_read_unlock
            |          
            |--4.56%--__rcu_read_unlock
            |          |          
            |           --4.54%--ext4_get_group_info
            |                     |          
            |                      --4.53%--ext4_mb_good_group
            |                                |          
            |                                 --4.51%--ext4_mb_find_good_group_avg_frag_lists
            |                                           ext4_mb_regular_allocator
            |                                           ext4_mb_new_blocks
            |                                           ext4_ext_map_blocks
            |                                           ext4_map_blocks
            |                                           ext4_do_writepages
            |                                           ext4_writepages
            |                                           do_writepages
            |                                           __writeback_single_inode
            |                                           writeback_sb_inodes
            |                                           __writeback_inodes_wb
            |                                           wb_writeback
            |                                           wb_workfn
            |                                           process_one_work
            |                                           worker_thread
            |                                           kthread
            |                                           ret_from_fork
            |                                           ret_from_fork_asm
            |          
             --2.08%--ext4_get_group_info
                       ext4_mb_good_group
                       |          
                        --2.07%--ext4_mb_find_good_group_avg_frag_lists
                                  ext4_mb_regular_allocator
                                  ext4_mb_new_blocks
                                  ext4_ext_map_blocks
                                  ext4_map_blocks
                                  ext4_do_writepages
                                  ext4_writepages
                                  do_writepages
                                  __writeback_single_inode
                                  writeback_sb_inodes
                                  __writeback_inodes_wb
                                  wb_writeback
                                  wb_workfn
                                  process_one_work
                                  worker_thread
                                  kthread
                                  ret_from_fork
                                  ret_from_fork_asm
     4.32%  kworker/u16:5+f  [kernel.kallsyms]  [k] __rcu_read_lock
            |          
             --4.30%--__rcu_read_lock
                       |          
                        --4.29%--ext4_get_group_info
                                  |          
                                   --4.27%--ext4_mb_good_group
                                             |          
                                              --4.26%--ext4_mb_find_good_group_avg_frag_lists
                                                        ext4_mb_regular_allocator
                                                        ext4_mb_new_blocks
                                                        ext4_ext_map_blocks
                                                        ext4_map_blocks
                                                        ext4_do_writepages
                                                        ext4_writepages
                                                        do_writepages
                                                        __writeback_single_inode
                                                        writeback_sb_inodes
                                                        __writeback_inodes_wb
                                                        wb_writeback
                                                        wb_workfn
                                                        process_one_work
                                                        worker_thread
                                                        kthread
                                                        ret_from_fork
                                                        ret_from_fork_asm

     1.98%  kworker/u16:5+f  [kernel.kallsyms]  [k] ext4_mb_scan_aligned
            |          
             --1.98%--ext4_mb_scan_aligned
                       ext4_mb_regular_allocator
                       ext4_mb_new_blocks
                       ext4_ext_map_blocks
                       ext4_map_blocks
                       ext4_do_writepages
                       ext4_writepages
                       do_writepages
                       __writeback_single_inode
                       writeback_sb_inodes
                       __writeback_inodes_wb
                       wb_writeback
                       wb_workfn
                       process_one_work
                       worker_thread
                       kthread
                       ret_from_fork
                       ret_from_fork_asm

     0.50%  kworker/u16:5+f  [kernel.kallsyms]  [k] mb_find_order_for_block
     0.44%  kworker/u16:5+f  [kernel.kallsyms]  [k] ext4_mb_load_buddy_gfp
     0.38%  kworker/u16:5+f  [kernel.kallsyms]  [k] xas_descend
     0.33%  kworker/u16:5+f  [kernel.kallsyms]  [k] ext4_mb_regular_allocator
     0.30%  kworker/u16:5+f  [kernel.kallsyms]  [k] filemap_get_entry
     0.24%  kworker/u16:5+f  [kernel.kallsyms]  [k] xas_load
     0.18%  kworker/u16:5+f  [kernel.kallsyms]  [k] ext4_mb_unload_buddy
     0.16%  kworker/u16:5+f  [kernel.kallsyms]  [k] mb_find_extent
     0.15%  kworker/u16:5+f  [kernel.kallsyms]  [k] _raw_spin_trylock
     0.14%  kworker/u16:5+f  [kernel.kallsyms]  [k] _raw_read_unlock
     0.10%  kworker/u16:5+f  [kernel.kallsyms]  [k] _raw_read_lock
     0.08%  kworker/u16:5+f  [kernel.kallsyms]  [k] __filemap_get_folio
     0.07%  kworker/u16:5+f  [kernel.kallsyms]  [k] mb_find_buddy
     0.05%  kworker/u16:5+f  [kernel.kallsyms]  [k] xas_start
     0.03%  kworker/u16:5+f  [kernel.kallsyms]  [k] pagecache_get_page
     0.03%  kworker/u16:5+f  [kernel.kallsyms]  [k] _raw_spin_unlock
     0.02%  kworker/u16:5+f  [kernel.kallsyms]  [k] perf_adjust_freq_unthr_context
     0.02%  kworker/u16:5+f  [kernel.kallsyms]  [k] folio_mark_accessed
     0.02%  kworker/u16:5+f  [kernel.kallsyms]  [k] folio_test_hugetlb
     0.01%  kworker/u16:5+f  [kernel.kallsyms]  [k] native_write_msr
     0.01%  kworker/u16:5+f  [kernel.kallsyms]  [k] _raw_spin_lock
     0.01%  kworker/u16:5+f  [kernel.kallsyms]  [k] __cond_resched
     0.01%  kworker/u16:5+f  [kernel.kallsyms]  [k] timekeeping_advance
     0.01%  kworker/u16:5+f  [kernel.kallsyms]  [k] native_irq_return_iret
     0.00%  kworker/u16:5+f  [kernel.kallsyms]  [k] gen8_irq_handler
Comment 29 Eyal Lebedinsky 2023-11-16 04:31:05 UTC
I now repeated the test with the new option.

The result is that very few dirty buffers are recorded, but the rsync that usually takes a few seconds, now stops for minutes at a time.

The machine also stutters, for example as I open a snall file on /data1.

The following perf run is of the rsync.

$ sudo sysctl vm.drop_caches=3
$ sudo mount -o remount,nodelalloc /data1
$ (d="`date`" ; sudo rsync -aHSK --stats --progress --checksum-choice=none --no-compress -W /data2/no-backup/old-backups/tapes/13/file.1.data /data1/no-backup/old-backups/tapes-again/13/ ; echo "`date`: started $d")
$ sudo perf record -p 1531839 -g sleep 60
$ sudo perf report --no-children --stdio -i perf.data

I will add the logs as an attachment (I hope).
Comment 30 Eyal Lebedinsky 2023-11-16 04:33:35 UTC
Created attachment 305411 [details]
perf lest logs

perf log for rsync, See comment #29
Comment 31 Ojaswin Mujoo 2023-11-16 19:33:44 UTC
Hey Eyal,

Thanks for running the tests. So it definitely feels that there is something going on in the ext4 allocator side. Ideally we shouldn't be taking this much time in finding blocks in the FS especially since the older kernels complete it quickly so it couldn't be a low FS space issue.

I spent some good amount of time today trying to replicate it to no avail hence I'll have to request you for some more data. Thanks to the perf report you've shared I have an idea of some kernel areas we can probe to understand whats truly going on. I'll just share the perf probe and perf record commands below. Please let me know if its not possible to run these on the affected distro so we can think of some other way :) 

** Adding probe points **

sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order"

sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists:18 cr iter->bb_group"

sudo perf probe -a "ext4_mb_good_group:20 free fragments ac->ac_g_ex.fe_len ac->ac_2order"

** Recording using the probe point **

sudo perf record -e probe:ext4_mb_find_good_group_avg_frag_lists -e probe:ext4_mb_find_good_group_avg_frag_lists_L18 -e probe:ext4_mb_good_group_L20 -e ext4:ext4_mballoc_alloc -g -p <pid> sleep 20

** Saving the output **

sudo perf script -i perf.data > perf-probe.log

I tried these commands on the mainline kernel but I hope they've not changed too much for your kernel. You *might* need to change the line numbers of the probe if thats the case and the event names in the record command would also change accordingly. Please let me know if you need any more info from my end to get this done, I'd be glad to help.

Thanks again and regard,
ojaswin
Comment 32 Eyal Lebedinsky 2023-11-17 00:01:28 UTC
Looks like a basic misunderstanding on my side, or maybe more preparation is needed? What am I missing?

$ sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order"
Failed to find the path for the kernel: Invalid ELF file
  Error: Failed to add events.

$ uname -a
Linux e7.eyal.emu.id.au 6.5.10-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Nov  2 19:59:55 UTC 2023 x86_64 GNU/Linux

$ ls -l /boot/*6.5.10*
-rw------- 1 root root  8727059 Nov  2 11:00 /boot/System.map-6.5.10-200.fc38.x86_64
-rw-r--r-- 1 root root   264245 Nov  2 11:00 /boot/config-6.5.10-200.fc38.x86_64
-rw------- 1 root root 39368262 Nov  9 10:55 /boot/initramfs-6.5.10-200.fc38.x86_64.img
lrwxrwxrwx 1 root root       46 Nov  9 10:55 /boot/symvers-6.5.10-200.fc38.x86_64.xz -> /lib/modules/6.5.10-200.fc38.x86_64/symvers.xz
-rwxr-xr-x 1 root root 14551752 Nov  2 11:00 /boot/vmlinuz-6.5.10-200.fc38.x86_64

I strace'd the command, and toward the end I see this (I have vmlinuz, not vmlinux - related?).
These are the first failed 'openat()' in the trace.

openat(AT_FDCWD, "vmlinux", O_RDONLY)   = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/boot/vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/boot/vmlinux-6.5.10-200.fc38.x86_64", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/debug/boot/vmlinux-6.5.10-200.fc38.x86_64", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/modules/6.5.10-200.fc38.x86_64/build/vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/debug/lib/modules/6.5.10-200.fc38.x86_64/vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/debug/boot/vmlinux-6.5.10-200.fc38.x86_64.debug", O_RDONLY) = -1 ENOENT (No such file or directory)
newfstatat(AT_FDCWD, "/root/.debug/.build-id/d3/6a8ae40f88e310401726e275aa1940b279babd", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0
openat(AT_FDCWD, "/root/.debug/.build-id/d3/6a8ae40f88e310401726e275aa1940b279babd/kallsyms", O_RDONLY) = 3

Earlier I see this open success:

newfstatat(AT_FDCWD, "/lib/modules/6.5.10-200.fc38.x86_64/vmlinuz", {st_mode=S_IFREG|0755, st_size=14551752, ...}, 0) = 0

which is identical to the one in /boot.

Should I boot a different kernel type?

Remember that I am not a dev...
Comment 33 Ojaswin Mujoo 2023-11-17 00:44:38 UTC
Hey Eyal,

Yes you are right, it is not able to find the vmlinux with debug symbols. You can either download the debuginfo package of the corresponding kernel which should place the vmlinux in  the required place or even having the "vmlinux" with debug symbols in the same directory as where you are running perf probe shall do.

Also, once this error is resolved you *might* encounter something like " Error: Failed to show lines." To fix it you can install the kernel source code package or just use the -s flag to point to the source code. (Example of how I use it):

sudo perf probe -s ~/workspace/9p -a "ext4_mb_find_good_group_avg_frag_lists order"

Let me know if it still doesn't work.

Thanks!
Ojaswin
Comment 34 Eyal Lebedinsky 2023-11-17 01:39:21 UTC
Created attachment 305414 [details]
detailed perf log

Attachment for comment #34
Comment 35 Eyal Lebedinsky 2023-11-17 01:40:56 UTC
$ sudo dnf debuginfo-install kernel
Installing:
 kernel-debuginfo
Installing dependencies:
 kernel-debuginfo-common-x86_64

$ sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order"

$ sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists:18 cr iter->bb_group"

$ sudo perf record -e probe:ext4_mb_find_good_group_avg_frag_lists -e probe:ext4_mb_find_good_group_avg_frag_lists_L18 -e probe:ext4_mb_good_group_L20 -e ext4:ext4_mballoc_alloc -g -p 2789843 sleep 20

$ sudo perf script -i perf.data > perf-probe.log
[ perf record: Woken up 1395 times to write data ]
Warning:
Processed 24640612 events and lost 22 chunks!

Check IO/CPU overload!

[ perf record: Captured and wrote 4725.791 MB perf.data (19187380 samples) ]

$ sudo perf script -i perf.data > perf-probe.log
Warning:
Processed 24640612 events and lost 22 chunks!

Check IO/CPU overload!

$ ls -l perf-probe.log
-rw-r--r-- 1 eyal eyal 39024390135 Nov 17 12:31 perf-probe.log

I am limited to upload much less that this as an attachment.
        File size limit: 5120 KB)

I managed to include only so many likes ftom the head:

$ wc -l perf-probe-head.log
51594 perf-probe-head.log

See attachement https://bugzilla.kernel.org/attachment.cgi?id=305414
Comment 36 Ojaswin Mujoo 2023-11-17 15:39:32 UTC
Hey Eyal,

So the trace data has given me an idea of what's going on. Basically in ext4 we maintain a list of FS blocks groups (BGs) where each list will have BGs based on the order of free blocks (BG with 64 free blocks goes in list order 6. 640 free blocks goes in order 9 list etc). In our case, we are trying to allocate stripe size blocks at a time ie 640 blocks or roughly 2.5 KB and ext4 tries to look at the order 9 list to find a BG that is fit to satisfy our request. 

Unfortunately there seems to be a lot of BGs in the order 9 list (> 1000) but most of them dont have enough free blocks to satisfy the request so we keep looping 
and trying to call ext4_mb_good_group() on each of them to see if anyone is good enough. Once we do find a good enough BG, due to striping we actually try to look for blocks which are specially aligned to stripe size and once we don't find it we just start looping in the list again from the beginning (!!).
  
Although I have a good idea now, I'm not able to point my finger at the exact change in 6.5 that might have caused this. We did change the allocator to some extent and it might be related to this but we need to dig a bit more deeper to confirm.

Would it be possible to share the same perf record again but this time I'm adding a few more probes and removing -g so we can fit more in 5MBlimit and also the commands for Linux 6.4 so we can compare whats changed:

Linux 6.5+:

Probe adding commands:

sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order"
sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists:18 cr iter->bb_group"
sudo perf probe -a "ext4_mb_good_group:20 free fragments ac->ac_g_ex.fe_len ac->ac_2order"
sudo perf probe -a "ext4_mb_scan_aligned:26 i max"

Record command:

perf record -e probe:ext4_mb_find_good_group_avg_frag_lists_L18 -e probe:ext4_mb_good_group_L20 -e probe:ext4_mb_find_good_group_avg_frag_lists -e probe:ext4_mb_    scan_aligned_L26 -e ext4:ext4_mballoc_alloc -p <pid> sleep 20


Linux 6.4.x:

Probe adding commands:

sudo perf probe -a "ext4_mb_choose_next_group_cr1:25 i iter->bb_group"
sudo perf probe -a "ext4_mb_good_group:20 free fragments ac->ac_g_ex.fe_len ac->ac_2order"
sudo perf probe -a "ext4_mb_scan_aligned:26 i max"

Record command:

sudo perf record -e probe:ext4_mb_choose_next_group_cr1_L25 -e probe:ext4_mb_good_group_L20 -e probe:ext4_mb_scan_aligned_L26 -e ext4:ext4_mballoc_alloc -p <pid> sleep 20

Thanks again for all your help on this!

Regards,
ojaswin
Comment 37 Eyal Lebedinsky 2023-11-17 23:52:17 UTC
Finally got it.

kernel 6.5 was easy. kernel-6.4 debuginfo could not be found with 'dnf' and I searched long for it. Finally locating it on koji.

Then the third probe was wrong, and after some learning I discovered
    $ sudo perf probe --line "ext4_mb_scan_aligned"
     21         while (i < EXT4_CLUSTERS_PER_GROUP(sb)) {
     22                 if (!mb_test_bit(i, bitmap)) {
     23                         max = mb_find_extent(e4b, i, sbi->s_stripe, &ex);
     24                         if (max >= sbi->s_stripe) {
     25                                 ac->ac_found++;
     26                                 ex.fe_logical = 0xDEADF00D; /* debug value */
                                        ac->ac_b_ex = ex;
     28                                 ext4_mb_use_best_found(ac, e4b);
     29                                 break;
                                }
                        }
     32                 i += sbi->s_stripe;
                }
 
and decided to pick line 24 (not the original 26). Is this right?

I will now upload the reports.
Comment 38 Eyal Lebedinsky 2023-11-17 23:53:47 UTC
Created attachment 305417 [details]
more perf reports

for comment #37
Comment 39 Eyal Lebedinsky 2023-11-17 23:57:18 UTC
Created attachment 305418 [details]
more perf reports

for comment #37
Comment 40 Ojaswin Mujoo 2023-11-18 12:10:04 UTC
Hey Eyal,

Thanks for the data, the perf probes you added were correct!

I see that the problem is as I suspected where we keep looking trying to find aligned blocks in ext4 when probably none of them exist. Aligned allocation right now is only done when stripe mount option is passed as an optimization. Currently we don't seem to fallback to normal allocation if aligned allocation doesn't work and this causes the very long, seemingly infinite looping. 

I can try to work on a patchset that fixes this however as a temporary fix you can continue with stripe mount option turned off for ext4. This will then instruct ext4 to just use normal allocation rather than aligned.

One point to note is that -o stripe=xyz is sometimes automatically added during mount even when we don't pass it. You can look at Comment #6 #7 and #8 in this bug for more info. To confirm it's off you can look into /proc/fs/ext4/<dev>/options file which has all the currently active mount options, you shouldn't see stripe there.

Further, this is not something that was changed between 6.4 and 6.5 however seems like the allocator changes in 6.5 made it even more difficult to come out of this loop thus prolonging the time taken to flush. 

Also, just wanted to check if you have any non-prod setup where you'd be open to compile kernel with patches to see if we are able to fix the issue.

Lastly, thank you so much for all the probe data and logs, it's been a huge help :)

Regard,
ojaswin
Comment 41 Eyal Lebedinsky 2023-11-18 13:17:59 UTC
OK, so how do I disable the stripe=640 that was automatically added?
Is there a mount option to disable it?
I would rather not futz with tune2fs.

RE kernel build: I used to often do this maybe 30 years ago, but on a workstation, not on the server. The build I expect to be simple, but it is the risk of a bad patch that worries me. I would rather not do that.

BTW, while you are looking at the cache issue, what are your thoughts on the fact that the array shows (iostat -x) a very high w_await when the members are much lower? Is it normal?

         Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
16:04:08 md127            0.20      0.80     0.00   0.00   10.50     4.00    9.80    739.20     0.00   0.00 1495.44    75.43    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00   14.66   4.51
16:04:08 sdb              2.30    187.20    44.50  95.09   30.74    81.39    6.80    345.20    79.70  92.14   29.72    50.76    0.00      0.00     0.00   0.00    0.00     0.00    0.60   14.33    0.28   1.26
16:04:08 sdc              1.90    138.80    32.80  94.52   16.63    73.05    6.60    295.60    67.50  91.09   13.06    44.79    0.00      0.00     0.00   0.00    0.00     0.00    0.60   11.33    0.12   1.05
16:04:08 sdd              2.30    194.40    46.30  95.27    3.70    84.52    2.90    138.80    32.00  91.69    3.34    47.86    0.00      0.00     0.00   0.00    0.00     0.00    0.60    4.67    0.02   0.54
16:04:08 sde              1.80    204.00    49.20  96.47   14.89   113.33    4.10    302.40    71.70  94.59   13.27    73.76    0.00      0.00     0.00   0.00    0.00     0.00    0.60    6.17    0.08   1.66
16:04:08 sdf              1.90     97.20    22.40  92.18   14.63    51.16    4.70    212.00    48.50  91.17   19.70    45.11    0.00      0.00     0.00   0.00    0.00     0.00    0.60   11.50    0.13   1.98
16:04:08 sdg              2.00    212.00    51.00  96.23    7.80   106.00    4.70    279.20    65.30  93.29    8.32    59.40    0.00      0.00     0.00   0.00    0.00     0.00    0.60    8.00    0.06   1.86
16:04:08 sdh              2.20    213.20    51.10  95.87   21.73    96.91    4.80    315.60    74.30  93.93   27.77    65.75    0.00      0.00     0.00   0.00    0.00     0.00    0.60   14.33    0.19   2.05
Comment 42 Ojaswin Mujoo 2023-11-20 06:11:04 UTC
Hey Eyal,

stripe size can be set to 0 by remounting the FS as follows:

mount -o remount,stripe=0 <dev>

This should should allow us to bypass aligned allocation.

Regarding building the kernel, sure I understand your concerns.

Regarding the util of RAID array, I don't think I have enough experience dealing with RAID devices to comment if it is normal behavior or not. Still I'll try to look a bit more into it to see if I can think of something.

Regards,
ojaswin
Comment 43 Eyal Lebedinsky 2023-11-20 08:48:13 UTC
Thanks Ojaswin,

The mount option did the trick, things are drastically faster now.
I am now finishing the actual copy that I was unable to complete.

The reason I ask about the raid is that it writes so slowly that I wonder if it
is just the looping thread or maybe there is something more.
The large difference between what the array reports compared to what the members show (in iostat -x) is beyond what I expected as the array write amplification.

I will keep watching this thread.

Regards,
  Eyal
Comment 44 Eduard Kohler 2023-11-27 17:31:53 UTC
Hi Ojaswin,

I'm still interested by a patch for this. I may find a way to give it a try. This aligned allocation should work for most otherwise more people would be here.
Also as the FS is automagically mounted with a stripe value (based on raid info) it would be nice to avoid future misbehaviours like this.
Comment 45 Ojaswin Mujoo 2023-12-03 07:33:50 UTC
Hi Eduard,

Sure, I'm on vacation this week. Before that I found some time to try and replicate this in a simulated environment, will be sending a patch once I'm back.

Regards
Ojaswin
Comment 46 Ojaswin Mujoo 2023-12-15 11:27:29 UTC
Hello,

I've posted a patch that, as per my understanding of the issue, should be able to fix the halt in kernels above 6.5

As I was not able to completely replicate the issue, I relied on the probes and other logs collected to root cause this, so I'd really appreciate if the folks here can help test this on their setups where the issue was being observed.

Link to patch: https://lore.kernel.org/linux-ext4/cover.1702455010.git.ojaswin@linux.ibm.com/T/#t

Regards,
ojaswin
Comment 47 carlos 2023-12-23 00:56:55 UTC
I've posted an easy reproducer here:
https://marc.info/?l=linux-raid&m=170327844709957&w=2

You can adjust the vm settings to make sure all the tree fits in the ram cache, depending on the ram of your machine.

By default the FS is mounted with stripe=1280 because it's on a raid6. Remounting with stripe=0 works around the problem. Excellent!

Also, this affects all 6.* kernels, as I said in the above post.
Comment 48 Andreas Dilger 2023-12-23 01:48:39 UTC
Independent of the fixes to the mballoc code to improve the allocation performance, I'm wondering about the ''RAID stride'' values in use here.
The "stride" value is intended to be the size of one complete set of
disks (e.g. 128KiB chunk size * 8 data disks = 1MiB).  The filesystem
doesn't see the parity disks, so the number of those disks does not
matter to ext4. 


    > RAID stride:              32752
    
    > In my case, I have an EXT4 partition over an mdadm raid 1 array of two HDD.
    > RAID stride:              32745


It seems in all these cases that the stripe/stride is strange.  I can't
see any value to setting stride to (almost) 128MB, especially not on a
RAID-1 system.   Were these values automatically generated by mke2fs,
or entered manually?  If manually, why was that value chosen?   If there
is something unclear in the documentation it should be fixed, and the
same if there is something wrong in mke2fs detection of the geometry.

    > By default the FS is mounted with stripe=1280 because it's on a raid6.
    > Remounting with stripe=0 works around the problem. Excellent!

Carlos, how many data disks in this system?  Do you have 5x 256KiB or
10x 128KiB *data* disks, plus 2 *parity* disks?
Comment 49 Eduard Kohler 2023-12-23 12:46:30 UTC
Hi Ojaswin,

I gave a try to your patch and indeed it fixes the issue.
I started to compile a 'vanilla' 6.5.13 fedora kernel (latest available on F37) and tested it, just to have a base. As expected the issue kicked in.
I added your patch, recompile it and the issue gone.

Andreas, I am positively sure that I never set/change any stride value on my own. Mostly because I have no idea what it is and how it interacts with the FS or Raid. I don't know if it can be changed or it is one time set. So it was chose by the tool I used to create the partition or raid or ext4 FS.
Comment 50 Eduard Kohler 2023-12-23 14:22:16 UTC
After Andreas comment, I throw a look to the raid wiki and changed the stripe/stride values:

tune2fs -E stride=32,stripe-width=32 /dev/md0p1

which may or may not be optimal for a RAID1.

Running a 6.5.12 kernel (fedora released one, not the previous compiled one).
And it also makes the issue go away.
Comment 51 carlos 2023-12-23 15:47:13 UTC
bugzilla-daemon@kernel.org (bugzilla-daemon@kernel.org) wrote on Fri, Dec 22, 2023 at 10:48:39PM -03:
> https://bugzilla.kernel.org/show_bug.cgi?id=217965
> 
> Andreas Dilger (adilger.kernelbugzilla@dilger.ca) changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |adilger.kernelbugzilla@dilg
>                    |                            |er.ca
> 
> --- Comment #48 from Andreas Dilger (adilger.kernelbugzilla@dilger.ca) ---
> Independent of the fixes to the mballoc code to improve the allocation
> performance, I'm wondering about the ''RAID stride'' values in use here.
> The "stride" value is intended to be the size of one complete set of
> disks (e.g. 128KiB chunk size * 8 data disks = 1MiB).  The filesystem
> doesn't see the parity disks, so the number of those disks does not
> matter to ext4. 
> 
> It seems in all these cases that the stripe/stride is strange.  I can't
> see any value to setting stride to (almost) 128MB, especially not on a
> RAID-1 system.   Were these values automatically generated by mke2fs,
> or entered manually?  If manually, why was that value chosen?   If there
> is something unclear in the documentation it should be fixed, and the
> same if there is something wrong in mke2fs detection of the geometry.
> 
>     > By default the FS is mounted with stripe=1280 because it's on a raid6.
>     > Remounting with stripe=0 works around the problem. Excellent!
> 
> Carlos, how many data disks in this system?  Do you have 5x 256KiB or
> 10x 128KiB *data* disks, plus 2 *parity* disks?

There are 10 data disks. The stride, determined automatically by mke2fs, is
correct since the chunk is 512KiB.
Comment 52 Matthew Stapleton 2023-12-29 20:35:19 UTC
I'm a bit late to reporting this, but it might be related to this bug report.

Since noticing this 100% cpu flush problem on my computer with a non-raid Samsung SSD 990 PRO NMVe and kernel 6.6.3 (and now 6.6.8) I decided to do a git bisect and seemed to track down the problem to commit: 7e170922f06bf46effa7c57f6035fc463d6edc7e if that helps, although I haven't tested if disabling CR1_5 also resolves the problem.  Running "tar xf linux-6.6.8.tar.xz" is enough to trigger 100% cpu for a few seconds after tar is finished on my system, although it doesn't noticeably trigger the problem on my server with raid 6 and hard drives.  It also looks like my ext4 filesystem is mounting with stripe=32733 according to /proc/mounts, although dumpe2fs -h doesn't show any stripe settings on the filesystem.

I'll try the patch and see if it fixes the problem.
Comment 53 Matthew Stapleton 2023-12-29 20:51:41 UTC
Oh I just rechecked and I do have RAID stride set to 32733 on my filesystem which I think I did to optimise for SSD blocks.  Also, another test which makes the flush take a really long time to complete on my system is "cp -ai linux-6.6.8 linux-6.6.8_orig".
Comment 54 Matthew Stapleton 2023-12-29 21:05:56 UTC
(In reply to Ojaswin Mujoo from comment #46)
> Hello,
> 
> I've posted a patch that, as per my understanding of the issue, should be
> able to fix the halt in kernels above 6.5
> 
> As I was not able to completely replicate the issue, I relied on the probes
> and other logs collected to root cause this, so I'd really appreciate if the
> folks here can help test this on their setups where the issue was being
> observed.
> 
> Link to patch:
> https://lore.kernel.org/linux-ext4/cover.1702455010.git.ojaswin@linux.ibm.
> com/T/#t
> 
> Regards,
> ojaswin

It looks like that patch has fixed the problem for my system.  Thanks.  I still haven't checked if disabling CR1_5 from 7e170922f06bf46effa7c57f6035fc463d6edc7e also resolves the problem, although I'm still planning to try that out as well.
Comment 55 Matthew Stapleton 2023-12-29 22:11:01 UTC
It looks like reversing the changes made in 7e170922f06bf46effa7c57f6035fc463d6edc7e also fixed the problem (I didn't notice before that the names were changed in the very next commit as well).  

in fs/ext4/mballoc.c:
Commenting out new_cr = CR_GOAL_LEN_SLOW in function: ext4_mb_choose_next_group_goal_fast wasn't enough to fix it so then I did the following as well which fixed the problem:
commented out the call to ext4_mb_choose_next_group_best_avail in function: ext4_mb_choose_next_group
commented out the code to "Reset goal length to original goal length before falling into CR_GOAL_LEN_SLOW"
Comment 56 Matthew Stapleton 2023-12-29 22:13:09 UTC
Correction: It was new_cr = CR_BEST_AVAIL_LEN that I commented out
Comment 57 Matthew Stapleton 2023-12-29 23:07:29 UTC
Also, it's a bit hard to tell, but even with the patch from https://lore.kernel.org/linux-ext4/cover.1702455010.git.ojaswin@linux.ibm.com/T/#t , I think flush might still be using more cpu compared to when I disable ext4_mb_choose_next_group_best_avail even though it isn't stalling anymore.
Comment 58 Matthew Stapleton 2023-12-29 23:22:38 UTC
Maybe it's just because different code is being executed, but running tests a few more times, top does seems to be showing less kernel cpu usage while extracting thunderbird-115.5.0.source.tar.xz (https://archive.mozilla.org/pub/thunderbird/releases/115.5.0/source/thunderbird-115.5.0.source.tar.xz) with CR_BEST_AVAIL_LEN disabled even though the actual extract operation is taking the same amount of time with both patch methods and all the cpu usage goes away in both cases once tar is finished.
Comment 59 Ojaswin Mujoo 2024-01-02 04:44:08 UTC
Hi Matthew,

Thanks for testing out the patch and for sharing the results with and without CR_BEST_AVAIL_LEN commented. Would it be possible for you to add perf probes as mentioned in comment 36 and share the result. That'd be helpful in understanding why CR_BEST_AVAIL_LEN is getting stuck. Basically, I am interested in seeing how the scan is happening in CR_GOAL_LEN_FAST vs CR_BEST_AVAIL_LEN.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217965#c36

Also, another thing I'd request you to try is, instead of commenting out the code can you run:

$ echo "0" > /sys/fs/ext4/<dev>/mb_best_avail_max_trim_order

and rerun the test. This disables most of the code but keeps some of the initial logic. Might help zero down if the issue is in the initial logic or the disabled code.

Thanks again,
ojaswin
Comment 60 Matthew Stapleton 2024-01-02 06:44:15 UTC
It looks like echo "0" > /sys/fs/ext4/<dev>/mb_best_avail_max_trim_order stops flush from getting stuck, but the flush kworker still uses more cpu during the write operation.  Hopefully I can get the perf results over the next few days.
Comment 61 Ojaswin Mujoo 2024-01-04 05:38:36 UTC
Hi Matthew, thanks for confirming. So as pointed out in comment 9 [1], for the above steps to disable CR_BEST_AVAIL_LEN code did not fix the issue. My suspicion is that this issue can occur either in CR_GOAL_LEN_FAST or CR_BEST_AVAIL_LEN lookup, depending on the block groups being searched. Probably for you, it was occurring during CR_BEST_AVAIL_LEN lookup and hence disabling that code fixed it. 

Further, as Carlos pointed out above, they are able to see this in all 6.* kernels which means this is happening before CR_BEST_AVAIL_LEN was introduced however it seems to be much easier to trigger in 6.5+ kernels. Now, assuming the above theory is correct, then in cases where this is triggered from CR_GOAL_LEN_FAST, it should ideally happen easily in pre 6.5 kernels as well, but it doesn't, which makes me think that there might be some other related changes in 6.5 that might be making it easier to trigger.

I'll try to play around a bit more with this. Also, as for higher CPU usage, how high are we talking about? So CR_BEST_AVAIL_LEN does add some extra cycles at the cost of generally faster allocation in fragmented filesystems, however since you have disabled it we shouldn't ideally be seeing it. Also, does the CPU util consistently drop when you commented out that code?

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217965#c9
Comment 62 Matthew Stapleton 2024-01-04 22:20:41 UTC
With top at 1 second intervals , with thunderbird source extract, I can see the flush process with around 70% to 100% for 2 1 second intervals during the tar extract.  I don't this this much cpu usage was present with CR_BEST_AVAIL_LEN commented out.  Hopefully the perf probes will show where the real usage is.  I still haven't tried out the perf probes, but I'm still planning on doing that hopefully sometime in the next few days.
Comment 63 Matthew Stapleton 2024-01-06 02:11:24 UTC
I can't seem to trigger the problem now even with all the patches disabled and /sys/fs/ext4/<dev>/mb_best_avail_max_trim_order set to the default of 3 so unfortunately I won't be able to help with this problem for now.

According to perf record -ag sleep 5 at the moment most of the ext4 cpu usage is in ext4_get_group_desc and ext4_get_group_info , both called from ext4_mb_prefetch and some usage in ext4_mb_good_group called from ext4_mb_find_good_group_avg_frag_lists .  Both ext4_mb_prefetch and ext4_mb_find_good_group_avg_frag_lists were called from ext4_mb_regular_allocator.  But this is all fairly low cpu usage at the moment and it all stops shortly after the file operation is completed.
Comment 64 Matthew Stapleton 2024-01-06 03:18:01 UTC
After patching ext4_mb_choose_next_group set of functions with noinline_for_stack, perf is now including those in the report: perf report --stdio -i perf.data .  That usage in ext4_mb_regular_allocator is only about 1.39% (compared to 17.59 for ext4_get_group_desc and ext4_get_group_info) with ext4_mb_choose_next_group_goal_fast using 0.74% and ext4_mb_choose_next_group_best_avail using 0.58% which indicates ext4_mb_choose_next_group_best_avail isn't stalling for me like it was before.
Comment 65 Matthew Stapleton 2024-01-07 02:56:44 UTC
I'm not sure if this will be worth looking into, but with the cpu usage problem, I've done some more testing and "perf report --stdio -i perf.data" is showing around 2 to 3 times as much cpu time since the commit that added the CR_BEST_AVAIL_LEN feature: 7e170922f06bf .  With /sys/fs/ext4/<dev>/mb_best_avail_max_trim_order set to 0 it goes a bit lower, but is still at least 2 times cpu.  With perf record -ag sleep 25 which is the time it takes the extract to run, it's goes from about 2% to 5% so fairly small, but when measuring with 5 second intervals this can go from around 5% to 15%.  This cpu usage is concentrated in ext4_get_group_desc and ext4_get_group_info and within these functions, perf is picking up more cpu time in __rcu_read_lock and __rcu_read_unlock.
Comment 66 Matteo Italia 2024-02-07 11:44:41 UTC
I think I'm experiencing the same problem; relatively big (1.5T), relatively empty (716 G used) ext4 partition on a slow-ish SATA magnetic HDD (ST3000DM007-1WY10G), stripe=32747, 6.5.0-14-generic kernel from Ubuntu 22.04 HWE stack; during the installation of COMSOL Multiphysics 6.2 (which essentially untars ~16 GB of stuff) I got kworker flush process stuck at 100% of CPU for multiple hours, and IO cache filling my 32 GB of RAM. I tried the various parameters tweaking suggested here with no particular change of pace (although maybe the pending IO still worked according to the old values?). I run the bfq scheduler on this disk, but I tried to switch to pretty much all other schedulers mid-process with no perceptible change.

After ~3 hours of agonizingly slow progress (with disk write performance for the installation often in the tens of KB/s) I decided to abort the installation. I rebooted in the Ubuntu GA kernel (5.15.0-92-generic) without touching any specific parameter and restarted the installation; this time it completed in ~10 minutes.
Comment 67 denk 2024-03-12 20:27:43 UTC
Hi,

I just stumbled upon this issue, which it seems is the same I am suffering from on a Raspberry Pi 5. Th e OS is Arch Linux ARM and linux-rpi 6.6.20 running on a NVME SSD. After reading this issue, I checked the ext4 mount options and found stripe=8191. As a first test I downgraded linux-rpi to 6.1.69 and my issue of a CPU hogging kworker process, when pulling and creating docker containers did not happen anymore. With kernel 6.6.20 and all earlier 6.6.x versions, docker operations were really slow, in some cases pulling and recreating a container took over 1000 seconds. With kernel 6.1.69 these operations are back to normal speed.
As a next step I would like to try to set stripe=0. What is the right approach here, as I am a bit unsure. Is it ok to remount the running system partition with mount -o remount,stripe=0 <dev> without any risks for a first test? And to change permanently to stripe=0,id it ok and sufficient to run tune2fs -f -E stride=0 <dev> on the mounted system partition without any risks? I would be thankful for a helping hand here to not do anything wrong or dangerous. 

I hope, that this works around my issue for now, until a fix has been found for this. Thank you very much for looking into this!

Kind regards
denk
Comment 68 Ojaswin Mujoo 2024-03-17 11:58:17 UTC
Hello Denk,

Both of the commands you mentioned "mount -o remount,stripe=0 <dev>" and "tune2fs -f -E stride=0 <dev>" should be okay to run and aren't dangerous for your filesystem. You can always take a backup to be extra cautious ;)

That being said, this issue was fixed by the following patch [1] which I believe landed in linux kernel 6.7 so anything above that should help you avoid the issue as well.

Regards,
Ojaswin  

[1] https://lore.kernel.org/linux-ext4/cover.1702455010.git.ojaswin@linux.ibm.com
Comment 69 Eyal Lebedinsky 2024-03-18 04:45:26 UTC
Seeing this last comment (#68), I wondered if the fix is included in my latest kernel. Or do I have a different problem?

After just updating my fedora 38 to kernel 6.7.9 I did NOT do the remount and repeated my original problematic test.
    $ uname -s
    Linux e7.eyal.emu.id.au 6.7.9-100.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar  6 19:31:16 UTC 2024 x86_64 GNU/Linux
    $ mount|grep data1
    /dev/md127 on /data1 type ext4 (rw,noatime,stripe=640)

I rsync a directory (262GB, 8,911,139 files) from an SSD to a raid6 (7x12TB => 60TB fs). Both are ext4.
    $ sudo time rsync -aHSK --stats --progress /data/no-backup/old-backups/tapes /data1/no-backup/really-not/old-backups/

What I see is that once meminfo shows the limit (4GB) was reached, the kworker kicks in at 100% CPU.
At that point iostat shows activity on the array dropping, from about 160MB/s to very little (with an occasional burst of a few MB/s).

```
10:12:33 2024-03-18
10:12:33 Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
[trimmed]
10:17:13 md127             0.00         0.00         0.00         0.00          0          0          0
10:17:23 md127            34.40       137.60         0.00         0.00       1376          0          0  <<< rsync starts
10:17:33 md127          2346.30       747.20      8638.00         0.00       7472      86380          0
10:17:43 md127          7067.10       431.60    133644.40         0.00       4316    1336444          0
10:17:53 md127          1692.80       578.80      7015.20         0.00       5788      70152          0
10:18:03 md127          2439.20       169.60     32071.20         0.00       1696     320712          0
10:18:13 md127           274.00         4.00      2242.00         0.00         40      22420          0
10:18:23 md127          3172.70        17.60     56828.00         0.00        176     568280          0
10:18:33 md127           416.20         0.80      1664.80         0.00          8      16648          0
10:18:43 md127            18.70         0.40        76.40         0.00          4        764          0
10:18:53 md127             6.50         0.00        30.80         0.00          0        308          0
10:19:03 md127             4.80         0.00        40.00         0.00          0        400          0
10:19:13 md127             5.70         0.00        63.60         0.00          0        636          0
10:19:23 md127             2.60         0.00        54.80         0.00          0        548          0
10:19:33 md127             7.40         0.00       243.20         0.00          0       2432          0
10:19:43 md127             5.20         0.00        75.60         0.00          0        756          0
10:19:53 md127             3.80         0.00        20.40         0.00          0        204          0
10:20:03 md127             2.00         0.00        13.20         0.00          0        132          0
10:20:13 md127             3.90         0.00        29.20         0.00          0        292          0
10:20:23 md127             3.80         0.00        19.60         0.00          0        196          0

At the same time meminfo shows:
2024-03-18 10:17:04 Dirty:     11220 kB  Buffers:    829988 kB  MemFree:    670576 kB
2024-03-18 10:17:14 Dirty:     10784 kB  Buffers:    830016 kB  MemFree:    631500 kB
2024-03-18 10:17:24 Dirty:    750616 kB  Buffers:    875592 kB  MemFree:    654236 kB
2024-03-18 10:17:34 Dirty:   2757048 kB  Buffers:    972948 kB  MemFree:    600636 kB
2024-03-18 10:17:44 Dirty:   2855196 kB  Buffers:   1046736 kB  MemFree:    551940 kB
2024-03-18 10:17:54 Dirty:   4104524 kB  Buffers:   1127200 kB  MemFree:    538136 kB
2024-03-18 10:18:04 Dirty:   4390504 kB  Buffers:   1155588 kB  MemFree:    600828 kB
2024-03-18 10:18:14 Dirty:   4518280 kB  Buffers:   1161916 kB  MemFree:    580176 kB
2024-03-18 10:18:24 Dirty:   4356952 kB  Buffers:   1185872 kB  MemFree:    543072 kB
2024-03-18 10:18:34 Dirty:   4559504 kB  Buffers:   1196396 kB  MemFree:    518872 kB
2024-03-18 10:18:44 Dirty:   4567212 kB  Buffers:   1197060 kB  MemFree:    606572 kB
2024-03-18 10:18:54 Dirty:   4567592 kB  Buffers:   1197084 kB  MemFree:    611440 kB
... and stays there until ...

I then killed the copy (14,296MB copied). The writing to the array remained very low, the kernel thread stayed at 100%
and meminfo drained very slowly. Access to the array is now slow with some hiccups.

2024-03-18 10:35:24 Dirty:   4484720 kB  Buffers:   4984308 kB  MemFree:    820532 kB   <<< rsync killed
2024-03-18 10:35:34 Dirty:   4484436 kB  Buffers:   4984348 kB  MemFree:    851288 kB
2024-03-18 10:35:44 Dirty:   4483992 kB  Buffers:   4984368 kB  MemFree:    817516 kB
2024-03-18 10:35:54 Dirty:   4483780 kB  Buffers:   4984400 kB  MemFree:    803156 kB
2024-03-18 10:36:04 Dirty:   4483704 kB  Buffers:   4984460 kB  MemFree:    809956 kB
2024-03-18 10:36:14 Dirty:   4479416 kB  Buffers:   4984496 kB  MemFree:    832980 kB
2024-03-18 10:36:24 Dirty:   4474312 kB  Buffers:   4984528 kB  MemFree:    881464 kB
2024-03-18 10:36:34 Dirty:   4474260 kB  Buffers:   4984568 kB  MemFree:    840444 kB
2024-03-18 10:36:44 Dirty:   4474132 kB  Buffers:   4984600 kB  MemFree:    843524 kB
2024-03-18 10:36:54 Dirty:   4474292 kB  Buffers:   4984640 kB  MemFree:    841004 kB
2024-03-18 10:37:04 Dirty:   4474052 kB  Buffers:   4984680 kB  MemFree:    834148 kB
2024-03-18 10:37:14 Dirty:   4473688 kB  Buffers:   4984712 kB  MemFree:    853200 kB
2024-03-18 10:37:24 Dirty:   4473448 kB  Buffers:   4984752 kB  MemFree:    782540 kB
2024-03-18 10:37:34 Dirty:   4473288 kB  Buffers:   4984776 kB  MemFree:    786100 kB
2024-03-18 10:37:44 Dirty:   3871768 kB  Buffers:   4984972 kB  MemFree:    846020 kB
2024-03-18 10:37:54 Dirty:   3871612 kB  Buffers:   4985020 kB  MemFree:    826664 kB
2024-03-18 10:38:04 Dirty:   3871736 kB  Buffers:   4985052 kB  MemFree:    826084 kB
2024-03-18 10:38:14 Dirty:   3871184 kB  Buffers:   4985100 kB  MemFree:    876572 kB
2024-03-18 10:38:24 Dirty:   3870936 kB  Buffers:   4985140 kB  MemFree:    918944 kB
2024-03-18 10:38:34 Dirty:   3648080 kB  Buffers:   4985256 kB  MemFree:    901336 kB
2024-03-18 10:38:44 Dirty:   3556612 kB  Buffers:   4985316 kB  MemFree:    902532 kB
2024-03-18 10:38:54 Dirty:   3551636 kB  Buffers:   4985364 kB  MemFree:    837816 kB
2024-03-18 10:39:04 Dirty:   3551968 kB  Buffers:   4985468 kB  MemFree:    823392 kB
2024-03-18 10:39:14 Dirty:   2835648 kB  Buffers:   4985656 kB  MemFree:    629428 kB
...
2024-03-18 11:05:25 Dirty:   2737096 kB  Buffers:   4993860 kB  MemFree:    599424 kB   <<< 30m later
2024-03-18 11:35:25 Dirty:   2573748 kB  Buffers:   5001184 kB  MemFree:    612288 kB   <<< again
2024-03-18 12:05:26 Dirty:   2432572 kB  Buffers:   5007704 kB  MemFree:    663928 kB   <<< again
2024-03-18 12:35:27 Dirty:   2145348 kB  Buffers:   3707492 kB  MemFree:    588464 kB   <<< again
2024-03-18 13:05:27 Dirty:   2017848 kB  Buffers:   3718936 kB  MemFree:    585500 kB   <<< again
2024-03-18 13:35:28 Dirty:   1822436 kB  Buffers:   3746824 kB  MemFree:    565560 kB   <<< again
2024-03-18 14:05:29 Dirty:   1595088 kB  Buffers:   3799124 kB  MemFree:    544504 kB   <<< again
2024-03-18 14:35:29 Dirty:   1498416 kB  Buffers:   3816868 kB  MemFree:   3883524 kB   <<< again
2024-03-18 15:05:30 Dirty:   1387140 kB  Buffers:   3835824 kB  MemFree:   3266060 kB   <<< again
...
2024-03-18 15:32:51 Dirty:   1284940 kB  Buffers:   3850936 kB  MemFree:   3088904 kB   <<< finally
2024-03-18 15:33:01 Dirty:    933268 kB  Buffers:   3851144 kB  MemFree:   3098840 kB
2024-03-18 15:33:11 Dirty:     51956 kB  Buffers:   3851248 kB  MemFree:   3095456 kB
2024-03-18 15:33:21 Dirty:     51968 kB  Buffers:   3851284 kB  MemFree:   3059212 kB
2024-03-18 15:33:31 Dirty:     52032 kB  Buffers:   3851308 kB  MemFree:   3085352 kB
2024-03-18 15:33:41 Dirty:       172 kB  Buffers:   3851336 kB  MemFree:   3090912 kB
2024-03-18 15:33:51 Dirty:        64 kB  Buffers:   3851368 kB  MemFree:   3030584 kB
```

So over 5 hours to copy this small part (14GB of 262GB) of the data.

Is this expected? Is this a fundamental "feature" of ext4? Or of raid6?
When I did do "sudo mount -o remount,stripe=0 /data1" the copy progressed nicely with good
writing speed.

I have logs of the progress of this test at 10s intervals.

Regards
Comment 70 Ojaswin Mujoo 2024-03-20 06:46:18 UTC
Hi Eyal,

My bad, seems like the patch landed in 6.8-rc1 as seen here [1].

[1] https://lore.kernel.org/linux-ext4/20240110203521.GA1013681@mit.edu/

Regards,
ojaswin
Comment 71 denk 2024-03-23 08:53:54 UTC
(In reply to Ojaswin Mujoo from comment #68)
> Hello Denk,
> 
> Both of the commands you mentioned "mount -o remount,stripe=0 <dev>" and
> "tune2fs -f -E stride=0 <dev>" should be okay to run and aren't dangerous
> for your filesystem. You can always take a backup to be extra cautious ;)
> 
> That being said, this issue was fixed by the following patch [1] which I
> believe landed in linux kernel 6.7 so anything above that should help you
> avoid the issue as well.
> 
> Regards,
> Ojaswin  
> 
> [1]
> https://lore.kernel.org/linux-ext4/cover.1702455010.git.ojaswin@linux.ibm.com

Hello Ojaswin,

thank you very much! I changed stripe to 0 last weekend with tune2fs -f -E stripe_width=0 <dev>, as tune2fs -f -E stride=0 <dev> did not change stripe to 0. And since then my system is back to normal performance. So far no further issues.
Thanks you very much for all the work for ext4!
Best
denk

Note You need to log in before you can comment on or make changes to this bug.