Since kernel 6.5.x and 6.6-rc* I'm getting weird kworker flush activity when building openwrt from sources. 91 root 20 0 0 0 0 R 99,7 0,0 18:06.57 kworker/u16:4+flush-8:16 Openwrt sources resides on the sata hdd drive with ext4 fs,I'm using this setup for a last 5 years, the problem is that since 6.5 kernels after the openwrt kernel patch make step(https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=scripts/patch-kernel.sh;h=c2b7e7204952f93946a6075d546cbeae32c2627f;hb=HEAD which probably involves a lot of copy and write operations) kworker/u16:4+flush-8:16 uses 100% of one core for a while(5-15 minutes) even after I canceling openwrt build. I tried to move this openwrt sources folder to an ssd drive where my system is resides and run openwrt build from there and getting no issues with kworker flush cpu usage. Also I have no such behavior with 6.4.x and older kernels so it looks like regression to me, not sure if this is a fs, vfs or even block subsystem issue. This is how it looks in perf Samples: 320K of event 'cycles:P', Event count (approx.): 363448649248 Children Self Command Shared Object Symbol + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] ret_from_fork_asm + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] ret_from_fork + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] kthread + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] worker_thread + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] process_one_work + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] wb_workfn + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] wb_writeback + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] __writeback_inodes_wb + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] writeback_sb_inodes + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] __writeback_single_inode + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] do_writepages + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] ext4_writepages + 12,40% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] ext4_do_writepages + 12,39% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] ext4_map_blocks + 12,39% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] ext4_ext_map_blocks + 12,38% 0,00% kworker/u16:2+f [kernel.vmlinux] [k] ext4_mb_new_blocks + 12,38% 0,93% kworker/u16:2+f [kernel.vmlinux] [k] ext4_mb_regular_allocator + 9,42% 0,00% cc1 [unknown] [.] 0000000000000000 + 5,42% 0,53% kworker/u16:2+f [kernel.vmlinux] [k] ext4_mb_scan_aligned + 4,88% 0,69% kworker/u16:2+f [kernel.vmlinux] [k] mb_find_extent + 3,99% 3,95% kworker/u16:2+f [kernel.vmlinux] [k] mb_find_order_for_block + 3,51% 0,61% kworker/u16:2+f [kernel.vmlinux] [k] ext4_mb_load_buddy_gfp + 2,95% 0,01% cc1 [kernel.vmlinux] [k] asm_exc_page_fault + 2,67% 0,18% kworker/u16:2+f [kernel.vmlinux] [k] pagecache_get_page + 2,41% 0,40% kworker/u16:2+f [kernel.vmlinux] [k] __filemap_get_folio + 2,33% 2,10% cc1 cc1 [.] cpp_get_token_1 + 2,12% 0,05% cc1 [kernel.vmlinux] [k] exc_page_fault + 2,07% 0,04% cc1 [kernel.vmlinux] [k] do_user_addr_fault + 1,81% 0,52% kworker/u16:2+f [kernel.vmlinux] [k] filemap_get_entry 1,80% 1,71% cc1 cc1 [.] ht_lookup_with_hash + 1,77% 0,08% cc1 [kernel.vmlinux] [k] handle_mm_fault + 1,65% 0,14% cc1 [kernel.vmlinux] [k] __handle_mm_fault 1,60% 1,49% cc1 cc1 [.] _cpp_lex_direct + 1,54% 0,73% kworker/u16:2+f [kernel.vmlinux] [k] ext4_mb_good_group + 1,49% 1,46% cc1 cc1 [.] ggc_internal_alloc + 1,28% 0,05% cc1 [kernel.vmlinux] [k] do_anonymous_page + 1,28% 0,04% cc1 [kernel.vmlinux] [k] entry_SYSCALL_64
Created attachment 305179 [details] Kernel config
Backtrace echo l > /proc/sysrq-trigger: [ 3128.330756] CPU: 2 PID: 311853 Comm: patch Tainted: P O 6.6.0-rc4-x86_64 #1 [ 3128.330758] Hardware name: ASUS All Series/Z97-K, BIOS 2902 03/31/2016 [ 3128.330759] RIP: 0010:mb_find_order_for_block+0x68/0xa0 [ 3128.330762] Code: d0 02 00 00 42 0f b7 04 00 48 01 f0 48 89 c6 41 89 f9 c1 e0 03 83 e0 38 41 d3 f9 48 83 e6 f8 44 01 c8 48 98 48 0f a3 06 73 27 <0f> b7 42 30 83 c1 01 83 c0 01 39 c1 7f 17 48 8b 42 18 48 8b 72 08 [ 3128.330763] RSP: 0018:ffffc9000e363630 EFLAGS: 00000207 [ 3128.330765] RAX: 0000000000000036 RBX: ffffc9000e363698 RCX: 000000000000000c [ 3128.330766] RDX: ffffc9000e363738 RSI: ffff88810a7e2ff8 RDI: 0000000000006000 [ 3128.330767] RBP: 0000000000007ff0 R08: 0000000000000018 R09: 0000000000000006 [ 3128.330767] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000006000 [ 3128.330768] R13: 0000000000006000 R14: 000000000000000d R15: ffffc9000e363738 [ 3128.330769] FS: 00007f237c609740(0000) GS:ffff88841fa80000(0000) knlGS:0000000000000000 [ 3128.330771] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3128.330772] CR2: 00005576c2964000 CR3: 0000000261c76004 CR4: 00000000001706e0 [ 3128.330773] Call Trace: [ 3128.330774] <NMI> [ 3128.330774] ? nmi_cpu_backtrace+0x88/0xf0 [ 3128.330776] ? nmi_cpu_backtrace_handler+0xc/0x20 [ 3128.330778] ? nmi_handle+0x53/0xf0 [ 3128.330780] ? default_do_nmi+0x40/0x220 [ 3128.330782] ? exc_nmi+0xf3/0x120 [ 3128.330784] ? end_repeat_nmi+0x16/0x67 [ 3128.330787] ? mb_find_order_for_block+0x68/0xa0 [ 3128.330788] ? mb_find_order_for_block+0x68/0xa0 [ 3128.330790] ? mb_find_order_for_block+0x68/0xa0 [ 3128.330792] </NMI> [ 3128.330792] <TASK> [ 3128.330793] mb_find_extent+0x149/0x200 [ 3128.330795] ext4_mb_scan_aligned+0xd3/0x140 [ 3128.330798] ext4_mb_regular_allocator+0xcca/0xe50 [ 3128.330800] ext4_mb_new_blocks+0x8c6/0xe80 [ 3128.330802] ? ext4_find_extent+0x3c1/0x410 [ 3128.330803] ? release_pages+0x13a/0x400 [ 3128.330805] ext4_ext_map_blocks+0x626/0x15c0 [ 3128.330807] ? filemap_get_folios_tag+0x1c6/0x1f0 [ 3128.330809] ? mpage_prepare_extent_to_map+0x462/0x4a0 [ 3128.330811] ext4_map_blocks+0x18b/0x5f0 [ 3128.330813] ? kmem_cache_alloc+0x278/0x410 [ 3128.330814] ? ext4_alloc_io_end_vec+0x19/0x50 [ 3128.330816] ext4_do_writepages+0x619/0xa90 [ 3128.330818] ext4_writepages+0xa8/0x180 [ 3128.330819] do_writepages+0xc6/0x190 [ 3128.330821] ? __find_get_block+0x1b2/0x2a0 [ 3128.330824] filemap_fdatawrite_wbc+0x5e/0x80 [ 3128.330827] __filemap_fdatawrite_range+0x57/0x80 [ 3128.330829] ext4_rename+0x753/0xbf0 [ 3128.330831] vfs_rename+0xa7e/0xbe0 [ 3128.330834] ? do_renameat2+0x4be/0x660 [ 3128.330836] do_renameat2+0x4be/0x660 [ 3128.330838] __x64_sys_renameat+0x44/0x60 [ 3128.330839] do_syscall_64+0x3f/0x90 [ 3128.330841] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 3128.330843] RIP: 0033:0x7f237c66160a [ 3128.330844] Code: 48 8b 15 21 e8 15 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 49 89 ca b8 08 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 06 c3 0f 1f 44 00 00 48 8b 15 e9 e7 15 00 f7 [ 3128.330845] RSP: 002b:00007ffcf85ca528 EFLAGS: 00000202 ORIG_RAX: 0000000000000108 [ 3128.330847] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f237c66160a [ 3128.330848] RDX: 0000000000000004 RSI: 00005576c295a091 RDI: 0000000000000004 [ 3128.330848] RBP: 0000000000000004 R08: 00000000000081a4 R09: 0000000000000000 [ 3128.330849] R10: 00005576c2959ad1 R11: 0000000000000202 R12: 00005576c2959ad1 [ 3128.330850] R13: 00005576c295a091 R14: 00007ffcf85ca630 R15: 0000000000000000 [ 3128.330851] </TASK> [ 3128.330852] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x62/0xb0 [ 4648.228980] sysrq: Show backtrace of all active CPUs [ 4648.228983] NMI backtrace for cpu 1 [ 4648.228985] CPU: 1 PID: 526431 Comm: bash Tainted: P O 6.6.0-rc4-x86_64 #1 [ 4648.228987] Hardware name: ASUS All Series/Z97-K, BIOS 2902 03/31/2016 [ 4648.228987] Call Trace: [ 4648.228989] <TASK> [ 4648.228991] dump_stack_lvl+0x36/0x50 [ 4648.228995] nmi_cpu_backtrace+0xc0/0xf0 [ 4648.228997] ? __pfx_nmi_raise_cpu_backtrace+0x10/0x10 [ 4648.228999] nmi_trigger_cpumask_backtrace+0xc8/0xf0 [ 4648.229000] __handle_sysrq+0xd2/0x190 [ 4648.229002] write_sysrq_trigger+0x23/0x40 [ 4648.229004] proc_reg_write+0x57/0xa0 [ 4648.229006] vfs_write+0xcb/0x410 [ 4648.229008] ? filp_flush+0x4d/0x80 [ 4648.229010] ksys_write+0x62/0xe0 [ 4648.229011] do_syscall_64+0x3f/0x90 [ 4648.229013] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4648.229016] RIP: 0033:0x7f3d2d70e640 [ 4648.229017] Code: 87 0c 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 80 3d 29 0d 0d 00 00 74 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 48 89 [ 4648.229019] RSP: 002b:00007ffd39f5e4a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [ 4648.229021] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f3d2d70e640 [ 4648.229022] RDX: 0000000000000002 RSI: 000055fbbed84ba0 RDI: 0000000000000001 [ 4648.229023] RBP: 00007f3d2d7d85c0 R08: 0000000000000030 R09: 00007f3d2d7d7b00 [ 4648.229024] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002 [ 4648.229025] R13: 000055fbbed84ba0 R14: 0000000000000002 R15: 00007f3d2d7d5f40 [ 4648.229026] </TASK> [ 4648.229027] Sending NMI from CPU 1 to CPUs 0,2-7: [ 4648.229030] NMI backtrace for cpu 3 skipped: idling at intel_idle+0x62/0xb0 [ 4648.229033] NMI backtrace for cpu 7 skipped: idling at intel_idle+0x62/0xb0 [ 4648.229036] NMI backtrace for cpu 5 skipped: idling at intel_idle+0x62/0xb0 [ 4648.229039] NMI backtrace for cpu 2 [ 4648.229041] CPU: 2 PID: 71621 Comm: kworker/u16:1 Tainted: P O 6.6.0-rc4-x86_64 #1 [ 4648.229042] Hardware name: ASUS All Series/Z97-K, BIOS 2902 03/31/2016 [ 4648.229043] Workqueue: writeback wb_workfn (flush-8:16) [ 4648.229047] RIP: 0010:mb_find_order_for_block+0x66/0xa0 [ 4648.229050] Code: 8b 80 d0 02 00 00 42 0f b7 04 00 48 01 f0 48 89 c6 41 89 f9 c1 e0 03 83 e0 38 41 d3 f9 48 83 e6 f8 44 01 c8 48 98 48 0f a3 06 <73> 27 0f b7 42 30 83 c1 01 83 c0 01 39 c1 7f 17 48 8b 42 18 48 8b [ 4648.229052] RSP: 0018:ffffc90001f076c8 EFLAGS: 00000203 [ 4648.229053] RAX: 0000000000000138 RBX: ffffc90001f07730 RCX: 0000000000000006 [ 4648.229054] RDX: ffffc90001f077d0 RSI: ffff88810a7e2f80 RDI: 0000000000004e30 [ 4648.229055] RBP: 0000000000007ff0 R08: 000000000000000c R09: 0000000000000138 [ 4648.229056] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8881049f7800 [ 4648.229056] R13: 0000000000000001 R14: 0000000000007ff0 R15: ffffc90001f077d0 [ 4648.229057] FS: 0000000000000000(0000) GS:ffff88841fa80000(0000) knlGS:0000000000000000 [ 4648.229059] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4648.229060] CR2: 00007f23b3ffc3a0 CR3: 000000023ba2a001 CR4: 00000000001706e0 [ 4648.229060] Call Trace: [ 4648.229061] <NMI> [ 4648.229062] ? nmi_cpu_backtrace+0x88/0xf0 [ 4648.229064] ? nmi_cpu_backtrace_handler+0xc/0x20 [ 4648.229065] ? nmi_handle+0x53/0xf0 [ 4648.229068] ? default_do_nmi+0x40/0x220 [ 4648.229070] ? exc_nmi+0xf3/0x120 [ 4648.229072] ? end_repeat_nmi+0x16/0x67 [ 4648.229075] ? mb_find_order_for_block+0x66/0xa0 [ 4648.229077] ? mb_find_order_for_block+0x66/0xa0 [ 4648.229079] ? mb_find_order_for_block+0x66/0xa0 [ 4648.229081] </NMI> [ 4648.229081] <TASK> [ 4648.229081] mb_find_extent+0xd7/0x200 [ 4648.229084] ext4_mb_scan_aligned+0xd3/0x140 [ 4648.229086] ? ext4_get_group_info+0x4b/0x60 [ 4648.229088] ext4_mb_regular_allocator+0xcca/0xe50 [ 4648.229090] ? mempool_alloc+0x58/0x180 [ 4648.229093] ext4_mb_new_blocks+0x8c6/0xe80 [ 4648.229094] ? ext4_find_extent+0x3c1/0x410 [ 4648.229096] ext4_ext_map_blocks+0x626/0x15c0 [ 4648.229099] ? filemap_get_folios_tag+0x1c6/0x1f0 [ 4648.229100] ? mpage_prepare_extent_to_map+0x462/0x4a0 [ 4648.229102] ext4_map_blocks+0x18b/0x5f0 [ 4648.229104] ? kmem_cache_alloc+0x278/0x410 [ 4648.229106] ? ext4_alloc_io_end_vec+0x19/0x50 [ 4648.229109] ext4_do_writepages+0x619/0xa90 [ 4648.229111] ext4_writepages+0xa8/0x180 [ 4648.229113] do_writepages+0xc6/0x190 [ 4648.229115] __writeback_single_inode+0x2c/0x1a0 [ 4648.229117] writeback_sb_inodes+0x1e8/0x450 [ 4648.229119] __writeback_inodes_wb+0x47/0xe0 [ 4648.229121] wb_writeback.isra.0+0x159/0x1b0 [ 4648.229123] wb_workfn+0x269/0x380 [ 4648.229125] ? __schedule+0x2ca/0x1110 [ 4648.229127] ? __mod_timer+0x116/0x370 [ 4648.229130] process_one_work+0x12f/0x240 [ 4648.229132] worker_thread+0x2f0/0x410 [ 4648.229134] ? __pfx_worker_thread+0x10/0x10 [ 4648.229136] kthread+0xe3/0x110 [ 4648.229138] ? __pfx_kthread+0x10/0x10 [ 4648.229139] ret_from_fork+0x2f/0x50 [ 4648.229141] ? __pfx_kthread+0x10/0x10 [ 4648.229142] ret_from_fork_asm+0x1b/0x30 [ 4648.229145] </TASK> [ 4648.229145] NMI backtrace for cpu 0 skipped: idling at intel_idle+0x62/0xb0 [ 4648.229149] NMI backtrace for cpu 4 skipped: idling at intel_idle+0x62/0xb0 [ 4648.229151] NMI backtrace for cpu 6 skipped: idling at intel_idle+0x62/0xb0
What sort of information can you give about the ext4 file system where this is happening? How much free space is there? Can you run "df" and "dumpe2fs -h" on the system so we can see which file system features were enabled? Also, can you specify the kind of the block device? Differential debugging would be useful. For example, can you try is to copy the file system image to a hardware, and see if you can reproduce the behavior using the same kernel build workload? What if you copy to the file system image to a USB attached SSD, and see if you can reproduce the behavior? What you attach the USB attached SSD, and use a freshly formatted ext4 file system? Does it reproduce then? If the file system is nearly full, what if you delete a lot of unused space, to provide a lot more free space, and see if it reproduces then?
(In reply to Theodore Tso from comment #3) > What sort of information can you give about the ext4 file system where this > is happening? How much free space is there? Can you run "df" and > "dumpe2fs -h" on the system so we can see which file system features were > enabled? Also, can you specify the kind of the block device? > > Differential debugging would be useful. For example, can you try is to copy > the file system image to a hardware, and see if you can reproduce the > behavior using the same kernel build workload? What if you copy to the > file system image to a USB attached SSD, and see if you can reproduce the > behavior? What you attach the USB attached SSD, and use a freshly > formatted ext4 file system? Does it reproduce then? If the file system > is nearly full, what if you delete a lot of unused space, to provide a lot > more free space, and see if it reproduces then? mount |grep sdb2 /dev/sdb2 on /mnt/sdb2 type ext4 (rw,relatime,stripe=32752) df /mnt/sdb2/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb2 2667140496 2125391752 539022500 80% /mnt/sdb2 dumpe2fs -h /dev/sdb2 dumpe2fs 1.47.0 (5-Feb-2023) Filesystem volume name: <none> Last mounted on: /mnt/sdb2 Filesystem UUID: 3bd82ee7-f08a-4a23-8e50-2df3786b0858 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 169369600 Block count: 677467392 Reserved block count: 677465 Overhead clusters: 10682268 Free blocks: 135437186 Free inodes: 168057645 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 862 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stride: 32752 Flex block group size: 16 Filesystem created: Wed Jun 25 13:10:17 2014 Last mount time: Mon Oct 2 18:46:12 2023 Last write time: Mon Oct 2 18:46:12 2023 Mount count: 23 Maximum mount count: -1 Last checked: Thu Sep 28 14:37:21 2023 Check interval: 0 (<none>) Lifetime writes: 84 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: a90aaa2d-577c-4181-83f9-696dee7ea92d Journal backup: inode blocks Journal features: journal_incompat_revoke Total journal size: 128M Total journal blocks: 32768 Max transaction length: 32768 Fast commit length: 0 Journal sequence: 0x002a8661 Journal start: 9556 I don't have physical access to this machine so can't do any manipulations to the hardware or attaching USB drives etc. Device is Model Family: Toshiba 3.5" DT01ACA... Desktop HDD Device Model: TOSHIBA DT01ACA300 Serial Number: 14KP58HGS LU WWN Device Id: 5 000039 ff4d7b718 Firmware Version: MX6OABB0 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Oct 2 18:53:26 2023 EEST SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM feature is: Disabled Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, frozen [SEC2] Wt Cache Reorder: Enabled
Hi Ivan, Thanks for sharing the additional information. So I can see that you are passing stripe=32752 and from the backtrace and I do see some of the functions in the trace that indicate that a striped allocation was taking place. There were some changes in the allocator in 6.5 kernel if I'm not wrong but I'll need to try to replicate this to confirm if the issue is related to those and update here. Meanwhile, can you provide some more info on the following: 1. I want to check if this issue is related to the striped allocation code path or if its also seen in the usual path. To check that, will it be possible for you to run the same workload without passing "stripe=32752" option during mount time? 2. How often do you see this issue, is it noticeable 100% of times? Regards, ojaswin
Hi Ojaswin, 1. Actually I'm not passing stripe option in my fstab, seems it's added automatically. Could you please clarify how I can get rid of it? corresponding fstab entry UUID=3bd82ee7-f08a-4a23-8e50-2df3786b0858 /mnt/sdb2 ext4 defaults,user_xattr 0 2 2. The issue is reproducible/visible 100% of times when building openwrt on the kernels >=6.5 including 6.6.
Removed stripe with tune2fs -f -E stride=0 /dev/sdb2 and it fixed the issue. Still have no clue why it was enabled. Thanks
Hey Ivan, thanks! Seems like stripe mount option is automatically added if the filesystem is made using stride or stripe-width options during mkfs time. Can you kindly perform the below steps as well to further help narrow this issue down: 1. Mount FS with the usual options (including using stripe=32752). 2. Once mounted, run: $ echo "0" > /sys/fs/ext4/sdb2/mb_best_avail_max_trim_order This might need sudo. Basically this disables one of the allocator feature that was added in 6.5. 3. Run the workload again and see if the issue is still visible. Regards, ojaswin
With mount option stripe=32752 and disabled mb_best_avail_max_trim_order issue is still visible. cat /sys/fs/ext4/sdb2/mb_best_avail_max_trim_order 0 mount|grep sdb2 /dev/sdb2 on /mnt/sdb2 type ext4 (rw,relatime,stripe=32752) 4687 root 20 0 0 0 0 R 100,0 0,0 0:25.73 kworker/u16:1+flush-8:16
Hi Ivan, So unfortunately I'm not able to replicate it yet at my end. While I try that, wanted to check if you can give a few things a try, So it seems like the CPU is stuck at mb_find_order_for_block() called from mb_find_extent(). I do see a while loop in mb_find_order_for_block() but its not obvious if its stuck there and if so why. If possible can you: 1. Recompile the kernel with CONFIG_DEBUG_INFO=y CONFIG_SOFTLOCKUP_DETECTOR=y and CONFIG_HARDLOCKUP_DETECTOR=y which might provide more information in dmesg when the lockup happens. 2. Replicate it once more and note the RIP value in the trace of stuck CPU, for example in the above trace it was mb_find_order_for_block+0x68 for CPU2. 3. Run the following kernel's source dir to get the corresponding line number (CONFIG_DEBUG_INFO needed): $ ./scripts/faddr2line vmlinux mb_find_order_for_block+0x68 Maybe you can share the code you see in and around those lines as well as the exact kernel version. This will help pinpoint the location where the code might be stuck (for example in a loop), which can help root cause this. Also, you mentioned that the CPU gets stuck at 100% util for 10-15mins, does it ever come back to normal or does it stay stuck? Regards, ojaswin
Hi Ojaswin, Tried it but unfortunately it gives me nothing useful [ 619.894677] RIP: 0010:mb_find_order_for_block+0x50/0xa0 ./scripts/faddr2line vmlinux mb_find_order_for_block+0x50 mb_find_order_for_block+0x50/0xa0: fs mm ??:0 All required kernel options is enabled zcat /proc/config.gz |grep -i 'CONFIG_DEBUG_INFO\|CONFIG_SOFTLOCKUP_DETECTOR\|CONFIG_HARDLOCKUP_DETECTOR'|grep -v '#' CONFIG_DEBUG_INFO=y CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y CONFIG_DEBUG_INFO_COMPRESSED_NONE=y CONFIG_SOFTLOCKUP_DETECTOR=y CONFIG_HARDLOCKUP_DETECTOR=y CONFIG_HARDLOCKUP_DETECTOR_PERF=y CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER=y >Also, you mentioned that the CPU gets stuck at 100% util for 10-15mins, does >it ever come back to normal or does it stay stuck? Yes, it comes back to normal after quite some time.
Hey Ivan, so I used the kernel v6.6-rc1 and the same config you provided as well as well as mounted an hdd on my VM. Then I followed the steps here build openwrt [1]. However, I'm still unable to replicate the 100% cpu utilization in a kworker/flush thread (I do get . Since you have the config options enabled and we didn't see them trigger any warning and the fact that we get back to normal after a few minutes indicates that its not a lockup/deadlock. We also see that on faster SSD we don't see this issue so this might even have something to do with a lot of IOs being queued up on the slower disk causing us to notice the delay. Maybe we are waiting a lot more on some spinlock that can explain the CPU utilization. Since I'm unable to replicate it, I'll have to request you for some more info to get to the bottom of this. More specifically, can you kindly provide the following: For the kernel with this issue: 1. Replicate the 100% util in one terminal window. 2. Once the 100% util is hit, in another terminal run the following command: $ iostat -x /dev/<dev> 2 (run this for 20 to 30 seconds) $ perf record -ag sleep 20 $ echo l > /proc/sysrq_trigger $ uname -a 3. Repeat the above for a kernel where the issue is not seen. Kindly share the sysrq back trace, iostat output, perf.data and the uname output for both the runs here so that I can take a closer look at what is causing the unusual utilization. [1] https://github.com/openwrt/openwrt#quickstart Regards, ojaswin
Affected one: iostat: https://pastebin.com/mau5VYnV perf: https://drive.google.com/file/d/15FO5YsvYrFnKtJI9hfl8qPG7oI8SgLeI/view?usp=sharing backtrace: https://pastebin.com/AgjYXhjR uname -a: Linux gentoo-desktop 6.5.7-gentoo-x86_64 #1 SMP PREEMPT_DYNAMIC Wed Oct 11 14:40:24 EEST 2023 x86_64 Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz GenuineIntel GNU/Linux I don't have 6.4.16 kernel at the moment, will compile it later and provide relevant info.
Not affected 6.4.16: iostat: https://pastebin.com/6XadeAuh perf: https://drive.google.com/file/d/1neAq73hoztfTU87NyCHWGAQhCKQfILOH/view?usp=sharing backtrace: https://pastebin.com/zdnWgx4M uname -a: Linux gentoo-desktop 6.4.16-gentoo-x86_64 #1 SMP PREEMPT_DYNAMIC Thu Oct 12 00:09:02 EEST 2023 x86_64 Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz GenuineIntel GNU/Linux
I encounter the same issue on a stock fedora 37 kernel # uname -a Linux xxx.net 6.4.16-100.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Sep 13 18:22:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux The issue seems to be present also on brtfs partitions, as mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2242391 In my case, I have an EXT4 partition over an mdadm raid 1 array of two HDD. # mdadm --misc --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Sun Feb 24 21:40:33 2013 Raid Level : raid1 Array Size : 3906118928 (3.64 TiB 4.00 TB) Used Dev Size : 3906118928 (3.64 TiB 4.00 TB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Tue Oct 17 12:05:22 2023 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Consistency Policy : resync Name : xxx.net:0 (local to host xxx.net) UUID : 25070e74:b2ac1695:46ee20f9:7e8d1e05 Events : 402440 Number Major Minor RaidDevice State 3 8 17 0 active sync /dev/sdb1 2 8 33 1 active sync /dev/sdc1 dumpe2fs -h /dev/md0p1 dumpe2fs 1.46.5 (30-Dec-2021) Filesystem volume name: <none> Last mounted on: /home Filesystem UUID: 4df09a03-d8ef-4cef-a64d-cefde4823fc2 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 244137984 Block count: 976529232 Reserved block count: 48826457 Overhead clusters: 15373546 Free blocks: 545717412 Free inodes: 243040909 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 791 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stride: 32745 Flex block group size: 16 Filesystem created: Sun Feb 24 22:13:52 2013 Last mount time: Mon Oct 16 21:06:57 2023 Last write time: Mon Oct 16 21:06:57 2023 Mount count: 6 Maximum mount count: -1 Last checked: Sat Oct 7 16:27:02 2023 Check interval: 0 (<none>) Lifetime writes: 238 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 First orphan inode: 13509593 Default directory hash: half_md4 Directory Hash Seed: af37e79f-0457-4318-9c5d-840dc2f60bce Journal backup: inode blocks Journal features: journal_incompat_revoke Total journal size: 128M Total journal blocks: 32768 Max transaction length: 32768 Fast commit length: 0 Journal sequence: 0x003bc716 Journal start: 1 # df /dev/md0p1 Filesystem 1K-blocks Used Available Use% Mounted on /dev/md0p1 3844622744 1661757340 1987543192 46% /home # mount |grep md0 /dev/md0p1 on /home type ext4 (rw,relatime,seclabel,stripe=32745) In order to trig the 100% CPU on one core, a simple # for i in {0001..0200}; echo "some text" > "file_${i}.txt" is enough. This will bring the kworker/flush at 100 % for around 90 seconds. This time is dependent of the number of created files. Trying to delete these files is more or less impossible. On the same setup I have a small SSD used for system installation. The same test doesn't bring the kworker/flush behaviour. # dumpe2fs -h /dev/sda3 dumpe2fs 1.46.5 (30-Dec-2021) Filesystem volume name: <none> Last mounted on: / Filesystem UUID: 589968d6-2283-4903-8699-1b23f5c341a5 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 848640 Block count: 3390208 Reserved block count: 169510 Free blocks: 1823000 Free inodes: 735742 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 827 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8160 Inode blocks per group: 510 Flex block group size: 16 Filesystem created: Sun Apr 6 00:00:12 2014 Last mount time: Mon Oct 16 21:06:53 2023 Last write time: Mon Oct 16 21:06:47 2023 Mount count: 437 Maximum mount count: -1 Last checked: Sun Apr 6 00:00:12 2014 Check interval: 0 (<none>) Lifetime writes: 1791 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 First orphan inode: 130778 Default directory hash: half_md4 Directory Hash Seed: 7d17214b-9585-4370-8c06-4236e449fa7f Journal backup: inode blocks Journal features: journal_incompat_revoke Total journal size: 128M Total journal blocks: 32768 Max transaction length: 32768 Fast commit length: 0 Journal sequence: 0x0114205c Journal start: 8194 # mount |grep sda3 /dev/sda3 on / type ext4 (rw,noatime,seclabel)
Oups not the right kernel version. The 6.4.16 one is the one that it is working. The one with the issue (and actually all of the 6.5 I tested) is: Linux version 6.5.4-100.fc37.x86_64 (mockbuild@bkernel01.iad2.fedoraproject.org) (gcc (GCC) 12.3.1 20230508 (Red Hat 12.3.1-1), GNU ld version 2.38-27.fc37) #1 SMP PREEMPT_DYNAMIC Tue Sep 19 13:13:25 UTC 2023
Hey Ivan, Eduard, Thanks for all the data. The iostat suggests that we are breaking up the writes much more as seen in the wareq-sz field, which is much lower in the regressed kernel. Further, since the issue is only when striping is used (ie in raid scenarios) it seems like it has something to do with raiding. Eduard's replicator seems much easier to try so let me see if I'm able to replicate this over a raid setup of HDDs. Regards, ojaswin
Hi Ojaswin, If you want to replicate something similar to my setup, I better give you more info about it, because it's not a desktop computer. It's a very slow system by current standards and I think it may impact why there aren't more people complaining about this issue. In Ivan's case it seems that after a small while, it recovered even with a lot of files. His setup seems more powerful than mine. In my case pulling a 75MB container with podman didn't recovered after 2 hours of 100% CPU. The system has 4G RAM: # free total used free shared buff/cache available Mem: 3977448 615760 410332 14580 2951356 3045304 Swap: 1564668 0 1564668 and an AMD G-T40E processor: # cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 20 model : 2 model name : AMD G-T40E Processor stepping : 0 microcode : 0x5000119 cpu MHz : 801.798 cache size : 512 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt hw_pstate vmmcall arat npt lbrv svm_lock nrip_save pausefilter bugs : fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass bogomips : 1999.82 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate processor : 1 vendor_id : AuthenticAMD cpu family : 20 model : 2 model name : AMD G-T40E Processor stepping : 0 microcode : 0x5000119 cpu MHz : 799.941 cache size : 512 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 6 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni monitor ssse3 cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch ibs skinit wdt hw_pstate vmmcall arat npt lbrv svm_lock nrip_save pausefilter bugs : fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass bogomips : 1999.82 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate The raid1 underlying hardware changed from the 2013 from 1T -> 2T -> 4T HDD. The file system grew up accordingly. During the step 2T -> 4T disk layout was converted from MBR to GPT: # gdisk -l /dev/sdb GPT fdisk (gdisk) version 1.0.9 Partition table scan: MBR: protective BSD: not present APM: not present GPT: present Found valid GPT with protective MBR; using GPT. Disk /dev/sdb: 7814037168 sectors, 3.6 TiB Model: ST4000VN008-2DR1 Sector size (logical/physical): 512/4096 bytes Disk identifier (GUID): CD6E68D9-1A0F-40F7-9755-FDE136159F92 Partition table holds up to 128 entries Main partition table begins at sector 2 and ends at sector 33 First usable sector is 34, last usable sector is 7814037134 Partitions will be aligned on 2048-sector boundaries Total free space is 1537100 sectors (750.5 MiB) Number Start (sector) End (sector) Size Code Name 1 2048 7812502048 3.6 TiB 8300 primary And again the raid1 md0 partition table: # gdisk -l /dev/md0 GPT fdisk (gdisk) version 1.0.9 Partition table scan: MBR: protective BSD: not present APM: not present GPT: present Found valid GPT with protective MBR; using GPT. Disk /dev/md0: 7812237857 sectors, 3.6 TiB Sector size (logical/physical): 512/4096 bytes Disk identifier (GUID): 46C6A662-69D1-492D-A31F-1121048F9FE3 Partition table holds up to 128 entries Main partition table begins at sector 2 and ends at sector 33 First usable sector is 34, last usable sector is 7812237823 Partitions will be aligned on 2048-sector boundaries Total free space is 3933 sectors (1.9 MiB) Number Start (sector) End (sector) Size Code Name 1 2048 7812235904 3.6 TiB 8300 Linux filesystem Last one is the EXT4 filesystem.
I seem to have a similar issue, on a PC with raid6, which was discussed here https://marc.info/?l=linux-raid&m=169922587226984 which ended up pointing me here. My impression is that it is not a fedora (I am on f38) specific issue so here is the correct place to post. I see no progress for 4 weeks, is there any update? It looks like people were getting close to understanding the cause, maybe even to a solution.
Hey Eyal, I was continuing to look into this but had to keep it aside for a bit mainly since I got caught up with some other work plus I don't seem to have the right hardware to trigger this either. I see that you are facing a similar issue and have some more information on your post. Thanks for that, let me take some more time to go though it and think about the issue. Btw, would it possible for you to do a git bisect of the issue to see which is the bad commit? Since I'm not able to replicate it on my setup that could be really helpful. Regards, ojaswin
Hi Ojaswin, Thanks for the info. This is a 24x7 server, so rebooting often is not something I would consider. However, if a specific kernel is of interest, which I can install from the fedora repositories, then I could probably be able to give it a test. I will give 6.4.15 a try though.
I now ran a test on 6.4.15. I copied a segment of the tree I have issues with. 7.2GB in 77,972 files. It took just over 9 minutes to sync the drain the dirty cache, and the array run like this (I am watching the kB_wrtn/s column): Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd 18:15:01 md127 423.00 4.00 7001.20 0.00 40 70012 0 18:15:11 md127 112.30 14.80 9689.60 0.00 148 96896 0 18:15:21 md127 28.70 3.60 964.80 0.00 36 9648 0 18:15:31 md127 321.90 4.00 1340.40 0.00 40 13404 0 18:15:41 md127 18.20 0.00 123.20 0.00 0 1232 0 18:15:51 md127 26.80 10.00 2062.00 0.00 100 20620 0 18:16:01 md127 2.60 0.00 56.40 0.00 0 564 0 18:16:11 md127 2.80 0.00 82.40 0.00 0 824 0 18:16:21 md127 6.10 0.00 150.80 0.00 0 1508 0 18:16:31 md127 86.00 16.80 4978.00 0.00 168 49780 0 18:16:41 md127 48.90 10.80 4505.60 0.00 108 45056 0 18:16:51 md127 36.10 6.80 1498.00 0.00 68 14980 0 18:17:01 md127 3.20 0.40 64.40 0.00 4 644 0 18:17:11 md127 3.40 0.80 84.80 0.00 8 848 0 18:17:21 md127 3.70 0.80 283.60 0.00 8 2836 0 18:17:31 md127 32.90 6.00 1950.00 0.00 60 19500 0 18:17:41 md127 907.50 24.00 15638.40 0.00 240 156384 0 18:17:51 md127 698.00 9.60 10476.40 0.00 96 104764 0 18:18:01 md127 146.80 0.40 4796.00 0.00 4 47960 0 18:18:11 md127 163.00 0.80 1328.80 0.00 8 13288 0 18:18:21 md127 149.50 1.20 1140.00 0.00 12 11400 0 18:18:31 md127 46.50 1.60 545.20 0.00 16 5452 0 18:18:41 md127 335.90 14.40 9770.00 0.00 144 97700 0 18:18:51 md127 119.90 4.40 3868.80 0.00 44 38688 0 18:19:01 md127 315.70 25.60 11422.80 0.00 256 114228 0 18:19:11 md127 285.80 1.20 4142.40 0.00 12 41424 0 18:19:21 md127 239.00 4.40 3888.00 0.00 44 38880 0 18:19:31 md127 116.30 1.20 1752.80 0.00 12 17528 0 18:19:41 md127 166.60 8.00 8886.00 0.00 80 88860 0 18:19:51 md127 357.20 5.60 6146.80 0.00 56 61468 0 18:20:01 md127 118.10 1.20 1670.40 0.00 12 16704 0 18:20:11 md127 66.00 6.40 802.80 0.00 64 8028 0 18:20:21 md127 139.40 4.40 1641.20 0.00 44 16412 0 18:20:31 md127 44.40 2.40 632.40 0.00 24 6324 0 18:20:41 md127 183.50 13.60 4048.40 0.00 136 40484 0 18:20:51 md127 114.40 28.40 8346.40 0.00 284 83464 0 18:21:01 md127 152.20 27.20 15576.00 0.00 272 155760 0 18:21:11 md127 75.30 1.60 1146.40 0.00 16 11464 0 18:21:21 md127 236.20 18.40 13491.20 0.00 184 134912 0 18:21:31 md127 226.70 3.60 3237.60 0.00 36 32376 0 18:21:41 md127 152.70 4.00 2048.00 0.00 40 20480 0 18:21:51 md127 92.10 7.60 1907.20 0.00 76 19072 0 18:22:01 md127 142.50 8.00 1921.60 0.00 80 19216 0 18:22:11 md127 137.30 5.60 2307.60 0.00 56 23076 0 18:22:21 md127 124.10 6.00 1511.20 0.00 60 15112 0 18:22:31 md127 105.90 4.00 1888.80 0.00 40 18888 0 18:22:41 md127 166.30 2.00 2073.20 0.00 20 20732 0 18:22:51 md127 649.90 6.80 10615.20 0.00 68 106152 0 18:23:01 md127 1253.70 8.80 175698.40 0.00 88 1756984 0 18:23:11 md127 88.40 2.00 66840.40 0.00 20 668404 0 18:23:21 md127 0.00 0.00 0.00 0.00 0 0 0 Is writing at a few MB/s reasonable for an array with disks that can top 200MB/s? Note the burst at the end when a significant chunk of the data was written out. Now running on 6.5.10 I see a very different story: Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd 18:45:53 md127 2.60 0.00 10.40 0.00 0 104 0 18:46:03 md127 5.40 0.00 21.60 0.00 0 216 0 18:46:13 md127 5.40 0.00 21.60 0.00 0 216 0 18:46:23 md127 5.30 0.00 21.60 0.00 0 216 0 18:46:33 md127 5.20 0.00 20.80 0.00 0 208 0 18:46:43 md127 5.30 0.00 23.60 0.00 0 236 0 18:46:53 md127 5.10 0.00 44.00 0.00 0 440 0 18:47:03 md127 2.70 0.00 21.60 0.00 0 216 0 18:47:13 md127 5.30 0.00 26.80 0.00 0 268 0 18:47:23 md127 5.40 0.00 25.20 0.00 0 252 0 18:47:33 md127 5.20 0.00 21.60 0.00 0 216 0 18:47:43 md127 5.30 0.00 21.20 0.00 0 212 0 18:47:53 md127 5.30 0.00 22.40 0.00 0 224 0 18:48:03 md127 5.40 0.00 24.00 0.00 0 240 0 18:48:13 md127 2.60 0.00 11.20 0.00 0 112 0 18:48:23 md127 5.30 0.00 22.40 0.00 0 224 0 18:48:33 md127 5.30 0.00 23.20 0.00 0 232 0 18:48:43 md127 5.20 0.00 24.40 0.00 0 244 0 18:48:53 md127 5.30 0.00 22.00 0.00 0 220 0 18:49:03 md127 6.20 3.60 23.20 0.00 36 232 0 18:49:13 md127 3.40 0.00 20.80 0.00 0 208 0 18:49:23 md127 5.20 0.00 20.80 0.00 0 208 0 18:49:33 md127 5.30 0.00 22.00 0.00 0 220 0 18:49:43 md127 5.30 0.00 21.60 0.00 0 216 0 18:49:53 md127 5.20 0.00 33.20 0.00 0 332 0 I expect this to take much longer to complete, possibly hours (20% drained in the last 30m). In both cases the flusher thread was running at 100%CPU. I do have the full logs though. HTH
My oppinion on this is as follows. a.) this openwrt runs on what? kernel 6.5 would not run well on a older embedded processor. b.) the controller chip of the hdd might be the problem, if not damaged the driver of the controller does not have all the functions the new kernel version has. if the controller is not damaged try to install the newest BIOS for that controller, and may be you need a updated driver.
In case it helps. Did a copy of a small part of my files tree. 368 MB total size 19,232 files Checked how long it takes to write out the dirty blocks: kernel-6.4.15 0m50s kernel-6.5.7 20m kernel-6.5.10 same In all cases the flusher thread was running at 100%CPU from shortly after the copy. At the start meminfo showed about 380MB Dirty size. About https://bugzilla.kernel.org/show_bug.cgi?id=217965#c23 I do not see how it is relevant to me. I also doubt a "new kernel version" was released with such user visible changes without due advice.
Hey Eyal, Will it be possible for you to try this experiment on an ext4 FS mounted with -o nodelalloc option. This should allocate the blocks immediately instead of doing so during writeback. Further, will you be able to provide the perf data file along with kallsyms or somthing similar that can be used to resolve the symbols. If not, the following output will also do: perf report --no-children --stdio -i perf.data Meanwhile I'll continue to look. Thanks!
Ojaswin, For my understanding: the (dirty) buffers hold data blocks that need to be written. Clearing dirty buffers does not include fs activity, which was already performed during the copy. Is this correct? If so then why are we talking about ext4? I expect that my understanding is incorrect. The copy itself is very fast, completing in a few seconds. I assume that the requested mount option is safe. Is it safe to use this way $ sudo mount -o remount,nodelalloc /data1 or should I also add the other current options, 'noatime' (from fstab) and 'stripe=640' (automatic)? or should I reboot with this option in fstab? Below is what I did now, is this what you need? $ sudo perf record -p 1366946 sleep 60 [ perf record: Woken up 37 times to write data ] [ perf record: Captured and wrote 9.079 MB perf.data (237339 samples) ] $ sudo perf report --no-children --stdio -i perf.data # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 237K of event 'cycles:P' # Event count (approx.): 258097341907 # # Overhead Command Shared Object Symbol # ........ ............... ................. .......................................... # 53.90% kworker/u16:4+f [kernel.kallsyms] [k] ext4_mb_good_group 13.71% kworker/u16:4+f [kernel.kallsyms] [k] ext4_get_group_info 7.39% kworker/u16:4+f [kernel.kallsyms] [k] ext4_mb_find_good_group_avg_frag_lists 6.96% kworker/u16:4+f [kernel.kallsyms] [k] __rcu_read_unlock 5.01% kworker/u16:4+f [kernel.kallsyms] [k] ext4_mb_scan_aligned 4.51% kworker/u16:4+f [kernel.kallsyms] [k] __rcu_read_lock 1.70% kworker/u16:4+f [kernel.kallsyms] [k] mb_find_order_for_block 0.89% kworker/u16:4+f [kernel.kallsyms] [k] xas_descend 0.85% kworker/u16:4+f [kernel.kallsyms] [k] filemap_get_entry 0.81% kworker/u16:4+f [kernel.kallsyms] [k] ext4_mb_regular_allocator 0.77% kworker/u16:4+f [kernel.kallsyms] [k] ext4_mb_load_buddy_gfp 0.54% kworker/u16:4+f [kernel.kallsyms] [k] xas_load 0.50% kworker/u16:4+f [kernel.kallsyms] [k] ext4_mb_unload_buddy 0.42% kworker/u16:4+f [kernel.kallsyms] [k] _raw_read_unlock 0.40% kworker/u16:4+f [kernel.kallsyms] [k] mb_find_extent 0.40% kworker/u16:4+f [kernel.kallsyms] [k] _raw_spin_trylock 0.25% kworker/u16:4+f [kernel.kallsyms] [k] __filemap_get_folio 0.24% kworker/u16:4+f [kernel.kallsyms] [k] _raw_read_lock 0.13% kworker/u16:4+f [kernel.kallsyms] [k] xas_start 0.12% kworker/u16:4+f [kernel.kallsyms] [k] mb_find_buddy 0.10% kworker/u16:4+f [kernel.kallsyms] [k] pagecache_get_page 0.08% kworker/u16:4+f [kernel.kallsyms] [k] folio_mark_accessed 0.05% kworker/u16:4+f [kernel.kallsyms] [k] folio_test_hugetlb 0.04% kworker/u16:4+f [kernel.kallsyms] [k] _raw_spin_unlock 0.02% kworker/u16:4+f [kernel.kallsyms] [k] __cond_resched 0.02% kworker/u16:4+f [kernel.kallsyms] [k] perf_adjust_freq_unthr_context 0.01% kworker/u16:4+f [kernel.kallsyms] [k] gen8_irq_handler 0.01% kworker/u16:4+f [kernel.kallsyms] [k] _raw_spin_lock 0.01% kworker/u16:4+f [kernel.kallsyms] [k] native_irq_return_iret 0.01% kworker/u16:4+f [kernel.kallsyms] [k] native_write_msr 0.01% kworker/u16:4+f [kernel.kallsyms] [k] timekeeping_advance 0.01% kworker/u16:4+f [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0.01% kworker/u16:4+f [kernel.kallsyms] [k] native_read_msr <then many zero% items>
Hey Eyal, So the way most file systems handle their writes is: 1. Data is written to memory buffers aka pagecache 2. When writeback/flush kicks in, FS tries to group adjacent data together and allocates disk blocks for it 3. Finally, send the data to lower levels like blocks layer -> raid -> scsi etc for the actual write. The practice of delaying the actual allocation till writeback/flush is known as delayed allocation or delalloc in ext4 and is on by default (other FSes might have different names). This is why the ext4 allocation related functions (ext4_mb_regular_allocator etc) show up in your perf report of the flusher thread. With delaloc, we are sending bigger requests to the ext4 allocator since we try to group together buffers. With nodelalloc we disable this so fs block allocation happens when we are dirtying the buffers (in step 1 above) and we only allocate as much as that write asked for thus sending smaller requests at a time. Since with delalloc we see that your flusher seemed to be taking a lot of time in ext4 allocation routines, I wanted to check if a change in allocation pattern via nodelalloc could help us narrow down the issue. Using: $ sudo mount -o remount,nodelalloc /data1 should be safe and preserve your other mount options so you can give it a try. Lastly, thanks for the perf report however I'm sorry I forgot to mention that i was actually looking for the call graph, which could be collected as follows: $ sudo perf record -p 1234 -g sleep 60 Can you please share the report of the above command. Thanks! Ojaswin
$ sudo perf record -p 1531839 -g sleep 60 [ perf record: Woken up 191 times to write data ] [ perf record: Captured and wrote 47.856 MB perf.data (238391 samples) ] $ sudo perf report --no-children --stdio -i perf.data # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 238K of event 'cycles:P' # Event count (approx.): 262920477580 # # Overhead Command Shared Object Symbol # ........ ............... ................. .......................................... # 58.48% kworker/u16:5+f [kernel.kallsyms] [k] ext4_mb_good_group | |--56.07%--ext4_mb_good_group | | | --55.85%--ext4_mb_find_good_group_avg_frag_lists | ext4_mb_regular_allocator | ext4_mb_new_blocks | ext4_ext_map_blocks | ext4_map_blocks | ext4_do_writepages | ext4_writepages | do_writepages | __writeback_single_inode | writeback_sb_inodes | __writeback_inodes_wb | wb_writeback | wb_workfn | process_one_work | worker_thread | kthread | ret_from_fork | ret_from_fork_asm | --2.16%--ext4_get_group_info ext4_mb_good_group | --2.16%--ext4_mb_find_good_group_avg_frag_lists ext4_mb_regular_allocator ext4_mb_new_blocks ext4_ext_map_blocks ext4_map_blocks ext4_do_writepages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ret_from_fork_asm 15.69% kworker/u16:5+f [kernel.kallsyms] [k] ext4_get_group_info | --15.10%--ext4_get_group_info | --15.05%--ext4_mb_good_group | --15.01%--ext4_mb_find_good_group_avg_frag_lists ext4_mb_regular_allocator ext4_mb_new_blocks ext4_ext_map_blocks ext4_map_blocks ext4_do_writepages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ret_from_fork_asm 9.51% kworker/u16:5+f [kernel.kallsyms] [k] ext4_mb_find_good_group_avg_frag_lists | --9.26%--ext4_mb_find_good_group_avg_frag_lists ext4_mb_regular_allocator ext4_mb_new_blocks ext4_ext_map_blocks ext4_map_blocks ext4_do_writepages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ret_from_fork_asm 6.63% kworker/u16:5+f [kernel.kallsyms] [k] __rcu_read_unlock | |--4.56%--__rcu_read_unlock | | | --4.54%--ext4_get_group_info | | | --4.53%--ext4_mb_good_group | | | --4.51%--ext4_mb_find_good_group_avg_frag_lists | ext4_mb_regular_allocator | ext4_mb_new_blocks | ext4_ext_map_blocks | ext4_map_blocks | ext4_do_writepages | ext4_writepages | do_writepages | __writeback_single_inode | writeback_sb_inodes | __writeback_inodes_wb | wb_writeback | wb_workfn | process_one_work | worker_thread | kthread | ret_from_fork | ret_from_fork_asm | --2.08%--ext4_get_group_info ext4_mb_good_group | --2.07%--ext4_mb_find_good_group_avg_frag_lists ext4_mb_regular_allocator ext4_mb_new_blocks ext4_ext_map_blocks ext4_map_blocks ext4_do_writepages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ret_from_fork_asm 4.32% kworker/u16:5+f [kernel.kallsyms] [k] __rcu_read_lock | --4.30%--__rcu_read_lock | --4.29%--ext4_get_group_info | --4.27%--ext4_mb_good_group | --4.26%--ext4_mb_find_good_group_avg_frag_lists ext4_mb_regular_allocator ext4_mb_new_blocks ext4_ext_map_blocks ext4_map_blocks ext4_do_writepages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ret_from_fork_asm 1.98% kworker/u16:5+f [kernel.kallsyms] [k] ext4_mb_scan_aligned | --1.98%--ext4_mb_scan_aligned ext4_mb_regular_allocator ext4_mb_new_blocks ext4_ext_map_blocks ext4_map_blocks ext4_do_writepages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ret_from_fork_asm 0.50% kworker/u16:5+f [kernel.kallsyms] [k] mb_find_order_for_block 0.44% kworker/u16:5+f [kernel.kallsyms] [k] ext4_mb_load_buddy_gfp 0.38% kworker/u16:5+f [kernel.kallsyms] [k] xas_descend 0.33% kworker/u16:5+f [kernel.kallsyms] [k] ext4_mb_regular_allocator 0.30% kworker/u16:5+f [kernel.kallsyms] [k] filemap_get_entry 0.24% kworker/u16:5+f [kernel.kallsyms] [k] xas_load 0.18% kworker/u16:5+f [kernel.kallsyms] [k] ext4_mb_unload_buddy 0.16% kworker/u16:5+f [kernel.kallsyms] [k] mb_find_extent 0.15% kworker/u16:5+f [kernel.kallsyms] [k] _raw_spin_trylock 0.14% kworker/u16:5+f [kernel.kallsyms] [k] _raw_read_unlock 0.10% kworker/u16:5+f [kernel.kallsyms] [k] _raw_read_lock 0.08% kworker/u16:5+f [kernel.kallsyms] [k] __filemap_get_folio 0.07% kworker/u16:5+f [kernel.kallsyms] [k] mb_find_buddy 0.05% kworker/u16:5+f [kernel.kallsyms] [k] xas_start 0.03% kworker/u16:5+f [kernel.kallsyms] [k] pagecache_get_page 0.03% kworker/u16:5+f [kernel.kallsyms] [k] _raw_spin_unlock 0.02% kworker/u16:5+f [kernel.kallsyms] [k] perf_adjust_freq_unthr_context 0.02% kworker/u16:5+f [kernel.kallsyms] [k] folio_mark_accessed 0.02% kworker/u16:5+f [kernel.kallsyms] [k] folio_test_hugetlb 0.01% kworker/u16:5+f [kernel.kallsyms] [k] native_write_msr 0.01% kworker/u16:5+f [kernel.kallsyms] [k] _raw_spin_lock 0.01% kworker/u16:5+f [kernel.kallsyms] [k] __cond_resched 0.01% kworker/u16:5+f [kernel.kallsyms] [k] timekeeping_advance 0.01% kworker/u16:5+f [kernel.kallsyms] [k] native_irq_return_iret 0.00% kworker/u16:5+f [kernel.kallsyms] [k] gen8_irq_handler
I now repeated the test with the new option. The result is that very few dirty buffers are recorded, but the rsync that usually takes a few seconds, now stops for minutes at a time. The machine also stutters, for example as I open a snall file on /data1. The following perf run is of the rsync. $ sudo sysctl vm.drop_caches=3 $ sudo mount -o remount,nodelalloc /data1 $ (d="`date`" ; sudo rsync -aHSK --stats --progress --checksum-choice=none --no-compress -W /data2/no-backup/old-backups/tapes/13/file.1.data /data1/no-backup/old-backups/tapes-again/13/ ; echo "`date`: started $d") $ sudo perf record -p 1531839 -g sleep 60 $ sudo perf report --no-children --stdio -i perf.data I will add the logs as an attachment (I hope).
Created attachment 305411 [details] perf lest logs perf log for rsync, See comment #29
Hey Eyal, Thanks for running the tests. So it definitely feels that there is something going on in the ext4 allocator side. Ideally we shouldn't be taking this much time in finding blocks in the FS especially since the older kernels complete it quickly so it couldn't be a low FS space issue. I spent some good amount of time today trying to replicate it to no avail hence I'll have to request you for some more data. Thanks to the perf report you've shared I have an idea of some kernel areas we can probe to understand whats truly going on. I'll just share the perf probe and perf record commands below. Please let me know if its not possible to run these on the affected distro so we can think of some other way :) ** Adding probe points ** sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order" sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists:18 cr iter->bb_group" sudo perf probe -a "ext4_mb_good_group:20 free fragments ac->ac_g_ex.fe_len ac->ac_2order" ** Recording using the probe point ** sudo perf record -e probe:ext4_mb_find_good_group_avg_frag_lists -e probe:ext4_mb_find_good_group_avg_frag_lists_L18 -e probe:ext4_mb_good_group_L20 -e ext4:ext4_mballoc_alloc -g -p <pid> sleep 20 ** Saving the output ** sudo perf script -i perf.data > perf-probe.log I tried these commands on the mainline kernel but I hope they've not changed too much for your kernel. You *might* need to change the line numbers of the probe if thats the case and the event names in the record command would also change accordingly. Please let me know if you need any more info from my end to get this done, I'd be glad to help. Thanks again and regard, ojaswin
Looks like a basic misunderstanding on my side, or maybe more preparation is needed? What am I missing? $ sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order" Failed to find the path for the kernel: Invalid ELF file Error: Failed to add events. $ uname -a Linux e7.eyal.emu.id.au 6.5.10-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Nov 2 19:59:55 UTC 2023 x86_64 GNU/Linux $ ls -l /boot/*6.5.10* -rw------- 1 root root 8727059 Nov 2 11:00 /boot/System.map-6.5.10-200.fc38.x86_64 -rw-r--r-- 1 root root 264245 Nov 2 11:00 /boot/config-6.5.10-200.fc38.x86_64 -rw------- 1 root root 39368262 Nov 9 10:55 /boot/initramfs-6.5.10-200.fc38.x86_64.img lrwxrwxrwx 1 root root 46 Nov 9 10:55 /boot/symvers-6.5.10-200.fc38.x86_64.xz -> /lib/modules/6.5.10-200.fc38.x86_64/symvers.xz -rwxr-xr-x 1 root root 14551752 Nov 2 11:00 /boot/vmlinuz-6.5.10-200.fc38.x86_64 I strace'd the command, and toward the end I see this (I have vmlinuz, not vmlinux - related?). These are the first failed 'openat()' in the trace. openat(AT_FDCWD, "vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/boot/vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/boot/vmlinux-6.5.10-200.fc38.x86_64", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/debug/boot/vmlinux-6.5.10-200.fc38.x86_64", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/lib/modules/6.5.10-200.fc38.x86_64/build/vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/debug/lib/modules/6.5.10-200.fc38.x86_64/vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/usr/lib/debug/boot/vmlinux-6.5.10-200.fc38.x86_64.debug", O_RDONLY) = -1 ENOENT (No such file or directory) newfstatat(AT_FDCWD, "/root/.debug/.build-id/d3/6a8ae40f88e310401726e275aa1940b279babd", {st_mode=S_IFDIR|0755, st_size=4096, ...}, 0) = 0 openat(AT_FDCWD, "/root/.debug/.build-id/d3/6a8ae40f88e310401726e275aa1940b279babd/kallsyms", O_RDONLY) = 3 Earlier I see this open success: newfstatat(AT_FDCWD, "/lib/modules/6.5.10-200.fc38.x86_64/vmlinuz", {st_mode=S_IFREG|0755, st_size=14551752, ...}, 0) = 0 which is identical to the one in /boot. Should I boot a different kernel type? Remember that I am not a dev...
Hey Eyal, Yes you are right, it is not able to find the vmlinux with debug symbols. You can either download the debuginfo package of the corresponding kernel which should place the vmlinux in the required place or even having the "vmlinux" with debug symbols in the same directory as where you are running perf probe shall do. Also, once this error is resolved you *might* encounter something like " Error: Failed to show lines." To fix it you can install the kernel source code package or just use the -s flag to point to the source code. (Example of how I use it): sudo perf probe -s ~/workspace/9p -a "ext4_mb_find_good_group_avg_frag_lists order" Let me know if it still doesn't work. Thanks! Ojaswin
Created attachment 305414 [details] detailed perf log Attachment for comment #34
$ sudo dnf debuginfo-install kernel Installing: kernel-debuginfo Installing dependencies: kernel-debuginfo-common-x86_64 $ sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order" $ sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists:18 cr iter->bb_group" $ sudo perf record -e probe:ext4_mb_find_good_group_avg_frag_lists -e probe:ext4_mb_find_good_group_avg_frag_lists_L18 -e probe:ext4_mb_good_group_L20 -e ext4:ext4_mballoc_alloc -g -p 2789843 sleep 20 $ sudo perf script -i perf.data > perf-probe.log [ perf record: Woken up 1395 times to write data ] Warning: Processed 24640612 events and lost 22 chunks! Check IO/CPU overload! [ perf record: Captured and wrote 4725.791 MB perf.data (19187380 samples) ] $ sudo perf script -i perf.data > perf-probe.log Warning: Processed 24640612 events and lost 22 chunks! Check IO/CPU overload! $ ls -l perf-probe.log -rw-r--r-- 1 eyal eyal 39024390135 Nov 17 12:31 perf-probe.log I am limited to upload much less that this as an attachment. File size limit: 5120 KB) I managed to include only so many likes ftom the head: $ wc -l perf-probe-head.log 51594 perf-probe-head.log See attachement https://bugzilla.kernel.org/attachment.cgi?id=305414
Hey Eyal, So the trace data has given me an idea of what's going on. Basically in ext4 we maintain a list of FS blocks groups (BGs) where each list will have BGs based on the order of free blocks (BG with 64 free blocks goes in list order 6. 640 free blocks goes in order 9 list etc). In our case, we are trying to allocate stripe size blocks at a time ie 640 blocks or roughly 2.5 KB and ext4 tries to look at the order 9 list to find a BG that is fit to satisfy our request. Unfortunately there seems to be a lot of BGs in the order 9 list (> 1000) but most of them dont have enough free blocks to satisfy the request so we keep looping and trying to call ext4_mb_good_group() on each of them to see if anyone is good enough. Once we do find a good enough BG, due to striping we actually try to look for blocks which are specially aligned to stripe size and once we don't find it we just start looping in the list again from the beginning (!!). Although I have a good idea now, I'm not able to point my finger at the exact change in 6.5 that might have caused this. We did change the allocator to some extent and it might be related to this but we need to dig a bit more deeper to confirm. Would it be possible to share the same perf record again but this time I'm adding a few more probes and removing -g so we can fit more in 5MBlimit and also the commands for Linux 6.4 so we can compare whats changed: Linux 6.5+: Probe adding commands: sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists order" sudo perf probe -a "ext4_mb_find_good_group_avg_frag_lists:18 cr iter->bb_group" sudo perf probe -a "ext4_mb_good_group:20 free fragments ac->ac_g_ex.fe_len ac->ac_2order" sudo perf probe -a "ext4_mb_scan_aligned:26 i max" Record command: perf record -e probe:ext4_mb_find_good_group_avg_frag_lists_L18 -e probe:ext4_mb_good_group_L20 -e probe:ext4_mb_find_good_group_avg_frag_lists -e probe:ext4_mb_ scan_aligned_L26 -e ext4:ext4_mballoc_alloc -p <pid> sleep 20 Linux 6.4.x: Probe adding commands: sudo perf probe -a "ext4_mb_choose_next_group_cr1:25 i iter->bb_group" sudo perf probe -a "ext4_mb_good_group:20 free fragments ac->ac_g_ex.fe_len ac->ac_2order" sudo perf probe -a "ext4_mb_scan_aligned:26 i max" Record command: sudo perf record -e probe:ext4_mb_choose_next_group_cr1_L25 -e probe:ext4_mb_good_group_L20 -e probe:ext4_mb_scan_aligned_L26 -e ext4:ext4_mballoc_alloc -p <pid> sleep 20 Thanks again for all your help on this! Regards, ojaswin
Finally got it. kernel 6.5 was easy. kernel-6.4 debuginfo could not be found with 'dnf' and I searched long for it. Finally locating it on koji. Then the third probe was wrong, and after some learning I discovered $ sudo perf probe --line "ext4_mb_scan_aligned" 21 while (i < EXT4_CLUSTERS_PER_GROUP(sb)) { 22 if (!mb_test_bit(i, bitmap)) { 23 max = mb_find_extent(e4b, i, sbi->s_stripe, &ex); 24 if (max >= sbi->s_stripe) { 25 ac->ac_found++; 26 ex.fe_logical = 0xDEADF00D; /* debug value */ ac->ac_b_ex = ex; 28 ext4_mb_use_best_found(ac, e4b); 29 break; } } 32 i += sbi->s_stripe; } and decided to pick line 24 (not the original 26). Is this right? I will now upload the reports.
Created attachment 305417 [details] more perf reports for comment #37
Created attachment 305418 [details] more perf reports for comment #37
Hey Eyal, Thanks for the data, the perf probes you added were correct! I see that the problem is as I suspected where we keep looking trying to find aligned blocks in ext4 when probably none of them exist. Aligned allocation right now is only done when stripe mount option is passed as an optimization. Currently we don't seem to fallback to normal allocation if aligned allocation doesn't work and this causes the very long, seemingly infinite looping. I can try to work on a patchset that fixes this however as a temporary fix you can continue with stripe mount option turned off for ext4. This will then instruct ext4 to just use normal allocation rather than aligned. One point to note is that -o stripe=xyz is sometimes automatically added during mount even when we don't pass it. You can look at Comment #6 #7 and #8 in this bug for more info. To confirm it's off you can look into /proc/fs/ext4/<dev>/options file which has all the currently active mount options, you shouldn't see stripe there. Further, this is not something that was changed between 6.4 and 6.5 however seems like the allocator changes in 6.5 made it even more difficult to come out of this loop thus prolonging the time taken to flush. Also, just wanted to check if you have any non-prod setup where you'd be open to compile kernel with patches to see if we are able to fix the issue. Lastly, thank you so much for all the probe data and logs, it's been a huge help :) Regard, ojaswin
OK, so how do I disable the stripe=640 that was automatically added? Is there a mount option to disable it? I would rather not futz with tune2fs. RE kernel build: I used to often do this maybe 30 years ago, but on a workstation, not on the server. The build I expect to be simple, but it is the risk of a bad patch that worries me. I would rather not do that. BTW, while you are looking at the cache issue, what are your thoughts on the fact that the array shows (iostat -x) a very high w_await when the members are much lower? Is it normal? Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util 16:04:08 md127 0.20 0.80 0.00 0.00 10.50 4.00 9.80 739.20 0.00 0.00 1495.44 75.43 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.66 4.51 16:04:08 sdb 2.30 187.20 44.50 95.09 30.74 81.39 6.80 345.20 79.70 92.14 29.72 50.76 0.00 0.00 0.00 0.00 0.00 0.00 0.60 14.33 0.28 1.26 16:04:08 sdc 1.90 138.80 32.80 94.52 16.63 73.05 6.60 295.60 67.50 91.09 13.06 44.79 0.00 0.00 0.00 0.00 0.00 0.00 0.60 11.33 0.12 1.05 16:04:08 sdd 2.30 194.40 46.30 95.27 3.70 84.52 2.90 138.80 32.00 91.69 3.34 47.86 0.00 0.00 0.00 0.00 0.00 0.00 0.60 4.67 0.02 0.54 16:04:08 sde 1.80 204.00 49.20 96.47 14.89 113.33 4.10 302.40 71.70 94.59 13.27 73.76 0.00 0.00 0.00 0.00 0.00 0.00 0.60 6.17 0.08 1.66 16:04:08 sdf 1.90 97.20 22.40 92.18 14.63 51.16 4.70 212.00 48.50 91.17 19.70 45.11 0.00 0.00 0.00 0.00 0.00 0.00 0.60 11.50 0.13 1.98 16:04:08 sdg 2.00 212.00 51.00 96.23 7.80 106.00 4.70 279.20 65.30 93.29 8.32 59.40 0.00 0.00 0.00 0.00 0.00 0.00 0.60 8.00 0.06 1.86 16:04:08 sdh 2.20 213.20 51.10 95.87 21.73 96.91 4.80 315.60 74.30 93.93 27.77 65.75 0.00 0.00 0.00 0.00 0.00 0.00 0.60 14.33 0.19 2.05
Hey Eyal, stripe size can be set to 0 by remounting the FS as follows: mount -o remount,stripe=0 <dev> This should should allow us to bypass aligned allocation. Regarding building the kernel, sure I understand your concerns. Regarding the util of RAID array, I don't think I have enough experience dealing with RAID devices to comment if it is normal behavior or not. Still I'll try to look a bit more into it to see if I can think of something. Regards, ojaswin
Thanks Ojaswin, The mount option did the trick, things are drastically faster now. I am now finishing the actual copy that I was unable to complete. The reason I ask about the raid is that it writes so slowly that I wonder if it is just the looping thread or maybe there is something more. The large difference between what the array reports compared to what the members show (in iostat -x) is beyond what I expected as the array write amplification. I will keep watching this thread. Regards, Eyal
Hi Ojaswin, I'm still interested by a patch for this. I may find a way to give it a try. This aligned allocation should work for most otherwise more people would be here. Also as the FS is automagically mounted with a stripe value (based on raid info) it would be nice to avoid future misbehaviours like this.
Hi Eduard, Sure, I'm on vacation this week. Before that I found some time to try and replicate this in a simulated environment, will be sending a patch once I'm back. Regards Ojaswin
Hello, I've posted a patch that, as per my understanding of the issue, should be able to fix the halt in kernels above 6.5 As I was not able to completely replicate the issue, I relied on the probes and other logs collected to root cause this, so I'd really appreciate if the folks here can help test this on their setups where the issue was being observed. Link to patch: https://lore.kernel.org/linux-ext4/cover.1702455010.git.ojaswin@linux.ibm.com/T/#t Regards, ojaswin
I've posted an easy reproducer here: https://marc.info/?l=linux-raid&m=170327844709957&w=2 You can adjust the vm settings to make sure all the tree fits in the ram cache, depending on the ram of your machine. By default the FS is mounted with stripe=1280 because it's on a raid6. Remounting with stripe=0 works around the problem. Excellent! Also, this affects all 6.* kernels, as I said in the above post.
Independent of the fixes to the mballoc code to improve the allocation performance, I'm wondering about the ''RAID stride'' values in use here. The "stride" value is intended to be the size of one complete set of disks (e.g. 128KiB chunk size * 8 data disks = 1MiB). The filesystem doesn't see the parity disks, so the number of those disks does not matter to ext4. > RAID stride: 32752 > In my case, I have an EXT4 partition over an mdadm raid 1 array of two HDD. > RAID stride: 32745 It seems in all these cases that the stripe/stride is strange. I can't see any value to setting stride to (almost) 128MB, especially not on a RAID-1 system. Were these values automatically generated by mke2fs, or entered manually? If manually, why was that value chosen? If there is something unclear in the documentation it should be fixed, and the same if there is something wrong in mke2fs detection of the geometry. > By default the FS is mounted with stripe=1280 because it's on a raid6. > Remounting with stripe=0 works around the problem. Excellent! Carlos, how many data disks in this system? Do you have 5x 256KiB or 10x 128KiB *data* disks, plus 2 *parity* disks?
Hi Ojaswin, I gave a try to your patch and indeed it fixes the issue. I started to compile a 'vanilla' 6.5.13 fedora kernel (latest available on F37) and tested it, just to have a base. As expected the issue kicked in. I added your patch, recompile it and the issue gone. Andreas, I am positively sure that I never set/change any stride value on my own. Mostly because I have no idea what it is and how it interacts with the FS or Raid. I don't know if it can be changed or it is one time set. So it was chose by the tool I used to create the partition or raid or ext4 FS.
After Andreas comment, I throw a look to the raid wiki and changed the stripe/stride values: tune2fs -E stride=32,stripe-width=32 /dev/md0p1 which may or may not be optimal for a RAID1. Running a 6.5.12 kernel (fedora released one, not the previous compiled one). And it also makes the issue go away.
bugzilla-daemon@kernel.org (bugzilla-daemon@kernel.org) wrote on Fri, Dec 22, 2023 at 10:48:39PM -03: > https://bugzilla.kernel.org/show_bug.cgi?id=217965 > > Andreas Dilger (adilger.kernelbugzilla@dilger.ca) changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |adilger.kernelbugzilla@dilg > | |er.ca > > --- Comment #48 from Andreas Dilger (adilger.kernelbugzilla@dilger.ca) --- > Independent of the fixes to the mballoc code to improve the allocation > performance, I'm wondering about the ''RAID stride'' values in use here. > The "stride" value is intended to be the size of one complete set of > disks (e.g. 128KiB chunk size * 8 data disks = 1MiB). The filesystem > doesn't see the parity disks, so the number of those disks does not > matter to ext4. > > It seems in all these cases that the stripe/stride is strange. I can't > see any value to setting stride to (almost) 128MB, especially not on a > RAID-1 system. Were these values automatically generated by mke2fs, > or entered manually? If manually, why was that value chosen? If there > is something unclear in the documentation it should be fixed, and the > same if there is something wrong in mke2fs detection of the geometry. > > > By default the FS is mounted with stripe=1280 because it's on a raid6. > > Remounting with stripe=0 works around the problem. Excellent! > > Carlos, how many data disks in this system? Do you have 5x 256KiB or > 10x 128KiB *data* disks, plus 2 *parity* disks? There are 10 data disks. The stride, determined automatically by mke2fs, is correct since the chunk is 512KiB.
I'm a bit late to reporting this, but it might be related to this bug report. Since noticing this 100% cpu flush problem on my computer with a non-raid Samsung SSD 990 PRO NMVe and kernel 6.6.3 (and now 6.6.8) I decided to do a git bisect and seemed to track down the problem to commit: 7e170922f06bf46effa7c57f6035fc463d6edc7e if that helps, although I haven't tested if disabling CR1_5 also resolves the problem. Running "tar xf linux-6.6.8.tar.xz" is enough to trigger 100% cpu for a few seconds after tar is finished on my system, although it doesn't noticeably trigger the problem on my server with raid 6 and hard drives. It also looks like my ext4 filesystem is mounting with stripe=32733 according to /proc/mounts, although dumpe2fs -h doesn't show any stripe settings on the filesystem. I'll try the patch and see if it fixes the problem.
Oh I just rechecked and I do have RAID stride set to 32733 on my filesystem which I think I did to optimise for SSD blocks. Also, another test which makes the flush take a really long time to complete on my system is "cp -ai linux-6.6.8 linux-6.6.8_orig".
(In reply to Ojaswin Mujoo from comment #46) > Hello, > > I've posted a patch that, as per my understanding of the issue, should be > able to fix the halt in kernels above 6.5 > > As I was not able to completely replicate the issue, I relied on the probes > and other logs collected to root cause this, so I'd really appreciate if the > folks here can help test this on their setups where the issue was being > observed. > > Link to patch: > https://lore.kernel.org/linux-ext4/cover.1702455010.git.ojaswin@linux.ibm. > com/T/#t > > Regards, > ojaswin It looks like that patch has fixed the problem for my system. Thanks. I still haven't checked if disabling CR1_5 from 7e170922f06bf46effa7c57f6035fc463d6edc7e also resolves the problem, although I'm still planning to try that out as well.
It looks like reversing the changes made in 7e170922f06bf46effa7c57f6035fc463d6edc7e also fixed the problem (I didn't notice before that the names were changed in the very next commit as well). in fs/ext4/mballoc.c: Commenting out new_cr = CR_GOAL_LEN_SLOW in function: ext4_mb_choose_next_group_goal_fast wasn't enough to fix it so then I did the following as well which fixed the problem: commented out the call to ext4_mb_choose_next_group_best_avail in function: ext4_mb_choose_next_group commented out the code to "Reset goal length to original goal length before falling into CR_GOAL_LEN_SLOW"
Correction: It was new_cr = CR_BEST_AVAIL_LEN that I commented out
Also, it's a bit hard to tell, but even with the patch from https://lore.kernel.org/linux-ext4/cover.1702455010.git.ojaswin@linux.ibm.com/T/#t , I think flush might still be using more cpu compared to when I disable ext4_mb_choose_next_group_best_avail even though it isn't stalling anymore.
Maybe it's just because different code is being executed, but running tests a few more times, top does seems to be showing less kernel cpu usage while extracting thunderbird-115.5.0.source.tar.xz (https://archive.mozilla.org/pub/thunderbird/releases/115.5.0/source/thunderbird-115.5.0.source.tar.xz) with CR_BEST_AVAIL_LEN disabled even though the actual extract operation is taking the same amount of time with both patch methods and all the cpu usage goes away in both cases once tar is finished.
Hi Matthew, Thanks for testing out the patch and for sharing the results with and without CR_BEST_AVAIL_LEN commented. Would it be possible for you to add perf probes as mentioned in comment 36 and share the result. That'd be helpful in understanding why CR_BEST_AVAIL_LEN is getting stuck. Basically, I am interested in seeing how the scan is happening in CR_GOAL_LEN_FAST vs CR_BEST_AVAIL_LEN. [1] https://bugzilla.kernel.org/show_bug.cgi?id=217965#c36 Also, another thing I'd request you to try is, instead of commenting out the code can you run: $ echo "0" > /sys/fs/ext4/<dev>/mb_best_avail_max_trim_order and rerun the test. This disables most of the code but keeps some of the initial logic. Might help zero down if the issue is in the initial logic or the disabled code. Thanks again, ojaswin
It looks like echo "0" > /sys/fs/ext4/<dev>/mb_best_avail_max_trim_order stops flush from getting stuck, but the flush kworker still uses more cpu during the write operation. Hopefully I can get the perf results over the next few days.
Hi Matthew, thanks for confirming. So as pointed out in comment 9 [1], for the above steps to disable CR_BEST_AVAIL_LEN code did not fix the issue. My suspicion is that this issue can occur either in CR_GOAL_LEN_FAST or CR_BEST_AVAIL_LEN lookup, depending on the block groups being searched. Probably for you, it was occurring during CR_BEST_AVAIL_LEN lookup and hence disabling that code fixed it. Further, as Carlos pointed out above, they are able to see this in all 6.* kernels which means this is happening before CR_BEST_AVAIL_LEN was introduced however it seems to be much easier to trigger in 6.5+ kernels. Now, assuming the above theory is correct, then in cases where this is triggered from CR_GOAL_LEN_FAST, it should ideally happen easily in pre 6.5 kernels as well, but it doesn't, which makes me think that there might be some other related changes in 6.5 that might be making it easier to trigger. I'll try to play around a bit more with this. Also, as for higher CPU usage, how high are we talking about? So CR_BEST_AVAIL_LEN does add some extra cycles at the cost of generally faster allocation in fragmented filesystems, however since you have disabled it we shouldn't ideally be seeing it. Also, does the CPU util consistently drop when you commented out that code? [1] https://bugzilla.kernel.org/show_bug.cgi?id=217965#c9
With top at 1 second intervals , with thunderbird source extract, I can see the flush process with around 70% to 100% for 2 1 second intervals during the tar extract. I don't this this much cpu usage was present with CR_BEST_AVAIL_LEN commented out. Hopefully the perf probes will show where the real usage is. I still haven't tried out the perf probes, but I'm still planning on doing that hopefully sometime in the next few days.
I can't seem to trigger the problem now even with all the patches disabled and /sys/fs/ext4/<dev>/mb_best_avail_max_trim_order set to the default of 3 so unfortunately I won't be able to help with this problem for now. According to perf record -ag sleep 5 at the moment most of the ext4 cpu usage is in ext4_get_group_desc and ext4_get_group_info , both called from ext4_mb_prefetch and some usage in ext4_mb_good_group called from ext4_mb_find_good_group_avg_frag_lists . Both ext4_mb_prefetch and ext4_mb_find_good_group_avg_frag_lists were called from ext4_mb_regular_allocator. But this is all fairly low cpu usage at the moment and it all stops shortly after the file operation is completed.
After patching ext4_mb_choose_next_group set of functions with noinline_for_stack, perf is now including those in the report: perf report --stdio -i perf.data . That usage in ext4_mb_regular_allocator is only about 1.39% (compared to 17.59 for ext4_get_group_desc and ext4_get_group_info) with ext4_mb_choose_next_group_goal_fast using 0.74% and ext4_mb_choose_next_group_best_avail using 0.58% which indicates ext4_mb_choose_next_group_best_avail isn't stalling for me like it was before.
I'm not sure if this will be worth looking into, but with the cpu usage problem, I've done some more testing and "perf report --stdio -i perf.data" is showing around 2 to 3 times as much cpu time since the commit that added the CR_BEST_AVAIL_LEN feature: 7e170922f06bf . With /sys/fs/ext4/<dev>/mb_best_avail_max_trim_order set to 0 it goes a bit lower, but is still at least 2 times cpu. With perf record -ag sleep 25 which is the time it takes the extract to run, it's goes from about 2% to 5% so fairly small, but when measuring with 5 second intervals this can go from around 5% to 15%. This cpu usage is concentrated in ext4_get_group_desc and ext4_get_group_info and within these functions, perf is picking up more cpu time in __rcu_read_lock and __rcu_read_unlock.
I think I'm experiencing the same problem; relatively big (1.5T), relatively empty (716 G used) ext4 partition on a slow-ish SATA magnetic HDD (ST3000DM007-1WY10G), stripe=32747, 6.5.0-14-generic kernel from Ubuntu 22.04 HWE stack; during the installation of COMSOL Multiphysics 6.2 (which essentially untars ~16 GB of stuff) I got kworker flush process stuck at 100% of CPU for multiple hours, and IO cache filling my 32 GB of RAM. I tried the various parameters tweaking suggested here with no particular change of pace (although maybe the pending IO still worked according to the old values?). I run the bfq scheduler on this disk, but I tried to switch to pretty much all other schedulers mid-process with no perceptible change. After ~3 hours of agonizingly slow progress (with disk write performance for the installation often in the tens of KB/s) I decided to abort the installation. I rebooted in the Ubuntu GA kernel (5.15.0-92-generic) without touching any specific parameter and restarted the installation; this time it completed in ~10 minutes.
Hi, I just stumbled upon this issue, which it seems is the same I am suffering from on a Raspberry Pi 5. Th e OS is Arch Linux ARM and linux-rpi 6.6.20 running on a NVME SSD. After reading this issue, I checked the ext4 mount options and found stripe=8191. As a first test I downgraded linux-rpi to 6.1.69 and my issue of a CPU hogging kworker process, when pulling and creating docker containers did not happen anymore. With kernel 6.6.20 and all earlier 6.6.x versions, docker operations were really slow, in some cases pulling and recreating a container took over 1000 seconds. With kernel 6.1.69 these operations are back to normal speed. As a next step I would like to try to set stripe=0. What is the right approach here, as I am a bit unsure. Is it ok to remount the running system partition with mount -o remount,stripe=0 <dev> without any risks for a first test? And to change permanently to stripe=0,id it ok and sufficient to run tune2fs -f -E stride=0 <dev> on the mounted system partition without any risks? I would be thankful for a helping hand here to not do anything wrong or dangerous. I hope, that this works around my issue for now, until a fix has been found for this. Thank you very much for looking into this! Kind regards denk
Hello Denk, Both of the commands you mentioned "mount -o remount,stripe=0 <dev>" and "tune2fs -f -E stride=0 <dev>" should be okay to run and aren't dangerous for your filesystem. You can always take a backup to be extra cautious ;) That being said, this issue was fixed by the following patch [1] which I believe landed in linux kernel 6.7 so anything above that should help you avoid the issue as well. Regards, Ojaswin [1] https://lore.kernel.org/linux-ext4/cover.1702455010.git.ojaswin@linux.ibm.com
Seeing this last comment (#68), I wondered if the fix is included in my latest kernel. Or do I have a different problem? After just updating my fedora 38 to kernel 6.7.9 I did NOT do the remount and repeated my original problematic test. $ uname -s Linux e7.eyal.emu.id.au 6.7.9-100.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 6 19:31:16 UTC 2024 x86_64 GNU/Linux $ mount|grep data1 /dev/md127 on /data1 type ext4 (rw,noatime,stripe=640) I rsync a directory (262GB, 8,911,139 files) from an SSD to a raid6 (7x12TB => 60TB fs). Both are ext4. $ sudo time rsync -aHSK --stats --progress /data/no-backup/old-backups/tapes /data1/no-backup/really-not/old-backups/ What I see is that once meminfo shows the limit (4GB) was reached, the kworker kicks in at 100% CPU. At that point iostat shows activity on the array dropping, from about 160MB/s to very little (with an occasional burst of a few MB/s). ``` 10:12:33 2024-03-18 10:12:33 Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd [trimmed] 10:17:13 md127 0.00 0.00 0.00 0.00 0 0 0 10:17:23 md127 34.40 137.60 0.00 0.00 1376 0 0 <<< rsync starts 10:17:33 md127 2346.30 747.20 8638.00 0.00 7472 86380 0 10:17:43 md127 7067.10 431.60 133644.40 0.00 4316 1336444 0 10:17:53 md127 1692.80 578.80 7015.20 0.00 5788 70152 0 10:18:03 md127 2439.20 169.60 32071.20 0.00 1696 320712 0 10:18:13 md127 274.00 4.00 2242.00 0.00 40 22420 0 10:18:23 md127 3172.70 17.60 56828.00 0.00 176 568280 0 10:18:33 md127 416.20 0.80 1664.80 0.00 8 16648 0 10:18:43 md127 18.70 0.40 76.40 0.00 4 764 0 10:18:53 md127 6.50 0.00 30.80 0.00 0 308 0 10:19:03 md127 4.80 0.00 40.00 0.00 0 400 0 10:19:13 md127 5.70 0.00 63.60 0.00 0 636 0 10:19:23 md127 2.60 0.00 54.80 0.00 0 548 0 10:19:33 md127 7.40 0.00 243.20 0.00 0 2432 0 10:19:43 md127 5.20 0.00 75.60 0.00 0 756 0 10:19:53 md127 3.80 0.00 20.40 0.00 0 204 0 10:20:03 md127 2.00 0.00 13.20 0.00 0 132 0 10:20:13 md127 3.90 0.00 29.20 0.00 0 292 0 10:20:23 md127 3.80 0.00 19.60 0.00 0 196 0 At the same time meminfo shows: 2024-03-18 10:17:04 Dirty: 11220 kB Buffers: 829988 kB MemFree: 670576 kB 2024-03-18 10:17:14 Dirty: 10784 kB Buffers: 830016 kB MemFree: 631500 kB 2024-03-18 10:17:24 Dirty: 750616 kB Buffers: 875592 kB MemFree: 654236 kB 2024-03-18 10:17:34 Dirty: 2757048 kB Buffers: 972948 kB MemFree: 600636 kB 2024-03-18 10:17:44 Dirty: 2855196 kB Buffers: 1046736 kB MemFree: 551940 kB 2024-03-18 10:17:54 Dirty: 4104524 kB Buffers: 1127200 kB MemFree: 538136 kB 2024-03-18 10:18:04 Dirty: 4390504 kB Buffers: 1155588 kB MemFree: 600828 kB 2024-03-18 10:18:14 Dirty: 4518280 kB Buffers: 1161916 kB MemFree: 580176 kB 2024-03-18 10:18:24 Dirty: 4356952 kB Buffers: 1185872 kB MemFree: 543072 kB 2024-03-18 10:18:34 Dirty: 4559504 kB Buffers: 1196396 kB MemFree: 518872 kB 2024-03-18 10:18:44 Dirty: 4567212 kB Buffers: 1197060 kB MemFree: 606572 kB 2024-03-18 10:18:54 Dirty: 4567592 kB Buffers: 1197084 kB MemFree: 611440 kB ... and stays there until ... I then killed the copy (14,296MB copied). The writing to the array remained very low, the kernel thread stayed at 100% and meminfo drained very slowly. Access to the array is now slow with some hiccups. 2024-03-18 10:35:24 Dirty: 4484720 kB Buffers: 4984308 kB MemFree: 820532 kB <<< rsync killed 2024-03-18 10:35:34 Dirty: 4484436 kB Buffers: 4984348 kB MemFree: 851288 kB 2024-03-18 10:35:44 Dirty: 4483992 kB Buffers: 4984368 kB MemFree: 817516 kB 2024-03-18 10:35:54 Dirty: 4483780 kB Buffers: 4984400 kB MemFree: 803156 kB 2024-03-18 10:36:04 Dirty: 4483704 kB Buffers: 4984460 kB MemFree: 809956 kB 2024-03-18 10:36:14 Dirty: 4479416 kB Buffers: 4984496 kB MemFree: 832980 kB 2024-03-18 10:36:24 Dirty: 4474312 kB Buffers: 4984528 kB MemFree: 881464 kB 2024-03-18 10:36:34 Dirty: 4474260 kB Buffers: 4984568 kB MemFree: 840444 kB 2024-03-18 10:36:44 Dirty: 4474132 kB Buffers: 4984600 kB MemFree: 843524 kB 2024-03-18 10:36:54 Dirty: 4474292 kB Buffers: 4984640 kB MemFree: 841004 kB 2024-03-18 10:37:04 Dirty: 4474052 kB Buffers: 4984680 kB MemFree: 834148 kB 2024-03-18 10:37:14 Dirty: 4473688 kB Buffers: 4984712 kB MemFree: 853200 kB 2024-03-18 10:37:24 Dirty: 4473448 kB Buffers: 4984752 kB MemFree: 782540 kB 2024-03-18 10:37:34 Dirty: 4473288 kB Buffers: 4984776 kB MemFree: 786100 kB 2024-03-18 10:37:44 Dirty: 3871768 kB Buffers: 4984972 kB MemFree: 846020 kB 2024-03-18 10:37:54 Dirty: 3871612 kB Buffers: 4985020 kB MemFree: 826664 kB 2024-03-18 10:38:04 Dirty: 3871736 kB Buffers: 4985052 kB MemFree: 826084 kB 2024-03-18 10:38:14 Dirty: 3871184 kB Buffers: 4985100 kB MemFree: 876572 kB 2024-03-18 10:38:24 Dirty: 3870936 kB Buffers: 4985140 kB MemFree: 918944 kB 2024-03-18 10:38:34 Dirty: 3648080 kB Buffers: 4985256 kB MemFree: 901336 kB 2024-03-18 10:38:44 Dirty: 3556612 kB Buffers: 4985316 kB MemFree: 902532 kB 2024-03-18 10:38:54 Dirty: 3551636 kB Buffers: 4985364 kB MemFree: 837816 kB 2024-03-18 10:39:04 Dirty: 3551968 kB Buffers: 4985468 kB MemFree: 823392 kB 2024-03-18 10:39:14 Dirty: 2835648 kB Buffers: 4985656 kB MemFree: 629428 kB ... 2024-03-18 11:05:25 Dirty: 2737096 kB Buffers: 4993860 kB MemFree: 599424 kB <<< 30m later 2024-03-18 11:35:25 Dirty: 2573748 kB Buffers: 5001184 kB MemFree: 612288 kB <<< again 2024-03-18 12:05:26 Dirty: 2432572 kB Buffers: 5007704 kB MemFree: 663928 kB <<< again 2024-03-18 12:35:27 Dirty: 2145348 kB Buffers: 3707492 kB MemFree: 588464 kB <<< again 2024-03-18 13:05:27 Dirty: 2017848 kB Buffers: 3718936 kB MemFree: 585500 kB <<< again 2024-03-18 13:35:28 Dirty: 1822436 kB Buffers: 3746824 kB MemFree: 565560 kB <<< again 2024-03-18 14:05:29 Dirty: 1595088 kB Buffers: 3799124 kB MemFree: 544504 kB <<< again 2024-03-18 14:35:29 Dirty: 1498416 kB Buffers: 3816868 kB MemFree: 3883524 kB <<< again 2024-03-18 15:05:30 Dirty: 1387140 kB Buffers: 3835824 kB MemFree: 3266060 kB <<< again ... 2024-03-18 15:32:51 Dirty: 1284940 kB Buffers: 3850936 kB MemFree: 3088904 kB <<< finally 2024-03-18 15:33:01 Dirty: 933268 kB Buffers: 3851144 kB MemFree: 3098840 kB 2024-03-18 15:33:11 Dirty: 51956 kB Buffers: 3851248 kB MemFree: 3095456 kB 2024-03-18 15:33:21 Dirty: 51968 kB Buffers: 3851284 kB MemFree: 3059212 kB 2024-03-18 15:33:31 Dirty: 52032 kB Buffers: 3851308 kB MemFree: 3085352 kB 2024-03-18 15:33:41 Dirty: 172 kB Buffers: 3851336 kB MemFree: 3090912 kB 2024-03-18 15:33:51 Dirty: 64 kB Buffers: 3851368 kB MemFree: 3030584 kB ``` So over 5 hours to copy this small part (14GB of 262GB) of the data. Is this expected? Is this a fundamental "feature" of ext4? Or of raid6? When I did do "sudo mount -o remount,stripe=0 /data1" the copy progressed nicely with good writing speed. I have logs of the progress of this test at 10s intervals. Regards
Hi Eyal, My bad, seems like the patch landed in 6.8-rc1 as seen here [1]. [1] https://lore.kernel.org/linux-ext4/20240110203521.GA1013681@mit.edu/ Regards, ojaswin
(In reply to Ojaswin Mujoo from comment #68) > Hello Denk, > > Both of the commands you mentioned "mount -o remount,stripe=0 <dev>" and > "tune2fs -f -E stride=0 <dev>" should be okay to run and aren't dangerous > for your filesystem. You can always take a backup to be extra cautious ;) > > That being said, this issue was fixed by the following patch [1] which I > believe landed in linux kernel 6.7 so anything above that should help you > avoid the issue as well. > > Regards, > Ojaswin > > [1] > https://lore.kernel.org/linux-ext4/cover.1702455010.git.ojaswin@linux.ibm.com Hello Ojaswin, thank you very much! I changed stripe to 0 last weekend with tune2fs -f -E stripe_width=0 <dev>, as tune2fs -f -E stride=0 <dev> did not change stripe to 0. And since then my system is back to normal performance. So far no further issues. Thanks you very much for all the work for ext4! Best denk