Bug 206643
Summary: | fuse freeze and usb devices | ||
---|---|---|---|
Product: | File System | Reporter: | taz.007 |
Component: | Other | Assignee: | io_other |
Status: | NEW --- | ||
Severity: | normal | CC: | kernel |
Priority: | P1 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 5.8.10 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
taz.007
2020-02-23 12:50:14 UTC
Unable to reproduce this with mergerfs 2.29.0 and vanilla kernel 5.5.0 here. Reading your description above, this is how my setup looks like: ``` $ for i in {0..4}; do pv -Ss 512m < /dev/zero > /var/tmp/test_${i}.img; done $ for f in /var/tmp/test_*; do losetup -f $f; done $ for b in /dev/loop[0-4]; do echo $b; mkfs.ext4 -qF $b; done $ mkdir /mnt/test/{0..4} $ for i in {0..4}; do echo $i; mount -t ext4 /dev/loop${i} /mnt/test/${i}; done $ mergerfs -o allow_other,threads=2,moveonenospc=true,ignorepponrename=true,category.create=mfs,func.unlink=ff,minfreespace=0,dropcacheonclose=true /mnt/test/0:/mnt/test/1:/mnt/test/2:/mnt/test/3:/mnt/test/4 /mnt/disk/ $ mount | tail -1 0:1:2:3:4 on /mnt/disk type fuse.mergerfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other) ``` With that in place, rsync is writing, reading and removing data to/from the mergerfs mount, but no lockups so far: ``` $ while true; do date; rsync -ax /usr/ /mnt/disk; tar -cf - /mnt/disk/ | pv > /dev/null; rm -rf /mnt/disk/*; echo; done ``` Can you maybe share your exacty commands, and try with a vanilla kernel perhaps? Enable more DEBUG options in the kernel config to see what's going on? The Arch kernel is very close to vanilla, with some patches related to drm, & wifi , so I don't think it is the culprit (https://git.archlinux.org/linux.git/log/?h=v5.5.3-arch1) the cmdline for rsync is : rsync -av --delete /media/WDALL/$DIRWITHLOTSOFBIGFILES/ /media/WDBACK/$SAMEDIRNAME/ with /media/WDALL the mergerfs mount point and /media/WDBACK a regular ext4 mount point As I can't reproduce unless I tweak the max_sectors, I think you would need to test with a real usb2 drive. Also, at the same time as rsync running, there is some light read/write happening on the same mergerfs mount (the mergerfs mount is shared via smbd). This probably increases the race condition. Do you have some specific CONFIG_*DEBUG* you want me to try ? Hm, yeah maybe it's USB related after all. Maybe set USB_STORAGE_DEBUG, and once CONFIG_DEBUG_KERNEL is set maybe enable some "Lock Debugging" and "Debug Oops, Lockups and Hangs" things, or even CONFIG_EXT4_DEBUG, or lower kernel.hung_task_timeout_secs, anything really that may trigger some messages instead of locking up quietly. I had to disable USB_STORAGE_DEBUG, way too much logs, system not that much usable while rsync was running with this kernel settings. I could reproduce several times, but no lock debug message in the kernel, until some hours ago, where the system seemed "more" stuck than usual. "sync" didn't suffice to unfreeze the system, i had to hdparm -tT each drive, then it unfreezed, and while doing this finally got a message from the kernel: [102601.484027] smbd D 0 1787 1 0x80004006 [102601.484039] Call Trace: [102601.484064] __schedule+0x305/0x8b0 [102601.484077] ? prepare_to_wait_event+0x59/0xe0 [102601.484097] schedule+0x43/0xf0 [102601.484124] fuse_simple_request+0x37d/0x510 [fuse] [102601.484140] ? wait_woken+0x70/0x70 [102601.484156] ? wait_woken+0x70/0x70 [102601.484180] fuse_flush+0x144/0x170 [fuse] [102601.484218] filp_close+0x2a/0x60 [102601.484231] put_files_struct+0x6a/0xb0 [102601.484245] exit_files+0x3c/0x50 [102601.484257] do_exit+0x317/0xa10 [102601.484278] do_group_exit+0x36/0x90 [102601.484291] get_signal+0x14c/0x950 [102601.484307] ? mntput_no_expire+0x4c/0x280 [102601.484321] do_signal+0x2d/0x520 [102601.484330] ? mntput_no_expire+0x66/0x280 [102601.484362] exit_to_usermode_loop+0x85/0x140 [102601.484374] do_fast_syscall_32+0x1c7/0x270 [102601.484388] entry_SYSENTER_32+0xa5/0xf8 [102601.484396] EIP: 0xb7fd48e9 [102601.484405] Code: 5d c3 8d b4 26 00 00 00 00 b8 80 96 98 00 eb b5 8b 04 24 c3 8b 14 24 c3 8b 1c 24 c3 8b 3c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 [102601.484412] EAX: fffffffc EBX: 0044bb60 ECX: bffff38c EDX: b7f4ab8c [102601.484418] ESI: 00451e70 EDI: bffff47c EBP: b7f50194 ESP: bffff30c [102601.484424] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000282 [102601.484443] ? xas_create+0x370/0x3b0 [102601.484457] ? nmi+0xc5/0x2ac [102601.484487] Showing all locks held in the system: [102601.484512] 1 lock held by khungtaskd/43: [102601.484516] #0: c3bd1d9c (rcu_read_lock){....}, at: debug_show_all_locks+0x1f/0xe6 [102601.484638] 1 lock held by x264/930: [102601.484642] #0: e409043c (&f->f_pos_lock){....}, at: __fdget_pos+0x3d/0x50 [102601.484689] ============================================= smbd is sharing the mergerfs mountpoint x264 is one of the background tasks doing light io on the same mergerfs mountpoint. This was with kernel 5.5.6 got a freeze recently. max_sectors was set to 240, rsync was not even running. No dump related messages in the kernel It is still happening with a recent kernel. running sync or sometimes hdparm -t (on all drives) is still needed to unfreeze the system. This bug is infuriating. while running hdparm -t in parallel to workaround the hang; this happened : [1090302.387637] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-16-proto-9, throttling... [1090307.407113] __request_module: 361 callbacks suppressed [1090307.407120] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-16-proto-9, throttling... [1090307.407687] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-16-proto-9, throttling... [1090307.413541] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-16-proto-9, throttling... [1090307.413558] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-16-proto-9, throttling... [1090307.421367] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-16-proto-9, throttling... [1090307.422975] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-16-proto-9, throttling... [1090307.430159] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-16-proto-9, throttling... [1090307.430243] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-16-proto-9, throttling... [1090307.431028] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-16-proto-9, throttling... [1090307.432136] request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module net-pf-16-proto-9, throttling... [1090312.523610] INFO: task hdparm:30741 blocked for more than 122 seconds. [1090312.523621] Not tainted 5.7.5-arch1-1 #2 [1090312.523624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1090312.523630] hdparm D 0 30741 30714 0x00000000 [1090312.523640] Call Trace: [1090312.523664] __schedule+0x25e/0x7c0 [1090312.523675] schedule+0x5a/0xd0 [1090312.523688] rwsem_down_write_slowpath+0x226/0x410 [1090312.523695] down_write+0x2f/0x40 [1090312.523701] sync_inodes_sb+0x77/0x210 [1090312.523709] sync_inodes_one_sb+0x18/0x20 [1090312.523715] iterate_supers+0x82/0xd0 [1090312.523720] ? __ia32_sys_splice+0xf0/0xf0 [1090312.523724] ksys_sync+0x2a/0x80 [1090312.523728] __do_sys_sync+0x8/0x10 [1090312.523734] do_fast_syscall_32+0x61/0x220 [1090312.523740] entry_SYSENTER_32+0xa5/0xf8 [1090312.523744] EIP: 0xb7fd2c71 [1090312.523753] Code: 89 da 89 f3 e8 18 00 00 00 89 d3 5b 5e 5d c3 8b 04 24 c3 8b 1c 24 c3 8b 34 24 c3 8b 3c 24 c3 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 [1090312.523757] EAX: ffffffda EBX: 00425e98 ECX: 00200000 EDX: b7be0000 [1090312.523762] ESI: 00000003 EDI: 00000014 EBP: 00425e98 ESP: bffff61c [1090312.523767] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000286 [1090312.523779] ? x86_vector_free_irqs+0xb3/0xe0 FWIW, I cant reproduce the issue anymore. However the setup has changed in several ways : - the highly fragmented ext3 drive has been converted to ext4. - lots of files have been removed from that drive, reducing the overall fragmentation. that step "seemed" to have the biggest improvement. then - switched to a 64bits kernel - switched currently to a 5.11.16 kernel |