If you keep appending data to a a file on a XFS filesystem, the entire system will occasionally freeze for up to 4 seconds or so. Then it recovers and continues for a few minutes, until it happens again. Although the freezes happen at sort of random times, it's generally highly reproducible. This is extremely bothersome for desktop use, as _everything_ stops during the freeze. Even typing text into a terminal. Even processes that don't appear to do any I/O are frozen (at least not I/O to disks). For example, a python script that does nothing else than printing a counter to stdout (running in a X11 terminal emulator) will stop. Everything is completely in memory; the system disk is on a SSD anyway. In general, the desktop becomes completely unusable garbage, which in turn makes XFS unsuitable for desktop operation until this is fixed. Reproduction is simple enough: if I copy multiple large files (dozens of files with about 500-5000 MB per file) with rsync from one disk to another (both hard disks using XFS filesystems), the freezes will happen at least every few minutes. Someone else observed something similar independently just now, and suspects this is happening because XFS blocks the entire kernel when freeing certain caches: https://twitter.com/marcan42/status/1086652425061056515 I don't know whether my case is the same, but it sure looks very similar. This is with kernel 4.19.16, deadline scheduler for both source and target (as recommended by XFS FAQ), and full kernel preemption enabled. The system has 16GB RAM, of which 10GB are usually available for caches. I don't know what information is useful, so please request whatever information might be useful for analyzing this.
PS: I've observed absolutely no problems when reformatting exactly the same target disk with ext4 and performing the same test.
http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
Output of 'echo w > /proc/sysrq-trigger' when the system is having freeze problems, please, as well as `iostat -dxm 5` over a period of minutes leading up to the freeze and after the freeze ends. -Dave.
I'll try to provide all this information in a few days. I doubt I'll be able to trigger the sysrq with a shell command while the system is frozen (because, well, it's frozen). If that's the case, would triggering the sysrq via keyboard directly work better?
My suggestion is enable sysrq, and then type out the trigger line in a shell in advance without pressing return; and then hit return once the hang happens. It almost certainly will get through to the kernel even if the command doesn't return until after the system recovers. It might be more reliable to do this over ssh than through the desktop environment's terminal. If you're running systemd, you can extract the full dmesg using 'journalctl -k > kernel.log' without systemd you may need to reboot with boot param log_buf_len=1M and issue 'dmesg -n 8'before reproducing.
In all the information below, the test disk was /dev/sdd, and was mounted on /mnt/tmp1/. Script which triggers the problem somewhat reliably: ----- #!/bin/bash TESTDIR=/mnt/tmp1/tests/ mkdir $TESTDIR COUNTER=0 while true ; do dd if=/dev/zero of=$TESTDIR/f$COUNTER bs=1024 count=$((1*1024)) COUNTER=$((COUNTER+1)) done ----- Whether it "freezes" the system after a few seconds (under 1 minute) of running seems to depend on the size of the files written. The system freeze is of course not permanent; it's just a performance problem. When the system freezes, even the mouse pointer can stop moving. As you can see in the dmesg excerpt below, Xorg got blocked. The script was run from a terminal emulator running on the X session. I'm fairly sure nothing other than the test script accessed the test disk/filesystem. Sometimes the script blocks for a while without freezing the system. dmesg excerpt from when I triggered the blocked task sysrq via ssh, this includes the messages from when the disk was hotplugged and mounted: ----- [585767.299464] ata3: softreset failed (device not ready) [585772.375433] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [585772.378964] ata3.00: ATA-9: HGST HUH728060ALE600, A4GNT7J0, max UDMA/133 [585772.378968] ata3.00: 11721045168 sectors, multi 0: LBA48 NCQ (depth 32), AA [585772.386994] ata3.00: configured for UDMA/133 [585772.387134] scsi 2:0:0:0: Direct-Access ATA HGST HUH728060AL T7J0 PQ: 0 ANSI: 5 [585772.387453] sd 2:0:0:0: Attached scsi generic sg3 type 0 [585772.387599] sd 2:0:0:0: [sdd] 11721045168 512-byte logical blocks: (6.00 TB/5.46 TiB) [585772.387604] sd 2:0:0:0: [sdd] 4096-byte physical blocks [585772.387636] sd 2:0:0:0: [sdd] Write Protect is off [585772.387640] sd 2:0:0:0: [sdd] Mode Sense: 00 3a 00 00 [585772.387680] sd 2:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [585772.432755] sdd: sdd1 [585772.433234] sd 2:0:0:0: [sdd] Attached SCSI disk [585790.697722] XFS (sdd1): Mounting V5 Filesystem [585790.842858] XFS (sdd1): Ending clean mount [587294.464409] sysrq: SysRq : Show Blocked State [587294.464419] task PC stack pid father [588653.596479] sysrq: SysRq : Show Blocked State [588653.596488] task PC stack pid father [588653.596534] kswapd0 D 0 93 2 0x80000000 [588653.596538] Call Trace: [588653.596549] ? __schedule+0x240/0x860 [588653.596553] ? schedule_timeout+0x274/0x390 [588653.596557] schedule+0x78/0x110 [588653.596560] schedule_timeout+0x274/0x390 [588653.596565] ? trace_hardirqs_off_thunk+0x1a/0x1c [588653.596568] wait_for_completion+0xe2/0x140 [588653.596572] ? wake_up_q+0x70/0x70 [588653.596576] __flush_work+0xfb/0x1a0 [588653.596580] ? flush_workqueue_prep_pwqs+0x130/0x130 [588653.596585] xlog_cil_force_lsn+0x67/0x1d0 [588653.596591] ? __xfs_iunpin_wait+0x96/0x150 [588653.596594] xfs_log_force_lsn+0x73/0x130 [588653.596597] ? xfs_reclaim_inode+0xb5/0x300 [588653.596600] __xfs_iunpin_wait+0x96/0x150 [588653.596605] ? init_wait_var_entry+0x40/0x40 [588653.596607] xfs_reclaim_inode+0xb5/0x300 [588653.596611] xfs_reclaim_inodes_ag+0x1a6/0x2f0 [588653.596616] ? preempt_count_sub+0x43/0x50 [588653.596619] ? _raw_spin_unlock+0x12/0x30 [588653.596621] ? xfs_inode_set_reclaim_tag+0x9f/0x170 [588653.596624] ? preempt_count_sub+0x43/0x50 [588653.596627] xfs_reclaim_inodes_nr+0x31/0x40 [588653.596631] super_cache_scan+0x14c/0x1a0 [588653.596635] do_shrink_slab+0x125/0x2a0 [588653.596639] shrink_slab+0x144/0x260 [588653.596643] shrink_node+0xd6/0x420 [588653.596647] kswapd+0x3ce/0x6e0 [588653.596651] ? mem_cgroup_shrink_node+0x170/0x170 [588653.596654] kthread+0x110/0x130 [588653.596658] ? kthread_create_worker_on_cpu+0x40/0x40 [588653.596660] ret_from_fork+0x24/0x30 [588653.596730] Xorg D 0 894 880 0x00400004 [588653.596733] Call Trace: [588653.596737] ? __schedule+0x240/0x860 [588653.596741] ? schedule_preempt_disabled+0x23/0xa0 [588653.596744] schedule+0x78/0x110 [588653.596747] ? __mutex_lock.isra.5+0x292/0x4b0 [588653.596750] schedule_preempt_disabled+0x23/0xa0 [588653.596752] __mutex_lock.isra.5+0x292/0x4b0 [588653.596757] ? xfs_perag_get_tag+0x52/0xf0 [588653.596760] xfs_reclaim_inodes_ag+0x287/0x2f0 [588653.596766] ? radix_tree_gang_lookup_tag+0xc2/0x140 [588653.596770] ? iput+0x210/0x210 [588653.596772] ? preempt_count_sub+0x43/0x50 [588653.596775] xfs_reclaim_inodes_nr+0x31/0x40 [588653.596778] super_cache_scan+0x14c/0x1a0 [588653.596781] do_shrink_slab+0x125/0x2a0 [588653.596784] shrink_slab+0x204/0x260 [588653.596787] ? __schedule+0x248/0x860 [588653.596791] shrink_node+0xd6/0x420 [588653.596794] do_try_to_free_pages+0xb6/0x350 [588653.596798] try_to_free_pages+0xce/0x1b0 [588653.596802] __alloc_pages_slowpath+0x33d/0xc80 [588653.596808] __alloc_pages_nodemask+0x23f/0x260 [588653.596820] ttm_pool_populate+0x25e/0x480 [ttm] [588653.596825] ? kmalloc_large_node+0x37/0x60 [588653.596828] ? __kmalloc_node+0x20e/0x2b0 [588653.596836] ttm_populate_and_map_pages+0x24/0x250 [ttm] [588653.596845] ttm_tt_populate.part.9+0x1b/0x60 [ttm] [588653.596853] ttm_tt_bind+0x42/0x60 [ttm] [588653.596861] ttm_bo_handle_move_mem+0x258/0x4e0 [ttm] [588653.596939] ? amdgpu_bo_subtract_pin_size+0x50/0x50 [amdgpu] [588653.596947] ttm_bo_validate+0xe7/0x110 [ttm] [588653.596951] ? preempt_count_sub+0x43/0x50 [588653.596954] ? _raw_write_unlock+0x12/0x30 [588653.596974] ? drm_pci_agp_destroy+0x4d/0x50 [drm] [588653.596983] ttm_bo_init_reserved+0x347/0x390 [ttm] [588653.597059] amdgpu_bo_do_create+0x19c/0x420 [amdgpu] [588653.597136] ? amdgpu_bo_subtract_pin_size+0x50/0x50 [amdgpu] [588653.597213] amdgpu_bo_create+0x30/0x200 [amdgpu] [588653.597291] amdgpu_gem_object_create+0x8b/0x110 [amdgpu] [588653.597404] amdgpu_gem_create_ioctl+0x1d0/0x290 [amdgpu] [588653.597417] ? preempt_count_sub+0x43/0x50 [588653.597421] ? _raw_spin_unlock+0x12/0x30 [588653.597499] ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu] [588653.597521] drm_ioctl_kernel+0x7f/0xd0 [drm] [588653.597545] drm_ioctl+0x1e4/0x380 [drm] [588653.597625] ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu] [588653.597631] ? tlb_finish_mmu+0x1f/0x30 [588653.597637] ? preempt_count_sub+0x43/0x50 [588653.597712] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [588653.597720] do_vfs_ioctl+0x8d/0x5d0 [588653.597726] ? do_munmap+0x33c/0x430 [588653.597731] ? __fget+0x6e/0xa0 [588653.597736] ksys_ioctl+0x60/0x90 [588653.597742] __x64_sys_ioctl+0x16/0x20 [588653.597748] do_syscall_64+0x4a/0xd0 [588653.597754] entry_SYSCALL_64_after_hwframe+0x49/0xbe [588653.597759] RIP: 0033:0x7f929cf6d747 [588653.597768] Code: Bad RIP value. [588653.597771] RSP: 002b:00007ffca6b5df18 EFLAGS: 00003246 ORIG_RAX: 0000000000000010 [588653.597780] RAX: ffffffffffffffda RBX: 00007ffca6b5e000 RCX: 00007f929cf6d747 [588653.597783] RDX: 00007ffca6b5df70 RSI: 00000000c0206440 RDI: 000000000000000e [588653.597787] RBP: 00007ffca6b5df70 R08: 000055f3e86a3290 R09: 000000000000000b [588653.597791] R10: 000055f3e653c010 R11: 0000000000003246 R12: 00000000c0206440 [588653.597795] R13: 000000000000000e R14: 0000000000000000 R15: 000055f3e86a3290 [588653.598255] kworker/u32:0 D 0 6490 2 0x80000000 [588653.598263] Workqueue: writeback wb_workfn (flush-8:48) [588653.598265] Call Trace: [588653.598270] ? __schedule+0x240/0x860 [588653.598274] ? blk_flush_plug_list+0x1d9/0x220 [588653.598277] ? io_schedule+0x12/0x40 [588653.598280] schedule+0x78/0x110 [588653.598283] io_schedule+0x12/0x40 [588653.598286] get_request+0x26b/0x770 [588653.598291] ? finish_wait+0x80/0x80 [588653.598294] blk_queue_bio+0x15f/0x4e0 [588653.598297] generic_make_request+0x1c0/0x460 [588653.598301] submit_bio+0x45/0x140 [588653.598304] xfs_submit_ioend+0x9c/0x1e0 [588653.598308] xfs_vm_writepages+0x68/0x80 [588653.598312] do_writepages+0x2e/0xb0 [588653.598315] ? _raw_spin_unlock+0x12/0x30 [588653.598317] ? list_lru_add+0xf5/0x1a0 [588653.598320] __writeback_single_inode+0x3d/0x3d0 [588653.598323] writeback_sb_inodes+0x1c4/0x430 [588653.598328] __writeback_inodes_wb+0x5d/0xb0 [588653.598331] wb_writeback+0x265/0x310 [588653.598335] wb_workfn+0x314/0x410 [588653.598340] process_one_work+0x199/0x3b0 [588653.598343] worker_thread+0x30/0x380 [588653.598345] ? rescuer_thread+0x310/0x310 [588653.598348] kthread+0x110/0x130 [588653.598351] ? kthread_create_worker_on_cpu+0x40/0x40 [588653.598354] ret_from_fork+0x24/0x30 [588653.598380] xfsaild/sdd1 D 0 19755 2 0x80000000 [588653.598383] Call Trace: [588653.598387] ? __schedule+0x240/0x860 [588653.598389] ? enqueue_entity+0xf6/0x6c0 [588653.598392] ? schedule_timeout+0x274/0x390 [588653.598395] schedule+0x78/0x110 [588653.598398] schedule_timeout+0x274/0x390 [588653.598403] ? tracing_record_taskinfo_skip+0x40/0x50 [588653.598405] wait_for_completion+0xe2/0x140 [588653.598408] ? wake_up_q+0x70/0x70 [588653.598411] __flush_work+0xfb/0x1a0 [588653.598415] ? flush_workqueue_prep_pwqs+0x130/0x130 [588653.598419] xlog_cil_force_lsn+0x67/0x1d0 [588653.598422] ? _raw_spin_unlock_irqrestore+0x22/0x40 [588653.598426] ? try_to_del_timer_sync+0x3d/0x50 [588653.598429] xfs_log_force+0x83/0x2d0 [588653.598432] ? preempt_count_sub+0x43/0x50 [588653.598436] xfsaild+0x19b/0x7f0 [588653.598440] ? _raw_spin_unlock_irqrestore+0x22/0x40 [588653.598443] ? xfs_trans_ail_cursor_first+0x80/0x80 [588653.598446] kthread+0x110/0x130 [588653.598449] ? kthread_create_worker_on_cpu+0x40/0x40 [588653.598452] ret_from_fork+0x24/0x30 [588653.598462] kworker/6:2 D 0 19911 2 0x80000000 [588653.598468] Workqueue: xfs-cil/sdd1 xlog_cil_push_work [588653.598469] Call Trace: [588653.598473] ? __schedule+0x240/0x860 [588653.598476] ? _raw_spin_lock_irqsave+0x1c/0x40 [588653.598479] ? xlog_state_get_iclog_space+0xfc/0x2c0 [588653.598481] ? wake_up_q+0x70/0x70 [588653.598484] schedule+0x78/0x110 [588653.598487] xlog_state_get_iclog_space+0xfc/0x2c0 [588653.598490] ? wake_up_q+0x70/0x70 [588653.598494] xlog_write+0x153/0x680 [588653.598498] xlog_cil_push+0x259/0x3e0 [588653.598503] process_one_work+0x199/0x3b0 [588653.598506] worker_thread+0x1c6/0x380 [588653.598509] ? rescuer_thread+0x310/0x310 [588653.598512] kthread+0x110/0x130 [588653.598515] ? kthread_create_worker_on_cpu+0x40/0x40 [588653.598518] ret_from_fork+0x24/0x30 [588653.598542] kworker/8:0 D 0 21308 2 0x80000000 [588653.598547] Workqueue: xfs-sync/sdd1 xfs_log_worker [588653.598549] Call Trace: [588653.598553] ? __schedule+0x240/0x860 [588653.598556] ? schedule_timeout+0x274/0x390 [588653.598559] schedule+0x78/0x110 [588653.598561] schedule_timeout+0x274/0x390 [588653.598565] wait_for_completion+0xe2/0x140 [588653.598567] ? wake_up_q+0x70/0x70 [588653.598570] __flush_work+0xfb/0x1a0 [588653.598574] ? flush_workqueue_prep_pwqs+0x130/0x130 [588653.598577] xlog_cil_force_lsn+0x67/0x1d0 [588653.598581] ? trace_hardirqs_off_thunk+0x1a/0x1c [588653.598584] xfs_log_force+0x83/0x2d0 [588653.598587] ? preempt_count_sub+0x43/0x50 [588653.598590] xfs_log_worker+0x2f/0xf0 [588653.598593] process_one_work+0x199/0x3b0 [588653.598595] worker_thread+0x30/0x380 [588653.598598] ? rescuer_thread+0x310/0x310 [588653.598601] kthread+0x110/0x130 [588653.598604] ? kthread_create_worker_on_cpu+0x40/0x40 [588653.598607] ret_from_fork+0x24/0x30 ----- "iostat -x -d -m 5 " while running the test: ----- Linux 4.19.16 (debian) 24.01.2019 _x86_64_ (12 CPU) Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb 49.94 0.13 8.22 0.09 0.01 0.01 0.02 6.16 1.09 195.85 0.08 168.59 689.71 0.73 3.63 sda 2.09 0.35 0.03 0.02 0.01 0.02 0.24 6.36 0.25 6.46 0.00 15.34 51.02 0.23 0.06 sdc 4.14 1.01 0.37 0.09 0.02 3.05 0.58 75.03 0.33 13.48 0.01 90.82 90.05 0.25 0.13 dm-0 0.97 3.56 0.02 0.02 0.00 0.00 0.00 0.00 0.29 7.14 0.03 19.75 4.56 0.08 0.03 loop0 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 9.87 1.08 0.00 11.52 8.80 1.50 0.00 loop1 0.06 0.00 0.01 0.00 0.00 0.00 0.00 0.00 2.66 3.65 0.00 85.79 88.40 0.88 0.01 sdd 0.04 0.48 0.00 0.34 0.00 0.00 2.38 0.48 13.82 625.41 0.30 10.27 712.62 4.63 0.24 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 2.80 114.60 0.01 57.53 0.00 0.00 0.00 0.00 31.57 315.09 44.88 4.00 514.07 2.82 33.12 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 0.00 4.00 0.00 0.00 dm-0 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 309.20 0.00 158.60 0.00 0.00 0.00 0.00 0.00 379.40 143.99 0.00 525.25 3.24 100.08 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.60 0.00 0.00 0.00 0.20 0.00 25.00 0.00 3.00 0.00 0.00 4.00 2.67 0.16 dm-0 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 0.00 4.00 2.00 0.16 loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.80 162.20 0.00 159.67 0.00 0.00 0.00 0.00 1161.50 933.72 130.70 4.00 1008.02 6.13 100.00 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 0.00 4.00 0.00 0.00 dm-0 0.00 0.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 154.00 0.00 153.80 0.00 0.00 0.00 0.00 0.00 885.47 143.06 0.00 1022.67 6.49 100.00 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.50 0.00 0.00 4.00 0.00 0.00 0.00 sdc 0.00 1.00 0.00 0.00 0.00 0.20 0.00 16.67 0.00 1.60 0.00 0.00 4.00 0.80 0.08 dm-0 0.00 1.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.67 0.00 0.00 4.00 0.67 0.08 loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 10.60 174.40 0.17 138.13 0.00 7.40 0.00 4.07 6.43 861.84 130.00 16.00 811.06 5.39 99.76 Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util sdb 0.00 0.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.00 0.00 0.00 4.00 6.00 0.24 sda 5.80 0.60 0.05 0.00 0.00 0.00 0.00 0.00 0.55 2.33 0.00 8.55 7.00 0.62 0.40 sdc 0.00 0.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.50 0.00 0.00 2.00 0.00 0.00 dm-0 0.00 0.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 15.60 29.20 0.23 7.46 0.00 36.20 0.00 55.35 8.55 47.97 1.53 15.08 261.59 3.91 17.52 ----- "vmstat 5" while running the test: ----- procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 4379320 64296 6133624 0 0 738 760 4 0 1 0 97 1 0 0 1 0 3123424 64296 7369344 0 0 11 58163 4661 8568 1 5 92 2 0 1 1 0 2301400 64296 8193912 0 0 1 176743 5350 10644 2 4 86 8 0 1 1 0 934552 64304 9550800 0 0 2 170806 5922 12653 2 5 85 8 0 0 3 0 132944 64304 10366496 0 0 0 163897 5791 9442 1 4 82 13 0 2 1 0 135192 64312 10366564 0 0 2 114094 2093 2144 0 0 91 8 0 0 0 0 185980 64316 10327628 0 0 454 7708 2964 8098 1 1 97 1 0 ----- Information requested by the FAQ: - kernel version (uname -a) Linux debian 4.19.16 #1.0.my.2 SMP PREEMPT Thu Jan 17 15:18:14 CET 2019 x86_64 GNU/Linux (That's just a vanilla kernel. It does _not_ use Debian's default kernel config.) - xfsprogs version (xfs_repair -V) xfs_repair version 4.15.1 - number of CPUs 6 cores, total of 12 CPUs (AMD Ryzen 5 2600). - contents of /proc/meminfo MemTotal: 16419320 kB MemFree: 442860 kB MemAvailable: 8501360 kB Buffers: 66816 kB Cached: 9966896 kB SwapCached: 0 kB Active: 10300036 kB Inactive: 4981936 kB Active(anon): 6199144 kB Inactive(anon): 930620 kB Active(file): 4100892 kB Inactive(file): 4051316 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 112 kB Writeback: 0 kB AnonPages: 5248396 kB Mapped: 1110368 kB Shmem: 1881512 kB Slab: 426008 kB SReclaimable: 242452 kB SUnreclaim: 183556 kB KernelStack: 16064 kB PageTables: 44408 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 8209660 kB Committed_AS: 15149512 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB Percpu: 9664 kB HardwareCorrupted: 0 kB AnonHugePages: 2619392 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 6745216 kB DirectMap2M: 9963520 kB DirectMap1G: 1048576 kB - contents of /proc/mounts Target disk only: /dev/sdd1 /mnt/tmp1 xfs rw,relatime,attr2,inode64,noquota 0 0 - contents of /proc/partitions Target disk only: 8 48 5860522584 sdd 8 49 5860521543 sdd1 - RAID layout (hardware and/or software) None. - LVM configuration None. - type of disks you are using Spinning rust, potentially interesting parts from smartctl: Model Family: HGST Ultrastar He8 Device Model: HGST HUH728060ALE600 Firmware Version: A4GNT7J0 - write cache status of drives hdparm -I /dev/sdd|grep Write.cache * Write cache - size of BBWC and mode it is running in No idea how to get this. - xfs_info output on the filesystem in question xfs_info /mnt/tmp1 meta-data=/dev/sdd1 isize=512 agcount=6, agsize=268435455 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1 spinodes=0 rmapbt=0 = reflink=0 data = bsize=4096 blocks=1465130385, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
On Thu, Jan 24, 2019 at 11:59:44AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=202349 > > --- Comment #6 from nfxjfg@googlemail.com --- > In all the information below, the test disk was /dev/sdd, and was mounted on > /mnt/tmp1/. Ok, so you are generating largely page cache memory pressure and some dirty inodes. ..... > When the system freezes, even the mouse pointer can stop moving. As you can > see > in the dmesg excerpt below, Xorg got blocked. The script was run from a > terminal emulator running on the X session. I'm fairly sure nothing other > than > the test script accessed the test disk/filesystem. Sometimes the script > blocks > for a while without freezing the system. OK, this explains why I'm not seeing any measurable stalls at all when running similar page cache pressure workloads - no GPU creating memory pressure and triggering direct reclaim. And from that perspective, this looks more like a bug in the ttm memory pool allocator, not an XFS problem. Yes, XFS is doing memory reclaim and is doing IO during reclaim, but that's because it's the only thing that has reclaimable objects in memory (due to your workload). While this may be undesirable, it is necessary to work around other deficiencies in the memory reclaim infrastructure and, as such, it is not a bug. We are working to try to avoid this problem, but we haven't found a solution yet. It won't prevent the desktop freeze under memory shortage" problem from occurring, though. i.e. the reason your desktop freezes is that this allocation here: > [588653.596794] do_try_to_free_pages+0xb6/0x350 > [588653.596798] try_to_free_pages+0xce/0x1b0 > [588653.596802] __alloc_pages_slowpath+0x33d/0xc80 > [588653.596808] __alloc_pages_nodemask+0x23f/0x260 > [588653.596820] ttm_pool_populate+0x25e/0x480 [ttm] > [588653.596825] ? kmalloc_large_node+0x37/0x60 > [588653.596828] ? __kmalloc_node+0x20e/0x2b0 > [588653.596836] ttm_populate_and_map_pages+0x24/0x250 [ttm] > [588653.596845] ttm_tt_populate.part.9+0x1b/0x60 [ttm] > [588653.596853] ttm_tt_bind+0x42/0x60 [ttm] > [588653.596861] ttm_bo_handle_move_mem+0x258/0x4e0 [ttm] > [588653.596939] ? amdgpu_bo_subtract_pin_size+0x50/0x50 [amdgpu] > [588653.596947] ttm_bo_validate+0xe7/0x110 [ttm] > [588653.596951] ? preempt_count_sub+0x43/0x50 > [588653.596954] ? _raw_write_unlock+0x12/0x30 > [588653.596974] ? drm_pci_agp_destroy+0x4d/0x50 [drm] > [588653.596983] ttm_bo_init_reserved+0x347/0x390 [ttm] > [588653.597059] amdgpu_bo_do_create+0x19c/0x420 [amdgpu] > [588653.597136] ? amdgpu_bo_subtract_pin_size+0x50/0x50 [amdgpu] > [588653.597213] amdgpu_bo_create+0x30/0x200 [amdgpu] > [588653.597291] amdgpu_gem_object_create+0x8b/0x110 [amdgpu] > [588653.597404] amdgpu_gem_create_ioctl+0x1d0/0x290 [amdgpu] > [588653.597417] ? preempt_count_sub+0x43/0x50 > [588653.597421] ? _raw_spin_unlock+0x12/0x30 > [588653.597499] ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu] > [588653.597521] drm_ioctl_kernel+0x7f/0xd0 [drm] > [588653.597545] drm_ioctl+0x1e4/0x380 [drm] > [588653.597625] ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu] > [588653.597631] ? tlb_finish_mmu+0x1f/0x30 > [588653.597637] ? preempt_count_sub+0x43/0x50 > [588653.597712] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] > [588653.597720] do_vfs_ioctl+0x8d/0x5d0 is a GFP_USER allocation. That is: #define GFP_USER (__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL) __GFP_RECLAIM means direct reclaim is allowed, as is reclaim via kswapd. __GFP_FS means "reclaim from filesystem caches is allowed". __GFP_IO means that it's allowed to do IO during reclaim. __GFP_HARDWALL means the allocation is limited to the current cpuset memory policy. Basically, the ttm infrastructure has said to the allocator that it is ok to block for as long as it takes for you to do whatever you need to do to reclaim enough memory for the required allocation. Given that your workload is creating only filesystem memory pressure, that means that's where reclaim is directed. And given that the allocation says "blocking is fine" and "reclaim from filesystems", it's no surprise that the GPU operations are getting stuck behind filesytem reclaim. > Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm > %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util > sdd 0.80 162.20 0.00 159.67 0.00 0.00 0.00 > 0.00 1161.50 933.72 130.70 4.00 1008.02 6.13 100.00 Yup, there's a second long wait for any specific read or write IO to complete here. GPU operations are interactive, so they really need to have bound response times. Using "block until required memory is available" operations guarantees that whenever the system gets low on memory, desktop interactivity will go to shit..... In the meantime, it might be worth checking if you have: CONFIG_BLK_WBT=y CONFIG_BLK_WBT_MQ=y active in your kernel. The block layer writeback throttle should minimise the impact of bulk data writeback IO on metadata writeback and read IO latency and so help minimise long blocking times for metadata IO. Cheers, Dave.
So with my GPU developer hat on: What is it that the GPU driver devs should do to avoid this? GPU tasks aren't per-se realtime or interactive (think GPU compute tasks with big working sets, that can take hours or even days to complete). So from the driver perspective we _want_ FS reclaim to happen instead of failing a task, but in general for the interactive workload we would rather trash the working set a bit (by purging clean caches) instead of doing a high latency writeback of dirty caches. Also to put the obvious question up again: Why doesn't the reporter see any of those catastrophic latency events with ext4? It it just that XFS can queue up way more IO and thus direct reclaim getting stuck behind the already scheduled IO for a way longer time? If I understand the problem right, the basic issue is that we still don't have have IO less dirty throttling in Linux, so we make the interactive GPU process pay part of the bill of the background process dirtying lots of pages, instead of throttling this one earlier.
On the kernel which I ran the tests: # zcat /proc/config.gz |grep WBT CONFIG_BLK_WBT=y # CONFIG_BLK_WBT_SQ is not set CONFIG_BLK_WBT_MQ=y I can't reproduce any of this when running the same test script on an ext4 formatted disk under the same circumstances (different disk, but comparable speed).
Hi nfxjfg, Can you try a plain 4.19.0 kernel and see if it has the same problem? I suspect not, read https://bugzilla.kernel.org/show_bug.cgi?id=202441 for more details. I'd like to confirm that this is the same issue. -Dave.
I also have to note that I found it impossible to reproduce the issue using the test script on some other hard disks. It must depend a lot on the work load and the speed/size of the disk, or other unknown circumstances. This also means that my comment about ext4 might be misleading, although I definitely confirmed that my problem went away when I reformatted yet another disk from xfs to ext4. Dave Chinner: would a 4.18.20 kernel also do? (I still have that on my disk.)
So I tried on the following kernel versions: 4.19.19 4.19.16 4.19.0 4.18.20 4.16.14 It happened on all of them. Reproduction is a bit spotty. The script I first posted doesn't work reliably anymore. I guess it depends on the kind and amount of memory pressure. Despite hard reproduction, it's not an obscure issue. I've also hit it when compiling the kernel on a XFS filesystem on a hard disk. My reproduction steps are now as following (and yes they're absurd): - run memtester 12G (make sure "free memory" as shown in top goes to very low while running the test) - start video playback (I used mpv with some random 720p video) - run test.sh (maybe until 100k files) - run sync - run rm -rf /mnt/tmp1/tests/ - switch to another virtual desktop with lots of firefox windows (yeah...), and switch back - video playback gets noticeably interrupted for a moment This happened even on 4.16.14. dmesg except when I "caught" it again on 4.19.19 (there's nothing new I guess): [ 250.656494] sysrq: SysRq : Show Blocked State [ 250.656505] task PC stack pid father [ 250.656581] kswapd0 D 0 91 2 0x80000000 [ 250.656585] Call Trace: [ 250.656600] ? __schedule+0x23d/0x830 [ 250.656604] schedule+0x28/0x80 [ 250.656608] schedule_timeout+0x23e/0x360 [ 250.656612] wait_for_completion+0xeb/0x150 [ 250.656617] ? wake_up_q+0x70/0x70 [ 250.656623] ? __xfs_buf_submit+0x112/0x230 [ 250.656625] ? xfs_bwrite+0x25/0x60 [ 250.656628] xfs_buf_iowait+0x22/0xc0 [ 250.656631] __xfs_buf_submit+0x112/0x230 [ 250.656633] xfs_bwrite+0x25/0x60 [ 250.656637] xfs_reclaim_inode+0x2e5/0x310 [ 250.656640] xfs_reclaim_inodes_ag+0x19e/0x2d0 [ 250.656645] xfs_reclaim_inodes_nr+0x31/0x40 [ 250.656650] super_cache_scan+0x14c/0x1a0 [ 250.656656] do_shrink_slab+0x129/0x270 [ 250.656660] shrink_slab+0x201/0x280 [ 250.656663] shrink_node+0xd6/0x420 [ 250.656666] kswapd+0x3d3/0x6c0 [ 250.656670] ? mem_cgroup_shrink_node+0x140/0x140 [ 250.656674] kthread+0x110/0x130 [ 250.656677] ? kthread_create_worker_on_cpu+0x40/0x40 [ 250.656680] ret_from_fork+0x24/0x30 [ 250.656785] Xorg D 0 850 836 0x00400004 [ 250.656789] Call Trace: [ 250.656792] ? __schedule+0x23d/0x830 [ 250.656795] schedule+0x28/0x80 [ 250.656798] schedule_preempt_disabled+0xa/0x10 [ 250.656801] __mutex_lock.isra.5+0x28b/0x460 [ 250.656806] ? xfs_perag_get_tag+0x2d/0xc0 [ 250.656808] xfs_reclaim_inodes_ag+0x286/0x2d0 [ 250.656811] ? isolate_lru_pages.isra.55+0x34f/0x400 [ 250.656817] ? list_lru_add+0xb2/0x190 [ 250.656819] ? list_lru_isolate_move+0x40/0x60 [ 250.656824] ? iput+0x1f0/0x1f0 [ 250.656827] xfs_reclaim_inodes_nr+0x31/0x40 [ 250.656829] super_cache_scan+0x14c/0x1a0 [ 250.656832] do_shrink_slab+0x129/0x270 [ 250.656836] shrink_slab+0x144/0x280 [ 250.656838] shrink_node+0xd6/0x420 [ 250.656841] do_try_to_free_pages+0xb6/0x350 [ 250.656844] try_to_free_pages+0xce/0x180 [ 250.656856] __alloc_pages_slowpath+0x347/0xc70 [ 250.656863] __alloc_pages_nodemask+0x25c/0x280 [ 250.656875] ttm_pool_populate+0x25e/0x480 [ttm] [ 250.656880] ? kmalloc_large_node+0x37/0x60 [ 250.656883] ? __kmalloc_node+0x204/0x2a0 [ 250.656891] ttm_populate_and_map_pages+0x24/0x250 [ttm] [ 250.656899] ttm_tt_populate.part.9+0x1b/0x60 [ttm] [ 250.656907] ttm_tt_bind+0x42/0x60 [ttm] [ 250.656915] ttm_bo_handle_move_mem+0x258/0x4e0 [ttm] [ 250.656995] ? amdgpu_bo_subtract_pin_size+0x50/0x50 [amdgpu] [ 250.657003] ttm_bo_validate+0xe7/0x110 [ttm] [ 250.657079] ? amdgpu_bo_subtract_pin_size+0x50/0x50 [amdgpu] [ 250.657105] ? drm_vma_offset_add+0x46/0x50 [drm] [ 250.657113] ttm_bo_init_reserved+0x342/0x380 [ttm] [ 250.657189] amdgpu_bo_do_create+0x19c/0x400 [amdgpu] [ 250.657266] ? amdgpu_bo_subtract_pin_size+0x50/0x50 [amdgpu] [ 250.657269] ? try_to_wake_up+0x44/0x450 [ 250.657343] amdgpu_bo_create+0x30/0x200 [amdgpu] [ 250.657349] ? cpumask_next_wrap+0x2c/0x70 [ 250.657352] ? sched_clock_cpu+0xc/0xb0 [ 250.657355] ? select_idle_sibling+0x293/0x3a0 [ 250.657431] amdgpu_gem_object_create+0x8b/0x110 [amdgpu] [ 250.657509] amdgpu_gem_create_ioctl+0x1d0/0x290 [amdgpu] [ 250.657516] ? tracing_record_taskinfo_skip+0x40/0x50 [ 250.657518] ? tracing_record_taskinfo+0xe/0xa0 [ 250.657594] ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu] [ 250.657614] drm_ioctl_kernel+0x7f/0xd0 [drm] [ 250.657619] ? sock_sendmsg+0x30/0x40 [ 250.657639] drm_ioctl+0x1e4/0x380 [drm] [ 250.657715] ? amdgpu_gem_object_close+0x1c0/0x1c0 [amdgpu] [ 250.657720] ? do_futex+0x2a1/0xa30 [ 250.657802] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [ 250.657828] do_vfs_ioctl+0x8d/0x5d0 [ 250.657832] ? __x64_sys_futex+0x133/0x15b [ 250.657835] ksys_ioctl+0x60/0x90 [ 250.657838] __x64_sys_ioctl+0x16/0x20 [ 250.657842] do_syscall_64+0x4a/0xd0 [ 250.657845] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 250.657849] RIP: 0033:0x7f12b52dc747 [ 250.657855] Code: Bad RIP value. [ 250.657856] RSP: 002b:00007ffceccab168 EFLAGS: 00003246 ORIG_RAX: 0000000000000010 [ 250.657860] RAX: ffffffffffffffda RBX: 00007ffceccab250 RCX: 00007f12b52dc747 [ 250.657861] RDX: 00007ffceccab1c0 RSI: 00000000c0206440 RDI: 000000000000000e [ 250.657863] RBP: 00007ffceccab1c0 R08: 0000559b8f644890 R09: 00007f12b53a7cb0 [ 250.657864] R10: 0000559b8e72a010 R11: 0000000000003246 R12: 00000000c0206440 [ 250.657865] R13: 000000000000000e R14: 0000559b8e7bf500 R15: 0000559b8f644890 Also I noticed some more bad behavior. When I copied hundreds of gigabytes from a SSD block device to a XFS file system on a HDD, I got _severe_ problems with tasks hanging. They got stuck in something like io_scheduler (I don't think I have the log anymore; could probably reproduce if needed). This was also a "desktop randomly freezes on heavy background I/O". Although the freezes were worse (waiting for up to a minute for small I/O to finish!), it's overall not as bad as the one this bug is about, because most hangs seemed to be about accesses to the same filesystem.
I'm the author of the mentioned tweet. What I was seeing was that on a system with loads of free RAM, XFS reclaiming inodes would randomly block on IO. This is completely unexpected. I do expect that a system under memory and IO pressure ("used" memory, not just "available" memory used for clean caches) will block on IO during allocations that trigger writeback. I do *not* expect that on a system with tons of clean data to evict, but that is what I saw with XFS. I had a process writing real-time data to disk (on a moderately busy system with gigabytes of free RAM), and even though disk bandwidth was plenty to keep up with the data, I was seeing buffer underruns due to big random latency spikes. After I introduced a process in the pipeline doing up to several gigabytes of RAM buffering to even out the spikes, I was *still* getting the buffer input stuttering for several seconds, breaking the real-time capture. That's where I realized that a kernel pipe buffer allocation was somehow blocking on XFS doing IO. I would echo 3 > /proc/sys/vm/drop_caches, and latencies would become normal. I would then watch RAM used for caches slowly creep up, and when it hit 95% or so, latency would randomly shoot through the roof again. This is obviously broken behavior. Allocating from RAM used for caches should **never** block on IO. The system should never slow down because extra RAM is being used for caches. That is just insane. The whole point of using RAM for caches is to utilize otherwise wasted RAM. If this is causing allocations to block on IO when freeing said RAM, randomly causing huge latency spikes for everything, then that is broken. I've since switched to ext4 on the same disks and haven't had a problem ever since.
I have also had this problem with XFS; running 4.19 with https://github.com/bobrik/linux/pull/3 ("xfs: add a sysctl to disable memory reclaim participation") and fs.xfs.memory_reclaim = 1 ("async inode reclaim of clean inodes only") has improved things for me.
This problem is still observed on kernel 5.4.x. Extremely annoying. I've stopped using XFS for new partitions.