Bug 201331
Summary: | deadlock (XFS?) | ||
---|---|---|---|
Product: | File System | Reporter: | edo (edo.rus) |
Component: | XFS | Assignee: | io_md |
Status: | NEW --- | ||
Severity: | high | CC: | cmaiolino, david, sandeen |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.20-rc4 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg output
/proc/mdstat xfs_info output one more dmesg output mdadm --examine 1st dmesg with CONFIG_LOCKDEP=y 2nd dmesg with CONFIG_LOCKDEP=y 3rd dmesg with CONFIG_LOCKDEP=y 4th dmesg with CONFIG_LOCKDEP=y dmesg with raid 10 only, CONFIG_LOCKDEP=y mdadm --examine dmesg output, xfs only (no raid) |
Created attachment 278929 [details]
/proc/mdstat
Created attachment 278931 [details]
xfs_info output
I have made some RAID tuning: echo 32768 > /sys/block/md3/md/stripe_cache_size I tested with 4.17 and 4.18 prebuilt Debian kernels, behavior is the same: Sep 30 16:01:23 storage10x10n1 kernel: [23683.218388] INFO: task kworker/u24:0:21848 blocked for more than 120 seconds. Sep 30 16:01:23 storage10x10n1 kernel: [23683.218495] Not tainted 4.18.0-0.bpo.1-amd64 #1 Debian 4.18.6-1~bpo9+1 Sep 30 16:01:23 storage10x10n1 kernel: [23683.218593] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 30 16:01:23 storage10x10n1 kernel: [23683.218712] kworker/u24:0 D 0 21848 2 0x80000000 Sep 30 16:01:23 storage10x10n1 kernel: [23683.218814] Workqueue: writeback wb_workfn (flush-9:3) Sep 30 16:01:23 storage10x10n1 kernel: [23683.218910] Call Trace: Sep 30 16:01:23 storage10x10n1 kernel: [23683.219005] ? __schedule+0x3f5/0x880 Sep 30 16:01:23 storage10x10n1 kernel: [23683.219096] schedule+0x32/0x80 Sep 30 16:01:23 storage10x10n1 kernel: [23683.219192] bitmap_startwrite+0x161/0x1e0 [md_mod] Sep 30 16:01:23 storage10x10n1 kernel: [23683.219291] ? remove_wait_queue+0x60/0x60 Sep 30 16:01:23 storage10x10n1 kernel: [23683.219388] add_stripe_bio+0x441/0x7d0 [raid456] Sep 30 16:01:23 storage10x10n1 kernel: [23683.219484] raid5_make_request+0x1ae/0xb10 [raid456] Sep 30 16:01:23 storage10x10n1 kernel: [23683.219580] ? remove_wait_queue+0x60/0x60 Sep 30 16:01:23 storage10x10n1 kernel: [23683.219675] ? blk_queue_split+0x222/0x5e0 Sep 30 16:01:23 storage10x10n1 kernel: [23683.219770] md_handle_request+0x116/0x190 [md_mod] Sep 30 16:01:23 storage10x10n1 kernel: [23683.219867] md_make_request+0x65/0x160 [md_mod] Sep 30 16:01:23 storage10x10n1 kernel: [23683.219962] generic_make_request+0x1e7/0x410 Sep 30 16:01:23 storage10x10n1 kernel: [23683.220058] ? submit_bio+0x6c/0x140 Sep 30 16:01:23 storage10x10n1 kernel: [23683.220148] submit_bio+0x6c/0x140 Sep 30 16:01:23 storage10x10n1 kernel: [23683.220294] xfs_add_to_ioend+0x14c/0x280 [xfs] Sep 30 16:01:23 storage10x10n1 kernel: [23683.220415] ? xfs_map_buffer.isra.14+0x37/0x70 [xfs] Sep 30 16:01:23 storage10x10n1 kernel: [23683.220534] xfs_do_writepage+0x2bb/0x680 [xfs] Sep 30 16:01:23 storage10x10n1 kernel: [23683.220632] ? clear_page_dirty_for_io+0x20c/0x2a0 Sep 30 16:01:23 storage10x10n1 kernel: [23683.220727] write_cache_pages+0x1ed/0x430 Sep 30 16:01:23 storage10x10n1 kernel: [23683.220852] ? xfs_add_to_ioend+0x280/0x280 [xfs] Sep 30 16:01:23 storage10x10n1 kernel: [23683.220971] xfs_vm_writepages+0x64/0xa0 [xfs] Sep 30 16:01:23 storage10x10n1 kernel: [23683.221068] do_writepages+0x1a/0x60 Sep 30 16:01:23 storage10x10n1 kernel: [23683.221161] __writeback_single_inode+0x3d/0x320 Sep 30 16:01:23 storage10x10n1 kernel: [23683.221255] writeback_sb_inodes+0x221/0x4b0 Sep 30 16:01:23 storage10x10n1 kernel: [23683.221349] __writeback_inodes_wb+0x87/0xb0 Sep 30 16:01:23 storage10x10n1 kernel: [23683.221442] wb_writeback+0x288/0x320 Sep 30 16:01:23 storage10x10n1 kernel: [23683.221534] ? cpumask_next+0x16/0x20 Sep 30 16:01:23 storage10x10n1 kernel: [23683.221626] ? wb_workfn+0x37c/0x450 Sep 30 16:01:23 storage10x10n1 kernel: [23683.221717] wb_workfn+0x37c/0x450 Sep 30 16:01:23 storage10x10n1 kernel: [23683.221811] process_one_work+0x191/0x370 Sep 30 16:01:23 storage10x10n1 kernel: [23683.221904] worker_thread+0x4f/0x3b0 Sep 30 16:01:23 storage10x10n1 kernel: [23683.221995] kthread+0xf8/0x130 Sep 30 16:01:23 storage10x10n1 kernel: [23683.222086] ? rescuer_thread+0x340/0x340 Sep 30 16:01:23 storage10x10n1 kernel: [23683.222179] ? kthread_create_worker_on_cpu+0x70/0x70 Sep 30 16:01:23 storage10x10n1 kernel: [23683.222276] ret_from_fork+0x1f/0x40 On Thu, Oct 04, 2018 at 11:25:49PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=201331 > > --- Comment #4 from edo (edo.rus@gmail.com) --- > I tested with 4.17 and 4.18 prebuilt Debian kernels, behavior is the same: > Sep 30 16:01:23 storage10x10n1 kernel: [23683.218388] INFO: task > kworker/u24:0:21848 blocked for more than 120 seconds. I think we need to rename XFS to "The Messenger: Please don't shoot me"... :) From the xfs_info: sunit=4096 swidth=32768 blks Ok, that looks wrong - why do you have a MD raid device with 16MB stripe unit and a 128MB stripe width? Yup: md3 : active raid6 sda4[0] sdj4[9] sdg4[6] sdd4[3] sdi4[8] sdf4[5] sde4[4] sdh4[7] sdb4[2] sdc4[1] 77555695616 blocks super 1.2 level 6, 16384k chunk, algorithm 2 [10/10] [UUUUUUUUUU] bitmap: 9/73 pages [36KB], 65536KB chunk You've configured your RAID6 device with a 16MB chunk size, which gives the XFS su/sw noted above. Basically, your RMW'd your RAID device to death because every write is a sub-stripe write. > Workqueue: writeback wb_workfn (flush-9:3) > Call Trace: > schedule+0x32/0x80 > bitmap_startwrite+0x161/0x1e0 [md_mod] MD blocks here when it has too many inflight bitmap updates and so waits for IO to complete before starting another. This isn't XFS filesystem IO - this in internal MD RAID consistency information that it needs to write for crash recovery purposes. This will be a direct result of the raid device configuration.... > add_stripe_bio+0x441/0x7d0 [raid456] > raid5_make_request+0x1ae/0xb10 [raid456] > md_handle_request+0x116/0x190 [md_mod] > md_make_request+0x65/0x160 [md_mod] > generic_make_request+0x1e7/0x410 > submit_bio+0x6c/0x140 > xfs_add_to_ioend+0x14c/0x280 [xfs] > xfs_do_writepage+0x2bb/0x680 [xfs] > write_cache_pages+0x1ed/0x430 > xfs_vm_writepages+0x64/0xa0 [xfs] > do_writepages+0x1a/0x60 > __writeback_single_inode+0x3d/0x320 > writeback_sb_inodes+0x221/0x4b0 > __writeback_inodes_wb+0x87/0xb0 > wb_writeback+0x288/0x320 > wb_workfn+0x37c/0x450 ... and this is just the writeback path - your problem has nothing do with XFS... Cheers, Dave. Edo, I'm closing this bug. There is no evidence of bug on what you described, but a poorly configured storage. As Dave mentioned, you are RMWing your storage array to death. Please, fix your storage array configuration, and feel free to reopen the bug if you still hit it with a proper storage configuration, but by now, what you have is a big number of huge RMW cycles. D'oh, I have no permission to close MD bugs, but well, as Dave mentioned, your problem is not a bug. > Basically, your RMW'd your RAID device to death because every write
is a sub-stripe write.
Why is it bad?
Even with default 512k stipe almost every write is sub-stripe write.
Anyway system lock till reset is an error, isn't it?
(In reply to edo from comment #8) > > Basically, your RMW'd your RAID device to death because every write > is a sub-stripe write. > > Why is it bad? > Even with default 512k stipe almost every write is sub-stripe write. 512k * 8 = 4MiB A substripe write will require 4MiB + parity (which IIRC will be another 1024MiB) so a total of 5MiB needs to be Read, modified and written for each undersized write. For a 16MiB stripe: 16MiB * 8 data disks = 128MiB + 32MiB for the parity chunks, so, for every undersized write, you need to have 160MiB of data read modified and written back to the array. Multiply it for several files, and you will RMW your array to death really fast, mainly if your workload is mostly undersized IO. > Anyway system lock till reset is an error, isn't it? Your system is not locked up, it's just really slow due the amount of time being spent waiting for IO completion. > A substripe write will require 4MiB + parity (which IIRC will be another 1024MiB) so a total of 5MiB needs to be Read, modified and written for each undersized write. I'm pretty sure, md raid can do partial update for chunks. I just ran 4Kib random write test (fio) on freshly created array with chunk size 4Kib, after that with chunk size 16Mib - no difference in terms iops on raid device (observed in fio), iops and transfer speed on raid members (observed in iostat). > Your system is not locked up, it's just really slow due the amount of time being spent waiting for IO completion. No. Array was filling with good speed (80-250Mib/s, limited by network) several hours after system startup. Unexpectedly write stopped (zeros in iostat) and messages "blocked for more than 120 seconds" arrive. I managed to continue disk filling with hardware reset only. Such behavior repeated many times, I tried several kernel versions and kernel options. One thing that's kind of weird is this: [ 1679.494859] md: md2: resync done. [ 5679.900329] INFO: task tar:18235 blocked for more than 120 seconds. almost exactly 4000 seconds? Maybe a coincidence. The messages from md's bitmap_startwrite is almost the same timestamp, too: [ 5679.904044] INFO: task kworker/u24:3:18307 blocked for more than 120 seconds. md is scheduled out here: if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) { DEFINE_WAIT(__wait); /* note that it is safe to do the prepare_to_wait * after the test as long as we do it before dropping * the spinlock. */ prepare_to_wait(&bitmap->overflow_wait, &__wait, TASK_UNINTERRUPTIBLE); spin_unlock_irq(&bitmap->counts.lock); schedule(); finish_wait(&bitmap->overflow_wait, &__wait); continue; } So md is waiting to be woken up when the bitmap writer finishes. Details aside, I really do think that xfs is the victim/messenger here; we should at least try to get some md eyes on this one as well. > almost exactly 4000 seconds? Maybe a coincidence.
Definitely coincidence, another log:
Sep 30 09:26:49 storage10x10n1 kernel: [ 10.344316] IPv6: ADDRCONF(NETDEV_CHANGE): enp4s0: link becomes ready
Sep 30 09:26:52 storage10x10n1 kernel: [ 12.380310] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Sep 30 09:26:52 storage10x10n1 kernel: [ 12.380728] NFSD: starting 90-second grace period (net f00000a0)
Sep 30 11:13:45 storage10x10n1 kernel: [ 6425.580208] perf: interrupt took too long (2526 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Sep 30 11:56:10 storage10x10n1 kernel: [ 8970.022413] perf: interrupt took too long (3180 > 3157), lowering kernel.perf_event_max_sample_rate to 62750
Sep 30 12:58:42 storage10x10n1 kernel: [12722.572449] perf: interrupt took too long (3984 > 3975), lowering kernel.perf_event_max_sample_rate to 50000
Sep 30 15:36:21 storage10x10n1 kernel: [22180.854770] perf: interrupt took too long (4983 > 4980), lowering kernel.perf_event_max_sample_rate to 40000
Sep 30 16:01:23 storage10x10n1 kernel: [23683.218388] INFO: task kworker/u24:0:21848 blocked for more than 120 seconds.
(In reply to edo from comment #10) > > A substripe write will require 4MiB + parity (which IIRC will be another > 1024MiB) so a total of 5MiB needs to be Read, modified and written for each > undersized write. > > I'm pretty sure, md raid can do partial update for chunks. Yes, it can, but it's still got to do a RMW on a 16MB chunk on your array configuration. ISTR that the bitmap updates come from partial updates being run to indicate the stripe is inconcistent while it rewrites the new data and the new parity information. i.e. a partial chunk update still requires a RMW of the chunk and a RMW of the parity chunk for that stripe. > No. > Array was filling with good speed (80-250Mib/s, limited by network) several > hours after system startup. > Unexpectedly write stopped (zeros in iostat) and messages "blocked for more > than 120 seconds" arrive. I managed to continue disk filling with hardware > reset only. > Such behavior repeated many times, I tried several kernel versions and > kernel options. And it's stuck in MD waiting for a bitmap update to complete. Either your hardware has lost an IO, or there's a wakeup race in MD and so everything is blocked waiting for an IO completion that will never arrive. You say it's a new server you've just set up - that means there is a good chance that this "lost IO" is actually a hardware problem and it's not actually a MD issue.... -Dave. I've got one more freeze. I did some tests: - other MD devices work correctly; - read from affected MD seems to work without problem (but it is too big to do full check); - ls on root directory works, on nested directories hangs; - /proc/meminfo shows about to 16Gb of dirty buffers (memory 128Gb total), but no write to affected MD device. (In reply to Eric Sandeen from comment #11) > I really do think that xfs is the victim/messenger here I tried on another server with same settings . XFS: freeze after several hours. EXT4: works without freezes for at least 5 days. What can I do to help fix this bug? Created attachment 279509 [details]
one more dmesg output
Still have deadlocks under load (torrents), dmesg output is attached
Deadlocks happened 8 times in 5 last days. It is definetely related to load (torrents downloading/uploading with speed 20-40Mb/s causes deadlock in couple of hours). Every time kernel reports "blocked for more than 120 seconds" for several tasks, 2 minutes later - several more tasks. Should be tasks in first batch suspected? There is always md1_raid10/md2_raid10/md3_raid6 present, so I guess this issue is related with MD code. One time I got such sequence during kernel compilation on md2: Nov 19 07:33:31 somehost [ 9183.917260] INFO: task md2_raid10:373 blocked for more than 120 seconds. Nov 19 07:33:31 somehost [ 9183.918898] INFO: task objtool:22492 blocked for more than 120 seconds. Nov 19 07:35:32 somehost [ 9304.745197] INFO: task md3_raid6:371 blocked for more than 120 seconds. Nov 19 07:35:32 somehost [ 9304.747433] INFO: task md2_raid10:373 blocked for more than 120 seconds. Nov 19 07:35:32 somehost [ 9304.749393] INFO: task xfsaild/md2:406 blocked for more than 120 seconds. Nov 19 07:35:32 somehost [ 9304.752414] INFO: task xfsaild/md3:757 blocked for more than 120 seconds. Nov 19 07:35:32 somehost [ 9304.755797] INFO: task avfsd:1406 blocked for more than 120 seconds. Nov 19 07:35:32 somehost [ 9304.767090] INFO: task nginx:1249 blocked for more than 120 seconds. Nov 19 07:35:32 somehost [ 9304.770479] INFO: task kworker/u24:1:9469 blocked for more than 120 seconds. No md3_raid6 and on xfsaild in first batch, does it mean, that issue isn't related with XFS and isn't related with RAID6 stripe size? Created attachment 279693 [details]
mdadm --examine
Created attachment 279701 [details]
1st dmesg with CONFIG_LOCKDEP=y
Created attachment 279703 [details]
2nd dmesg with CONFIG_LOCKDEP=y
Created attachment 279705 [details]
3rd dmesg with CONFIG_LOCKDEP=y
Created attachment 279707 [details]
4th dmesg with CONFIG_LOCKDEP=y
Created attachment 279711 [details]
dmesg with raid 10 only, CONFIG_LOCKDEP=y
Comment on attachment 279711 [details]
dmesg with raid 10 only, CONFIG_LOCKDEP=y
I found a way to quickly reproduce freezing (on my system).
1. Start resync of raid-10 array
echo 'repair' >/sys/block/md2/md/sync_action
2. Start kernel rebuilding on the same array
make -j12 clean && make -j12
raid6 is not invoked into this process, so issue is definitely not raid6-related
(I disabled mounting fs on raid6)
Comment on attachment 279711 [details]
dmesg with raid 10 only, CONFIG_LOCKDEP=y
I found a way to quickly reproduce freezing (on my system).
1. Start resync of raid-10 array
echo 'repair' >/sys/block/md2/md/sync_action
2. Start kernel rebuilding on the same array
make -j12 clean && make -j12
raid6 is not invoked into this process, so issue is definitely not raid6-specific
(I disabled mounting of fs on raid6)
Created attachment 279727 [details]
mdadm --examine
Created attachment 279737 [details]
dmesg output, xfs only (no raid)
This simple command freezes kernel with XFS on HDD (no RAID at all!):
for ((A=100;A<200;A++)); do [ -d $A ] || mkdir $A; (cd $A; tar -xz < /linux-4.20-rc4.tar.gz )& done
With EXT4 it works as should.
On SSD this command doesn't cause freeze with XFS. But XFS is much slower than EXT4 here (~400 seconds vs ~300 seconds).
XFS is stuck waiting for log IO completion and everything is backed up waiting for that to happen. Remember what I said back in comment #13? Quote: | And it's stuck in MD waiting for a bitmap update to complete. Either your | hardware has lost an IO, or there's a wakeup race in MD and so everything | is blocked waiting for an IO completion that will never arrive. You say | it's a new server you've just set up - that means there is a good chance | that this "lost IO" is actually a hardware problem and it's not actually | a MD issue.... Well, you now have XFS stuck waiting for log IO to complete. So we have the situation where either you have hardware that is losing an IO, or both MD and XFS have bugs in them across multiple kernels that look like lost IOs. This looks and smells like you have a hardware problem, not a kernel problem. If you haven't already done so, please your drivers and the firmware for your hardware to the latest versions. -Dave. > This looks and smells like you have a hardware problem, not a kernel problem.
Last dmesg is from completely different machine.
For me it looks like if some IO lasts more than usual then race condition may occur.
And it seems that EXT4 is not affected by such issue.
|
Created attachment 278927 [details] dmesg output I've set up new server with ten of 10Tb disks. Main volume is XFS over RAID6 (created with mdadm). For now fs is filling with data. After several hours of uptime disk IO freezes with such messages in log: [ 5679.900329] INFO: task tar:18235 blocked for more than 120 seconds. [ 5679.900404] Not tainted 4.18.12 #2 ... [ 5679.904044] INFO: task kworker/u24:3:18307 blocked for more than 120 seconds. [ 5679.904137] Not tainted 4.18.12 #2 and so on. I'm unsure, but it seems to be XFS-related.