Bug 201331 - deadlock (XFS?)
Summary: deadlock (XFS?)
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: All Linux
: P1 high
Assignee: io_md
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-04 23:14 UTC by edo
Modified: 2018-11-30 11:10 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.20-rc4
Tree: Mainline
Regression: No


Attachments
dmesg output (87.98 KB, text/plain)
2018-10-04 23:14 UTC, edo
Details
/proc/mdstat (862 bytes, text/plain)
2018-10-04 23:16 UTC, edo
Details
xfs_info output (683 bytes, text/plain)
2018-10-04 23:17 UTC, edo
Details
one more dmesg output (21.88 KB, text/plain)
2018-11-19 02:08 UTC, edo
Details
mdadm --examine (9.05 KB, text/plain)
2018-11-28 12:07 UTC, edo
Details
1st dmesg with CONFIG_LOCKDEP=y (19.20 KB, text/plain)
2018-11-28 12:09 UTC, edo
Details
2nd dmesg with CONFIG_LOCKDEP=y (33.01 KB, text/plain)
2018-11-28 12:17 UTC, edo
Details
3rd dmesg with CONFIG_LOCKDEP=y (29.43 KB, text/plain)
2018-11-28 12:18 UTC, edo
Details
4th dmesg with CONFIG_LOCKDEP=y (26.77 KB, text/plain)
2018-11-28 12:18 UTC, edo
Details
dmesg with raid 10 only, CONFIG_LOCKDEP=y (42.21 KB, text/plain)
2018-11-28 14:52 UTC, edo
Details
mdadm --examine (10.16 KB, text/plain)
2018-11-29 00:19 UTC, edo
Details
dmesg output, xfs only (no raid) (47.80 KB, text/plain)
2018-11-29 12:26 UTC, edo
Details

Description edo 2018-10-04 23:14:51 UTC
Created attachment 278927 [details]
dmesg output

I've set up new server with ten of 10Tb disks.
Main volume is XFS over RAID6 (created with mdadm).

For now fs is filling with data. After several hours of uptime disk IO freezes with such messages in log:
[ 5679.900329] INFO: task tar:18235 blocked for more than 120 seconds.
[ 5679.900404]       Not tainted 4.18.12 #2
...
[ 5679.904044] INFO: task kworker/u24:3:18307 blocked for more than 120 seconds.
[ 5679.904137]       Not tainted 4.18.12 #2
and so on.

I'm unsure, but it seems to be XFS-related.
Comment 1 edo 2018-10-04 23:16:45 UTC
Created attachment 278929 [details]
/proc/mdstat
Comment 2 edo 2018-10-04 23:17:49 UTC
Created attachment 278931 [details]
xfs_info output
Comment 3 edo 2018-10-04 23:18:17 UTC
I have made some RAID tuning:
echo 32768 >  /sys/block/md3/md/stripe_cache_size
Comment 4 edo 2018-10-04 23:25:49 UTC
I tested with 4.17 and 4.18 prebuilt Debian kernels, behavior is the same:
Sep 30 16:01:23 storage10x10n1 kernel: [23683.218388] INFO: task kworker/u24:0:21848 blocked for more than 120 seconds.
Sep 30 16:01:23 storage10x10n1 kernel: [23683.218495]       Not tainted 4.18.0-0.bpo.1-amd64 #1 Debian 4.18.6-1~bpo9+1
Sep 30 16:01:23 storage10x10n1 kernel: [23683.218593] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 30 16:01:23 storage10x10n1 kernel: [23683.218712] kworker/u24:0   D    0 21848      2 0x80000000
Sep 30 16:01:23 storage10x10n1 kernel: [23683.218814] Workqueue: writeback wb_workfn (flush-9:3)
Sep 30 16:01:23 storage10x10n1 kernel: [23683.218910] Call Trace:
Sep 30 16:01:23 storage10x10n1 kernel: [23683.219005]  ? __schedule+0x3f5/0x880
Sep 30 16:01:23 storage10x10n1 kernel: [23683.219096]  schedule+0x32/0x80
Sep 30 16:01:23 storage10x10n1 kernel: [23683.219192]  bitmap_startwrite+0x161/0x1e0 [md_mod]
Sep 30 16:01:23 storage10x10n1 kernel: [23683.219291]  ? remove_wait_queue+0x60/0x60
Sep 30 16:01:23 storage10x10n1 kernel: [23683.219388]  add_stripe_bio+0x441/0x7d0 [raid456]
Sep 30 16:01:23 storage10x10n1 kernel: [23683.219484]  raid5_make_request+0x1ae/0xb10 [raid456]
Sep 30 16:01:23 storage10x10n1 kernel: [23683.219580]  ? remove_wait_queue+0x60/0x60
Sep 30 16:01:23 storage10x10n1 kernel: [23683.219675]  ? blk_queue_split+0x222/0x5e0
Sep 30 16:01:23 storage10x10n1 kernel: [23683.219770]  md_handle_request+0x116/0x190 [md_mod]
Sep 30 16:01:23 storage10x10n1 kernel: [23683.219867]  md_make_request+0x65/0x160 [md_mod]
Sep 30 16:01:23 storage10x10n1 kernel: [23683.219962]  generic_make_request+0x1e7/0x410
Sep 30 16:01:23 storage10x10n1 kernel: [23683.220058]  ? submit_bio+0x6c/0x140
Sep 30 16:01:23 storage10x10n1 kernel: [23683.220148]  submit_bio+0x6c/0x140
Sep 30 16:01:23 storage10x10n1 kernel: [23683.220294]  xfs_add_to_ioend+0x14c/0x280 [xfs]
Sep 30 16:01:23 storage10x10n1 kernel: [23683.220415]  ? xfs_map_buffer.isra.14+0x37/0x70 [xfs]
Sep 30 16:01:23 storage10x10n1 kernel: [23683.220534]  xfs_do_writepage+0x2bb/0x680 [xfs]
Sep 30 16:01:23 storage10x10n1 kernel: [23683.220632]  ? clear_page_dirty_for_io+0x20c/0x2a0
Sep 30 16:01:23 storage10x10n1 kernel: [23683.220727]  write_cache_pages+0x1ed/0x430
Sep 30 16:01:23 storage10x10n1 kernel: [23683.220852]  ? xfs_add_to_ioend+0x280/0x280 [xfs]
Sep 30 16:01:23 storage10x10n1 kernel: [23683.220971]  xfs_vm_writepages+0x64/0xa0 [xfs]
Sep 30 16:01:23 storage10x10n1 kernel: [23683.221068]  do_writepages+0x1a/0x60
Sep 30 16:01:23 storage10x10n1 kernel: [23683.221161]  __writeback_single_inode+0x3d/0x320
Sep 30 16:01:23 storage10x10n1 kernel: [23683.221255]  writeback_sb_inodes+0x221/0x4b0
Sep 30 16:01:23 storage10x10n1 kernel: [23683.221349]  __writeback_inodes_wb+0x87/0xb0
Sep 30 16:01:23 storage10x10n1 kernel: [23683.221442]  wb_writeback+0x288/0x320
Sep 30 16:01:23 storage10x10n1 kernel: [23683.221534]  ? cpumask_next+0x16/0x20
Sep 30 16:01:23 storage10x10n1 kernel: [23683.221626]  ? wb_workfn+0x37c/0x450
Sep 30 16:01:23 storage10x10n1 kernel: [23683.221717]  wb_workfn+0x37c/0x450
Sep 30 16:01:23 storage10x10n1 kernel: [23683.221811]  process_one_work+0x191/0x370
Sep 30 16:01:23 storage10x10n1 kernel: [23683.221904]  worker_thread+0x4f/0x3b0
Sep 30 16:01:23 storage10x10n1 kernel: [23683.221995]  kthread+0xf8/0x130
Sep 30 16:01:23 storage10x10n1 kernel: [23683.222086]  ? rescuer_thread+0x340/0x340
Sep 30 16:01:23 storage10x10n1 kernel: [23683.222179]  ? kthread_create_worker_on_cpu+0x70/0x70
Sep 30 16:01:23 storage10x10n1 kernel: [23683.222276]  ret_from_fork+0x1f/0x40
Comment 5 Dave Chinner 2018-10-05 01:06:15 UTC
On Thu, Oct 04, 2018 at 11:25:49PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=201331

> 
> --- Comment #4 from edo (edo.rus@gmail.com) ---
> I tested with 4.17 and 4.18 prebuilt Debian kernels, behavior is the same:
> Sep 30 16:01:23 storage10x10n1 kernel: [23683.218388] INFO: task
> kworker/u24:0:21848 blocked for more than 120 seconds.

I think we need to rename XFS to "The Messenger: Please don't shoot
me"... :)

From the xfs_info:

sunit=4096   swidth=32768 blks

Ok, that looks wrong - why do you have a MD raid device with
16MB stripe unit and a 128MB stripe width?

Yup:

md3 : active raid6 sda4[0] sdj4[9] sdg4[6] sdd4[3] sdi4[8] sdf4[5] sde4[4] sdh4[7] sdb4[2] sdc4[1]
      77555695616 blocks super 1.2 level 6, 16384k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
            bitmap: 9/73 pages [36KB], 65536KB chunk

You've configured your RAID6 device with a 16MB chunk size, which
gives the XFS su/sw noted above.

Basically, your RMW'd your RAID device to death because every write
is a sub-stripe write.

>  Workqueue: writeback wb_workfn (flush-9:3)
>  Call Trace:
>   schedule+0x32/0x80
>  bitmap_startwrite+0x161/0x1e0 [md_mod]

MD blocks here when it has too many inflight bitmap updates and so
waits for IO to complete before starting another. This isn't XFS
filesystem IO - this in internal MD RAID consistency information
that it needs to write for crash recovery purposes.

This will be a direct result of the raid device configuration....

>  add_stripe_bio+0x441/0x7d0 [raid456]
>  raid5_make_request+0x1ae/0xb10 [raid456]
>  md_handle_request+0x116/0x190 [md_mod]
>  md_make_request+0x65/0x160 [md_mod]
>  generic_make_request+0x1e7/0x410
>   submit_bio+0x6c/0x140
>  xfs_add_to_ioend+0x14c/0x280 [xfs]
>  xfs_do_writepage+0x2bb/0x680 [xfs]
>  write_cache_pages+0x1ed/0x430
>  xfs_vm_writepages+0x64/0xa0 [xfs]
>   do_writepages+0x1a/0x60
>  __writeback_single_inode+0x3d/0x320
>  writeback_sb_inodes+0x221/0x4b0
>  __writeback_inodes_wb+0x87/0xb0
>   wb_writeback+0x288/0x320
>   wb_workfn+0x37c/0x450

... and this is just the writeback path - your problem has nothing
do with XFS...

Cheers,

Dave.
Comment 6 Carlos Maiolino 2018-10-05 09:08:00 UTC
Edo, I'm closing this bug.

There is no evidence of bug on what you described, but a poorly configured storage.

As Dave mentioned, you are RMWing your storage array to death. Please, fix your storage array configuration, and feel free to reopen the bug if you still hit it with a proper storage configuration, but by now, what you have is a big number of huge RMW cycles.
Comment 7 Carlos Maiolino 2018-10-05 09:11:41 UTC
D'oh, I have no permission to close MD bugs, but well, as Dave mentioned, your problem is not a bug.
Comment 8 edo 2018-10-05 09:18:01 UTC
> Basically, your RMW'd your RAID device to death because every write
is a sub-stripe write.

Why is it bad?
Even with default 512k stipe almost every write is sub-stripe write.
Anyway system lock till reset is an error, isn't it?
Comment 9 Carlos Maiolino 2018-10-05 10:15:47 UTC
(In reply to edo from comment #8)
> > Basically, your RMW'd your RAID device to death because every write
> is a sub-stripe write.
> 
> Why is it bad?
> Even with default 512k stipe almost every write is sub-stripe write.

512k * 8 = 4MiB

A substripe write will require 4MiB + parity (which IIRC will be another 1024MiB) so a total of 5MiB needs to be Read, modified and written for each undersized write.

For a 16MiB stripe:

16MiB * 8 data disks = 128MiB + 32MiB for the parity chunks, so, for every undersized write, you need to have 160MiB of data read modified and written back to the array.

Multiply it for several files, and you will RMW your array to death really fast, mainly if your workload is mostly undersized IO.



> Anyway system lock till reset is an error, isn't it?

Your system is not locked up, it's just really slow due the amount of time being spent waiting for IO completion.
Comment 10 edo 2018-10-05 16:39:55 UTC
> A substripe write will require 4MiB + parity (which IIRC will be another 1024MiB) so a total of 5MiB needs to be Read, modified and written for each undersized write.

I'm pretty sure, md raid can do partial update for chunks.

I just ran 4Kib random write test (fio) on freshly created array with chunk size 4Kib, after that with chunk size 16Mib - no difference in terms iops on raid device (observed in fio), iops and transfer speed on raid members (observed in iostat).


> Your system is not locked up, it's just really slow due the amount of time being spent waiting for IO completion.

No.
Array was filling with good speed (80-250Mib/s, limited by network) several hours after system startup.
Unexpectedly write stopped (zeros in iostat) and messages "blocked for more than 120 seconds" arrive. I managed to continue disk filling with hardware reset only.
Such behavior repeated many times, I tried several kernel versions and kernel options.
Comment 11 Eric Sandeen 2018-10-05 17:09:39 UTC
One thing that's kind of weird is this:

[ 1679.494859] md: md2: resync done.
[ 5679.900329] INFO: task tar:18235 blocked for more than 120 seconds.

almost exactly 4000 seconds?  Maybe a coincidence.

The messages from md's bitmap_startwrite is almost the same timestamp, too:

[ 5679.904044] INFO: task kworker/u24:3:18307 blocked for more than 120 seconds.

md is scheduled out here:

                if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) {
                        DEFINE_WAIT(__wait);
                        /* note that it is safe to do the prepare_to_wait
                         * after the test as long as we do it before dropping
                         * the spinlock.
                         */
                        prepare_to_wait(&bitmap->overflow_wait, &__wait,
                                        TASK_UNINTERRUPTIBLE);
                        spin_unlock_irq(&bitmap->counts.lock);
                        schedule();
                        finish_wait(&bitmap->overflow_wait, &__wait);
                        continue;
                }

So md is waiting to be woken up when the bitmap writer finishes.  Details aside, I really do think that xfs is the victim/messenger here; we should at least try to get some md eyes on this one as well.
Comment 12 edo 2018-10-05 19:54:25 UTC
> almost exactly 4000 seconds?  Maybe a coincidence.

Definitely coincidence, another log:

Sep 30 09:26:49 storage10x10n1 kernel: [   10.344316] IPv6: ADDRCONF(NETDEV_CHANGE): enp4s0: link becomes ready
Sep 30 09:26:52 storage10x10n1 kernel: [   12.380310] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Sep 30 09:26:52 storage10x10n1 kernel: [   12.380728] NFSD: starting 90-second grace period (net f00000a0)
Sep 30 11:13:45 storage10x10n1 kernel: [ 6425.580208] perf: interrupt took too long (2526 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Sep 30 11:56:10 storage10x10n1 kernel: [ 8970.022413] perf: interrupt took too long (3180 > 3157), lowering kernel.perf_event_max_sample_rate to 62750
Sep 30 12:58:42 storage10x10n1 kernel: [12722.572449] perf: interrupt took too long (3984 > 3975), lowering kernel.perf_event_max_sample_rate to 50000
Sep 30 15:36:21 storage10x10n1 kernel: [22180.854770] perf: interrupt took too long (4983 > 4980), lowering kernel.perf_event_max_sample_rate to 40000
Sep 30 16:01:23 storage10x10n1 kernel: [23683.218388] INFO: task kworker/u24:0:21848 blocked for more than 120 seconds.
Comment 13 Dave Chinner 2018-10-05 23:55:16 UTC
(In reply to edo from comment #10)
> > A substripe write will require 4MiB + parity (which IIRC will be another
> 1024MiB) so a total of 5MiB needs to be Read, modified and written for each
> undersized write.
> 
> I'm pretty sure, md raid can do partial update for chunks.

Yes, it can, but it's still got to do a RMW on a 16MB chunk on your array configuration. ISTR that the bitmap updates come from partial updates being run to indicate the stripe is inconcistent while it rewrites the new data and the new parity information. i.e. a partial chunk update still requires a RMW of the chunk and a RMW of the parity chunk for that stripe.

> No.
> Array was filling with good speed (80-250Mib/s, limited by network) several
> hours after system startup.
> Unexpectedly write stopped (zeros in iostat) and messages "blocked for more
> than 120 seconds" arrive. I managed to continue disk filling with hardware
> reset only.
> Such behavior repeated many times, I tried several kernel versions and
> kernel options.

And it's stuck in MD waiting for a bitmap update to complete. Either your hardware has lost an IO, or there's a wakeup race in MD and so everything is blocked waiting for an IO completion that will never arrive. You say it's a new server you've just set up - that means there is a good chance that this "lost IO" is actually a hardware problem and it's not actually a MD issue....

-Dave.
Comment 14 edo 2018-10-07 14:40:38 UTC
I've got one more freeze.

I did some tests:
 - other MD devices work correctly;
 - read from affected MD seems to work without problem (but it is too big to do full check);
 - ls on root directory works, on nested directories hangs;
 - /proc/meminfo shows about to 16Gb of dirty buffers (memory 128Gb total), but no write to affected MD device.
Comment 15 edo 2018-10-15 16:32:11 UTC
(In reply to Eric Sandeen from comment #11)
> I really do think that xfs is the victim/messenger here

I tried on another server with same settings .
XFS: freeze after several hours.
EXT4: works without freezes for at least 5 days.

What can I do to help fix this bug?
Comment 16 edo 2018-11-19 02:08:38 UTC
Created attachment 279509 [details]
one more dmesg output

Still have deadlocks under load (torrents), dmesg output is attached
Comment 17 edo 2018-11-23 17:42:47 UTC
Deadlocks happened 8 times in 5 last days.
It is definetely related to load (torrents downloading/uploading with speed 20-40Mb/s causes deadlock in couple of hours).

Every time kernel reports "blocked for more than 120 seconds" for several tasks, 2 minutes later - several more tasks.
Should be tasks in first batch suspected?
There is always md1_raid10/md2_raid10/md3_raid6 present, so I guess this issue is related with MD code.

One time I got such sequence during kernel compilation on md2:
Nov 19 07:33:31 somehost [ 9183.917260] INFO: task md2_raid10:373 blocked for more than 120 seconds.
Nov 19 07:33:31 somehost [ 9183.918898] INFO: task objtool:22492 blocked for more than 120 seconds.
Nov 19 07:35:32 somehost [ 9304.745197] INFO: task md3_raid6:371 blocked for more than 120 seconds.
Nov 19 07:35:32 somehost [ 9304.747433] INFO: task md2_raid10:373 blocked for more than 120 seconds.
Nov 19 07:35:32 somehost [ 9304.749393] INFO: task xfsaild/md2:406 blocked for more than 120 seconds.
Nov 19 07:35:32 somehost [ 9304.752414] INFO: task xfsaild/md3:757 blocked for more than 120 seconds.
Nov 19 07:35:32 somehost [ 9304.755797] INFO: task avfsd:1406 blocked for more than 120 seconds.
Nov 19 07:35:32 somehost [ 9304.767090] INFO: task nginx:1249 blocked for more than 120 seconds.
Nov 19 07:35:32 somehost [ 9304.770479] INFO: task kworker/u24:1:9469 blocked for more than 120 seconds.

No md3_raid6 and on xfsaild in first batch, does it mean, that issue isn't related with XFS and isn't related with RAID6 stripe size?
Comment 18 edo 2018-11-28 12:07:02 UTC
Created attachment 279693 [details]
mdadm --examine
Comment 19 edo 2018-11-28 12:09:58 UTC
Created attachment 279701 [details]
1st dmesg with CONFIG_LOCKDEP=y
Comment 20 edo 2018-11-28 12:17:38 UTC
Created attachment 279703 [details]
2nd dmesg with CONFIG_LOCKDEP=y
Comment 21 edo 2018-11-28 12:18:01 UTC
Created attachment 279705 [details]
3rd dmesg with CONFIG_LOCKDEP=y
Comment 22 edo 2018-11-28 12:18:39 UTC
Created attachment 279707 [details]
4th dmesg with CONFIG_LOCKDEP=y
Comment 23 edo 2018-11-28 14:52:00 UTC
Created attachment 279711 [details]
dmesg with raid 10 only, CONFIG_LOCKDEP=y
Comment 24 edo 2018-11-28 15:36:43 UTC
Comment on attachment 279711 [details]
dmesg with raid 10 only, CONFIG_LOCKDEP=y

I found a way to quickly reproduce freezing (on my system).

1. Start resync of raid-10 array
echo 'repair' >/sys/block/md2/md/sync_action
2. Start kernel rebuilding on the same array
make -j12 clean && make -j12

raid6 is not invoked into this process, so issue is definitely not raid6-related
(I disabled mounting fs on raid6)
Comment 25 edo 2018-11-28 15:38:01 UTC
Comment on attachment 279711 [details]
dmesg with raid 10 only, CONFIG_LOCKDEP=y

I found a way to quickly reproduce freezing (on my system).

1. Start resync of raid-10 array
echo 'repair' >/sys/block/md2/md/sync_action
2. Start kernel rebuilding on the same array
make -j12 clean && make -j12

raid6 is not invoked into this process, so issue is definitely not raid6-specific
(I disabled mounting of fs on raid6)
Comment 26 edo 2018-11-29 00:19:50 UTC
Created attachment 279727 [details]
mdadm --examine
Comment 27 edo 2018-11-29 12:26:11 UTC
Created attachment 279737 [details]
dmesg output, xfs only (no raid)

This simple command freezes kernel with XFS on HDD (no RAID at all!):
for ((A=100;A<200;A++)); do [ -d $A ] || mkdir $A; (cd $A; tar -xz < /linux-4.20-rc4.tar.gz )& done

With EXT4 it works as should.
On SSD this command doesn't cause freeze with XFS. But XFS is much slower than EXT4 here (~400 seconds vs ~300 seconds).
Comment 28 Dave Chinner 2018-11-30 02:41:45 UTC
XFS is stuck waiting for log IO completion and everything is backed up waiting for that to happen.

Remember what I said back in comment #13? Quote:

| And it's stuck in MD waiting for a bitmap update to complete. Either your
| hardware has lost an IO, or there's a wakeup race in MD and so everything
| is blocked waiting for an IO completion that will never arrive. You say
| it's a new server you've just set up - that means there is a good chance
| that this "lost IO" is actually a hardware problem and it's not actually
| a MD issue....

Well, you now have XFS stuck waiting for log IO to complete. So we have the situation where either you have hardware that is losing an IO, or both MD and XFS have bugs in them across multiple kernels that look like lost IOs.

This looks and smells like you have a hardware problem, not a kernel problem. If you haven't already done so, please your drivers and the firmware for your hardware to the latest versions.

-Dave.
Comment 29 edo 2018-11-30 11:10:38 UTC
> This looks and smells like you have a hardware problem, not a kernel problem.

Last dmesg is from completely different machine. 
For me it looks like if some IO lasts more than usual then race condition may occur.
And it seems that EXT4 is not affected by such issue.

Note You need to log in before you can comment on or make changes to this bug.