Bug 85581 - btrfs: kworker/u16:2 creates neverending fast writes on almost-full filesystem
Summary: btrfs: kworker/u16:2 creates neverending fast writes on almost-full filesystem
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Josef Bacik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-04 08:58 UTC by Alexander E. Patrakov
Modified: 2014-12-16 00:59 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.16, 3.17-rc5
Tree: Mainline
Regression: No


Attachments

Description Alexander E. Patrakov 2014-10-04 08:58:59 UTC
Possible duplicates: bug 65101 (although I don't have BFQ), bug 74761 (although there is no btrfs-level RAID here).

On my laptop, there are two SSDs (/dev/sda and /dev/sdb) in an ISL RAID (driven by mdraid as RAID0). On one partition of the RAID, there is a LUKS volume. That volume, for historical reasons, is used as an LVM PV, and some of the logical volumes inside are formatted as btrfs. This includes the 12GiB logical volume used for /usr/portage shared between multiple Gentoo installations on this laptop.

That particular filesystem was created with mixed data+metadata, and currently btrfs fi df /usr/portage shows:

System, single: total=4.00MiB, used=4.00KiB
Data+Metadata, single: total=12.00GiB, used=11.71GiB
GlobalReserve, single: total=240,00MiB, used=0.00

The mount options in /etc/fstab are:

compress=zlib,ssd,autodefrag,inode_cache

After some multiprocess activity on /usr/portage (as demonstrated by "emerge -av --newuse --deep --update @world"), some kernel thread named "kworker/u16:2" (the number after the colon is different each time) creates huge (70 MB/s read and 70 MB/s write) SSD activity, and uses 15% of the CPU. It does not stop when I stop the "emerge" command, and stops only when I unmount the filesystem. I am afraid that this can physically destroy the SSDs if left unattended.

After "echo t > /proc/sysrq-trigger", I have:

[  925.042814] kworker/u16:2   D ffffffff816ee2d0     0  2256      2 0x00000000
[  925.042819] Workqueue: events_unbound btrfs_async_reclaim_metadata_space
[  925.042821]  ffff88024f20b890 0000000000000046 ffff880084f9c260 ffff88024f20bfd8
[  925.042824]  0000000000014bc0 0000000000014bc0 ffff8802541c8000 ffff880084f9c260
[  925.042827]  ffff88024f20b878 ffffffff810a90eb 0000000000000046 0000000000000004
[  925.042830] Call Trace:
[  925.042833]  [<ffffffff810a90eb>] ? __lock_acquire+0x37b/0x2020
[  925.042835]  [<ffffffff816ee2d0>] ? bit_wait+0x40/0x40
[  925.042838]  [<ffffffff810a884a>] ? mark_held_locks+0x6a/0x90
[  925.042841]  [<ffffffff810d2945>] ? ktime_get+0x105/0x140
[  925.042844]  [<ffffffff810a896d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
[  925.042847]  [<ffffffff8101db59>] ? read_tsc+0x9/0x10
[  925.042849]  [<ffffffff816ee2d0>] ? bit_wait+0x40/0x40
[  925.042852]  [<ffffffff816eda94>] schedule+0x24/0x70
[  925.042854]  [<ffffffff816edd6a>] io_schedule+0x8a/0xd0
[  925.042857]  [<ffffffff816ee2f6>] bit_wait_io+0x26/0x40
[  925.042859]  [<ffffffff816ee0bd>] __wait_on_bit+0x5d/0x90
[  925.042862]  [<ffffffff81140f7f>] wait_on_page_bit+0xbf/0xd0
[  925.042865]  [<ffffffff8109ed30>] ? autoremove_wake_function+0x40/0x40
[  925.042867]  [<ffffffff811410f2>] filemap_fdatawait_range+0x102/0x1a0
[  925.042871]  [<ffffffff812f6187>] btrfs_wait_ordered_range+0x67/0x130
[  925.042875]  [<ffffffff8131ea1d>] __btrfs_write_out_cache+0x35d/0x420
[  925.042878]  [<ffffffff8131edf4>] btrfs_write_out_cache+0x94/0x100
[  925.042882]  [<ffffffff812cc8d3>] btrfs_write_dirty_block_groups+0x503/0x6e0
[  925.042885]  [<ffffffff812da834>] commit_cowonly_roots+0x174/0x230
[  925.042887]  [<ffffffff812dc874>] btrfs_commit_transaction+0x3f4/0x950
[  925.042890]  [<ffffffff812c467d>] flush_space+0x8d/0x4c0
[  925.042893]  [<ffffffff816f2b1c>] ? _raw_spin_unlock+0x2c/0x50
[  925.042896]  [<ffffffff812c4bdc>] btrfs_async_reclaim_metadata_space+0x12c/0x1b0
[  925.042898]  [<ffffffff8107badf>] process_one_work+0x19f/0x4a0
[  925.042900]  [<ffffffff8107ba74>] ? process_one_work+0x134/0x4a0
[  925.042903]  [<ffffffff8107c293>] worker_thread+0x113/0x480
[  925.042905]  [<ffffffff810a8a3d>] ? trace_hardirqs_on+0xd/0x10
[  925.042908]  [<ffffffff8107c180>] ? init_pwq+0xb0/0xb0
[  925.042910]  [<ffffffff81080d2f>] kthread+0xdf/0x100
[  925.042913]  [<ffffffff81080c50>] ? kthread_create_on_node+0x1f0/0x1f0
[  925.042916]  [<ffffffff816f383c>] ret_from_fork+0x7c/0xb0
[  925.042919]  [<ffffffff81080c50>] ? kthread_create_on_node+0x1f0/0x1f0

I have created a disk image with dd. You can download 12 1G-sized parts of it here, including all the data:

http://82.193.153.141/btrfs/

The write-hog is readily reproducible by loop-mounting that image and copying some data inside there:

mount -t btrfs -o compress=zlib,ssd,autodefrag,inode_cache /dev/loop0 /mnt/btrfs
cp inkscape-0.48.5.tar.bz2 /mnt/btrfs/
Comment 1 Christian Burger 2014-11-24 14:38:46 UTC
TL;DR: Had a very similar if not the same problem with 3.15.8 (GrSecurity patch applied); vanilla 3.18.0-rc5 fixes the problem with intensive writes persisting a long time after accessing files; but accessing the file system with `rsync` does still unproportional massive writes.



 (In reply to Alexander E. Patrakov from comment #0)

> of the logical volumes inside are formatted as btrfs. This includes the
> 12GiB logical volume used for /usr/portage shared between multiple Gentoo
> installations on this laptop.

Same here: Gentoo package tree mounted at /usr/portage with btrfs, but actually only around 600 MiB in size. I thought, it couldn't hurt to test the file system there. Turns out, I was wrong. :)

> That particular filesystem was created with mixed data+metadata, and
> currently btrfs fi df /usr/portage shows:

At my end it shows

# btrfs fi df /usr/portage
System, single: total=4.00MiB, used=4.00KiB
Data+Metadata, single: total=596.00MiB, used=572.62MiB
unknown, single: total=12.00MiB, used=0.00

> The mount options in /etc/fstab are:

mount options are: "defaults"

> After some multiprocess activity on /usr/portage (as demonstrated by "emerge
> -av --newuse --deep --update @world"), some kernel thread named

similar scenario: Doing "cave sync" -- an alternative to "emerge sync"

> "kworker/u16:2" (the number after the colon is different each time) creates
> huge (70 MB/s read and 70 MB/s write) SSD activity, and uses 15% of the CPU.

Can confirm (was between 30 and 70 MB/s).

> It does not stop when I stop the "emerge" command, and stops only when I
> unmount the filesystem. 

Can confirm that, too.

> I am afraid that this can physically destroy the
> SSDs if left unattended.

Actually, I can confirm that it's a pretty sure bet, it will trash your SSD. 
Here my two log entries of my `smartd`

| Oct 30 14:22:15 laptop03 smartd[2192]: Device: /dev/sda [SAT], SMART 
| Prefailure Attribute: 177 Wear_Leveling_Count changed from 97 to 96
| [...]
| Nov 13 11:58:25 laptop03 smartd[2191]: Device: /dev/sda [SAT], SMART 
| Prefailure Attribute: 177 Wear_Leveling_Count changed from 79 to 78

My SSD degraded by more that 15 of 100 points in only two weeks, after being 
used for over a year previously. 15 points could be compared 3 to 6 years of 
usage, assuming the number represents degradation linearly; which it probably 
does. 

I am an idiot for not realizing the problem earlier, but I started to 
investigate the problem two weeks after I saw it happen for the first time. When I saw it during this two weeks, I just rebooted the system after I could not figure out quickly why kworker was accessing my disk so much and deferred investigation to a later time. 

I think it took the biggest hits, when a daily maintenance script was running 
during the night which triggered the bug. I cannot remember that I restarted 
the system that often, so the intensive writing might have stopped after some 
time.

> I have created a disk image with dd. You can download 12 1G-sized parts of
> it here, including all the data:

I have an image which is about 600 MB uncompressed and around 120 MB compressed with `xz`. You can get it compressed here for the next 30 days:

http://www.filedropper.com/usrportagebtrfs
sha512sum: 8e364c68a78dab0fada41f7e90d0acaa74cd660561103d1b39fea3c6040c291ec7fe5e6368de5e891cfc6e1d0bd4b659b45ff23fddefa9f1941a8a96412f9e79

Using the latest release candidate 3.18.0-rc5 the intensive writes after accessing the file system disappear and no unmount is required anymore.

But there are still intensive writes during the rsync process; sample output of `iotop -o` during `rsync`:

Total DISK READ :       5.63 M/s | Total DISK WRITE :     108.85 M/s
Actual DISK READ:       5.61 M/s | Actual DISK WRITE:     108.93 M/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                       
945320 be/4 root        5.41 M/s    3.75 K/s  0.00 %  4.08 % rsync --recursive --links --safe~rg/gentoo-portage/ /usr/portage/
944977 be/4 root      176.28 K/s   76.52 M/s  0.00 %  0.28 % [kworker/u16:4]
945311 be/4 root       26.25 K/s   24.05 M/s  0.00 %  0.05 % [kworker/u16:5]
944982 be/4 root       22.50 K/s    5.97 M/s  0.00 %  0.04 % [kworker/u16:10]
945321 be/4 root        0.00 B/s 1361.46 K/s  0.00 %  0.02 % rsync --recursive --links --safe~rg/gentoo-portage/ /usr/portage/
943285 be/4 root        7.50 K/s  975.15 K/s  0.00 %  0.01 % [kworker/u16:0]
945309 be/4 root        0.00 B/s   26.25 K/s  0.00 %  0.00 % [kworker/u16:1]

The rsync process should not need to write that much data on a pretty recent Gentoo package tree.


The executed rsync command is:

rsync --recursive --links --safe-links --perms --times --compress --force --whole-file --delet--delete-delay --stats --timeout=180 --filter 'merge /usr/share/paludis/traditional.exclude' --exclude=/.cache --progress rsync://rsync.de.gentoo.org/gentoo-portage/ /usr/portage/

You can replace 
rsync://rsync.de.gentoo.org/gentoo-portage/
by
rsync://rsync.us.gentoo.org/gentoo-portage/
if you are not in Europe, but in the US; everywhere else you might wanna check on
http://www.gentoo.org/main/en/mirrors-rsync.xml


I copied the data from the corrupted btrfs file system to a fresh one and the massive writes during access disappear, too. So it seems there is actually some corruption in the metadata of the file system, which triggers the massive writes. The corruption must have been introduced by either 3.16.3, 3.15.10 or 3.15.8 (all three from the Gentoo sys-kernel/hardened-sources package). I don't have the logs to retrace which one was running when the problem occured for the first time -- they were already recycled.
Comment 2 Alexander E. Patrakov 2014-11-25 06:45:37 UTC
I have tried to download the new image in order to rehost it. However, it is truncated to 112 MB and the SHA512SUM doesn't match. Please reupload.
Comment 3 Christian Burger 2014-11-25 11:56:49 UTC
I just downloaded the file and compared it to a fresh `dd | xz` I just made. They have the same SHA512 and size. Obviously, the SHA512 I submitted was wrong. And it could well be about 112 MiB as well as ~ 120 MB -- I rounded a bit liberally. :)

file size: 117804208 Bytes (~ 112.35 MiB)

sha512sum: 152f470b341f989e40060c902fd6c360cac33bc47afbe93f226f0f3ca2fabccd55fc8a81d7646c15f1fabed0ee1529553529f514c6fa5f72be77018b46a55de0

Seems like I screwed up the SHA512, sorry for that. I hope you can confirm this SHA512 on your side. Thanks for the mirror.
Comment 4 Alexander E. Patrakov 2014-11-25 12:15:23 UTC
Mirrored as http://82.193.153.141/btrfs/burger-usr.portage.btrfs.xz

Note: later on this week I will be replacing the router with a different model, but after that, typical availability guarantees will apply.
Comment 5 Marc-Andre Lureau 2014-11-28 15:27:25 UTC
I found myself in the same situation, with 60Mb/sec activity for hours on f21 (kernel 3.17.3-300.fc21.x86_64) with a Samsung SSD 840 PRO.

My solution was to reboot with 3.12. At this point there was no IO activy, but I had issues with IO, related to full disk (I was ~75% in both Data and Metadata). I ran a "btrfs balance start /" (without dusage, which didn't help), and finally rebooted under 3.17, problem is gone for the time being.
Comment 6 Andrej Antonov 2014-12-06 15:18:05 UTC
today, same issue was in my laptop (kernel: 3.17.4-1-ARCH). many hours.

then... I was run "btrfs balance start /" and this problem is gone (without reboot).
Comment 7 Alexander E. Patrakov 2014-12-15 21:02:22 UTC
Fixed in 3.18, but not in any of the stable kernels. The fix is:

commit 25ce459c1af138f95a3fd318461193397ebb825b
Author: Liu Bo <bo.li.liu@oracle.com>
Date:   Wed Sep 10 12:58:50 2014 +0800

    Btrfs: fix loop writing of async reclaim
    
    One of my tests shows that when we really don't have space to reclaim via
    flush_space and also run out of space, this async reclaim work loops on adding
    itself into the workqueue and keeps writing something to disk according to
    iostat's results, and these writes mainly comes from commit_transaction which
    writes super_block.  This's unacceptable as it can be bad to disks, especially
    memeory storages.
    
    This adds a check to avoid the above situation.
    
    Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
    Signed-off-by: Chris Mason <clm@fb.com>

I will now poke the stable maintainers about it.
Comment 8 Marc-Andre Lureau 2014-12-16 00:59:29 UTC
memeory/memory

Note You need to log in before you can comment on or make changes to this bug.