Bug 212185

Summary:	btrfs preemptive reclaim causing too much CPU work
Product:	File System	Reporter:	Michal Koutný (mkoutny)
Component:	btrfs	Assignee:	BTRFS virtual assignee (fs_btrfs)
Status:	RESOLVED CODE_FIX
Severity:	normal	CC:	andrea_iob, artafinde, bapr, brauliobo, bruce, ciberguerra, crab2313, dsterba, dvyfkebuac.rhn, emiettin3, evangelos, grey, iam, josef, kmq+kernel.org, leledumbo_cool, lionel-dev, master.mhn1388, mg.andersssonn, putr4.s, roysjosh, sean, sozforex, tomboshoven
Priority:	P1
Hardware:	x86-64
OS:	Linux
Kernel Version:	v5.12-rc2	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	some trace data more trace data

Description Michal Koutný 2021-03-09 15:09:11 UTC

Hi.
I've observed suspiciously demanding kworker/u$i:$j+events_unbound. Upon further look, it spends most of the time in btrfs_preempt_reclaim_metadata_space:


> [10481.108813] CPU: 2 PID: 22662 Comm: kworker/u8:4 Kdump: loaded Not tainted
> 5.12.0-rc2-1.gd9388fc-default #1 openSUSE Tumbleweed (unreleased)
> [10481.108816] Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
> [btrfs]
> [10481.108818] RIP: 0010:_find_next_bit.constprop.0+0x3/0x80
> [10481.108818] Code: ff ff 48 83 f8 ff 74 12 48 23 14 c7 74 e9 48 c1 e0 06 48
> 0f bd d2 4c 8d 04 10 4c 89 c0 c3 66 0f 1f 84 00 00 00 00 00 49 89 d1 <48> 39
> d1 73 51 48 89 ca 48 c1 ea 06 48 8b 04 d7 48 85 f6 74 04 48
> [10481.108819] RSP: 0018:ffffb92f0380fe08 EFLAGS: 00000046
> [10481.108820] RAX: 0000000000117000 RBX: ffff99d2438f6000 RCX:
> 0000000000000002
> [10481.108820] RDX: 0000000000000200 RSI: 0000000000000000 RDI:
> ffffffffb2283aa0
> [10481.108821] RBP: 0000000000000206 R08: 0000000000000000 R09:
> 0000000000000200
> [10481.108821] R10: 0000000000000004 R11: 0000000000000013 R12:
> fffffffffffe1000
> [10481.108822] R13: ffff99d24dfef630 R14: 0000000000000000 R15:
> ffff99d24dfefc50
> [10481.108822] FS:  0000000000000000(0000) GS:ffff99d356500000(0000)
> knlGS:0000000000000000
> [10481.108823] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [10481.108823] CR2: 00007f098c360008 CR3: 0000000215210003 CR4:
> 00000000003706e0
> [10481.108824] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [10481.108825] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [10481.108825] Call Trace:
> [10481.108826]  cpumask_next+0x16/0x20
> [10481.108826]  __percpu_counter_sum+0x3d/0x60
> [10481.108827]  need_preemptive_reclaim.part.0+0x88/0x110 [btrfs]
> [10481.108828]  btrfs_preempt_reclaim_metadata_space+0xbc/0x210 [btrfs]
> [10481.108828]  process_one_work+0x1df/0x370
> [10481.108829]  worker_thread+0x50/0x400
> [10481.108830]  ? process_one_work+0x370/0x370
> [10481.108830]  kthread+0x11b/0x140
> [10481.108831]  ? __kthread_bind_mask+0x60/0x60
> [10481.108831]  ret_from_fork+0x22/0x30

The filesystem is reasonably filled:

> Data, single: total=147.94GiB, used=132.44GiB
> System, single: total=32.00MiB, used=16.00KiB
> Metadata, single: total=2.00GiB, used=1.14GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B


I also checked with `btrfs/btrfs_flush_space` tracepoint, an illustrative output (repeated plethora times) is:
>    kworker/u8:12-28612   [002] .... 22227.461500: btrfs_flush_space:
>    dc1d7f42-5155-4095-a514-751950f8accd: state=3(FLUSH_DELAYED_REFS_NR)
>    flags=4(METADATA) num_bytes=851968 ret=0 for_preempt=1

It's not permanently spinning, but the overall (additional) load of the kworker spending time with btrfs_preempt_reclaim_metadata_space is around 1. I don't believe there's so much CPU work necessary in maintaining some free space, therefore I consider this a bug.

(I didn't observe this with v5.11.)

Comment 1 Josef Bacik 2021-03-24 17:31:06 UTC

Can I see the full btrfs fi usage so I have an idea of how full the fs is?  I'm trying to reproduce the problem, but having trouble generating a workload that trips the preemptive reclaim at all.  There's definitely some low hanging fruit, but I want to make sure I'm reproducing the problem instead of just throwing patches at it and hoping for the best.

Comment 2 Michal Koutný 2021-03-24 17:38:39 UTC

Thanks for looking into this.

> # btrfs fi usage /vol/
> Overall:
>     Device size:               149.97GiB
>     Device allocated:          149.97GiB
>     Device unallocated:                  1.00MiB
>     Device missing:                0.00B
>     Used:                      133.32GiB
>     Free (estimated):           15.81GiB      (min: 15.81GiB)
>     Free (statfs, df):                  15.81GiB
>     Data ratio:                             1.00
>     Metadata ratio:                 1.00
>     Global reserve:            512.00MiB      (used: 0.00B)
>     Multiple profiles:                        no
> 
> Data,single: Size:147.94GiB, Used:132.12GiB (89.31%)
>    /dev/mapper/vol     147.94GiB
> 
> Metadata,single: Size:2.00GiB, Used:1.20GiB (59.86%)
>    /dev/mapper/vol       2.00GiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>    /dev/mapper/vol      32.00MiB
> 
> Unallocated:
>    /dev/mapper/vol       1.00MiB

Comment 3 Josef Bacik 2021-04-27 17:47:19 UTC

I have a branch that has my current work around this problem, please check it and see if it resolves your problem.

https://github.com/josefbacik/linux/tree/preemptsum

Comment 4 Michal Koutný 2021-04-30 09:23:40 UTC

Thanks. I should get down to that in the next week.
BTW should I be cautious when testing it on my production FS?

Comment 5 Josef Bacik 2021-04-30 15:56:07 UTC

That's up to you, I only ever test in production ;).  That being said these just change how we trigger preemptive flushing, at worse it'll make things slower, they shouldn't break anything.

Comment 6 Michal Koutný 2021-05-03 16:55:42 UTC

So I compared a 5.12.0-rc8 that I had at hand against
v5.12 + the following patches:

cherry-picked 2cdb3909c9e9 for applicability
- 0001-btrfs-use-percpu_read_positive-instead-of-sum_positi.patch
from the branch torvalds/master..josefbacik/preemptsum (9ccce092fc64..1a07c6b9cc88)
- 0001-btrfs-correct-try_lock_extent-usage-in-read_extent_b.patch
- 0002-btrfs-check-worker-before-need_preemptive_reclaim.patch
- 0003-btrfs-only-clamp-the-first-time-we-have-to-start-flu.patch
- 0004-btrfs-take-into-account-global-rsv-in-need_preemptiv.patch
- 0005-btrfs-use-the-global-rsv-size-in-the-preemptive-thre.patch
- 0006-btrfs-don-t-include-the-global-rsv-size-in-the-preem.patch
- 0007-btrfs-only-ignore-delalloc-if-delalloc-is-much-small.patch
- 0008-btrfs-handle-preemptive-delalloc-flushing-slightly-d.patch

I measured sys time spent by any events_unbound kworker (during ordinary run of
the system):

kernel				real time (s)	sys time (s)	ratio
5.12.0-rc8-1.ga71cb9a		16646		9797.83		0.588587
5.12.1-202105031446.gdc37b65	12622 		  55.46		0.00439391

So from the perspective of unjustified CPU consumption the patchset resolves
the issue for me.

Comment 7 Michal Koutný 2021-05-04 10:34:23 UTC

(In reply to Michal Koutný from comment #6)
> So from the perspective of unjustified CPU consumption the patchset resolves
> the issue for me.
Hm, I may have not stressed it properly yesterday. It looks like switching git
branches on the FS triggers the preemptive reclaim excessively. With the same
metric I'm seeing today:

5.12.1-202105031446.gdc37b65	1587		638.64		0.402219

(The kworker consumes _minutes_ of CPU time after the switch but seems to
settle eventually (unless I switch again).)

The recent state of the FS is:
> Overall:
>     Device size:               149.97GiB
>     Device allocated:          149.97GiB
>     Device unallocated:                  1.00MiB
>     Device missing:                0.00B
>     Used:                      141.97GiB
>     Free (estimated):            7.22GiB      (min: 7.22GiB)
>     Free (statfs, df):                   7.22GiB
>     Data ratio:                             1.00
>     Metadata ratio:                 1.00
>     Global reserve:            512.00MiB      (used: 0.00B)
>     Multiple profiles:                        no
> 
> Data,single: Size:147.94GiB, Used:140.72GiB (95.12%)
>    /dev/mapper/vol     147.94GiB
> 
> Metadata,single: Size:2.00GiB, Used:1.25GiB (62.63%)
>    /dev/mapper/vol       2.00GiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>    /dev/mapper/vol      32.00MiB
> 
> Unallocated:
>    /dev/mapper/vol       1.00MiB

Comment 8 Greyson Christoforo 2021-05-18 10:30:47 UTC

Looks like I've been seeing this for the past week or so on Arch Linux with linux 5.12.4.

Are there any workarounds besides applying the changes in https://github.com/josefbacik/linux/tree/preemptsum ?

Comment 9 Greyson Christoforo 2021-05-18 11:05:42 UTC

I could mention my usage info:

> $ sudo btrfs filesystem usage /
> Overall:
>     Device size:                45.41GiB
>     Device allocated:           45.41GiB
>     Device unallocated:                  1.00MiB
>     Device missing:                0.00B
>     Used:                       42.24GiB
>     Free (estimated):            2.74GiB      (min: 2.74GiB)
>     Free (statfs, df):                   2.74GiB
>     Data ratio:                             1.00
>     Metadata ratio:                 1.00
>     Global reserve:            197.08MiB      (used: 0.00B)
>     Multiple profiles:                        no
> 
> Data,single: Size:43.37GiB, Used:40.62GiB (93.67%)
>    /dev/nvme0n1p7       43.37GiB
> 
> Metadata,single: Size:2.01GiB, Used:1.62GiB (80.49%)
>    /dev/nvme0n1p7        2.01GiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
>    /dev/nvme0n1p7       32.00MiB
> 
> Unallocated:
>    /dev/nvme0n1p7        1.00MiB

Should I expect this problem to disappear if delete some files to reach some magic free space threshold?

Comment 10 rhn 2021-06-12 18:15:18 UTC

I'm seeing the same with kernel 5.12.5-200.fc33.x86_64.

With RAID1, currently one of my CPUs is working at least half of the time, and the laptop fan knows no peace:

/dev/sda1                    40G   37G  2.1G  95% /home

When I was at something around 5GiB free, the worker thread didn't trigger noticeably.

I haven't observed any particular event that leads to bouts of CPU onslaught, but on the other hand I have several instances of Firefox idling at all times.

Comment 11 rhn 2021-06-12 18:18:24 UTC

Quick update: it seems like a btrfs balance made the CPU symptoms go away for the time being.

Comment 12 emiettin3 2021-06-13 12:09:38 UTC

Also on my system (Arch Linux, kernel 5.12.>=9), on a reasonably full btrfs partition (~90%), a kworker process indefinitely consumes 100% CPU. perf report reveals btrfs_preempt_reclaim_metadata_space as the culprit for me too. 

Downgrading to kernel 5.11.x fixes the issue (5.12.<=8 might've been okay too, unfortunately cannot confirm this). Same goes for a full btrfs rebalance, CPU usage is fixed.

Comment 13 agapito 2021-07-25 15:15:25 UTC

I started to suffer this bug on latest 5.12.x releases using Archlinux, probably it was because my root partition had low free space at that time, around 2 GiB. I can confirm that 5.13.4 kernel is still affected by this bug.

The solution of this problem was freeing some space moving a couple of big files to another partition. This is my root partition now:


sudo btrfs filesystem usage /
Overall:
    Device size:                  60.00GiB
    Device allocated:             57.97GiB
    Device unallocated:            2.03GiB
    Device missing:                  0.00B
    Used:                         53.45GiB
    Free (estimated):              5.76GiB      (min: 5.76GiB)
    Free (statfs, df):             5.75GiB
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)
    Multiple profiles:                  no

Comment 14 Mohammad 2021-07-25 18:32:38 UTC

I can confirm that kernel 5.13.4 suffers from this bug. I'm on Fedora 34 and the specific kernel version is 5.13.4-200.fc34.x86_64. I had 7GiB free on the root which causes CPU workload, so I had to remove some files to stop the CPU and fan usage. Now, I have 20GiB free and the CPU and fans usage are mitigated.

If you need more information I'll be glad to provide it.

Comment 15 agapito 2021-07-31 12:38:15 UTC

(In reply to agapito from comment #13)
> I started to suffer this bug on latest 5.12.x releases using Archlinux,
> probably it was because my root partition had low free space at that time,
> around 2 GiB. I can confirm that 5.13.4 kernel is still affected by this bug.
> 
> The solution of this problem was freeing some space moving a couple of big
> files to another partition. This is my root partition now:
> 
> 
> sudo btrfs filesystem usage /
> Overall:
>     Device size:                  60.00GiB
>     Device allocated:             57.97GiB
>     Device unallocated:            2.03GiB
>     Device missing:                  0.00B
>     Used:                         53.45GiB
>     Free (estimated):              5.76GiB      (min: 5.76GiB)
>     Free (statfs, df):             5.75GiB
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:              512.00MiB      (used: 0.00B)
>     Multiple profiles:                  no

Forget it, still happening.

Comment 16 ValdikSS 2021-07-31 19:54:32 UTC

Happens to me with Fedora 34 5.13.5-200.fc34.x86_64.
I have plenty of free space (54G, 74% partition filled).

Comment 17 Bráulio Bhavamitra 2021-08-04 11:01:12 UTC

same here with archlinux and kernel 5.13.7

Comment 18 Bráulio Bhavamitra 2021-08-04 11:37:47 UTC

is there a way to stop CPU usage without rebooting?

Comment 19 Bráulio Bhavamitra 2021-08-06 11:05:00 UTC

It turned out to be Virtualbox in my case. After a Windows reboot in it, the unbound events processes went away

Comment 20 Bráulio Bhavamitra 2021-08-08 12:47:06 UTC

actually not, hapenned again and vm reboot or shutdown didn't help...

Comment 21 ValdikSS 2021-08-08 12:49:17 UTC

(In reply to Bráulio Bhavamitra from comment #18)
> is there a way to stop CPU usage without rebooting?

In my case, launching btrfs rebalance helped:

    sudo btrfs balance start -musage=35 -dusage=35 /

Comment 22 Bráulio Bhavamitra 2021-08-08 12:57:25 UTC

Just had rebooted and it start with the 100% unbound_events usage...

ValdikSS: thank you, it solved after 1min running this command!

FYI: this might be triggered by Chia farming, as it uses a lot of disk space and files.

Comment 23 Bráulio Bhavamitra 2021-08-09 14:19:16 UTC

Had to use `--full-balance` after another events_unbound event, `-musage=35 -dusage=35` wasn't enough

Comment 24 Josef Bacik 2021-08-09 14:35:52 UTC

Sorry everybody I didn't get notifications on this thing, I'll dig into it now.

Comment 25 Josef Bacik 2021-08-09 17:34:21 UTC

While I'm trying to reproduce, could somebody collect some trace data while the problem is happening, something like

echo "for_preempt == 1" > /sys/kernel/debug/tracing/events/btrfs/btrfs_flush_space/filter
echo 1 > /sys/kernel/debug/tracing/events/btrfs/btrfs_flush_space/enable
echo 1 > /sys/kernel/debug/tracing/events/btrfs/btrfs_done_preemptive_reclaim/enable
cat /sys/kernel/debug/tracing/trace_pipe > out.txt

count to 5 and then ctrl+c and upload out.txt so I can see wtf it's doing.

Comment 26 Greyson Christoforo 2021-08-09 17:40:11 UTC

5s of that gave me a 730MiB file! Here are the first few lines of that: https://bpa.st/raw/RD5Q is that enough?

Comment 27 Josef Bacik 2021-08-09 19:46:38 UTC

Wow ok, can you give me the output between two 'btrfs_done_preemptive_reclaim' events, including the 'btrfs_done_preemptive_reclaim' events themselves?  If that's too big just snip the duplicate lines and tell me how much you snipped please.

Comment 28 Greyson Christoforo 2021-08-10 07:18:59 UTC

Created attachment 298251 [details]
some trace data

out.txt (in tar.xz archive) as requested in https://bugzilla.kernel.org/show_bug.cgi?id=212185#c25

Comment 29 Greyson Christoforo 2021-08-10 07:21:25 UTC

That attachment is the whole thing. Turns out a bunch of duplicated text compresses pretty well!

Comment 30 Bráulio Bhavamitra 2021-08-10 11:32:52 UTC

same for me, just all repeated with:

kworker/u64:4-344762  [004] .... 167816.462311: btrfs_flush_space: 0e658495-8f2b-46f7-a2d7-6144825d1c5b: state=3(FLUSH_DELAYED_REFS_NR) flags=4(METADATA) num_bytes=262144 ret=0 for_preempt=1

full file at https://drive.google.com/file/d/1pSDKziJuPnCMaNipfTBhcyU9PSoHYaxB/view?usp=sharing

Comment 31 Bráulio Bhavamitra 2021-08-10 11:50:50 UTC

the issue is getting worse, now a --full-balance isn't helping anymore to get rid of the CPU usage of events_unbound.

I'm running it on 2 other slow HDDs, but I suspect they are not the cause

Comment 32 Bráulio Bhavamitra 2021-08-10 12:54:37 UTC

found another backup btrfs partition that needed rebalance, and that got rid of the unbound events :P

Comment 33 Josef Bacik 2021-08-10 14:04:43 UTC

Holy hell, alright can you just enable 'btrfs_done_preemptive_reclaim' and capture the trace file until you get a few events from that?

Comment 34 Greyson Christoforo 2021-08-10 15:45:07 UTC

(In reply to Josef Bacik from comment #33)
> Holy hell, alright can you just enable 'btrfs_done_preemptive_reclaim' and
> capture the trace file until you get a few events from that?

Could you be more explicit (like last time) on what commands I should run to get what you're looking for here?

Comment 35 Josef Bacik 2021-08-10 17:28:22 UTC

echo 1 > /sys/kernel/debug/tracing/events/btrfs/btrfs_done_preemptive_reclaim/enable
cat /sys/kernel/debug/tracing/trace_pipe

don't need to put it into a file, it shouldn't happen that often, I just need a couple lines of the output to see what it says.

Comment 36 Sean Greenslade 2021-08-10 18:45:40 UTC

I've been experiencing this issue intermittently as well (Arch Linux, 5.13.7-arch1-1). I triggered the btrfs_done_preemptive_reclaim debug trace and got one output from it, after which the CPU usage went back to normal.

   kworker/u16:1-43744   [001] ...1 101704.405332: btrfs_done_preemptive_reclaim: 445d8f25-d7c1-4d52-83e3-9b1c37823f50: flags=METADATA total_bytes=1073741824 bytes_used=446365696 bytes_pinned=0 bytes_reserved=0 bytes_may_use=536870912 bytes_readonly=0 reclaim_size=0 clamp=6 global_reserved=536870912 trans_reserved=0 delayed_refs_reserved=0 delayed_reserved=0 chunk_free_space=1048576

Comment 37 Sean Greenslade 2021-08-10 19:06:54 UTC

It came back and generated a few more lines:

  kworker/u16:10-47445   [000] ...1 102617.533533: btrfs_done_preemptive_reclaim: 445d8f25-d7c1-4d52-83e3-9b1c37823f50: flags=METADATA total_bytes=1073741824 bytes_used=446185472 bytes_pinned=180224 bytes_reserved=294912 bytes_may_use=546701312 bytes_readonly=0 reclaim_size=0 clamp=6 global_reserved=536870912 trans_reserved=0 delayed_refs_reserved=7077888 delayed_reserved=2097152 chunk_free_space=1048576
   kworker/u16:5-44518   [000] ...1 102696.944009: btrfs_done_preemptive_reclaim: 445d8f25-d7c1-4d52-83e3-9b1c37823f50: flags=METADATA total_bytes=1073741824 bytes_used=446152704 bytes_pinned=1605632 bytes_reserved=393216 bytes_may_use=625606656 bytes_readonly=0 reclaim_size=1310720 clamp=6 global_reserved=536870912 trans_reserved=0 delayed_refs_reserved=88735744 delayed_reserved=0 chunk_free_space=1048576
   kworker/u16:5-44518   [002] ...1 102696.954654: btrfs_done_preemptive_reclaim: 445d8f25-d7c1-4d52-83e3-9b1c37823f50: flags=METADATA total_bytes=1073741824 bytes_used=446152704 bytes_pinned=1605632 bytes_reserved=1048576 bytes_may_use=624558080 bytes_readonly=0 reclaim_size=1310720 clamp=6 global_reserved=536870912 trans_reserved=0 delayed_refs_reserved=87293952 delayed_reserved=393216 chunk_free_space=1048576
   kworker/u16:5-44518   [003] ...1 102720.052036: btrfs_done_preemptive_reclaim: 445d8f25-d7c1-4d52-83e3-9b1c37823f50: flags=METADATA total_bytes=1073741824 bytes_used=446251008 bytes_pinned=0 bytes_reserved=0 bytes_may_use=536870912 bytes_readonly=0 reclaim_size=0 clamp=6 global_reserved=536870912 trans_reserved=0 delayed_refs_reserved=0 delayed_reserved=0 chunk_free_space=1048576

Comment 38 Greyson Christoforo 2021-08-11 07:35:11 UTC

Created attachment 298285 [details]
more trace data

Here's what I got for the less verbose trace request. The arrival of these messages correlates strongly with my CPU usage. I often see a core unpeg from the constant 100% it normally is when one of these trace messages arrives. I'll try to put together something that shows this. It seems we're definitely on the right track here.

Comment 39 Greyson Christoforo 2021-08-11 09:01:01 UTC

I hope this is a good demo of how my CPU usage is related to these trace messages: https://asciinema.org/a/429966

Essentially, this issue makes a machine with an X thread CPU an X-1 thread one!

In recording this example, I've noticed that a single problematic reclaim event can take a loooong time to actually complete, like tens of minutes in some cases. I don't understand enough to be able to trigger this myself. Creating (and then deleting) large files from /dev/random doesn't do it. Maybe I should try that with many small files...

Comment 40 Josef Bacik 2021-08-11 18:13:10 UTC

Alright I think it's doing the "right" thing here, just stupidly.  You've got 89mib of "free" metadata space, which isn't a lot in the grand scheme of things, and we have a threshold of 98% full before we decide its too full to do anything.  Clearly 98% is too high of a threshold, because you're just spinning flushing stuff out constantly for no reason.  I'll reduce this to 90% and that'll make it less silly.

Comment 41 Josef Bacik 2021-08-11 18:37:37 UTC

https://lore.kernel.org/linux-btrfs/cover.1628706812.git.josef@toxicpanda.com/

Comment 42 Evangelos Foutras 2021-08-11 19:49:43 UTC

FWIW the problem isn't with full file systems only; it's happening with the following stats too:

Data, single: total=12.00GiB, used=11.07GiB
System, single: total=32.00MiB, used=16.00KiB
Metadata, single: total=256.00MiB, used=102.78MiB
GlobalReserve, single: total=23.00MiB, used=0.00B

It's a 20G btrfs filesystem on Linux 5.13.7, so about 7.8G is free space (on fast storage too).

Cloning the systemd repo and copying it to a new directory seems to have triggered this for me now. CPU has been spinning in sys time for about 15 minutes so far, I'm waiting for the

Comment 43 Evangelos Foutras 2021-08-11 19:50:57 UTC

* I'm waiting for the btrfs_done_preemptive_reclaim event once it's done (assuming it prints it at the end of the operation).

Comment 44 Evangelos Foutras 2021-08-11 20:15:15 UTC

Got this after about 35 minutes of 100% CPU usage (the kworker/u2:7+events_unbound process is still going though!):

    kworker/u2:7-79786   [000] ...1 179939.368121: btrfs_done_preemptive_reclaim: 30924304-7ce1-4c07-b328-f81d310846f7: flags=METADATA total_bytes=268435456 bytes_used=108052480 bytes_pinned=835584 bytes_reserved=32768 bytes_may_use=27918336 bytes_readonly=0 reclaim_size=0 clamp=8 global_reserved=24117248 trans_reserved=0 delayed_refs_reserved=1048576 delayed_reserved=917504 chunk_free_space=7282339840

Comment 45 Evangelos Foutras 2021-08-11 20:53:36 UTC

One more after the 1 hour mark:

    kworker/u2:7-79786   [000] ...1 181770.406475: btrfs_done_preemptive_reclaim: 30924304-7ce1-4c07-b328-f81d310846f7: flags=METADATA total_bytes=268435456 bytes_used=108265472 bytes_pinned=770048 bytes_reserved=0 bytes_may_use=27246592 bytes_readonly=0 reclaim_size=0 clamp=8 global_reserved=24117248 trans_reserved=0 delayed_refs_reserved=1048576 delayed_reserved=786432 chunk_free_space=7282339840

kworker/u2:7+events_unbound is showing 1h10m of CPU time now. I think I may need to reboot to stop it. 🐱

Comment 46 Greyson Christoforo 2021-08-12 08:50:47 UTC

I see, in the patch email you sent to linux-btrfs@vger.kernel.org Josef, that you wrote "The problem is for very full file systems on slower disks..." but that does not seem to describe my situation. Yeah, my btrfs partition on this drive is pretty full, but the drive is a "Sandisk Corp WD Black SN750 / PC SN730 NVMe SSD" does that count as slower? I guess I'm just thinking if this is an issue on an M.2 NVMe SSD and it's worsened/caused by drive speed, imagine what it would be like on spinning rust!

If I understand the message you sent a few ones up, the actual root of my problem is that I only have 89mib of free metadata space. Have I done something wrong here? Am I somehow in control of this allocation? It also doesn't match with what `btrfs filesystem usage /` shows me today:

```
$ sudo btrfs filesystem usage /
Overall:
    Device size:		  45.41GiB
    Device allocated:		  45.41GiB
    Device unallocated:		   1.00MiB
    Device missing:		     0.00B
    Used:			  41.96GiB
    Free (estimated):		   2.76GiB	(min: 2.76GiB)
    Free (statfs, df):		   2.76GiB
    Data ratio:			      1.00
    Metadata ratio:		      1.00
    Global reserve:		 214.39MiB	(used: 0.00B)
    Multiple profiles:		        no

Data,single: Size:43.12GiB, Used:40.35GiB (93.59%)
   /dev/nvme0n1p7	  43.12GiB

Metadata,single: Size:2.26GiB, Used:1.61GiB (71.12%)
   /dev/nvme0n1p7	   2.26GiB

System,single: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/nvme0n1p7	  32.00MiB

Unallocated:
   /dev/nvme0n1p7	   1.00MiB
```

That's 650GiB free, or only 71% metadata space used, right? Is that consistent with everything you've seen? Why does it disagree so much with 89mib?

Comment 47 Greyson Christoforo 2021-08-12 08:54:15 UTC

*~670MiB free

Comment 48 agapito 2021-08-12 11:48:32 UTC

I'm using an NVMe device too. Filesystem is ZSTD compressed. I'm not sure but I don't suffer from this bug on a normal hard disk, only when I write some files to the NVMe partition, like a system update.

Comment 49 Greyson Christoforo 2021-08-12 11:52:08 UTC

I have compression enabled too. My mount is like this:

/dev/nvme0n1p6 on / type btrfs (rw,relatime,compress=lzo,ssd,space_cache,subvolid=257,subvol=/root)

Comment 50 ValdikSS 2021-08-12 11:59:10 UTC

I don't have compression.

Comment 51 sozforex 2021-08-12 13:16:23 UTC

I think have the same problem. I have a lot of free space available, but when most of *allocated* space is used (data used > ~95%, not sure) I start seeing high CPU usage from kworker/u16:2+events_unbound.
Problem disappears after creation and removal of a large (8GB) file with random data (to expand allocated space).

/dev/sdb4 on /home type btrfs (rw,noatime,compress-force=zstd:3,ssd,space_cache,subvolid=299,subvol=/@home)

```
❯ btrfs fi usage /home
Overall:
    Device size:                 351.00GiB
    Device allocated:             46.02GiB
    Device unallocated:          304.98GiB
    Device missing:                  0.00B
    Used:                         43.33GiB
    Free (estimated):            305.28GiB      (min: 152.79GiB)
    Free (statfs, df):           305.28GiB
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:              179.39MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,single: Size:42.01GiB, Used:41.70GiB (99.27%)
   /dev/sdb4      42.01GiB

Metadata,DUP: Size:2.00GiB, Used:833.70MiB (40.71%)
   /dev/sdb4       4.00GiB

System,DUP: Size:8.00MiB, Used:16.00KiB (0.20%)
   /dev/sdb4      16.00MiB

Unallocated:
   /dev/sdb4     304.98GiB
```

Comment 52 Sean Greenslade 2021-08-12 19:49:24 UTC

I've been trying to test Josef's patches, but my machine that exhibits this issue is crashing on the latest git master kernel (unpatched) and the Arch Linux release kernel (5.13.9) is missing some btrfs patches that Josef's patches rely on. I'll try bisecting master to see if I can back out whatever's crashing my system.

Comment 53 Bráulio Bhavamitra 2021-08-14 13:53:23 UTC

got 2 events_unbound :P one for each unbalanced partition (btw, also having this issues on a NVMe SSD)

Comment 54 Bráulio Bhavamitra 2021-08-20 11:32:33 UTC

was the patch applied to stable?

Comment 55 Bráulio Bhavamitra 2021-08-20 23:13:56 UTC

this problem is happening almost every day for me in one of the 3 SSDs partitions I have, randomly. It never happened though with any of the 2 HDD btrfs partitions

Comment 56 agapito 2021-08-21 10:00:21 UTC

@Bráulio: It happens when you write to disk, and you will probably write more to an ssd disk if you have the operating system there, although it also happens on normal disks. 

Honestly i´m tired of this. I am thinking of migrating to to the old and stable ext4, especially when I read that the hard disk where I have stored my data appears on this list and is prone to possible corruption.

https://btrfs.wiki.kernel.org/index.php/Hardware_bugs

Comment 57 David Sterba 2021-08-23 09:07:57 UTC

The patches are now in the development queue, are on the way to Linus tree and then to stable. The ETA for that could be next week once 5.14 is released and stable picks it.

(In reply to agapito from comment #56)
> Honestly i´m tired of this. I am thinking of migrating to to the old and
> stable ext4, especially when I read that the hard disk where I have stored
> my data appears on this list and is prone to possible corruption.
> 
> https://btrfs.wiki.kernel.org/index.php/Hardware_bugs

I hope you understand that no filesystem will save your data on a buggy hardware, not to say ext4 won't even detect that in most cases.

Comment 58 agapito 2021-08-25 00:25:01 UTC

(In reply to David Sterba from comment #57)
> The patches are now in the development queue, are on the way to Linus tree
> and then to stable. The ETA for that could be next week once 5.14 is
> released and stable picks it.
> 
> (In reply to agapito from comment #56)
> > Honestly i´m tired of this. I am thinking of migrating to to the old and
> > stable ext4, especially when I read that the hard disk where I have stored
> > my data appears on this list and is prone to possible corruption.
> > 
> > https://btrfs.wiki.kernel.org/index.php/Hardware_bugs
> 
> I hope you understand that no filesystem will save your data on a buggy
> hardware, not to say ext4 won't even detect that in most cases.

I have been using ext4 since its release and NEVER had any problems. For over a decade I never lost a single file or had any record of any file corruption. I have never had to format the root partition for any kind of ext4 related problem even though I suffer from constant power outages and very importantly I have never suffered any performance problems using it.

8 months ago I bought a WD BLUE HDD as my main disk for my data and decided to format it as btrfs seduced by the idea of ZSTD compression; a few days later I formatted all my partitions to btrfs. To be fair, during these last months I have also suffered quite a few power outages and I had no problems either, but suffering this bug and waiting for the fix for almost 2 months and the fact that my hard drive is listed on the wiki as buggy, was too much for me. 

In this last weekend I have reorganized my hard drives and formatted all my partitions and none of them is btrfs now. The most important thing about a filesystem for me is forgetting that it exists and that's what happened to me for over a decade, until I tried btrfs. 

Maybe I'll try btrfs in the future, because I admit that on paper it's superior to the rest, but what's the point if bugs like this appear in the stable kernel branches? Debian and Red Hat can't be wrong, so maybe I'll give it another chance when it's the default filesystem in Debian.

I am not trying to be disrespectful, I am just telling my experience.

Comment 59 theodore 2021-08-25 15:55:21 UTC

First, i want to say i appreciate the work that is being done to fix this--it has been a quite annoying bug for me as well.  I wanted to point out that I've seen zstd compression mentioned in several posts above, and that I am also using zstd and suffering from this bug.  My drive overall is 70% used, with 71% metadata usage.

Comment 60 Bráulio Bhavamitra 2021-08-25 22:48:26 UTC

I just got this in a newly created partition on a new drive Seagate Firecuda 510 2tb. After a rsync 30gb of the root fs the events_unbound was triggered!

Comment 61 Bráulio Bhavamitra 2021-08-31 14:31:34 UTC

got this again after a simple system update with `pacman -Syu`.

maybe this is due to a high `commit` option? It is set to 600 here

Comment 62 Martin Andersson 2021-09-02 17:07:59 UTC

I'm seeing this happening and blocking I/O for up to a minute just after timeshift does its hourly snapshots. Admittedly, I have quite a lot of them.

Comment 63 Bráulio Bhavamitra 2021-09-06 18:39:31 UTC

This issue might be gone with kernel 4.14.0, it is been a few days that it doesn't happen to none of my 6 btrfs ssd partitions (btw, it never happened on my hdds)

Comment 64 Sean Greenslade 2021-09-12 21:39:44 UTC

I never managed to get an RC kernel to work on my machine (for reasons unrelated to this issue), but I've been running a 5.14.2 release kernel for about 3 days now and the issue has not returned.

Comment 65 Greyson Christoforo 2021-09-14 14:44:53 UTC

I'm on 5.14.3 and I still have the issue. Did the patch make it into 5.14?

Comment 66 Greyson Christoforo 2021-09-14 14:59:54 UTC

(In reply to Greyson Christoforo from comment #65)
> Did the patch make it into 5.14?

Looks like the answer to that is no, but I can see a derivative of the patch from https://bugzilla.kernel.org/show_bug.cgi?id=212185#c41 in 5.15-rc1.

Comment 67 David Sterba 2021-09-14 15:04:19 UTC

For some reason the patch hasn't been picked from the pull request branch so I have to ask for including it manually.

Comment 68 David Sterba 2021-10-04 17:22:25 UTC

Released in 5.14.7, I've been using it for some time and haven't observed the runaway kthread anymore. I took longer to fix than it should, sorry, something to improve next time. If anybody sees the problem in the supposedly fixed versions, please reopen.

Comment 69 Greyson Christoforo 2021-10-05 09:59:35 UTC

I haven't noticed it yet running on 5.14.9  
Thank you very much for the fix David!