Bug 202559

Summary:	BTRFS is on large filesystems very slow
Product:	File System	Reporter:	NoName_Nr_1 (thorsten.brandau)
Component:	btrfs	Assignee:	BTRFS virtual assignee (fs_btrfs)
Status:	NEW ---
Severity:	high	CC:	stf_xl, thorsten.brandau
Priority:	P1
Hardware:	x86-64
OS:	Linux
Kernel Version:	4.20 or anything before	Subsystem:
Regression:	No	Bisected commit-id:

Description NoName_Nr_1 2019-02-11 12:30:36 UTC

Hi
I am using BTRFS for my data partitions on 4 different servers for over a year now, starting with Kernel 4.9. What I find is that

- BTRFS is unbearable slow
- The machines gets stuck on regular intervalls and slows down to standstill
- The load of the machine explodes due to IO of BRTFS
- The disk is often stuck at 100% busy (ATOP) but no file transfer is going on

I have 2 Dell machines with Raids connected via UWSCSI 320, eSATA 6GB, one Zealbox with PCI SATA Raid and one with eSATA 6GB connection with a PCI adapter.

I had to revert back to XFS for the two Dell machines (where it also was possible). As I use the two Dell Servers with the old Raids as Backup, I could very well see the file transfer. Where I needed to mirror my disks, BTRFS systems ran at a load of 25 (uptime) and 100% disk usage (atop), while about 3MB/s were going ofer the 1 GBit network, while the two old systems with XFS did the same job at 100+ MB/s.

The network is fine, as speed check and out of memory transfers work at maximum speed.

Most annoying is that the machine gets regularily stuck during the day, i.e. for about 60-120 minutes it is practically at a standstill and e.g. no samba-drives can be access.

After rebooting, the systems works smoothly for about a week or two, until this behaviour reoccurrs.

My Raids are:

68 TB BTRFS
19 TB BTRFS
30 TB BTRFS
13 TB BTRFS

The 19 and 13 TB are now converted to XFS, with servers now idling at 0.5 load (uptime) doing the same as before.

Unfortunately I could not even move the data to another system and reformat the disk, as the transfer is so slow, I stopped ater 1 week of data transfer.

Single file transfers, putting one, even one large file to disk via network works fine. The problem get really serious is many files are transferred, especially many small files (100.000+). Then the system is unusable for all users.

I moved all background services of BTRFS (balancing etc.) to the weekend, removed automatic cleanup from snapper service etc. and pushed those things into presumably idle times.

My operating system is openSUSE Tumbleweed for the main (largest) server, openSUSE Leap 15 for the other ones. The boot disks run on BTRFS (0,3-3 TB) where no probiem was visible so far.

My questions are
- Is it possible that BTRFS is not able to hand normal big file systems?
- How can BTRFS be speed up to reasonable speed for small/medium sized raids (hard raid)
- Is there a way to convert BTRFS to XFS (did not find any)?
- Is there a plan to speed up BTRFS to be comparable to XFS or so?

Any other suggestion on workarounds are highly apreaciated.

Comment 1 Stanislaw Gruszka 2019-03-30 09:06:17 UTC

(In reply to NoName_Nr_1 from comment #0)

> - The disk is often stuck at 100% busy (ATOP) but no file transfer is going
> on

When this happen could you do "perf top -a --stdio" to print what functions eat cpu power ?

> My Raids are:
> 
> 68 TB BTRFS
> 19 TB BTRFS
> 30 TB BTRFS
> 13 TB BTRFS

This looks pretty enterprise for me. I would consider using commercial linux offering and get paid support.

Comment 2 NoName_Nr_1 2019-03-31 11:57:46 UTC

Seriously? I do a bug report (which is not made easy anyways) and what I get is "go somewhere else"?

It seems that Linux is not anymore what it was when I was junger and started to use it.

I am not sure which enterprise you are referring to, but a couple of enterprise do not look enterprise for me. This is operated at a small company, which is nowhere close to an enterprise. My home raid is similar and getting pretty crowded.

I have found some page which claims that having a "larger" number of snapshots (>10 !) would affect performance. The subject was not really releated to my problem from the text, however, as I reduced the number of snapshots dramatically, the problem got smaller.

Which means that I more seldom get loads >15 during no-load operation (i.e. no serious file transfer). So my impression is, that snapshots are in fact part of the problem.

I was using snapshots for being able to roll back my data partition. So doing hourly snapshots (8 days back), weekly snapshots (5 weeks back), monthyl snapshots (13 months back) and annual snapshots (3 years) for each of the BTRFS volumes. I would not consider this extensive but a reasnable strcuture for a file system offering snapshots.

Anyhow, when I reduced the number of snapshots to a total of 10 per volume, the system lockdown was reduced (>in the past 4 weeks) dramatically. It however does not make too much sense backupwise.

An no, that is not my only way of backing up the system, but it is for being able to reconstruct user deleted files which unfortunately happens from time to time.

"perf" gives the following reading, but I need to provoke a full lockdown to make it more infromational. I will add this ASAP.
(the top/atop/iotop shows aleways btrfs-transactions to be eating up all ressources and the load on the disk is 100% access at low write/read values).

     2.76%  [kernel]       [k] get_page_from_freelist
     2.71%  [kernel]       [k] svc_tcp_recvfrom
     2.68%  [kernel]       [k] module_get_kallsym
     1.93%  [kernel]       [k] vsnprintf
     1.82%  [kernel]       [k] memcpy_erms
     1.76%  [kernel]       [k] vmx_vmexit
     1.68%  [kernel]       [k] free_pcppages_bulk
     1.66%  [kernel]       [k] format_decode
     1.46%  [kernel]       [k] number
     1.42%  [kernel]       [k] kallsyms_expand_symbol.constprop.1
     1.25%  perf           [.] rb_next
     1.15%  [kernel]       [k] menu_select
     1.12%  [kernel]       [k] cpuidle_enter_state
     1.08%  perf           [.] __symbols__insert
     1.03%  libc-2.29.so   [.] __GI_____strtoull_l_internal
     1.02%  [kernel]       [k] __schedule
     0.95%  [kernel]       [k] string
     0.94%  [kernel]       [k] trace_hardirqs_off
     0.93%  [kernel]       [k] native_queued_spin_lock_slowpath
     0.93%  [kernel]       [k] __alloc_pages_nodemask
     0.87%  [kernel]       [k] trace_hardirqs_on
     0.78%  [kernel]       [k] free_unref_page
     0.74%  [kernel]       [k] ipt_do_table
     0.70%  [kernel]       [k] cache_reap
     0.70%  [kernel]       [k] svc_recv
     0.69%  [kernel]       [k] __x86_indirect_thunk_rax
     0.62%  [kernel]       [k] __page_cache_release

Comment 3 NoName_Nr_1 2019-04-03 16:00:38 UTC

Hi
I could not get it to full load, but with defrag it is pretty much under load (however, disk only at 33-70% instead of the typical 100% busy).

The perf gives:

  24.97%  [kernel]       [k] native_queued_spin_lock_slowpath
     2.99%  [kernel]       [k] queued_write_lock_slowpath
     2.82%  [kernel]       [k] _raw_spin_lock_irqsave
     2.70%  [kernel]       [k] __schedule
     2.44%  [kernel]       [k] prepare_to_wait_event
     2.29%  [kernel]       [k] btrfs_tree_read_lock
     2.24%  [kernel]       [k] menu_select
     1.92%  [kernel]       [k] btrfs_get_token_32
     1.86%  [kernel]       [k] map_private_extent_buffer
     1.47%  [kernel]       [k] btrfs_tree_read_unlock
     1.40%  [kernel]       [k] queued_read_lock_slowpath
     1.12%  [kernel]       [k] btrfs_tree_lock
     1.11%  [kernel]       [k] try_to_wake_up
     1.08%  [kernel]       [k] btrfs_set_token_32
     1.05%  [kernel]       [k] _raw_read_lock
     0.93%  [kernel]       [k] btrfs_search_slot
     0.92%  [kernel]       [k] generic_bin_search.constprop.40
     0.88%  [kernel]       [k] do_idle
     0.86%  [kernel]       [k] update_load_avg
     0.86%  [kernel]       [k] native_sched_clock
     0.84%  [kernel]       [k] btrfs_set_lock_blocking_rw
     0.81%  [kernel]       [k] select_task_rq_fair
     0.77%  [kernel]       [k] find_extent_buffer
     0.77%  [kernel]       [k] add_delayed_ref_head
     0.76%  [kernel]       [k] __update_load_avg_cfs_rq
     0.74%  [kernel]       [k] trace_hardirqs_off
     0.73%  [kernel]       [k] _raw_write_lock
     0.72%  [kernel]       [k] _raw_spin_lock
     0.71%  [kernel]       [k] __switch_to_asm
     0.68%  [kernel]       [k] __switch_to
     0.64%  [kernel]       [k] __wake_up_common
     0.64%  [kernel]       [k] update_cfs_rq_h_load
     0.59%  [kernel]       [k] __radix_tree_lookup
     0.59%  [kernel]       [k] btrfs_tree_unlock
     0.57%  [kernel]       [k] module_get_kallsym
     0.56%  [kernel]       [k] free_extent_buffer
     0.54%  [kernel]       [k] pick_next_task_fair

I use a rolling release, so the kernel is now at

Linux server06 5.0.3-1-default #1 SMP Fri Mar 22 17:30:35 UTC 2019 (2a31831) x86_64 x86_64 x86_64 GNU/Linux


Hope that helps. It seems to get slowly better, but in comparison to XFS the file system is unfortunately still a lot slower. It is a pity as I really like the features.

Comment 4 Stanislaw Gruszka 2019-04-03 17:00:18 UTC

This looks like lock contention.  What is the program / process name what cause this ?

Comment 5 NoName_Nr_1 2019-04-03 17:12:56 UTC

btrfs-transactions is typically the leader with top.

Especially when something copies filetrees via nfs. Whole system is practically on a standstill for disk access.