Bug 90401
Summary: | btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file | ||
---|---|---|---|
Product: | File System | Reporter: | Martin Steigerwald (Martin) |
Component: | btrfs | Assignee: | Josef Bacik (josef) |
Status: | RESOLVED UNREPRODUCIBLE | ||
Severity: | normal | CC: | ibaldo |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.18 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
kern.log with a lot of sysrq-t, the first few while all was fine, but then during kworker using 100% of one core
testcase where I tried to bring BTRFS to <300 IOPS but failed testcase on the same /home where I got the 254 IOPS, as I enlarged and rebalanced it round of testing with really tight free space |
Description
Martin Steigerwald
2014-12-27 13:50:48 UTC
Cut and paste atop output into a huge editor window to see it like it is intended to be. :) Next I will try to reproduce this with an freshly creating BTRFS filesystem. Robert pointed out that merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home Sa 27. Dez 13:26:39 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 152.83GiB devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=154.97GiB, used=149.58GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.26GiB GlobalReserve, single: total=512.00MiB, used=0.00B means the filesystem is almost full. Yeah, thats right. But for the initial I use a Windows XP with an NTFS in an about 17/18 GiB case and try to defrag in it or just install a 500 MB tax return software for it I had: merkaba:~> btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B And there is happened as well that kworder uses 100% of one core for minutes upto the point that desktop environment (Plasma) was locked due to waiting for I/O. And actually it got like the above due to the balance. I did not suddenly put an additional 8 GiB onto the filesystem. Anyway, I tried to reproduce it just by filesystem is full conditation, but I can´t. There is more to it. I will attach a test case log for the reproducer attempts. Created attachment 161971 [details]
testcase where I tried to bring BTRFS to <300 IOPS but failed
Filesystem full alone does not trigger it. At least for a small filesystem.
Maybe it needs to be big filesystem almost full?
Or there is something else wrong with my /home?
Okay, I want to add here that I triggered this kworker thread uses 100% full on one core for minutes issue first by installing an about 400 MiB application into a Windows XP Virtualbox in a cow´d about 17-18 GiB VDI image file with NTFS. Then I triggered it again as I tried compacting the vdi image as recommended by Oracle blog post by defragmenting the NTFS, then running sdelete -c on Windows XP, which seems to be a bit like fstrim, and then running some VBoxManage command to compact the VDI file. And my observation was that it only happens once BTRFS cannot allocate anymore chunks cause there is no space on the devices for that anymore. And for that I want to add that I did the defragmentation on NTFS several times today with a BTRFS with plenty of space feee to allocate chunks from and this *didn´t* trigger the kworker hogs a CPU use issue. So for me the issue does not seem to happen as long as BTRFS still is able to reserve new data chunks from unreserved device space. I will try to verify this with the fio testcase as well. Below a summary of the initial story how I found this issue. But let me run the whole story: 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again. Which gave me: merkaba:~> btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.19GiB devid 1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~> btrfs fi df /home Data, RAID1: total=144.98GiB, used=140.95GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.24GiB GlobalReserve, single: total=512.00MiB, used=0.00B 2) I run the Virtualbox machine again and defragmented the NTFS filesystem in the VDI image file. And: It worked *just* fine. Fine as in *fine*. No issues whatsoever. I got this during the run: ATOP - merkaba 2014/12/27 12:58:42 ----------- 10s elapsed PRC | sys 10.41s | user 1.08s | #proc 357 | #trun 4 | #tslpi 694 | #tslpu 0 | #zombie 0 | no procacct | CPU | sys 107% | user 11% | irq 0% | idle 259% | wait 23% | guest 0% | curf 3.01GHz | curscal 93% | cpu | sys 29% | user 3% | irq 0% | idle 63% | cpu002 w 5% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 27% | user 3% | irq 0% | idle 65% | cpu000 w 5% | guest 0% | curf 3.03GHz | curscal 94% | cpu | sys 26% | user 3% | irq 0% | idle 63% | cpu003 w 8% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 24% | user 2% | irq 0% | idle 68% | cpu001 w 6% | guest 0% | curf 3.00GHz | curscal 93% | CPL | avg1 1.92 | avg5 1.01 | avg15 0.56 | | csw 501619 | intr 129279 | | numcpu 4 | MEM | tot 15.5G | free 610.1M | cache 9.1G | buff 0.1M | slab 1.0G | shmem 183.5M | vmbal 0.0M | hptot 0.0M | SWP | tot 12.0G | free 11.6G | | | | | vmcom 7.1G | vmlim 19.7G | PAG | scan 219141 | steal 215577 | stall 936 | | | | swin 0 | swout 940 | LVM | sata-home | busy 53% | read 181413 | write 0 | KiB/w 0 | MBr/s 70.86 | MBw/s 0.00 | avio 0.03 ms | LVM | sata-swap | busy 2% | read 0 | write 940 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.37 | avio 0.17 ms | LVM | sata-debian | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 1.00 ms | LVM | msata-debian | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms | DSK | sda | busy 53% | read 181413 | write 477 | KiB/w 7 | MBr/s 70.86 | MBw/s 0.37 | avio 0.03 ms | DSK | sdb | busy 0% | read 0 | write 1 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms | NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 | NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | icmpi 0 | icmpo 0 | NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 | PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3 9650 - martin martin 22 7.89s 0.65s 0K 128K 705.5M 382.1M -- - S 2 87% VirtualBox 9911 - root root 1 0.69s 0.01s 0K 0K 0K 0K -- - S 3 7% watch 9598 - root root 1 0.38s 0.00s 0K 0K 0K 20K -- - S 0 4% kworker/u8:9 9892 - root root 1 0.36s 0.00s 0K 0K 0K 0K -- - S 1 4% kworker/u8:17 9428 - root root 1 0.30s 0.00s 0K 0K 0K 0K -- - R 0 3% kworker/u8:3 9589 - root root 1 0.23s 0.00s 0K 0K 0K 0K -- - S 1 2% kworker/u8:6 4746 - martin martin 2 0.04s 0.13s 0K -16K 0K 0K -- - R 2 2% konsole Every 1,0s: cat /proc/meminfo Sat Dec 27 12:59:23 2014 MemTotal: 16210512 kB MemFree: 786632 kB MemAvailable: 10271500 kB Buffers: 52 kB Cached: 9564340 kB SwapCached: 70268 kB Active: 6847560 kB Inactive: 5257956 kB Active(anon): 2016412 kB Inactive(anon): 703076 kB Active(file): 4831148 kB Inactive(file): 4554880 kB Unevictable: 9068 kB Mlocked: 9068 kB SwapTotal: 12582908 kB SwapFree: 12186680 kB Dirty: 972324 kB Writeback: 0 kB AnonPages: 2526340 kB Mapped: 2457096 kB Shmem: 173564 kB Slab: 918128 kB SReclaimable: 848816 kB SUnreclaim: 69312 kB KernelStack: 11200 kB PageTables: 64556 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 20688164 kB Committed_AS: 7438348 kB I am not seeing more than one GiB of dirty here during regular usage and it is no problem. And kworker thread CPU usage just fine. So no, the dirty_background_ratio isn´t an issue with this 16 GiB ThinkPad T520. [… some rant of mine about not adapting dirty_background_ratio when it is not actually needed …] So defragmentation completed just nice, no issue so far. But I am close to full device space reservation already: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home Sa 27. Dez 13:02:40 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 151.58GiB devid 1 size 160.00GiB used 158.01GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 158.01GiB path /dev/mapper/sata-home As I was not able to use the NTFS defragmenter as a reproduce then as the NTFS was already defragmented I then introduced the fio job, which just hit it nice. Now with a freshly formatted, but small dual SSD BTRFS RAID 1 with 10 GiB on each of two devices I was not able to trigger this kworker hogs 100% of one core case. So the BTRFS allocates all space from device seems to be necessary to trigger it on my /home, but doesn´t suffice to trigger it on a fresh BTRFS. I thought this much of history may be helpful to understand where this bug report is coming from. And I thought I better add it here than ask you to read the quite long thread I mentioned initially. And I forget the original fio test results and summarized too much: That was the test in the case all device space was reserved for chunks. I.e. with this: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home Sa 27. Dez 13:26:39 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 152.83GiB devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=154.97GiB, used=149.58GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.26GiB GlobalReserve, single: total=512.00MiB, used=0.00B And I got only 254 IOPS on random write then. Its the random write case it seems. Here are values from fio job: martin@merkaba:~> fio ssd-test.fio seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 2 processes Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s] seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014 write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec clat (usec): min=3, max=38048, avg=10.52, stdev=205.25 lat (usec): min=3, max=38048, avg=10.66, stdev=205.43 clat percentiles (usec): | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4], | 30.00th=[ 4], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5], | 70.00th=[ 7], 80.00th=[ 8], 90.00th=[ 8], 95.00th=[ 9], | 99.00th=[ 14], 99.50th=[ 20], 99.90th=[ 211], 99.95th=[ 2128], | 99.99th=[10304] bw (KB /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20 lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12% lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% cpu : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Seems fine. But: rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014 write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61 lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61 clat percentiles (usec): | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5], | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6], | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10], | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116], | 99.99th=[16711680] bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02 lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02% lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02% cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec Run status group 1 (all jobs): WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec The full mail I referred to in my last and in this comment is: http://www.spinics.net/lists/linux-btrfs/msg40745.html In a moment I post how /home behaves when there is still unused device space left to reserve chunks from. This is another test. With my /home. This time while it has enough free device space to reserve new chunks from. Remember this for the case where is hasn´t: merkaba:~> btrfs fi sh /home Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 144.19GiB devid 1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~> btrfs fi df /home Data, RAID1: total=144.98GiB, used=140.95GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.24GiB GlobalReserve, single: total=512.00MiB, used=0.00B rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014 write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61 lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61 clat percentiles (usec): | 1.00th=[ 4], 5.00th=[ 5], 10.00th=[ 5], 20.00th=[ 5], | 30.00th=[ 6], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 6], | 70.00th=[ 7], 80.00th=[ 7], 90.00th=[ 9], 95.00th=[ 10], | 99.00th=[ 18], 99.50th=[ 19], 99.90th=[ 28], 99.95th=[ 116], | 99.99th=[16711680] bw (KB /s): min= 0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02 lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02% lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02% cpu : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=35084/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 […] Run status group 1 (all jobs): WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec That is where I saw this kworker thread at 100% of one Sandybridge core for minutes. This is where I made kern.log with sysrq-t triggers in https://bugzilla.kernel.org/show_bug.cgi?id=90401#c0 made from. But now as I extended it to 170 GiB and did some basic rebalancing (upto btrfs balance start -dusage=80) I have this: First attempt: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 13:13:47 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 151.09GiB devid 1 size 170.00GiB used 158.03GiB path /dev/mapper/msata-home devid 2 size 170.00GiB used 158.03GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=153.00GiB, used=147.83GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.26GiB GlobalReserve, single: total=512.00MiB, used=0.00B martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process rand-write: Laying out IO file(s) (1 file(s) / 4096MB) Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/84528KB/0KB /s] [0/21.2K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=9987: Sun Dec 28 13:14:32 2014 write: io=4096.0MB, bw=155304KB/s, iops=38826, runt= 27007msec clat (usec): min=5, max=28202, avg=22.03, stdev=240.04 lat (usec): min=5, max=28202, avg=22.28, stdev=240.16 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 10], 40.00th=[ 11], 50.00th=[ 12], 60.00th=[ 13], | 70.00th=[ 14], 80.00th=[ 15], 90.00th=[ 17], 95.00th=[ 23], | 99.00th=[ 93], 99.50th=[ 175], 99.90th=[ 2096], 99.95th=[ 6816], | 99.99th=[10176] bw (KB /s): min=76832, max=413616, per=100.00%, avg=156706.75, stdev=101101.26 lat (usec) : 10=29.85%, 20=62.43%, 50=5.74%, 100=1.07%, 250=0.57% lat (usec) : 500=0.16%, 750=0.04%, 1000=0.02% lat (msec) : 2=0.02%, 4=0.01%, 10=0.08%, 20=0.01%, 50=0.01% cpu : usr=12.05%, sys=47.34%, ctx=86985, majf=0, minf=5 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=155304KB/s, minb=155304KB/s, maxb=155304KB/s, mint=27007msec, maxt=27007msec Second attempt: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 13:16:19 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 155.11GiB devid 1 size 170.00GiB used 162.03GiB path /dev/mapper/msata-home devid 2 size 170.00GiB used 162.03GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=157.00GiB, used=151.83GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.27GiB GlobalReserve, single: total=512.00MiB, used=0.00B martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/113.5MB/0KB /s] [0/29.3K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10043: Sun Dec 28 13:17:34 2014 write: io=4096.0MB, bw=145995KB/s, iops=36498, runt= 28729msec clat (usec): min=4, max=143201, avg=23.95, stdev=518.47 lat (usec): min=4, max=143201, avg=24.13, stdev=518.48 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 13], 80.00th=[ 13], 90.00th=[ 15], 95.00th=[ 16], | 99.00th=[ 33], 99.50th=[ 70], 99.90th=[ 5472], 99.95th=[ 8640], | 99.99th=[20864] bw (KB /s): min= 4, max=433760, per=100.00%, avg=149179.63, stdev=136784.14 lat (usec) : 10=38.35%, 20=58.99%, 50=1.96%, 100=0.38%, 250=0.16% lat (usec) : 500=0.03%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.08%, 20=0.02%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01% cpu : usr=10.25%, sys=42.40%, ctx=42642, majf=0, minf=8 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=145995KB/s, minb=145995KB/s, maxb=145995KB/s, mint=28729msec, maxt=28729msec Third attempt: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 13:18:24 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 156.16GiB devid 1 size 170.00GiB used 160.03GiB path /dev/mapper/msata-home devid 2 size 170.00GiB used 160.03GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.00GiB, used=152.83GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.34GiB GlobalReserve, single: total=512.00MiB, used=0.00B martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/195.7MB/0KB /s] [0/50.9K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10058: Sun Dec 28 13:18:59 2014 write: io=4096.0MB, bw=202184KB/s, iops=50545, runt= 20745msec clat (usec): min=4, max=28261, avg=15.84, stdev=214.59 lat (usec): min=4, max=28261, avg=16.06, stdev=214.78 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 8], 40.00th=[ 9], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 15], 95.00th=[ 17], | 99.00th=[ 52], 99.50th=[ 105], 99.90th=[ 426], 99.95th=[ 980], | 99.99th=[12736] bw (KB /s): min= 4, max=426344, per=100.00%, avg=207355.30, stdev=105104.72 lat (usec) : 10=41.44%, 20=55.33%, 50=2.17%, 100=0.54%, 250=0.34% lat (usec) : 500=0.10%, 750=0.02%, 1000=0.01% lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.01% cpu : usr=14.15%, sys=59.06%, ctx=81711, majf=0, minf=6 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=202183KB/s, minb=202183KB/s, maxb=202183KB/s, mint=20745msec, maxt=20745msec merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 13:19:15 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 155.16GiB devid 1 size 170.00GiB used 162.03GiB path /dev/mapper/msata-home devid 2 size 170.00GiB used 162.03GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=157.00GiB, used=151.85GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.31GiB GlobalReserve, single: total=512.00MiB, used=0.00B So here BTRFS was fast. It is getting trouble on my /home when its almost full. But more so than with a empty 10 GiB filesystem as I have shown in testcase https://bugzilla.kernel.org/show_bug.cgi?id=90401#c3 There I had: merkaba:/mnt/btrfsraid1> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process rand-write: Laying out IO file(s) (1 file(s) / 4096MB) Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/20924KB/0KB /s] [0/5231/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=6221: Sat Dec 27 15:34:14 2014 write: io=2645.8MB, bw=22546KB/s, iops=5636, runt=120165msec clat (usec): min=4, max=3054.8K, avg=174.87, stdev=11455.26 lat (usec): min=4, max=3054.8K, avg=175.03, stdev=11455.27 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 6], 10.00th=[ 6], 20.00th=[ 7], | 30.00th=[ 7], 40.00th=[ 8], 50.00th=[ 9], 60.00th=[ 10], | 70.00th=[ 11], 80.00th=[ 12], 90.00th=[ 14], 95.00th=[ 17], | 99.00th=[ 30], 99.50th=[ 40], 99.90th=[ 1992], 99.95th=[25984], | 99.99th=[411648] bw (KB /s): min= 168, max=70703, per=100.00%, avg=27325.46, stdev=14887.94 lat (usec) : 10=55.81%, 20=41.12%, 50=2.70%, 100=0.14%, 250=0.06% lat (usec) : 500=0.02%, 750=0.01%, 1000=0.02% lat (msec) : 2=0.02%, 4=0.02%, 10=0.02%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01%, 500=0.02%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01%, >=2000=0.01% cpu : usr=1.56%, sys=5.57%, ctx=29822, majf=0, minf=7 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=677303/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=2645.8MB, aggrb=22545KB/s, minb=22545KB/s, maxb=22545KB/s, mint=120165msec, maxt=120165msec I will attach this as text file in a moment as well. Created attachment 162011 [details]
testcase on the same /home where I got the 254 IOPS, as I enlarged and rebalanced it
This is how BTRFS performs when it can still reserve new data chunks from unused device space.
This is the same as I pasted in last comment.
Another thing: I used the wrong btrfs sh output in comment #6. Comment #5 has the right one. But finally I got to: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home Sa 27. Dez 13:26:39 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 152.83GiB devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=154.97GiB, used=149.58GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.26GiB GlobalReserve, single: total=512.00MiB, used=0.00B This was there IOPS went down to 254 IOPS. Another testing would be to downsize my /home BTRFS again and try to balance it so it is similar as above, but with some unused device space to allocate new chunks from, but right now I have: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 13:19:15 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 155.16GiB devid 1 size 170.00GiB used 162.03GiB path /dev/mapper/msata-home devid 2 size 170.00GiB used 162.03GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=157.00GiB, used=151.85GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.31GiB GlobalReserve, single: total=512.00MiB, used=0.00B Hmmm it might work, but right now it doesn´t want to resize to 160GiB, not even on just one device: merkaba:~> btrfs filesystem resize 1:160G /home Resize '/home' of '1:160G' ERROR: unable to resize '/home' - No space left on device Okay, and this is the last series of tests for today. Conclusion: I cannot manage to get it down to the knees as before, but I come near to it. Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh even *worse* situation than before. That hints me at the need to look at the free space fragmentation, as in the beginning the problem started appearing with: merkaba:~> btrfs fi sh /home Label: 'home' uuid: […] Total devices 2 FS bytes used 144.41GiB devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~> btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B Yes, thats 13 GiB of free space *within* the chunks. That was where I first saw the kworker hogs one core issue while using Virtualbox with that Windows XP VM. So while I can get it down in IOPS by bringing it to a situation where it can not reserve additional data chunks again, I cannot recreate the abysmal 250 IOPS figure by this. Not even with my /home filesystem. So there is more to it. I think its important to look into free space fragmentation. It seems it needs an *aged* filesystem to recreate. At it seems the balances really helped. As I am not able to recreate the issue to that extent right now. Test log below, will attach as well: Okay, retesting with merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:01:05 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 155.15GiB devid 1 size 163.00GiB used 159.92GiB path /dev/mapper/msata-home devid 2 size 163.00GiB used 159.92GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=154.95GiB, used=151.84GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.31GiB GlobalReserve, single: total=512.00MiB, used=0.00B Thats is just 3 GiB to reserve new data chunks from from. First run – all good: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/134.2MB/0KB /s] [0/34.4K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10483: Sun Dec 28 14:02:59 2014 write: io=4096.0MB, bw=218101KB/s, iops=54525, runt= 19231msec clat (usec): min=4, max=20056, avg=14.87, stdev=143.15 lat (usec): min=4, max=20056, avg=15.09, stdev=143.26 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 13], 80.00th=[ 14], 90.00th=[ 15], 95.00th=[ 17], | 99.00th=[ 52], 99.50th=[ 99], 99.90th=[ 434], 99.95th=[ 980], | 99.99th=[ 7968] bw (KB /s): min=62600, max=424456, per=100.00%, avg=218821.63, stdev=93695.28 lat (usec) : 10=38.19%, 20=58.83%, 50=1.90%, 100=0.59%, 250=0.33% lat (usec) : 500=0.09%, 750=0.03%, 1000=0.01% lat (msec) : 2=0.02%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01% cpu : usr=15.50%, sys=61.86%, ctx=93432, majf=0, minf=5 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=218101KB/s, minb=218101KB/s, maxb=218101KB/s, mint=19231msec, maxt=19231msec Second run: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:04:01 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 155.23GiB devid 1 size 163.00GiB used 160.95GiB path /dev/mapper/msata-home devid 2 size 163.00GiB used 160.95GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.98GiB, used=151.91GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.32GiB GlobalReserve, single: total=512.00MiB, used=0.00B martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/171.3MB/0KB /s] [0/43.9K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10501: Sun Dec 28 14:05:03 2014 write: io=4096.0MB, bw=220637KB/s, iops=55159, runt= 19010msec clat (usec): min=4, max=20578, avg=14.45, stdev=160.84 lat (usec): min=4, max=20578, avg=14.65, stdev=160.88 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 7], | 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 15], 95.00th=[ 17], | 99.00th=[ 42], 99.50th=[ 79], 99.90th=[ 278], 99.95th=[ 620], | 99.99th=[ 9792] bw (KB /s): min= 5, max=454816, per=100.00%, avg=224700.32, stdev=100763.29 lat (usec) : 10=38.15%, 20=58.73%, 50=2.28%, 100=0.47%, 250=0.26% lat (usec) : 500=0.06%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01% cpu : usr=15.83%, sys=63.17%, ctx=74934, majf=0, minf=5 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=220636KB/s, minb=220636KB/s, maxb=220636KB/s, mint=19010msec, maxt=19010msec Okay, now try the same without space for a free chunk to allocate. The testfile is still there, fio doesn´t delete and recreate it on every attempt, but just writes into it: martin@merkaba:~> ls -l ssd.test.file -rw-r--r-- 1 martin martin 4294967296 Dez 28 14:05 ssd.test.file Okay – with still one chunk to allocate: merkaba:~> btrfs filesystem resize 1:161G /home Resize '/home' of '1:161G' merkaba:~> btrfs filesystem resize 2:161G /home Resize '/home' of '2:161G' merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:08:45 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 155.15GiB devid 1 size 161.00GiB used 159.92GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 159.92GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=154.95GiB, used=151.84GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.31GiB GlobalReserve, single: total=512.00MiB, used=0.00B I would like to have it allocate the chunks by other means, but it frees them eventually afterwards again, so I did it this way. Note, we still have the original file there. The space it currently occupies is already taken into account. Next test: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/130.5MB/0KB /s] [0/33.5K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10563: Sun Dec 28 14:10:34 2014 write: io=4096.0MB, bw=210526KB/s, iops=52631, runt= 19923msec clat (usec): min=4, max=21820, avg=14.78, stdev=119.40 lat (usec): min=4, max=21821, avg=15.03, stdev=120.26 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 11], | 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 14], 95.00th=[ 17], | 99.00th=[ 62], 99.50th=[ 131], 99.90th=[ 490], 99.95th=[ 964], | 99.99th=[ 6816] bw (KB /s): min= 3, max=410480, per=100.00%, avg=216892.84, stdev=95620.33 lat (usec) : 10=33.20%, 20=63.71%, 50=1.86%, 100=0.59%, 250=0.42% lat (usec) : 500=0.12%, 750=0.03%, 1000=0.01% lat (msec) : 2=0.02%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% cpu : usr=15.13%, sys=62.74%, ctx=94346, majf=0, minf=5 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=210525KB/s, minb=210525KB/s, maxb=210525KB/s, mint=19923msec, maxt=19923msec Okay, this is still good. merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:11:18 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 155.17GiB devid 1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.94GiB, used=151.86GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.30GiB GlobalReserve, single: total=512.00MiB, used=0.00B Now no space left to reserve additional chunks anymore. Another test: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/152.3MB/0KB /s] [0/38.1K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10580: Sun Dec 28 14:13:26 2014 write: io=4096.0MB, bw=225804KB/s, iops=56450, runt= 18575msec clat (usec): min=4, max=16669, avg=13.66, stdev=72.88 lat (usec): min=4, max=16669, avg=13.89, stdev=73.06 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 13], 80.00th=[ 14], 90.00th=[ 15], 95.00th=[ 20], | 99.00th=[ 65], 99.50th=[ 113], 99.90th=[ 314], 99.95th=[ 506], | 99.99th=[ 2768] bw (KB /s): min= 4, max=444568, per=100.00%, avg=231326.97, stdev=93374.31 lat (usec) : 10=36.50%, 20=58.44%, 50=3.73%, 100=0.76%, 250=0.44% lat (usec) : 500=0.09%, 750=0.02%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01% cpu : usr=16.35%, sys=68.39%, ctx=127221, majf=0, minf=5 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=225803KB/s, minb=225803KB/s, maxb=225803KB/s, mint=18575msec, maxt=18575msec Okay, this still does not trigger it. Another test, it even freed some chunk: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:14:21 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 155.28GiB devid 1 size 161.00GiB used 160.85GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 160.85GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.89GiB, used=151.97GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.31GiB GlobalReserve, single: total=512.00MiB, used=0.00B Still good: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/156.5MB/0KB /s] [0/40.6K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10589: Sun Dec 28 14:14:37 2014 write: io=4096.0MB, bw=161121KB/s, iops=40280, runt= 26032msec clat (usec): min=4, max=1228.9K, avg=15.69, stdev=1205.88 lat (usec): min=4, max=1228.9K, avg=15.92, stdev=1205.90 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 8], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 13], 80.00th=[ 14], 90.00th=[ 15], 95.00th=[ 19], | 99.00th=[ 53], 99.50th=[ 96], 99.90th=[ 366], 99.95th=[ 764], | 99.99th=[ 7776] bw (KB /s): min= 0, max=431680, per=100.00%, avg=219856.30, stdev=98172.64 lat (usec) : 10=39.24%, 20=55.83%, 50=3.81%, 100=0.63%, 250=0.33% lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 2000=0.01% cpu : usr=11.50%, sys=61.08%, ctx=123428, majf=0, minf=6 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=161121KB/s, minb=161121KB/s, maxb=161121KB/s, mint=26032msec, maxt=26032msec Okay, lets allocate one GiB with fallocate to make free space tighter: martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-1 0.00user 0.09system 0:00.11elapsed 86%CPU (0avgtext+0avgdata 1752maxresident)k 112inputs+64outputs (1major+89minor)pagefaults 0swaps martin@merkaba:~> ls -lh 1g-1 -rw-r--r-- 1 martin martin 1,0G Dez 28 14:16 1g merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:16:24 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 156.15GiB devid 1 size 161.00GiB used 160.94GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 160.94GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.97GiB, used=152.84GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.31GiB GlobalReserve, single: total=512.00MiB, used=0.00B Still not: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/132.1MB/0KB /s] [0/34.4K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10632: Sun Dec 28 14:17:12 2014 write: io=4096.0MB, bw=198773KB/s, iops=49693, runt= 21101msec clat (usec): min=4, max=543255, avg=16.27, stdev=563.85 lat (usec): min=4, max=543255, avg=16.48, stdev=563.91 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 12], | 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 14], 95.00th=[ 17], | 99.00th=[ 49], 99.50th=[ 98], 99.90th=[ 386], 99.95th=[ 828], | 99.99th=[10816] bw (KB /s): min= 4, max=444848, per=100.00%, avg=203909.07, stdev=109502.11 lat (usec) : 10=33.97%, 20=62.99%, 50=2.05%, 100=0.51%, 250=0.33% lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 750=0.01% cpu : usr=14.21%, sys=60.44%, ctx=70273, majf=0, minf=6 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=4096.0MB, aggrb=198772KB/s, minb=198772KB/s, maxb=198772KB/s, mint=21101msec, maxt=21101msec Another 1G file. Got it: ATOP - merkaba 2014/12/28 14:18:14 ----------- 10s elapsed PRC | sys 21.74s | user 2.48s | #proc 382 | #trun 8 | #tslpi 698 | #tslpu 1 | #zombie 0 | no procacct | CPU | sys 218% | user 24% | irq 1% | idle 155% | wait 2% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 75% | user 5% | irq 0% | idle 19% | cpu003 w 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 59% | user 3% | irq 0% | idle 37% | cpu001 w 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 48% | user 6% | irq 1% | idle 45% | cpu000 w 1% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 36% | user 9% | irq 0% | idle 54% | cpu002 w 1% | guest 0% | curf 3.00GHz | curscal 93% | CPL | avg1 2.13 | avg5 2.37 | avg15 1.92 | | csw 67473 | intr 59152 | | numcpu 4 | MEM | tot 15.5G | free 1.1G | cache 11.0G | buff 0.1M | slab 740.2M | shmem 190.9M | vmbal 0.0M | hptot 0.0M | SWP | tot 12.0G | free 11.4G | | | | | vmcom 5.4G | vmlim 19.7G | PAG | scan 0 | steal 0 | stall 1 | | | | swin 19 | swout 0 | LVM | sata-home | busy 8% | read 4 | write 26062 | KiB/w 3 | MBr/s 0.00 | MBw/s 10.18 | avio 0.03 ms | LVM | msata-home | busy 5% | read 4 | write 26062 | KiB/w 3 | MBr/s 0.00 | MBw/s 10.18 | avio 0.02 ms | LVM | sata-swap | busy 0% | read 19 | write 0 | KiB/w 0 | MBr/s 0.01 | MBw/s 0.00 | avio 0.05 ms | LVM | msata-debian | busy 0% | read 0 | write 4 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms | LVM | sata-debian | busy 0% | read 0 | write 4 | KiB/w 4 | MBr/s 0.00 | MBw/s 0.00 | avio 0.00 ms | DSK | sda | busy 8% | read 23 | write 13239 | KiB/w 7 | MBr/s 0.01 | MBw/s 10.18 | avio 0.06 ms | DSK | sdb | busy 5% | read 4 | write 14360 | KiB/w 7 | MBr/s 0.00 | MBw/s 10.18 | avio 0.04 ms | NET | transport | tcpi 18 | tcpo 18 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 | NET | network | ipi 18 | ipo 18 | ipfrw 0 | deliv 18 | | icmpi 0 | icmpo 0 | NET | eth0 0% | pcki 2 | pcko 2 | si 0 Kbps | so 0 Kbps | erri 0 | erro 0 | drpo 0 | NET | lo ---- | pcki 16 | pcko 16 | si 2 Kbps | so 2 Kbps | erri 0 | erro 0 | drpo 0 | PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/5 10657 - martin martin 1 9.88s 0.00s 0K 0K 0K 48K -- - R 1 99% fallocate 9685 - root root 1 9.84s 0.00s 0K 0K 0K 0K -- - D 0 99% kworker/u8:10 martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-2 ; ls -l 1g* 0.00user 59.28system 1:00.21elapsed 98%CPU (0avgtext+0avgdata 1756maxresident)k 0inputs+416outputs (0major+90minor)pagefaults 0swaps -rw-r--r-- 1 martin martin 1073741824 Dez 28 14:16 1g-1 -rw-r--r-- 1 martin martin 1073741824 Dez 28 14:17 1g-2 One minute in system CPU for this. Okay, so now another test: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:19:30 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 157.18GiB devid 1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.94GiB, used=153.87GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.30GiB GlobalReserve, single: total=512.00MiB, used=0.00B I admit, this now really isn´t nice to it anymore, but I want to see where it starts to become an issue. ATOP - merkaba 2014/12/28 14:21:18 ----------- 1s elapsed PRC | sys 1.15s | user 0.16s | #proc 382 | #trun 2 | #tslpi 707 | #tslpu 1 | #zombie 0 | clones 0 | | no procacct | CPU | sys 163% | user 24% | irq 1% | idle 189% | wait 26% | | steal 0% | guest 0% | curf 3.01GHz | curscal 94% | cpu | sys 72% | user 1% | irq 0% | idle 25% | cpu001 w 1% | | steal 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 41% | user 9% | irq 0% | idle 32% | cpu002 w 19% | | steal 0% | guest 0% | curf 3.00GHz | curscal 93% | cpu | sys 34% | user 10% | irq 0% | idle 53% | cpu003 w 3% | | steal 0% | guest 0% | curf 3.03GHz | curscal 94% | cpu | sys 16% | user 4% | irq 0% | idle 77% | cpu000 w 3% | | steal 0% | guest 0% | curf 3.00GHz | curscal 93% | CPL | avg1 2.37 | avg5 2.64 | avg15 2.13 | | | csw 18687 | intr 12435 | | | numcpu 4 | MEM | tot 15.5G | free 2.5G | cache 9.5G | buff 0.1M | slab 742.6M | shmem 242.8M | shrss 115.5M | vmbal 0.0M | hptot 0.0M | hpuse 0.0M | SWP | tot 12.0G | free 11.4G | | | | | | | vmcom 5.5G | vmlim 19.7G | LVM | msata-home | busy 71% | read 28 | write 8134 | KiB/r 4 | KiB/w 3 | MBr/s 0.11 | MBw/s 31.68 | avq 13.21 | avio 0.06 ms | LVM | sata-home | busy 40% | read 72 | write 8135 | KiB/r 4 | KiB/w 3 | MBr/s 0.28 | MBw/s 31.69 | avq 41.67 | avio 0.03 ms | DSK | sdb | busy 71% | read 24 | write 6049 | KiB/r 4 | KiB/w 5 | MBr/s 0.11 | MBw/s 31.68 | avq 5.64 | avio 0.08 ms | DSK | sda | busy 38% | read 60 | write 5987 | KiB/r 4 | KiB/w 5 | MBr/s 0.28 | MBw/s 31.69 | avq 20.40 | avio 0.04 ms | NET | transport | tcpi 16 | tcpo 16 | udpi 0 | udpo 0 | tcpao 1 | tcppo 1 | tcprs 0 | tcpie 0 | udpie 0 | NET | network | ipi 16 | ipo 16 | ipfrw 0 | deliv 16 | | | | icmpi 0 | icmpo 0 | NET | lo ---- | pcki 16 | pcko 16 | si 20 Kbps | so 20 Kbps | coll 0 | erri 0 | erro 0 | drpi 0 | drpo 0 | PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/2 10459 - root root 1 0.70s 0.00s 0K 0K 0K 0K -- - D 3 100% kworker/u8:17 10674 - martin martin 1 0.20s 0.01s 0K 0K 0K 27504K -- - R 0 30% fio Okay. It is jumping between 0 IOPS and 40000 IOPS and hogging 100% with one kworker. Still quite okay in terms of IOPS tough: martin@merkaba:~> fio ssd-test.fio rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.11 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/126.8MB/0KB /s] [0/32.5K/0 iops] [eta 00m:00s] rand-write: (groupid=0, jobs=1): err= 0: pid=10674: Sun Dec 28 14:22:16 2014 write: io=3801.3MB, bw=32415KB/s, iops=8103, runt=120083msec clat (usec): min=4, max=1809.9K, avg=83.88, stdev=3615.98 lat (usec): min=4, max=1809.9K, avg=84.10, stdev=3616.00 clat percentiles (usec): | 1.00th=[ 6], 5.00th=[ 7], 10.00th=[ 7], 20.00th=[ 8], | 30.00th=[ 9], 40.00th=[ 10], 50.00th=[ 11], 60.00th=[ 11], | 70.00th=[ 12], 80.00th=[ 13], 90.00th=[ 15], 95.00th=[ 18], | 99.00th=[ 52], 99.50th=[ 124], 99.90th=[24704], 99.95th=[30592], | 99.99th=[57088] bw (KB /s): min= 0, max=417544, per=100.00%, avg=48302.16, stdev=89108.07 lat (usec) : 10=35.61%, 20=60.17%, 50=3.17%, 100=0.47%, 250=0.27% lat (usec) : 500=0.05%, 750=0.02%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.04%, 50=0.16% lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2000=0.01% cpu : usr=2.37%, sys=29.74%, ctx=202984, majf=0, minf=6 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=973128/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=3801.3MB, aggrb=32415KB/s, minb=32415KB/s, maxb=32415KB/s, mint=120083msec, maxt=120083msec I stop this here now. Its interesting to see that even with: merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home So 28. Dez 14:23:11 CET 2014 Label: 'home' uuid: […] Total devices 2 FS bytes used 157.89GiB devid 1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home devid 2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home Btrfs v3.17 Data, RAID1: total=155.94GiB, used=154.59GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=4.94GiB, used=3.30GiB GlobalReserve, single: total=512.00MiB, used=0.00B I am not able to fully reproduce it. I can partly reproduce it, but it behaves way better than before. I think to go further one needs to have a look at the free space fragmentation inside the chunks. As in the beginning I had: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: merkaba:~> btrfs fi sh /home Label: 'home' uuid: […] Total devices 2 FS bytes used 144.41GiB devid 1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid 2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~> btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B So I had quite some free space *within* the chunks and it still was a problem. Created attachment 162041 [details]
round of testing with really tight free space
still not able to completely reproduce the worst case <300 IOPS scenario where
the desktop locked.
One more note about the IOPS. I currently let fio run with: martin@merkaba:~> cat ssd-test.fio [global] bs=4k #ioengine=libaio #iodepth=4 size=4g #direct=1 runtime=120 filename=ssd.test.file #[seq-write] #rw=write #stonewall [rand-write] rw=randwrite stonewall This is using buffered I/O read()/write() system calls. So these IOPS are not regarding the device raw capabilities. I specifically wanted to test through the page cache to simulate what I see with Virtualbox writing to the VDI file (i.e. dirty piling up and dirty_background_ratio in effect). Just like with a real app. But that also means that IOPS may be higher cause fio ends before all of the writes have been completed to disk. That means when I reach <300IOPS with buffered writes, that means that through the pagecache BTRFS was not able to yield a higher IOPS. But it also means that I measure write requests like an application would be doing (unless it uses fsync() or direct I/O which it seems to me Virtualbox doesn´t at least not with every request). Just wanted to make that explicit. Its basically visible in the job file from what I commented out in there, but still, I thought I mention it. I just tested the effect by reducing the test file to 500MiB and the runtime to 10 seconds and I got 98000 IOPS for that. So the larger test file size, but specifically the runtime forces the kernel to do actual writes due to: merkaba:~> grep . /proc/sys/vm/dirty_* /proc/sys/vm/dirty_background_bytes:0 /proc/sys/vm/dirty_background_ratio:10 /proc/sys/vm/dirty_bytes:0 /proc/sys/vm/dirty_expire_centisecs:3000 /proc/sys/vm/dirty_ratio:20 /proc/sys/vm/dirty_writeback_centisecs:500 (standard values, I still see no need to optimize anything in here with those SSDs, not even with the 16 GiB of RAM the laptop has, as the SSDs usually easily can keep up, and I´d rather wait for a change in the default value unless I am convinced of a benefit in manually adapting it in *this* case) With merkaba:~> btrfs fi df /home Data, RAID1: total=163.87GiB, used=146.92GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.94GiB, used=3.26GiB GlobalReserve, single: total=512.00MiB, used=0.00B merkaba:~> btrfs fi sh /home Label: 'home' uuid: […] Total devices 2 FS bytes used 150.18GiB devid 1 size 170.00GiB used 169.84GiB path /dev/mapper/msata-home devid 2 size 170.00GiB used 169.84GiB path /dev/mapper/sata-home Btrfs v3.17 I had a noticable hang during sdelete.exe -z in Windows XP VM with 20 GiB VDI file – Patrik on mailing list told me they have changed the argument from -c to -z as I wondered by VBoxManage modifyhd Winlala.vdi --compact did not reduce the size of the file). It was not as bad, but desktop was locked for more than 5 seconds easily. So this also happens with larger free space *within* the chunks. Before I to the VBoxManage --compact I will now rebalance partly. So this definately shows, it can happen when BTRFS cannot reserve any new chunks anymore, yet still has *plenty* of free space within the existing data chunks. My current idea is this: From what I perceive the worst case issue happens when 1) BTRFS cannot reserve any new chunks from unused device space anymore. 2) The free space in the existing chunks is highly fragmented. Only one of those conditions is not sufficient to trigger it. This still happens with 4.3 kernel. Had this on /home again after a long time of having it working reliably. It has not been more than 120 seconds, but easily 10-30 seconds of one kworker thread hogging one core with Plasma desktop being unresponsive waiting for I/O. I managed to balance data chunks after vacuuming Akonadi database + moving some data elsewhere. On btrfs balance start -dusage=95 it froze mouse pointer while writing a mail in KMail. I let it sit for some minutes, but still sitting there, mouse pointer frozen. This is still not production ready. I wouldn´t use this on any customer machine unless lots and lots of underprovisioning. Debian Jessie with kernel 4.2.6-3~bpo8+2, ZoneMinder with 17 cameras, BTRFS with RAID1 (managed by BTRFS) with 2x1TB SATA 7200RPM HDDs, 88% disk full, Intel Core i5-4570 (Haswell), 2x8Gib DDR3-1600 CAS9. If I use motion detection it works perfectly, but if I want to put all 17 cameras in continuos recording, it works but after a couple minutes kworker eats 100% of one CPU, loadavg spikes to 23 and little actual IO goes to the HDDs, they get idle, the workload becomes CPU bound, stalling everything. I will try tomorrow with 80% full, but what percentage of full utilization should I use / do you recommend? The higher the better for this case... Thanks. Tried with 80% full, same issue, iostat -x 1 shows this: avg-cpu: %user %nice %system %iowait %steal %idle 0,00 0,00 25,19 0,25 0,00 74,56 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 33,00 0,00 60,00 0,00 2540,00 84,67 0,00 0,00 0,00 0,00 0,00 0,00 sdb 0,00 39,00 0,00 54,00 0,00 2540,00 94,07 0,02 0,30 0,00 0,30 0,30 1,60 Remember, 4 cores, so 25% system is because of kworker eating a full CPU, there are some writes but not much utilization of the disks, perf top shows a lot of bit searching in the kernel. Using skinny metadata extents, there are no subvolumes and no snapshots, using noatime, not using compression, using COW except in the MySQL files. /proc/mounts: /dev/sdb3 / btrfs rw,noatime,nodiratime,space_cache,subvolid=5,subvol=/ 0 0 I think I am going to change to another FS, though my experience with BTRFS has been great so far, this is a showstopper for this use case. I hope it can get improved some time and could switch to it for all my machines, in the meantime I will continue using it for backups and some other non IO intensive machines. Thanks a lot!!! Happy new Year! I found that when: merkaba:~> btrfs fi sh /home Label: 'home' uuid: […] Total devices 2 FS bytes used 148.98GiB devid 1 size 170.00GiB used 170.00GiB path /dev/mapper/msata-home devid 2 size 170.00GiB used 170.00GiB path /dev/mapper/sata-home on the devices themselves used matches size, in other words the total space of the device is allocated in chunks, this is *can* happen. No matter the percentage of the allocation *inside* the chunks. Which is the value on the "Total devices" line. merkaba:~> btrfs fi df /home Data, RAID1: total=163.94GiB, used=145.32GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=6.03GiB, used=3.67GiB GlobalReserve, single: total=512.00MiB, used=0.00B Thus in this condition it could happen already for me. So its not that much about the complete fill rate of the device, but whether BTRFS if need be can still allocate another chunk. That said: This up to my current experience was a necessary, but not a sufficient condition for it to happen. So the system can run like this for quite a time, but at some time may run into kworker using 100% cpu of one core issue. For production I thus recommend to keep the device allocation into chunks *below* the available device space by balancing. But it can happen that it doesn´t balance anymore, giving out of space error. So another alternative is: Do not yet use it for interactive machines. I mentioned it in some recent thread that I think BTRFS is still not production ready and got some response. And some discussion about the issue. Chris Mason said that this issue seems to be one that needs a lot of think time to solve properly¹. I will follow-up there in a moment. Thanks, Martin [1] http://thread.gmane.org/gmane.comp.file-systems.btrfs/51176 Thanks, Martin What usually works okayish here is something like: btrfs balance start -dusage=80 /home experiment with the value. And for metadata you use -m. Just running a balance without parameters gave me an no space left on device error again. With -dusage I again got: merkaba:~> btrfs fi sh /home Label: 'home' uuid: […] Total devices 2 FS bytes used 148.96GiB devid 1 size 170.00GiB used 161.18GiB path /dev/mapper/msata-home devid 2 size 170.00GiB used 161.18GiB path /dev/mapper/sata-home merkaba:~> btrfs fi df /home Data, RAID1: total=155.12GiB, used=145.31GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=6.03GiB, used=3.65GiB GlobalReserve, single: total=512.00MiB, used=784.00KiB merkaba:~> btrfs device usage /home /dev/dm-0, ID: 1 Device size: 170.00GiB Data,RAID1: 155.12GiB Metadata,RAID1: 6.03GiB System,RAID1: 32.00MiB Unallocated: 8.82GiB /dev/dm-3, ID: 2 Device size: 170.00GiB Data,RAID1: 155.12GiB Metadata,RAID1: 6.03GiB System,RAID1: 32.00MiB Unallocated: 8.82GiB Important is the unallocated value. I never had a kworker thread using 100% of one core for extended periods of time if this value was above 0. Whether this can be automated in a reliable way by having some monitoring on the unallocated value + an automatic trigger for the balance, I am not sure. It would depend whether a fixed value to "usage=" would work in all cases. Although it would be important to let the balance run in times without high load or at least with ionice -c3 (btrfs balance doesn´t seem to have an option for that) and even then this only works when using CFQ I/O scheduler and even then only for the reads, not for the (buffered) writes. Anyway, if you want to discuss this, please subscribe to BTRFS mailing list, lets use the bug report for actual findings and facts. I think I saw this up to 4.3. But I am not sure whether I saw it with 4.4 and I definately didn´t see it with 4.5. Since it depends on the state of the filesystem, it may still happen, but for now I am closing this. I will reopen, if I see this again. Thanks. As I cannot trace this to a certain code fix, I will close as no longer reproducible. (I think it would be good if bugzilla had something more specific for this.) |