Bug 90401

Summary:	btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file
Product:	File System	Reporter:	Martin Steigerwald (Martin)
Component:	btrfs	Assignee:	Josef Bacik (josef)
Status:	RESOLVED UNREPRODUCIBLE
Severity:	normal	CC:	ibaldo
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	3.18	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	kern.log with a lot of sysrq-t, the first few while all was fine, but then during kworker using 100% of one core testcase where I tried to bring BTRFS to <300 IOPS but failed testcase on the same /home where I got the 254 IOPS, as I enlarged and rebalanced it round of testing with really tight free space

Description Martin Steigerwald 2014-12-27 13:50:48 UTC

Created attachment 161961 [details]
kern.log with a lot of sysrq-t, the first few while all was fine, but then during kworker using 100% of one core

See:

BTRFS free space handling still needs more work: Hangs again
Martin Steigerwald | 26 Dec 14:37 2014
http://permalink.gmane.org/gmane.comp.file-systems.btrfs/41790

for details.

Basically when I have


merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:26:39 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 152.83GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.97GiB, used=149.58GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

with

merkaba:~#1> mount | grep home
/dev/mapper/msata-home on /home type btrfs (rw,noatime,compress=lzo,ssd,space_cache)
/dev/mapper/msata-home on /mnt/home-zeit type btrfs (rw,noatime,compress=lzo,ssd,space_cache)

with big metadata and skinny extents and do

martin@merkaba:~> cat ssd-test.fio
[global]
bs=4k
#ioengine=libaio
#iodepth=4
size=4g
#direct=1
runtime=120
filename=ssd.test.file

#[seq-write]
#rw=write
#stonewall

[rand-write]
rw=randwrite
stonewall


I get:

martin@merkaba:~> fio ssd-test.fio
seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 2 processes
Jobs: 1 (f=1): [_(1),w(1)] [19.3% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01m:57s]       
0$ zsh  1$ zsh  2$ zsh  3-$ zsh  4$ zsh  5$* zsh                                   


yes, thats 0 IOPS.

0 IOPS and in zero IOPS. For minutes.



And here is why:

ATOP - merkaba                          2014/12/27  13:46:52                          -----------                           10s elapsed
PRC |  sys   10.77s |  user   0.31s |  #proc    334  | #trun      2  |  #tslpi   548 |  #tslpu     3 |  #zombie    0  | no  procacct  |
CPU |  sys     108% |  user      3% |  irq       0%  | idle    286%  |  wait      2% |  guest     0% |  curf 3.08GHz  | curscal  96%  |
cpu |  sys      72% |  user      1% |  irq       0%  | idle     28%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      19% |  user      0% |  irq       0%  | idle     81%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      11% |  user      1% |  irq       0%  | idle     87%  |  cpu003 w  1% |  guest     0% |  curf 3.19GHz  | curscal  99%  |
cpu |  sys       6% |  user      1% |  irq       0%  | idle     91%  |  cpu002 w  1% |  guest     0% |  curf 3.11GHz  | curscal  97%  |
CPL |  avg1    2.78 |  avg5    1.34 |  avg15   1.12  |               |  csw    50192 |  intr   32379 |                | numcpu     4  |
MEM |  tot    15.5G |  free    5.0G |  cache   8.7G  | buff    0.0M  |  slab  332.6M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
LVM |     sata-home |  busy      5% |  read     160  | write  11177  |  KiB/w      3 |  MBr/s   0.06 |  MBw/s   4.36  | avio 0.05 ms  |
LVM |    msata-home |  busy      4% |  read      28  | write  11177  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   4.36  | avio 0.04 ms  |
LVM |   sata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
LVM |  msata-debian |  busy      0% |  read       0  | write    844  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.33  | avio 0.02 ms  |
DSK |           sda |  busy      5% |  read     160  | write  10200  |  KiB/w      4 |  MBr/s   0.06 |  MBw/s   4.69  | avio 0.05 ms  |
DSK |           sdb |  busy      4% |  read      28  | write  10558  |  KiB/w      4 |  MBr/s   0.01 |  MBw/s   4.69  | avio 0.04 ms  |
NET |  transport    |  tcpi      35 |  tcpo      33  | udpi       3  |  udpo       3 |  tcpao      2 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       38 |  ipo       36  | ipfrw      0  |  deliv     38 |               |  icmpi      0  | icmpo      0  |
NET |  eth0      0% |  pcki      22 |  pcko      20  | si    9 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
14973      -   root      root          1   8.92s    0.00s      0K       0K      0K     144K  --     -  S       0   89%   kworker/u8:14
17450      -   root      root          1   0.86s    0.00s      0K       0K      0K      32K  --     -  R       3    9%   kworker/u8:5
  788      -   root      root          1   0.25s    0.00s      0K       0K    128K   18880K  --     -  S       3    3%   btrfs-transact
12254      -   root      root          1   0.14s    0.00s      0K       0K     64K     576K  --     -  S       2    1%   kworker/u8:3
17332      -   root      root          1   0.11s    0.00s      0K       0K    112K    1348K  --     -  S       2    1%   kworker/u8:4
 3291      -   martin    martin        4   0.01s    0.09s      0K       0K      0K       0K  --     -  S       1    1%   plasma-deskto




ATOP - merkaba                          2014/12/27  13:47:12                          -----------                           10s elapsed
PRC |  sys   10.78s |  user   0.44s |  #proc    334  | #trun      3  |  #tslpi   547 |  #tslpu     3 |  #zombie    0  | no  procacct  |
CPU |  sys     106% |  user      4% |  irq       0%  | idle    288%  |  wait      1% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      93% |  user      0% |  irq       0%  | idle      7%  |  cpu002 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys       7% |  user      0% |  irq       0%  | idle     93%  |  cpu003 w  0% |  guest     0% |  curf 3.01GHz  | curscal  94%  |
cpu |  sys       3% |  user      2% |  irq       0%  | idle     94%  |  cpu000 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys       3% |  user      2% |  irq       0%  | idle     95%  |  cpu001 w  0% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
CPL |  avg1    3.33 |  avg5    1.56 |  avg15   1.20  |               |  csw    38253 |  intr   23104 |                | numcpu     4  |
MEM |  tot    15.5G |  free    4.9G |  cache   8.7G  | buff    0.0M  |  slab  336.5M |  shmem 207.2M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.7G |                |               |               |               |  vmcom   3.4G  | vmlim  19.7G  |
LVM |    msata-home |  busy      2% |  read       0  | write   2337  |  KiB/w      3 |  MBr/s   0.00 |  MBw/s   0.91  | avio 0.07 ms  |
LVM |     sata-home |  busy      2% |  read      36  | write   2337  |  KiB/w      3 |  MBr/s   0.01 |  MBw/s   0.91  | avio 0.07 ms  |
LVM |  msata-debian |  busy      1% |  read       1  | write   1630  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.65  | avio 0.03 ms  |
LVM |   sata-debian |  busy      0% |  read       0  | write   1019  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.41  | avio 0.02 ms  |
DSK |           sdb |  busy      2% |  read       1  | write   2545  |  KiB/w      5 |  MBr/s   0.00 |  MBw/s   1.45  | avio 0.07 ms  |
DSK |           sda |  busy      1% |  read      36  | write   2461  |  KiB/w      5 |  MBr/s   0.01 |  MBw/s   1.28  | avio 0.06 ms  |
NET |  transport    |  tcpi      20 |  tcpo      20  | udpi       1  |  udpo       1 |  tcpao      1 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       21 |  ipo       21  | ipfrw      0  |  deliv     21 |               |  icmpi      0  | icmpo      0  |
NET |  eth0      0% |  pcki       5 |  pcko       5  | si    0 Kbps  |  so    0 Kbps |  erri       0 |  erro       0  | drpo       0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
17450      -   root      root          1   9.96s    0.00s      0K       0K      0K       0K  --     -  R       2  100%   kworker/u8:5
 4746      -   martin    martin        2   0.06s    0.15s      0K       0K      0K       0K  --     -  S       1    2%   konsole
10508      -   root      root          1   0.13s    0.00s      0K       0K     96K    4048K  --     -  S       1    1%   kworker/u8:18
 1488      -   root      root          1   0.06s    0.06s      0K       0K      0K       0K  --     -  S       0    1%   Xorg
17332      -   root      root          1   0.12s    0.00s      0K       0K     96K     580K  --     -  R       3    1%   kworker/u8:4
17454      -   root      root          1   0.11s    0.00s      0K       0K     32K    4416K  --     -  D       1    1%   kworker/u8:6
17516      -   root      root          1   0.09s    0.00s      0K       0K     16K     136K  --     -  S       3    1%   kworker/u8:7
 3268      -   martin    martin        3   0.02s    0.05s      0K       0K      0K       0K  --     -  S       1    1%   kwin
10036      -   root      root          1   0.05s    0.02s      0K       0K      0K       0K  --     -  R       0    1%   atop



You see a kworker hogging the Sandybridge CPU while the SSDs are basically idling.


I will attach a 25 MiB kern.log with quite some echo t > /proc/sysrq-trigger in it as xz file.

This in on a ThinkPad T520 with Dual SSD BTRFS RAID 1. This ThinkPad currently throttles CPUs on high load due to temperature issues which may make it even more visible, but since this doesn´t appear to happened as long as BTRFS can still allocate *new* chunks, this seem to point at a performance issue with free space handling on randomly writing into large cow´d file.

Comment 1 Martin Steigerwald 2014-12-27 13:55:48 UTC

Cut and paste atop output into a huge editor window to see it like it is intended to be. :)

Next I will try to reproduce this with an freshly creating BTRFS filesystem.

Comment 2 Martin Steigerwald 2014-12-27 15:35:59 UTC

Robert pointed out that

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:26:39 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 152.83GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.97GiB, used=149.58GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

means the filesystem is almost full.

Yeah, thats right.

But for the initial I use a Windows XP with an NTFS in an about 17/18 GiB case and try to defrag in it or just install a 500 MB tax return software for it I had:

merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

And there is happened as well that kworder uses 100% of one core for minutes upto the point that desktop environment (Plasma) was locked due to waiting for I/O.


And actually it got like the above due to the balance. I did not suddenly put an additional 8 GiB onto the filesystem.

Anyway, I tried to reproduce it just by filesystem is full conditation, but I can´t. There is more to it. I will attach a test case log for the reproducer attempts.

Comment 3 Martin Steigerwald 2014-12-27 15:37:06 UTC

Created attachment 161971 [details]
testcase where I tried to bring BTRFS to <300 IOPS but failed

Filesystem full alone does not trigger it. At least for a small filesystem.

Maybe it needs to be big filesystem almost full?

Or there is something else wrong with my /home?

Comment 4 Martin Steigerwald 2014-12-28 12:13:12 UTC

Okay, I want to add here that I triggered this kworker thread uses 100% full on one core for minutes issue first by installing an about 400 MiB application into a Windows XP Virtualbox in a cow´d about 17-18 GiB VDI image file with NTFS.

Then I triggered it again as I tried compacting the vdi image as recommended by Oracle blog post by defragmenting the NTFS, then running sdelete -c on Windows XP, which seems to be a bit like fstrim, and then running some VBoxManage command to compact the VDI file. And my observation was that it only happens once BTRFS cannot allocate anymore chunks cause there is no space on the devices for that anymore.

And for that I want to add that I did the defragmentation on NTFS several times today with a BTRFS with plenty of space feee to allocate chunks from and this *didn´t* trigger the kworker hogs a CPU use issue.

So for me the issue does not seem to happen as long as BTRFS still is able to
reserve new data chunks from unreserved device space.

I will try to verify this with the fio testcase as well.

Below a summary of the initial story how I found this issue.



But let me run the whole story:

1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again.

Which gave me:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
        Total devices 2 FS bytes used 144.19GiB
        devid    1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=144.98GiB, used=140.95GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.24GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


2) I run the Virtualbox machine again and defragmented the NTFS filesystem
in the VDI image file. And: It worked *just* fine. Fine as in *fine*. No issues
whatsoever.


I got this during the run:

ATOP - merkaba                          2014/12/27  12:58:42                          -----------                           10s elapsed
PRC |  sys   10.41s |  user   1.08s |  #proc    357  | #trun      4  |  #tslpi   694 |  #tslpu     0 |  #zombie    0  | no  procacct  |
CPU |  sys     107% |  user     11% |  irq       0%  | idle    259%  |  wait     23% |  guest     0% |  curf 3.01GHz  | curscal  93%  |
cpu |  sys      29% |  user      3% |  irq       0%  | idle     63%  |  cpu002 w  5% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      27% |  user      3% |  irq       0%  | idle     65%  |  cpu000 w  5% |  guest     0% |  curf 3.03GHz  | curscal  94%  |
cpu |  sys      26% |  user      3% |  irq       0%  | idle     63%  |  cpu003 w  8% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
cpu |  sys      24% |  user      2% |  irq       0%  | idle     68%  |  cpu001 w  6% |  guest     0% |  curf 3.00GHz  | curscal  93%  |
CPL |  avg1    1.92 |  avg5    1.01 |  avg15   0.56  |               |  csw   501619 |  intr  129279 |                | numcpu     4  |
MEM |  tot    15.5G |  free  610.1M |  cache   9.1G  | buff    0.1M  |  slab    1.0G |  shmem 183.5M |  vmbal   0.0M  | hptot   0.0M  |
SWP |  tot    12.0G |  free   11.6G |                |               |               |               |  vmcom   7.1G  | vmlim  19.7G  |
PAG |  scan  219141 |  steal 215577 |  stall    936  |               |               |               |  swin       0  | swout    940  |
LVM |     sata-home |  busy     53% |  read  181413  | write      0  |  KiB/w      0 |  MBr/s  70.86 |  MBw/s   0.00  | avio 0.03 ms  |
LVM |     sata-swap |  busy      2% |  read       0  | write    940  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.37  | avio 0.17 ms  |
LVM |   sata-debian |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 1.00 ms  |
LVM |  msata-debian |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.00 ms  |
DSK |           sda |  busy     53% |  read  181413  | write    477  |  KiB/w      7 |  MBr/s  70.86 |  MBw/s   0.37  | avio 0.03 ms  |
DSK |           sdb |  busy      0% |  read       0  | write      1  |  KiB/w      4 |  MBr/s   0.00 |  MBw/s   0.00  | avio 0.00 ms  |
NET |  transport    |  tcpi      16 |  tcpo      16  | udpi       0  |  udpo       0 |  tcpao      1 |  tcppo      1  | tcprs      0  |
NET |  network      |  ipi       16 |  ipo       16  | ipfrw      0  |  deliv     16 |               |  icmpi      0  | icmpo      0  |
NET |  lo      ---- |  pcki      16 |  pcko      16  | si    2 Kbps  |  so    2 Kbps |  erri       0 |  erro       0  | drpo       0  |

  PID    TID   RUID      EUID        THR  SYSCPU   USRCPU   VGROW    RGROW   RDDSK    WRDSK  ST   EXC  S   CPUNR   CPU   CMD        1/3
 9650      -   martin    martin       22   7.89s    0.65s      0K     128K  705.5M   382.1M  --     -  S       2   87%   VirtualBox
 9911      -   root      root          1   0.69s    0.01s      0K       0K      0K       0K  --     -  S       3    7%   watch
 9598      -   root      root          1   0.38s    0.00s      0K       0K      0K      20K  --     -  S       0    4%   kworker/u8:9
 9892      -   root      root          1   0.36s    0.00s      0K       0K      0K       0K  --     -  S       1    4%   kworker/u8:17
 9428      -   root      root          1   0.30s    0.00s      0K       0K      0K       0K  --     -  R       0    3%   kworker/u8:3
 9589      -   root      root          1   0.23s    0.00s      0K       0K      0K       0K  --     -  S       1    2%   kworker/u8:6 
 4746      -   martin    martin        2   0.04s    0.13s      0K     -16K      0K       0K  --     -  R       2    2%   konsole



Every 1,0s: cat /proc/meminfo                                                                                  Sat Dec 27 12:59:23 2014

MemTotal:       16210512 kB
MemFree:          786632 kB
MemAvailable:   10271500 kB
Buffers:              52 kB
Cached:          9564340 kB
SwapCached:        70268 kB
Active:          6847560 kB
Inactive:        5257956 kB
Active(anon):    2016412 kB
Inactive(anon):   703076 kB
Active(file):    4831148 kB
Inactive(file):  4554880 kB
Unevictable:        9068 kB
Mlocked:            9068 kB
SwapTotal:      12582908 kB
SwapFree:       12186680 kB
Dirty:            972324 kB
Writeback:             0 kB
AnonPages:       2526340 kB
Mapped:          2457096 kB
Shmem:            173564 kB
Slab:             918128 kB
SReclaimable:     848816 kB
SUnreclaim:        69312 kB
KernelStack:       11200 kB
PageTables:        64556 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    20688164 kB
Committed_AS:    7438348 kB



I am not seeing more than one GiB of dirty here during regular usage and
it is no problem.

And kworker thread CPU usage just fine. So no, the dirty_background_ratio
isn´t an issue with this 16 GiB ThinkPad T520. [… some rant of mine about not adapting dirty_background_ratio when it is not actually needed …]


So defragmentation completed just nice, no issue so far.

But I am close to full device space reservation already:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:02:40 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 151.58GiB
        devid    1 size 160.00GiB used 158.01GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 158.01GiB path /dev/mapper/sata-home


As I was not able to use the NTFS defragmenter as a reproduce then as the NTFS was already defragmented I then introduced the fio job, which just hit it nice.

Now with a freshly formatted, but small dual SSD BTRFS RAID 1 with 10 GiB on each of two devices I was not able to trigger this kworker hogs 100% of one core case.


So the BTRFS allocates all space from device seems to be necessary to trigger it on my /home, but doesn´t suffice to trigger it on a fresh BTRFS.

I thought this much of history may be helpful to understand where this bug report is coming from. And I thought I better add it here than ask you to read the quite long thread I mentioned initially.

Comment 5 Martin Steigerwald 2014-12-28 12:35:57 UTC

And I forget the original fio test results and summarized too much:

That was the test in the case all device space was reserved for chunks. I.e. with this:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:26:39 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 152.83GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.97GiB, used=149.58GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


And I got only 254 IOPS on random write then.


Its the random write case it seems. Here are values from fio job:

martin@merkaba:~> fio ssd-test.fio
seq-write: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
rand-write: (g=1): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 2 processes
Jobs: 1 (f=1): [_(1),w(1)] [3.6% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 01h:06m:26s]
seq-write: (groupid=0, jobs=1): err= 0: pid=19212: Sat Dec 27 13:48:33 2014
  write: io=4096.0MB, bw=343683KB/s, iops=85920, runt= 12204msec
    clat (usec): min=3, max=38048, avg=10.52, stdev=205.25
     lat (usec): min=3, max=38048, avg=10.66, stdev=205.43
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[    7], 80.00th=[    8], 90.00th=[    8], 95.00th=[    9],
     | 99.00th=[   14], 99.50th=[   20], 99.90th=[  211], 99.95th=[ 2128],
     | 99.99th=[10304]
    bw (KB  /s): min=164328, max=812984, per=100.00%, avg=345585.75, stdev=201695.20
    lat (usec) : 4=0.18%, 10=95.31%, 20=4.00%, 50=0.18%, 100=0.12%
    lat (usec) : 250=0.12%, 500=0.02%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
  cpu          : usr=13.55%, sys=46.89%, ctx=7810, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Seems fine.


But:

rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
  write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
    clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
     lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
     | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
     | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
     | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
     | 99.99th=[16711680]
    bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
    lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
    lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
  cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=343682KB/s, minb=343682KB/s, maxb=343682KB/s, mint=12204msec, maxt=12204msec

Run status group 1 (all jobs):
  WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec



The full mail I referred to in my last and in this comment is:

http://www.spinics.net/lists/linux-btrfs/msg40745.html


In a moment I post how /home behaves when there is still unused device space
left to reserve chunks from.

Comment 6 Martin Steigerwald 2014-12-28 12:38:33 UTC

This is another test.

With my /home. This time while it has enough free device space to reserve
new chunks from.

Remember this for the case where is hasn´t:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 144.19GiB
        devid    1 size 160.00GiB used 150.01GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 150.01GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=144.98GiB, used=140.95GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.24GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

rand-write: (groupid=1, jobs=1): err= 0: pid=19243: Sat Dec 27 13:48:33 2014
  write: io=140336KB, bw=1018.4KB/s, iops=254, runt=137803msec
    clat (usec): min=4, max=21299K, avg=3708.02, stdev=266885.61
     lat (usec): min=4, max=21299K, avg=3708.14, stdev=266885.61
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    5], 10.00th=[    5], 20.00th=[    5],
     | 30.00th=[    6], 40.00th=[    6], 50.00th=[    6], 60.00th=[    6],
     | 70.00th=[    7], 80.00th=[    7], 90.00th=[    9], 95.00th=[   10],
     | 99.00th=[   18], 99.50th=[   19], 99.90th=[   28], 99.95th=[  116],
     | 99.99th=[16711680]
    bw (KB  /s): min=    0, max= 3426, per=100.00%, avg=1030.10, stdev=938.02
    lat (usec) : 10=92.63%, 20=6.89%, 50=0.43%, 100=0.01%, 250=0.02%
    lat (msec) : 250=0.01%, 500=0.01%, >=2000=0.02%
  cpu          : usr=0.06%, sys=1.59%, ctx=28720, majf=0, minf=7
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=35084/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1
[…]
Run status group 1 (all jobs):
  WRITE: io=140336KB, aggrb=1018KB/s, minb=1018KB/s, maxb=1018KB/s, mint=137803msec, maxt=137803msec

That is where I saw this kworker thread at 100% of one Sandybridge core
for minutes. This is where I made kern.log with sysrq-t triggers in

https://bugzilla.kernel.org/show_bug.cgi?id=90401#c0

made from.


But now as I extended it to 170 GiB and did some basic rebalancing
(upto btrfs balance start -dusage=80) I have this:


First attempt:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:13:47 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 151.09GiB
        devid    1 size 170.00GiB used 158.03GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 158.03GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=153.00GiB, used=147.83GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


martin@merkaba:~> fio ssd-test.fio 
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
rand-write: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/84528KB/0KB /s] [0/21.2K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=9987: Sun Dec 28 13:14:32 2014
  write: io=4096.0MB, bw=155304KB/s, iops=38826, runt= 27007msec
    clat (usec): min=5, max=28202, avg=22.03, stdev=240.04
     lat (usec): min=5, max=28202, avg=22.28, stdev=240.16
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[   10], 40.00th=[   11], 50.00th=[   12], 60.00th=[   13],
     | 70.00th=[   14], 80.00th=[   15], 90.00th=[   17], 95.00th=[   23],
     | 99.00th=[   93], 99.50th=[  175], 99.90th=[ 2096], 99.95th=[ 6816],
     | 99.99th=[10176]
    bw (KB  /s): min=76832, max=413616, per=100.00%, avg=156706.75, stdev=101101.26
    lat (usec) : 10=29.85%, 20=62.43%, 50=5.74%, 100=1.07%, 250=0.57%
    lat (usec) : 500=0.16%, 750=0.04%, 1000=0.02%
    lat (msec) : 2=0.02%, 4=0.01%, 10=0.08%, 20=0.01%, 50=0.01%
  cpu          : usr=12.05%, sys=47.34%, ctx=86985, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=155304KB/s, minb=155304KB/s, maxb=155304KB/s, mint=27007msec, maxt=27007msec



Second attempt:



merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:16:19 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 155.11GiB
        devid    1 size 170.00GiB used 162.03GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 162.03GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=157.00GiB, used=151.83GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.27GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/113.5MB/0KB /s] [0/29.3K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10043: Sun Dec 28 13:17:34 2014
  write: io=4096.0MB, bw=145995KB/s, iops=36498, runt= 28729msec
    clat (usec): min=4, max=143201, avg=23.95, stdev=518.47
     lat (usec): min=4, max=143201, avg=24.13, stdev=518.48
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    9], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   13], 90.00th=[   15], 95.00th=[   16],
     | 99.00th=[   33], 99.50th=[   70], 99.90th=[ 5472], 99.95th=[ 8640],
     | 99.99th=[20864]
    bw (KB  /s): min=    4, max=433760, per=100.00%, avg=149179.63, stdev=136784.14
    lat (usec) : 10=38.35%, 20=58.99%, 50=1.96%, 100=0.38%, 250=0.16%
    lat (usec) : 500=0.03%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.08%, 20=0.02%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%
  cpu          : usr=10.25%, sys=42.40%, ctx=42642, majf=0, minf=8
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=145995KB/s, minb=145995KB/s, maxb=145995KB/s, mint=28729msec, maxt=28729msec



Third attempt:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:18:24 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 156.16GiB
        devid    1 size 170.00GiB used 160.03GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 160.03GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.00GiB, used=152.83GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.34GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/195.7MB/0KB /s] [0/50.9K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10058: Sun Dec 28 13:18:59 2014
  write: io=4096.0MB, bw=202184KB/s, iops=50545, runt= 20745msec
    clat (usec): min=4, max=28261, avg=15.84, stdev=214.59
     lat (usec): min=4, max=28261, avg=16.06, stdev=214.78
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[    9], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   15], 95.00th=[   17],
     | 99.00th=[   52], 99.50th=[  105], 99.90th=[  426], 99.95th=[  980],
     | 99.99th=[12736]
    bw (KB  /s): min=    4, max=426344, per=100.00%, avg=207355.30, stdev=105104.72
    lat (usec) : 10=41.44%, 20=55.33%, 50=2.17%, 100=0.54%, 250=0.34%
    lat (usec) : 500=0.10%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.01%
  cpu          : usr=14.15%, sys=59.06%, ctx=81711, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=202183KB/s, minb=202183KB/s, maxb=202183KB/s, mint=20745msec, maxt=20745msec


merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:19:15 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 155.16GiB
        devid    1 size 170.00GiB used 162.03GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 162.03GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=157.00GiB, used=151.85GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



So here BTRFS was fast. It is getting trouble on my /home when its
almost full. But more so than with a empty 10 GiB filesystem as I
have shown in testcase
https://bugzilla.kernel.org/show_bug.cgi?id=90401#c3


There I had:

merkaba:/mnt/btrfsraid1> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
rand-write: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/20924KB/0KB /s] [0/5231/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=6221: Sat Dec 27 15:34:14 2014
  write: io=2645.8MB, bw=22546KB/s, iops=5636, runt=120165msec
    clat (usec): min=4, max=3054.8K, avg=174.87, stdev=11455.26
     lat (usec): min=4, max=3054.8K, avg=175.03, stdev=11455.27
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    6], 10.00th=[    6], 20.00th=[    7],
     | 30.00th=[    7], 40.00th=[    8], 50.00th=[    9], 60.00th=[   10],
     | 70.00th=[   11], 80.00th=[   12], 90.00th=[   14], 95.00th=[   17],
     | 99.00th=[   30], 99.50th=[   40], 99.90th=[ 1992], 99.95th=[25984],
     | 99.99th=[411648]
    bw (KB  /s): min=  168, max=70703, per=100.00%, avg=27325.46, stdev=14887.94
    lat (usec) : 10=55.81%, 20=41.12%, 50=2.70%, 100=0.14%, 250=0.06%
    lat (usec) : 500=0.02%, 750=0.01%, 1000=0.02%
    lat (msec) : 2=0.02%, 4=0.02%, 10=0.02%, 20=0.01%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%, 500=0.02%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%, >=2000=0.01%
  cpu          : usr=1.56%, sys=5.57%, ctx=29822, majf=0, minf=7
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=677303/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=2645.8MB, aggrb=22545KB/s, minb=22545KB/s, maxb=22545KB/s, mint=120165msec, maxt=120165msec




I will attach this as text file in a moment as well.

Comment 7 Martin Steigerwald 2014-12-28 12:40:06 UTC

Created attachment 162011 [details]
testcase on the same /home where I got the 254 IOPS, as I enlarged and rebalanced it

This is how BTRFS performs when it can still reserve new data chunks from unused device space.

This is the same as I pasted in last comment.

Comment 8 Martin Steigerwald 2014-12-28 12:51:56 UTC

Another thing: I used the wrong btrfs sh output in comment #6. Comment #5 has the right one.

But finally I got to:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
Sa 27. Dez 13:26:39 CET 2014
Label: 'home'  uuid: [some UUID]
        Total devices 2 FS bytes used 152.83GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.97GiB, used=149.58GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

This was there IOPS went down to 254 IOPS.


Another testing would be to downsize my /home BTRFS again and try to balance it so it is similar as above, but with some unused device space to allocate new chunks from, but right now I have:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 13:19:15 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.16GiB
        devid    1 size 170.00GiB used 162.03GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 162.03GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=157.00GiB, used=151.85GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


Hmmm it might work, but right now it doesn´t want to resize to 160GiB, not even on just one device:

merkaba:~> btrfs filesystem resize 1:160G /home       
Resize '/home' of '1:160G'
ERROR: unable to resize '/home' - No space left on device

Comment 9 Martin Steigerwald 2014-12-28 13:36:11 UTC

Okay, and this is the last series of tests for today.

Conclusion:

I cannot manage to get it down to the knees as before, but I come near to it.

Still its 8000 IOPS, instead of 250 IOPS, in an according to btrfs fi sh
even *worse* situation than before.

That hints me at the need to look at the free space fragmentation, as in the
beginning the problem started appearing with:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 144.41GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Yes, thats 13 GiB of free space *within* the chunks. That was where
I first saw the kworker hogs one core issue while using Virtualbox with
that Windows XP VM.


So while I can get it down in IOPS by bringing it to a situation where it
can not reserve additional data chunks again, I cannot recreate the
abysmal 250 IOPS figure by this. Not even with my /home filesystem.

So there is more to it. I think its important to look into free space
fragmentation. It seems it needs an *aged* filesystem to recreate. At
it seems the balances really helped. As I am not able to recreate the
issue to that extent right now.



Test log below, will attach as well:

Okay, retesting with 

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:01:05 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.15GiB
        devid    1 size 163.00GiB used 159.92GiB path /dev/mapper/msata-home
        devid    2 size 163.00GiB used 159.92GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.95GiB, used=151.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


Thats is just 3 GiB to reserve new data chunks from from.

First run – all good:

martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/134.2MB/0KB /s] [0/34.4K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10483: Sun Dec 28 14:02:59 2014
  write: io=4096.0MB, bw=218101KB/s, iops=54525, runt= 19231msec
    clat (usec): min=4, max=20056, avg=14.87, stdev=143.15
     lat (usec): min=4, max=20056, avg=15.09, stdev=143.26
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   14], 90.00th=[   15], 95.00th=[   17],
     | 99.00th=[   52], 99.50th=[   99], 99.90th=[  434], 99.95th=[  980],
     | 99.99th=[ 7968]
    bw (KB  /s): min=62600, max=424456, per=100.00%, avg=218821.63, stdev=93695.28
    lat (usec) : 10=38.19%, 20=58.83%, 50=1.90%, 100=0.59%, 250=0.33%
    lat (usec) : 500=0.09%, 750=0.03%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01%
  cpu          : usr=15.50%, sys=61.86%, ctx=93432, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=218101KB/s, minb=218101KB/s, maxb=218101KB/s, mint=19231msec, maxt=19231msec




Second run:


merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:04:01 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.23GiB
        devid    1 size 163.00GiB used 160.95GiB path /dev/mapper/msata-home
        devid    2 size 163.00GiB used 160.95GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.98GiB, used=151.91GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.32GiB
GlobalReserve, single: total=512.00MiB, used=0.00B




martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/171.3MB/0KB /s] [0/43.9K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10501: Sun Dec 28 14:05:03 2014
  write: io=4096.0MB, bw=220637KB/s, iops=55159, runt= 19010msec
    clat (usec): min=4, max=20578, avg=14.45, stdev=160.84
     lat (usec): min=4, max=20578, avg=14.65, stdev=160.88
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    7],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   15], 95.00th=[   17],
     | 99.00th=[   42], 99.50th=[   79], 99.90th=[  278], 99.95th=[  620],
     | 99.99th=[ 9792]
    bw (KB  /s): min=    5, max=454816, per=100.00%, avg=224700.32, stdev=100763.29
    lat (usec) : 10=38.15%, 20=58.73%, 50=2.28%, 100=0.47%, 250=0.26%
    lat (usec) : 500=0.06%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 50=0.01%
  cpu          : usr=15.83%, sys=63.17%, ctx=74934, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=220636KB/s, minb=220636KB/s, maxb=220636KB/s, mint=19010msec, maxt=19010msec



Okay, now try the same without space for a free chunk to allocate.

The testfile is still there, fio doesn´t delete and recreate it on
every attempt, but just writes into it:

martin@merkaba:~> ls -l ssd.test.file 
-rw-r--r-- 1 martin martin 4294967296 Dez 28 14:05 ssd.test.file


Okay – with still one chunk to allocate:

merkaba:~> btrfs filesystem resize 1:161G /home
Resize '/home' of '1:161G'
merkaba:~> btrfs filesystem resize 2:161G /home
Resize '/home' of '2:161G'
merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:08:45 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.15GiB
        devid    1 size 161.00GiB used 159.92GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 159.92GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=154.95GiB, used=151.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



I would like to have it allocate the chunks by other means, but it frees
them eventually afterwards again, so I did it this way.

Note, we still have the original file there. The space it currently
occupies is already taken into account.


Next test:

martin@merkaba:~> fio ssd-test.fio   
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/130.5MB/0KB /s] [0/33.5K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10563: Sun Dec 28 14:10:34 2014
  write: io=4096.0MB, bw=210526KB/s, iops=52631, runt= 19923msec
    clat (usec): min=4, max=21820, avg=14.78, stdev=119.40
     lat (usec): min=4, max=21821, avg=15.03, stdev=120.26
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    9], 40.00th=[   10], 50.00th=[   11], 60.00th=[   11],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   14], 95.00th=[   17],
     | 99.00th=[   62], 99.50th=[  131], 99.90th=[  490], 99.95th=[  964],
     | 99.99th=[ 6816]
    bw (KB  /s): min=    3, max=410480, per=100.00%, avg=216892.84, stdev=95620.33
    lat (usec) : 10=33.20%, 20=63.71%, 50=1.86%, 100=0.59%, 250=0.42%
    lat (usec) : 500=0.12%, 750=0.03%, 1000=0.01%
    lat (msec) : 2=0.02%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
  cpu          : usr=15.13%, sys=62.74%, ctx=94346, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=210525KB/s, minb=210525KB/s, maxb=210525KB/s, mint=19923msec, maxt=19923msec


Okay, this is still good.

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:11:18 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.17GiB
        devid    1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.94GiB, used=151.86GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B




Now no space left to reserve additional chunks anymore. Another test:


martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/152.3MB/0KB /s] [0/38.1K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10580: Sun Dec 28 14:13:26 2014
  write: io=4096.0MB, bw=225804KB/s, iops=56450, runt= 18575msec
    clat (usec): min=4, max=16669, avg=13.66, stdev=72.88
     lat (usec): min=4, max=16669, avg=13.89, stdev=73.06
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   14], 90.00th=[   15], 95.00th=[   20],
     | 99.00th=[   65], 99.50th=[  113], 99.90th=[  314], 99.95th=[  506],
     | 99.99th=[ 2768]
    bw (KB  /s): min=    4, max=444568, per=100.00%, avg=231326.97, stdev=93374.31
    lat (usec) : 10=36.50%, 20=58.44%, 50=3.73%, 100=0.76%, 250=0.44%
    lat (usec) : 500=0.09%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=16.35%, sys=68.39%, ctx=127221, majf=0, minf=5
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=225803KB/s, minb=225803KB/s, maxb=225803KB/s, mint=18575msec, maxt=18575msec



Okay, this still does not trigger it.


Another test, it even freed some chunk:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:14:21 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 155.28GiB
        devid    1 size 161.00GiB used 160.85GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.85GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.89GiB, used=151.97GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



Still good:

martin@merkaba:~> fio ssd-test.fio
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/156.5MB/0KB /s] [0/40.6K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10589: Sun Dec 28 14:14:37 2014
  write: io=4096.0MB, bw=161121KB/s, iops=40280, runt= 26032msec
    clat (usec): min=4, max=1228.9K, avg=15.69, stdev=1205.88
     lat (usec): min=4, max=1228.9K, avg=15.92, stdev=1205.90
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    8], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   13], 80.00th=[   14], 90.00th=[   15], 95.00th=[   19],
     | 99.00th=[   53], 99.50th=[   96], 99.90th=[  366], 99.95th=[  764],
     | 99.99th=[ 7776]
    bw (KB  /s): min=    0, max=431680, per=100.00%, avg=219856.30, stdev=98172.64
    lat (usec) : 10=39.24%, 20=55.83%, 50=3.81%, 100=0.63%, 250=0.33%
    lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 2000=0.01%
  cpu          : usr=11.50%, sys=61.08%, ctx=123428, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=161121KB/s, minb=161121KB/s, maxb=161121KB/s, mint=26032msec, maxt=26032msec



Okay, lets allocate one GiB with fallocate to make free space tighter:

martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-1
0.00user 0.09system 0:00.11elapsed 86%CPU (0avgtext+0avgdata 1752maxresident)k
112inputs+64outputs (1major+89minor)pagefaults 0swaps
martin@merkaba:~> ls -lh 1g-1
-rw-r--r-- 1 martin martin 1,0G Dez 28 14:16 1g


merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:16:24 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 156.15GiB
        devid    1 size 161.00GiB used 160.94GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.94GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.97GiB, used=152.84GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.31GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


Still not:

martin@merkaba:~> fio ssd-test.fio                  
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/132.1MB/0KB /s] [0/34.4K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10632: Sun Dec 28 14:17:12 2014
  write: io=4096.0MB, bw=198773KB/s, iops=49693, runt= 21101msec
    clat (usec): min=4, max=543255, avg=16.27, stdev=563.85
     lat (usec): min=4, max=543255, avg=16.48, stdev=563.91
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    9], 40.00th=[   10], 50.00th=[   11], 60.00th=[   12],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   14], 95.00th=[   17],
     | 99.00th=[   49], 99.50th=[   98], 99.90th=[  386], 99.95th=[  828],
     | 99.99th=[10816]
    bw (KB  /s): min=    4, max=444848, per=100.00%, avg=203909.07, stdev=109502.11
    lat (usec) : 10=33.97%, 20=62.99%, 50=2.05%, 100=0.51%, 250=0.33%
    lat (usec) : 500=0.08%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.01%, 750=0.01%
  cpu          : usr=14.21%, sys=60.44%, ctx=70273, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1048576/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=198772KB/s, minb=198772KB/s, maxb=198772KB/s, mint=21101msec, maxt=21101msec




Another 1G file.

Got it:

ATOP - merkaba                       2014/12/28  14:18:14                       -----------                        10s elapsed
PRC | sys   21.74s | user   2.48s | #proc    382 | #trun      8 |  #tslpi   698 | #tslpu     1 | #zombie    0 | no  procacct |
CPU | sys     218% | user     24% | irq       1% | idle    155% |  wait      2% | guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      75% | user      5% | irq       0% | idle     19% |  cpu003 w  0% | guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      59% | user      3% | irq       0% | idle     37% |  cpu001 w  0% | guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      48% | user      6% | irq       1% | idle     45% |  cpu000 w  1% | guest     0% | curf 3.00GHz | curscal  93% |
cpu | sys      36% | user      9% | irq       0% | idle     54% |  cpu002 w  1% | guest     0% | curf 3.00GHz | curscal  93% |
CPL | avg1    2.13 | avg5    2.37 | avg15   1.92 |              |  csw    67473 | intr   59152 |              | numcpu     4 |
MEM | tot    15.5G | free    1.1G | cache  11.0G | buff    0.1M |  slab  740.2M | shmem 190.9M | vmbal   0.0M | hptot   0.0M |
SWP | tot    12.0G | free   11.4G |              |              |               |              | vmcom   5.4G | vmlim  19.7G |
PAG | scan       0 | steal      0 | stall      1 |              |               |              | swin      19 | swout      0 |
LVM |    sata-home | busy      8% | read       4 | write  26062 |  KiB/w      3 | MBr/s   0.00 | MBw/s  10.18 | avio 0.03 ms |
LVM |   msata-home | busy      5% | read       4 | write  26062 |  KiB/w      3 | MBr/s   0.00 | MBw/s  10.18 | avio 0.02 ms |
LVM |    sata-swap | busy      0% | read      19 | write      0 |  KiB/w      0 | MBr/s   0.01 | MBw/s   0.00 | avio 0.05 ms |
LVM | msata-debian | busy      0% | read       0 | write      4 |  KiB/w      4 | MBr/s   0.00 | MBw/s   0.00 | avio 0.00 ms |
LVM |  sata-debian | busy      0% | read       0 | write      4 |  KiB/w      4 | MBr/s   0.00 | MBw/s   0.00 | avio 0.00 ms |
DSK |          sda | busy      8% | read      23 | write  13239 |  KiB/w      7 | MBr/s   0.01 | MBw/s  10.18 | avio 0.06 ms |
DSK |          sdb | busy      5% | read       4 | write  14360 |  KiB/w      7 | MBr/s   0.00 | MBw/s  10.18 | avio 0.04 ms |
NET | transport    | tcpi      18 | tcpo      18 | udpi       0 |  udpo       0 | tcpao      1 | tcppo      1 | tcprs      0 |
NET | network      | ipi       18 | ipo       18 | ipfrw      0 |  deliv     18 |              | icmpi      0 | icmpo      0 |
NET | eth0      0% | pcki       2 | pcko       2 | si    0 Kbps |  so    0 Kbps | erri       0 | erro       0 | drpo       0 |
NET | lo      ---- | pcki      16 | pcko      16 | si    2 Kbps |  so    2 Kbps | erri       0 | erro       0 | drpo       0 |

  PID   TID  RUID      EUID       THR  SYSCPU  USRCPU   VGROW   RGROW   RDDSK   WRDSK  ST  EXC  S  CPUNR   CPU  CMD        1/5
10657     -  martin    martin       1   9.88s   0.00s      0K      0K      0K     48K  --    -  R      1   99%  fallocate
 9685     -  root      root         1   9.84s   0.00s      0K      0K      0K      0K  --    -  D      0   99%  kworker/u8:10



martin@merkaba:~> /usr/bin/time fallocate -l 1G 1g-2 ; ls -l 1g*
0.00user 59.28system 1:00.21elapsed 98%CPU (0avgtext+0avgdata 1756maxresident)k
0inputs+416outputs (0major+90minor)pagefaults 0swaps
-rw-r--r-- 1 martin martin 1073741824 Dez 28 14:16 1g-1
-rw-r--r-- 1 martin martin 1073741824 Dez 28 14:17 1g-2


One minute in system CPU for this.



Okay, so now another test:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:19:30 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 157.18GiB
        devid    1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.94GiB, used=153.87GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



I admit, this now really isn´t nice to it anymore, but I want to see
where it starts to become an issue.



ATOP - merkaba                                  2014/12/28  14:21:18                                  -----------                                  1s elapsed
PRC | sys    1.15s | user   0.16s | #proc    382  | #trun      2 | #tslpi   707 | #tslpu     1 | #zombie    0 | clones     0  |              | no  procacct |
CPU | sys     163% | user     24% | irq       1%  | idle    189% | wait     26% |              | steal     0% | guest     0%  | curf 3.01GHz | curscal  94% |
cpu | sys      72% | user      1% | irq       0%  | idle     25% | cpu001 w  1% |              | steal     0% | guest     0%  | curf 3.00GHz | curscal  93% |
cpu | sys      41% | user      9% | irq       0%  | idle     32% | cpu002 w 19% |              | steal     0% | guest     0%  | curf 3.00GHz | curscal  93% |
cpu | sys      34% | user     10% | irq       0%  | idle     53% | cpu003 w  3% |              | steal     0% | guest     0%  | curf 3.03GHz | curscal  94% |
cpu | sys      16% | user      4% | irq       0%  | idle     77% | cpu000 w  3% |              | steal     0% | guest     0%  | curf 3.00GHz | curscal  93% |
CPL | avg1    2.37 | avg5    2.64 | avg15   2.13  |              |              | csw    18687 | intr   12435 |               |              | numcpu     4 |
MEM | tot    15.5G | free    2.5G | cache   9.5G  | buff    0.1M | slab  742.6M | shmem 242.8M | shrss 115.5M | vmbal   0.0M  | hptot   0.0M | hpuse   0.0M |
SWP | tot    12.0G | free   11.4G |               |              |              |              |              |               | vmcom   5.5G | vmlim  19.7G |
LVM |   msata-home | busy     71% | read      28  | write   8134 | KiB/r      4 | KiB/w      3 | MBr/s   0.11 | MBw/s  31.68  | avq    13.21 | avio 0.06 ms |
LVM |    sata-home | busy     40% | read      72  | write   8135 | KiB/r      4 | KiB/w      3 | MBr/s   0.28 | MBw/s  31.69  | avq    41.67 | avio 0.03 ms |
DSK |          sdb | busy     71% | read      24  | write   6049 | KiB/r      4 | KiB/w      5 | MBr/s   0.11 | MBw/s  31.68  | avq     5.64 | avio 0.08 ms |
DSK |          sda | busy     38% | read      60  | write   5987 | KiB/r      4 | KiB/w      5 | MBr/s   0.28 | MBw/s  31.69  | avq    20.40 | avio 0.04 ms |
NET | transport    | tcpi      16 | tcpo      16  | udpi       0 | udpo       0 | tcpao      1 | tcppo      1 | tcprs      0  | tcpie      0 | udpie      0 |
NET | network      | ipi       16 | ipo       16  | ipfrw      0 | deliv     16 |              |              |               | icmpi      0 | icmpo      0 |
NET | lo      ---- | pcki      16 | pcko      16  | si   20 Kbps | so   20 Kbps | coll       0 | erri       0 | erro       0  | drpi       0 | drpo       0 |

  PID     TID    RUID        EUID         THR    SYSCPU    USRCPU     VGROW     RGROW    RDDSK     WRDSK    ST    EXC    S    CPUNR     CPU    CMD        1/2
10459       -    root        root           1     0.70s     0.00s        0K        0K       0K        0K    --      -    D        3    100%    kworker/u8:17
10674       -    martin      martin         1     0.20s     0.01s        0K        0K       0K    27504K    --      -    R        0     30%    fio



Okay.

It is jumping between 0 IOPS and 40000 IOPS and hogging 100% with one kworker.



Still quite okay in terms of IOPS tough:

martin@merkaba:~> fio ssd-test.fio                              
rand-write: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/126.8MB/0KB /s] [0/32.5K/0 iops] [eta 00m:00s]
rand-write: (groupid=0, jobs=1): err= 0: pid=10674: Sun Dec 28 14:22:16 2014
  write: io=3801.3MB, bw=32415KB/s, iops=8103, runt=120083msec
    clat (usec): min=4, max=1809.9K, avg=83.88, stdev=3615.98
     lat (usec): min=4, max=1809.9K, avg=84.10, stdev=3616.00
    clat percentiles (usec):
     |  1.00th=[    6],  5.00th=[    7], 10.00th=[    7], 20.00th=[    8],
     | 30.00th=[    9], 40.00th=[   10], 50.00th=[   11], 60.00th=[   11],
     | 70.00th=[   12], 80.00th=[   13], 90.00th=[   15], 95.00th=[   18],
     | 99.00th=[   52], 99.50th=[  124], 99.90th=[24704], 99.95th=[30592],
     | 99.99th=[57088]
    bw (KB  /s): min=    0, max=417544, per=100.00%, avg=48302.16, stdev=89108.07
    lat (usec) : 10=35.61%, 20=60.17%, 50=3.17%, 100=0.47%, 250=0.27%
    lat (usec) : 500=0.05%, 750=0.02%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.04%, 50=0.16%
    lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2000=0.01%
  cpu          : usr=2.37%, sys=29.74%, ctx=202984, majf=0, minf=6
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=973128/d=0, short=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=3801.3MB, aggrb=32415KB/s, minb=32415KB/s, maxb=32415KB/s, mint=120083msec, maxt=120083msec



I stop this here now.


Its interesting to see that even with:

merkaba:~> date; btrfs fi sh /home ; btrfs fi df /home
So 28. Dez 14:23:11 CET 2014
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 157.89GiB
        devid    1 size 161.00GiB used 160.91GiB path /dev/mapper/msata-home
        devid    2 size 161.00GiB used 160.91GiB path /dev/mapper/sata-home

Btrfs v3.17
Data, RAID1: total=155.94GiB, used=154.59GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=4.94GiB, used=3.30GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

I am not able to fully reproduce it.


I can partly reproduce it, but it behaves way better than before.



I think to go further one needs to have a look at the free space
fragmentation inside the chunks.

As in the beginning I had:

I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and
compress=lzo:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 144.41GiB
        devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
        devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



So I had quite some free space *within* the chunks and it still was a
problem.

Comment 10 Martin Steigerwald 2014-12-28 13:37:29 UTC

Created attachment 162041 [details]
round of testing with really tight free space

still not able to completely reproduce the worst case <300 IOPS scenario where
the desktop locked.

Comment 11 Martin Steigerwald 2014-12-28 15:02:37 UTC

One more note about the IOPS. I currently let fio run with:

martin@merkaba:~> cat ssd-test.fio 
[global]
bs=4k
#ioengine=libaio
#iodepth=4
size=4g
#direct=1
runtime=120
filename=ssd.test.file

#[seq-write]
#rw=write
#stonewall

[rand-write]
rw=randwrite
stonewall


This is using buffered I/O read()/write() system calls. So these
IOPS are not regarding the device raw capabilities. I specifically wanted
to test through the page cache to simulate what I see with Virtualbox
writing to the VDI file (i.e. dirty piling up and dirty_background_ratio
in effect). Just like with a real app.

But that also means that IOPS may be higher cause fio ends before all of the
writes have been completed to disk.

That means when I reach <300IOPS with buffered writes, that means that
through the pagecache BTRFS was not able to yield a higher IOPS.

But it also means that I measure write requests like an application would
be doing (unless it uses fsync() or direct I/O which it seems to me
Virtualbox doesn´t at least not with every request).

Just wanted to make that explicit. Its basically visible in the job file
from what I commented out in there, but still, I thought I mention it.

I just tested the effect by reducing the test file to 500MiB and the runtime
to 10 seconds and I got 98000 IOPS for that. So the larger test file size,
but specifically the runtime forces the kernel to do actual writes due to:

merkaba:~> grep . /proc/sys/vm/dirty_*
/proc/sys/vm/dirty_background_bytes:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_bytes:0
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:20
/proc/sys/vm/dirty_writeback_centisecs:500

(standard values, I still see no need to optimize anything in here with
those SSDs, not even with the 16 GiB of RAM the laptop has, as the SSDs
usually easily can keep up, and I´d rather wait for a change in the default
value unless I am convinced of a benefit in manually adapting it in *this*
case)

Comment 12 Martin Steigerwald 2014-12-29 09:25:33 UTC

With

merkaba:~> btrfs fi df /home
Data, RAID1: total=163.87GiB, used=146.92GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.94GiB, used=3.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 150.18GiB
        devid    1 size 170.00GiB used 169.84GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 169.84GiB path /dev/mapper/sata-home

Btrfs v3.17

I had a noticable hang during sdelete.exe -z in Windows XP VM with 20 GiB VDI file – Patrik on mailing list told me they have changed the argument from -c to -z as I wondered by VBoxManage modifyhd Winlala.vdi --compact did not reduce the size of the file).

It was not as bad, but desktop was locked for more than 5 seconds easily.

So this also happens with larger free space *within* the chunks. Before I to the VBoxManage --compact I will now rebalance partly.

So this definately shows, it can happen when BTRFS cannot reserve any new
chunks anymore, yet still has *plenty* of free space within the existing data
chunks.

My current idea is this:

From what I perceive the worst case issue happens when

1) BTRFS cannot reserve any new chunks from unused device space anymore.

2) The free space in the existing chunks is highly fragmented.

Only one of those conditions is not sufficient to trigger it.

Comment 13 Martin Steigerwald 2015-12-13 22:19:11 UTC

This still happens with 4.3 kernel. Had this on /home again after a long time of having it working reliably. It has not been more than 120 seconds, but easily 10-30 seconds of one kworker thread hogging one core with Plasma desktop being unresponsive waiting for I/O.

I managed to balance data chunks after vacuuming Akonadi database + moving some data elsewhere.

On btrfs balance start -dusage=95 it froze mouse pointer while writing a mail in KMail. I let it sit for some minutes, but still sitting there, mouse pointer frozen.

This is still not production ready. I wouldn´t use this on any customer machine unless lots and lots of underprovisioning.

Comment 14 Ivan Baldo 2015-12-29 22:45:12 UTC

Debian Jessie with kernel 4.2.6-3~bpo8+2, ZoneMinder with 17 cameras, BTRFS with RAID1 (managed by BTRFS) with 2x1TB SATA 7200RPM HDDs, 88% disk full, Intel Core i5-4570 (Haswell), 2x8Gib DDR3-1600 CAS9.
If I use motion detection it works perfectly, but if I want to put all 17 cameras in continuos recording, it works but after a couple minutes kworker eats 100% of one CPU, loadavg spikes to 23 and little actual IO goes to the HDDs, they get idle, the workload becomes CPU bound, stalling everything.
I will try tomorrow with 80% full, but what percentage of full utilization should I use / do you recommend? The higher the better for this case...
Thanks.

Comment 15 Ivan Baldo 2015-12-30 17:49:09 UTC

Tried with 80% full, same issue, iostat -x 1 shows this:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00   25,19    0,25    0,00   74,56
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00    33,00    0,00   60,00     0,00  2540,00    84,67     0,00    0,00    0,00    0,00   0,00   0,00
sdb               0,00    39,00    0,00   54,00     0,00  2540,00    94,07     0,02    0,30    0,00    0,30   0,30   1,60

Remember, 4 cores, so 25% system is because of kworker eating a full CPU, there are some writes but not much utilization of the disks, perf top shows a lot of bit searching in the kernel.
Using skinny metadata extents, there are no subvolumes and no snapshots, using noatime, not using compression, using COW except in the MySQL files.
/proc/mounts:
/dev/sdb3 / btrfs rw,noatime,nodiratime,space_cache,subvolid=5,subvol=/ 0 0
I think I am going to change to another FS, though my experience with BTRFS has been great so far, this is a showstopper for this use case.
I hope it can get improved some time and could switch to it for all my machines, in the meantime I will continue using it for backups and some other non IO intensive machines.
Thanks a lot!!!

Comment 16 Martin Steigerwald 2016-01-01 10:39:18 UTC

Happy new Year!

I found that when:

merkaba:~> btrfs fi sh /home
Label: 'home' uuid: […]
Total devices 2 FS bytes used 148.98GiB
devid 1 size 170.00GiB used 170.00GiB path /dev/mapper/msata-home
devid 2 size 170.00GiB used 170.00GiB path /dev/mapper/sata-home

on the devices themselves used matches size, in other words the total space of the device is allocated in chunks, this is *can* happen. No matter the percentage of the allocation *inside* the chunks. Which is the value on the "Total devices" line.

merkaba:~> btrfs fi df /home
Data, RAID1: total=163.94GiB, used=145.32GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=6.03GiB, used=3.67GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Thus in this condition it could happen already for me.

So its not that much about the complete fill rate of the device, but whether BTRFS if need be can still allocate another chunk.

That said: This up to my current experience was a necessary, but not a sufficient condition for it to happen. So the system can run like this for quite a time, but at some time may run into kworker using 100% cpu of one core issue.

For production I thus recommend to keep the device allocation into chunks *below* the available device space by balancing. But it can happen that it doesn´t balance anymore, giving out of space error. So another alternative is: Do not yet use it for interactive machines. I mentioned it in some recent thread that I think BTRFS is still not production ready and got some response. And some discussion about the issue. Chris Mason said that this issue seems to be one that needs a lot of think time to solve properly¹. I will follow-up there in a moment. Thanks, Martin

[1] http://thread.gmane.org/gmane.comp.file-systems.btrfs/51176

Thanks, Martin

Comment 17 Martin Steigerwald 2016-01-01 12:34:34 UTC

What usually works okayish here is something like:

btrfs balance start -dusage=80 /home

experiment with the value. And for metadata you use -m.

Just running a balance without parameters gave me an no space left on device error again.

With -dusage I again got:

merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 148.96GiB
        devid    1 size 170.00GiB used 161.18GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 161.18GiB path /dev/mapper/sata-home

merkaba:~> btrfs fi df /home   
Data, RAID1: total=155.12GiB, used=145.31GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=6.03GiB, used=3.65GiB
GlobalReserve, single: total=512.00MiB, used=784.00KiB

merkaba:~> btrfs device usage /home
/dev/dm-0, ID: 1
   Device size:           170.00GiB
   Data,RAID1:            155.12GiB
   Metadata,RAID1:          6.03GiB
   System,RAID1:           32.00MiB
   Unallocated:             8.82GiB

/dev/dm-3, ID: 2
   Device size:           170.00GiB
   Data,RAID1:            155.12GiB
   Metadata,RAID1:          6.03GiB
   System,RAID1:           32.00MiB
   Unallocated:             8.82GiB

Important is the unallocated value. I never had a kworker thread using 100% of one core for extended periods of time if this value was above 0.

Whether this can be automated in a reliable way by having some monitoring on the unallocated value + an automatic trigger for the balance, I am not sure. It would depend whether a fixed value to "usage=" would work in all cases. Although it would be important to let the balance run in times without high load or at least with ionice -c3 (btrfs balance doesn´t seem to have an option for that) and even then this only works when using CFQ I/O scheduler and even then only for the reads, not for the (buffered) writes. Anyway, if you want to discuss this, please subscribe to BTRFS mailing list, lets use the bug report for actual findings and facts.

Comment 18 Martin Steigerwald 2016-03-20 11:10:33 UTC

I think I saw this up to 4.3. But I am not sure whether I saw it with 4.4 and I definately didn´t see it with 4.5. Since it depends on the state of the filesystem, it may still happen, but for now I am closing this. I will reopen, if I see this again.

Thanks.

As I cannot trace this to a certain code fix, I will close as no longer reproducible. (I think it would be good if bugzilla had something more specific for this.)