Bug 206767

Summary: Bcache show discrepancy between writeback_debug_ratio and priority_stats for amount of dirty data
Product: Drivers Reporter: Benjamin Allot (benjamin.allot)
Component: OtherAssignee: drivers_other
Status: NEW ---    
Severity: high CC: colyli, dongdong.tao, trent, zakharov.a.g
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.15.18 Tree: Mainline
Regression: No

Description Benjamin Allot 2020-03-05 23:20:22 UTC
Hello,

The kernel is a ubuntu one but it's a derivative of the 4.15.18 without any bcache patch.

We're using extensively bcache but we have an issue with a server which shows I/O wait >= 1 seconds for both the caching device (a nvme disks) and the backing device most of the time.

We noticed that every 45-55 minutes, the I/O wait was going down briefly and it is corresponding to the time the GC is triggered.

I manually triggered it to confirm.

What is interesting is that there is a discrepancy between the amount of dirty data as reported by writeback_rate_debug doesn't add up with the percent of Dirty bucket, as seen by the priority_stats

To summarize, here is the current status when the bug is triggered:
 
* dirty page is 149.5 Gb  [0]
* writeback_percent is 40
* SSD disks size is 1.6 Tb
* percent of dirty data is 70 % [1]

As seen in the writeback.h file [2], there is a cutoff writeback sync that is triggered at 70%. Therefore, the explanation of the I/O wait is simple, the cache is constantly using writeback on the backing device because the device thinks it's at 70% of dirty page.

However, since the writeback_percent is 40, and the SSD size is 1.6Tb, the amount of dirty data should be around 655 Gb before reaching the 40%, triggering  the writeback.

The priority_stats is counting the number of dirty buckets according to sysfs.c [3].

The number of buckets does match the SSD size [4].

What would explain this strange percentage in the priority_stats ?

Regards,

[0]: https://paste.ubuntu.com/p/Rndk3xqvq8/
[1]: https://paste.ubuntu.com/p/HnwWYCrPf2/
[2]: https://github.com/torvalds/linux/blob/v4.15/drivers/md/bcache/writeback.h
[3]: https://github.com/torvalds/linux/blob/v4.15/drivers/md/bcache/sysfs.c
[4]: https://paste.ubuntu.com/p/SfcPkWnKBt/
Comment 1 Coly Li 2020-03-11 12:38:42 UTC
There are quite a few problems in pure Linux v4.15 (indeed for stable v4.15 too). I suggest to try Linux v5.3+ kernel (preferred to Linux v5.5) and see whether the situation could be better.

Run 'git log --oneline v4.15..v5.5 drivers/md/bcache/' will display 240 patches since v4.15 to v5.5, most of them are bug fixes. There are two options to do,
1) backport most of bcache patches to your kernel.
2) upgrade to latest stable kernel, which is v5.5 at this moment.

Thanks.

Coly Li
Comment 2 Trent Lloyd 2020-03-17 06:35:11 UTC
I looked into this situation as I have seen similar in multiple environments some are just plain XFS on bcache others are ceph on bcache. One of the main things I noticed other than lower than expected performance was the the w_await from iostat for the bcache devices tended to equal that of the backing device, rather than the cache device, implying that most writes were being bypassed.

It's worth noting up front that most of these environments have multiple backing devices in a single cache set (may in some cases act different to 1-1 setups). sequential_cutoff=0, congested_read_threshold_us=0, congested_write_threshold_us=0. So those should not be interfering.

I think I have found something interesting in that there seems to be a mismatch between /sys/fs/bcache/UUID/cache_available_percent (and also the value from priority_stats for Dirty) and the SUM of /sys/fs/bcache/UUID/bdev*/dirty_data. In many cases the Dirty percent times the size of the cache was 2-4x the sum of the backing device's dirty_data value. With a backing device writeback_percent of 30 or 40% (and a workload that has the dirty_data staying relatively constant, it's keeping pace with the writebck_rate scaling) the cache devices have a Dirty value that gets stuck mostly on 70% and potentially triggers the sync writes scenario.

I found that if I run fio against a raw bcache device using a zipf random distribution (which writes a large percentage of the writes to the same blocks, and a long tail of writes to different blocks) then I can cause this same situation. The following command will write to the bcache device in a 16GB window with such a distribution:
fio --name=randrw --rw=randrw --direct=1 --ioengine=libaio --bs=4k --numjobs=1 --rwmixread=0 --size=16G --iodepth=64 --random_distribution=zipf --filename=/dev/bcache0 

When I do this, it submits 16GB of I/O but the backing device dirty_data only shows 1.8GB (as that is the total number of blocks modified, but some of them many times). However the cache Dirty percentage shows 12% x 128GB = 15.36GB or roughly the size of the actual writes. Using trigger_gc sometimes causes this percentage to drop a tiny little bit but not much.

If you do this with more data or more backing devices, you can hit the 70% cache threshold without even hitting the writeback_percent writeback and so writeback continues at the minimum rate (as a side note on 5.5.9, the writeback code doesn't seem to be working right, need to look at that further, or I'm misunderstanding something, when the writeback_rate is scaled because we exceed dirty_target or even when idle and the rate changes to 0.9T I still see a constant minimum ~500 w/s / 2MB/s - never anything higher).

To contrast the above doing the same FIO test but using doing a random write through the 16GB to all 16GB you get a backing device dirty_data=15.9G and the same cache device Dirty 12%. Additionally after a while at the minimum rate we were down to 8.9GB dirty_Data but even after a trigger_gc the Dirty% is still 12%.

I also seem to be able to get a 4% dirty cache (5GB) by just writing a 1MB chunk of the bcache with 0s over and over again for a while (literally dd in a loop).. though that immediately disappears as soon as i do a trigger_gc unlike other write patterns. Where even after 30-60 mins of 2MB/s writeback and a manual trigger_gc the Dirty cache percentage stays mostly static. On another test with 2 backing devices one with a 16G zipf random write, and the other with an 8GB totally random write .. had after an hour-ish of writeback 3G dirty_data for each backing device but still 18% Dirty (the peak value), 0% clean and 80% Unused even after a trigger_gc.

I can reproduce all of the above on the latest Ubuntu Bionic 4.15.0-70, Ubuntu Bionic HWE 5.3.0-40 and Ubuntu Mainline PPA 5.5.9.


I do not currently have a working understand of how bcache works internally and I can imagine a potential need with logging etc for bcache to have to cache writes to the same block multiple times at least temporarily but it seems in practice these don't clear out very quickly if at all resulting in a higher than expected dirty percentage. And I can also imagine a scenario that due to sitting on the writeback_percent limit that it literally never or takes a very long time to get around to writing out some areas of the cache and may be paying a penalty for those multiple overwrites for a long time.

Though as a general item there does seem to be a disconnect between the cache device 70% dirty causing synchronous write operations but writeback operations being driven entirely based on the dirty_data of the backing devices. Even if they should be connected values in practice.

Any input on my findings above would be good, including any holes in my testing or understanding -- I'm sure that such a simplistic test can miss various tunables etc that avoid such a test and wouldn't apply to more realistic I/O scenarios. But the symptoms do seem to match the production environments so that is a start at least.


Finally as an example in practice with a cluster of 36 backing devices.. each machine having 4x2T backing devices and 1x1T cache device I had to set writeback_percent=10 (giving dirty_data ~22G x 4) to get ~50% (500G) of the cache consistently dirty. With writeback_percent=30 (giving dirty_data ~45-60G depending on the exact disk as some had data inflow rates lower than the minimum writeback) almost most caches had a Dirty of 70% constantly and cache_available_percent was often high 20s. This is a Ceph Bluestore environment. And for clarity the FIO testing above is to raw bcache devices (no Ceph). This environment has Ubuntu 4.15.0-88 kernel which does have a bunch of bcache patches backported (including multiple that seemed possibly like they would be related) though not all of them. Hence I tested the above test scenario mainly on 5.5.9 but also confirmed that the same happened on the other kernels.

Any input would be appreciated!
Comment 3 Aleksei Zakharov 2020-08-27 11:32:38 UTC
Hi everyone, we see the same issue with a bcache devices.

Symptoms are high latency of bcache device (the same as backend HDD device) and the same number of IO/s completed by HDD and bcache device.

Tracing during normal operations shows only one code path for bcache_write tracepoint with writeback==0 && bypass==0:
root@hostname:zaharov# bpftrace -e 't:bcache:bcache_write / args->writeback == 0 && args->bypass == 0 /{ @[kstack()]=count() } i:s:5 {exit();}'
Attaching 2 probes...

@[
    bch_data_insert+186
    cached_dev_read_done+377
    process_one_work+478
    worker_thread+50
    kthread+289
    ret_from_fork+31
]: 1194

And two code paths when the problem occurs:
root@hostname:bcache19# bpftrace -e 't:bcache:bcache_write / args->writeback == 0 && args->bypass == 0 /{ @[kstack()]=count() } i:s:5 {exit();}'
Attaching 2 probes...

@[
    bch_data_insert+186
    cached_dev_read_done+377
    process_one_work+478
    worker_thread+50
    kthread+289
    ret_from_fork+31
]: 48
@[
    bch_data_insert+186
    cached_dev_make_request+2821
    generic_make_request+292
    submit_bio+115
    blkdev_direct_IO+943
    generic_file_direct_write+166
    __generic_file_write_iter+190
    blkdev_write_iter+168
    aio_write+281
    do_io_submit+877
    sys_io_submit+16
    do_syscall_64+115
    entry_SYSCALL_64_after_hwframe+61
]: 1111


Percent of dirty data differs between writeback_rate_debug and priority_stats:
root@hostname:bcache19# cat ./bcache/writeback_rate_debug
rate:           4.0k/sec
dirty:          19.6G
target:         20.0G
proportional:   -11.5M
integral:       -24.5k
change:         0.0k/sec
next io:        -21ms
root@hostname:bcache19# cat ./bcache/cache/cache_available_percent
29
root@hostname:bcache19# cat ./bcache/cache/cache0/priority_stats
Unused:         0%
Clean:          29%
Dirty:          70%
Metadata:       0%
Average:        1114
Sectors per Q:  13136960
Quantiles:      [27 59 113 201 304 369 424 496 571 626 689 752 811 872 938 1004 1070 1138 1210 1284 1370 1456 1531 1620 1725 1838 1973 2116 2369 2609 2687]
root@hostname:bcache19# cat ./bcache/cache/cache0/nbuckets
411100
root@hostname:bcache19# cat ./bcache/cache/cache0/bucket_size
512.0k

It looks like there's a difference between stripe_sectors_dirty count and keys marked with GC_MARK_DIRTY if i'm not mistaken.

As a workaround it helps to set dirty ratio to a lower value and wait for some data to migrate.
Comment 4 Dongdong Tao 2020-10-16 09:42:45 UTC
I've also met this issue, Here is my root cause analysis:

Bcache is composed of multiple buckets, one bucket by default is 512k, and one bucket is composed by multiple sectors, one sector is 512 bytes,
so we can think of one bucket containing 1024 sectors. So a bucket will be viewed as dirty even if there is only 1 sector that is dirty. If we get a
situation that we have about 70 percent buckets who contains 10 percent dirty data, which means one bucket would have average 15 percent
sectors are dirty, this is a bucket highly fragmented situation, and it’s the reason that we can observe only 10 percent dirty_data while the
cache_available_percent is just 30 percent. If we look at the code, we can find that bcache calculates the ‘cache_available_percent’ based on the
number of dirty buckets, but it calculates the ‘dirty_data’ based on the number of dirty sectors. The problem is that bcache only checks dirty
buckets ratio instead of dirty data ratio to determine if there is enough space in the cache device and the write will have to go directly to the
backing disk if there is not enough space.

The code[1] and [2] show where bcache does this kind check, and the variable ‘dc->disk.c->gc_stats.in_use’ means current dirty buckets.
I don’t think it makes sense that bcache is using dirty_buckets here instead of dirty_data as a ratio to determine if cache device still have enough space.

As to why it would cause bcache buckets into such a highly fragmented situation, bcache writes the bucket one by one, it will only try to allocate a
new bucket until the old buckets are all consumed, so only the writes to the cache device won’t cause fragmentation, but My current theory is this
is caused by writeback process, because writeback process is writing back those dirty data in the order of the key in btree (it will scan the btree to
get the dirty keys), so it’s not flushing in the order of bucket by bucket, hence by using this writeback mechanism, it is inevitable to cause such a
high fragmented situation.

Followed this theory, I’m able to reproduce this issue with below steps:
1. Use below fio command to keep writing small data (4k is better as can make the bucket more easily to fragmented, which can reproduce it
always, I use 32k is because it’s faster) to a bcache device and make the dirty_data to 70, at that time we can see the IOPS dropped from
8000 to about less than 100, and the cache_available_percent is 30 also.

fio --name=random-writers --filename=/dev/bcache0 --ioengine=libaio --iodepth=4 --rw=randwrite --bs=32k --direct=1 --numjobs=4

2. Then just wait for the dirty data to write back until the dirty_data is around 10%, but you will see the cache_available_percent is still only 30.

I’m able to reproduce this issue in 4.15 kernel and also in 5.3 kernel, this is likely an issue still in the mainline

It would be highly appreciated to get some input here
Thanks,
Dongdong
Comment 5 Aleksei Zakharov 2020-10-16 09:55:40 UTC
(In reply to Dongdong Tao from comment #4)
> I've also met this issue, Here is my root cause analysis:
> 
> Bcache is composed of multiple buckets, one bucket by default is 512k, and
> one bucket is composed by multiple sectors, one sector is 512 bytes,
> so we can think of one bucket containing 1024 sectors. So a bucket will be
> viewed as dirty even if there is only 1 sector that is dirty. If we get a
> situation that we have about 70 percent buckets who contains 10 percent
> dirty data, which means one bucket would have average 15 percent
> sectors are dirty, this is a bucket highly fragmented situation, and it’s
> the reason that we can observe only 10 percent dirty_data while the
> cache_available_percent is just 30 percent. If we look at the code, we can
> find that bcache calculates the ‘cache_available_percent’ based on the
> number of dirty buckets, but it calculates the ‘dirty_data’ based on the
> number of dirty sectors. The problem is that bcache only checks dirty
> buckets ratio instead of dirty data ratio to determine if there is enough
> space in the cache device and the write will have to go directly to the
> backing disk if there is not enough space.
Thank you!
> 
> The code[1] and [2] show where bcache does this kind check, and the variable
> ‘dc->disk.c->gc_stats.in_use’ means current dirty buckets.
> I don’t think it makes sense that bcache is using dirty_buckets here instead
> of dirty_data as a ratio to determine if cache device still have enough
> space.
Yep, I looked at this part of code too. And I'm not sure why bcache counts buckets here. It would be great, if someone understands the reason.
> ...
> 
> It would be highly appreciated to get some input here
> Thanks,
> Dongdong
Comment 6 Benjamin Allot 2020-10-19 09:38:13 UTC
Thanks a lot for your replies and analysis.

Especially the fragmentation explanation and the bpftrace lines.

Sadly, the available kernel I had on production were 4.4 only so I couldn't record the issue lately, but I

I had found that just writing a huge file with a 64 BS [0] was enough, since it triggered was constantly writing back once the writeback_percent is reached, causing enough fragmentation

I found particularly interesting what is causing the fragmentation though.

[0]: https://pastebin.ubuntu.com/p/YnnvvSRhXK/
Comment 7 Dongdong Tao 2020-11-04 04:59:14 UTC
I have a patch that is under review:
https://marc.info/?l=linux-bcache&m=160441418209114&w=1