Bug 196675

Summary: BFQ scheduler hangs system
Product: IO/Storage Reporter: Vladimir Lomov (lomov.vl)
Component: Block LayerAssignee: Paolo Valente (paolo.valente)
Status: NEW ---    
Severity: high CC: hvtaifwkbgefbaei
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.12 Subsystem:
Regression: No Bisected commit-id:
Attachments: This is copy-pasted from terminal (output of journalctl -k -f)
This is collected using netcosole

Description Vladimir Lomov 2017-08-16 06:23:59 UTC
Created attachment 257941 [details]
This is copy-pasted from terminal (output of journalctl -k -f)

I'm using Archlinux x86_64 and the distro ships linux kernel ver. 4.12.7. I enable BFQ scheduler using kernel parameter and udev rule:

[grub, kernel parameter]
systemd.unified_cgroup_hierarchy=1 scsi_mod.use_blk_mq=1

[udev rule]
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="bfq"

After the boot I checked that disks are use bfq scheduler:

$ cat /sys/block/sda/queue/scheduler
mq-deadline kyber [bfq] none

Now I'm starting systemd-nspawn instance files of which are located on disk governed by bfq. This instance is used to run Yandex.Disk to synchronize files and directories. A few seconds after that system hangs. I was able to get messages from kernel (it is not 100% possible).

If I use another scheduler, kyber for example, all works fine.

Let me know if more details are needed.

P.S. I already filed bug report on distro bug tracker.

---
WBR, Vladimir Lomov
Comment 1 Vladimir Lomov 2017-08-16 06:24:55 UTC
Forgot to add url to distro bug report (it has some details)
https://bugs.archlinux.org/task/55149
Comment 2 Vladimir Lomov 2017-08-16 06:28:18 UTC
Created attachment 257943 [details]
This is collected using netcosole

This data is taken from netconsole. I had started system-nspawn instance several times to try to get something meaningful, but when kernel stuck nothing is printed. After 5 minutes watchdog restarts the system. Finally I was able to obtain something useful when I started second systemd-nspawn instance and then the first one.
Comment 3 Jens Axboe 2017-08-16 14:49:17 UTC
I have forwarded this report to Paolo, doesn't look like he has a bugzilla account.
Comment 4 Vladimir Lomov 2017-08-22 13:14:41 UTC
Kernel with patches from messages

1. https://www.spinics.net/lists/linux-block/msg14113.html
2. https://www.spinics.net/lists/linux-block/msg14303.html
3. https://www.spinics.net/lists/linux-block/msg15016.html
4. https://www.spinics.net/lists/linux-block/msg15222.html
5. https://www.spinics.net/lists/linux-block/msg15514.html
6. https://www.spinics.net/lists/linux-block/msg15516.html
7. https://www.spinics.net/lists/linux-block/msg15626.html
8. https://www.spinics.net/lists/linux-block/msg15625.html
9. https://www.spinics.net/lists/linux-block/msg16172.html

works fine.

I didn't try to find exact patch that solves the problem (sorry, I don't have time right now) just searched linux-block mailing list and collected all patches related to BFQ (except trivial and module aliasing one).

As I understand all these patches will be in kernel 4.14 so I think when kernel 4.14 will be released this bug may be closed.
Comment 5 Sami Farin 2017-09-23 06:47:10 UTC
Vladimir, do you use DM and/or LUKS?

I have XFS, SLUB, LUKS (for root and media partitions).  I started testing bfq with 4.12.5 kernel.  System just hangs in 5 min to one day and I have to power cycle.  I have tried 4.12.5, 4.12.6, 4.12.7, 4.12.8, 4.12.10, 4.12.11, 4.12.13, 4.12.14.  With 4.12.12 I got 11 days uptime when I used scsi_mod.use_blk_mq=N .
4.9 series was very stable (it didn't have bfq :-P ).

Unfortunately I don't have any logs, I run in Xorg and after reboot there is nothing in logs about the crash.

On next reboot I try if kyber is more stable...
Comment 6 Vladimir Lomov 2017-09-24 04:24:39 UTC
Hello Sami Farin,

No, I don't use DM and LUKS.

I partially resolved my BFQ problems using patches published on linux-block (will hope they will be in 4.14) but still has sometimes problems with recent kernels (4.12.12, 4.12.13, 4.12.14). From time to time some of systems hang and reboot (thanks watchdog), but this is rather rare situations so it is difficult to get logs. To get logs from kernel I setup netconsole for my hosts but caught only two hangs and only one is related to bfq.

If you want to check kernel 4.9 you may try CK patches (I use linux-ck but until 4.12 BFQ was no kernel).
Comment 7 Paolo Valente 2021-11-09 15:26:49 UTC
These stables kernels do not contain fixes made in the meanwhile. Is this report still relevant? If so, please tell me the actual trees that suffer from these hangs, so that I can check which fixes are still missing in those kernels.
Comment 8 Sami Farin 2021-11-12 11:28:56 UTC
I don't know, but use_blk_mq parameter is not used anymore (at least in 5.10):

$ grep -B1 use_blk_mq drivers/md/dm-rq.c
/* Unused, but preserved for userspace compatibility */
static bool use_blk_mq = true;
module_param(use_blk_mq, bool, S_IRUGO | S_IWUSR);
MODULE_PARM_DESC(use_blk_mq, "Use block multiqueue for request-based DM devices");