Bug 214503

Summary: System hangs with 5.14.7: trace shows bfq/blk_mq/btrfs involved
Product: IO/Storage Reporter: Grzegorz Kowal (custos.mentis)
Component: Block LayerAssignee: Paolo Valente (paolo.valente)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: high CC: amfernusus, arzeth0, git, hgkamath, jammehcow, jan.steffens, josef, ne-vlezay80, nrndda, paolo.valente, sanjay.ankur, torvalds
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.14.7 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: kernel bug trace
bug trace #2
journalctl -b output
bug trace for branch dev-bfq-on-5.12
add more checks on queue merging
kernel trace for oops in 5.12.0-bfq + additional debug patch
tentative fix: reset last_bfqq_created on group change

Description Grzegorz Kowal 2021-09-23 14:13:47 UTC
Created attachment 298931 [details]
kernel bug trace

Yesterday I have updated my kernel to version 5.14.7. Since then I am getting random system hangs with the kernel bug message attached. With 5.14.6 and earlier there was not problems.

I use Gentoo Linux on AMD FX-8350. The system is installed on Samsung NVMe 970 Pro. I use BFQ scheduler.

I am testing now the system with two suspected patches reverted:

- blkcg-fix-memory-leak-in-blk_iolatency_init.patch
- block-bfq-honor-already-setup-queue-merges.patch

I will report soon if they are responsible for the hangs.
Comment 1 James Upjohn 2021-09-24 04:19:17 UTC
Can reproduce on Arch Linux using 5.14.7-zen with BTRFS. Downgraded to 5.14.6 and haven't seen it reoccur.
Comment 2 Grzegorz Kowal 2021-09-24 11:51:00 UTC
I have applied block-bfq-honor-already-setup-queue-merges.patch back and running since yesterday without issues.

It would suggest that https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.14.7&id=3e8418e361775d16da4fc7d031f2173cee1d25dd introduced the issue.
Comment 3 Grzegorz Kowal 2021-09-24 15:44:55 UTC
Please, disregard my comment 2. After 6 hours of uptime the system just froze again. Unfortunately, I could not get any bug trace this time. I will try to investigate more the cause.
Comment 4 Grzegorz Kowal 2021-09-24 22:39:32 UTC
Another freeze, after nearly 7 hours of uptime. This time with the kernel trace, which I am attaching. Just to be clear, this kernel was with block-bfq-honor-already-setup-queue-merges.patch reverted.
Comment 5 Grzegorz Kowal 2021-09-24 22:42:03 UTC
Created attachment 298965 [details]
bug trace #2
Comment 6 Jens Axboe 2021-09-24 22:43:50 UTC
Let's eliminate a component - can you switch to using anything but bfq as your scheduler?
Comment 7 Grzegorz Kowal 2021-09-24 23:02:12 UTC
Sure, I will change the scheduler and see if it helps. Thanks!
Comment 8 Grzegorz Kowal 2021-09-26 23:48:57 UTC
Ok, so here are the results.

Uptime of 12 hours with no hangs with patch "block-bfq-honor-already-setup-queue-merges" *reversed* and the scheduler set to *bfq*.

Uptime of 11 hours with no hands with patch "block-bfq-honor-already-setup-queue-merges" *applied* and the scheduler set to *mq-deadline*.

After updating to 5.14.7 the hangs happened roughly within 2-3 hours of uptime.

The patch I am referring to is in the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.14.8&id=88013a0c5d9971a234afa783f2dee11c3e8675b2
Comment 9 Daniel Hyders 2021-09-28 12:04:34 UTC
I was able to reproduce this on kernel 5.14.8 (Arch Linux rolling). System hangs seem to be spontaneous and random, however frequent and disruptive.

The longest that I was able to go was 6 hours without any hangs.
Comment 10 Jens Axboe 2021-09-28 12:36:28 UTC
I've queued up a revert of this patch, as there seems to be little activity from the bfq side to get this fixed.

https://git.kernel.dk/cgit/linux-block/commit/?h=block-5.15&id=ebc69e897e17373fbe1daaff1debaa77583a5284

It'll go into 5.15-rc4 and percolate down to -stable as well.
Comment 11 amfernusus 2021-09-28 13:15:37 UTC
I started to experience the same freeze/hang after upgrafing to 5.14.7 on arch.
the freeze happens randomly after about 20-40 minutes after boot.

i also have a laptop that i updated at the same day with the same version and its just fine.

happy to provide more details. but i didn't notice any crash or kernel errors in the logs
Comment 12 amfernusus 2021-09-28 13:38:52 UTC
Created attachment 298997 [details]
journalctl -b output
Comment 13 Sait 2021-09-28 15:40:04 UTC
Experiencing similar freezes on Fedora 34 with btrfs... system freezes after a certain number of hours with 5.14 kernels but working fine with 5.13 ones.
Comment 14 Jens Axboe 2021-09-28 15:46:24 UTC
For anyone seeing freezes, please try:

# echo mq-deadline > /sys/block/nvme0n1/queue/scheduler

right after boot and run with that, then you should not see any issues. Substitute nvme0n1 for whatever device is hosting your local file systems, and if multiple, do it for every device in use.
Comment 15 Paolo Valente 2021-09-29 16:17:26 UTC
Unfortunately, such a freeze doesn't tell me much about the cause.

So I've pushed the most recent development version of bfq that I have. It's for 5.12.0, you can find it here:
https://github.com/Algodev-github/bfq-mq/tree/dev-bfq-on-5.12

Could you please retry with this version? It should hopefully end up with some informative OOPS.

Thanks,
Paolo
Comment 16 Sait 2021-09-29 19:51:58 UTC
To Comment 15: So, the reversal of a small patch fixing the problem does not provide and clues? I am just a Linux user not a developer but spent two days struggling with system freeze of stable kernel. Many others are doing the same. I would have expected a bit more effort and compassion from a developer. These OOPS lead to file system damages and not the best way to debug.
Comment 17 Grzegorz Kowal 2021-09-29 22:25:55 UTC
Created attachment 299031 [details]
bug trace for branch dev-bfq-on-5.12

With response to Comment 15:

I have compiled the bfq development version from https://github.com/Algodev-github/bfq-mq/tree/dev-bfq-on-5.12 and I've got a similar freeze after a few hours. The bug trace is attached.
Comment 18 Arzet Ro 2021-09-30 10:05:42 UTC
I use linux-tkg (i.e. not upstream), Arch Linux, scsi_mod.use_blk_mq=1, zswap.enabled=1, Ryzen 2600.

```
$ uname -a
Linux arzeth-pc 5.14.7-202-tkg-pds #1 TKG SMP PREEMPT Wed, 22 Sep 2021 21:37:57 +0000 x86_64 GNU/Linux
$ uptime -p
up 1 day, 13 hours, 23 minutes
$ cat /sys/block/sda/queue/scheduler
mq-deadline kyber [bfq] none
$ cat /sys/block/nvme0n1/queue/scheduler 
[none] mq-deadline kyber bfq
```

What I have on /dev/sda (HDD, SATA 3.0, WD5000AAKX-001CA0):
/ (including /usr/, /etc/, /var/);
/n;
part of /home/.

/usr/share is XFS (not LUKS);
/n is Paragon's NTFS3 (27.0.0-5);
Everything else is ext4 (LUKS1).

Zero warnings in dmesg even after unsuspending.
/n is often being read by qBittorrent.

Maybe the system hang happens only if BFQ is used for SSD or just NVMe?
Comment 19 Arzet Ro 2021-09-30 10:36:33 UTC
And I have a 99% filled and constantly
used 4GB swap file (+zswap) on HDD's (BFQ) XFS partition.
So BFQ I/O scheduler must be very busy in my case,
yet not even a single warning in dmesg.

I've been successfully using 5.14.7-tkg with BFQ for 7 days,
but since I rebooted once or twice to use Windows for a little time,
my current uptime is just 38 hours.
Comment 20 Sait 2021-09-30 13:17:11 UTC
Isn't this happening with BTRFS systems....I can confirm this because I have multiple workstations of same model (not same year) all with SSDs. Only ones experiencing the freeze are the ones with BTRFS / file system (Fedora 34).
Comment 21 Sait 2021-09-30 14:38:45 UTC
PS: to be clear all workstations are running Fedora 34 with all updates. Only two have btrfs filesystem and they are the ones freezing.
Comment 22 Josef Bacik 2021-09-30 15:15:02 UTC
To be clear, this isn't a btrfs problem, we are just unlucky enough to consistently create the circumstances to cause BFQ to panic here.
Comment 23 Sait 2021-09-30 15:21:27 UTC
Sorry, I did not mean to imply that it was a btrfs problem but just trying to explain why some systems experiencing this and others seem to be stable (perhaps to infrequent to assume they are stable). There are others who thought this had to do with Nvidia drivers and the kernel 5.14.7 but they are aware of this bug now.
Comment 24 Paolo Valente 2021-09-30 15:27:37 UTC
(In reply to Grzegorz Kowal from comment #17)
> Created attachment 299031 [details]
> bug trace for branch dev-bfq-on-5.12
> 
> With response to Comment 15:
> 
> I have compiled the bfq development version from
> https://github.com/Algodev-github/bfq-mq/tree/dev-bfq-on-5.12 and I've got a
> similar freeze after a few hours. The bug trace is attached.

Thank you very much! I'm analysing the code, guided by the OOPS. I'll probably get back with some tentative fix or debug patch.
Comment 25 Jens Axboe 2021-09-30 15:28:23 UTC
It's purely a BFQ problem, and a patch has already been queued up. It'll go upstream tomorrow. If someone wants to work with Paolo on debugging the issue, by all means, go for it. As far as I'm concerned, since a fix exists, we should close this one and any potential dialogue should be driven by Paolo in a new issue or email.
Comment 26 Paolo Valente 2021-09-30 15:47:44 UTC
(In reply to Jens Axboe from comment #25)
> It's purely a BFQ problem, and a patch has already been queued up. It'll go
> upstream tomorrow. If someone wants to work with Paolo on debugging the
> issue, by all means, go for it. As far as I'm concerned, since a fix exists,
> we should close this one and any potential dialogue should be driven by
> Paolo in a new issue or email.

I'm not finding such a fix. Could you please point me to it?
Comment 27 Paolo Valente 2021-09-30 15:51:56 UTC
Ah ok, it's your revert, sorry
Comment 28 Jens Axboe 2021-09-30 15:55:05 UTC
Yes, the revert. The patch has already been pinpointed by several people as being the culprit.
Comment 29 Paolo Valente 2021-09-30 16:26:49 UTC
Created attachment 299037 [details]
add more checks on queue merging

(In reply to Grzegorz Kowal from comment #17)
> Created attachment 299031 [details]
> bug trace for branch dev-bfq-on-5.12
> 
> With response to Comment 15:
> 
> I have compiled the bfq development version from
> https://github.com/Algodev-github/bfq-mq/tree/dev-bfq-on-5.12 and I've got a
> similar freeze after a few hours. The bug trace is attached.

I have just made this debug patch. It should take us closer to the cause of the bug. Could you please apply it on top of my dev version of bfq and retry?
Comment 30 Grzegorz Kowal 2021-10-01 11:28:37 UTC
Well, I am not going to use my principal computer as a test bed and risk losing any data. I do regular backups, but I would have to spend hours in restoring them, which I don't have. I've switch to a different scheduler and I did not experienced any freeze anymore.

Nevertheless, I am trying to setup a VM with a similar configuration to mine main computer and test your development branch hoping it will freeze too. But it may take a few days, since I am doing this in my free time.

It seems these three components are important: NVMe, btrfs and bfq. I have setup up hourly snapshots of btrfs filesystems in my computer. But no freeze happened at the moment of doing such a snapshot. A couple of freezes happened when the computer was left alone, doing nothing. Moreover, it is impossible to get any logs after the freeze happens. I had to log in from a different computer and use dmesg -w to see what is happening.
Comment 31 amfernusus 2021-10-03 08:40:22 UTC
Just to give more information, i have a laptop with kernel 5.14.7 that doesnt experience this problem, it has btrfs with one disk with two partition in single.
Comment 32 Ganapathi Kamath 2021-10-04 05:13:59 UTC
Just reporting/confirming that fix works on fedora downstream kernel.

I have a few btrfs partitions on a laptop.
I don't have NVME, just 1 sata-SSD and 2-hdd.

I experienced freezes on many 5.14.x-x.fc35.x86_64 and on the recent 5.15.0-0.rc3.20211001git4de593fb965f.30.fc36.x86_64. 

The 5.14.9-300.fc35.x86_64, which has the patch, is stable.

I then self built 5.15rc3 from the src rpm, applied the revert patch, which resulted in more stable usable kernel. Presently browsing machine booted using this kernel. It has been up for 5 hrs, in standby for 5 hrs. https://bugzilla.redhat.com/show_bug.cgi?id=2007406 

I see fix is upstreamed in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/ . So it will make it to 5.15 rc4.
Comment 33 Grzegorz Kowal 2021-10-05 23:36:47 UTC
Created attachment 299117 [details]
kernel trace for oops in 5.12.0-bfq + additional debug patch

(In reply to Paolo Valente from comment #29)
> Created attachment 299037 [details]
> add more checks on queue merging
> 
> (In reply to Grzegorz Kowal from comment #17)
> > Created attachment 299031 [details]
> > bug trace for branch dev-bfq-on-5.12
> > 
> > With response to Comment 15:
> > 
> > I have compiled the bfq development version from
> > https://github.com/Algodev-github/bfq-mq/tree/dev-bfq-on-5.12 and I've got
> a
> > similar freeze after a few hours. The bug trace is attached.
> 
> I have just made this debug patch. It should take us closer to the cause of
> the bug. Could you please apply it on top of my dev version of bfq and retry?

Paolo, I've just got a similar freeze in QEMU with freshly installed Fedora 34 (btrfs root fs + scheduler set to BFQ) and kernel compiled from your BFQ development branch (5.12.0-bfq) with additional debug patch. The trace is attached. I hope it will help to find out the cause of these freezes.
Comment 34 Paolo Valente 2021-10-06 17:24:47 UTC
Thank you very much! Thanks to this new trace, I think I could have found the cause of the problem. I'm working on a fix. I hope you will be able to test it.
Comment 35 Paolo Valente 2021-10-09 09:39:51 UTC
Created attachment 299147 [details]
tentative fix: reset last_bfqq_created on group change

Hi,
here is a tentative fix. Please apply this patch on top of the other and retry. Thank you very much.
Comment 36 Grzegorz Kowal 2021-10-11 17:34:47 UTC
(In reply to Paolo Valente from comment #35)
> Created attachment 299147 [details]
> tentative fix: reset last_bfqq_created on group change
> 
> Hi,
> here is a tentative fix. Please apply this patch on top of the other and
> retry. Thank you very much.

Thanks for the patch. I have tested it during the weekend and it seems to solve the problem. Without this patch the idle system was crashing within two hours, sometimes even within less than an hour. With the patch applied there was no crash, even with the system running up to 9 hours. I have tested this system with and without the patch a few times, with the idle and stressed system. As I said earlier, the system is Fedora 34 installed in QEMU with NVMe block device emulation, BFQ scheduler, and kernel compiled from the development branch of BFQ.
Comment 37 Paolo Valente 2021-10-15 13:36:48 UTC
Thank you very much! I'm about to post the patch for mainline, with your Tested-by.
Comment 38 Grzegorz Kowal 2021-10-18 10:52:12 UTC
(In reply to Paolo Valente from comment #37)
> Thank you very much! I'm about to post the patch for mainline, with your
> Tested-by.

Thanks!