Bug 214503
Summary: | System hangs with 5.14.7: trace shows bfq/blk_mq/btrfs involved | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Grzegorz Kowal (custos.mentis) |
Component: | Block Layer | Assignee: | Paolo Valente (paolo.valente) |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | high | CC: | amfernusus, arzeth0, git, hgkamath, jammehcow, jan.steffens, josef, ne-vlezay80, nrndda, paolo.valente, sanjay.ankur, torvalds |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.14.7 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
kernel bug trace
bug trace #2 journalctl -b output bug trace for branch dev-bfq-on-5.12 add more checks on queue merging kernel trace for oops in 5.12.0-bfq + additional debug patch tentative fix: reset last_bfqq_created on group change |
Can reproduce on Arch Linux using 5.14.7-zen with BTRFS. Downgraded to 5.14.6 and haven't seen it reoccur. I have applied block-bfq-honor-already-setup-queue-merges.patch back and running since yesterday without issues. It would suggest that https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.14.7&id=3e8418e361775d16da4fc7d031f2173cee1d25dd introduced the issue. Please, disregard my comment 2. After 6 hours of uptime the system just froze again. Unfortunately, I could not get any bug trace this time. I will try to investigate more the cause. Another freeze, after nearly 7 hours of uptime. This time with the kernel trace, which I am attaching. Just to be clear, this kernel was with block-bfq-honor-already-setup-queue-merges.patch reverted. Created attachment 298965 [details]
bug trace #2
Let's eliminate a component - can you switch to using anything but bfq as your scheduler? Sure, I will change the scheduler and see if it helps. Thanks! Ok, so here are the results. Uptime of 12 hours with no hangs with patch "block-bfq-honor-already-setup-queue-merges" *reversed* and the scheduler set to *bfq*. Uptime of 11 hours with no hands with patch "block-bfq-honor-already-setup-queue-merges" *applied* and the scheduler set to *mq-deadline*. After updating to 5.14.7 the hangs happened roughly within 2-3 hours of uptime. The patch I am referring to is in the following commit: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.14.8&id=88013a0c5d9971a234afa783f2dee11c3e8675b2 I was able to reproduce this on kernel 5.14.8 (Arch Linux rolling). System hangs seem to be spontaneous and random, however frequent and disruptive. The longest that I was able to go was 6 hours without any hangs. I've queued up a revert of this patch, as there seems to be little activity from the bfq side to get this fixed. https://git.kernel.dk/cgit/linux-block/commit/?h=block-5.15&id=ebc69e897e17373fbe1daaff1debaa77583a5284 It'll go into 5.15-rc4 and percolate down to -stable as well. I started to experience the same freeze/hang after upgrafing to 5.14.7 on arch. the freeze happens randomly after about 20-40 minutes after boot. i also have a laptop that i updated at the same day with the same version and its just fine. happy to provide more details. but i didn't notice any crash or kernel errors in the logs Created attachment 298997 [details]
journalctl -b output
Experiencing similar freezes on Fedora 34 with btrfs... system freezes after a certain number of hours with 5.14 kernels but working fine with 5.13 ones. For anyone seeing freezes, please try: # echo mq-deadline > /sys/block/nvme0n1/queue/scheduler right after boot and run with that, then you should not see any issues. Substitute nvme0n1 for whatever device is hosting your local file systems, and if multiple, do it for every device in use. Unfortunately, such a freeze doesn't tell me much about the cause. So I've pushed the most recent development version of bfq that I have. It's for 5.12.0, you can find it here: https://github.com/Algodev-github/bfq-mq/tree/dev-bfq-on-5.12 Could you please retry with this version? It should hopefully end up with some informative OOPS. Thanks, Paolo To Comment 15: So, the reversal of a small patch fixing the problem does not provide and clues? I am just a Linux user not a developer but spent two days struggling with system freeze of stable kernel. Many others are doing the same. I would have expected a bit more effort and compassion from a developer. These OOPS lead to file system damages and not the best way to debug. Created attachment 299031 [details] bug trace for branch dev-bfq-on-5.12 With response to Comment 15: I have compiled the bfq development version from https://github.com/Algodev-github/bfq-mq/tree/dev-bfq-on-5.12 and I've got a similar freeze after a few hours. The bug trace is attached. I use linux-tkg (i.e. not upstream), Arch Linux, scsi_mod.use_blk_mq=1, zswap.enabled=1, Ryzen 2600. ``` $ uname -a Linux arzeth-pc 5.14.7-202-tkg-pds #1 TKG SMP PREEMPT Wed, 22 Sep 2021 21:37:57 +0000 x86_64 GNU/Linux $ uptime -p up 1 day, 13 hours, 23 minutes $ cat /sys/block/sda/queue/scheduler mq-deadline kyber [bfq] none $ cat /sys/block/nvme0n1/queue/scheduler [none] mq-deadline kyber bfq ``` What I have on /dev/sda (HDD, SATA 3.0, WD5000AAKX-001CA0): / (including /usr/, /etc/, /var/); /n; part of /home/. /usr/share is XFS (not LUKS); /n is Paragon's NTFS3 (27.0.0-5); Everything else is ext4 (LUKS1). Zero warnings in dmesg even after unsuspending. /n is often being read by qBittorrent. Maybe the system hang happens only if BFQ is used for SSD or just NVMe? And I have a 99% filled and constantly used 4GB swap file (+zswap) on HDD's (BFQ) XFS partition. So BFQ I/O scheduler must be very busy in my case, yet not even a single warning in dmesg. I've been successfully using 5.14.7-tkg with BFQ for 7 days, but since I rebooted once or twice to use Windows for a little time, my current uptime is just 38 hours. Isn't this happening with BTRFS systems....I can confirm this because I have multiple workstations of same model (not same year) all with SSDs. Only ones experiencing the freeze are the ones with BTRFS / file system (Fedora 34). PS: to be clear all workstations are running Fedora 34 with all updates. Only two have btrfs filesystem and they are the ones freezing. To be clear, this isn't a btrfs problem, we are just unlucky enough to consistently create the circumstances to cause BFQ to panic here. Sorry, I did not mean to imply that it was a btrfs problem but just trying to explain why some systems experiencing this and others seem to be stable (perhaps to infrequent to assume they are stable). There are others who thought this had to do with Nvidia drivers and the kernel 5.14.7 but they are aware of this bug now. (In reply to Grzegorz Kowal from comment #17) > Created attachment 299031 [details] > bug trace for branch dev-bfq-on-5.12 > > With response to Comment 15: > > I have compiled the bfq development version from > https://github.com/Algodev-github/bfq-mq/tree/dev-bfq-on-5.12 and I've got a > similar freeze after a few hours. The bug trace is attached. Thank you very much! I'm analysing the code, guided by the OOPS. I'll probably get back with some tentative fix or debug patch. It's purely a BFQ problem, and a patch has already been queued up. It'll go upstream tomorrow. If someone wants to work with Paolo on debugging the issue, by all means, go for it. As far as I'm concerned, since a fix exists, we should close this one and any potential dialogue should be driven by Paolo in a new issue or email. (In reply to Jens Axboe from comment #25) > It's purely a BFQ problem, and a patch has already been queued up. It'll go > upstream tomorrow. If someone wants to work with Paolo on debugging the > issue, by all means, go for it. As far as I'm concerned, since a fix exists, > we should close this one and any potential dialogue should be driven by > Paolo in a new issue or email. I'm not finding such a fix. Could you please point me to it? Ah ok, it's your revert, sorry Yes, the revert. The patch has already been pinpointed by several people as being the culprit. Created attachment 299037 [details] add more checks on queue merging (In reply to Grzegorz Kowal from comment #17) > Created attachment 299031 [details] > bug trace for branch dev-bfq-on-5.12 > > With response to Comment 15: > > I have compiled the bfq development version from > https://github.com/Algodev-github/bfq-mq/tree/dev-bfq-on-5.12 and I've got a > similar freeze after a few hours. The bug trace is attached. I have just made this debug patch. It should take us closer to the cause of the bug. Could you please apply it on top of my dev version of bfq and retry? Well, I am not going to use my principal computer as a test bed and risk losing any data. I do regular backups, but I would have to spend hours in restoring them, which I don't have. I've switch to a different scheduler and I did not experienced any freeze anymore. Nevertheless, I am trying to setup a VM with a similar configuration to mine main computer and test your development branch hoping it will freeze too. But it may take a few days, since I am doing this in my free time. It seems these three components are important: NVMe, btrfs and bfq. I have setup up hourly snapshots of btrfs filesystems in my computer. But no freeze happened at the moment of doing such a snapshot. A couple of freezes happened when the computer was left alone, doing nothing. Moreover, it is impossible to get any logs after the freeze happens. I had to log in from a different computer and use dmesg -w to see what is happening. Just to give more information, i have a laptop with kernel 5.14.7 that doesnt experience this problem, it has btrfs with one disk with two partition in single. Just reporting/confirming that fix works on fedora downstream kernel. I have a few btrfs partitions on a laptop. I don't have NVME, just 1 sata-SSD and 2-hdd. I experienced freezes on many 5.14.x-x.fc35.x86_64 and on the recent 5.15.0-0.rc3.20211001git4de593fb965f.30.fc36.x86_64. The 5.14.9-300.fc35.x86_64, which has the patch, is stable. I then self built 5.15rc3 from the src rpm, applied the revert patch, which resulted in more stable usable kernel. Presently browsing machine booted using this kernel. It has been up for 5 hrs, in standby for 5 hrs. https://bugzilla.redhat.com/show_bug.cgi?id=2007406 I see fix is upstreamed in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/ . So it will make it to 5.15 rc4. Created attachment 299117 [details] kernel trace for oops in 5.12.0-bfq + additional debug patch (In reply to Paolo Valente from comment #29) > Created attachment 299037 [details] > add more checks on queue merging > > (In reply to Grzegorz Kowal from comment #17) > > Created attachment 299031 [details] > > bug trace for branch dev-bfq-on-5.12 > > > > With response to Comment 15: > > > > I have compiled the bfq development version from > > https://github.com/Algodev-github/bfq-mq/tree/dev-bfq-on-5.12 and I've got > a > > similar freeze after a few hours. The bug trace is attached. > > I have just made this debug patch. It should take us closer to the cause of > the bug. Could you please apply it on top of my dev version of bfq and retry? Paolo, I've just got a similar freeze in QEMU with freshly installed Fedora 34 (btrfs root fs + scheduler set to BFQ) and kernel compiled from your BFQ development branch (5.12.0-bfq) with additional debug patch. The trace is attached. I hope it will help to find out the cause of these freezes. Thank you very much! Thanks to this new trace, I think I could have found the cause of the problem. I'm working on a fix. I hope you will be able to test it. Created attachment 299147 [details]
tentative fix: reset last_bfqq_created on group change
Hi,
here is a tentative fix. Please apply this patch on top of the other and retry. Thank you very much.
(In reply to Paolo Valente from comment #35) > Created attachment 299147 [details] > tentative fix: reset last_bfqq_created on group change > > Hi, > here is a tentative fix. Please apply this patch on top of the other and > retry. Thank you very much. Thanks for the patch. I have tested it during the weekend and it seems to solve the problem. Without this patch the idle system was crashing within two hours, sometimes even within less than an hour. With the patch applied there was no crash, even with the system running up to 9 hours. I have tested this system with and without the patch a few times, with the idle and stressed system. As I said earlier, the system is Fedora 34 installed in QEMU with NVMe block device emulation, BFQ scheduler, and kernel compiled from the development branch of BFQ. Thank you very much! I'm about to post the patch for mainline, with your Tested-by. (In reply to Paolo Valente from comment #37) > Thank you very much! I'm about to post the patch for mainline, with your > Tested-by. Thanks! |
Created attachment 298931 [details] kernel bug trace Yesterday I have updated my kernel to version 5.14.7. Since then I am getting random system hangs with the kernel bug message attached. With 5.14.6 and earlier there was not problems. I use Gentoo Linux on AMD FX-8350. The system is installed on Samsung NVMe 970 Pro. I use BFQ scheduler. I am testing now the system with two suspected patches reverted: - blkcg-fix-memory-leak-in-blk_iolatency_init.patch - block-bfq-honor-already-setup-queue-merges.patch I will report soon if they are responsible for the hangs.