kernel-6.11.0-0.rc3.30.fc41.x86_64 This hang seems new in 6.11.0. Downstream bug: https://bugzilla.redhat.com/show_bug.cgi?id=2303267 Trying to change the queue/scheduler for a device containing an ext4 filesystem very rarely causes other processes which are using the filesystem to hang. This is running inside qemu. Using 'crash' I was able to get a stack trace from a couple of hanging processes (from different VM instances): crash> set 230 PID: 230 COMMAND: "modprobe" TASK: ffff98ce03ca3040 [THREAD_INFO: ffff98ce03ca3040] CPU: 0 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 230 TASK: ffff98ce03ca3040 CPU: 0 COMMAND: "modprobe" #0 [ffffaf9940307840] __schedule at ffffffff9618f6d0 #1 [ffffaf99403078f8] schedule at ffffffff96190a27 #2 [ffffaf9940307908] __bio_queue_enter at ffffffff957e121c #3 [ffffaf9940307968] blk_mq_submit_bio at ffffffff957f358c #4 [ffffaf99403079f0] __submit_bio at ffffffff957e1e3c #5 [ffffaf9940307a58] submit_bio_noacct_nocheck at ffffffff957e2326 #6 [ffffaf9940307ac0] ext4_mpage_readpages at ffffffff955ceafc #7 [ffffaf9940307be0] read_pages at ffffffff95381d1a #8 [ffffaf9940307c40] page_cache_ra_unbounded at ffffffff95381ff5 #9 [ffffaf9940307ca8] filemap_fault at ffffffff953761b5 #10 [ffffaf9940307d48] __do_fault at ffffffff953d1895 #11 [ffffaf9940307d70] do_fault at ffffffff953d2425 #12 [ffffaf9940307da0] __handle_mm_fault at ffffffff953d8c6b #13 [ffffaf9940307e88] handle_mm_fault at ffffffff953d95c2 #14 [ffffaf9940307ec8] do_user_addr_fault at ffffffff950b34ea #15 [ffffaf9940307f28] exc_page_fault at ffffffff96186e4e #16 [ffffaf9940307f50] asm_exc_page_fault at ffffffff962012a6 RIP: 0000556b7a7468d8 RSP: 00007ffde2ffb560 RFLAGS: 00000206 RAX: 00000000000bec82 RBX: 00007f5331a0dc82 RCX: 0000556b7a75b92a RDX: 00007ffde2ffd8d0 RSI: 00000000200bec82 RDI: 0000556ba8edf960 RBP: 00007ffde2ffb7c0 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00000000200bec82 R13: 0000556ba8edf960 R14: 00007ffde2ffd8d0 R15: 0000556b7a760708 ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b crash> set 234 PID: 234 COMMAND: "modprobe" TASK: ffff9f5ec3a22f40 [THREAD_INFO: ffff9f5ec3a22f40] CPU: 0 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 234 TASK: ffff9f5ec3a22f40 CPU: 0 COMMAND: "modprobe" #0 [ffffb21e002e7840] __schedule at ffffffffa718f6d0 #1 [ffffb21e002e78f8] schedule at ffffffffa7190a27 #2 [ffffb21e002e7908] __bio_queue_enter at ffffffffa67e121c #3 [ffffb21e002e7968] blk_mq_submit_bio at ffffffffa67f358c #4 [ffffb21e002e79f0] __submit_bio at ffffffffa67e1e3c #5 [ffffb21e002e7a58] submit_bio_noacct_nocheck at ffffffffa67e2326 #6 [ffffb21e002e7ac0] ext4_mpage_readpages at ffffffffa65ceafc #7 [ffffb21e002e7be0] read_pages at ffffffffa6381d17 #8 [ffffb21e002e7c40] page_cache_ra_unbounded at ffffffffa6381ff5 #9 [ffffb21e002e7ca8] filemap_fault at ffffffffa63761b5 #10 [ffffb21e002e7d48] __do_fault at ffffffffa63d1892 #11 [ffffb21e002e7d70] do_fault at ffffffffa63d2425 #12 [ffffb21e002e7da0] __handle_mm_fault at ffffffffa63d8c6b #13 [ffffb21e002e7e88] handle_mm_fault at ffffffffa63d95c2 #14 [ffffb21e002e7ec8] do_user_addr_fault at ffffffffa60b34ea #15 [ffffb21e002e7f28] exc_page_fault at ffffffffa7186e4e #16 [ffffb21e002e7f50] asm_exc_page_fault at ffffffffa72012a6 RIP: 000055d16159f8d8 RSP: 00007ffdd4c1f340 RFLAGS: 00010206 RAX: 00000000000bec82 RBX: 00007ff2fd00dc82 RCX: 000055d1615b492a RDX: 00007ffdd4c216b0 RSI: 00000000200bec82 RDI: 000055d185725960 RBP: 00007ffdd4c1f5a0 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00000000200bec82 R13: 000055d185725960 R14: 00007ffdd4c216b0 R15: 000055d1615b9708 ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b The hanging process was modprobe both times. I don't know if that is significant. Note that "noop" is not actually a valid scheduler (for about 10 years)! We still observe this hang even when setting it to this impossible value. # echo noop > /sys/block/sda/queue/scheduler /init: line 108: echo: write error: Invalid argument
This is a bug in the block layer, not in ext4. It's also a fairly long-standing issue; I remember seeing something like in our Data Center kernels at $WORK at least five plus years ago. At the time, we dealt with this via the simple expedient of "Doctor, Doctor, it hurts when I do that. Well, then don't do that then!" At the time it happened relatively rarely; if you're able to reliably reproduce it, perhaps it would be useful to send a script with said reliable reproducer to the linux-block list, and perhaps try to do a bisect to see when it became trivially easy to trigger.
We did indeed work around it by not doing it: https://github.com/libguestfs/libguestfs/commit/b2d682a4730ead8b4ae07e5aaf6fa230c5eec305 so my interest in this bug is now limited. If you want to close it then that's fine. It's unfortunately triggered very rarely. The only test I have is: $ while guestfish -a /dev/null run -vx >& /tmp/log ; do echo -n . ; done but that hits the bug less than 1 in 200 iterations (although more often when run in a qemu VM using software emulation, which is where we first hit this.) I tried to write something that could trigger it more frequently by having a loop in a guest writing to queue/scheduler, while another loop in the guest would write to the disk, but I couldn't seem to hit the bug at all.
So FWIW, what we saw in our data center kernel was switching between one valid scheduler to a different valid schedule while I/O was in flight. And this was with a kernel that didn't have any modules (or not any modules that would be loaded under normal circumstance, so modprobe wouldn't have been in the picture). It triggered rarely as well, and I don't remember whether it was an oops or a hang --- if I remember correctly, it was a oops. So it might not be the same thing, but our workaround was to quiescece the device before changing the scheduler. Since this was happening in the boot sequence, it was something we could do relatively easily, and like you we then lost interest. :-) The question is whether or not I want to close it; the question is whether we think it's worth trying to ask the block layer developers to try to take a look at it. Right now it's mostly only ext4 developers who are paying attention to this bug componet, so someone would need to take it up to the block developers, and the thing that would be most useful is a reliable reproducer. I don't know what guestfish is doing, but if we were trying to create a reproducer from what I was seeing a few years ago, it would be something like running fsstress or fio to exercise the block layer, and then try switching the I/O scheduler and see if we can make it go *boom* regularly. Maybe with something like that we could get a reproducer that doesn't require launching a VM multiple times and only seeing the failure less than 0.5% of the time....
I've got a job running now: # while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done and fio doing some random writes on an XFS filesystem on sda1. Baremetal, kernel 6.11.0-0.rc3.20240814git6b0f8db921ab.32.fc42.x86_64. I'll leave that going overnight to see what happens.
I'm afraid that didn't reproduce the issue. I have an idea to run the same test, but in a software emulated virtual machine. The idea is that the much slower performance will exaggerate any race conditions in the kernel. However this will take time.
Yes I can reproduce this inside a software emulated VM with another 6.11.0 Fedora kernel. I will bisect this later, but for now reproduction instructions are given below. (1) Install a Fedora 40 virtual machine. I used the command below but other ways are available: virt-builder fedora-40 --size=10G --root-password=password:123456 (2) Run the VM in qemu with software emulation (TCG): qemu-system-x86_64 -machine accel=tcg -cpu qemu64 -m 4096 -drive file=/var/tmp/fedora-40.qcow2,format=qcow2,if=virtio (3) Inside the VM, log in as root/123456, install fio, and update the kernel: dnf install fedora-repos-rawhide dnf install fio dnf update kernel reboot (should upgrade to 6.11.0 and boot into that kernel). (4) Inside the VM, in one terminal run: while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done (5) Inside the VM, in another terminal run fio with the following config or similar: [global] name=fio-rand-write filename=/root/fio-rand-write rw=randwrite bs=4K numjobs=4 time_based runtime=1h group_reporting [file1] size=1G ioengine=libaio iodepth=64 (6) After a while the fio process ETA will start counting up (since one or more threads have got stuck and are not making progress). Also logging in is problematic and many common commands like 'dmesg' or 'ps' hang. I could only recover by rebooting.
> virt-builder fedora-40 --size=10G --root-password=password:123456 Should be: virt-builder fedora-40 --size=10G --root-password=password:123456 --format=qcow2
> while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done Should be: while true; do echo noop > /sys/block/vda/queue/scheduler 2>/dev/null ; done as the guest is using virtio. (Copied and pasted the instructions from my host test, but I did use the correct command in the VM test.)
Also this does *not* reproduce with a 6.8.5 kernel on the same VM, so I really do think this is a regression. This looks like a very tedious bisect but I'll have a go if I have time later.
Just to close this one out ... I couldn't reproduce this issue when I compiled the kernel myself, even though I was using the exact same .config as Fedora uses, the same tag, building it on Fedora 40, and Fedora itself does not have any significant downstream patches. There were a few differences, for example I'm probably using a slightly different version of gcc/binutils than the Fedora kernel builders. So being unable to reproduce it in a self-compiled kernel, I cannot bisect it. We have worked around the problem, so that's basically as far as I want to take this bug. Feel free to close it if you want.
Hi Richard, Can you collect the blk-mq debugfs log for this virtio-blk devices? (cd /sys/kernel/debug/block/vda && find . -type f -exec grep -aH . {} \;) There is known io hang risk in case of MQ: https://lore.kernel.org/linux-block/20240903081653.65613-1-songmuchun@bytedance.com/T/#m0212e28431d08e00a3de132ec40b2b907bb07177 If you or anyone can reproduce the issue, please test the above patches. Thanks,
Is it possible to get this from 'crash' utility (ie from kernel structs)? The VM hangs fairly hard when this problem is hit. I can try the kernel patches though, will see if I can do that today.
To document for myself and others what I did to reproduce the bug and get the kernel stack trace ... (1) libguestfs from git with this patch reverted: https://github.com/libguestfs/libguestfs/commit/b2d682a4730ead8b4ae07e5aaf6fa230c5eec305 (2) Run guestfish in a loop until it hangs: $ while LIBGUESTFS_BACKEND_SETTINGS=force_tcg ./run guestfish -a /dev/null run -vx >& /tmp/log ; do echo -n . ; done (3) Looking /tmp/log we can see it hung just after trying to set noop scheduler: $ tail -5 /tmp/log + echo 300 + for f in /sys/block/sd*/device/timeout + echo 300 + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler + echo noop (4) Check the log for the kernel version, install the corresponding kernel debuginfo. (5) Get virsh to produce a core dump of the VM: $ virsh list Id Name State -------------------------------------------- 1950 guestfs-lsdbxy71u4jg1w6x running $ virsh dump 1950 /var/tmp/core --memory-only Domain '1950' dumped to /var/tmp/core (6) Open in 'crash': $ crash /usr/lib/debug/lib/modules/6.11.0-0.rc5.20240830git20371ba12063.47.fc42.x86_64/vmlinux /var/tmp/core (7) List processes and find the one which hung: crash> ps ... 230 73 0 ffffa01f83c58000 UN 0.3 11608 3340 modprobe (8) Get stack trace from the hung process: crash> set 230 PID: 230 COMMAND: "modprobe" TASK: ffffa01f83c58000 [THREAD_INFO: ffffa01f83c58000] CPU: 0 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 230 TASK: ffffa01f83c58000 CPU: 0 COMMAND: "modprobe" #0 [ffffc1db0030f840] __schedule at ffffffff921906d0 #1 [ffffc1db0030f8f8] schedule at ffffffff92191a27 #2 [ffffc1db0030f908] __bio_queue_enter at ffffffff917e17dc #3 [ffffc1db0030f968] blk_mq_submit_bio at ffffffff917f3b4c #4 [ffffc1db0030f9f0] __submit_bio at ffffffff917e23fc #5 [ffffc1db0030fa58] submit_bio_noacct_nocheck at ffffffff917e28e6 #6 [ffffc1db0030fac0] ext4_mpage_readpages at ffffffff915cef7c #7 [ffffc1db0030fbe0] read_pages at ffffffff91381cda #8 [ffffc1db0030fc40] page_cache_ra_unbounded at ffffffff91381fb5 #9 [ffffc1db0030fca8] filemap_fault at ffffffff91376175 #10 [ffffc1db0030fd48] __do_fault at ffffffff913d1755 #11 [ffffc1db0030fd70] do_fault at ffffffff913d22e5 #12 [ffffc1db0030fda0] __handle_mm_fault at ffffffff913d8b2b #13 [ffffc1db0030fe88] handle_mm_fault at ffffffff913d9472 #14 [ffffc1db0030fec8] do_user_addr_fault at ffffffff910b34ea #15 [ffffc1db0030ff28] exc_page_fault at ffffffff92187e4e #16 [ffffc1db0030ff50] asm_exc_page_fault at ffffffff922012a6 RIP: 000055bb085508d8 RSP: 00007ffc3e731900 RFLAGS: 00010206 RAX: 00000000000becd6 RBX: 00007f39925d1cd6 RCX: 000055bb0856592a RDX: 00007ffc3e733c70 RSI: 00000000200becd6 RDI: 000055bb1a712970 RBP: 00007ffc3e731b60 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00000000200becd6 R13: 000055bb1a712970 R14: 00007ffc3e733c70 R15: 000055bb0856a708 ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
I think I have bisected this to: commit af2814149883e2c1851866ea2afcd8eadc040f79 Author: Christoph Hellwig <hch@lst.de> Date: Mon Jun 17 08:04:38 2024 +0200 block: freeze the queue in queue_attr_store queue_attr_store updates attributes used to control generating I/O, and can cause malformed bios if changed with I/O in flight. Freeze the queue in common code instead of adding it to almost every attribute. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240617060532.127975-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
Created attachment 306825 [details] 'foreach bt' in crash utility
Reverting commit af2814149883e2c1851866ea2afcd8eadc040f79 on top of current git head (788220eee30d6) either fixes or hides the problem enough to fix my reproducer. The revert is not clean, but only 2 hunks fail and they were trivial to fix up.
http://oirase.annexia.org/tmp/kbug-219166/ vmlinux and memory core dump captured when the bug happened. You can open this using the 'crash' utility. The corresponding kernel is: https://koji.fedoraproject.org/koji/buildinfo?buildID=2538598
Upstream discussion: https://lore.kernel.org/linux-block/20240907073522.GW1450@redhat.com/T/
Closing as fixed in: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=734e1a8603128ac31526c477a39456be5f4092b6