Bug 219166
Summary: | occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler' | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Richard W.M. Jones (rjones) |
Component: | Block Layer | Assignee: | fs_ext4 (fs_ext4) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | regressions, tom.leiming, tytso |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
URL: | https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=734e1a8603128ac31526c477a39456be5f4092b6 | ||
Kernel Version: | 6.11.0 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | 'foreach bt' in crash utility |
Description
Richard W.M. Jones
2024-08-16 07:43:09 UTC
This is a bug in the block layer, not in ext4. It's also a fairly long-standing issue; I remember seeing something like in our Data Center kernels at $WORK at least five plus years ago. At the time, we dealt with this via the simple expedient of "Doctor, Doctor, it hurts when I do that. Well, then don't do that then!" At the time it happened relatively rarely; if you're able to reliably reproduce it, perhaps it would be useful to send a script with said reliable reproducer to the linux-block list, and perhaps try to do a bisect to see when it became trivially easy to trigger. We did indeed work around it by not doing it: https://github.com/libguestfs/libguestfs/commit/b2d682a4730ead8b4ae07e5aaf6fa230c5eec305 so my interest in this bug is now limited. If you want to close it then that's fine. It's unfortunately triggered very rarely. The only test I have is: $ while guestfish -a /dev/null run -vx >& /tmp/log ; do echo -n . ; done but that hits the bug less than 1 in 200 iterations (although more often when run in a qemu VM using software emulation, which is where we first hit this.) I tried to write something that could trigger it more frequently by having a loop in a guest writing to queue/scheduler, while another loop in the guest would write to the disk, but I couldn't seem to hit the bug at all. So FWIW, what we saw in our data center kernel was switching between one valid scheduler to a different valid schedule while I/O was in flight. And this was with a kernel that didn't have any modules (or not any modules that would be loaded under normal circumstance, so modprobe wouldn't have been in the picture). It triggered rarely as well, and I don't remember whether it was an oops or a hang --- if I remember correctly, it was a oops. So it might not be the same thing, but our workaround was to quiescece the device before changing the scheduler. Since this was happening in the boot sequence, it was something we could do relatively easily, and like you we then lost interest. :-) The question is whether or not I want to close it; the question is whether we think it's worth trying to ask the block layer developers to try to take a look at it. Right now it's mostly only ext4 developers who are paying attention to this bug componet, so someone would need to take it up to the block developers, and the thing that would be most useful is a reliable reproducer. I don't know what guestfish is doing, but if we were trying to create a reproducer from what I was seeing a few years ago, it would be something like running fsstress or fio to exercise the block layer, and then try switching the I/O scheduler and see if we can make it go *boom* regularly. Maybe with something like that we could get a reproducer that doesn't require launching a VM multiple times and only seeing the failure less than 0.5% of the time.... I've got a job running now: # while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done and fio doing some random writes on an XFS filesystem on sda1. Baremetal, kernel 6.11.0-0.rc3.20240814git6b0f8db921ab.32.fc42.x86_64. I'll leave that going overnight to see what happens. I'm afraid that didn't reproduce the issue. I have an idea to run the same test, but in a software emulated virtual machine. The idea is that the much slower performance will exaggerate any race conditions in the kernel. However this will take time. Yes I can reproduce this inside a software emulated VM with another 6.11.0 Fedora kernel. I will bisect this later, but for now reproduction instructions are given below. (1) Install a Fedora 40 virtual machine. I used the command below but other ways are available: virt-builder fedora-40 --size=10G --root-password=password:123456 (2) Run the VM in qemu with software emulation (TCG): qemu-system-x86_64 -machine accel=tcg -cpu qemu64 -m 4096 -drive file=/var/tmp/fedora-40.qcow2,format=qcow2,if=virtio (3) Inside the VM, log in as root/123456, install fio, and update the kernel: dnf install fedora-repos-rawhide dnf install fio dnf update kernel reboot (should upgrade to 6.11.0 and boot into that kernel). (4) Inside the VM, in one terminal run: while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done (5) Inside the VM, in another terminal run fio with the following config or similar: [global] name=fio-rand-write filename=/root/fio-rand-write rw=randwrite bs=4K numjobs=4 time_based runtime=1h group_reporting [file1] size=1G ioengine=libaio iodepth=64 (6) After a while the fio process ETA will start counting up (since one or more threads have got stuck and are not making progress). Also logging in is problematic and many common commands like 'dmesg' or 'ps' hang. I could only recover by rebooting. > virt-builder fedora-40 --size=10G --root-password=password:123456
Should be:
virt-builder fedora-40 --size=10G --root-password=password:123456 --format=qcow2
> while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done
Should be:
while true; do echo noop > /sys/block/vda/queue/scheduler 2>/dev/null ; done
as the guest is using virtio. (Copied and pasted the instructions from my host test, but I did use the correct command in the VM test.)
Also this does *not* reproduce with a 6.8.5 kernel on the same VM, so I really do think this is a regression. This looks like a very tedious bisect but I'll have a go if I have time later. Just to close this one out ... I couldn't reproduce this issue when I compiled the kernel myself, even though I was using the exact same .config as Fedora uses, the same tag, building it on Fedora 40, and Fedora itself does not have any significant downstream patches. There were a few differences, for example I'm probably using a slightly different version of gcc/binutils than the Fedora kernel builders. So being unable to reproduce it in a self-compiled kernel, I cannot bisect it. We have worked around the problem, so that's basically as far as I want to take this bug. Feel free to close it if you want. Hi Richard, Can you collect the blk-mq debugfs log for this virtio-blk devices? (cd /sys/kernel/debug/block/vda && find . -type f -exec grep -aH . {} \;) There is known io hang risk in case of MQ: https://lore.kernel.org/linux-block/20240903081653.65613-1-songmuchun@bytedance.com/T/#m0212e28431d08e00a3de132ec40b2b907bb07177 If you or anyone can reproduce the issue, please test the above patches. Thanks, Is it possible to get this from 'crash' utility (ie from kernel structs)? The VM hangs fairly hard when this problem is hit. I can try the kernel patches though, will see if I can do that today. To document for myself and others what I did to reproduce the bug and get the kernel stack trace ... (1) libguestfs from git with this patch reverted: https://github.com/libguestfs/libguestfs/commit/b2d682a4730ead8b4ae07e5aaf6fa230c5eec305 (2) Run guestfish in a loop until it hangs: $ while LIBGUESTFS_BACKEND_SETTINGS=force_tcg ./run guestfish -a /dev/null run -vx >& /tmp/log ; do echo -n . ; done (3) Looking /tmp/log we can see it hung just after trying to set noop scheduler: $ tail -5 /tmp/log + echo 300 + for f in /sys/block/sd*/device/timeout + echo 300 + for f in /sys/block/{h,s,ub,v}d*/queue/scheduler + echo noop (4) Check the log for the kernel version, install the corresponding kernel debuginfo. (5) Get virsh to produce a core dump of the VM: $ virsh list Id Name State -------------------------------------------- 1950 guestfs-lsdbxy71u4jg1w6x running $ virsh dump 1950 /var/tmp/core --memory-only Domain '1950' dumped to /var/tmp/core (6) Open in 'crash': $ crash /usr/lib/debug/lib/modules/6.11.0-0.rc5.20240830git20371ba12063.47.fc42.x86_64/vmlinux /var/tmp/core (7) List processes and find the one which hung: crash> ps ... 230 73 0 ffffa01f83c58000 UN 0.3 11608 3340 modprobe (8) Get stack trace from the hung process: crash> set 230 PID: 230 COMMAND: "modprobe" TASK: ffffa01f83c58000 [THREAD_INFO: ffffa01f83c58000] CPU: 0 STATE: TASK_UNINTERRUPTIBLE crash> bt PID: 230 TASK: ffffa01f83c58000 CPU: 0 COMMAND: "modprobe" #0 [ffffc1db0030f840] __schedule at ffffffff921906d0 #1 [ffffc1db0030f8f8] schedule at ffffffff92191a27 #2 [ffffc1db0030f908] __bio_queue_enter at ffffffff917e17dc #3 [ffffc1db0030f968] blk_mq_submit_bio at ffffffff917f3b4c #4 [ffffc1db0030f9f0] __submit_bio at ffffffff917e23fc #5 [ffffc1db0030fa58] submit_bio_noacct_nocheck at ffffffff917e28e6 #6 [ffffc1db0030fac0] ext4_mpage_readpages at ffffffff915cef7c #7 [ffffc1db0030fbe0] read_pages at ffffffff91381cda #8 [ffffc1db0030fc40] page_cache_ra_unbounded at ffffffff91381fb5 #9 [ffffc1db0030fca8] filemap_fault at ffffffff91376175 #10 [ffffc1db0030fd48] __do_fault at ffffffff913d1755 #11 [ffffc1db0030fd70] do_fault at ffffffff913d22e5 #12 [ffffc1db0030fda0] __handle_mm_fault at ffffffff913d8b2b #13 [ffffc1db0030fe88] handle_mm_fault at ffffffff913d9472 #14 [ffffc1db0030fec8] do_user_addr_fault at ffffffff910b34ea #15 [ffffc1db0030ff28] exc_page_fault at ffffffff92187e4e #16 [ffffc1db0030ff50] asm_exc_page_fault at ffffffff922012a6 RIP: 000055bb085508d8 RSP: 00007ffc3e731900 RFLAGS: 00010206 RAX: 00000000000becd6 RBX: 00007f39925d1cd6 RCX: 000055bb0856592a RDX: 00007ffc3e733c70 RSI: 00000000200becd6 RDI: 000055bb1a712970 RBP: 00007ffc3e731b60 R8: 0000000000000000 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00000000200becd6 R13: 000055bb1a712970 R14: 00007ffc3e733c70 R15: 000055bb0856a708 ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b I think I have bisected this to: commit af2814149883e2c1851866ea2afcd8eadc040f79 Author: Christoph Hellwig <hch@lst.de> Date: Mon Jun 17 08:04:38 2024 +0200 block: freeze the queue in queue_attr_store queue_attr_store updates attributes used to control generating I/O, and can cause malformed bios if changed with I/O in flight. Freeze the queue in common code instead of adding it to almost every attribute. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240617060532.127975-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Created attachment 306825 [details]
'foreach bt' in crash utility
Reverting commit af2814149883e2c1851866ea2afcd8eadc040f79 on top of current git head (788220eee30d6) either fixes or hides the problem enough to fix my reproducer. The revert is not clean, but only 2 hunks fail and they were trivial to fix up. http://oirase.annexia.org/tmp/kbug-219166/ vmlinux and memory core dump captured when the bug happened. You can open this using the 'crash' utility. The corresponding kernel is: https://koji.fedoraproject.org/koji/buildinfo?buildID=2538598 Upstream discussion: https://lore.kernel.org/linux-block/20240907073522.GW1450@redhat.com/T/ |