Bug 219166 - occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/scheduler'
Summary: occasional block layer hang when setting 'echo noop > /sys/block/sda/queue/sc...
Status: RESOLVED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Block Layer (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL: https://git.kernel.org/pub/scm/linux/...
Keywords:
Depends on:
Blocks:
 
Reported: 2024-08-16 07:43 UTC by Richard W.M. Jones
Modified: 2024-09-18 08:36 UTC (History)
3 users (show)

See Also:
Kernel Version: 6.11.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
'foreach bt' in crash utility (34.99 KB, text/plain)
2024-09-06 20:28 UTC, Richard W.M. Jones
Details

Description Richard W.M. Jones 2024-08-16 07:43:09 UTC
kernel-6.11.0-0.rc3.30.fc41.x86_64

This hang seems new in 6.11.0.

Downstream bug:
https://bugzilla.redhat.com/show_bug.cgi?id=2303267

Trying to change the queue/scheduler for a device containing an ext4 filesystem very rarely causes other processes which are using the filesystem to hang.

This is running inside qemu.  Using 'crash' I was able to get a stack trace from a couple of hanging processes (from different VM instances):

crash> set 230
    PID: 230
COMMAND: "modprobe"
   TASK: ffff98ce03ca3040  [THREAD_INFO: ffff98ce03ca3040]
    CPU: 0
  STATE: TASK_UNINTERRUPTIBLE 
crash> bt
PID: 230      TASK: ffff98ce03ca3040  CPU: 0    COMMAND: "modprobe"
 #0 [ffffaf9940307840] __schedule at ffffffff9618f6d0
 #1 [ffffaf99403078f8] schedule at ffffffff96190a27
 #2 [ffffaf9940307908] __bio_queue_enter at ffffffff957e121c
 #3 [ffffaf9940307968] blk_mq_submit_bio at ffffffff957f358c
 #4 [ffffaf99403079f0] __submit_bio at ffffffff957e1e3c
 #5 [ffffaf9940307a58] submit_bio_noacct_nocheck at ffffffff957e2326
 #6 [ffffaf9940307ac0] ext4_mpage_readpages at ffffffff955ceafc
 #7 [ffffaf9940307be0] read_pages at ffffffff95381d1a
 #8 [ffffaf9940307c40] page_cache_ra_unbounded at ffffffff95381ff5
 #9 [ffffaf9940307ca8] filemap_fault at ffffffff953761b5
#10 [ffffaf9940307d48] __do_fault at ffffffff953d1895
#11 [ffffaf9940307d70] do_fault at ffffffff953d2425
#12 [ffffaf9940307da0] __handle_mm_fault at ffffffff953d8c6b
#13 [ffffaf9940307e88] handle_mm_fault at ffffffff953d95c2
#14 [ffffaf9940307ec8] do_user_addr_fault at ffffffff950b34ea
#15 [ffffaf9940307f28] exc_page_fault at ffffffff96186e4e
#16 [ffffaf9940307f50] asm_exc_page_fault at ffffffff962012a6
    RIP: 0000556b7a7468d8  RSP: 00007ffde2ffb560  RFLAGS: 00000206
    RAX: 00000000000bec82  RBX: 00007f5331a0dc82  RCX: 0000556b7a75b92a
    RDX: 00007ffde2ffd8d0  RSI: 00000000200bec82  RDI: 0000556ba8edf960
    RBP: 00007ffde2ffb7c0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000202  R12: 00000000200bec82
    R13: 0000556ba8edf960  R14: 00007ffde2ffd8d0  R15: 0000556b7a760708
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

crash> set 234
    PID: 234
COMMAND: "modprobe"
   TASK: ffff9f5ec3a22f40  [THREAD_INFO: ffff9f5ec3a22f40]
    CPU: 0
  STATE: TASK_UNINTERRUPTIBLE 
crash> bt
PID: 234      TASK: ffff9f5ec3a22f40  CPU: 0    COMMAND: "modprobe"
 #0 [ffffb21e002e7840] __schedule at ffffffffa718f6d0
 #1 [ffffb21e002e78f8] schedule at ffffffffa7190a27
 #2 [ffffb21e002e7908] __bio_queue_enter at ffffffffa67e121c
 #3 [ffffb21e002e7968] blk_mq_submit_bio at ffffffffa67f358c
 #4 [ffffb21e002e79f0] __submit_bio at ffffffffa67e1e3c
 #5 [ffffb21e002e7a58] submit_bio_noacct_nocheck at ffffffffa67e2326
 #6 [ffffb21e002e7ac0] ext4_mpage_readpages at ffffffffa65ceafc
 #7 [ffffb21e002e7be0] read_pages at ffffffffa6381d17
 #8 [ffffb21e002e7c40] page_cache_ra_unbounded at ffffffffa6381ff5
 #9 [ffffb21e002e7ca8] filemap_fault at ffffffffa63761b5
#10 [ffffb21e002e7d48] __do_fault at ffffffffa63d1892
#11 [ffffb21e002e7d70] do_fault at ffffffffa63d2425
#12 [ffffb21e002e7da0] __handle_mm_fault at ffffffffa63d8c6b
#13 [ffffb21e002e7e88] handle_mm_fault at ffffffffa63d95c2
#14 [ffffb21e002e7ec8] do_user_addr_fault at ffffffffa60b34ea
#15 [ffffb21e002e7f28] exc_page_fault at ffffffffa7186e4e
#16 [ffffb21e002e7f50] asm_exc_page_fault at ffffffffa72012a6
    RIP: 000055d16159f8d8  RSP: 00007ffdd4c1f340  RFLAGS: 00010206
    RAX: 00000000000bec82  RBX: 00007ff2fd00dc82  RCX: 000055d1615b492a
    RDX: 00007ffdd4c216b0  RSI: 00000000200bec82  RDI: 000055d185725960
    RBP: 00007ffdd4c1f5a0   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000202  R12: 00000000200bec82
    R13: 000055d185725960  R14: 00007ffdd4c216b0  R15: 000055d1615b9708
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

The hanging process was modprobe both times.  I don't know if that is significant.

Note that "noop" is not actually a valid scheduler (for about 10 years)!  We still observe this hang even when setting it to this impossible value.

# echo noop > /sys/block/sda/queue/scheduler
/init: line 108: echo: write error: Invalid argument
Comment 1 Theodore Tso 2024-08-16 16:40:19 UTC
This is a bug in the block layer, not in ext4.  It's also a fairly long-standing issue; I remember seeing something like in our Data Center kernels at $WORK at least five plus years ago.   At the time, we dealt with this via the simple expedient of "Doctor, Doctor, it hurts when I do that.  Well, then don't do that then!"

At the time it happened relatively rarely; if you're able to reliably reproduce it, perhaps it would be useful to send a script with said reliable reproducer to the linux-block list, and perhaps try to do a bisect to see when it became trivially easy to trigger.
Comment 2 Richard W.M. Jones 2024-08-16 17:06:33 UTC
We did indeed work around it by not doing it:
https://github.com/libguestfs/libguestfs/commit/b2d682a4730ead8b4ae07e5aaf6fa230c5eec305
so my interest in this bug is now limited.  If you want to close it then that's fine.

It's unfortunately triggered very rarely.  The only test I have is:

$ while guestfish -a /dev/null run -vx >& /tmp/log ; do echo -n . ; done

but that hits the bug less than 1 in 200 iterations (although more often when run in a qemu VM using software emulation, which is where we first hit this.)

I tried to write something that could trigger it more frequently by having a loop in a guest writing to queue/scheduler, while another loop in the guest would write to the disk, but I couldn't seem to hit the bug at all.
Comment 3 Theodore Tso 2024-08-16 20:36:42 UTC
So FWIW, what we saw in our data center kernel was switching between one valid scheduler to a different valid schedule while I/O was in flight.   And this was with a kernel that didn't have any modules (or not any modules that would be loaded under normal circumstance, so modprobe wouldn't have been in the picture).   It triggered rarely as well, and I don't remember whether it was an oops or a hang --- if I remember correctly, it was a oops.   So it might not be the same thing, but our workaround was to quiescece the device before changing the scheduler.  Since this was happening in the boot sequence, it was something we could do relatively easily, and like you we then lost interest.  :-)

The question is whether or not I want to close it; the question is whether we think it's worth trying to ask the block layer developers to try to take a look at it.   Right now it's mostly only ext4 developers who are paying attention to this bug componet, so someone would need to take it up to the block developers, and the thing that would be most useful is a reliable reproducer.

I don't know what guestfish is doing, but if we were trying to create a reproducer from what I was seeing a few years ago, it would be something like running fsstress or fio to exercise the block layer, and then try switching the I/O scheduler and see if we can make it go *boom* regularly.   Maybe with something like that we could get a reproducer that doesn't require launching a VM multiple times and only seeing the failure less than 0.5% of the time....
Comment 4 Richard W.M. Jones 2024-08-16 20:51:06 UTC
I've got a job running now:

# while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done

and fio doing some random writes on an XFS filesystem on sda1.

Baremetal, kernel 6.11.0-0.rc3.20240814git6b0f8db921ab.32.fc42.x86_64.

I'll leave that going overnight to see what happens.
Comment 5 Richard W.M. Jones 2024-08-17 10:22:11 UTC
I'm afraid that didn't reproduce the issue.  I have an idea to run the same test, but in a software emulated virtual machine.  The idea is that the much slower performance will exaggerate any race conditions in the kernel.  However this will take time.
Comment 6 Richard W.M. Jones 2024-08-17 12:36:41 UTC
Yes I can reproduce this inside a software emulated VM with another 6.11.0 Fedora kernel.  I will bisect this later, but for now reproduction instructions are given below.

(1) Install a Fedora 40 virtual machine.  I used the command below but other ways are available:

virt-builder fedora-40 --size=10G --root-password=password:123456

(2) Run the VM in qemu with software emulation (TCG):

qemu-system-x86_64 -machine accel=tcg -cpu qemu64 -m 4096 -drive file=/var/tmp/fedora-40.qcow2,format=qcow2,if=virtio

(3) Inside the VM, log in as root/123456, install fio, and update the kernel:

dnf install fedora-repos-rawhide
dnf install fio
dnf update kernel
reboot

(should upgrade to 6.11.0 and boot into that kernel).

(4) Inside the VM, in one terminal run:

while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done

(5) Inside the VM, in another terminal run fio with the following config or similar:

[global]
name=fio-rand-write
filename=/root/fio-rand-write
rw=randwrite
bs=4K
numjobs=4
time_based
runtime=1h
group_reporting

[file1]
size=1G
ioengine=libaio
iodepth=64

(6) After a while the fio process ETA will start counting up (since one or more threads have got stuck and are not making progress).  Also logging in is problematic and many common commands like 'dmesg' or 'ps' hang.  I could only recover by rebooting.
Comment 7 Richard W.M. Jones 2024-08-17 12:38:03 UTC
> virt-builder fedora-40 --size=10G --root-password=password:123456

Should be:

virt-builder fedora-40 --size=10G --root-password=password:123456 --format=qcow2
Comment 8 Richard W.M. Jones 2024-08-17 12:41:18 UTC
> while true; do echo noop > /sys/block/sda/queue/scheduler 2>/dev/null ; done

Should be:

while true; do echo noop > /sys/block/vda/queue/scheduler 2>/dev/null ; done

as the guest is using virtio.  (Copied and pasted the instructions from my host test, but I did use the correct command in the VM test.)
Comment 9 Richard W.M. Jones 2024-08-17 13:58:21 UTC
Also this does *not* reproduce with a 6.8.5 kernel on the same VM, so I really do think this is a regression.

This looks like a very tedious bisect but I'll have a go if I have time later.
Comment 10 Richard W.M. Jones 2024-08-20 15:33:12 UTC
Just to close this one out ...

I couldn't reproduce this issue when I compiled the kernel myself, even though I was using the exact same .config as Fedora uses, the same tag, building it on Fedora 40, and Fedora itself does not have any significant downstream patches.  There were a few differences, for example I'm probably using a slightly different version of gcc/binutils than the Fedora kernel builders.

So being unable to reproduce it in a self-compiled kernel, I cannot bisect it.

We have worked around the problem, so that's basically as far as I want to take this bug.  Feel free to close it if you want.
Comment 11 Ming Lei 2024-09-05 01:04:58 UTC
Hi Richard,

Can you collect the blk-mq debugfs log for this virtio-blk devices?

  (cd /sys/kernel/debug/block/vda && find . -type f -exec grep -aH . {} \;)

There is known io hang risk in case of MQ:

https://lore.kernel.org/linux-block/20240903081653.65613-1-songmuchun@bytedance.com/T/#m0212e28431d08e00a3de132ec40b2b907bb07177

If you or anyone can reproduce the issue, please test the above patches.

Thanks,
Comment 12 Richard W.M. Jones 2024-09-05 07:25:52 UTC
Is it possible to get this from 'crash' utility (ie from kernel structs)?  The VM hangs fairly hard when this problem is hit.

I can try the kernel patches though, will see if I can do that today.
Comment 13 Richard W.M. Jones 2024-09-05 09:32:51 UTC
To document for myself and others what I did to reproduce the bug and get the kernel stack trace ...

(1) libguestfs from git with this patch reverted:
https://github.com/libguestfs/libguestfs/commit/b2d682a4730ead8b4ae07e5aaf6fa230c5eec305

(2) Run guestfish in a loop until it hangs:

$ while LIBGUESTFS_BACKEND_SETTINGS=force_tcg ./run guestfish -a /dev/null run -vx >& /tmp/log ; do echo -n . ; done


(3) Looking /tmp/log we can see it hung just after trying to set noop scheduler:

$ tail -5 /tmp/log
+ echo 300
+ for f in /sys/block/sd*/device/timeout
+ echo 300
+ for f in /sys/block/{h,s,ub,v}d*/queue/scheduler
+ echo noop

(4) Check the log for the kernel version, install the corresponding kernel debuginfo.

(5) Get virsh to produce a core dump of the VM:

$ virsh list 
 Id     Name                       State
--------------------------------------------
 1950   guestfs-lsdbxy71u4jg1w6x   running

$ virsh dump 1950 /var/tmp/core --memory-only

Domain '1950' dumped to /var/tmp/core

(6) Open in 'crash':

$ crash /usr/lib/debug/lib/modules/6.11.0-0.rc5.20240830git20371ba12063.47.fc42.x86_64/vmlinux /var/tmp/core

(7) List processes and find the one which hung:

crash> ps 
...
      230      73   0  ffffa01f83c58000  UN   0.3    11608     3340  modprobe

(8) Get stack trace from the hung process:

crash> set 230
    PID: 230
COMMAND: "modprobe"
   TASK: ffffa01f83c58000  [THREAD_INFO: ffffa01f83c58000]
    CPU: 0
  STATE: TASK_UNINTERRUPTIBLE 
crash> bt
PID: 230      TASK: ffffa01f83c58000  CPU: 0    COMMAND: "modprobe"
 #0 [ffffc1db0030f840] __schedule at ffffffff921906d0
 #1 [ffffc1db0030f8f8] schedule at ffffffff92191a27
 #2 [ffffc1db0030f908] __bio_queue_enter at ffffffff917e17dc
 #3 [ffffc1db0030f968] blk_mq_submit_bio at ffffffff917f3b4c
 #4 [ffffc1db0030f9f0] __submit_bio at ffffffff917e23fc
 #5 [ffffc1db0030fa58] submit_bio_noacct_nocheck at ffffffff917e28e6
 #6 [ffffc1db0030fac0] ext4_mpage_readpages at ffffffff915cef7c
 #7 [ffffc1db0030fbe0] read_pages at ffffffff91381cda
 #8 [ffffc1db0030fc40] page_cache_ra_unbounded at ffffffff91381fb5
 #9 [ffffc1db0030fca8] filemap_fault at ffffffff91376175
#10 [ffffc1db0030fd48] __do_fault at ffffffff913d1755
#11 [ffffc1db0030fd70] do_fault at ffffffff913d22e5
#12 [ffffc1db0030fda0] __handle_mm_fault at ffffffff913d8b2b
#13 [ffffc1db0030fe88] handle_mm_fault at ffffffff913d9472
#14 [ffffc1db0030fec8] do_user_addr_fault at ffffffff910b34ea
#15 [ffffc1db0030ff28] exc_page_fault at ffffffff92187e4e
#16 [ffffc1db0030ff50] asm_exc_page_fault at ffffffff922012a6
    RIP: 000055bb085508d8  RSP: 00007ffc3e731900  RFLAGS: 00010206
    RAX: 00000000000becd6  RBX: 00007f39925d1cd6  RCX: 000055bb0856592a
    RDX: 00007ffc3e733c70  RSI: 00000000200becd6  RDI: 000055bb1a712970
    RBP: 00007ffc3e731b60   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000202  R12: 00000000200becd6
    R13: 000055bb1a712970  R14: 00007ffc3e733c70  R15: 000055bb0856a708
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
Comment 14 Richard W.M. Jones 2024-09-06 19:46:44 UTC
I think I have bisected this to:

commit af2814149883e2c1851866ea2afcd8eadc040f79
Author: Christoph Hellwig <hch@lst.de>
Date:   Mon Jun 17 08:04:38 2024 +0200

    block: freeze the queue in queue_attr_store
    
    queue_attr_store updates attributes used to control generating I/O, and
    can cause malformed bios if changed with I/O in flight.  Freeze the queue
    in common code instead of adding it to almost every attribute.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Bart Van Assche <bvanassche@acm.org>
    Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
    Reviewed-by: Hannes Reinecke <hare@suse.de>
    Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
    Link: https://lore.kernel.org/r/20240617060532.127975-12-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
Comment 15 Richard W.M. Jones 2024-09-06 20:28:13 UTC
Created attachment 306825 [details]
'foreach bt' in crash utility
Comment 16 Richard W.M. Jones 2024-09-06 20:32:27 UTC
Reverting commit af2814149883e2c1851866ea2afcd8eadc040f79 on top of current git head (788220eee30d6) either fixes or hides the problem enough to fix my reproducer.

The revert is not clean, but only 2 hunks fail and they were trivial to fix up.
Comment 17 Richard W.M. Jones 2024-09-07 07:49:17 UTC
http://oirase.annexia.org/tmp/kbug-219166/

vmlinux and memory core dump captured when the bug happened.  You
can open this using the 'crash' utility.  The corresponding kernel is:

https://koji.fedoraproject.org/koji/buildinfo?buildID=2538598
Comment 18 Richard W.M. Jones 2024-09-07 11:09:56 UTC
Upstream discussion: https://lore.kernel.org/linux-block/20240907073522.GW1450@redhat.com/T/

Note You need to log in before you can comment on or make changes to this bug.