In our cloud infrastructure we experienced an issue with the hot plugging of the virtio-blk PCI devices: - make some hard load on one device - during this work load, attach new device using the same PCI bus During hot plugging the guest will "change" the window registers. This leads to ioevent notifiers disable on the host side. As a result if some notifiers were missed on the first device during highload it can stack like forever. It can be workarounded by updating the guest kernel or by using some kind of polling on the host side. But anyway it looks like an issue inside the guest kernel. This problem was resolved with the following commit: commit 51c48b310183ab6ba5419edfc6a8de889cc04521 Author: Bjorn Helgaas <bhelgaas@google.com> Date: Sat Jan 19 11:35:04 2019 -0600 PCI: Probe bridge window attributes once at enumeration-time This patch was applied to stable since 5.4. And it will be really great to see this patch and behaviour in the earlier kernel versions. Could it be applied to the linux-stable repository for some earlier kernels (4.x and 5.x)?
The lkml discussion: https://lkml.org/lkml/2019/1/22/906 There was an idea to file a new issue on bugzilla, but i didn't find it. The steps how it can be reproduced with kvm/qemu stack and Ubuntu 18.04 guest kernel (4.15 or any below 5.4): 1. Run qemu with the following PCI topology: $ cat run_linux_dev.sh #!/bin/bash QEMU_BINARY=./x86_64-softmmu/qemu-system-x86_64 PCI_TOPOLOGY="-device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1,numa_node=0 \ -device pcie-root-port,id=s0,slot=0,bus=pcie.1 \ -device pcie-root-port,id=s1,slot=1,bus=pcie.1 \ -device pcie-root-port,id=s2,slot=2,bus=pcie.1 \ -device pcie-root-port,id=s3,slot=3,bus=pcie.1 \ -device pcie-root-port,id=s4,slot=4,bus=pcie.1 \ -device pcie-root-port,id=s5,slot=5,bus=pcie.1 \ -device pcie-root-port,id=s6,slot=6,bus=pcie.1 \ -device pcie-root-port,id=s7,slot=7,bus=pcie.1 \ -device pxb-pcie,bus_nr=137,bus=pcie.0,id=pcie.2,numa_node=0 \ -device pcie-root-port,id=s8,slot=8,bus=pcie.2 \ -device pcie-root-port,id=s9,slot=9,bus=pcie.2 \ -device pcie-root-port,id=s10,slot=10,bus=pcie.2 \ -device pcie-root-port,id=s11,slot=11,bus=pcie.2 \ -device pcie-root-port,id=s12,slot=12,bus=pcie.2 \ -device pcie-root-port,id=s13,slot=13,bus=pcie.2 \ -device pcie-root-port,id=s14,slot=14,bus=pcie.2 \ -device pcie-root-port,id=s15,slot=15,bus=pcie.2 \ -device ioh3420,id=rootport1,bus=pcie.0,multifunction=on,chassis=1 \ -device ioh3420,id=rootport2,bus=pcie.0,multifunction=on,chassis=2 \ -device pcie-pci-bridge,id=pci.1,bus=rootport1 \ -device pcie-pci-bridge,id=pci.2,bus=rootport2" $QEMU_BINARY -machine q35,accel=kvm,usb=off \ -name attachtest,debug-threads=on -cpu host \ -smp 8,sockets=1,cores=8,threads=1 \ -enable-kvm -nodefaults -no-user-config \ -m 8192M -object memory-backend-file,id=ram-node0,size=8192M,host-nodes=0,policy=bind,mem-path=/dev/shm/vm90.mem,share=on \ -numa node,cpus=0-0,memdev=ram-node0 \ -netdev user,id=netdev0,hostfwd=tcp::20090-:22 \ -device e1000,netdev=netdev0 \ -vnc 127.0.0.1:90 -chardev socket,id=charserial0,server,nowait,telnet,host=127.0.0.1,port=30090 \ -device isa-serial,chardev=charserial0,id=serial0 -device VGA,id=vga0,vgamem_mb=16 \ -qmp unix:/tmp/qmp90.sock,server,nowait \ -hda <path_to_guest_image> \ $PCI_TOPOLOGY 2. Attach one disk using QMP commands: {"execute": "object-add", "arguments": {"id": "iot1", "qom-type": "iothread"}} {"execute": "blockdev-add", "arguments": { "node-name": "node-name-disk1", "cache": {"direct": true}, "detect-zeroes": "on", "driver": "raw", "file": { "driver": "file", "filename": "<path_to_first_disk>" } } } {"execute": "device_add", "arguments": {"addr": "3", "bus": "pci.1", "id": "virtio-disk1", "write-cache": "off", "driver": "virtio-blk-pci", "disable-legacy": "off", "num-queues": "2", "drive": "node-name-disk1", "werror": "report", "config-wce": "off", "rerror": "report", "serial": "cloud-disk1", "iothread": "iot1"}} 3. Run some non-stop workload for the hot plugged device. It can be the script below. For our case we used some random 64mb file and it is important to make a direct sync: $ cat cp.sh #!/bin/bash DISKNAME=$1 echo "Load: ${DISKNAME}" while true; do while [ ! -b ${DISKNAME} ] ; do touch ./file1 done sudo dd if=./file1 of=${DISKNAME} bs=16384 oflag=sync done $ ./cp.sh /dev/vda For good reproduce we use the following transfer rate: 16777216 bytes (17 MB, 16 MiB) copied, 0.298746 s, 56.2 MB/s 4. During this non-stop load, attach second device: {"execute": "object-add", "arguments": {"id": "iot2", "qom-type": "iothread"}} {"execute": "blockdev-add", "arguments": { "node-name": "node-name-disk2", "cache": {"direct": true}, "detect-zeroes": "on", "driver": "raw", "file": { "driver": "file", "filename": "<path_to_second_disk>" } } } {"execute": "device_add", "arguments": {"addr": "4", "bus": "pci.1", "id": "virtio-disk2", "write-cache": "off", "driver": "virtio-blk-pci", "disable-legacy": "off", "num-queues": "2", "drive": "node-name-disk2", "werror": "report", "config-wce": "off", "rerror": "report", "serial": "cloud-disk2", "iothread": "iot2"}} As a result the first drive will stack with the dmesg log showed below: [ 45.919147] shpchp 0000:01:00.0: PCI slot #4-1 - powering on due to button press [25/9971] [ 52.192207] pci 0000:02:04.0: [1af4:1001] type 00 class 0x010000 [ 52.192401] pci 0000:02:04.0: reg 0x10: [io 0x0000-0x003f] [ 52.192446] pci 0000:02:04.0: reg 0x14: [mem 0x00000000-0x00000fff] [ 52.192607] pci 0000:02:04.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref] [ 52.196334] pci 0000:02:04.0: BAR 4: assigned [mem 0xfc204000-0xfc207fff 64bit pref] [ 52.196399] pci 0000:02:04.0: BAR 1: assigned [mem 0xfc401000-0xfc401fff] [ 52.196417] pci 0000:02:04.0: BAR 0: assigned [io 0xc040-0xc07f] [ 52.196435] shpchp 0000:01:00.0: PCI bridge to [bus 02] [ 52.196443] shpchp 0000:01:00.0: bridge window [io 0xc000-0xcfff] [ 52.200299] shpchp 0000:01:00.0: bridge window [mem 0xfc400000-0xfc5fffff] [ 52.202845] shpchp 0000:01:00.0: bridge window [mem 0xfc200000-0xfc3fffff 64bit pref] [ 52.208570] virtio-pci 0000:02:04.0: enabling device (0000 -> 0003) [ 242.656133] INFO: task dd:1561 blocked for more than 120 seconds. [ 242.656521] Not tainted 4.15.18-cloud21681.01 #1 [ 242.656827] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 242.657297] dd D 0 1561 1559 0x00000000 [ 242.657301] Call Trace: [ 242.657343] __schedule+0x297/0x8b0 [ 242.657346] schedule+0x2c/0x80 [ 242.657354] io_schedule+0x16/0x40 [ 242.657366] wait_on_page_bit_common+0xd8/0x160 [ 242.657369] ? page_cache_tree_insert+0xe0/0xe0 [ 242.657370] __filemap_fdatawait_range+0xfa/0x160 [ 242.657373] file_write_and_wait_range+0x86/0xb0 [ 242.657389] blkdev_fsync+0x1b/0x50 [ 242.657394] vfs_fsync_range+0x51/0xb0 [ 242.657396] blkdev_write_iter+0xff/0x130 [ 242.657403] __vfs_write+0xfe/0x170 [ 242.657404] ? check_disk_change+0x70/0x70 [ 242.657405] ? __vfs_write+0xfe/0x170 [ 242.657407] vfs_write+0xb1/0x1a0 [ 242.657408] SyS_write+0x55/0xc0 [ 242.657414] do_syscall_64+0x73/0x130 [ 242.657418] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 242.657421] RIP: 0033:0x7fef71f12154 [ 242.657422] RSP: 002b:00007ffee5965ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 242.657424] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fef71f12154 [ 242.657425] RDX: 0000000000004000 RSI: 0000559b4c44d000 RDI: 0000000000000001 [ 242.657426] RBP: 0000000000004000 R08: 0000000000000077 R09: 0000000000000000 [ 242.657426] R10: 0000559b4c44b010 R11: 0000000000000246 R12: 0000559b4c44d000 [ 242.657427] R13: 0000000000000000 R14: 0000000000000000 R15: 0000559b4c44d000 [ 316.384143] EXT4-fs (sda2): error count since last fsck: 2 [ 316.384185] EXT4-fs (sda2): initial error at time 1592839558: ext4_journal_check_start:61 [ 316.384192] EXT4-fs (sda2): last error at time 1592839558: ext4_journal_check_start:61 Also it will be not possible to detach the first device.
If I understand correctly, this is a request to backport 51c48b310183 ("PCI: Probe bridge window attributes once at enumeration-time") to stable kernels. That sounds OK to me. Which specific kernels should it go to? The stack trace above looks like v4.15.18. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git does have a v4.15.18 tag, but it's two years old and I don't think that series is maintained any longer. It looks like v4.14 and v4.19 are possible candidates. 51c48b310183 appeared in v5.1, so it doesn't need to be backported to any v5.x kernels (except maybe v5.0, but there's no stable series for it either).
Yes, you are correct: it is a request of moving a patch to stable. Good point about v4.15.18 tag which is two years old. Well at least v4.14 and v4.19 will be nice to have.
Sorry, i'm not familiar with the process of moving the changes to the linux-stable branch. Am i correct that my next step is to write in the stable ml with the link to the bugzilla ticket? Or maybe you can push these fix to v4.14 and v4.19 branches yourself. Please advise.