Bug 208371

Summary: virtio-blk-pci device can stack forever waiting for the block request completion
Product: Drivers Reporter: Dima Stepanov (dimastep)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: bjorn
Priority: P1    
Hardware: All   
OS: Linux   
URL: https://lkml.org/lkml/2019/1/22/906
Kernel Version: 4.15, 4.19 (in general any below 5.3) Subsystem:
Regression: No Bisected commit-id:

Description Dima Stepanov 2020-06-29 09:35:52 UTC
In our cloud infrastructure we experienced an issue with the hot plugging of the virtio-blk PCI devices:
 - make some hard load on one device
 - during this work load, attach new device using the same PCI bus
During hot plugging the guest will "change" the window registers. This leads to ioevent notifiers disable on the host side. As a result if some notifiers were missed on the first device during highload it can stack like forever. It can be workarounded by updating the guest kernel or by using some kind of polling on the host side. But anyway it looks like an issue inside the guest kernel.

This problem was resolved with the following commit:
commit 51c48b310183ab6ba5419edfc6a8de889cc04521
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Sat Jan 19 11:35:04 2019 -0600

    PCI: Probe bridge window attributes once at enumeration-time

This patch was applied to stable since 5.4. And it will be really great to see this patch and behaviour in the earlier kernel versions. Could it be applied to the linux-stable repository for some earlier kernels (4.x and 5.x)?
Comment 1 Dima Stepanov 2020-06-29 10:02:21 UTC
The lkml discussion: https://lkml.org/lkml/2019/1/22/906
There was an idea to file a new issue on bugzilla, but i didn't find it.

The steps how it can be reproduced with kvm/qemu stack and Ubuntu 18.04 guest kernel (4.15 or any below 5.4):

1. Run qemu with the following PCI topology:
$ cat run_linux_dev.sh 
#!/bin/bash

QEMU_BINARY=./x86_64-softmmu/qemu-system-x86_64

PCI_TOPOLOGY="-device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1,numa_node=0 \
    -device pcie-root-port,id=s0,slot=0,bus=pcie.1 \
    -device pcie-root-port,id=s1,slot=1,bus=pcie.1 \
    -device pcie-root-port,id=s2,slot=2,bus=pcie.1 \
    -device pcie-root-port,id=s3,slot=3,bus=pcie.1 \
    -device pcie-root-port,id=s4,slot=4,bus=pcie.1 \
    -device pcie-root-port,id=s5,slot=5,bus=pcie.1 \
    -device pcie-root-port,id=s6,slot=6,bus=pcie.1 \
    -device pcie-root-port,id=s7,slot=7,bus=pcie.1 \
    -device pxb-pcie,bus_nr=137,bus=pcie.0,id=pcie.2,numa_node=0 \
    -device pcie-root-port,id=s8,slot=8,bus=pcie.2 \
    -device pcie-root-port,id=s9,slot=9,bus=pcie.2 \
    -device pcie-root-port,id=s10,slot=10,bus=pcie.2 \
    -device pcie-root-port,id=s11,slot=11,bus=pcie.2 \
    -device pcie-root-port,id=s12,slot=12,bus=pcie.2 \
    -device pcie-root-port,id=s13,slot=13,bus=pcie.2 \
    -device pcie-root-port,id=s14,slot=14,bus=pcie.2 \
    -device pcie-root-port,id=s15,slot=15,bus=pcie.2 \
    -device ioh3420,id=rootport1,bus=pcie.0,multifunction=on,chassis=1 \
    -device ioh3420,id=rootport2,bus=pcie.0,multifunction=on,chassis=2 \
    -device pcie-pci-bridge,id=pci.1,bus=rootport1 \
    -device pcie-pci-bridge,id=pci.2,bus=rootport2"

$QEMU_BINARY -machine q35,accel=kvm,usb=off \
    -name attachtest,debug-threads=on -cpu host \
    -smp 8,sockets=1,cores=8,threads=1 \
    -enable-kvm -nodefaults -no-user-config \
    -m 8192M -object memory-backend-file,id=ram-node0,size=8192M,host-nodes=0,policy=bind,mem-path=/dev/shm/vm90.mem,share=on \
    -numa node,cpus=0-0,memdev=ram-node0 \
    -netdev user,id=netdev0,hostfwd=tcp::20090-:22 \
    -device e1000,netdev=netdev0 \
    -vnc 127.0.0.1:90 -chardev socket,id=charserial0,server,nowait,telnet,host=127.0.0.1,port=30090 \
    -device isa-serial,chardev=charserial0,id=serial0 -device VGA,id=vga0,vgamem_mb=16 \
    -qmp unix:/tmp/qmp90.sock,server,nowait \
    -hda <path_to_guest_image> \
    $PCI_TOPOLOGY

2. Attach one disk using QMP commands:
{"execute": "object-add", "arguments": {"id": "iot1", "qom-type": "iothread"}}

{"execute": "blockdev-add", "arguments": { "node-name": "node-name-disk1", "cache": {"direct": true}, "detect-zeroes": "on", "driver": "raw", "file": { "driver": "file", "filename": "<path_to_first_disk>" } } }

{"execute": "device_add", "arguments": {"addr": "3", "bus": "pci.1", "id": "virtio-disk1", "write-cache": "off", "driver": "virtio-blk-pci", "disable-legacy": "off", "num-queues": "2", "drive": "node-name-disk1", "werror": "report", "config-wce": "off", "rerror": "report", "serial": "cloud-disk1", "iothread": "iot1"}}

3. Run some non-stop workload for the hot plugged device. It can be the script below. For our case we used some random 64mb file and it is important to make a direct sync:
$ cat cp.sh 
#!/bin/bash

DISKNAME=$1
echo "Load: ${DISKNAME}"

while true; do
        while [ ! -b ${DISKNAME} ] ; do
                touch ./file1
        done
        sudo dd if=./file1 of=${DISKNAME} bs=16384 oflag=sync
done
$ ./cp.sh /dev/vda
For good reproduce we use the following transfer rate:
 16777216 bytes (17 MB, 16 MiB) copied, 0.298746 s, 56.2 MB/s

4. During this non-stop load, attach second device:
{"execute": "object-add", "arguments": {"id": "iot2", "qom-type": "iothread"}}

{"execute": "blockdev-add", "arguments": { "node-name": "node-name-disk2", "cache": {"direct": true}, "detect-zeroes": "on", "driver": "raw", "file": { "driver": "file", "filename": "<path_to_second_disk>" } } }

{"execute": "device_add", "arguments": {"addr": "4", "bus": "pci.1", "id": "virtio-disk2", "write-cache": "off", "driver": "virtio-blk-pci", "disable-legacy": "off", "num-queues": "2", "drive": "node-name-disk2", "werror": "report", "config-wce": "off", "rerror": "report", "serial": "cloud-disk2", "iothread": "iot2"}}

As a result the first drive will stack with the dmesg log showed below:
[   45.919147] shpchp 0000:01:00.0: PCI slot #4-1 - powering on due to button press                           [25/9971]
[   52.192207] pci 0000:02:04.0: [1af4:1001] type 00 class 0x010000
[   52.192401] pci 0000:02:04.0: reg 0x10: [io  0x0000-0x003f]
[   52.192446] pci 0000:02:04.0: reg 0x14: [mem 0x00000000-0x00000fff]
[   52.192607] pci 0000:02:04.0: reg 0x20: [mem 0x00000000-0x00003fff 64bit pref]
[   52.196334] pci 0000:02:04.0: BAR 4: assigned [mem 0xfc204000-0xfc207fff 64bit pref]
[   52.196399] pci 0000:02:04.0: BAR 1: assigned [mem 0xfc401000-0xfc401fff]
[   52.196417] pci 0000:02:04.0: BAR 0: assigned [io  0xc040-0xc07f]
[   52.196435] shpchp 0000:01:00.0: PCI bridge to [bus 02]
[   52.196443] shpchp 0000:01:00.0:   bridge window [io  0xc000-0xcfff]
[   52.200299] shpchp 0000:01:00.0:   bridge window [mem 0xfc400000-0xfc5fffff]
[   52.202845] shpchp 0000:01:00.0:   bridge window [mem 0xfc200000-0xfc3fffff 64bit pref]
[   52.208570] virtio-pci 0000:02:04.0: enabling device (0000 -> 0003)
[  242.656133] INFO: task dd:1561 blocked for more than 120 seconds.
[  242.656521]       Not tainted 4.15.18-cloud21681.01 #1
[  242.656827] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.657297] dd              D    0  1561   1559 0x00000000
[  242.657301] Call Trace:
[  242.657343]  __schedule+0x297/0x8b0
[  242.657346]  schedule+0x2c/0x80
[  242.657354]  io_schedule+0x16/0x40
[  242.657366]  wait_on_page_bit_common+0xd8/0x160
[  242.657369]  ? page_cache_tree_insert+0xe0/0xe0
[  242.657370]  __filemap_fdatawait_range+0xfa/0x160
[  242.657373]  file_write_and_wait_range+0x86/0xb0
[  242.657389]  blkdev_fsync+0x1b/0x50
[  242.657394]  vfs_fsync_range+0x51/0xb0
[  242.657396]  blkdev_write_iter+0xff/0x130
[  242.657403]  __vfs_write+0xfe/0x170
[  242.657404]  ? check_disk_change+0x70/0x70
[  242.657405]  ? __vfs_write+0xfe/0x170
[  242.657407]  vfs_write+0xb1/0x1a0
[  242.657408]  SyS_write+0x55/0xc0
[  242.657414]  do_syscall_64+0x73/0x130
[  242.657418]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  242.657421] RIP: 0033:0x7fef71f12154
[  242.657422] RSP: 002b:00007ffee5965ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  242.657424] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fef71f12154
[  242.657425] RDX: 0000000000004000 RSI: 0000559b4c44d000 RDI: 0000000000000001
[  242.657426] RBP: 0000000000004000 R08: 0000000000000077 R09: 0000000000000000
[  242.657426] R10: 0000559b4c44b010 R11: 0000000000000246 R12: 0000559b4c44d000
[  242.657427] R13: 0000000000000000 R14: 0000000000000000 R15: 0000559b4c44d000
[  316.384143] EXT4-fs (sda2): error count since last fsck: 2
[  316.384185] EXT4-fs (sda2): initial error at time 1592839558: ext4_journal_check_start:61
[  316.384192] EXT4-fs (sda2): last error at time 1592839558: ext4_journal_check_start:61

Also it will be not possible to detach the first device.
Comment 2 Bjorn Helgaas 2020-06-29 19:10:00 UTC
If I understand correctly, this is a request to backport 51c48b310183 ("PCI: Probe bridge window attributes once at enumeration-time") to stable kernels.

That sounds OK to me.  Which specific kernels should it go to?  The stack trace above looks like v4.15.18.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git does have a v4.15.18 tag, but it's two years old and I don't think that series is maintained any longer.

It looks like v4.14 and v4.19 are possible candidates.

51c48b310183 appeared in v5.1, so it doesn't need to be backported to any v5.x kernels (except maybe v5.0, but there's no stable series for it either).
Comment 3 Dima Stepanov 2020-06-30 06:45:27 UTC
Yes, you are correct: it is a request of moving a patch to stable.

Good point about v4.15.18 tag which is two years old. Well at least v4.14 and v4.19 will be nice to have.
Comment 4 Dima Stepanov 2020-07-13 09:02:20 UTC
Sorry, i'm not familiar with the process of moving the changes to the linux-stable branch. Am i correct that my next step is to write in the stable ml with the link to the bugzilla ticket? Or maybe you can push these fix to v4.14 and v4.19 branches yourself.

Please advise.