Bug 194911

Summary: "virtio_pci: use shared interrupts for virtqueues" (5c34d002dcc7) leads to random crashes on a CentOS7 host
Product: Other Reporter: The Linux kernel's regression tracker (Thorsten Leemhuis) (regressions)
Component: OtherAssignee: other_other
Status: RESOLVED CODE_FIX    
Severity: normal CC: adamw, hch, m.s.tsirkin, rjones
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.11-rc Subsystem:
Regression: Yes Bisected commit-id:
Attachments: bisect.sh

Description The Linux kernel's regression tracker (Thorsten Leemhuis) 2017-03-17 18:59:31 UTC
Saw random crashes during boot every few boot attempts when running Linux 4.11-rc/mainline in a Fedora 26 guest under a CentOS7 host (CPU: Intel(R) Pentium(R) CPU G3220) using KVM. Sometimes when the guest actually booted the network did not work. To get some impressions of the crashes I got see this gallery: https://plus.google.com/+ThorstenLeemhuis/posts/FjyyGjNtrrG

Long story short: Did a bisection and found that https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") is the first bad commmit. I tried to reverting it on mainline as of today but failed due to reject I could not solve quickly. FWIW: I yesterday reverted the whole vhost merge from the merge window on top of yesterdays mainline and that fixed the issue.

Any idea what might be wrong? Do you need more details from my side to fix this?
Comment 1 Adam Williamson 2017-03-21 23:22:22 UTC
There seems to be a general pattern of people getting a GPF during early boot, and a boot failure, in testing Fedora 26 images at present, though the tracebacks they get seem to vary quite a lot. I'm not sure if these are all the same case as this bug, but they may be.

One case is reported by Richard Jones here:

https://bugzilla.redhat.com/show_bug.cgi?id=1430297

that one seems to be at least somewhat reproducible, by running:

LIBGUESTFS_BACKEND=direct libguestfs-test-tool

from an F26 host. I've seen similar things happen in VMs I've run here just for testing, sometimes. And now we have a bunch of people doing some F26 tests 'live' at present, and apparently some of them are running into the same general problem.

I *think* so far all the cases we know of involve VMs - I don't believe we've had any cases on bare metal yet. I'll update if we do see any.
Comment 2 Richard W.M. Jones 2017-03-22 21:17:41 UTC
Created attachment 255439 [details]
bisect.sh

I bisected this to the following commit:

07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507 is the first bad commit
commit 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507
Author: Christoph Hellwig <hch@lst.de>
Date:   Sun Feb 5 18:15:19 2017 +0100

    virtio_pci: use shared interrupts for virtqueues
    
    This lets IRQ layer handle dispatching IRQs to separate handlers for the
    case where we don't have per-VQ MSI-X vectors, and allows us to greatly
    simplify the code based on the assumption that we always have interrupt
    vector 0 (legacy INTx or config interrupt for MSI-X) available, and
    any other interrupt is request/freed throught the VQ, even if the
    actual interrupt line might be shared in some cases.
    
    This allows removing a great deal of variables keeping track of the
    interrupt state in struct virtio_pci_device, as we can now simply walk the
    list of VQs and deal with per-VQ interrupt handlers there, and only treat
    vector 0 special.
    
    Additionally clean up the VQ allocation code to properly unwind on error
    instead of having a single global cleanup label, which is error prone,
    and in this case also leads to more code.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

:040000 040000 79a8267ffb73f9d244267c5f68365305bddd4696 8832a160b978710bbd24ba6966f462b3faa27fcc M	drivers

This is the very same commit already pointed to in the description
of this bug.

I'm also attaching the bisection script that I used, partly so
I remember what I did if I come back to this in future, but also
because it might be of interest to others.
Comment 3 Adam Williamson 2017-03-23 21:13:14 UTC
It sounds like https://lkml.org/lkml/2017/3/23/38 is the fix for this, and is now queued up for pushing. Status should probably be updated?