Saw random crashes during boot every few boot attempts when running Linux 4.11-rc/mainline in a Fedora 26 guest under a CentOS7 host (CPU: Intel(R) Pentium(R) CPU G3220) using KVM. Sometimes when the guest actually booted the network did not work. To get some impressions of the crashes I got see this gallery: https://plus.google.com/+ThorstenLeemhuis/posts/FjyyGjNtrrG
Long story short: Did a bisection and found that https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") is the first bad commmit. I tried to reverting it on mainline as of today but failed due to reject I could not solve quickly. FWIW: I yesterday reverted the whole vhost merge from the merge window on top of yesterdays mainline and that fixed the issue.
Any idea what might be wrong? Do you need more details from my side to fix this?
There seems to be a general pattern of people getting a GPF during early boot, and a boot failure, in testing Fedora 26 images at present, though the tracebacks they get seem to vary quite a lot. I'm not sure if these are all the same case as this bug, but they may be.
One case is reported by Richard Jones here:
that one seems to be at least somewhat reproducible, by running:
from an F26 host. I've seen similar things happen in VMs I've run here just for testing, sometimes. And now we have a bunch of people doing some F26 tests 'live' at present, and apparently some of them are running into the same general problem.
I *think* so far all the cases we know of involve VMs - I don't believe we've had any cases on bare metal yet. I'll update if we do see any.
Created attachment 255439 [details]
I bisected this to the following commit:
07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507 is the first bad commit
Author: Christoph Hellwig <firstname.lastname@example.org>
Date: Sun Feb 5 18:15:19 2017 +0100
virtio_pci: use shared interrupts for virtqueues
This lets IRQ layer handle dispatching IRQs to separate handlers for the
case where we don't have per-VQ MSI-X vectors, and allows us to greatly
simplify the code based on the assumption that we always have interrupt
vector 0 (legacy INTx or config interrupt for MSI-X) available, and
any other interrupt is request/freed throught the VQ, even if the
actual interrupt line might be shared in some cases.
This allows removing a great deal of variables keeping track of the
interrupt state in struct virtio_pci_device, as we can now simply walk the
list of VQs and deal with per-VQ interrupt handlers there, and only treat
vector 0 special.
Additionally clean up the VQ allocation code to properly unwind on error
instead of having a single global cleanup label, which is error prone,
and in this case also leads to more code.
Signed-off-by: Christoph Hellwig <email@example.com>
Signed-off-by: Michael S. Tsirkin <firstname.lastname@example.org>
:040000 040000 79a8267ffb73f9d244267c5f68365305bddd4696 8832a160b978710bbd24ba6966f462b3faa27fcc M drivers
This is the very same commit already pointed to in the description
of this bug.
I'm also attaching the bisection script that I used, partly so
I remember what I did if I come back to this in future, but also
because it might be of interest to others.
It sounds like https://lkml.org/lkml/2017/3/23/38 is the fix for this, and is now queued up for pushing. Status should probably be updated?