Bug 194911
Summary: | "virtio_pci: use shared interrupts for virtqueues" (5c34d002dcc7) leads to random crashes on a CentOS7 host | ||
---|---|---|---|
Product: | Other | Reporter: | The Linux kernel's regression tracker (Thorsten Leemhuis) (regressions) |
Component: | Other | Assignee: | other_other |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | adamw, hch, m.s.tsirkin, rjones |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.11-rc | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: | bisect.sh |
Description
The Linux kernel's regression tracker (Thorsten Leemhuis)
2017-03-17 18:59:31 UTC
There seems to be a general pattern of people getting a GPF during early boot, and a boot failure, in testing Fedora 26 images at present, though the tracebacks they get seem to vary quite a lot. I'm not sure if these are all the same case as this bug, but they may be. One case is reported by Richard Jones here: https://bugzilla.redhat.com/show_bug.cgi?id=1430297 that one seems to be at least somewhat reproducible, by running: LIBGUESTFS_BACKEND=direct libguestfs-test-tool from an F26 host. I've seen similar things happen in VMs I've run here just for testing, sometimes. And now we have a bunch of people doing some F26 tests 'live' at present, and apparently some of them are running into the same general problem. I *think* so far all the cases we know of involve VMs - I don't believe we've had any cases on bare metal yet. I'll update if we do see any. Created attachment 255439 [details] bisect.sh I bisected this to the following commit: 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507 is the first bad commit commit 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507 Author: Christoph Hellwig <hch@lst.de> Date: Sun Feb 5 18:15:19 2017 +0100 virtio_pci: use shared interrupts for virtqueues This lets IRQ layer handle dispatching IRQs to separate handlers for the case where we don't have per-VQ MSI-X vectors, and allows us to greatly simplify the code based on the assumption that we always have interrupt vector 0 (legacy INTx or config interrupt for MSI-X) available, and any other interrupt is request/freed throught the VQ, even if the actual interrupt line might be shared in some cases. This allows removing a great deal of variables keeping track of the interrupt state in struct virtio_pci_device, as we can now simply walk the list of VQs and deal with per-VQ interrupt handlers there, and only treat vector 0 special. Additionally clean up the VQ allocation code to properly unwind on error instead of having a single global cleanup label, which is error prone, and in this case also leads to more code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> :040000 040000 79a8267ffb73f9d244267c5f68365305bddd4696 8832a160b978710bbd24ba6966f462b3faa27fcc M drivers This is the very same commit already pointed to in the description of this bug. I'm also attaching the bisection script that I used, partly so I remember what I did if I come back to this in future, but also because it might be of interest to others. It sounds like https://lkml.org/lkml/2017/3/23/38 is the fix for this, and is now queued up for pushing. Status should probably be updated? |