Bug 187351
Summary: | unhandled irq stack trace during shutdown | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Prarit Bhargava (prarit) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | NEW --- | ||
Severity: | normal | CC: | bjorn, myron.stowe, rui.y.wang |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.9.0 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | Boot log for unhandled irq |
https://bugzilla.kernel.org/show_bug.cgi?id=113611 Other reports which you may not be able to see. If you cannot please let me know. https://bugzilla.redhat.com/show_bug.cgi?id=1149659 https://bugzilla.redhat.com/show_bug.cgi?id=1371929 https://bugzilla.redhat.com/show_bug.cgi?id=1340613 https://bugzilla.redhat.com/show_bug.cgi?id=1371929 Initial patch discussion: https://patchwork.kernel.org/patch/5990701/ and initial RFE, http://marc.info/?l=linux-pci&m=147705209308588&w=2 and follow-up resubmit of initial patch: http://marc.info/?l=linux-pci&m=147862787814760&w=2 P. Bjorn, I have attached a full boot file from an affected system as part of the bugzilla Description. If this is not sufficient to debug, please let me know. P. Some debug information: The issue is that pci_device_shutdown() is called on each PCI device, and does if (drv && drv->shutdown) drv->shutdown(pci_dev); pci_msi_shutdown(pci_dev); pci_msix_shutdown(pci_dev); The pci_msi_shutdown() and pci_msix_shutdown() functions both call pci_intx_for_msi() which enables the INTx interrupt asynchronously of the driver. The problem is that the driver may not have a shutdown function and the device remains active. The driver continues to operate the PCI device and the device interrupts to generate INTx. The driver, however, has not registered a handler for INTx and the interrupt line remains set which leads to an unhandled IRQ warning. This can be debugged by adding a 3 second delay in pci_device_shutdown() diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 1ccce1cd6aca..74776ca35e0c 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -20,6 +20,7 @@ #include <linux/pm_runtime.h> #include <linux/suspend.h> #include <linux/kexec.h> +#include <linux/delay.h> #include "pci.h" struct pci_dynid { @@ -461,10 +462,19 @@ static void pci_device_shutdown(struct device *dev) pm_runtime_resume(dev); + /* + * Mika, on the 4S the device that had the problem was the + * GT218 [GeForce 210] Video card running the nouveau driver. + * The nouveau driver does not use INTX at all IIRC. + */ + printk(KERN_EMERG "Calling shutdown for device %s\n", dev_name(dev)); if (drv && drv->shutdown) drv->shutdown(pci_dev); + pci_msi_shutdown(pci_dev); pci_msix_shutdown(pci_dev); + mdelay(3000); /* wait for unhandled irq */ + printk(KERN_EMERG "Completed shutdown for device %s\n", dev_name(dev)); /* * If this is a kexec reboot, turn off Bus Master bit on the and is highly reproducible on systems that have NVIDIA cards in them. P. Enabling INTx may be needed for every PCI device because system shutdown is passing control to unknown OS or firmware. Here's a patch trying to fix it from a different perspective. From 8f6d102e0a1ebd906f113532ec805fa395d12a3e Mon Sep 17 00:00:00 2001 From: Rui Wang <rui.y.wang@intel.com> Date: Fri, 25 Nov 2016 03:26:07 -0500 Subject: [PATCH] pci/msi: Setup INTx before pci_intx_for_msi() Migrate MSI actions to INTx if the driver doesn't have shutdown() to quiesce the hardware, in order to prevent IRQ storm during system shutdown. Signed-off-by: Rui Wang <rui.y.wang@intel.com> --- drivers/pci/msi.c | 35 +++++++++++++++++++++++++++++++++++ 1 files changed, 35 insertions(+), 0 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 98f1222..1be5645 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -886,6 +886,39 @@ int pci_msi_vec_count(struct pci_dev *dev) } EXPORT_SYMBOL(pci_msi_vec_count); +void pci_msi_to_intx(struct pci_dev *dev) +{ + struct irq_desc *desc; + struct msi_desc *entry; + struct irqaction *act; + unsigned int intx_irq; + struct pci_driver *drv = dev->driver; + + /* + * If it has shutdown() it's supposed to quiesce the hw. + */ + if (drv && drv->shutdown) + return; + + /* Only the first entry since the device can only assert 1 INTx */ + entry = first_pci_msi_entry(dev); + if (!entry) + return; + intx_irq = entry->msi_attrib.default_irq; + if (entry->irq) { + desc = irq_to_desc(entry->irq); + act = desc->action; + while (act) { + irq_set_affinity_hint(entry->irq, NULL); + /* Remove & reuse the action without freeing it */ + remove_irq(entry->irq, act); + act->flags |= IRQF_SHARED; + setup_irq(intx_irq, act); + act = act->next; + } + } +} + void pci_msi_shutdown(struct pci_dev *dev) { struct msi_desc *desc; @@ -898,6 +931,7 @@ void pci_msi_shutdown(struct pci_dev *dev) desc = first_pci_msi_entry(dev); pci_msi_set_enable(dev, 0); + pci_msi_to_intx(dev); pci_intx_for_msi(dev, 1); dev->msi_enabled = 0; @@ -1006,6 +1040,7 @@ void pci_msix_shutdown(struct pci_dev *dev) } pci_msix_clear_and_set_ctrl(dev, PCI_MSIX_FLAGS_ENABLE, 0); + pci_msi_to_intx(dev); pci_intx_for_msi(dev, 1); dev->msix_enabled = 0; pcibios_alloc_irq(dev); -- 1.7.5.4 (In reply to Rui Wang from comment #4) > Enabling INTx may be needed for every PCI device because system shutdown is > passing control to unknown OS or firmware. Here's a patch trying to fix it > from a different perspective. > Thanks Rui. A few questions. 1. Do you know of any systems that have problems like this? My argument is that prior to d52877c7b1af ("pci/irq: let pci_device_shutdown to call pci_msi_shutdown v2") there were no reports of faulty OS or firmware switching that I can find. 2. Shouldn't the reset (at least on systems within the past 10 years) cover this situation? P. (In reply to Prarit Bhargava from comment #5) > 1. Do you know of any systems that have problems like this? My argument is > that prior to d52877c7b1af ("pci/irq: let pci_device_shutdown to call > pci_msi_shutdown v2") there were no reports of faulty OS or firmware > switching that I can find. > No I don't. I was just theorizing. > 2. Shouldn't the reset (at least on systems within the past 10 years) cover > this situation? > kexec doesn't do reset. As for firmware, I think the firmware can choose not to do a reset immediately, but to continue executing for a while. e.g. sense the temperature to decide if the fan needs to continue turning. And it is not limited to PC but can be anything (automobile, drone...) |
Created attachment 244081 [details] Boot log for unhandled irq The following stack trace for an unhandled irq occurs on system shutdown: irq 16: nobody cared (try booting with the "irqpoll" option) CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-0.rc4.git0.1.el7_UNSUPPORTED.x86_64 #1 Hardware name: Hewlett-Packard HP Z820 Workstation/158B, BIOS J63 v03.90 06/01/2016 ffff88041f803eb0 ffffffff81337765 ffff8804186dbc00 ffff8804186dbc9c ffff88041f803ed8 ffffffff810db935 ffff8804186dbc00 0000000000000000 0000000000000010 ffff88041f803f10 ffffffff810dbc8f ffff8804186dbc00 Call Trace: <IRQ> [<ffffffff81337765>] dump_stack+0x63/0x8e [<ffffffff810db935>] __report_bad_irq+0x35/0xd0 [<ffffffff810dbc8f>] note_interrupt+0x20f/0x260 [<ffffffff810d8ff5>] handle_irq_event_percpu+0x45/0x60 [<ffffffff810d903c>] handle_irq_event+0x2c/0x50 [<ffffffff810dc9ba>] handle_fasteoi_irq+0x8a/0x150 [<ffffffff8102ee0b>] handle_irq+0xab/0x130 [<ffffffff8109de6a>] ? atomic_notifier_call_chain+0x1a/0x20 [<ffffffff8103631d>] ? __exit_idle+0x2d/0x30 [<ffffffff817158ad>] do_IRQ+0x4d/0xd0 [<ffffffff81713902>] common_interrupt+0x82/0x82 <EOI> [<ffffffff815e34d1>] ? cpuidle_enter_state+0xc1/0x280 [<ffffffff815e34c4>] ? cpuidle_enter_state+0xb4/0x280 [<ffffffff815e36c7>] cpuidle_enter+0x17/0x20 [<ffffffff810c1813>] call_cpuidle+0x23/0x40 [<ffffffff810c1a79>] cpu_startup_entry+0x149/0x240 [<ffffffff81705cd7>] rest_init+0x77/0x80 [<ffffffff81d8a147>] start_kernel+0x495/0x4a2 [<ffffffff81d89aa0>] ? set_init_arg+0x55/0x55 [<ffffffff81d89120>] ? early_idt_handler_array+0x120/0x120 [<ffffffff81d895d6>] x86_64_start_reservations+0x2a/0x2c [<ffffffff81d89715>] x86_64_start_kernel+0x13d/0x14c handlers: [<ffffffff8155ace0>] usb_hcd_irq Disabling IRQ #16 P.