Bug 12876
Summary: | irq 18: nobody cared after down-ing an e1000 interface | ||
---|---|---|---|
Product: | Drivers | Reporter: | Lubomir Rintel (lkundrak) |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | alan, jbrandeb |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.29-0.237.rc7.git4.fc11.i586 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | patch to eliminate races in close |
Description
Lubomir Rintel
2009-03-15 06:40:52 UTC
Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Sun, 15 Mar 2009 06:40:54 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12876 > > Summary: irq 18: nobody cared after down-ing an e1000 interface > Product: Drivers > Version: 2.5 > KernelVersion: 2.6.29-0.237.rc7.git4.fc11.i586 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Network > AssignedTo: drivers_network@kernel-bugs.osdl.org > ReportedBy: lkundrak@v3.sk > > > Distribution: Fedora > Hardware Environment: > http://smolts.org/show?uuid=pub_63f33868-db84-4452-bb6d-eeb440ee5449 > > Problem Description: > > I get the following OOPS when I down the e1000 network interface: > > irq 18: nobody cared (try booting with the "irqpoll" option) > Pid: 9014, comm: udevd Tainted: G W 2.6.29-0.237.rc7.git4.fc11.i586 > #1 > Call Trace: > [<c0471f2b>] __report_bad_irq+0x33/0x74 > [<c047205e>] note_interrupt+0xf2/0x150 > [<c04711e0>] ? handle_IRQ_event+0x4f/0x58 > [<c04725e8>] handle_fasteoi_irq+0x94/0xb7 > [<c0472554>] ? handle_fasteoi_irq+0x0/0xb7 > <IRQ> [<c04045ac>] ? common_interrupt+0x2c/0x40 > handlers: > [<c05f3cb6>] (ata_sff_interrupt+0x0/0xb1) > [<c06079ec>] (usb_hcd_irq+0x0/0xa3) > Disabling IRQ #18 > > As you can see, the IRQ 18 is shared with ata_sff and usb_hcd. > > I've tried to comment out e1000_free_irq() call from e1000_close() and added > a > printk() to e1000_intr(). After that, the problem didn't occur, but I've not > seen the output of the printk in e1000_intr() (It was visible while the > interface was up), as if it weren't called. Seems like the way shared > interrupts work is beyond my understanding. > I have a suspicion what is going on here. e1000_irq_disable does a mask of the interrupt, then ensures it is flushed then waits for any running IRQ to complete (because a running IRQ my re-enable interrupts) That makes no sense because 1. It is possible for an interrupt to sneak in between the ew32() and the synchronize_irqs in which case we wait for the IRQ to be *re-enabled* in some situations 2. IRQ delivery is asynchronous via the APIC bus, so even if mask the interrupts, then flush them, then wait for any running IRQ on another processor to finish - we can still get a handled interrupt after that until free_irq returns Can you change e1000_free_irq something like this. It's not a fix but it will really narrow down the window any may help verify the theory static void e1000_free_irq(struct e1000_adapter *adapter) { struct net_device *netdev = adapter->netdev; + struct e1000_hw *hw = &adapter->hw; + er32(ICR); free_irq(adapter->pdev->irq, netdev); (In reply to comment #2) > Can you change e1000_free_irq something like this. It's not a fix but it will > really narrow down the window any may help verify the theory I can confirm, that with the patch applied I'm no longer seeing the "nobody cared" messages. Created attachment 20615 [details]
patch to eliminate races in close
please give this patch a try, it may not apply directly because it is against net-next, but the only difference should be the napi_complete vs netif_rx_complete.
It should be easy to hand apply if necessary. The stopping of the rx and tx units is also a related fix to this kind of issue.
thanks for testing Alan's fix it helps us believe the theory is correct about the interrupt re-enable and asserted after e1000_intr or e1000_clean completes. You don't happen to be connected to a very busy network do you? On Sun, 15 Mar 2009, Andrew Morton wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=12876
> > I get the following OOPS when I down the e1000 network interface:
> >
> > irq 18: nobody cared (try booting with the "irqpoll" option)
> > Pid: 9014, comm: udevd Tainted: G W 2.6.29-0.237.rc7.git4.fc11.i586
> #1
> > Call Trace:
> > [<c0471f2b>] __report_bad_irq+0x33/0x74
> > [<c047205e>] note_interrupt+0xf2/0x150
> > [<c04711e0>] ? handle_IRQ_event+0x4f/0x58
> > [<c04725e8>] handle_fasteoi_irq+0x94/0xb7
> > [<c0472554>] ? handle_fasteoi_irq+0x0/0xb7
> > <IRQ> [<c04045ac>] ? common_interrupt+0x2c/0x40
> > handlers:
> > [<c05f3cb6>] (ata_sff_interrupt+0x0/0xb1)
> > [<c06079ec>] (usb_hcd_irq+0x0/0xa3)
> > Disabling IRQ #18
> >
> > As you can see, the IRQ 18 is shared with ata_sff and usb_hcd.
> >
> > I've tried to comment out e1000_free_irq() call from e1000_close() and
> added a
> > printk() to e1000_intr(). After that, the problem didn't occur, but I've
> not
> > seen the output of the printk in e1000_intr() (It was visible while the
> > interface was up), as if it weren't called. Seems like the way shared
> > interrupts work is beyond my understanding.
Please try the patch attached to the bug, Alan Cox pointed out that it was
likely the driver was racy during close and was re-enabling interrupts.
I spotted two possible areas where this was possible and put in checks for
__E1000_DOWN before re-enabling ints in those spots.
I'm offline for a week and I'll reconnect and check back when I return.
> You don't happen to be connected to a very busy network do you? Nope. At the times I was writing this report, I believe there was nothing happening on the network apart from some mDNS traffice once a few seconds. > Please try the patch attached to the bug, Alan Cox pointed out that it was > likely the driver was racy during close and was re-enabling interrupts. > > I spotted two possible areas where this was possible and put in checks for > __E1000_DOWN before re-enabling ints in those spots. This seems to have fixed the problem -- I no longer getting any more interrupts from the adapter and it seems to work well. Thanks! This change was committed, I don't have rights to mark it as fixed. e1000 commit: a6c42322722976ca81e6d02e4a702f33d659d8fc change was also committed for e1000e. |