Bug 12876 - irq 18: nobody cared after down-ing an e1000 interface
Summary: irq 18: nobody cared after down-ing an e1000 interface
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_network@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-03-15 06:40 UTC by Lubomir Rintel
Modified: 2009-03-31 12:12 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.29-0.237.rc7.git4.fc11.i586
Subsystem:
Regression: No
Bisected commit-id:


Attachments
patch to eliminate races in close (2.58 KB, patch)
2009-03-20 17:38 UTC, Jesse Brandeburg
Details | Diff

Description Lubomir Rintel 2009-03-15 06:40:52 UTC
Distribution: Fedora
Hardware Environment: http://smolts.org/show?uuid=pub_63f33868-db84-4452-bb6d-eeb440ee5449

Problem Description:

I get the following OOPS when I down the e1000 network interface:

irq 18: nobody cared (try booting with the "irqpoll" option)
Pid: 9014, comm: udevd Tainted: G        W  2.6.29-0.237.rc7.git4.fc11.i586 #1
Call Trace:
 [<c0471f2b>] __report_bad_irq+0x33/0x74
 [<c047205e>] note_interrupt+0xf2/0x150
 [<c04711e0>] ? handle_IRQ_event+0x4f/0x58
 [<c04725e8>] handle_fasteoi_irq+0x94/0xb7
 [<c0472554>] ? handle_fasteoi_irq+0x0/0xb7
 <IRQ>  [<c04045ac>] ? common_interrupt+0x2c/0x40
handlers:
[<c05f3cb6>] (ata_sff_interrupt+0x0/0xb1)
[<c06079ec>] (usb_hcd_irq+0x0/0xa3)
Disabling IRQ #18

As you can see, the IRQ 18 is shared with ata_sff and usb_hcd.

I've tried to comment out e1000_free_irq() call from e1000_close() and added a printk() to e1000_intr(). After that, the problem didn't occur, but I've not seen the output of the printk in e1000_intr() (It was visible while the interface was up), as if it weren't called. Seems like the way shared interrupts work is beyond my understanding.
Comment 1 Anonymous Emailer 2009-03-15 14:37:21 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sun, 15 Mar 2009 06:40:54 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=12876
> 
>            Summary: irq 18: nobody cared after down-ing an e1000 interface
>            Product: Drivers
>            Version: 2.5
>      KernelVersion: 2.6.29-0.237.rc7.git4.fc11.i586
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Network
>         AssignedTo: drivers_network@kernel-bugs.osdl.org
>         ReportedBy: lkundrak@v3.sk
> 
> 
> Distribution: Fedora
> Hardware Environment:
> http://smolts.org/show?uuid=pub_63f33868-db84-4452-bb6d-eeb440ee5449
> 
> Problem Description:
> 
> I get the following OOPS when I down the e1000 network interface:
> 
> irq 18: nobody cared (try booting with the "irqpoll" option)
> Pid: 9014, comm: udevd Tainted: G        W  2.6.29-0.237.rc7.git4.fc11.i586
> #1
> Call Trace:
>  [<c0471f2b>] __report_bad_irq+0x33/0x74
>  [<c047205e>] note_interrupt+0xf2/0x150
>  [<c04711e0>] ? handle_IRQ_event+0x4f/0x58
>  [<c04725e8>] handle_fasteoi_irq+0x94/0xb7
>  [<c0472554>] ? handle_fasteoi_irq+0x0/0xb7
>  <IRQ>  [<c04045ac>] ? common_interrupt+0x2c/0x40
> handlers:
> [<c05f3cb6>] (ata_sff_interrupt+0x0/0xb1)
> [<c06079ec>] (usb_hcd_irq+0x0/0xa3)
> Disabling IRQ #18
> 
> As you can see, the IRQ 18 is shared with ata_sff and usb_hcd.
> 
> I've tried to comment out e1000_free_irq() call from e1000_close() and added
> a
> printk() to e1000_intr(). After that, the problem didn't occur, but I've not
> seen the output of the printk in e1000_intr() (It was visible while the
> interface was up), as if it weren't called. Seems like the way shared
> interrupts work is beyond my understanding.
> 
Comment 2 Alan 2009-03-20 05:01:10 UTC
I have a suspicion what is going on here.

e1000_irq_disable does a mask of the interrupt, then ensures it is flushed then waits for any running IRQ to complete (because a running IRQ my re-enable interrupts)

That makes no sense because

1. It is possible for an interrupt to sneak in between the ew32() and the synchronize_irqs in which case we wait for the IRQ to be *re-enabled* in some situations

2. IRQ delivery is asynchronous via the APIC bus, so even if mask the interrupts, then flush them, then wait for any running IRQ on another processor to finish - we can still get a handled interrupt after that until free_irq returns

Can you change e1000_free_irq something like this. It's not a fix but it will really narrow down the window any may help verify the theory

 static void e1000_free_irq(struct e1000_adapter *adapter)
 {
 	struct net_device *netdev = adapter->netdev;
+	struct e1000_hw *hw = &adapter->hw;
 
+	er32(ICR);
 	free_irq(adapter->pdev->irq, netdev);
Comment 3 Lubomir Rintel 2009-03-20 15:48:17 UTC
(In reply to comment #2)
> Can you change e1000_free_irq something like this. It's not a fix but it will
> really narrow down the window any may help verify the theory

I can confirm, that with the patch applied I'm no longer seeing the "nobody cared" messages.
Comment 4 Jesse Brandeburg 2009-03-20 17:38:51 UTC
Created attachment 20615 [details]
patch to eliminate races in close

please give this patch a try, it may not apply directly because it is against net-next, but the only difference should be the napi_complete vs netif_rx_complete.

It should be easy to hand apply if necessary.  The stopping of the rx and tx units is also a related fix to this kind of issue.
Comment 5 Jesse Brandeburg 2009-03-20 17:40:26 UTC
thanks for testing Alan's fix it helps us believe the theory is correct about the interrupt re-enable and asserted after e1000_intr or e1000_clean completes.

You don't happen to be connected to a very busy network do you?
Comment 6 Jesse Brandeburg 2009-03-20 22:06:46 UTC
On Sun, 15 Mar 2009, Andrew Morton wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=12876
> > I get the following OOPS when I down the e1000 network interface:
> > 
> > irq 18: nobody cared (try booting with the "irqpoll" option)
> > Pid: 9014, comm: udevd Tainted: G        W  2.6.29-0.237.rc7.git4.fc11.i586
> #1
> > Call Trace:
> >  [<c0471f2b>] __report_bad_irq+0x33/0x74
> >  [<c047205e>] note_interrupt+0xf2/0x150
> >  [<c04711e0>] ? handle_IRQ_event+0x4f/0x58
> >  [<c04725e8>] handle_fasteoi_irq+0x94/0xb7
> >  [<c0472554>] ? handle_fasteoi_irq+0x0/0xb7
> >  <IRQ>  [<c04045ac>] ? common_interrupt+0x2c/0x40
> > handlers:
> > [<c05f3cb6>] (ata_sff_interrupt+0x0/0xb1)
> > [<c06079ec>] (usb_hcd_irq+0x0/0xa3)
> > Disabling IRQ #18
> > 
> > As you can see, the IRQ 18 is shared with ata_sff and usb_hcd.
> > 
> > I've tried to comment out e1000_free_irq() call from e1000_close() and
> added a
> > printk() to e1000_intr(). After that, the problem didn't occur, but I've
> not
> > seen the output of the printk in e1000_intr() (It was visible while the
> > interface was up), as if it weren't called. Seems like the way shared
> > interrupts work is beyond my understanding.

Please try the patch attached to the bug, Alan Cox pointed out that it was 
likely the driver was racy during close and was re-enabling interrupts.

I spotted two possible areas where this was possible and put in checks for 
__E1000_DOWN before re-enabling ints in those spots.

I'm offline for a week and I'll reconnect and check back when I return.
Comment 7 Lubomir Rintel 2009-03-21 00:17:08 UTC
> You don't happen to be connected to a very busy network do you?

Nope. At the times I was writing this report, I believe there was nothing happening on the network apart from some mDNS traffice once a few seconds.

> Please try the patch attached to the bug, Alan Cox pointed out that it was 
> likely the driver was racy during close and was re-enabling interrupts.
> 
> I spotted two possible areas where this was possible and put in checks for 
> __E1000_DOWN before re-enabling ints in those spots.

This seems to have fixed the problem -- I no longer getting any more interrupts from the adapter and it seems to work well. Thanks!
Comment 8 Jesse Brandeburg 2009-03-30 23:51:50 UTC
This change was committed, I don't have rights to mark it as fixed.
e1000 commit: a6c42322722976ca81e6d02e4a702f33d659d8fc

change was also committed for e1000e.

Note You need to log in before you can comment on or make changes to this bug.