Bug 7022 - e1000 - IRQ - Nobody cared
Summary: e1000 - IRQ - Nobody cared
Status: REJECTED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-08-18 00:56 UTC by Goudal Francois
Modified: 2006-11-21 13:53 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.15.3
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Kernel config file (33.23 KB, text/plain)
2006-08-23 00:14 UTC, Goudal Francois
Details
proposed workaround (738 bytes, patch)
2006-08-24 11:33 UTC, Marcin Slusarz
Details | Diff

Description Goudal Francois 2006-08-18 00:56:47 UTC
Distribution:
Debian 3.0 with a "home made" kernel build.

Hardware Environment:
Pentar avionics - Jetlan AR250 router, based on a small x68 hardware :
Motherboard : Ampro LittleBoard 800
CPU : Celeron M 1Ghz

Problem Description:
I have a problem with this hardware, about IRQs :

The ethernet card based on e1000 driver works fine a few minutes and then I get
this message in dmesg :

irq 9: nobody cared (try booting with the "irqpoll" option)
 [<c012b666>] __report_bad_irq+0x31/0x77
 [<c012b739>] note_interrupt+0x75/0x98
 [<c012b26c>] __do_IRQ+0x65/0x91
 [<c0103fc6>] do_IRQ+0x19/0x24
 [<c0102c3a>] common_interrupt+0x1a/0x20
 [<c011683c>] __do_softirq+0x2c/0x7d
 [<c01168af>] do_softirq+0x22/0x26
 [<c0103fcb>] do_IRQ+0x1e/0x24
 [<c0102c3a>] common_interrupt+0x1a/0x20
 [<c0101047>] default_idle+0x2b/0x53
 [<c01010c3>] cpu_idle+0x40/0x5c
 [<c045c684>] start_kernel+0x171/0x173
handlers:
[<c0256d48>] (e1000_intr+0x0/0xde)
Disabling IRQ #9

After this, the card no longer works.

This card is the only one using the IRQ line 9 according to /proc/interrupts :
           CPU0       
  0:     476368          XT-PIC  timer
  1:       1177          XT-PIC  i8042
  2:          0          XT-PIC  cascade
  3:          6          XT-PIC  HiSax
  4:          9          XT-PIC  serial
  5:        941          XT-PIC  ehci_hcd:usb1, uhci_hcd:usb2, eth1
  9:    2400000          XT-PIC  eth0
 10:        129          XT-PIC  HiSax
 14:          4          XT-PIC  i82365
 15:      56403          XT-PIC  ide1
NMI:          0 
ERR:          0

I tried to put "irqpoll" to the kernel command line. Then it seems that there is
no problem. The card remains working even after a long time of use so the
problem seems to be solved.
But this solution is not acceptable, in a long term. There is obviously a bug
somewhere, and normally, I should be able to use this card without irqpoll.
I don't know if this information can help, but my kernel is completely
statically built, there are no dynamic modules.

Thank's for help !
Comment 1 Marcin Slusarz 2006-08-18 07:37:57 UTC
I think you should:
- attach .config
- show output of lspci
- try newer kernel
Comment 2 Goudal Francois 2006-08-23 00:14:56 UTC
Created attachment 8851 [details]
Kernel config file
Comment 3 Goudal Francois 2006-08-23 00:18:05 UTC
Here is the output of lspci :

00:00.0 Host bridge: Intel Corp.: Unknown device 3580 (rev 02)
00:00.1 System peripheral: Intel Corp.: Unknown device 3584 (rev 02)
00:00.3 System peripheral: Intel Corp.: Unknown device 3585 (rev 02)
00:02.0 VGA compatible controller: Intel Corp.: Unknown device 3582 (rev 02)
00:1d.0 USB Controller: Intel Corp.: Unknown device 24c2 (rev 03)
00:1d.1 USB Controller: Intel Corp.: Unknown device 24c4 (rev 03)
00:1d.7 USB Controller: Intel Corp.: Unknown device 24cd (rev 03)
00:1e.0 PCI bridge: Intel Corp. 82820 820 (Camino 2) Chipset PCI (-M) (rev 83)
00:1f.0 ISA bridge: Intel Corp.: Unknown device 24cc (rev 03)
00:1f.1 IDE interface: Intel Corp.: Unknown device 24ca (rev 03)
00:1f.3 SMBus: Intel Corp.: Unknown device 24c3 (rev 03)
00:1f.5 Multimedia audio controller: Intel Corp.: Unknown device 24c5 (rev 03)
01:04.0 Ethernet controller: Intel Corp. 82559ER (rev 14)
01:05.0 Ethernet controller: Intel Corp.: Unknown device 1076 (rev 05)

I will try to install a newer kernel to make some tests. If you already have an
idea, please tell me about it.
I will givee you some news when the new kernel will be installed.
Comment 4 Goudal Francois 2006-08-23 00:44:48 UTC
Eventually, I just built a 2.6.17.10 kernel, using the same .config but with
running make menuconfig to check for obsolete config flags.
The problem is exactly the same, I get an IRQ, apparently from the Intel gigabit
card, but nobody cares about it.
Before getting this message, the card works fine and after, it doesn't work
anymore cause the IRQ line has been disabled by the kernel.
Thank's for your help.
Regards.
Comment 5 Marcin Slusarz 2006-08-23 12:24:57 UTC
what happens if you enable "Use Rx Polling (NAPI)" (CONFIG_E1000_NAPI) below
"Intel(R) PRO/1000 Gigabit Ethernet support"
Comment 6 Goudal Francois 2006-08-24 00:41:46 UTC
I just tried with the NAPI enabled but the problem is the same : It works fine
with irqpoll but without it, the IRQ becomes disabled.
I just found something really surprising so it may be really helpful :
When I run without irqpoll, the message "Nobody cares, etc..." appears exacly
when the interrupts counter on IRQ line 9 reaches 100000.
I have noticed that by looking at /proc/interrupts.
Everything works fine while this count is lower than 100000 but at the exact
moment when it reaches 100000, the kernel disables the IRQ line, and then, this
counter remains indefinitely to 100000 because the IRQ has been masked again by
the kernel.
I noticed that with both kernels (the one without NAPI and the one with NAPI)
but it seems that without NAPI, the 100000 count is reached much faster.
By the way, I'm quite surprised of this interrupt count, 100000 looks like a
lot, considering other peripherials at the same moment. Furthermore, the card
isn't used a lot, I'm just pinging it with another computer, but even if there
is nothing wired on the card, I get the problem.
Does it looks like familiar to you ?

Thanks.
Comment 7 Marcin Slusarz 2006-08-24 09:11:33 UTC
it looks like nic is generating some faked interrupts and when unhandled_irqs
reaches 100000 kernel disables this irq line

why "faked interrupt"? because documentation for "Interrupt Cause Read Register"
says:
"This register contains all interrupt conditions for the Ethernet controller.
Each time an interrupt causing event occurs, the corresponding interrupt bit is
set in the register. (..)"
but this register is sometimes(?) empty at the beginning of e1000_intr

i don't have any idea why...

ps: note that i'm not a kernel developer
Comment 8 Goudal Francois 2006-08-24 09:58:16 UTC
Well, thank's for this information.
I will try to have a look in the kernel source code, trying to find a way to
inhibit unhandled_interrupts, in order for the kernel not to disable the IRQ
line, even if it's ugly. That even would be better than irqpoll.
Well, if you're not a kernel developper, do you have an idea to who I can ask
for help then ?

Thank's a lot for your help !
Comment 9 Goudal Francois 2006-08-24 10:21:42 UTC
I just found these two lines in the e1000_intr function in e1000_main.c :

	if(unlikely(!icr))
		return IRQ_NONE;  /* Not our interrupt */

where icr is the register you were talking about.
I assume that when the if condition is true, we generate an IRQ Nobody cared to
the kernel. I assume then that if this condition is true 100000 times, then the
kernel disables the IRQ line.
Then I'll try to comment out these two lines and build a "hacked" kernel ^^. I
don't have access to the hardware right now so I will test this tomorrow.
But I assume that this test, should never be true, the hardware seems to be a
little aggressive on his IRQ line and that's bad, the CPU will always jump in
the interrupt handler to do ... nothing...
I'll give you some news when I'll have my tests done.
Thank's !
Comment 10 Marcin Slusarz 2006-08-24 11:33:40 UTC
Created attachment 8865 [details]
proposed workaround
Comment 11 Goudal Francois 2006-08-29 00:11:11 UTC
The patch you proposed didn't worked well cause it seems that the amount of IRQ
with no bit set in the register is really high so, sometimes, 100000 interrupts
without a bit set are raised without one "good" interrupt. It just gives more
chance for the kernel to keep the IRQ line enabled, but it does not solve the
problem.
I have made another patch that solves the problem, in my specific case :
I replaced return IRQ_NONE by return IRQ_HANDLED when the register value is 0.

In my specific case, it works fine now cause the ethernet card is alone on his
IRQ line, but I assume that if another peripherial is on the same IRQ line, it
won't work because the card driver would probably catch some interrupts that
would be addressed to the other peripherial's own driver, causing some
malfunctions on this hardware.
Now, for me, the problem is solved, but this solution isn't reliable, thus I
would like a kernel developper to tell me what he thinks about all this
considering this bug SOLVED.
I'll keep watching this page if someone else wants some new elements about the
problem I had.
Comment 12 Goudal Francois 2006-11-21 13:53:31 UTC
After some wide investigations, the problem is not due to the ethernet card, sorry.
It definetly seems that this motherboard model has a strange workaround with the
IRQ line 9. And the default BIOS configuration puts only the ethernet card on
this interrupt line...
Apparently it is possible to make the BIOS to not assign IRQ9 to any
peripherial, which solves the problem, but this motherboard is definetly crap...

Note You need to log in before you can comment on or make changes to this bug.