Bug 38632

Summary: IRQ Nobody Cared on Sandybridge Additional Ethernet Card
Product: Drivers Reporter: Chris Palmer (chris.palmer)
Component: NetworkAssignee: drivers_network (drivers_network)
Status: RESOLVED WILL_NOT_FIX    
Severity: high CC: ajschult, aklhfex, alan, andyrtr, bjorn.ottervik, chris.palmer, edo.rus, edward.donovan, ghost_3k, kaillasse91, kaneda, kernel.org, pierre, Simon_Lea
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.3-rc1 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 56331    
Attachments: /proc/acpi/wakeup
Kernel config
/proc/cpuinfo
/proc/interrupts
/proc/iomem
/proc/ioports
/proc/irq/spurious (before bug occurs)
/proc/irq/spurious (after bug occurs)
lspci -vvv
/var/log/messages
/proc/modules
/proc/scsi/scsi
/proc/softirqs
ver_linux
/proc/version

Description Chris Palmer 2011-07-01 14:15:49 UTC
Created attachment 64232 [details]
/proc/acpi/wakeup

Additional PCI ethernet card in Sandybridge M/B runs for a few minutes, then gets IRQ 19 Nobody Cared

System details:
- Asus P8H67-V/R3 (latest Bios 0712, default settings except AHCI and
VT-enabled)
- COREi5/2500K
- 4 x 4GB Corsair CMX8GX3M2A1333C9
- On-board video and ethernet (eth1, at11c driver)
- One additional RTL8139 or Broadcom BCM5702X PCI ethernet for eth0 (fails
identically with either) - no other cards
- FC14 with 2.6.39.2 custom kernel, updates to 30/6/11

Moving to another slot only changes the IRQ number involved.

The following do not prevent the problem:
- pci=noacpi
- acpi=noirq
- noapic
- nolapic
- pci=nocrs

Setting acpi=off does appear to prevent the problem (at least it runs for a couple of hours...) but is undesirable as power consumption is about 25% higher, and the CPU temp is markedly higher even with the CPU fan running faster.
Comment 1 Chris Palmer 2011-07-01 14:16:28 UTC
Created attachment 64242 [details]
Kernel config
Comment 2 Chris Palmer 2011-07-01 14:16:57 UTC
Created attachment 64252 [details]
/proc/cpuinfo
Comment 3 Chris Palmer 2011-07-01 14:17:32 UTC
Created attachment 64262 [details]
/proc/interrupts
Comment 4 Chris Palmer 2011-07-01 14:17:54 UTC
Created attachment 64272 [details]
/proc/iomem
Comment 5 Chris Palmer 2011-07-01 14:18:27 UTC
Created attachment 64282 [details]
/proc/ioports
Comment 6 Chris Palmer 2011-07-01 14:19:11 UTC
Created attachment 64292 [details]
/proc/irq/spurious (before bug occurs)
Comment 7 Chris Palmer 2011-07-01 14:19:37 UTC
Created attachment 64302 [details]
/proc/irq/spurious (after bug occurs)
Comment 8 Chris Palmer 2011-07-01 14:20:17 UTC
Created attachment 64312 [details]
lspci -vvv
Comment 9 Chris Palmer 2011-07-01 14:20:41 UTC
Created attachment 64322 [details]
/var/log/messages
Comment 10 Chris Palmer 2011-07-01 14:24:26 UTC
Created attachment 64332 [details]
/proc/modules
Comment 11 Chris Palmer 2011-07-01 14:26:06 UTC
Created attachment 64342 [details]
/proc/scsi/scsi
Comment 12 Chris Palmer 2011-07-01 14:26:33 UTC
Created attachment 64352 [details]
/proc/softirqs
Comment 13 Chris Palmer 2011-07-01 14:26:57 UTC
Created attachment 64362 [details]
ver_linux
Comment 14 Chris Palmer 2011-07-01 14:27:30 UTC
Created attachment 64372 [details]
/proc/version
Comment 15 Chris Palmer 2011-07-09 19:55:10 UTC
Bug also exists on 2.6.39.3.
Confirmed that the bug does not occur on 2.6.39.2 with acpi=off
Comment 16 Alexandru Coman 2011-07-17 01:04:10 UTC
I can confirm that I have the same problem.
Confirmed on 2.6.32, 2.6.38, 2.6.39, 2.6.39.2.

The additional PCI ethernet card is unusable after this.
What happens after "IRQ Nobody cared" message:
2.6.32 - network is very slow, with pings over 150ms compared to the usual 1ms
2.6.29 - network not woking at all, ping timeout

Booting with "acpi=off" does not fix the problem, is just delayed - it takes around 2 days on my server, compared to 30min without that parameter.
Comment 17 Andreas Radke 2011-07-21 09:24:05 UTC
Same symptoms here too. Looks to be the same issue like https://bugzilla.kernel.org/show_bug.cgi?id=35332
Comment 18 Lan Tianyu 2011-07-25 06:03:56 UTC
hi Chris Palmer
Please attach the output of dmesg after bug occurs. 

Test longer time with "acpi=off" to identify whether they are the same problem.
Comment 19 Chris Palmer 2011-07-25 08:52:20 UTC
There was nothing in dmesg that wasn't in /var/log/messages (already attached).

I did run with acpi=off for 7 days without problems, but given Alex's experience that may not be conclusive. I've now installed a PCI-x ethernet board (based on  the Intel 82574L chipset, e1000e driver) instead and it has worked flawlessly. So I have a workaround for the moment, but lots of PCI slots that I cannot use.

If necessary I could add the original PCI ethernet back in, and find something to connect to it, to do some testing. The machine is now in use though so I can't mess with it too much...
Comment 20 Zhang Rui 2011-07-26 06:15:01 UTC
this seems like a hardware/ethernet card driver issue.
re-assign to network experts.
Comment 21 Alexandru Coman 2011-07-26 09:42:37 UTC
I should also mention the fact that the bug is not connected to a specific ethernet card driver. I have reproduced the bug using 2 different PCI ethernet cards: RTL8169S (RTL8139 driver) and Intel 82541PI (Intel e1000 driver).
Comment 22 Chris Palmer 2011-07-27 17:38:28 UTC
Upgraded to kernel 3.0.0.
Broadcomm PCI ethernet card fails as before.
PCI-x ethernet card continues to work perfectly.
Comment 23 Simon Lea 2011-08-18 02:58:20 UTC
Reported this in the redhat bugzilla back in June but unfortunately needed to use the hardware so could not provide any more info.  Hopefully the info you have been able to log and report may help them as well.

As more issues have been reported I have copied the previously reported bug from the Redhat Bugzilla to here https://bugzilla.kernel.org/show_bug.cgi?id=41322 and have also referenced this and other very similar bugs reported here in my bug report so there is a central point with related reports.  Hopefully the various people looking at the separate reports will be able to talk to each other rather than work in isolation if they do not already.

I always had my et1000 and my bmdma devices affected (plus others sometimes) when my problem occurred.  I notice the bmdma is not reported for you though.  Did you turn off your onbard SATA/PATA controller.  I had always presumed the onboard PATA (VIA chipset) was causing the issue and could not turn it off without disabling the on-board SATA which I needed as it is a 24 drive NAS box).
Comment 24 Chris Palmer 2011-08-18 08:43:10 UTC
(In reply to comment #23)
> I always had my et1000 and my bmdma devices affected (plus others sometimes)
> when my problem occurred.  I notice the bmdma is not reported for you though. 
> Did you turn off your onbard SATA/PATA controller.  I had always presumed the
> onboard PATA (VIA chipset) was causing the issue and could not turn it off
> without disabling the on-board SATA which I needed as it is a 24 drive NAS
> box).

I'm using the onboard Intel SATA abd VIA PATA controllers - you can see them in the lspci attachment. The only external cards are the PCIx ethernet (working) and the PCI ethernet (failing).

A couple more bits of info:
- The PCI ethernet is connected to another host, configured for IP, but otherwise idle (until this bug is resolved!). If I don't actively use it, it can stay "working" for many days. It only takes light activity (e.g. pings at 1-second intervals) to cause it to fail within minutes. I can also flood-ping but it still takes about the same number of minutes to fail.
- I can rmmod/modprobe the PCI ethernet driver (tg3) then ifconfig, and it will start working again. (Of course it fails again a few minutes later). Reloading the driver is sufficient though - a reboot is not required.

Still hoping for a fix - I do need the 3 ethernet interfaces...

Chris
Comment 25 Pierre Schmitz 2011-08-23 21:03:14 UTC
I am not sure if this is the same issue as described by the others here, but if it is it might help as in my case a network card is not affected.

What I get is this:

[32595.466355] irq 18: nobody cared (try booting with the "irqpoll" option)
[32595.466358] Pid: 0, comm: swapper Tainted: P            3.0-ARCH #1
[32595.466359] Call Trace:
[32595.466360]  <IRQ>  [<ffffffff810c121a>] __report_bad_irq+0x3a/0xd0
[32595.466367]  [<ffffffff810c1636>] note_interrupt+0x136/0x1f0
[32595.466369]  [<ffffffff810bf729>] handle_irq_event_percpu+0xc9/0x2a0
[32595.466371]  [<ffffffff810bf945>] handle_irq_event+0x45/0x70
[32595.466373]  [<ffffffff810c1f67>] handle_fasteoi_irq+0x57/0xd0
[32595.466375]  [<ffffffff8100d9f2>] handle_irq+0x22/0x40
[32595.466377]  [<ffffffff813f5e6a>] do_IRQ+0x5a/0xe0
[32595.466379]  [<ffffffff813f3b53>] common_interrupt+0x13/0x13
[32595.466380]  <EOI>  [<ffffffff8127377b>] ? intel_idle+0xcb/0x120
[32595.466384]  [<ffffffff8127375d>] ? intel_idle+0xad/0x120
[32595.466387]  [<ffffffff813138bd>] cpuidle_idle_call+0x9d/0x350
[32595.466390]  [<ffffffff8100a21a>] cpu_idle+0xba/0x100
[32595.466392]  [<ffffffff813d1602>] rest_init+0x96/0xa4
[32595.466394]  [<ffffffff81748c23>] start_kernel+0x3de/0x3eb
[32595.466395]  [<ffffffff81748347>] x86_64_start_reservations+0x132/0x136
[32595.466397]  [<ffffffff81748140>] ? early_idt_handlers+0x140/0x140
[32595.466399]  [<ffffffff8174844d>] x86_64_start_kernel+0x102/0x111
[32595.466400] handlers:
[32595.466404] [<ffffffffa028ad70>] oxygen_interrupt
[32595.466405] Disabling IRQ #18

I cannot reproduce this issue an it is very rare; about every other week. The hardware using IRQ 18 is a ASUS Xonar sound card, but the same issue exists with a sound blaster.

The mainboard is an ASUS P8P67 LE. I can add more details if this would be considered useful.
Comment 26 Rafael Gandolfi 2011-09-08 06:05:14 UTC
Exactly the same problem as Chris Palmer with a network pci card with an Asus P8h67-v:

09:02.0 Ethernet controller [0200]: D-Link System Inc DGE-528T Gigabit Ethernet Adapter [1186:4300] (rev 10)

Appears with 32bits and 64bits kernel, irq boot option doesn't (irqpoll and all) didn't work, sometimes two days without the problem, sometimes the problem 3 times a day.
Comment 27 Alexandru Coman 2011-09-08 10:02:06 UTC
I can confirm the finding of Chris Palmer, that this issue does not affect PCI-x cards.

Replaced the Intel PCI ethernet card with an Intel PCI-X ethernet card. The server has been running fine for over 20 days without any problems.
Comment 28 Chris Palmer 2012-01-30 14:23:59 UTC
Some more "progress" on this. It appears that the root cause may be the ASM1083 PCIx/PCI bridge widely used, particularly by ASUS, in many Sandybridge and AMD boards. The problem shows up with both processor architectures, and all forms of PCI board (NICs are just the most common and easiest to observe).

There is a lot of detail at
  http://www.gossamer-threads.com/lists/linux/kernel/1466185

And the problem also seems to manifest itself for Windows users who are getting lousy performance with various PCI boards.

Have now tried the December BIOS update, and kernel 3.3-rc1 but no luck.

Chris
Comment 29 Edward Donovan 2012-02-14 04:25:19 UTC
This bug looks like the same problem as numbers 39122 and 42659.  

  https://bugzilla.kernel.org/show_bug.cgi?id=39122
  https://bugzilla.kernel.org/show_bug.cgi?id=42659

If bugzilla would let me, I'd mark the two later ones as dupes of this. 
Or do something to pull them together.

It looks like the ASM1083 chip is bad.  Chris raised the topic again on LKML, as seen
here:

  https://lkml.org/lkml/2012/2/2/370

where Linus and others say we may be able to do limited workarounds.  No code
has come from that, yet.

I'm posting a version of this note on all three bugs.
Comment 30 Alan 2012-08-24 15:21:11 UTC
*** Bug 39122 has been marked as a duplicate of this bug. ***
Comment 31 Alan 2012-08-30 14:08:31 UTC
*** Bug 42659 has been marked as a duplicate of this bug. ***
Comment 32 Rafael Gandolfi 2012-11-21 15:26:28 UTC
So what is the status ? Is there partial working workaround or should everybody with this chip just forget about using pci extension card ?
Comment 33 Alan 2012-11-21 15:40:15 UTC
The status at this point is that people believe the ASM1083 chip is the problem and so far nobody has found a fix (if indeed there is one) or manufacturer/board vendor info on how to deal with the problem.
Comment 34 Rafael Gandolfi 2012-11-21 23:17:19 UTC
Thanks for the answer.

I believe there was a spurious.c patch at the beginning of the year floating around lkml which was preventing pci card from stopping working totally. So was it too ugly, not working, affecting other users, just in my imagination ?

And since the initial bug report there is a lot of mobo released with this chipset (p.ex P8Z77 Asus series)... often used by new linux's users .

It would be nice to have something in dmesg telling that the ASM1083 is faulty with the current linux drivers rather than "irq spurious nobody cared". It will prevents users to try every pci slots or seeking answers when the problem is known to be without solution. Don't know just an idea.

PS : if anybody have a patch to test I can do it, my network card is still stopping to work in 2 min with current kernel
Comment 35 Aaron Lu 2012-12-10 03:03:21 UTC
*** Bug 35332 has been marked as a duplicate of this bug. ***
Comment 36 Gero 2015-09-19 22:14:43 UTC
Hi there,

I stumbled across the bug after blaiming first a Promise Fastrak 4310 and then the replacement Dawicontrol DC-3410 SATA controllers, before I eventually found out my ASRock H67DE3 features the infamous ASM1083.

Is there any fix or workaround? To me, all the reports look like that the IRQ in question has been allocated twice, to two different devices (also the case with me). My controller reports at Bios-boot-time to have Interrupt 11 which would not be occupied by another device. After Ubuntu bootet it reports to use IRQ 16, which collides with my USB controller and seems to trigger the problem.

I used the machine for 4 years under Windows XP without any performance problems. Beginning this year, with Windows XP out of support, I decided to devote myself fully to Ubuntu and since more or less 6 month of bad system performance and intermittent searches for a solution I finally found this bug report. In-between I actually acquired a Win7 license to be able to work more relyably with the machine (there also no issues).

Could it be a workaround to have the PCI cards not use the IRQ of another device , like mentioned IRQ 11 in my case, which the controller claims at boot time and which seems free?

Dear community - is "WILL_NOT_FIX" really the last answer to this problem. I would assume the chip is not that uncommon...

Best regards,

Gero
Comment 37 Chris Palmer 2015-09-20 09:01:50 UTC
Gero

There was quite a lot of activity looking at what the ASM1083 was actually doing. From memory it loses interrupts, and the only workaround is to keep polling it. It "works" under Windows because those drivers do exactly that - with a big performance hit. Doing something like that under linux was deemed too much effort to get a poor result.

I've had trouble with various different ASM chipsets. I just avoid any motherboard with any of their chipsets on now.

Regards
Chris
Comment 38 Gero 2015-09-20 14:14:48 UTC
Chris,

many thanks for the answer. I can not confirm your description of how it works
with windows. I had the Fasttrak 4310 running a RAID 10 array with 4 x 2TB SATA disks for 3 years, nearly 24/7 and I never experienced the loss of performance that I had under Linux. I switched the same hardware configuration from Windows to Ubuntu, where I noticed the problem after less then one week with write rates around 1.5 MB/s. Windows was more in the range fo 50 MB/s.

After finally finding out about the problem yesterday in this forum, I am right now experimenting. I actually installed both PCI SATA controllers in my system. This results in one having IRQ 16, shared with the USB controller, and one having IRQ 17 allocated exclusively. I have no SATA disk on the controller with IRQ 16 and the other SATA disks on the one with the exclusive IRQ 17. Right now I am waiting whether the error still materializes.

If this is a solution - I wonder whether it would not be possible to manually allocate IRQs to devices explicitly by some kernel boot parameter?

Best regards,

Gero
Comment 39 Chris Palmer 2015-09-20 15:53:05 UTC
Gero

I never said performance would be comparable under linux - I would expect it to be quite fast for a short while (until the chipset loses an interrupt) then very slow indeed. I think the 50 MB/s you were seeing under Windows is the degraded speed- it should be much faster (depending on your disks of course). IIRC interrupt sharing wasn't the issue. I wouldn't pin too many hopes on it working well.

Chris
Comment 40 Gero 2015-09-27 11:02:27 UTC
Dear Chris,

thanks for your reply. I can confirm (unfortunately) that the problem also occurs in the above case of none-interrupt-sharing. 

Following experiences with problem (even if not really relevant):
 - it occurs earlier in case of high data volumes and parallel access to external USB-volumes (latter especially in case of shared IRQ with USB)
 - it occurs much earlier (by days) on the Promise 4310 Fasttrak controller than the DawiController DC-3410

From my experience of 4 years I am still pretty sure that the controller did not face any problems under Windows. I had the controller already before in an older mainboard without the ASM chip. When I upgraded to the new mainboard, I did not experience any performance deterioration with regard to my hard drives (where my system was installed), which I assume should have been the case if the chip is really erroneous.

I assume there is a workaround or working driver in Windows, but I have great respect for the work you (the kernel-developers) already have done without getting the proper information from the manufacturer.

I right now bought myself a new DawiControl on PCIe, which solved the problem.

Best regards and many thanks,

Gero
Comment 41 edo 2016-02-28 01:25:32 UTC
(In reply to Chris Palmer from comment #37)
> There was quite a lot of activity looking at what the ASM1083 was actually
> doing. From memory it loses interrupts, and the only workaround is to keep
> polling it.

I tried to run "rmmod e100;modprobe e100;sleep 0.1;ifconfig eth2 up" after error - it helped. IRQ works again:
edo@edo-home:~$ grep eth2 /proc/interrupts ; sleep 3; grep eth2 /proc/interrupts
 17:    1262843          0   IO-APIC  17-fasteoi   eth2
 17:    1262849          0   IO-APIC  17-fasteoi   eth2

So I think there is a possibility to find good software workaround and this bug should be reopened.