Bug 38632
Summary: | IRQ Nobody Cared on Sandybridge Additional Ethernet Card | ||
---|---|---|---|
Product: | Drivers | Reporter: | Chris Palmer (chris.palmer) |
Component: | Network | Assignee: | drivers_network (drivers_network) |
Status: | RESOLVED WILL_NOT_FIX | ||
Severity: | high | CC: | ajschult, aklhfex, alan, andyrtr, bjorn.ottervik, chris.palmer, edo.rus, edward.donovan, ghost_3k, kaillasse91, kaneda, kernel.org, pierre, Simon_Lea |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.3-rc1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 56331 | ||
Attachments: |
/proc/acpi/wakeup
Kernel config /proc/cpuinfo /proc/interrupts /proc/iomem /proc/ioports /proc/irq/spurious (before bug occurs) /proc/irq/spurious (after bug occurs) lspci -vvv /var/log/messages /proc/modules /proc/scsi/scsi /proc/softirqs ver_linux /proc/version |
Created attachment 64242 [details]
Kernel config
Created attachment 64252 [details]
/proc/cpuinfo
Created attachment 64262 [details]
/proc/interrupts
Created attachment 64272 [details]
/proc/iomem
Created attachment 64282 [details]
/proc/ioports
Created attachment 64292 [details]
/proc/irq/spurious (before bug occurs)
Created attachment 64302 [details]
/proc/irq/spurious (after bug occurs)
Created attachment 64312 [details]
lspci -vvv
Created attachment 64322 [details]
/var/log/messages
Created attachment 64332 [details]
/proc/modules
Created attachment 64342 [details]
/proc/scsi/scsi
Created attachment 64352 [details]
/proc/softirqs
Created attachment 64362 [details]
ver_linux
Created attachment 64372 [details]
/proc/version
Bug also exists on 2.6.39.3. Confirmed that the bug does not occur on 2.6.39.2 with acpi=off I can confirm that I have the same problem. Confirmed on 2.6.32, 2.6.38, 2.6.39, 2.6.39.2. The additional PCI ethernet card is unusable after this. What happens after "IRQ Nobody cared" message: 2.6.32 - network is very slow, with pings over 150ms compared to the usual 1ms 2.6.29 - network not woking at all, ping timeout Booting with "acpi=off" does not fix the problem, is just delayed - it takes around 2 days on my server, compared to 30min without that parameter. Same symptoms here too. Looks to be the same issue like https://bugzilla.kernel.org/show_bug.cgi?id=35332 hi Chris Palmer Please attach the output of dmesg after bug occurs. Test longer time with "acpi=off" to identify whether they are the same problem. There was nothing in dmesg that wasn't in /var/log/messages (already attached). I did run with acpi=off for 7 days without problems, but given Alex's experience that may not be conclusive. I've now installed a PCI-x ethernet board (based on the Intel 82574L chipset, e1000e driver) instead and it has worked flawlessly. So I have a workaround for the moment, but lots of PCI slots that I cannot use. If necessary I could add the original PCI ethernet back in, and find something to connect to it, to do some testing. The machine is now in use though so I can't mess with it too much... this seems like a hardware/ethernet card driver issue. re-assign to network experts. I should also mention the fact that the bug is not connected to a specific ethernet card driver. I have reproduced the bug using 2 different PCI ethernet cards: RTL8169S (RTL8139 driver) and Intel 82541PI (Intel e1000 driver). Upgraded to kernel 3.0.0. Broadcomm PCI ethernet card fails as before. PCI-x ethernet card continues to work perfectly. Reported this in the redhat bugzilla back in June but unfortunately needed to use the hardware so could not provide any more info. Hopefully the info you have been able to log and report may help them as well. As more issues have been reported I have copied the previously reported bug from the Redhat Bugzilla to here https://bugzilla.kernel.org/show_bug.cgi?id=41322 and have also referenced this and other very similar bugs reported here in my bug report so there is a central point with related reports. Hopefully the various people looking at the separate reports will be able to talk to each other rather than work in isolation if they do not already. I always had my et1000 and my bmdma devices affected (plus others sometimes) when my problem occurred. I notice the bmdma is not reported for you though. Did you turn off your onbard SATA/PATA controller. I had always presumed the onboard PATA (VIA chipset) was causing the issue and could not turn it off without disabling the on-board SATA which I needed as it is a 24 drive NAS box). (In reply to comment #23) > I always had my et1000 and my bmdma devices affected (plus others sometimes) > when my problem occurred. I notice the bmdma is not reported for you though. > Did you turn off your onbard SATA/PATA controller. I had always presumed the > onboard PATA (VIA chipset) was causing the issue and could not turn it off > without disabling the on-board SATA which I needed as it is a 24 drive NAS > box). I'm using the onboard Intel SATA abd VIA PATA controllers - you can see them in the lspci attachment. The only external cards are the PCIx ethernet (working) and the PCI ethernet (failing). A couple more bits of info: - The PCI ethernet is connected to another host, configured for IP, but otherwise idle (until this bug is resolved!). If I don't actively use it, it can stay "working" for many days. It only takes light activity (e.g. pings at 1-second intervals) to cause it to fail within minutes. I can also flood-ping but it still takes about the same number of minutes to fail. - I can rmmod/modprobe the PCI ethernet driver (tg3) then ifconfig, and it will start working again. (Of course it fails again a few minutes later). Reloading the driver is sufficient though - a reboot is not required. Still hoping for a fix - I do need the 3 ethernet interfaces... Chris I am not sure if this is the same issue as described by the others here, but if it is it might help as in my case a network card is not affected. What I get is this: [32595.466355] irq 18: nobody cared (try booting with the "irqpoll" option) [32595.466358] Pid: 0, comm: swapper Tainted: P 3.0-ARCH #1 [32595.466359] Call Trace: [32595.466360] <IRQ> [<ffffffff810c121a>] __report_bad_irq+0x3a/0xd0 [32595.466367] [<ffffffff810c1636>] note_interrupt+0x136/0x1f0 [32595.466369] [<ffffffff810bf729>] handle_irq_event_percpu+0xc9/0x2a0 [32595.466371] [<ffffffff810bf945>] handle_irq_event+0x45/0x70 [32595.466373] [<ffffffff810c1f67>] handle_fasteoi_irq+0x57/0xd0 [32595.466375] [<ffffffff8100d9f2>] handle_irq+0x22/0x40 [32595.466377] [<ffffffff813f5e6a>] do_IRQ+0x5a/0xe0 [32595.466379] [<ffffffff813f3b53>] common_interrupt+0x13/0x13 [32595.466380] <EOI> [<ffffffff8127377b>] ? intel_idle+0xcb/0x120 [32595.466384] [<ffffffff8127375d>] ? intel_idle+0xad/0x120 [32595.466387] [<ffffffff813138bd>] cpuidle_idle_call+0x9d/0x350 [32595.466390] [<ffffffff8100a21a>] cpu_idle+0xba/0x100 [32595.466392] [<ffffffff813d1602>] rest_init+0x96/0xa4 [32595.466394] [<ffffffff81748c23>] start_kernel+0x3de/0x3eb [32595.466395] [<ffffffff81748347>] x86_64_start_reservations+0x132/0x136 [32595.466397] [<ffffffff81748140>] ? early_idt_handlers+0x140/0x140 [32595.466399] [<ffffffff8174844d>] x86_64_start_kernel+0x102/0x111 [32595.466400] handlers: [32595.466404] [<ffffffffa028ad70>] oxygen_interrupt [32595.466405] Disabling IRQ #18 I cannot reproduce this issue an it is very rare; about every other week. The hardware using IRQ 18 is a ASUS Xonar sound card, but the same issue exists with a sound blaster. The mainboard is an ASUS P8P67 LE. I can add more details if this would be considered useful. Exactly the same problem as Chris Palmer with a network pci card with an Asus P8h67-v: 09:02.0 Ethernet controller [0200]: D-Link System Inc DGE-528T Gigabit Ethernet Adapter [1186:4300] (rev 10) Appears with 32bits and 64bits kernel, irq boot option doesn't (irqpoll and all) didn't work, sometimes two days without the problem, sometimes the problem 3 times a day. I can confirm the finding of Chris Palmer, that this issue does not affect PCI-x cards. Replaced the Intel PCI ethernet card with an Intel PCI-X ethernet card. The server has been running fine for over 20 days without any problems. Some more "progress" on this. It appears that the root cause may be the ASM1083 PCIx/PCI bridge widely used, particularly by ASUS, in many Sandybridge and AMD boards. The problem shows up with both processor architectures, and all forms of PCI board (NICs are just the most common and easiest to observe). There is a lot of detail at http://www.gossamer-threads.com/lists/linux/kernel/1466185 And the problem also seems to manifest itself for Windows users who are getting lousy performance with various PCI boards. Have now tried the December BIOS update, and kernel 3.3-rc1 but no luck. Chris This bug looks like the same problem as numbers 39122 and 42659. https://bugzilla.kernel.org/show_bug.cgi?id=39122 https://bugzilla.kernel.org/show_bug.cgi?id=42659 If bugzilla would let me, I'd mark the two later ones as dupes of this. Or do something to pull them together. It looks like the ASM1083 chip is bad. Chris raised the topic again on LKML, as seen here: https://lkml.org/lkml/2012/2/2/370 where Linus and others say we may be able to do limited workarounds. No code has come from that, yet. I'm posting a version of this note on all three bugs. *** Bug 39122 has been marked as a duplicate of this bug. *** *** Bug 42659 has been marked as a duplicate of this bug. *** So what is the status ? Is there partial working workaround or should everybody with this chip just forget about using pci extension card ? The status at this point is that people believe the ASM1083 chip is the problem and so far nobody has found a fix (if indeed there is one) or manufacturer/board vendor info on how to deal with the problem. Thanks for the answer. I believe there was a spurious.c patch at the beginning of the year floating around lkml which was preventing pci card from stopping working totally. So was it too ugly, not working, affecting other users, just in my imagination ? And since the initial bug report there is a lot of mobo released with this chipset (p.ex P8Z77 Asus series)... often used by new linux's users . It would be nice to have something in dmesg telling that the ASM1083 is faulty with the current linux drivers rather than "irq spurious nobody cared". It will prevents users to try every pci slots or seeking answers when the problem is known to be without solution. Don't know just an idea. PS : if anybody have a patch to test I can do it, my network card is still stopping to work in 2 min with current kernel *** Bug 35332 has been marked as a duplicate of this bug. *** Hi there, I stumbled across the bug after blaiming first a Promise Fastrak 4310 and then the replacement Dawicontrol DC-3410 SATA controllers, before I eventually found out my ASRock H67DE3 features the infamous ASM1083. Is there any fix or workaround? To me, all the reports look like that the IRQ in question has been allocated twice, to two different devices (also the case with me). My controller reports at Bios-boot-time to have Interrupt 11 which would not be occupied by another device. After Ubuntu bootet it reports to use IRQ 16, which collides with my USB controller and seems to trigger the problem. I used the machine for 4 years under Windows XP without any performance problems. Beginning this year, with Windows XP out of support, I decided to devote myself fully to Ubuntu and since more or less 6 month of bad system performance and intermittent searches for a solution I finally found this bug report. In-between I actually acquired a Win7 license to be able to work more relyably with the machine (there also no issues). Could it be a workaround to have the PCI cards not use the IRQ of another device , like mentioned IRQ 11 in my case, which the controller claims at boot time and which seems free? Dear community - is "WILL_NOT_FIX" really the last answer to this problem. I would assume the chip is not that uncommon... Best regards, Gero Gero There was quite a lot of activity looking at what the ASM1083 was actually doing. From memory it loses interrupts, and the only workaround is to keep polling it. It "works" under Windows because those drivers do exactly that - with a big performance hit. Doing something like that under linux was deemed too much effort to get a poor result. I've had trouble with various different ASM chipsets. I just avoid any motherboard with any of their chipsets on now. Regards Chris Chris, many thanks for the answer. I can not confirm your description of how it works with windows. I had the Fasttrak 4310 running a RAID 10 array with 4 x 2TB SATA disks for 3 years, nearly 24/7 and I never experienced the loss of performance that I had under Linux. I switched the same hardware configuration from Windows to Ubuntu, where I noticed the problem after less then one week with write rates around 1.5 MB/s. Windows was more in the range fo 50 MB/s. After finally finding out about the problem yesterday in this forum, I am right now experimenting. I actually installed both PCI SATA controllers in my system. This results in one having IRQ 16, shared with the USB controller, and one having IRQ 17 allocated exclusively. I have no SATA disk on the controller with IRQ 16 and the other SATA disks on the one with the exclusive IRQ 17. Right now I am waiting whether the error still materializes. If this is a solution - I wonder whether it would not be possible to manually allocate IRQs to devices explicitly by some kernel boot parameter? Best regards, Gero Gero I never said performance would be comparable under linux - I would expect it to be quite fast for a short while (until the chipset loses an interrupt) then very slow indeed. I think the 50 MB/s you were seeing under Windows is the degraded speed- it should be much faster (depending on your disks of course). IIRC interrupt sharing wasn't the issue. I wouldn't pin too many hopes on it working well. Chris Dear Chris, thanks for your reply. I can confirm (unfortunately) that the problem also occurs in the above case of none-interrupt-sharing. Following experiences with problem (even if not really relevant): - it occurs earlier in case of high data volumes and parallel access to external USB-volumes (latter especially in case of shared IRQ with USB) - it occurs much earlier (by days) on the Promise 4310 Fasttrak controller than the DawiController DC-3410 From my experience of 4 years I am still pretty sure that the controller did not face any problems under Windows. I had the controller already before in an older mainboard without the ASM chip. When I upgraded to the new mainboard, I did not experience any performance deterioration with regard to my hard drives (where my system was installed), which I assume should have been the case if the chip is really erroneous. I assume there is a workaround or working driver in Windows, but I have great respect for the work you (the kernel-developers) already have done without getting the proper information from the manufacturer. I right now bought myself a new DawiControl on PCIe, which solved the problem. Best regards and many thanks, Gero (In reply to Chris Palmer from comment #37) > There was quite a lot of activity looking at what the ASM1083 was actually > doing. From memory it loses interrupts, and the only workaround is to keep > polling it. I tried to run "rmmod e100;modprobe e100;sleep 0.1;ifconfig eth2 up" after error - it helped. IRQ works again: edo@edo-home:~$ grep eth2 /proc/interrupts ; sleep 3; grep eth2 /proc/interrupts 17: 1262843 0 IO-APIC 17-fasteoi eth2 17: 1262849 0 IO-APIC 17-fasteoi eth2 So I think there is a possibility to find good software workaround and this bug should be reopened. |
Created attachment 64232 [details] /proc/acpi/wakeup Additional PCI ethernet card in Sandybridge M/B runs for a few minutes, then gets IRQ 19 Nobody Cared System details: - Asus P8H67-V/R3 (latest Bios 0712, default settings except AHCI and VT-enabled) - COREi5/2500K - 4 x 4GB Corsair CMX8GX3M2A1333C9 - On-board video and ethernet (eth1, at11c driver) - One additional RTL8139 or Broadcom BCM5702X PCI ethernet for eth0 (fails identically with either) - no other cards - FC14 with 2.6.39.2 custom kernel, updates to 30/6/11 Moving to another slot only changes the IRQ number involved. The following do not prevent the problem: - pci=noacpi - acpi=noirq - noapic - nolapic - pci=nocrs Setting acpi=off does appear to prevent the problem (at least it runs for a couple of hours...) but is undesirable as power consumption is about 25% higher, and the CPU temp is markedly higher even with the CPU fan running faster.