Bug 2574
Description
Noel Maddy
2004-04-22 20:09:17 UTC
Created attachment 2655 [details]
acpidmp output
Created attachment 2656 [details]
dmidecode output
Created attachment 2657 [details]
dmesg with pci=noacpi
Created attachment 2658 [details]
dmesg with acpi enabled
Created attachment 2659 [details]
interrupt behavior with both ehci_hcd and snd-intel_8x0
Created attachment 2660 [details]
interrupt behavior with only ehci_hcd loaded
Created attachment 2661 [details]
interrupt behavior with only snd-intel_8x0 loaded
Created attachment 2662 [details]
behavior with neither ehci_hcd nor snd-intel_8x0 loaded
Created attachment 2663 [details]
lspci -v output
*** Bug 2570 has been marked as a duplicate of this bug. *** Has IRQ21 ever worked on this board with any version of ACPI+IOAPIC enabled Linux? Does Windows give the same interrupt mapping and do the devices work? Comparison to pci=noacpi doesn't tell us much because lack of MPS in the BIOS causes that configuration to run in PIC mode. IOAPIC programming looks fine. Unfortunately, both ehci_irq() and snd_intel8x0_interrupt() are hard coded to return IRQ_HANDLED. So if the interrupts are all spurious (as we suspect), the IRQ will not get shut down b/c the drivers are claiming them no matter what. We may need to instrument these drivers to confirm that they're actually not seeing any interrupts for their hardware. random stabs 1. any difference if you get rid of the I2C stuff in your kernel? 2. any difference if you boot with "acpi_irq_isa=21" this may simply move the problem to a different IRQ. 3. If the interrupts are due to another device which is not registering a driver on this IRQ (current prime suspect) it would be interesting if you could go through BIOS SETUP and disable as many on-board devices as possible to see if the symptom goes away. No, IRQ21 has never worked properly with ACPI/IOAPIC on this board under Linux. I haven't run Windows on it at all. I may have an old Win98SE that I could install if that would help. Random results: 1. Behavior is unchanged when i2c stuff is removed. 3. Behavior is unchanged when disabling built-ins, EXCEPT if both AC97 and USB are disabled, then nothing is on IRQ21, so the problem doesn't show up - Behavior unchanged when SB Live! and bttv tuner are removed - Behavior unchanged when Radeon 9200SE replaces on-board IGP 2. acpi_irq_isa=21 changes things. Late in the boot, I get an "irq 20: nobody cared!", and then it's disabled. No interrupt flooding in this situation. ohci_hcd 0000:00:02.1: irq 22, pci mem dea25000 ohci_hcd 0000:00:02.1: new USB bus registered, assigned bus number 2 hub 2-0:1.0: USB hub found hub 2-0:1.0: 3 ports detected irq 20: nobody cared! Call Trace: [<c010b54b>] __report_bad_irq+0x2b/0x90 [<c010b644>] note_interrupt+0x64/0xa0 [<c010b8ff>] do_IRQ+0x12f/0x140 [<c0109cdc>] common_interrupt+0x18/0x20 handlers: [<dea6dc10>] (usb_hcd_irq+0x0/0x70 [usbcore]) Disabling IRQ #20 usb 1-2: new full speed USB device using address 2 ehci_hcd 0000:00:02.2: nVidia Corporation nForce2 USB Controller (full dmesg, /proc/interrupts, etc., as attachments) Created attachment 2664 [details]
dmesg with acpi_irq_isa=21
Created attachment 2665 [details]
behavior with acpi_irq_isa=21
*** Bug 2227 has been marked as a duplicate of this bug. *** Created attachment 2813 [details]
interrupts with IOAPIC/ACPI
Comment on attachment 2813 [details]
interrupts with IOAPIC/ACPI
Same problem here with MSI K7N2Delta-L.
I've tried with 2.6.3, 2.6.4, 2.6.5 and also 2.6.5 from fedora core 2.
Created attachment 2814 [details]
interrupts on XP
Here is the interrupt situation with windows XP.
ehci seems to be on interrupt 22. IRQ21 is used by ohci...
Epox 8RDA3+ suffers from the same problem. In Windows, IRQ assignment is different. IRQ 21 is used by the onboard soundcard, but the problem seems to exist as well, although in a milder form. The interrupt rate is ~2550/second on an idle system (~140/s on a similiar system with an SiS Chipset). I'd like to see complete dmesg from 2.6.6-mm2 or 2.6.6 + patch in bug 2665 since it includes some PCI Link fixes and has extra debug output. thanks, -Len Created attachment 2872 [details]
dmesg from 2.6.6-mm2
dmesg from 2.6.6-mm2 on same 7NIF2 system (with bk-input patch backed out)
It's flooding on IRQ20 instead of IRQ21 now.
storm moved to IRQ20 from 21 along with EHCI? This is consistent with the acpi_irq_isa=21 experiment, where both EHCI and the flood moved to IRQ20. This is looking more device specific than ACPI related at this point. Are there any USB devices plugged into the system? Any difference if you physically un-plug them? This doesn't directly address the case if EHCI is not loaded but sound is. Perhaps in that case the USB hardware is still generating interrupts and the sound is the victim? I think it is time to enable some debugging code on the USB side. For starters, it shouldn't return IRQ_HANDLED when its hardware didn't actually get an interrupt. I did have an Atmel at76c503a-based 802.11b adapter plugged in. Removing it
makes no difference, however.
Seems like the flood will happen whenever any driver is hooked up to the interrupt.
I'm not very clued about PCI interrupt routing/ACPI, but I did notice something.
AP3C is listed as disabled in the first list, but then enabled and assigned to
IRQ 20. In earlier kernels, the same three interrupt links were assigned to IRQ
21. Is it possible that enabling AP3C is related to the interrupt flood?
>grep -i 'irq.*20' 2.6.6-mm2.dmesg
ACPI: PCI Interrupt Link [APCF] (IRQs 20 21 22) *0
ACPI: PCI Interrupt Link [APCG] (IRQs 20 21 22) *0
ACPI: PCI Interrupt Link [APCH] (IRQs 20 21 22) *0
ACPI: PCI Interrupt Link [APCI] (IRQs 20 21 22) *0, disabled.
ACPI: PCI Interrupt Link [APCJ] (IRQs 20 21 22) *0
ACPI: PCI Interrupt Link [APCK] (IRQs 20 21 22) *0, disabled.
ACPI: PCI Interrupt Link [APCL] (IRQs 20 21 22) *0
ACPI: PCI Interrupt Link [APCM] (IRQs 20 21 22) *0, disabled.
ACPI: PCI Interrupt Link [AP3C] (IRQs 20 21 22) *0, disabled.
ACPI: PCI Interrupt Link [APCZ] (IRQs 20 21 22) *0, disabled.
ACPI: PCI Interrupt Link [APCL] enabled at IRQ 20
00:00:02[C] -> 2-20 -> IRQ 20 level high
ACPI: PCI Interrupt Link [APCJ] enabled at IRQ 20
ACPI: PCI Interrupt Link [AP3C] enabled at IRQ 20
IRQ20 -> 0:20
ehci_hcd 0000:00:02.2: irq 20, pci mem de90c000
Hmm, I'm not sure why this is assigned to me; looks like an ACPI issue with IRQ routing on certain NF2 boards. It's clearly not an NF2-always issue, or an EHCI-only issue. For example, here's one Shuttle NF2 MB: CPU0 0: 172106932 XT-PIC timer 1: 136176 IO-APIC-edge i8042 4: 532472 IO-APIC-edge serial 9: 0 IO-APIC-level acpi 14: 504035 IO-APIC-edge ide0 18: 467827 IO-APIC-level net2280 20: 2 IO-APIC-level ehci_hcd 21: 66840 IO-APIC-level ohci_hcd 22: 8836112 IO-APIC-level ohci_hcd, eth0 NMI: 9765 LOC: 172106898 ERR: 0 MIS: 0 That's using "forcedeth", which seems wasteful of IRQ22, but there's no hint of a flood on IRQ21 (which has a mouse). And "net2280" is used on this system mostly as a network link, in an IRQ-per-packet mode; it's generally faster than 100baseT full duplex links. For the record, the reasoning behind returning IRQ_HANDLED is that there are a number of cases where the controller can schedule an IRQ for "later", by which time the driver may already have handled the relevant event. That's not limited to the cases in which the watchdog timer (needed mostly for some flakiness on VIA hardware) fires. However I may be able to record enough state to identify a few of the cases where no such events are possible, like with a completely idle controller, and report those cases with IRQ_NONE. Noel, what happens when you rmmod both OHCI and EHCI, then reload them (EHCI first then OHCI)? I've had to do that with a recent non-NF2 motherboard, when the BIOS did strange init. Is there any USB "legacy" (keyboard/mouse) support enabled in the BIOS? If so, try disabling it to see if this still happens. David, Removing EHCI/OHCI and then reloading them in either order: no change Booting without EHCI/OHCI and then loading EHCI-first: no change I did have legacy USB enabled in the BIOS. Turned it off: no change Created attachment 2933 [details]
hack disabling AP3C PCI interrupt link
Going on my earlier suspicion, I threw this hack in to make sure that the AP3C
PCI Interrupt Link was not enabled.
It works! Or, at least, the flood is stopped. Sound works properly. I haven't
tested a USB 2.0 device yet, though. Will do that this weekend.
Created attachment 2934 [details]
dmesg with AP3C hack
Success! With the (admittedly ugly) hack forcing ACPI to skip AP3C, there is no interrupt flood, and both EHCI and on-chip sound work properly. I used an FA120 USB-ethernet adapter (usbnet driver). In order to make sure that it was working with EHCI, I removed ohci_hcd, and was still able to connect through the FA120. In /proc/interrupts, IRQ 20 (assigned to EHCI and sound) was incrementing with the network traffic. It seems obvious that the AP3C interrupt link is the source of the interrupt floods. So far, I haven't been able to find anything that doesn't work when it's forced off. Remaining questions: 1) What is AP3C connected to? Does disabling it break anything? What else should I test? 2) Why is AP3C enabled on my motherboard? Is it related to BIOS or ACPI problems, or is it a hardware issue? 3) Could AP3C have the opposite polarity to the other (EHCI and sound) interrupts that are linked with it? Hmm, looks like ACPI/BIOS interrupt routing bug after all. 1) What is AP3C connected to? Does disabling it break anything? What else should I test? Please attach the output from acpidmp, available in /usr/sbin/, or in pmtools: http://ftp.kernel.org/pub/linux/kernel/people/lenb/acpi/utils/ Please also attach the output from lspci -vv on this system before the hack. Together they'll tell us what devices are connected to what interrupt lines. 2) Why is AP3C enabled on my motherboard? Is it related to BIOS or ACPI problems, or is it a hardware issue? Links are enabled when PCI devices refer to them. Somewhere in your system there is now device without an interrupt: pci_link-0620 [73] acpi_pci_link_get_irq : Invalid link context If you didn't see any change in /proc/interrupts before and after the AP3C hack, then it may be that the device doesn't have a driver loaded. Or, maybe we're enabling links when perhaps it isn't necessary, and on this system maybe that unmasks a platform bug? 3) Could AP3C have the opposite polarity to the other (EHCI and sound) interrupts that are linked with it? No. PCI Interrupt Links all have PCI interrupt flags, by definition. Ah, I see acpidmp for Noel's system above, and it shows AP3C has 5 customers: Device (PCI0)/_PRT/APIC: Package (0x04) { 0x000CFFFF, 0x00, \_SB.PCI0.AP3C, 0x00 }, Device (HUB1)/_PRT/APIC Package (0x04) { 0x0001FFFF, 0x00, \_SB.PCI0.AP3C, 0x00 }, Package (0x04) { 0x0001FFFF, 0x01, \_SB.PCI0.AP3C, 0x00 }, Package (0x04) { 0x0001FFFF, 0x02, \_SB.PCI0.AP3C, 0x00 }, Package (0x04) { 0x0001FFFF, 0x03, \_SB.PCI0.AP3C, 0x00 } }) Device C, Pin A Device 1, Pins A,B,C,D lspci -vv will show the pins, but lspci -v shows 2 sub-functions on device 1, and does not show any on device C: 0000:00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a3) 0000:00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2) Created attachment 2955 [details]
lspci -vv (2.6.6-mm4 without hack)
Here's the lspci -vv on vanilla 2.6.6-mm4. I see you found the acpidmp.
Created attachment 2956 [details]
lspci -vv (2.6.6-mm4 without hack)
Doh! Paper bag, please! Forgot to run as root.
0000:00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2) Interrupt: pin A routed to IRQ 23 SMBus is the only device of the 5 potential customers that shows in lspci-vv and has an interrupt. Created attachment 2957 [details]
Bjorn's PRT patch vs 2.6.7
Please test this 2.6.7 patch (should apply to your 2.6.6-mm tree)
from Bjorn Helgaas. Before this patch, mp_parse_prt() enables
all PCI links. After this patch, the IRQs are enabled as demanded
by the PCI devices. Please attach the resulting dmesg and /proc/interrupts.
Created attachment 2962 [details]
/proc/interrupts with bjorn's patch
Looks good. There's no interrupt flood. Everything seems to work properly. I
also noticed that this fixes anomalous ACPI THRM readings in dmesg. Previously,
it was giving weird readings (-121 C, 8 C, 2 C, ...). Now the temperatures are
reasonable.
Created attachment 2963 [details]
dmesg on 2.6.6-mm4 with bjorn's patch
Created attachment 2967 [details]
Bjorn's PRT patch vs 2.6.7 (updated)
minor cleanups to previous version of this patch.
i'm checking this version into acpi-test tree.
Tested with updated version of Bjorn's PRT patch. Everything works great. No changes in dmesg from Bjorn's previous patch, except for formatting (global_irq_base -> gsi_base, GSI in decimal instead of hex). Kudos! shipped in 2.6.8.1 did not back-port to 2.4. closing |