Bug 14783
Summary: | Unhandled IRQ on Thinkpad R61i: "irq 16: nobody cared" | ||
---|---|---|---|
Product: | Drivers | Reporter: | Stefan Zegenhagen (sz) |
Component: | Video(DRI - Intel) | Assignee: | drivers_video-dri-intel (drivers_video-dri-intel) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | akpm, rjw, roxxor-linux, sz |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.32 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 14230 | ||
Attachments: |
device list (output of lspci -vv)
output of /proc/interrupts during normal conditions output of lspci -vv from Thinkpad x200s |
Description
Stefan Zegenhagen
2009-12-10 19:14:00 UTC
Created attachment 24141 [details]
device list (output of lspci -vv)
urgh. [<c1272210>] (ahci_interrupt+0x0/0x530) [<f86fc980>] (yenta_interrupt+0x0/0xf0 [yenta_socket]) [<f8456ef0>] (usb_hcd_irq+0x0/0x60 [usbcore]) [<c120fe60>] (i915_driver_irq_handler+0x0/0x530) are you able to disable some/all of these drivers and then, by a process of elimination, work out which one is playing up? (It might not be any of them, but I suspect it is). The first one to suspect is i915 - it's been changing a lot recently and has a relatively high level of bug reporting and fixing still. I don't know if that's representative already, but without i915 I've not yet observed the problem. $ uptime 14:23:41 up 12:33, 6 users, load average: 0.30, 0.13, 0.04 cat /proc/interrupts ... 16: 65065 18860 IO-APIC-fasteoi ahci, yenta, uhci_hcd:usb4 ... IRQ numbers for CPU0/1 don't indicate any occurrence of an interrupt storm. Hi All, I have the same problem on my x200s Thinkpad: irq 16: nobody cared (try booting with the "irqpoll" option) Pid: 0, comm: swapper Not tainted 2.6.32 #1 Call Trace: <IRQ> [<ffffffff8105dd79>] __report_bad_irq+0x38/0x87 [<ffffffff8105dee0>] note_interrupt+0x118/0x184 [<ffffffff8105e5c8>] handle_fasteoi_irq+0xa8/0xd5 [<ffffffff8100d95e>] handle_irq+0x1f/0x27 [<ffffffff8100cfc4>] do_IRQ+0x5e/0xbd [<ffffffff8100b713>] ret_from_intr+0x0/0xa <EOI> [<ffffffff812095db>] ? acpi_idle_enter_simple+0x15b/0x18b [<ffffffff812095d1>] ? acpi_idle_enter_simple+0x151/0x18b [<ffffffff8131ede3>] ? cpuidle_idle_call+0x98/0xd1 [<ffffffff8100a2bf>] ? cpu_idle+0x52/0x9e [<ffffffff81407870>] ? start_secondary+0x19c/0x1a1 handlers: [<ffffffff8124939a>] (i915_driver_irq_handler+0x0/0x4a2) [<ffffffff8129a192>] (ahci_interrupt+0x0/0x466) [<ffffffff812cda80>] (usb_hcd_irq+0x0/0x67) Disabling IRQ #16 I recognized, that this error occurs during high IO on the harddisk. After the error came up, I can't really work anymore (less than 1MB/s IO). The only thing I can do is rebooting. I had this behavior with the 2.6.31 and now with the .32 but this time more often than with version .31. I can provide more log outputs this evening, when I have access to my computer. Best Regards, Mario Hej there, here are some outputs of lspci and /proc/interrupts. These outputs were generated, while the system did work correct. I'll attach updated files is the error occurs again. uname -a Linux [name] 2.6.32 #1 SMP PREEMPT Sun Dec 6 13:12:40 CET 2009 x86_64 GNU/Linux Best regards, Mario Created attachment 24204 [details]
output of /proc/interrupts during normal conditions
Created attachment 24205 [details]
output of lspci -vv from Thinkpad x200s
I've done some more analysis regarding this bug. I've been running my laptop with only AHCI and i915 drivers. Adding printk()s in the IRQ handlers, I was able to get the following IRQ statistics in the *good* case: 12:24:47 kernel: ================================== 12:24:47 kernel: IRQ16: 10000 interrupts in 6736 jiffies. IRQs received: 12:24:47 kernel: i915: I915_USER_INTERRUPT = 148 12:24:47 kernel: i915: I915_DISPLAY_PIPE_B_EVENT_INTERRUPT = 7 12:24:47 kernel: i915: I915_DISPLAY_PIPE_A_EVENT_INTERRUPT = 46 12:24:47 kernel: i915: I915_DISPLAY_PORT_INTERRUPT = 11 12:24:47 kernel: i915: PIPEA_FIFO_UNDERRUN_STATUS = 75 12:24:47 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 2770 12:24:47 kernel: => i915: AND bits: 0x00000000, 212 events 12:24:47 kernel: ahci: IRQ_PORT0 = 7179 12:24:47 kernel: => ahci: AND bits: 0x00000001, 7179 events 12:24:47 kernel: TOTAL: 10236 / 10000 interrupts! The "AND bits" is the value of the IRQ status register ANDed when not zero to check if there is any bit that's permanently set, thus causing the IRQ storm. Looks good. In the *bad* case, it looks like this: 12:24:47 kernel: ================================== 12:24:47 kernel: IRQ16: 10000 interrupts in 25 jiffies. IRQs received: 12:24:47 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 40 12:24:47 kernel: => i915: AND bits: 0xFFFFFFFF, 0 events 12:24:47 kernel: => ahci: AND bits: 0xFFFFFFFF, 0 events 12:24:47 kernel: TOTAL: 40 / 10000 interrupts! 12:24:48 kernel: ================================== 12:24:48 kernel: IRQ16: 10000 interrupts in 25 jiffies. IRQs received: 12:24:48 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 39 12:24:48 kernel: => i915: AND bits: 0xFFFFFFFF, 0 events 12:24:48 kernel: => ahci: AND bits: 0xFFFFFFFF, 0 events 12:24:48 kernel: TOTAL: 39 / 10000 interrupts! 12:24:48 kernel: ================================== 12:24:48 kernel: IRQ16: 10000 interrupts in 25 jiffies. IRQs received: 12:24:48 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 39 12:24:48 kernel: => i915: AND bits: 0xFFFFFFFF, 0 events 12:24:48 kernel: => ahci: AND bits: 0xFFFFFFFF, 0 events 12:24:48 kernel: TOTAL: 39 / 10000 interrupts! 12:24:48 kernel: ================================== 12:24:48 kernel: IRQ16: 10000 interrupts in 26 jiffies. IRQs received: 12:24:48 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 40 12:24:48 kernel: => i915: AND bits: 0xFFFFFFFF, 0 events 12:24:48 kernel: => ahci: AND bits: 0xFFFFFFFF, 0 events 12:24:48 kernel: TOTAL: 40 / 10000 interrupts! 12:24:48 kernel: ================================== 12:24:48 kernel: IRQ16: 10000 interrupts in 21 jiffies. IRQs received: 12:24:48 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 36 12:24:48 kernel: => i915: AND bits: 0xFFFFFFFF, 0 events 12:24:48 kernel: ahci: IRQ_PORT0 = 2 12:24:48 kernel: => ahci: AND bits: 0x00000001, 2 events 12:24:48 kernel: TOTAL: 38 / 10000 interrupts! The statistics accounting and the printk()s introduce a sort of "rate limiting" that avoids disabling the IRQ. However, it is quite obvious that the majority of IRQ16 detections is not handled by either of the drivers. IRQ status registers do not show any bits that are permanently set. There's an oddity in the i915 hardware regarding the IRQ status register IIR: the PIPE{A,B}_FIFO_UNDERRUN events are not visible in this register. The driver reads some other registers in every IRQ to manually check pipe statuses. That's probably why the events are not accounted for in the i915-IIR AND-bit accounting. Maybe there're other bits in the i915 IIR register that behave similar to the pipestat interrupt bit? The problem still exists with an unmodified kernel 2.6.32.2. First I thought the problem was gone, because interrupt 16 didn't get disabled after several hours of operations, but it seems that the IRQ handler isn't called as often, leading to some sort of "rate limiting" and therefore not going over the detection limit. The IRQ occurrence count in /proc/interrupts, however, is increasing very fast and the ondemand CPUfreq gouvernor keeps CPU#0 running at highest speed. Considering this, the problem might have crept in much earlier than 2.6.32. It seems that a suspend/resume cycle successfully clears the source of the interrupt storm, at least temporarily. The problem appears to have gone away with recent 2.6.32-kernels. If nobody complains, the bug can be closed. |