Bug 14783

Summary: Unhandled IRQ on Thinkpad R61i: "irq 16: nobody cared"
Product: Drivers Reporter: Stefan Zegenhagen (sz)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, rjw, roxxor-linux, sz
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14230    
Attachments: device list (output of lspci -vv)
output of /proc/interrupts during normal conditions
output of lspci -vv from Thinkpad x200s

Description Stefan Zegenhagen 2009-12-10 19:14:00 UTC
Hi everyone,

while running a vanilla kernel 2.6.32 on my Thinkpad R61i, I get the following after a while. Since AHCI, video and USB are affected, the system becomes almost unusable. Same behaviour does not occur with kernel 2.6.31.6. I've no idea, which part of the hardware causes the IRQ storm [and no time to test].

I've found no way to really reproduce the bug, it just happens sooner or later.

I'm running i386 Linux on a Core2 Duo T2310 @ 1.46GHz. Device inventory (lspci -vv) attached.

--- snip ---
irq 16: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper Not tainted 2.6.32 #2
Call Trace:
[<c10627e4>] ? __report_bad_irq+0x24/0x90
[<c10629a2>] ? note_interrupt+0x152/0x190
[<c10630eb>] ? handle_fasteoi_irq+0xab/0xd0
[<c10057f7>] ? handle_irq+0x17/0x20
[<c1004e3c>] ? do_IRQ+0x4c/0xc0
[<c105036d>] ? tick_notify+0x1ed/0x3c0
[<c1356d9e>] ? _spin_unlock_irqrestore+0xe/0x30
[<c105036d>] ? tick_notify+0x1ed/0x3c0
[<c11baeea>] ? acpi_hw_read_port+0x23/0x71
[<c1003529>] ? common_interrupt+0x29/0x30
[<c1003529>] ? common_interrupt+0x29/0x30
[<c104007b>] ? worker_thread+0xcb/0x1e0
[<c1356d94>] ? _spin_unlock_irqrestore+0x4/0x30
[<c11c9c0c>] ? acpi_idle_enter_bm+0x29b/0x2b1
[<c1296475>] ? cpuidle_idle_call+0x85/0xd0
[<c1001f45>] ? cpu_idle+0x55/0x90
[<c148779f>] ? start_kernel+0x24c/0x285
[<c1487344>] ? unknown_bootoption+0x0/0x1c1
handlers:
[<c1272210>] (ahci_interrupt+0x0/0x530)
[<f86fc980>] (yenta_interrupt+0x0/0xf0 [yenta_socket])
[<f8456ef0>] (usb_hcd_irq+0x0/0x60 [usbcore])
[<c120fe60>] (i915_driver_irq_handler+0x0/0x530)
Disabling IRQ #16
--- snip ---
Comment 1 Stefan Zegenhagen 2009-12-10 19:15:36 UTC
Created attachment 24141 [details]
device list (output of lspci -vv)
Comment 2 Andrew Morton 2009-12-10 22:49:19 UTC
urgh.

[<c1272210>] (ahci_interrupt+0x0/0x530)
[<f86fc980>] (yenta_interrupt+0x0/0xf0 [yenta_socket])
[<f8456ef0>] (usb_hcd_irq+0x0/0x60 [usbcore])
[<c120fe60>] (i915_driver_irq_handler+0x0/0x530)

are you able to disable some/all of these drivers and then, by a process of elimination, work out which one is playing up?

(It might not be any of them, but I suspect it is).

The first one to suspect is i915 - it's been changing a lot recently and has a relatively high level of bug reporting and fixing still.
Comment 3 Stefan Zegenhagen 2009-12-13 13:28:15 UTC
I don't know if that's representative already, but without i915 I've not yet observed the problem.

$ uptime
 14:23:41 up 12:33,  6 users,  load average: 0.30, 0.13, 0.04

cat /proc/interrupts
...
 16:      65065      18860   IO-APIC-fasteoi   ahci, yenta, uhci_hcd:usb4
...

IRQ numbers for CPU0/1 don't indicate any occurrence of an interrupt storm.
Comment 4 Mario Mlynek 2009-12-15 09:30:18 UTC
Hi All,

I have the same problem on my x200s Thinkpad:

irq 16: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper Not tainted 2.6.32 #1 Call Trace:
 <IRQ>  [<ffffffff8105dd79>] __report_bad_irq+0x38/0x87  [<ffffffff8105dee0>] note_interrupt+0x118/0x184  [<ffffffff8105e5c8>] handle_fasteoi_irq+0xa8/0xd5  [<ffffffff8100d95e>] handle_irq+0x1f/0x27  [<ffffffff8100cfc4>] do_IRQ+0x5e/0xbd  [<ffffffff8100b713>] ret_from_intr+0x0/0xa  <EOI>  [<ffffffff812095db>] ? acpi_idle_enter_simple+0x15b/0x18b
 [<ffffffff812095d1>] ? acpi_idle_enter_simple+0x151/0x18b
 [<ffffffff8131ede3>] ? cpuidle_idle_call+0x98/0xd1  [<ffffffff8100a2bf>] ? cpu_idle+0x52/0x9e  [<ffffffff81407870>] ? start_secondary+0x19c/0x1a1
handlers:
[<ffffffff8124939a>] (i915_driver_irq_handler+0x0/0x4a2)
[<ffffffff8129a192>] (ahci_interrupt+0x0/0x466) [<ffffffff812cda80>] (usb_hcd_irq+0x0/0x67) Disabling IRQ #16

I recognized, that this error occurs during high IO on the harddisk. After the error came up, I can't really work anymore (less than 1MB/s IO). The only thing I can do is rebooting.

I had this behavior with the 2.6.31 and now with the .32 but this time more often than with version .31.

I can provide more log outputs this evening, when I have access to my computer.

Best Regards,
Mario
Comment 5 Mario Mlynek 2009-12-15 20:57:55 UTC
Hej there,
here are some outputs of lspci and /proc/interrupts.

These outputs were generated, while the system did work correct.
I'll attach updated files is the error occurs again.

uname -a 
Linux [name] 2.6.32 #1 SMP PREEMPT Sun Dec 6 13:12:40 CET 2009 x86_64 GNU/Linux

Best regards,
Mario
Comment 6 Mario Mlynek 2009-12-15 20:59:22 UTC
Created attachment 24204 [details]
output of /proc/interrupts during normal conditions
Comment 7 Mario Mlynek 2009-12-15 21:00:24 UTC
Created attachment 24205 [details]
output of lspci -vv from Thinkpad x200s
Comment 8 Stefan Zegenhagen 2009-12-30 11:28:05 UTC
I've done some more analysis regarding this bug. I've been running my laptop with only AHCI and i915 drivers. Adding printk()s in the IRQ handlers, I was able to get the following IRQ statistics in the *good* case:

12:24:47 kernel: ==================================
12:24:47 kernel: IRQ16: 10000 interrupts in 6736 jiffies. IRQs received:
12:24:47 kernel: i915: I915_USER_INTERRUPT = 148
12:24:47 kernel: i915: I915_DISPLAY_PIPE_B_EVENT_INTERRUPT = 7
12:24:47 kernel: i915: I915_DISPLAY_PIPE_A_EVENT_INTERRUPT = 46
12:24:47 kernel: i915: I915_DISPLAY_PORT_INTERRUPT = 11
12:24:47 kernel: i915: PIPEA_FIFO_UNDERRUN_STATUS = 75
12:24:47 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 2770
12:24:47 kernel: => i915: AND bits: 0x00000000, 212 events
12:24:47 kernel: ahci: IRQ_PORT0 = 7179
12:24:47 kernel: => ahci: AND bits: 0x00000001, 7179 events
12:24:47 kernel: TOTAL: 10236 / 10000 interrupts!

The "AND bits" is the value of the IRQ status register ANDed when not zero to check if there is any bit that's permanently set, thus causing the IRQ storm. Looks good.

In the *bad* case, it looks like this:

12:24:47 kernel: ==================================
12:24:47 kernel: IRQ16: 10000 interrupts in 25 jiffies. IRQs received:
12:24:47 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 40
12:24:47 kernel: => i915: AND bits: 0xFFFFFFFF, 0 events
12:24:47 kernel: => ahci: AND bits: 0xFFFFFFFF, 0 events
12:24:47 kernel: TOTAL: 40 / 10000 interrupts!
12:24:48 kernel: ==================================
12:24:48 kernel: IRQ16: 10000 interrupts in 25 jiffies. IRQs received:
12:24:48 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 39
12:24:48 kernel: => i915: AND bits: 0xFFFFFFFF, 0 events
12:24:48 kernel: => ahci: AND bits: 0xFFFFFFFF, 0 events
12:24:48 kernel: TOTAL: 39 / 10000 interrupts!
12:24:48 kernel: ==================================
12:24:48 kernel: IRQ16: 10000 interrupts in 25 jiffies. IRQs received:
12:24:48 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 39
12:24:48 kernel: => i915: AND bits: 0xFFFFFFFF, 0 events
12:24:48 kernel: => ahci: AND bits: 0xFFFFFFFF, 0 events
12:24:48 kernel: TOTAL: 39 / 10000 interrupts!
12:24:48 kernel: ==================================
12:24:48 kernel: IRQ16: 10000 interrupts in 26 jiffies. IRQs received:
12:24:48 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 40
12:24:48 kernel: => i915: AND bits: 0xFFFFFFFF, 0 events
12:24:48 kernel: => ahci: AND bits: 0xFFFFFFFF, 0 events
12:24:48 kernel: TOTAL: 40 / 10000 interrupts!
12:24:48 kernel: ==================================
12:24:48 kernel: IRQ16: 10000 interrupts in 21 jiffies. IRQs received:
12:24:48 kernel: i915: PIPEB_FIFO_UNDERRUN_STATUS = 36
12:24:48 kernel: => i915: AND bits: 0xFFFFFFFF, 0 events
12:24:48 kernel: ahci: IRQ_PORT0 = 2
12:24:48 kernel: => ahci: AND bits: 0x00000001, 2 events
12:24:48 kernel: TOTAL: 38 / 10000 interrupts!

The statistics accounting and the printk()s introduce a sort of "rate  limiting" that avoids disabling the IRQ. However, it is quite obvious that the majority of IRQ16 detections is not handled by either of the drivers. IRQ status registers do not show any bits that are permanently set.

There's an oddity in the i915 hardware regarding the IRQ status register IIR: the PIPE{A,B}_FIFO_UNDERRUN events are not visible in this register. The driver reads some other registers in every IRQ to manually check pipe statuses. That's probably why the events are not accounted for in the i915-IIR AND-bit accounting. Maybe there're other bits in the i915 IIR register that behave similar to the pipestat interrupt bit?


The problem still exists with an unmodified kernel 2.6.32.2. First I thought the problem was gone, because interrupt 16 didn't get disabled after several hours of operations, but it seems that the IRQ handler isn't called as often, leading to some sort of "rate limiting" and therefore not going over the detection limit. The IRQ occurrence count in /proc/interrupts, however, is increasing very fast and the ondemand CPUfreq gouvernor keeps CPU#0 running at highest speed. Considering this, the problem might have crept in much earlier than 2.6.32.

It seems that a suspend/resume cycle successfully clears the source of the interrupt storm, at least temporarily.
Comment 9 Stefan Zegenhagen 2010-02-01 08:54:55 UTC
The problem appears to have gone away with recent 2.6.32-kernels. If nobody complains, the bug can be closed.