Bug 13733

Summary: 2.6.31-rc2: irq 16: nobody cared
Product: ACPI Reporter: Rafael J. Wysocki (rjw)
Component: Config-InterruptsAssignee: Len Brown (lenb)
Status: CLOSED UNREPRODUCIBLE    
Severity: normal CC: acpi-bugzilla, greg, jirislaby, js+bugs, lenb, niel.lambrechts, rui.zhang, stern, vyacheslavovich, yakui.zhao, zdenek.kabelac
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 13615    
Attachments: acpidump output
lspci -vxxx output
dmesg output (before any issue encountered)
/proc/interrupts output
trace (Jul 10)
/proc/interrupts (2.6.30.2)
dmesg (2.6.30.2)
dmesg of error on 2.6.31.3

Description Rafael J. Wysocki 2009-07-06 23:02:18 UTC
Subject    : 2.6.31-rc2: irq 16: nobody cared
Submitter  : Niel Lambrechts <niel.lambrechts@gmail.com>
Date       : 2009-07-06 18:32
References : http://marc.info/?l=linux-kernel&m=124690524027166&w=4

This entry is being used for tracking a regression from 2.6.30.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 ykzhao 2009-07-07 00:29:23 UTC
Hi
    From the description it seems that this is a regression. Will you please use the git-bisect to identify the commit which causes the regression?
    It will be great if you can attach the output of acpidump, lspci -vxxx, dmesg, /proc/interrupts
    thanks.
Comment 2 Niel Lambrechts 2009-07-08 23:04:25 UTC
Created attachment 22279 [details]
acpidump output
Comment 3 Niel Lambrechts 2009-07-08 23:04:57 UTC
Created attachment 22280 [details]
lspci -vxxx output
Comment 4 Niel Lambrechts 2009-07-08 23:05:54 UTC
Created attachment 22281 [details]
dmesg output (before any issue encountered)
Comment 5 Niel Lambrechts 2009-07-08 23:06:14 UTC
Created attachment 22282 [details]
/proc/interrupts output
Comment 6 Niel Lambrechts 2009-07-08 23:07:59 UTC
Hi, 

I am not certain that I can reproduce this problem yet, due to lack of time to test. If I can reliably do so I will try and bisect and provide an update here.

Thanks
Niel
Comment 7 Niel Lambrechts 2009-07-08 23:20:55 UTC
I just tried reproducing the problem, but could not. I tried once without uhci_hcd loaded and once with it, since I sometimes remove the USB modules to reduce power consumption if I'm on battery (ala powertop).

I will keep running 2.6.31-rc2 for a few days and provide some more feedback.

Thanks
Niel
Comment 8 ykzhao 2009-07-15 03:10:11 UTC
HI, Niel
    Any update about this bug? Can this issue be reproduced?
    Thanks.
Comment 9 Niel Lambrechts 2009-07-15 05:19:08 UTC
Hi there,

I cannot reproduce the problem at will but I have seen it again using 2.6.31-rc2-pae. It happened as I probed for ehci_hcd, and seems to be referring to yenta.

I attach the log.

Regards,
Niel
Comment 10 Niel Lambrechts 2009-07-15 05:20:33 UTC
Created attachment 22346 [details]
trace (Jul 10)
Comment 11 Jiri Slaby 2009-07-15 06:25:25 UTC
Maybe USB fellows have some ideas?

It looks like ehci requested an irq which it doesn't mark as handled.

(From http://lkml.org/lkml/2009/7/15/13 I suppose it doesn't happen when the yenta is loaded alone.)
Comment 12 Len Brown 2009-07-20 14:42:44 UTC
Neil,
Does this happen only after resume from hibernete,
or is it ever seen on a cold boot, or a resume from suspend to RAM?

Can you supply a dmesg and /proc/interrupts from 2.6.30?

UHCI and yenta share IRQ16.
It is possible that when EHCI is initialized, it is somehow
enabling the, as yet no loaded, UHCI to provoke and IRQ16
which yenta can't handle; thus disabling IRQ16

pci 0000:15:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
15:00.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev ba)

uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
uhci_hcd 0000:00:1d.0: setting latency timer to 64
uhci_hcd 0000:00:1d.0: UHCI Host Controller

but to complicate things, ahci is also on IRQ 16,
but it cuts over to MSI:

ahci 0000:00:1f.2: PCI INT B -> GSI 16 (level, low) -> IRQ 16
ahci 0000:00:1f.2: irq 28 for MSI/MSI-X

same for i915 video:

pci 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
pci 0000:00:02.0: setting latency timer to 64
pci 0000:00:02.0: irq 31 for MSI/MSI-X

It would be interesting if "pci=nomsi" fixes this
by putting ahci and/or i915 on irq16.  If yes, that
means that one of those devices is provoking the IOAPIC
interrupt on IRQ16, but is not registered on the irq to handle it.
Comment 13 Niel Lambrechts 2009-07-22 22:24:46 UTC
Created attachment 22456 [details]
/proc/interrupts (2.6.30.2)

as requested.
Comment 14 Niel Lambrechts 2009-07-22 22:25:41 UTC
Created attachment 22457 [details]
dmesg (2.6.30.2)

as requested
Comment 15 Niel Lambrechts 2009-07-22 22:37:22 UTC
Hi Len,

Yes I've only seen it happen after a resume from hibernate, although not directly during the resume, but when I explicitly ran modprobe ehci_hcd.

I've only seen it twice and I unfortunately still cannot reproduce the issue at will even though I still use hibernate on a daily basis with 2.6.31-rcx.

regards,
Niel
Comment 16 Niel Lambrechts 2009-07-27 07:44:12 UTC
Still here on 2.6.31-rc4-pae:

Jul 27 09:37:24 localhost kernel: irq 16: nobody cared (try booting with the "irqpoll" option)
Jul 27 09:37:24 localhost kernel: Pid: 0, comm: swapper Not tainted 2.6.31-rc4-pae #98
Jul 27 09:37:24 localhost kernel: Call Trace:
Jul 27 09:37:24 localhost kernel: [<c015ef37>] __report_bad_irq+0x2e/0x6f
Jul 27 09:37:24 localhost kernel: [<c015f06d>] note_interrupt+0xf5/0x14c
Jul 27 09:37:24 localhost kernel: [<c015f5a6>] handle_fasteoi_irq+0x7d/0x9b
Jul 27 09:37:24 localhost kernel: [<c01046da>] handle_irq+0x1a/0x20
Jul 27 09:37:24 localhost kernel: [<c0103e91>] do_IRQ+0x41/0x95
Jul 27 09:37:24 localhost kernel: [<c0102fe9>] common_interrupt+0x29/0x30
Jul 27 09:37:24 localhost kernel: [<f8d9af96>] ? acpi_idle_enter_c1+0xd8/0xf6 [processor]
Jul 27 09:37:24 localhost kernel: [<c02a0ee6>] cpuidle_idle_call+0x60/0x96
Jul 27 09:37:24 localhost kernel: [<c0101ba7>] cpu_idle+0x49/0x62
Jul 27 09:37:24 localhost kernel: [<c0307197>] rest_init+0x53/0x55
Jul 27 09:37:24 localhost kernel: [<c046a7b2>] start_kernel+0x2ab/0x2b0
Jul 27 09:37:24 localhost kernel: [<c046a071>] __init_begin+0x71/0x79
Jul 27 09:37:24 localhost kernel: handlers:
Jul 27 09:37:24 localhost kernel: [<f8df221a>] (yenta_interrupt+0x0/0xb9 [yenta_socket])
Jul 27 09:37:24 localhost kernel: Disabling IRQ #16

Regards,
Niel
Comment 17 David Brownell 2009-07-27 21:26:17 UTC
Yeah, looks strange.  Maybe the UHCI maintainer could help; EHCI seems to be completely out of the direct loop.  The stack trace shows UHCI is out of the loop too though ... since it's not listed as having registered an IRQ #16 handler when this appears.  Maybe UHCI was unloaded but somehow didn't disable its IRQ handler.

Given how flakey PC hardware and firmware is, I'm a bit surprised that one glitch immediately disables that IRQ.  Surely it would be more robust to require that it happen a few more times before doing that?
Comment 18 Anonymous Emailer 2009-07-27 21:29:08 UTC
Reply-To: david-b@pacbell.net

On Tuesday 14 July 2009, bugzilla-daemon@bugzilla.kernel.org wrote:
> It looks like ehci requested an irq which it doesn't mark as handled.

At most, what happened is that EHCI startup triggered a UHCI
interrupt ... but the UHCI driver wasn't loaded/listening.

If /proc/interrupts doesn't show a UHCI handler on irq 16
then I'd say USB isn't directly invovled.

- Dave
Comment 19 Niel Lambrechts 2009-07-27 22:19:51 UTC
On 07/27/2009 11:29 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> On Tuesday 14 July 2009, bugzilla-daemon@bugzilla.kernel.org wrote:
>    
>> It looks like ehci requested an irq which it doesn't mark as handled.
>>      
>
> At most, what happened is that EHCI startup triggered a UHCI
> interrupt ... but the UHCI driver wasn't loaded/listening.
>
> If /proc/interrupts doesn't show a UHCI handler on irq 16
> then I'd say USB isn't directly invovled.
>    

I can confirm uhci_hcd was not loaded, and that the warning happened 
just as I did a 'modprobe ehci_hcd'.

Regards,
Niel
Comment 20 Rafael J. Wysocki 2009-07-27 22:27:54 UTC
On Monday 27 July 2009, Niel Lambrechts wrote:
> On 07/26/2009 10:28 PM, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.30.  Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=13733
> > Subject             : 2.6.31-rc2: irq 16: nobody cared
> > Submitter   : Niel Lambrechts<niel.lambrechts@gmail.com>
> > Date                : 2009-07-06 18:32 (21 days old)
> > References  : http://marc.info/?l=linux-kernel&m=124690524027166&w=4
> >    
> Hi,
> 
> Bug is still evident on kernels up to 2.6.31-rc4.
Comment 21 Zdenek Kabelac 2009-09-09 06:50:50 UTC
Hmmm

I'm wondering whether this problem could be related to my reported issue with irq18  with kernel 31-rc7

irq 18: nobody cared 
http://lkml.org/lkml/2009/8/28/90

Call Trace:
 <IRQ>  [<ffffffff810a500b>] __report_bad_irq+0x2b/0xa0
 [<ffffffff810a520a>] note_interrupt+0x18a/0x1d0
 [<ffffffff810a588d>] handle_fasteoi_irq+0xcd/0xf0
 [<ffffffff8100ee99>] handle_irq+0x49/0xa0
 [<ffffffff813e1343>] do_IRQ+0x73/0xf0
 [<ffffffff8100cb93>] ret_from_intr+0x0/0xf
 <EOI>  [<ffffffff8129057d>] ? acpi_idle_enter_c1+0xa1/0xc0
 [<ffffffff8107d83d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff81290580>] ? acpi_idle_enter_c1+0xa4/0xc0
 [<ffffffff8129057d>] ? acpi_idle_enter_c1+0xa1/0xc0
 [<ffffffff81334a9b>] ? cpuidle_idle_call+0x9b/0xf0
 [<ffffffff8100aa7b>] ? cpu_idle+0x6b/0xd0
 [<ffffffff813c72ca>] ? rest_init+0x7a/0x80
 [<ffffffff81601b02>] ? start_kernel+0x38f/0x39b
 [<ffffffff81601128>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff8160120d>] ? x86_64_start_kernel+0xe1/0xe8
handlers:
[<ffffffffa0006720>] (usb_hcd_irq+0x0/0xa0 [usbcore])
Disabling IRQ #18


It looked related to mmc....
Comment 22 Rafael J. Wysocki 2009-09-10 20:59:44 UTC
Pasting manually due to the Bugzilla's e-mail interface not working at the moment.

On 09/06/2009 10:30 PM, Rafael J. Wysocki wrote:
> The following bug entry is on the current list of known regressions
> from 2.6.30.  Please verify if it still should be listed and let me know
> (either way).
>
>
> Bug-Entry     : http://bugzilla.kernel.org/show_bug.cgi?id=13733
> Subject               : 2.6.31-rc2: irq 16: nobody cared
> Submitter     : Niel Lambrechts<niel.lambrechts@gmail.com>
> Date          : 2009-07-06 18:32 (63 days old)
> References    : http://marc.info/?l=linux-kernel&m=124690524027166&w=4

Hi Rafael,

This is still present in 2.6.31-rc9-pae.

To test, I removed ehci_hcd, hibernated/resumed and the problem did not appear. But when I first plugged in a USB mouse after the system was up and before ehci_hcd was loaded the error occurred again.
Comment 23 Alan Stern 2009-09-10 21:52:19 UTC
How about if you unload both uhci-hcd and ehci-hcd before hibernating?  If you plug in a USB mouse before reloading either module, does the problem occur?
Comment 24 Niel Lambrechts 2009-09-12 07:58:37 UTC
Hi Alan,

This is presently the case - I'm unloading both ehci_hcd and uhci_hcd before hibernating (I sometimes unload these when running on battery power). If I then plug in the mouse and load ehci_hcd, the warning gets triggered.

Thanks,
Niel
Comment 25 Niel Lambrechts 2009-09-12 13:25:47 UTC
Okay, I can now seemingly consistently reproduce this error:

1. rmmod ehci_hcd && rmmod uhci_hcd
2. hibernate
3. power-up
4. plug in any USB device (I used my Nokia cellphone in the last attempt)
5. modprobe ehci_hcd

Hope this helps,

Niel
Comment 26 Alan Stern 2009-09-12 14:39:44 UTC
Okay, let me ask a few questions to make sure everything is clear.

  1. What happens if you reverse the order of steps 4 and 5?

  2. What happens if in step 5, you modprobe uhci-hcd instead of ehci-hcd?

  3. What happens if you boot with "pci=nomsi", as Len asked in comment #12?
Comment 27 Niel Lambrechts 2009-09-13 18:49:54 UTC
Hi,

Sorry I missed some of those comments due to several blurry 16-hour days. :)

Here are the results:

1. What happens if you reverse the order of steps 4 and 5?

Error did not occur.

2. What happens if in step 5, you modprobe uhci-hcd instead of ehci-hcd?

Error did not occur.

Side note #1: After this test I did not reboot straight away. I removed the modules again, hibernated/resumed, plugged in a USB device and probed ehci_hcd - and the error again occurred.

3. What happens if you boot with "pci=nomsi", as Len asked in comment #12?

For this I therefore did another fresh reboot, manually specifying pci=nomsi on grub menu.

Side note #2: I encountered another problem at first. When hibernating and resuming like this, I had all types of EXT4 errors and resume did not complete, I am assuming this is since grub does not enforce pci=nomsi during the 2nd (resume) boot since it was not an option listed in menu.lst. (My grub menu does not appear/allow interaction when resuming). Can it be that pci=nomsi was not passed to a kernel that previously had it?

Anyhow, when I booted after putting pci=nomsi in menu.lst for grub and plugged a device before probing ehci_hcd, the "Disabling IRQ #16" message occurred again as per usual.

Hope this is meaningful!

Niel
Comment 28 Niel Lambrechts 2009-09-13 19:02:14 UTC
Using powertop I also notice that after this my power consumption shoots sky high after this error. Instead of the normal 11-14W I get:

----------------
Power usage (ACPI estimate): 28.0W (0.8 hours)


Top causes for wakeups:
  99.8% (86410.4)       <interrupt> : ahci, yenta, i915@pci:0000:00:02.0
  0.1% ( 97.3)       <interrupt> : PS/2 keyboard/mouse/touchpad

---------

and this is on the IRQ that was just disabled:
 16:  149416472       8450   IO-APIC-fasteoi   ahci, yenta, i915@pci:0000:00:02.0

I am not familiar with how this all works (i.e. what disabling an interrupt might lead to), so sorry I am mentioning something very obvious.

Niel
Comment 29 Alan Stern 2009-09-15 02:40:57 UTC
Got to admit, I don't see what could be causing this.  Did everything work correctly with an earlier kernel such as 2.6.30?  If it did, then git bisect might be the best approach to finding the cause of the problem.
Comment 30 Zhang Rui 2009-10-09 02:48:13 UTC
ping niel...
Comment 31 Niel Lambrechts 2009-10-13 06:24:50 UTC
I'll test some more with 2.6.31 stable and 2.6.32-latest - I have not recently noticed this error again.

Alan, the problem started for me in the early 2.6.31 cycle and was probably not present in 2.6.30 or 2.6.29 as I tested quite extensively on those levels as well.

cheers
Niel
Comment 32 Niel Lambrechts 2009-10-15 07:37:27 UTC
Created attachment 23419 [details]
dmesg of error on 2.6.31.3
Comment 33 Niel Lambrechts 2009-10-15 07:40:08 UTC
So the problem happened again on stable - 2.6.31.3.

Steps involved were similar to those in comment #25, but I actually cannot seem to duplicate this consistently.

Is there anything else I can do to make the debugging more verbose that might help nail this?

Regards,
Niel
Comment 34 Alan Stern 2009-10-15 14:14:25 UTC
You can enable CONFIG_USB_DEBUG.  I don't think it will help much, though.

Have you tried loading uhci-hcd before hibernating?  That might at least prevent the problem from occurring, although it won't fix anything.  I believe the cause is a bug in the motherboard or the BIOS.
Comment 35 Niel Lambrechts 2009-10-15 19:07:53 UTC
I don't care about the bug other than resolving this in the kernel, it is not impacting me as far as I can tell.

Question, if this is a h/w bug why does it not occur in 2.6.30 / earlier?


Niel
Comment 36 Alan Stern 2009-10-15 19:40:42 UTC
I don't know, and the only way I can think of to figure it out is for you to carry out a bisection search.  I would be very surprised if the cause ended up being something in the USB stack.  It's much more likely to lie in some completely different subsystem.
Comment 37 Jochen Spieker 2009-10-15 20:35:13 UTC
I am not competent enough to give real insights into this issue, but I don't think this is USB-related as well.

I am getting hit by this issue occasionally as well on my Thinkpad X200 (at least I get the "nobody cared" message as well) and I have only seen this happen when I let the machine run over night while nobody is using it and besides internal bluetooth+webcam I seldomly use USB hardware.
Comment 38 Rafael J. Wysocki 2009-11-21 22:48:37 UTC
On Saturday 21 November 2009, Niel Lambrechts wrote:
> Hi Rafael,
> 
> On 21 Nov 2009, at 5:02 PM, Rafael J. Wysocki wrote:
> 
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.30 and 2.6.31.
> > 
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31.  Please verify if it still should
> > be listed and let me know (either way).
> 
> I no longer have readily access to the laptop that showed this problem, and
> will therefore not be able to test if the problem has been resolved in later
> levels.
> 
> So the bug can be closed unless anyone else feels differently.
Comment 39 Rafael J. Wysocki 2009-11-21 22:49:09 UTC
Closing.