Bug 13733
Summary: | 2.6.31-rc2: irq 16: nobody cared | ||
---|---|---|---|
Product: | ACPI | Reporter: | Rafael J. Wysocki (rjw) |
Component: | Config-Interrupts | Assignee: | Len Brown (lenb) |
Status: | CLOSED UNREPRODUCIBLE | ||
Severity: | normal | CC: | acpi-bugzilla, greg, jirislaby, js+bugs, lenb, niel.lambrechts, rui.zhang, stern, vyacheslavovich, yakui.zhao, zdenek.kabelac |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.31-rc2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 13615 | ||
Attachments: |
acpidump output
lspci -vxxx output dmesg output (before any issue encountered) /proc/interrupts output trace (Jul 10) /proc/interrupts (2.6.30.2) dmesg (2.6.30.2) dmesg of error on 2.6.31.3 |
Description
Rafael J. Wysocki
2009-07-06 23:02:18 UTC
Hi From the description it seems that this is a regression. Will you please use the git-bisect to identify the commit which causes the regression? It will be great if you can attach the output of acpidump, lspci -vxxx, dmesg, /proc/interrupts thanks. Created attachment 22279 [details]
acpidump output
Created attachment 22280 [details]
lspci -vxxx output
Created attachment 22281 [details]
dmesg output (before any issue encountered)
Created attachment 22282 [details]
/proc/interrupts output
Hi, I am not certain that I can reproduce this problem yet, due to lack of time to test. If I can reliably do so I will try and bisect and provide an update here. Thanks Niel I just tried reproducing the problem, but could not. I tried once without uhci_hcd loaded and once with it, since I sometimes remove the USB modules to reduce power consumption if I'm on battery (ala powertop). I will keep running 2.6.31-rc2 for a few days and provide some more feedback. Thanks Niel HI, Niel Any update about this bug? Can this issue be reproduced? Thanks. Hi there, I cannot reproduce the problem at will but I have seen it again using 2.6.31-rc2-pae. It happened as I probed for ehci_hcd, and seems to be referring to yenta. I attach the log. Regards, Niel Created attachment 22346 [details]
trace (Jul 10)
Maybe USB fellows have some ideas? It looks like ehci requested an irq which it doesn't mark as handled. (From http://lkml.org/lkml/2009/7/15/13 I suppose it doesn't happen when the yenta is loaded alone.) Neil, Does this happen only after resume from hibernete, or is it ever seen on a cold boot, or a resume from suspend to RAM? Can you supply a dmesg and /proc/interrupts from 2.6.30? UHCI and yenta share IRQ16. It is possible that when EHCI is initialized, it is somehow enabling the, as yet no loaded, UHCI to provoke and IRQ16 which yenta can't handle; thus disabling IRQ16 pci 0000:15:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 15:00.0 CardBus bridge: Ricoh Co Ltd RL5c476 II (rev ba) uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 uhci_hcd 0000:00:1d.0: setting latency timer to 64 uhci_hcd 0000:00:1d.0: UHCI Host Controller but to complicate things, ahci is also on IRQ 16, but it cuts over to MSI: ahci 0000:00:1f.2: PCI INT B -> GSI 16 (level, low) -> IRQ 16 ahci 0000:00:1f.2: irq 28 for MSI/MSI-X same for i915 video: pci 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 pci 0000:00:02.0: setting latency timer to 64 pci 0000:00:02.0: irq 31 for MSI/MSI-X It would be interesting if "pci=nomsi" fixes this by putting ahci and/or i915 on irq16. If yes, that means that one of those devices is provoking the IOAPIC interrupt on IRQ16, but is not registered on the irq to handle it. Created attachment 22456 [details]
/proc/interrupts (2.6.30.2)
as requested.
Created attachment 22457 [details]
dmesg (2.6.30.2)
as requested
Hi Len, Yes I've only seen it happen after a resume from hibernate, although not directly during the resume, but when I explicitly ran modprobe ehci_hcd. I've only seen it twice and I unfortunately still cannot reproduce the issue at will even though I still use hibernate on a daily basis with 2.6.31-rcx. regards, Niel Still here on 2.6.31-rc4-pae: Jul 27 09:37:24 localhost kernel: irq 16: nobody cared (try booting with the "irqpoll" option) Jul 27 09:37:24 localhost kernel: Pid: 0, comm: swapper Not tainted 2.6.31-rc4-pae #98 Jul 27 09:37:24 localhost kernel: Call Trace: Jul 27 09:37:24 localhost kernel: [<c015ef37>] __report_bad_irq+0x2e/0x6f Jul 27 09:37:24 localhost kernel: [<c015f06d>] note_interrupt+0xf5/0x14c Jul 27 09:37:24 localhost kernel: [<c015f5a6>] handle_fasteoi_irq+0x7d/0x9b Jul 27 09:37:24 localhost kernel: [<c01046da>] handle_irq+0x1a/0x20 Jul 27 09:37:24 localhost kernel: [<c0103e91>] do_IRQ+0x41/0x95 Jul 27 09:37:24 localhost kernel: [<c0102fe9>] common_interrupt+0x29/0x30 Jul 27 09:37:24 localhost kernel: [<f8d9af96>] ? acpi_idle_enter_c1+0xd8/0xf6 [processor] Jul 27 09:37:24 localhost kernel: [<c02a0ee6>] cpuidle_idle_call+0x60/0x96 Jul 27 09:37:24 localhost kernel: [<c0101ba7>] cpu_idle+0x49/0x62 Jul 27 09:37:24 localhost kernel: [<c0307197>] rest_init+0x53/0x55 Jul 27 09:37:24 localhost kernel: [<c046a7b2>] start_kernel+0x2ab/0x2b0 Jul 27 09:37:24 localhost kernel: [<c046a071>] __init_begin+0x71/0x79 Jul 27 09:37:24 localhost kernel: handlers: Jul 27 09:37:24 localhost kernel: [<f8df221a>] (yenta_interrupt+0x0/0xb9 [yenta_socket]) Jul 27 09:37:24 localhost kernel: Disabling IRQ #16 Regards, Niel Yeah, looks strange. Maybe the UHCI maintainer could help; EHCI seems to be completely out of the direct loop. The stack trace shows UHCI is out of the loop too though ... since it's not listed as having registered an IRQ #16 handler when this appears. Maybe UHCI was unloaded but somehow didn't disable its IRQ handler. Given how flakey PC hardware and firmware is, I'm a bit surprised that one glitch immediately disables that IRQ. Surely it would be more robust to require that it happen a few more times before doing that? Reply-To: david-b@pacbell.net On Tuesday 14 July 2009, bugzilla-daemon@bugzilla.kernel.org wrote: > It looks like ehci requested an irq which it doesn't mark as handled. At most, what happened is that EHCI startup triggered a UHCI interrupt ... but the UHCI driver wasn't loaded/listening. If /proc/interrupts doesn't show a UHCI handler on irq 16 then I'd say USB isn't directly invovled. - Dave On 07/27/2009 11:29 PM, bugzilla-daemon@bugzilla.kernel.org wrote: > On Tuesday 14 July 2009, bugzilla-daemon@bugzilla.kernel.org wrote: > >> It looks like ehci requested an irq which it doesn't mark as handled. >> > > At most, what happened is that EHCI startup triggered a UHCI > interrupt ... but the UHCI driver wasn't loaded/listening. > > If /proc/interrupts doesn't show a UHCI handler on irq 16 > then I'd say USB isn't directly invovled. > I can confirm uhci_hcd was not loaded, and that the warning happened just as I did a 'modprobe ehci_hcd'. Regards, Niel On Monday 27 July 2009, Niel Lambrechts wrote:
> On 07/26/2009 10:28 PM, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.30. Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13733
> > Subject : 2.6.31-rc2: irq 16: nobody cared
> > Submitter : Niel Lambrechts<niel.lambrechts@gmail.com>
> > Date : 2009-07-06 18:32 (21 days old)
> > References : http://marc.info/?l=linux-kernel&m=124690524027166&w=4
> >
> Hi,
>
> Bug is still evident on kernels up to 2.6.31-rc4.
Hmmm I'm wondering whether this problem could be related to my reported issue with irq18 with kernel 31-rc7 irq 18: nobody cared http://lkml.org/lkml/2009/8/28/90 Call Trace: <IRQ> [<ffffffff810a500b>] __report_bad_irq+0x2b/0xa0 [<ffffffff810a520a>] note_interrupt+0x18a/0x1d0 [<ffffffff810a588d>] handle_fasteoi_irq+0xcd/0xf0 [<ffffffff8100ee99>] handle_irq+0x49/0xa0 [<ffffffff813e1343>] do_IRQ+0x73/0xf0 [<ffffffff8100cb93>] ret_from_intr+0x0/0xf <EOI> [<ffffffff8129057d>] ? acpi_idle_enter_c1+0xa1/0xc0 [<ffffffff8107d83d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff81290580>] ? acpi_idle_enter_c1+0xa4/0xc0 [<ffffffff8129057d>] ? acpi_idle_enter_c1+0xa1/0xc0 [<ffffffff81334a9b>] ? cpuidle_idle_call+0x9b/0xf0 [<ffffffff8100aa7b>] ? cpu_idle+0x6b/0xd0 [<ffffffff813c72ca>] ? rest_init+0x7a/0x80 [<ffffffff81601b02>] ? start_kernel+0x38f/0x39b [<ffffffff81601128>] ? x86_64_start_reservations+0x125/0x129 [<ffffffff8160120d>] ? x86_64_start_kernel+0xe1/0xe8 handlers: [<ffffffffa0006720>] (usb_hcd_irq+0x0/0xa0 [usbcore]) Disabling IRQ #18 It looked related to mmc.... Pasting manually due to the Bugzilla's e-mail interface not working at the moment. On 09/06/2009 10:30 PM, Rafael J. Wysocki wrote: > The following bug entry is on the current list of known regressions > from 2.6.30. Please verify if it still should be listed and let me know > (either way). > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13733 > Subject : 2.6.31-rc2: irq 16: nobody cared > Submitter : Niel Lambrechts<niel.lambrechts@gmail.com> > Date : 2009-07-06 18:32 (63 days old) > References : http://marc.info/?l=linux-kernel&m=124690524027166&w=4 Hi Rafael, This is still present in 2.6.31-rc9-pae. To test, I removed ehci_hcd, hibernated/resumed and the problem did not appear. But when I first plugged in a USB mouse after the system was up and before ehci_hcd was loaded the error occurred again. How about if you unload both uhci-hcd and ehci-hcd before hibernating? If you plug in a USB mouse before reloading either module, does the problem occur? Hi Alan, This is presently the case - I'm unloading both ehci_hcd and uhci_hcd before hibernating (I sometimes unload these when running on battery power). If I then plug in the mouse and load ehci_hcd, the warning gets triggered. Thanks, Niel Okay, I can now seemingly consistently reproduce this error: 1. rmmod ehci_hcd && rmmod uhci_hcd 2. hibernate 3. power-up 4. plug in any USB device (I used my Nokia cellphone in the last attempt) 5. modprobe ehci_hcd Hope this helps, Niel Okay, let me ask a few questions to make sure everything is clear. 1. What happens if you reverse the order of steps 4 and 5? 2. What happens if in step 5, you modprobe uhci-hcd instead of ehci-hcd? 3. What happens if you boot with "pci=nomsi", as Len asked in comment #12? Hi, Sorry I missed some of those comments due to several blurry 16-hour days. :) Here are the results: 1. What happens if you reverse the order of steps 4 and 5? Error did not occur. 2. What happens if in step 5, you modprobe uhci-hcd instead of ehci-hcd? Error did not occur. Side note #1: After this test I did not reboot straight away. I removed the modules again, hibernated/resumed, plugged in a USB device and probed ehci_hcd - and the error again occurred. 3. What happens if you boot with "pci=nomsi", as Len asked in comment #12? For this I therefore did another fresh reboot, manually specifying pci=nomsi on grub menu. Side note #2: I encountered another problem at first. When hibernating and resuming like this, I had all types of EXT4 errors and resume did not complete, I am assuming this is since grub does not enforce pci=nomsi during the 2nd (resume) boot since it was not an option listed in menu.lst. (My grub menu does not appear/allow interaction when resuming). Can it be that pci=nomsi was not passed to a kernel that previously had it? Anyhow, when I booted after putting pci=nomsi in menu.lst for grub and plugged a device before probing ehci_hcd, the "Disabling IRQ #16" message occurred again as per usual. Hope this is meaningful! Niel Using powertop I also notice that after this my power consumption shoots sky high after this error. Instead of the normal 11-14W I get: ---------------- Power usage (ACPI estimate): 28.0W (0.8 hours) Top causes for wakeups: 99.8% (86410.4) <interrupt> : ahci, yenta, i915@pci:0000:00:02.0 0.1% ( 97.3) <interrupt> : PS/2 keyboard/mouse/touchpad --------- and this is on the IRQ that was just disabled: 16: 149416472 8450 IO-APIC-fasteoi ahci, yenta, i915@pci:0000:00:02.0 I am not familiar with how this all works (i.e. what disabling an interrupt might lead to), so sorry I am mentioning something very obvious. Niel Got to admit, I don't see what could be causing this. Did everything work correctly with an earlier kernel such as 2.6.30? If it did, then git bisect might be the best approach to finding the cause of the problem. ping niel... I'll test some more with 2.6.31 stable and 2.6.32-latest - I have not recently noticed this error again. Alan, the problem started for me in the early 2.6.31 cycle and was probably not present in 2.6.30 or 2.6.29 as I tested quite extensively on those levels as well. cheers Niel Created attachment 23419 [details]
dmesg of error on 2.6.31.3
So the problem happened again on stable - 2.6.31.3. Steps involved were similar to those in comment #25, but I actually cannot seem to duplicate this consistently. Is there anything else I can do to make the debugging more verbose that might help nail this? Regards, Niel You can enable CONFIG_USB_DEBUG. I don't think it will help much, though. Have you tried loading uhci-hcd before hibernating? That might at least prevent the problem from occurring, although it won't fix anything. I believe the cause is a bug in the motherboard or the BIOS. I don't care about the bug other than resolving this in the kernel, it is not impacting me as far as I can tell. Question, if this is a h/w bug why does it not occur in 2.6.30 / earlier? Niel I don't know, and the only way I can think of to figure it out is for you to carry out a bisection search. I would be very surprised if the cause ended up being something in the USB stack. It's much more likely to lie in some completely different subsystem. I am not competent enough to give real insights into this issue, but I don't think this is USB-related as well. I am getting hit by this issue occasionally as well on my Thinkpad X200 (at least I get the "nobody cared" message as well) and I have only seen this happen when I let the machine run over night while nobody is using it and besides internal bluetooth+webcam I seldomly use USB hardware. On Saturday 21 November 2009, Niel Lambrechts wrote:
> Hi Rafael,
>
> On 21 Nov 2009, at 5:02 PM, Rafael J. Wysocki wrote:
>
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.30 and 2.6.31.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > be listed and let me know (either way).
>
> I no longer have readily access to the laptop that showed this problem, and
> will therefore not be able to test if the problem has been resolved in later
> levels.
>
> So the bug can be closed unless anyone else feels differently.
Closing. |