Bug 66841 - x86: Shared irq - very rare stuck IRQ
Summary: x86: Shared irq - very rare stuck IRQ
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-12-10 18:24 UTC by Markus
Modified: 2014-01-10 20:58 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.10.22
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Markus 2013-12-10 18:24:47 UTC
At first I added to bug #62781 because I just noticed the slowdown of the wlan (rt2800usb). It produced a lot of warnings and even bringing up the interface (ifconfig net_wlan up) took 60s.

I tracked it down to "usb_start_wait_urb" in "usb_control_msg" in "rt2x00usb_vendor_request".

But I completely overlooked the following message by the kernel:
> irq 17: nobody cared (try booting with the "irqpoll" option)> Call Trace:
>  <IRQ>  [<ffffffff813a3348>] dump_stack+0x19/0x1b
>  [<ffffffff81081568>] __report_bad_irq+0x38/0xd0
>  [<ffffffff81081961>] note_interrupt+0x131/0x1f0
>  [<ffffffff8107f4ea>] handle_irq_event_percpu+0xba/0x140
>  [<ffffffff8107f5b3>] handle_irq_event+0x43/0x70
>  [<ffffffff810824c5>] handle_fasteoi_irq+0x55/0x100
>  [<ffffffff81003bb9>] handle_irq+0x19/0x30
>  [<ffffffff81003a55>] do_IRQ+0x55/0xd0
>  [<ffffffff813a81e7>] common_interrupt+0x67/0x67> handlers:
> [<ffffffff812b5f30>] ata_bmdma_interrupt
> [<ffffffffa00baf80>] usb_hcd_irq [usbcore]
> [<ffffffffa00baf80>] usb_hcd_irq [usbcore]
> [<ffffffffa00baf80>] usb_hcd_irq [usbcore]
> Disabling IRQ #17

Neven seen this before. (The wlan stick was recently added.)
Is this a bug in the wlan-driver? The usb subsystem? the kernel itself?
What can I do? (Except for the workaround "irqpoll".)
Comment 1 Markus 2013-12-10 18:28:03 UTC
# grep 17: /proc/interrupts 
>  17:         64        534      29654    1369764   IO-APIC-fasteoi  
>  pata_jmicron, ehci_hcd:usb3, ehci_hcd:usb4, ehci_hcd:usb5

The same problem occured with the gentoo 3.10 kernel and a vanilla 3.8.13 & 3.10.22.
Comment 2 Alan 2013-12-10 22:16:32 UTC
The error is triggered when your machine starts producing interrupts that we cannot clear and which no driver believes it owns. It usually indicates BIOS problems.

The "irqpoll" option tells the kernel to try every driver to see if the IRQ was reported on the 'wrong' line.

If you reboot and put a different device on the same USB port do you also get that error logged ?
Comment 3 Markus 2013-12-10 22:47:13 UTC
Yes. I tried different ports for the wlan stick. In the beginning it was pluged in a hub. But I then freed a port on the motherboard and plugged the wlan stick there. The problem remained.
(I only noticed the problem since I use that wlan stick.)

Would "irqpoll" give a message to tell wether a different driver (which?) handled the interrupt or if its still unhandled?

BIOS is "up to date", although this doesnt mean its error free. :-/

(I will try "irqpoll" next. But may take some days.)
Comment 4 Alan 2013-12-11 11:50:36 UTC
Do you have other USB devices in use on those ports ?
Comment 5 Markus 2013-12-11 12:51:50 UTC
The ports on the motherboard are all used: keyboard, mouse, webcam, usv, wlan-stick and a SiS-PM. All usb2.

The (usb3) hub is  connected to one usb3 port on the motherboard. The wlan-stick was connected on that hub before. The same error occured.
The second usb3 port is free and used for the external usb3 hdd. The performance was normal, even when the wlan was slow.
Comment 6 Markus 2013-12-13 08:10:13 UTC
As I noted I tried "irqpoll". At the first try the computer completely froze. Unfortunatelly no logs were written. So I tried again with netconsole setup.
(By the was, the wlan never slowed down. But in my opinion its better to have a slow wlan than a completly frozen pc.)

> kernel BUG at mm/slub.c:3352!
> invalid opcode: 0000 [#1] PREEMPT SMP> Call Trace:
>  [<ffffffff810d962c>] free_pipe_info+0x6c/0x80
>  [<ffffffff810d96c1>] pipe_release+0x81/0x100
>  [<ffffffff810d2041>] __fput+0xe1/0x240
>  [<ffffffff810d2259>] ____fput+0x9/0x10
>  [<ffffffff81046c2f>] task_work_run+0x8f/0xd0
>  [<ffffffff810020c9>] do_notify_resume+0x59/0x80
>  [<ffffffff813a8bda>] int_signal+0x12/0x17> BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
> IP: [<ffffffff810cac30>] kmem_cache_alloc+0x70/0x100
> PGD 41cb88067 PUD 41c0d0067 PMD 0 
> Oops: 0000 [#2] PREEMPT SMP> BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
> IP: [<ffffffff810cac30>] kmem_cache_alloc+0x70/0x100
> PGD 0 
> Oops: 0000 [#3] PREEMPT SMP> BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
> IP: [<ffffffff810cac30>] kmem_cache_alloc+0x70/0x100
> PGD 412610067 PUD 4125e9067 PMD 0 
> Oops: 0000 [#4] PREEMPT SMP> BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
> IP: [<ffffffff810cab08>] __kmalloc+0x88/0x140
> PGD 41476e067 PUD 414783067 PMD 0 
> Oops: 0000 [#5] PREEMPT SMP> BUG: unable to handle kernel paging request at ffffffffffffffd8
> IP: [<ffffffff8104a01b>] kthread_data+0xb/0x20
> PGD 160c067 PUD 160e067 PMD 0 
> Oops: 0000 [#6] PREEMPT SMP > Fixing recursive fault but reboot is needed!

I dont know, if this is related or different bug?
Comment 7 Alan 2013-12-13 10:25:12 UTC
different I think (and known) pipefs race that Al Viro was looking at.

The fact it happens regardless of the port used for the wireless USB is weird. I'd have expected it to be tied to a given port if IRQ routing was the problem.

I'll assign this to the wireless folks although its all rather peculiar!
Comment 8 Markus 2013-12-13 10:49:31 UTC
Couldnt find a bug report. Does it exist or is it only on the ml?
Do any patches exist I could test?


# grep hci_hcd /proc/interrupts                                                                                                                                  
> 17:         17         60       3366     296573   IO-APIC-fasteoi  
> pata_jmicron, ehci_hcd:usb7, ehci_hcd:usb8, ehci_hcd:usb9
> 18:         94        507      35187    2139720   IO-APIC-fasteoi  
> ohci_hcd:usb3, ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6
> 45:          0          0          4         86   PCI-MSI-edge      xhci_hcd
> 46:          0          0          0          0   PCI-MSI-edge      xhci_hcd
> 47:          0          0          0          0   PCI-MSI-edge      xhci_hcd
> 48:          0          0          0          0   PCI-MSI-edge      xhci_hcd
> 49:          0          0          0          0   PCI-MSI-edge      xhci_hcd

It was always irq 17. Even when the stick was connected to the usb3 hub on the usb3 port on the motherboard. Does the above mean that all usb2 devices are handled by ehci and use irq 17?
Comment 9 Stanislaw Gruszka 2013-12-16 13:11:33 UTC
Wireless driver has no influence of this bug. It is caused either by ehci_hcd or by pata_jmicron, which possibly do not handle IRQ properly or it is pure hardware problem i.e. h/w generate interrupt without a reason.

Markus, there should be some other USB ports available, which are handled by ohci_hcd or xhci_hcd, try to use them instead of ehci_hcd.
Comment 10 Markus 2013-12-16 15:38:35 UTC
I used a usb2 port on the motherboard and a usb3 hub connected to a usb3 port on the motherboard. It was the same error. irq 17.

Two cases I will test:
- blacklist pata_jmicron as it is only for a optical drive hardly used
- direct connect the wlan stick to the usb3 port
Comment 11 Markus 2013-12-27 19:35:04 UTC
I updated to 3.10.25 and enabled usb debugging.

Connecting the stick directly to a usb3 port does not change anything. The problem still occurs.

But I figured out, how the problem can be suppressed. (Worked 4 days... where it would normally fail within one day.) I need to do one of the following:
- blacklist pata_jmicron
- remove usb webcam
- remove usb wlan stick

The debugging does not print anything when the "nobody cared" message appears. But afterwards a lot of these lines were printed:
> ehci-pci 0000:00:16.2: IAA watchdog: status e021 cmd 10035

(The cmd is always 10035 but the status was one of e021, e020, c021, e029, c020, e028, c029, c028)


Is there any way to reenable the interrupt?


The kernel disables the interrupt when many non-requested interrupts occured. So one bad interrupt here and there does not matter. But are these numbers exported? (To see their changes.)
Comment 12 Markus 2014-01-09 11:31:48 UTC
Blacklisting pata_jmicron is now working for over ten days and counting.

Anything else I can try?
Comment 13 Alan 2014-01-10 20:46:09 UTC
Is this with or without "irqpoll" now set ?

It's very odd that any of the three fixes it. If it was a specific device I'd expect only one of the device removals to make any difference (or both USB perhaps). The fact either cures it suggests something different.

Can you clarify if irqpoll was set on your testing ?
Comment 14 Markus 2014-01-10 20:54:41 UTC
"irqpoll" was not set. It makes the system crash (comment #6) before the "nobody cared" message could appear.

And with the blacklisted pata_jmicron the "nobody cared" message does not appear and the wlan does not slow down. (Dont know the influence of "irqpoll" here, but as its working...)
Comment 15 Alan 2014-01-10 20:58:39 UTC
Ok moving to platform specific/x86 because it looks more like a shared interrupt problem

Note You need to log in before you can comment on or make changes to this bug.