Bug 66841
Summary: | x86: Shared irq - very rare stuck IRQ | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Markus (m4rkusxxl) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | NEW --- | ||
Severity: | normal | CC: | alan, stf_xl |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 3.10.22 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Markus
2013-12-10 18:24:47 UTC
# grep 17: /proc/interrupts
> 17: 64 534 29654 1369764 IO-APIC-fasteoi
> pata_jmicron, ehci_hcd:usb3, ehci_hcd:usb4, ehci_hcd:usb5
The same problem occured with the gentoo 3.10 kernel and a vanilla 3.8.13 & 3.10.22.
The error is triggered when your machine starts producing interrupts that we cannot clear and which no driver believes it owns. It usually indicates BIOS problems. The "irqpoll" option tells the kernel to try every driver to see if the IRQ was reported on the 'wrong' line. If you reboot and put a different device on the same USB port do you also get that error logged ? Yes. I tried different ports for the wlan stick. In the beginning it was pluged in a hub. But I then freed a port on the motherboard and plugged the wlan stick there. The problem remained. (I only noticed the problem since I use that wlan stick.) Would "irqpoll" give a message to tell wether a different driver (which?) handled the interrupt or if its still unhandled? BIOS is "up to date", although this doesnt mean its error free. :-/ (I will try "irqpoll" next. But may take some days.) Do you have other USB devices in use on those ports ? The ports on the motherboard are all used: keyboard, mouse, webcam, usv, wlan-stick and a SiS-PM. All usb2. The (usb3) hub is connected to one usb3 port on the motherboard. The wlan-stick was connected on that hub before. The same error occured. The second usb3 port is free and used for the external usb3 hdd. The performance was normal, even when the wlan was slow. As I noted I tried "irqpoll". At the first try the computer completely froze. Unfortunatelly no logs were written. So I tried again with netconsole setup. (By the was, the wlan never slowed down. But in my opinion its better to have a slow wlan than a completly frozen pc.) > kernel BUG at mm/slub.c:3352! > invalid opcode: 0000 [#1] PREEMPT SMP … > Call Trace: > [<ffffffff810d962c>] free_pipe_info+0x6c/0x80 > [<ffffffff810d96c1>] pipe_release+0x81/0x100 > [<ffffffff810d2041>] __fput+0xe1/0x240 > [<ffffffff810d2259>] ____fput+0x9/0x10 > [<ffffffff81046c2f>] task_work_run+0x8f/0xd0 > [<ffffffff810020c9>] do_notify_resume+0x59/0x80 > [<ffffffff813a8bda>] int_signal+0x12/0x17 … > BUG: unable to handle kernel NULL pointer dereference at 0000000000000003 > IP: [<ffffffff810cac30>] kmem_cache_alloc+0x70/0x100 > PGD 41cb88067 PUD 41c0d0067 PMD 0 > Oops: 0000 [#2] PREEMPT SMP … > BUG: unable to handle kernel NULL pointer dereference at 0000000000000003 > IP: [<ffffffff810cac30>] kmem_cache_alloc+0x70/0x100 > PGD 0 > Oops: 0000 [#3] PREEMPT SMP … > BUG: unable to handle kernel NULL pointer dereference at 0000000000000003 > IP: [<ffffffff810cac30>] kmem_cache_alloc+0x70/0x100 > PGD 412610067 PUD 4125e9067 PMD 0 > Oops: 0000 [#4] PREEMPT SMP … > BUG: unable to handle kernel NULL pointer dereference at 0000000000000003 > IP: [<ffffffff810cab08>] __kmalloc+0x88/0x140 > PGD 41476e067 PUD 414783067 PMD 0 > Oops: 0000 [#5] PREEMPT SMP … > BUG: unable to handle kernel paging request at ffffffffffffffd8 > IP: [<ffffffff8104a01b>] kthread_data+0xb/0x20 > PGD 160c067 PUD 160e067 PMD 0 > Oops: 0000 [#6] PREEMPT SMP … > Fixing recursive fault but reboot is needed! I dont know, if this is related or different bug? different I think (and known) pipefs race that Al Viro was looking at. The fact it happens regardless of the port used for the wireless USB is weird. I'd have expected it to be tied to a given port if IRQ routing was the problem. I'll assign this to the wireless folks although its all rather peculiar! Couldnt find a bug report. Does it exist or is it only on the ml?
Do any patches exist I could test?
# grep hci_hcd /proc/interrupts
> 17: 17 60 3366 296573 IO-APIC-fasteoi
> pata_jmicron, ehci_hcd:usb7, ehci_hcd:usb8, ehci_hcd:usb9
> 18: 94 507 35187 2139720 IO-APIC-fasteoi
> ohci_hcd:usb3, ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6
> 45: 0 0 4 86 PCI-MSI-edge xhci_hcd
> 46: 0 0 0 0 PCI-MSI-edge xhci_hcd
> 47: 0 0 0 0 PCI-MSI-edge xhci_hcd
> 48: 0 0 0 0 PCI-MSI-edge xhci_hcd
> 49: 0 0 0 0 PCI-MSI-edge xhci_hcd
It was always irq 17. Even when the stick was connected to the usb3 hub on the usb3 port on the motherboard. Does the above mean that all usb2 devices are handled by ehci and use irq 17?
Wireless driver has no influence of this bug. It is caused either by ehci_hcd or by pata_jmicron, which possibly do not handle IRQ properly or it is pure hardware problem i.e. h/w generate interrupt without a reason. Markus, there should be some other USB ports available, which are handled by ohci_hcd or xhci_hcd, try to use them instead of ehci_hcd. I used a usb2 port on the motherboard and a usb3 hub connected to a usb3 port on the motherboard. It was the same error. irq 17. Two cases I will test: - blacklist pata_jmicron as it is only for a optical drive hardly used - direct connect the wlan stick to the usb3 port I updated to 3.10.25 and enabled usb debugging.
Connecting the stick directly to a usb3 port does not change anything. The problem still occurs.
But I figured out, how the problem can be suppressed. (Worked 4 days... where it would normally fail within one day.) I need to do one of the following:
- blacklist pata_jmicron
- remove usb webcam
- remove usb wlan stick
The debugging does not print anything when the "nobody cared" message appears. But afterwards a lot of these lines were printed:
> ehci-pci 0000:00:16.2: IAA watchdog: status e021 cmd 10035
(The cmd is always 10035 but the status was one of e021, e020, c021, e029, c020, e028, c029, c028)
Is there any way to reenable the interrupt?
The kernel disables the interrupt when many non-requested interrupts occured. So one bad interrupt here and there does not matter. But are these numbers exported? (To see their changes.)
Blacklisting pata_jmicron is now working for over ten days and counting. Anything else I can try? Is this with or without "irqpoll" now set ? It's very odd that any of the three fixes it. If it was a specific device I'd expect only one of the device removals to make any difference (or both USB perhaps). The fact either cures it suggests something different. Can you clarify if irqpoll was set on your testing ? "irqpoll" was not set. It makes the system crash (comment #6) before the "nobody cared" message could appear. And with the blacklisted pata_jmicron the "nobody cared" message does not appear and the wlan does not slow down. (Dont know the influence of "irqpoll" here, but as its working...) Ok moving to platform specific/x86 because it looks more like a shared interrupt problem |