Bug 6591 - VIA K8T800 SMP: kernel randomly loses all interrupts from my UHCI or SATA controllers
Summary: VIA K8T800 SMP: kernel randomly loses all interrupts from my UHCI or SATA con...
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: ACPI
Classification: Unclassified
Component: Config-Interrupts (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: acpi_config-interrupts
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-05-20 23:07 UTC by Nicholas Miell
Modified: 2007-06-03 14:37 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.16
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
dmesg from boot (15.69 KB, text/plain)
2006-09-17 17:50 UTC, Nicholas Miell
Details
/proc/interrupts (654 bytes, text/plain)
2006-09-17 17:51 UTC, Nicholas Miell
Details
dmesg diff, post patch (6.64 KB, text/plain)
2006-09-20 18:12 UTC, Nicholas Miell
Details
lspci -vvv output (13.31 KB, text/plain)
2006-09-22 22:32 UTC, Nicholas Miell
Details

Description Nicholas Miell 2006-05-20 23:07:09 UTC
This has been going on for a while (since mid-2005, at least), and it hasn't
magically fixed itself, so I'm finally filing a bug.

Distribution: Fedora Core 5

Hardware Environment:
MSI Master2 FAR based dual-Opteron system.

00:00.0 Host bridge: VIA Technologies, Inc. VT8385 [K8T800 AGP] Host Bridge (rev 01)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI bridge [K8T800/K8T890 South]
00:05.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 07)
00:05.1 Input device controller: Creative Labs SB Live! MIDI/Game Port (rev 07)
00:0b.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5705 Gigabit
Ethernet (rev 03)
00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID
Controller (rev 80)
00:0f.1 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 81)
00:10.4 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 86)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge
[KT600/K8T800/K8T890 South]
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:00.0 VGA compatible controller: ATI Technologies Inc Radeon R200 QH [Radeon
8500] (rev 80)

Problem Description:

Randomly (but fairly often), I stop getting interrupts from my SATA controller
or my UHCI controllers or both -- the counters in /proc/interrupts stop
increasing and I get libata timeout errors or my USB HID mouse movement gets
erratic (because the driver is polling instea of waiting for interrupts).

The devices still work -- IO to the disk will (eventually) complete and my mouse
or gamepad or whatever still work (but not very well).

If I remove and then reload the driver modules (sata_via or uhci_hcd) or
unbind/bind the devices in sysfs, the problem immediately goes away.

Steps to reproduce:
It's random, so there aren't any.

I think GL usage may be a factor, because if I disable all my GL screensavers
and don't otherwise use GL, the incidence goes down (but doesn't stop altogether).

Likewise, network activity may also be a factor, because this almost always
occurs when I'm downloading something to the SATA disk with BitTorrent (although
that may be just when I'm more likely to notice it).

Of course, it'll also happen when neither GL or network activity is going on, so
I have no idea what is actually going on.
Comment 1 Luming Yu 2006-05-24 06:50:45 UTC
Please make sure you are using latest BIOS.
Please try acpi=off to make sure this is acpi related issue.
Comment 2 Nicholas Miell 2006-05-24 16:27:05 UTC
I have the latest non-beta BIOS for this system.

I can't reproduce this with acpi=off. (I can't reproduce with noapic, either.)

However, I'm not able to reproduce it reliably even with ACPI on, and when I
boot the system with acpi=off my UHCI controllers and my graphics card don't get
interrupts (the error is something similar to "kernel: PCI: No IRQ known for
interrupt pin A of device 0000:01:00.0. Probably buggy MP table." for each the
devices), so the difference in configuration may be a contributing factor.
Comment 3 Luming Yu 2006-06-07 08:12:45 UTC
>This has been going on for a while (since mid-2005, at least), and it hasn't
>magically fixed itself, so I'm finally filing a bug

Does it work before?  If you remove USB support in kernel, can you see 
interrupt lost on SATA?
Comment 4 Nicholas Miell 2006-07-30 16:34:05 UTC
> Does it work before?
As far as I know, yes, but I didn't own any SATA disks and the SATA controller
was disabled in the BIOS. I also don't remember if there was a time period when
I was using SATA and this wasn't happening.

> If you remove USB support in kernel, can you see interrupt lost on SATA?

I tried temporarily blacklisting the uhci-hcd and ehci-hcd modules (which
prevents them from ever being loaded), and wasn't able to reproduce the error,
but, once again, this problem can't be reproduced on demand.


I also wrote a little SystemTap script which directly calls unmask_IO_APIC_irq
with the SATA controller's IRQ number. If I run this script when I'm not getting
any interrupts from the SATA controller, I start getting interrupts again. This
suggests to me that the interrupt may be disabled at the IOAPIC level without
the kernel's knowledge.

I also wrote another SystemTap script which records stack traces everytime
unmask_IO_APIC_irq or mask_IO_APIC_irq are called in order to determine if the
IOAPIC is actually disabling the IRQ without the kernel's knowledge. However,
whenever I leave this script running, I've never been able to reproduce problem.
This suggests to me that their may be IOAPIC access timing issues involved.

Comment 5 Sérgio M Basto 2006-09-17 09:58:32 UTC
> VT8237 PCI bridge 
please try this patch http://lkml.org/lkml/diff/2006/9/7/235/1
and see this bug 
http://bugme.osdl.org/show_bug.cgi?id=6419
Comment 6 Nicholas Miell 2006-09-17 13:23:22 UTC
This problem predates the following commit:

commit 75cf7456dd87335f574dcd53c4ae616a2ad71a11
Author: Chris Wedgwood <cw@f00f.org>
Date:   Tue Apr 18 23:57:09 2006 -0700
Subject: [PATCH] PCI quirk: VIA IRQ fixup should only run for VIA southbridges

Prior to this commit, quirk_via_irq was running on my machine but not doing
anything (no "PCI: VIA IRQ fixup for %s, from %d to %d\n" message was printed at
boot.)

Were I to apply the suggested patch, it still wouldn't do anything (the
conditions for it's activation have gotten more restrictive, not less), so I'm
not going to waste my time configuring and building a kernel to test it.
Comment 7 Sérgio M Basto 2006-09-17 17:26:54 UTC
(In reply to comment #6)
> Prior to this commit, quirk_via_irq was running on my machine but not doing
> anything (no "PCI: VIA IRQ fixup for %s, from %d to %d\n" message was printed
 > at boot.)
before this patch (http://lkml.org/lkml/2006/4/19/16) 
any_VIA_PCI was quirked, so yours too .
yes "PCI: VIA IRQ fixup for %s, from %d to %d\n" message was printed at
boot.
you can just apply the patch and just make bzImage and install (which is cp
arch/boot/bzImage over /boot/vmlinuz-2.6.17) and reboot, don't
need to recompile all over again.


Last thing,  can you attach your dmesg and cat /proc/interrupts
Comment 8 Nicholas Miell 2006-09-17 17:50:46 UTC
Created attachment 9039 [details]
dmesg from boot

My dmesg, as requested.
Comment 9 Nicholas Miell 2006-09-17 17:51:35 UTC
Created attachment 9040 [details]
/proc/interrupts

/proc/interrupts, as requested.
Comment 10 Nicholas Miell 2006-09-17 17:59:55 UTC
Looks like I spoke too soon -- the quirk is getting applied to my system.

I remember doing something to change whether or not that quirk gets run (see
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=190309 ), but I don't
remember if what I did is similar in result to this new patch.

Looks like I'll actually have to configure and build a kernel. I haven't had to
do that for months. *sigh*
Comment 11 Nicholas Miell 2006-09-19 20:49:04 UTC
OK, that patch didn't help any, which is consistent with my prior
experimentations with that VIA PCI quirk.
Comment 12 Sérgio M Basto 2006-09-20 04:25:03 UTC
ok, let me see your dmesg , with patch applied, please 
Comment 13 Nicholas Miell 2006-09-20 18:12:32 UTC
Created attachment 9063 [details]
dmesg diff, post patch
Comment 14 Sérgio M Basto 2006-09-22 06:27:12 UTC
OK  I need other report vi /etc/X11/xorg.conf
Please comment    # Load  "dri" 
and see if boot with X without problems, let see if it is a problem with via_agp
please use the post past
Comment 15 Nicholas Miell 2006-09-22 19:43:58 UTC
This system doesn't use VIA AGP.
Comment 16 Sérgio M Basto 2006-09-22 20:27:14 UTC
I saw this  
177:    4143175          0   IO-APIC-level  EMU10K1, eth0, radeon@pci:0000:01:00.0

so what hardware have this system ?

may you attach "lspci -vvv" and "lsmod | grep via", please ? 
Comment 17 Nicholas Miell 2006-09-22 22:32:24 UTC
Created attachment 9080 [details]
lspci -vvv output
Comment 18 Nicholas Miell 2006-09-22 22:33:03 UTC
[nicholas@entropy ~]$ lsmod | grep via
i2c_viapro             43353  0
i2c_core               60993  3 w83627hf,i2c_isa,i2c_viapro
sata_via               42821  1
libata                113113  1 sata_via
Comment 19 Len Brown 2007-03-30 19:10:28 UTC
can you reproduce this issue with 2.6.21-rc5 or later?
can you reproduce this issue with nmi_watchdog=0?
Comment 20 Adrian Bunk 2007-06-03 14:37:00 UTC
Please reopen this bug if:
- it is still present with kernel 2.6.21 and
- you can provide the requested information.

Note You need to log in before you can comment on or make changes to this bug.