Most recent kernel where this bug did not occur: none
Hardware Environment: Twinhead EFIO E12KT
Appears in all current vanilla kernels.
Driver loads, enables VIA_IRQ_GLOBAL, VIA_IRQ_DMA0_TD and VIA_IRQ_DMA1_TD. VIA hardware keeps firing interrupts, but neither of VIA_IRQ_DMA_TD_PENDING is set in the IRQ status register. Instead, the bit 11 (0x0400) is always set. Rewriting the status register doesn't clear the interrupt. I get 1 Mio interrupt events / 10sec with these symptoms. The kernel soon disables the interrupt and thus stops other hardware from working.
irqpoll option hangs the kernel.
Steps to reproduce:
On a system with K8M800 northbridge and integrated graphics, load teh via-agp and via modules. Observe that the kernel says:
irq 5: nobody cared (try booting with the "irqpoll" option)
Disabling IRQ #5
none found (other than disabling VIA_IRQ_GLOBAL
Here's some additional reports:
Created attachment 11887 [details]
Patch to /usr/src/<linux_ver>/drivers/char/drm/drm_pciids.h
I think I found a solution to this. In via_irq.c, two sets of IRQ masks are present, one for the unichrome pro a and the other generic. the k8m800 is a unichrome pro b chipset(i think). Currently, in the drm_pciids.h file, the device id of k8m800 is marked for generic. i altered it to mark it as unichrome_pro_a(since we do not have a mask for unichrome pro b).
The irq error stopped coming. Can someone verify this? Please check out the proposed patch.
Hope this helps.
The patch seems to work for me. I'm not seeing the IRQ disable message anymore. Thank you. Shankar, could you submit the patch and close the bug? Thank you again.
I have assigned the bug to me and resolved it. Is there anything else I should do to close this? Whom should I submit it to for QA and final closure?
Good job guys, i think this is good news!!
You might want to take a look at this:
(In reply to comment #4)
I think there is more to it than this. The interrupts got mapped in the apic, but there are other teething problems to the implementation.
Firstly I think the register set is unknown to us. We need access to it for reliably fixing this problem. If someone has access to the complete register set and is legally cleared to share it with us(NDA et all) please do so. Otherwise, we may have to do some trial and error to fix this.
Secondly, what we are doing in the irq handler seems wrong to me. For any interrupt register there are usually three sets.
One is the INT_ENABLE, which is usually a write register. Enabling particular bits of this register enables the interrupts associated with those bits. Judging by this, I think the VIA_REG_INTERRUPT is actually the INT_ENABLE register.
The next register usually would be the INT_STATUS register which is a read-only register. It would hold the current status of the interrupts. If a particular bit is high(1) then the interrupt has just been raised. This usually maps bit-by-bit to the INT_ENABLE register. So if Bit 9 is enabled in the INT_ENABLE register, Bit 9 in INT_STATUS would indicate the status of the interrupt.
The last register would be the INT_CLR which would be a write register. Again this would map bit/bit to the INT_ENABLE and INT_STATUS registers. Enabling a particular bit in this reg. would clear the corresponding interrupt and clear the corresponding bit in the INT_STATUS register.
An alternative would be to have the INT_STATUS acting as a R/W register where we clear the STATUS register directly to clear(ack) the interrupt and do away with the INT_CLR altogether. So a minimum of two registers would be essential.
If we look at the Radeon implementation, they have two registers, one being the status register. The VIA code on the other hand, seems to try to clear the same register, which I believe is the INT_ENABLE register. I believe this is wrong(no way to be sure until we get the register set). Again, I'm shooting in the dark here, but usually the INT_ENABLE and INT_STATUS registers are successive registers which means an offset of 0x4(again check the Radeon defines). So if 0x200, we may assume(again a very long shot) that 0x204 would be the Status register.
This calls for some hit and miss coding and lots of data gathering, before we can be sure. If the interrupts are not cleared(as in my hypothesis here), then there would be repeated interrupt triggers and service handlers would be kept very busy. According to my investigations, during some operations(like 3D, texture: see bug 5092), the handler is called more than 50,000 times continuously!!! And this nearly causes a lockup(unusably sluggish). This seems to substantiate my theory here.
I think this defect cannot be closed until this is taken care of. But I will leave this in the resolved state and try to fix this in https://bugs.freedesktop.org/show_bug.cgi?id=5092, as the impact of this problem is more fully felt there.
Please do share observations, comments and corrections. I would also welcome someone to give a try to hack the code to substantiate the above theory. What would be necessary would be to liberally paint the kernel drm code with printk's and do read's of reg. 0x200, 0x204, 0x208 etc., especially in the handler(see which bits are asserted and which are cleared). This may give us an idea, but too much printk's in code would probably slow things to a halt not to mention thread sleep problems due to spinlocks. ;-)
Sorry about the link. Bug 5092 is actually in bugs.freedesktop.org. bugs.freedesktop.org/show_bug.cgi?id=5092
I agree that we urgently need to see the register set description to finally be sure we can fix the problem. Originally, the code in via-irq.c comes from VIA (Tungsten Graphics) and from what I inferred from the code by reverse engineering is the following:
VIA_REG_INTERRUPT is a r/w register containing both, IRQ_ENABLE as well as IRQ_PENDING (status) bits at the same time. A specific interrupt is cleared by re-writing the IRQ_PENDING bit. Therefore it would work as follows:
- enable the IRQ_ENABLE bit for the IRQ you want to have and the IRQ_GLOBAL bit as well (otherwise IRQs are disabled at all)
- when the interrupt happens, read VIA_REG_INTERRUPT and look out for the IRQ_PENDING bit for the interrupt you handle
- if you're finished, prepare a bitset containing all IRQ_ENABLEs you want, the IRQ_GLOBAL *and* the IRQ_PENDING bits of the IRQ you handled, write that back to VIA_REG_INTERRUPT
- if you handled all pending IRQs, the hardware should de-assert the IRQ pin and the kernel should stop detecting the VIA IRQ
So much for the theory. I assume it works well at least for the VIA Unichrome Pro A group of chips and seems a reasonable IRQ handling strategy to me. I observed that earlier variants of the DRM driver didn't enable the DMA[0|1]_TD IRQs for generic chips, this seems to have been added later (maybe to support newer generic chips) and maybe this is what's breaking us.
The funny thing is this: the hardware keeps the IRQ pin asserted *EVEN THOUGH* I rewrote the only bit (e.g. 0x0400) that was always set. Maybe the K8M800 does not support DMAx_TD interrupts at all and something altogether different was enabled by the driver. But again I agree with you, we would need to have the register set description to know for sure.
Thanks for the info. I figured that they would have mixed up the enable and status bits together in the same register, probably to minimize the overhead of multiple PCI posts, but now that the confirmation comes from you, I think we can go ahead and try some combinations. Since all pending interrupts have to be cleared for the IRQ to deassert, reading the register at different points of the handler may give a clue.
Thanks again for the info. ;-)
I've done some more, detailed debugging of the spurious interrupt problem. I added code to gather statistics about calls to the VIA interrupt handler, e.g. telling me how often a VBLANK / DMA / spurious IRQ was received, how often one was waited for (or a DMA operation fired). I've found that VBLANK interrupts are properly handled (3D apps / OpenGL usually wait for VBLANK). The same is true for DMAs: Xv uses DMAs to transfer image data into video mem. I came across some interesting things, though.
At each occurrence of a spurious interrupt I read all four DMA CSRs as well as the IRQ register and isolated the bits that were always set during a spurious interrupt.
The DMA CSRs have not a single bit that is permanently set (thus unlikely to be the cause of the spurious interrupt).
The IRQ register has 0x0400 permanently set (the symptom). But interestingly, I happended to notice that on occasions, DMA[0|1]_TD_ENABLE bits get cleared unexpectedly. This definitely is not done by my debugging code. I suspected parts of either the kernel driver or the userspace driver to write to the IRQ register. I intercepted all the VIA_WRITE / VIASETREG / whatever macros which are used to write to the K8M800 registers to be extremely noisy when writing to the IRQ register - but found nothing. Neither the kernel driver via.ko, nor the Xorg-X11 via.so / Mesa unichrome-dri.so drivers unexpectedly rewrite the IRQ register.
Additionally, on the 10000s occurrence of a spurious interrupt I clear VIA_IRQ_GLOBAL to disable VIA interrupts completely and reset my statistics. But somehow the interrupts get enabled again, I don't know how. Without restarting my laptop or sending it to hibernation mode in between, I see multiple occurrences of my debug output when disabling VIA_IRQ_GLOBAL.
I've attached dmesg-output to illustrate the problem. function print_irq_info prints out the IRQ occurrence/wait statistics, the "XXX bits:" values are those bits which were permanently set during occurrences of the spurious interrupt. The line about via_irq.c:201 is where I clear VIA_IRQ_GLOBAL [VIA_WRITE(VIA_REG_INTERRUPT, status & ~VIA_IRQ_GLOBAL)], the following line indicates my attempt to disable VIA interrupts.
What can be read from those lines is that some time after clearing VIA_IRQ_GLOBAL the interrupt handler gets called again (collecting VBLANK interrupts mostly, no DMA IRQs because I didn't play videos during that period). Irregularly the spurious interrupt sets in until I clear VIA_IRQ_GLOBAL and the game restarts.
Questions we should concentrate on:
- What is bit 11 (0x0400) in the IRQ register of K8M800 chips?
- Why doesn't clearing VIA_IRQ_GLOBAL permanently stop interrupts from occurring?
- Who / what is setting / clearing the IRQ_DMAx_TD_ENABLE bits in the interrupt register?
- Do we deal with a hardware or a software bug?
Created attachment 12093 [details]
dmesg output - contains debugging output about the spurious VIA IRQs.
Is there anyone still working on/interested in this problem? Via has released a new driver `UniChrome XORG 40072d display driver source code' http://www.viaarena.com/Driver/cle266cn400cn-cx700cn800xorg40072-kernel-src_20071213d.rar at 13 December 2007. In that package the 3D/DRM-AGP subdir contains some via_updated pciids.h files. I've tried to use it on my amd64 machine, but failed... Would someone like to look into those files?
http://www.viaarena.com/default.aspx?PageID=420&OSID=25&CatID=2580&SubCatID=109(In reply to comment #13)
> Is there anyone still working on/interested in this problem? Via has released
> new driver `UniChrome XORG 40072d display driver source code'
> at 13 December 2007. In that package the 3D/DRM-AGP subdir contains some
> via_updated pciids.h files. I've tried to use it on my amd64 machine, but
> failed... Would someone like to look into those files?
Driver description page: http://www.viaarena.com/default.aspx?PageID=420&OSID=25&CatID=2580&SubCatID=109
ASUS A8V-MX with K8M800 bridge, AMD K8 64bits
Using Debian Etch, kernels 2.8.18-5 and 220.127.116.11
As usuall, the VIAarena scripts do not work. The drm_pciids.h file has all the
cards marked as generic now.
Upgraded my pc and no more k8m800! So am quitting this assignment.
See also Bug #6790.
Created attachment 15273 [details]
patch for linux 2.6.25-rc5
Although the patch from comment #18 removes the "irq 16: nobody cared" kernel message, PowerTOP ( http://www.lesswatts.org/projects/powertop/ ) reports that there are about 10000 interrupts per second when nothing happens.
PowerTOP version 1.8 (C) 2007 Intel Corporation
Cn Avg residency P-states (frequencies)
C0 (cpu running) (100,0%) 1,60 Ghz 0,0%
C1 0,0ms ( 0,0%) 800 Mhz 100,0%
C2 0,0ms ( 0,0%)
Wakeups-from-idle per second : 94411,8 interval: 10,0s
no ACPI power usage estimate available
Top causes for wakeups:
100,0% (94366,8) <interrupt> : via@pci:0000:01:00.0
0,0% ( 9,0) <interrupt> : ide1
0,0% ( 7,8) seamonkey-bin : futex_wait (hrtimer_wakeup)
0,0% ( 4,1) kicker : schedule_timeout (process_timeout)
0,0% ( 4,0) <kernel module> : usb_hcd_poll_rh_status (rh_timer_func)
0,0% ( 2,9) <interrupt> : ide0
0,0% ( 2,4) squid : schedule_timeout (process_timeout)
0,0% ( 1,6) kwin : schedule_timeout (process_timeout)
0,0% ( 1,5) dirmngr : schedule_timeout (process_timeout)
This also cause GlidePoint touchpad to lose sync:
[ 141.361052] psmouse.c: GlidePoint at isa0060/serio4/input0 lost synchronization, throwing 3 bytes away.
[ 141.362076] psmouse.c: resync failed, issuing reconnect request
[ 141.375451] psmouse.c: GlidePoint at isa0060/serio4/input0 lost synchronization, throwing 1 bytes away.
[ 141.376473] psmouse.c: resync failed, issuing reconnect request
Re-assigning to ACPI for now, to sort it out, since this appears to be an issue with interrupt assignment on this chipset.
*** This bug has been marked as a duplicate of bug 6790 ***