Bug 9994
Summary: | atiixp ide timeouts | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Honza Fikar (jan.fikar) |
Component: | IDE | Assignee: | io_ide (io_ide) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | kernel |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.24 | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
Honza Fikar
2008-02-14 16:17:11 UTC
Please try using git-bisect to narrow down the problem to the specific commit. Install git package, get git tree from kernel.org and do: git bisect start git bisect good 2.6.23 git bisect bad 2.6.24 It will select the kernel to test - compile and boot it to see if the problem is still there. If so do "git bisect bad" which will give you new kernel to test. If the kernel works fine do "git bisect good" instead. After few iterations you should find the exact commit which introduced the bug. Thanks. I have tried the bisection and I got something, which does not seem right to me, but maybe it is. The problem is that the bug does not show up immediately, but after a couple of hours. So I tried to wait for the bug at least 6 hours before I did "git bisect good", and due to this the whole bisection was long, but maybe it was not enough... I will try once again and I will wait longer. Anyway the result is here: 2421ba5b57ddbc3a972b9d6fb884817c39d2fff7 is first bad commit commit 2421ba5b57ddbc3a972b9d6fb884817c39d2fff7 Author: Kyle McMartin <kyle@mako.i.cabal.ca> Date: Wed Nov 28 02:17:53 2007 -0500 [PARISC] timer interrupt should not be IRQ_DISABLED The timer interrupt had accidentally been marked IRQ_DISABLED since IRQ_PER_CPU had been OR-ed in, instead of set. This had been working by accident for quite a while. Commit c642b8391cf8efc3622cc97329a0f46e7cbb70b8 changed the behaviour of IRQ_PER_CPU interrupts, which previously weren't checked for IRQ_DISABLED. Signed-off-by: Kyle McMartin <kyle@mcmartin.ca> :040000 040000 e77168c15d33749d5072ee1f6db59d05c128c5c6 426eb6ed185908219e7469fcfedf53d405b720d6 M arch Unfortunately this commit is not the one we are looking for as it only affects PARISC architecture :/ PS I assume that 2.6.25-rc4 still has this problem? I am afraid I missed the bug at least once, so the result is wrong :( I'll check 2.6.25-rc4 first and then I'll try to redo the bisection. original downstream report: http://bugs.gentoo.org/show_bug.cgi?id=209786 The latest results are not very good... Bisecting and waiting for at least 24 hours leads to a different wrong result: f435a91e66e7776f0c73fca5af3cb87c61130ed6 is first bad commit commit f435a91e66e7776f0c73fca5af3cb87c61130ed6 Author: Ralf Baechle <ralf@linux-mips.org> Date: Thu Dec 6 17:15:57 2007 +0000 [MIPS] BCM1480: Fix interrupt routing, take 2. Signed-off-by: Ralf Baechle <ralf@linux-mips.org> :040000 040000 73dbf425d802ab4d4a1c6fd4776df2b9d0086043 0fa229550fa638511ff9f9edde20f0329d650174 M arch Then I tried kernel 2.6.25-rc6 and unfortunately I still see the error, although it seems to be triggered even less often. One time I saw a single DMA timeout in 4 days without disabling DMA. Second time after three days and last time I saw the real problem, so a lot of timeouts, resets and disabling DMA on hda during two days. Is there a possibility to try only atiixp related patches instead of the bisecting? If I need to wait let's say 5 days, bisecting would take in the worst case two months :( You can do this by appending " drivers/ide/pci/atiixp.c" at the end of git-bisect command but I worry that it may be fruitless since there was very little atiixp changes.
Hmmm, it is also possible that these timeouts are caused by IRQ routing problems and not by IDE changes:
> APIC error on CPU0: 00(40)
Does this error always show up just before IDE timeout?
I'll try... I'm not sure about APIC error, as some kernels before (till 22 or 23 if I remember well) the log was full of those APIC errors. They were produced at a speed of two or three per hour. And I thought they are harmless. Anyway at that time there were no problems with atiixp. Then with the 22 or 23 upgrade they do no come so often. Maybe once per day or so and I still ignored them. In fact I'm not sure it was corrected or only silenced. There were still no atiixp problems. The atiixp timeouts started in 24. I think I can check your hypothesis, maybe you are right. I remember that at least in the 2.6.25-rc6 case with the single DMA timeout the APIC error was in the log just before. However I'm not sure how long before. I think I can switch on the timestamps. And if it really is an IRQ routing problem, shouldn't I try to disable it by noapic kernel parameter or something like that? Any updates here? Is this still a problem on the latest development release, currently v2.6.28-rc7? Sorry for the silence, the bug is meanwhile gone! It disapeared during 2.6.26 I guess. Now I'm at 2.6.27 and I haven't seen the bug since long time. So it is solved, but I don't know by what exactly. |