Bug 7415

Summary: VIA SATA system cannot be initialized
Product: IO/Storage Reporter: j.taimr (j.taimr)
Component: Serial ATAAssignee: Tejun Heo (htejun)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: high CC: alan, albertcc, htejun, kernel, mehturt
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: >=2.6.18 Subsystem:
Regression: --- Bisected commit-id:
Attachments: dmesg after good boot of 2.6.17-gentoo-r8
dmesg from 'bad' boot of 2.6.18-gentoo
dmesg from 'bad' boot of vanilla-2.6.19-rc2
lspci output
lspci -n output
lspci -v output
patch
dmesg of 2.6.18-gentoo-r4 after unrolling of the problematic commit
if vt6420, don't use SCR at all during detection
Modified Tejun's patch
Set ATA_NIEN and reset ATA_NIEN patch
turn on irq on both devices

Description j.taimr 2006-10-25 09:27:53 UTC
Most recent kernel where this bug did not occur:2.6.17-gentoo-r8
Distribution:Gentoo
Hardware Environment:AMD64X2 4800+/Abit AV8/1GB RAM/Ati Radeon 9100
Software Environment:64 bit w. 32 bit compatibility
Problem Description:

I am also using Gentoo, last working kernel is
2.6.17-gentoo-r8. Kernels 2.6.18-gentoo and vanilla-2.6.19-rc2 do not initialize
VIA SATA subsystem properly, the boot ends with kernel panic. The typical
message is:

sata_via 0000:00:0f.0: routed to hard irq line 2
ata1: SATA max UDMA/133 cmd 0xE000 ctl 0xE102 bmdma 0xE400 irq 18
ata2: SATA max UDMA/133 cmd 0xE200 ctl 0xE302 bmdma 0xE408 irq 18
scsi0 : sata_via
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ATA: abnormal status 0xD8 on port 0xE007
scsi1 : sata_via
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2.00: qc timeout (cmd 0xec)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ATA: abnormal status 0xD8 on port 0xE207

The same situation with kernel params acpi=off and/or noapic, it looks as like
as a bug in sata_via driver. No boot attempt is sucessful, it always stops by
kernel panic because of inaccesibility of the root system. My hd's and SATA
controller are ok., no suspicious messages in smartmon long tests, under
2.6.17-gentoo-r8 everything just works.

I already joined the bug 7235, but I was told that my problem is probably not
#7235 related.

Steps to reproduce:
- compile 2.6.18-gentoo or vanilla-2.6.19-rc2 kernel on my system ;-)
- boot attempt
- it always fails with the mentioned message
Comment 1 j.taimr 2006-10-25 09:31:38 UTC
Created attachment 9349 [details]
dmesg after good boot of 2.6.17-gentoo-r8
Comment 2 j.taimr 2006-10-25 09:32:41 UTC
Created attachment 9350 [details]
dmesg from 'bad' boot of 2.6.18-gentoo
Comment 3 j.taimr 2006-10-25 09:33:26 UTC
Created attachment 9351 [details]
dmesg from 'bad' boot of vanilla-2.6.19-rc2
Comment 4 j.taimr 2006-10-25 09:34:06 UTC
Created attachment 9352 [details]
lspci output
Comment 5 j.taimr 2006-10-25 09:34:39 UTC
Created attachment 9353 [details]
lspci -n output
Comment 6 j.taimr 2006-10-25 09:35:16 UTC
Created attachment 9354 [details]
lspci -v output
Comment 7 mehturt 2006-10-27 13:43:07 UTC
I have the same problem with vanilla 2.6.18.1
Comment 8 mehturt 2006-10-27 14:10:34 UTC
This patch fixed the problem for me
http://marc.theaimsgroup.com/?l=git-commits-head&m=116121959622812&q=raw
Comment 9 j.taimr 2006-10-28 12:59:11 UTC
I tried the mentioned patch; it did NOT help for me, still the same situation.
Comment 10 Daniel Drake 2006-11-14 08:12:41 UTC
Original report is at http://bugs.gentoo.org/150773
Comment 11 j.taimr 2006-12-03 11:21:14 UTC
After few weeks my situation remains unchanged; I tried new kernels
gentoo-sources-2.6.18-r1, gentoo-sources-2.6.18-r2, gentoo-sources-2.6.18-r3,
gentoo-sources-2.6.19 and gentoo-sources-2.6.19-r1. The symptoms are still the
same as I wrote above, none of my SATA disks is accessible after a boot attempt
with any of the kernels mentioned.
Comment 12 Daniel Drake 2006-12-03 17:04:03 UTC
Thanks for your continued interest here. Here's a process you could try which is
likely to find the exact buggy commit, but it will probably be quite time
consuming for you:

http://www.kernel.org/pub/software/scm/git/docs/howto/isolate-bugs-with-bisect.txt

You need to start with 2.6.17 as the good kernel and 2.6.18 as the bad one.

But, it's not quite that simple: we know that your hardware will *not* work on
upstream 2.6.17 or 2.6.18 because of another issue separate to this one: it will
fail to quirk the IRQ, i.e. it will not show this message:

PCI: VIA IRQ fixup for 0000:00:0f.0, from 11 to 2

So, you need to ensure that one of the quirk fix patches is included in each
iteration of kernel that you test with this process. I'd suggest using this one
as a safe bet:
http://dev.gentoo.org/~dsd/genpatches/trunk/2.6.17/2500_via-irq-quirk-revert.patch

I am not sure if you will have to revert that before marking a specific
iteration as good or bad (then re-applying it before compiling the next
iteration), or if it is safe to leave in place, this is for you to find out ;)
Comment 13 Daniel Drake 2006-12-03 17:05:25 UTC
Here's another writeup on bisections which includes more of an introduction to
obtaining the kernel tree via git:
http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/
Comment 14 j.taimr 2006-12-16 11:45:27 UTC
Still trying, now with git. Looks as like as a kinda of torture, the infos,
mentioned above are far too optimistic (the single bad patch isolation and
identification in 13 steps e.t.c). I recompiled git kernels approx. 60 times
during last week, unfortunately many times things went bad (I noted  5x kernel
OOps during boot of new git-kernel after recompilation, once within SATA
subsystem, 4 times inanother places, some kernels did not compile at all because
of undefined symbols). So far I only know,that all 2.6.17, 2.6.17-rc1..rc6
kernels work properly, 2.6.18-rc1 and 2.6.18 suffer by this problem. Last
version of libata,which works for me is 1.20. I think the problem starts with
libata 2.0, so far I did not find a way, how to revert the patch 1.20->2.0 for
master kernel (git complained that 'cannot revert this patch!'). I have to be
more familiar with git and go manually from the last good revision
patch-by-patch. And this flip-floping with VIA-quirks patch must be done every
step until it is a part of the current kernel (git is not happy otherwise...).
Comment 15 Daniel Drake 2006-12-16 15:17:52 UTC
Then perhaps you could find the commit before the 1.2 --> 2.0 upgrade, check it
out, confirm whether it works. Then check out the commit after, and confirm that
the breakage appears.

If you are plagued by compilation errors then try building a minimal kernel
only. I have done several bisections myself and have only had minor problems.
Comment 16 j.taimr 2006-12-17 09:22:01 UTC
After simplifying of .config and cleaning of some mess, I have a good candidate
for a troublemaker:

Author: Tejun Heo <htejun@gmail.com>  2006-06-16 08:13:53
Committer: Jeff Garzik <jeff@garzik.org>  2006-06-20 11:12:15
Parent: c5fa46e175ccd02803031ea071060cdb01521736 ([libata] sata_nv:
s/spin_lock_irqsave/spin_lock/ in irq handler)
Branches: origin, master, bisect
Follows: v2.6.17
Precedes: v2.6.18-rc1

    [PATCH] sata_via: convert to new EH, take #3
    
    Convert sata_via to new EH.  vt6420 used ATA_FLAG_SRST while vt6421
    used ATA_FLAG_SATA_RESET.  This difference seems to be an accident
    rather than intended.  This patch makes both flavors use
    ata_bmdma_error_handler() which makes use of both SRST and SATA
    hardreset.  This behavior change is intended and if it breaks
    anything, it should be very easy to spot.
    
    Signed-off-by: Tejun Heo <htejun@gmail.com>
    Signed-off-by: Jeff Garzik <jeff@garzik.org>

Commit #c5fa46e175ccd02803031ea071060cdb01521736 is good,
commit #d7a80dad2fe19a2b8c119c8e9cba605474a75a2b is bad, and
commint #40ef1d8d48e364dce689342adfdc475aa53f4808 is bad as well and this is the
result of git bisect.
Unfortunately, it cannot be reverted:

First trying simple merge strategy to revert.
Simple revert fails; trying Automatic revert.
ERROR: drivers/scsi/sata_via.c: Not handling case
322890b400a6000a9d627fa44d69fcabdfe9f131 ->  ->
c6975c5580ef8c9e62d3b6660e6841a5b9575c69
fatal: merge program failed

What should I try now?
Comment 17 j.taimr 2006-12-17 10:25:20 UTC
It is definitely #40ef1d8d48e364dce689342adfdc475aa53f4808.
I was able to revert this commit in #71d530cd1b6d97094481002a04c77fea1c8e1c22
(which has the mentioned troubles and it was 'bad' in git bisect testing). After
reverting kernel detects my SATA subsystem properly.
Comment 18 Daniel Drake 2006-12-17 11:09:16 UTC
Created attachment 9854 [details]
patch

Many thanks for all the testing! Here is a patch which should apply to recent
kernels. It reverts commit 40ef1d8d48e364dce689342adfdc475aa53f4808

Tejun, what are your thoughts? This is a VT6420 device.
Comment 19 j.taimr 2006-12-17 12:03:34 UTC
If there are no additional bindings and consequencies - and they are (remeber -
I was unable to revert that commit against2.6.18!)
Comment 20 Daniel Drake 2006-12-17 12:26:56 UTC
I made that patch by hand, it definitely does apply. It was made from a kernel
which is just-about-2.6.20-rc1.
Comment 21 j.taimr 2006-12-17 13:19:24 UTC
Created attachment 9860 [details]
dmesg of 2.6.18-gentoo-r4 after unrolling of the problematic commit
Comment 22 j.taimr 2006-12-17 13:21:34 UTC
So, I canconfirm. It works well.
Comment 23 Tejun Heo 2006-12-19 08:46:32 UTC
Created attachment 9884 [details]
if vt6420, don't use SCR at all during detection

Does this patch fix your problem?  This is against v2.6.19.
Comment 24 j.taimr 2006-12-20 01:30:01 UTC
Created attachment 9893 [details]
Modified Tejun's patch

Modified Tejun's patch - this helps, the oroginal one does NOT solve the
problem
Comment 25 j.taimr 2006-12-20 03:11:51 UTC
Tejun, even your patch works, if and only if the line:

        .freeze                 = ata_bmdma_freeze,

is NOT present. When commented out, the system boots. With this line it is
frozen during VIA-SATA initializing as above mentioned.
Comment 26 j.taimr 2006-12-20 03:25:00 UTC
And, without the '.freeze' line, the original driver works as well, without any
additional patching. Trouble is, I do not know the possible consequencies (= is
it possible just remove the mentioned line without any disaster is future?)
Comment 27 j.taimr 2006-12-20 12:06:27 UTC
Created attachment 9903 [details]
Set ATA_NIEN and reset ATA_NIEN patch

Apparently, my system does not like, when ATA_NIEN bit is set and kept ON for
too long time. My system boots with the attached patch of libata-sff.c, when I
tried to reset ATA_NIEN bit immediately after ata_chk_status(ap) operation.
Just another info: libata uses IRQ 18 and this interrupt is not shared with
anything else. It does the same in non-MSI/MSI-X and in MSI/MSI-X mode, it has
no influence.
Comment 28 j.taimr 2006-12-22 00:39:16 UTC
So, I used the following patches; it seems so far, so good:
Patch for 2.6.18-gentoo-r4:
---------- snip ------------------------
diff -u libata-bmdma.c.orig libata-bmdma.c
--- libata-bmdma.c,orig 2006-12-21 08:53:43.000000000 +0100
+++ libata-bmdma.c      2006-12-21 09:02:25.000000000 +0100
@@ -671,6 +671,8 @@
                writeb(ap->ctl, (void __iomem *)ioaddr->ctl_addr);
        else
                outb(ap->ctl, ioaddr->ctl_addr);
+       ata_wait_idle(ap);
+       ata_irq_on(ap);
 }

 /**
--------- snip -------------------------
Patch for linux-git-2.6.20-rc1
--------- snip -------------------------
diff -u libata-sff.c.orig libata-sff.c
--- libata-sff.c.orig   2006-12-21 08:41:52.000000000 +0100
+++ libata-sff.c        2006-12-21 08:42:07.000000000 +0100
@@ -706,7 +706,7 @@
         * previously pending IRQ on ATA_NIEN assertion.  Clear it.
         */
        ata_chk_status(ap);
-
+       ata_irq_on(ap);
        ap->ops->irq_clear(ap);
 }

-------- snip --------------------------
My system was used heavily in last 24 hours, I regularly tested the filesystem
consistency (no errors so far).
But I still have the same doubt: is it safe to use this modification (from the
long term point of view?).
Comment 29 j.taimr 2006-12-27 00:33:09 UTC
Kernel 2.6.18-gentoo-r6:
identical situation: it does not work as distributed, works well after patching
as in #28.
Comment 30 Tejun Heo 2006-12-27 05:06:30 UTC
OIC.  Thanks for finding this out.  It's very surprising tho.  It may have
something to do with detection failures we're seeing in other controllers too. 
I'll investigate it more.  For the time being, your change is not gonna destroy
your data, so don't worry.
Comment 31 Ben Hodgetts (Enverex) 2006-12-27 11:40:50 UTC
I applied the "ata_irq_on(ap);" patch to the 0.9.20-rc1 GIT but it didn't work
for me, still stuck at the same place, gave the same errors and eventually
booted without detecting the drive :(
Comment 32 Tejun Heo 2006-12-27 19:43:18 UTC
Created attachment 9957 [details]
turn on irq on both devices

Can you please test whether this patch fixes via detection problem?  It's
against 2.6.19.
Comment 33 j.taimr 2006-12-28 03:18:57 UTC
It does not work, Tejun. Sorry for bad news.
Comment 34 acb 2007-01-07 14:37:31 UTC
I applied the patch from comment #32 against 2.6.19.1 vanilla and it works for 
me.

But the patch in comment #28 against 2.6.18 vanilla and debian sources did not 
work. No self compiled (unpatched) kernel since 2.6.18 worked, although the 
debian supplied kernel image 2.6.18-1-k7 does work. I have installed the sata 
disc just recently, so I don't know if older kernel would work.

My mainboard is an Asus A7V600-X with the following VIA SATA Controller 
(according to lspci):
00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID 
Controller (rev 80)
00:0f.0 0104: 1106:3149 (rev 80)
Comment 35 Tejun Heo 2007-01-25 01:57:30 UTC
acb, please report the result of 'dmesg' of successful and failed detection. 
Thanks.
Comment 36 Tejun Heo 2007-02-27 07:13:30 UTC
Fixed in 2.6.20.  Closing.  Please test 2.6.20.1 and reopen if it's still broken.