|Summary:||VIA SATA system cannot be initialized|
|Component:||Serial ATA||Assignee:||Tejun Heo (htejun)|
|Severity:||high||CC:||alan, albertcc, htejun, kernel, mehturt|
dmesg after good boot of 2.6.17-gentoo-r8
dmesg from 'bad' boot of 2.6.18-gentoo
dmesg from 'bad' boot of vanilla-2.6.19-rc2
lspci -n output
lspci -v output
dmesg of 2.6.18-gentoo-r4 after unrolling of the problematic commit
if vt6420, don't use SCR at all during detection
Modified Tejun's patch
Set ATA_NIEN and reset ATA_NIEN patch
turn on irq on both devices
Description j.taimr 2006-10-25 09:27:53 UTC
Most recent kernel where this bug did not occur:2.6.17-gentoo-r8 Distribution:Gentoo Hardware Environment:AMD64X2 4800+/Abit AV8/1GB RAM/Ati Radeon 9100 Software Environment:64 bit w. 32 bit compatibility Problem Description: I am also using Gentoo, last working kernel is 2.6.17-gentoo-r8. Kernels 2.6.18-gentoo and vanilla-2.6.19-rc2 do not initialize VIA SATA subsystem properly, the boot ends with kernel panic. The typical message is: sata_via 0000:00:0f.0: routed to hard irq line 2 ata1: SATA max UDMA/133 cmd 0xE000 ctl 0xE102 bmdma 0xE400 irq 18 ata2: SATA max UDMA/133 cmd 0xE200 ctl 0xE302 bmdma 0xE408 irq 18 scsi0 : sata_via ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata1.00: qc timeout (cmd 0xec) ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ATA: abnormal status 0xD8 on port 0xE007 scsi1 : sata_via ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata2.00: qc timeout (cmd 0xec) ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ATA: abnormal status 0xD8 on port 0xE207 The same situation with kernel params acpi=off and/or noapic, it looks as like as a bug in sata_via driver. No boot attempt is sucessful, it always stops by kernel panic because of inaccesibility of the root system. My hd's and SATA controller are ok., no suspicious messages in smartmon long tests, under 2.6.17-gentoo-r8 everything just works. I already joined the bug 7235, but I was told that my problem is probably not #7235 related. Steps to reproduce: - compile 2.6.18-gentoo or vanilla-2.6.19-rc2 kernel on my system ;-) - boot attempt - it always fails with the mentioned message
Comment 1 j.taimr 2006-10-25 09:31:38 UTC
Created attachment 9349 [details] dmesg after good boot of 2.6.17-gentoo-r8
Comment 2 j.taimr 2006-10-25 09:32:41 UTC
Created attachment 9350 [details] dmesg from 'bad' boot of 2.6.18-gentoo
Comment 3 j.taimr 2006-10-25 09:33:26 UTC
Created attachment 9351 [details] dmesg from 'bad' boot of vanilla-2.6.19-rc2
Comment 7 mehturt 2006-10-27 13:43:07 UTC
I have the same problem with vanilla 184.108.40.206
Comment 8 mehturt 2006-10-27 14:10:34 UTC
This patch fixed the problem for me http://marc.theaimsgroup.com/?l=git-commits-head&m=116121959622812&q=raw
Comment 9 j.taimr 2006-10-28 12:59:11 UTC
I tried the mentioned patch; it did NOT help for me, still the same situation.
Comment 11 j.taimr 2006-12-03 11:21:14 UTC
After few weeks my situation remains unchanged; I tried new kernels gentoo-sources-2.6.18-r1, gentoo-sources-2.6.18-r2, gentoo-sources-2.6.18-r3, gentoo-sources-2.6.19 and gentoo-sources-2.6.19-r1. The symptoms are still the same as I wrote above, none of my SATA disks is accessible after a boot attempt with any of the kernels mentioned.
Comment 12 Daniel Drake 2006-12-03 17:04:03 UTC
Thanks for your continued interest here. Here's a process you could try which is likely to find the exact buggy commit, but it will probably be quite time consuming for you: http://www.kernel.org/pub/software/scm/git/docs/howto/isolate-bugs-with-bisect.txt You need to start with 2.6.17 as the good kernel and 2.6.18 as the bad one. But, it's not quite that simple: we know that your hardware will *not* work on upstream 2.6.17 or 2.6.18 because of another issue separate to this one: it will fail to quirk the IRQ, i.e. it will not show this message: PCI: VIA IRQ fixup for 0000:00:0f.0, from 11 to 2 So, you need to ensure that one of the quirk fix patches is included in each iteration of kernel that you test with this process. I'd suggest using this one as a safe bet: http://dev.gentoo.org/~dsd/genpatches/trunk/2.6.17/2500_via-irq-quirk-revert.patch I am not sure if you will have to revert that before marking a specific iteration as good or bad (then re-applying it before compiling the next iteration), or if it is safe to leave in place, this is for you to find out ;)
Comment 13 Daniel Drake 2006-12-03 17:05:25 UTC
Here's another writeup on bisections which includes more of an introduction to obtaining the kernel tree via git: http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/
Comment 14 j.taimr 2006-12-16 11:45:27 UTC
Still trying, now with git. Looks as like as a kinda of torture, the infos, mentioned above are far too optimistic (the single bad patch isolation and identification in 13 steps e.t.c). I recompiled git kernels approx. 60 times during last week, unfortunately many times things went bad (I noted 5x kernel OOps during boot of new git-kernel after recompilation, once within SATA subsystem, 4 times inanother places, some kernels did not compile at all because of undefined symbols). So far I only know,that all 2.6.17, 2.6.17-rc1..rc6 kernels work properly, 2.6.18-rc1 and 2.6.18 suffer by this problem. Last version of libata,which works for me is 1.20. I think the problem starts with libata 2.0, so far I did not find a way, how to revert the patch 1.20->2.0 for master kernel (git complained that 'cannot revert this patch!'). I have to be more familiar with git and go manually from the last good revision patch-by-patch. And this flip-floping with VIA-quirks patch must be done every step until it is a part of the current kernel (git is not happy otherwise...).
Comment 15 Daniel Drake 2006-12-16 15:17:52 UTC
Then perhaps you could find the commit before the 1.2 --> 2.0 upgrade, check it out, confirm whether it works. Then check out the commit after, and confirm that the breakage appears. If you are plagued by compilation errors then try building a minimal kernel only. I have done several bisections myself and have only had minor problems.
Comment 16 j.taimr 2006-12-17 09:22:01 UTC
After simplifying of .config and cleaning of some mess, I have a good candidate for a troublemaker: Author: Tejun Heo <firstname.lastname@example.org> 2006-06-16 08:13:53 Committer: Jeff Garzik <email@example.com> 2006-06-20 11:12:15 Parent: c5fa46e175ccd02803031ea071060cdb01521736 ([libata] sata_nv: s/spin_lock_irqsave/spin_lock/ in irq handler) Branches: origin, master, bisect Follows: v2.6.17 Precedes: v2.6.18-rc1 [PATCH] sata_via: convert to new EH, take #3 Convert sata_via to new EH. vt6420 used ATA_FLAG_SRST while vt6421 used ATA_FLAG_SATA_RESET. This difference seems to be an accident rather than intended. This patch makes both flavors use ata_bmdma_error_handler() which makes use of both SRST and SATA hardreset. This behavior change is intended and if it breaks anything, it should be very easy to spot. Signed-off-by: Tejun Heo <firstname.lastname@example.org> Signed-off-by: Jeff Garzik <email@example.com> Commit #c5fa46e175ccd02803031ea071060cdb01521736 is good, commit #d7a80dad2fe19a2b8c119c8e9cba605474a75a2b is bad, and commint #40ef1d8d48e364dce689342adfdc475aa53f4808 is bad as well and this is the result of git bisect. Unfortunately, it cannot be reverted: First trying simple merge strategy to revert. Simple revert fails; trying Automatic revert. ERROR: drivers/scsi/sata_via.c: Not handling case 322890b400a6000a9d627fa44d69fcabdfe9f131 -> -> c6975c5580ef8c9e62d3b6660e6841a5b9575c69 fatal: merge program failed What should I try now?
Comment 17 j.taimr 2006-12-17 10:25:20 UTC
It is definitely #40ef1d8d48e364dce689342adfdc475aa53f4808. I was able to revert this commit in #71d530cd1b6d97094481002a04c77fea1c8e1c22 (which has the mentioned troubles and it was 'bad' in git bisect testing). After reverting kernel detects my SATA subsystem properly.
Comment 18 Daniel Drake 2006-12-17 11:09:16 UTC
Created attachment 9854 [details] patch Many thanks for all the testing! Here is a patch which should apply to recent kernels. It reverts commit 40ef1d8d48e364dce689342adfdc475aa53f4808 Tejun, what are your thoughts? This is a VT6420 device.
Comment 19 j.taimr 2006-12-17 12:03:34 UTC
If there are no additional bindings and consequencies - and they are (remeber - I was unable to revert that commit against2.6.18!)
Comment 20 Daniel Drake 2006-12-17 12:26:56 UTC
I made that patch by hand, it definitely does apply. It was made from a kernel which is just-about-2.6.20-rc1.
Comment 21 j.taimr 2006-12-17 13:19:24 UTC
Created attachment 9860 [details] dmesg of 2.6.18-gentoo-r4 after unrolling of the problematic commit
Comment 22 j.taimr 2006-12-17 13:21:34 UTC
So, I canconfirm. It works well.
Comment 23 Tejun Heo 2006-12-19 08:46:32 UTC
Created attachment 9884 [details] if vt6420, don't use SCR at all during detection Does this patch fix your problem? This is against v2.6.19.
Comment 24 j.taimr 2006-12-20 01:30:01 UTC
Created attachment 9893 [details] Modified Tejun's patch Modified Tejun's patch - this helps, the oroginal one does NOT solve the problem
Comment 25 j.taimr 2006-12-20 03:11:51 UTC
Tejun, even your patch works, if and only if the line: .freeze = ata_bmdma_freeze, is NOT present. When commented out, the system boots. With this line it is frozen during VIA-SATA initializing as above mentioned.
Comment 26 j.taimr 2006-12-20 03:25:00 UTC
And, without the '.freeze' line, the original driver works as well, without any additional patching. Trouble is, I do not know the possible consequencies (= is it possible just remove the mentioned line without any disaster is future?)
Comment 27 j.taimr 2006-12-20 12:06:27 UTC
Created attachment 9903 [details] Set ATA_NIEN and reset ATA_NIEN patch Apparently, my system does not like, when ATA_NIEN bit is set and kept ON for too long time. My system boots with the attached patch of libata-sff.c, when I tried to reset ATA_NIEN bit immediately after ata_chk_status(ap) operation. Just another info: libata uses IRQ 18 and this interrupt is not shared with anything else. It does the same in non-MSI/MSI-X and in MSI/MSI-X mode, it has no influence.
Comment 28 j.taimr 2006-12-22 00:39:16 UTC
So, I used the following patches; it seems so far, so good: Patch for 2.6.18-gentoo-r4: ---------- snip ------------------------ diff -u libata-bmdma.c.orig libata-bmdma.c --- libata-bmdma.c,orig 2006-12-21 08:53:43.000000000 +0100 +++ libata-bmdma.c 2006-12-21 09:02:25.000000000 +0100 @@ -671,6 +671,8 @@ writeb(ap->ctl, (void __iomem *)ioaddr->ctl_addr); else outb(ap->ctl, ioaddr->ctl_addr); + ata_wait_idle(ap); + ata_irq_on(ap); } /** --------- snip ------------------------- Patch for linux-git-2.6.20-rc1 --------- snip ------------------------- diff -u libata-sff.c.orig libata-sff.c --- libata-sff.c.orig 2006-12-21 08:41:52.000000000 +0100 +++ libata-sff.c 2006-12-21 08:42:07.000000000 +0100 @@ -706,7 +706,7 @@ * previously pending IRQ on ATA_NIEN assertion. Clear it. */ ata_chk_status(ap); - + ata_irq_on(ap); ap->ops->irq_clear(ap); } -------- snip -------------------------- My system was used heavily in last 24 hours, I regularly tested the filesystem consistency (no errors so far). But I still have the same doubt: is it safe to use this modification (from the long term point of view?).
Comment 29 j.taimr 2006-12-27 00:33:09 UTC
Kernel 2.6.18-gentoo-r6: identical situation: it does not work as distributed, works well after patching as in #28.
Comment 30 Tejun Heo 2006-12-27 05:06:30 UTC
OIC. Thanks for finding this out. It's very surprising tho. It may have something to do with detection failures we're seeing in other controllers too. I'll investigate it more. For the time being, your change is not gonna destroy your data, so don't worry.
Comment 31 Ben Hodgetts (Enverex) 2006-12-27 11:40:50 UTC
I applied the "ata_irq_on(ap);" patch to the 0.9.20-rc1 GIT but it didn't work for me, still stuck at the same place, gave the same errors and eventually booted without detecting the drive :(
Comment 32 Tejun Heo 2006-12-27 19:43:18 UTC
Created attachment 9957 [details] turn on irq on both devices Can you please test whether this patch fixes via detection problem? It's against 2.6.19.
Comment 33 j.taimr 2006-12-28 03:18:57 UTC
It does not work, Tejun. Sorry for bad news.
Comment 34 acb 2007-01-07 14:37:31 UTC
I applied the patch from comment #32 against 220.127.116.11 vanilla and it works for me. But the patch in comment #28 against 2.6.18 vanilla and debian sources did not work. No self compiled (unpatched) kernel since 2.6.18 worked, although the debian supplied kernel image 2.6.18-1-k7 does work. I have installed the sata disc just recently, so I don't know if older kernel would work. My mainboard is an Asus A7V600-X with the following VIA SATA Controller (according to lspci): 00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller (rev 80) 00:0f.0 0104: 1106:3149 (rev 80)
Comment 35 Tejun Heo 2007-01-25 01:57:30 UTC
acb, please report the result of 'dmesg' of successful and failed detection. Thanks.
Comment 36 Tejun Heo 2007-02-27 07:13:30 UTC
Fixed in 2.6.20. Closing. Please test 18.104.22.168 and reopen if it's still broken.