Bug 15173

Summary: sata_via VT6421 softRAID
Product: IO/Storage Reporter: Pawel Piatek (xj)
Component: Serial ATAAssignee: Tejun Heo (tj)
Status: RESOLVED CODE_FIX    
Severity: normal CC: fercerpav, kernelbugtracker, matej.zary, napperley, peter, q, sjorrit, tj
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32.7, 2.6.33-rc6 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output
/var/log/messages
/var/log/syslog
dmesg output after boot
dmesg after running hdparm -tT
lspci output
format of the drive
sata_via-crc-fix.patch
lspci -nnv output
lspci output
smartctl drive output
dmesg from before the lockup
lspci -nnv on my A7V600 (KT600 chipset)

Description Pawel Piatek 2010-01-30 05:33:39 UTC
Hi,

I bought VIA SATA/PATA controler on PCI slot with conected two SATA disk and two PATA disks.
When start software RAID array on SATA disk then a lot of such messages goes to logs:
kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
kernel: ata4.00: configured for UDMA/33
kernel: ata4: EH complete
kernel: ata4: hard resetting link

Some informations:
Controler
# lspci -vs 02:
02:0b.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE RAID Controller (rev 50)
	Subsystem: VIA Technologies, Inc. VT6421 IDE RAID Controller
	Flags: bus master, medium devsel, latency 64, IRQ 10
	I/O ports at dcf0 [size=16]
	I/O ports at dcd0 [size=16]
	I/O ports at dcb0 [size=16]
	I/O ports at dc90 [size=16]
	I/O ports at dc60 [size=32]
	I/O ports at d800 [size=256]
	Expansion ROM at fb000000 [disabled] [size=64K]
	Capabilities: [e0] Power Management version 2
	Kernel driver in use: sata_via

Disks:
# lsscsi|grep 'sd[ef]'
[2:0:0:0]    disk    ATA      WDC WD10EADS-00M 01.0  /dev/sde
[3:0:0:0]    disk    ATA      WDC WD10EADS-00M 01.0  /dev/sdf

Plase help.
Comment 1 Tejun Heo 2010-02-02 03:28:36 UTC
Can you please attach full dmesg output after such incident?  Thanks.
Comment 2 Pawel Piatek 2010-02-05 22:05:14 UTC
Created attachment 24921 [details]
dmesg output

dmesg output
Comment 3 Tejun Heo 2010-02-08 02:52:34 UTC
Both the controller and drive are reporting communication problems.  They can't talk to each other properly at the link level.  How often does this happen?  Can you post log w/ timestamps?  Can you please try to use a shorter cable?
Comment 4 Pawel Piatek 2010-02-09 08:43:09 UTC
Created attachment 24967 [details]
/var/log/messages

Log with timestamp /var/log/messages.
Comment 5 Pawel Piatek 2010-02-09 08:44:23 UTC
Created attachment 24968 [details]
/var/log/syslog

Log with timestamps
Comment 6 Pawel Piatek 2010-02-09 08:49:03 UTC
hi, thanks for reply.
I change cables two times - this not help :(. Some more hardware information:
# lspci
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 03)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 02)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.2 USB Controller: Intel Corporation 82371AB/EB/MB PIIX4 USB (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
00:0e.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
00:0f.0 PCI bridge: Digital Equipment Corporation DECchip 21152 (rev 03)
00:11.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 24)
01:00.0 VGA compatible controller: ATI Technologies Inc 3D Rage Pro AGP 1X/2X (rev 5c)
02:0b.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE RAID Controller (rev 50)

# lsscsi 
[0:0:0:0]    disk    ATA      SAMSUNG SV0511D  MJ20  /dev/sda 
[0:0:1:0]    disk    ATA      ST310210A        3.21  /dev/sdb 
[1:0:0:0]    disk    ATA      SAMSUNG SP0842N  BH90  /dev/sdc 
[1:0:1:0]    disk    ATA      ST340016A        3.05  /dev/sdd 
[2:0:0:0]    disk    ATA      WDC WD10EADS-00M 01.0  /dev/sde 
[3:0:0:0]    disk    ATA      WDC WD10EADS-00M 01.0  /dev/sdf 
[4:0:0:0]    disk    ATA      ST340016A        3.05  /dev/sdg 
[4:0:1:0]    disk    ATA      IBM-DTLA-305020  TW2O  /dev/sdh
Comment 7 Tejun Heo 2010-02-10 01:54:04 UTC
This very much looks like a hardware problem.  Can you power up a separate power supply and connect half of SATA harddrives there?  You can power up a PSU by doing the following.

  http://modtown.co.uk/mt/article2.php?id=psumod
Comment 8 Pawel Piatek 2010-02-16 13:28:58 UTC
Ok. I connect SATA drives to separate power suply, but this don't change anything. Problem seems apear when copy data between SATA drivers. So when create raid level 1 - then "sync" proces cause this errors.

Maybe this is problem with DMA support in sata_via module ?
Comment 9 Tejun Heo 2010-02-17 00:20:07 UTC
Thanks for testing it.

The failures you're seeing is between the host controller (the via chip) and the hard drive. Both the controller and the hard drive are reporting that they can't hear each other very well.  Host side issues (between the controller and the components on the mainboard) usually don't manifest as ATA bus issues.

I'm afraid there isn't much the driver can do with these failures.  There could be some PHY level knobs in the controller but given that this is the first report of this type of issues with the controller, I'm much more inclined toward faulty add-in board (ie. signal trace lengths not matched properly, faulty connector kind of things).  Can you try it on a different operating system?

Thanks.
Comment 10 Jorrit Tijben 2010-05-20 21:32:08 UTC
Created attachment 26469 [details]
dmesg output after boot
Comment 11 Jorrit Tijben 2010-05-20 21:34:32 UTC
Hello,

The problem Pawel Piatek describes also manifests itself for me and another guy who bought a VT6421-based controller at the same time.

The drives in this case are WDC WD15EARS-00Z5B1. It's known to me that these drives need to be aligned, but the fdisk in util-linux-ng 2.17.2 defaults to a start sector of 2048 bytes anyway.

On a SB700 motherboard controller, the drive works fine (sequential output of hdparm -tT around 100MB/s)... with the VT6421 it's in the range of 1-30 MB/s. We see a lot of 'hard resetting' link messages too. Attached are my dmesg directly after booting, and a dmesg full of error messages when I run hdparm -tT on the drive, along with some other files.

I cannot verify whether the controller works on another OS, but I've ruled out cable, connection issues etc. as the drives works fine with the SB700 controller and the same problem persists for the other guy.

Note that I experience the problems on a 2.6.32.5 kernel. Because I saw some recent fixes, the other guy tried a 2.6.34 kernel but the problem still remains.

Any clues, hints, ideas?

In any case a big thanks to the (driver/libata) developers anyway, your work is greatly appreciated.

With kind regards,
Jorrit Tijben
Comment 12 Jorrit Tijben 2010-05-20 21:35:50 UTC
Created attachment 26470 [details]
dmesg after running hdparm -tT
Comment 13 Jorrit Tijben 2010-05-20 21:39:11 UTC
Created attachment 26471 [details]
lspci output
Comment 14 Jorrit Tijben 2010-05-20 21:42:04 UTC
Created attachment 26472 [details]
format of the drive
Comment 15 Tejun Heo 2010-05-21 11:51:13 UTC
Hello,

Hmm... I just tested my vt6421 addon card with several different recent wd drives and I'm seeing the same problem w/ all of them while other drives work just fine.  Bus trace doesn't show anything particular except the host claiming bad reception every now and then.  There seems to a phy compatibility issue here.  I'll test a bit more and contact related parties.

Thanks.
Comment 16 Jorrit Tijben 2010-05-22 10:57:46 UTC
(In reply to comment #15)

> There seems to a phy compatibility issue
> here.  I'll test a bit more and contact related parties.

Thanks for replying so quickly. I'm curious to see whether it's fixable in the driver.

Jorrit Tijben
Comment 17 Tejun Heo 2010-05-31 14:20:44 UTC
Created attachment 26589 [details]
sata_via-crc-fix.patch

Okay, this should fix the problem although I don't know how it does it.  Can you please try it?
Comment 18 Jeff Garzik 2010-05-31 18:11:13 UTC
According to the docs I have for PCI ID 0x3249, register 0x52 is "Transport Miscellaneous Control" bits:

7 Reserved ............................................always reads 0
6 Transport Issue Early Request to Link to
  improve Performance .............................. default = 0
5 Reserved ................................................... default = 0
4 Single Data FIS Transmission................. default = 0
  Allow over 8k bytes.
3 BIST FIS ................................................... default = 0
  Controller can accept BIST FIS when behaves as a
  device (Rx53[1:0] are set). This bit is set only for
  controller to control BIST FIS self-test.
2 SATA Flow Control Water Flag
    1 FFF0 threshold (the value is based on RX43)
    0 32DW.....................................................default
1 COMRESET Will reset both master / slave
  device (test mode only) ............................ default = 0
0 Reset Shadow Register (test mode only) default = 0
Comment 19 Tejun Heo 2010-05-31 18:16:18 UTC
So, that's setting SATA Flow Control Water Flag to FFF0 threshold which is somehow based on something called RX43.  Still sounds like a proper mystery to me.  :-(

Thanks.
Comment 20 Jorrit Tijben 2010-06-04 06:49:05 UTC
Hello,

Excuse me for the late response, but I can confirm the fix by Joseph Chan works. hdparm -tT now gives around 50MB/s for buffered disk reads now and the error messages are gone.

The performance is still a bit meager compared to the SB700 on-board controller, but I don't know what the possible bottlenecks are.

A very big thanks to all for investigating and fixing this, it's *really* appreaciated!

Jorrit Tijben
Comment 21 Tejun Heo 2010-06-04 07:23:28 UTC
Patch already in mainline and will be released for -stable too.  Resolving as FIXED.

Thanks.
Comment 22 Markus Müller 2010-07-09 16:32:41 UTC
Works for mee too! I had messages like the follwing, with 2.6.34.1 they have now disappeared.

ata1.00: exception Emask 0x12 SAct 0x0 SErr 0x1000500 action 0x6
ata1.00: BMDMA stat 0x5
ata1: SError: { UnrecovData Proto TrStaTrns }
ata1.00: cmd c8/00:20:b7:18:38/00:00:00:00:00/e0 tag 0 dma 16384 in
         res 51/40:20:b7:18:38/00:00:00:00:00/e0 Emask 0x12 (ATA bus error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1: hard resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: configured for UDMA/100
ata1: EH complete

akk:~# lspci
00:01.0 Host bridge: Advanced Micro Devices [AMD] CS5536 [Geode companion] Host Bridge (rev 33)
00:01.2 Entertainment encryption device: Advanced Micro Devices [AMD] Geode LX AES Security Block
00:09.0 Ethernet controller: VIA Technologies, Inc. VT6105M [Rhine-III] (rev 96)
00:0b.0 Ethernet controller: VIA Technologies, Inc. VT6105M [Rhine-III] (rev 96)
00:0c.0 RAID bus controller: VIA Technologies, Inc. VT6421 IDE RAID Controller (rev 50)
00:0e.0 Network controller: Atheros Communications Inc. Device 0029 (rev 01)
00:0f.0 ISA bridge: Advanced Micro Devices [AMD] CS5536 [Geode companion] ISA (rev 03)
00:0f.2 IDE interface: Advanced Micro Devices [AMD] CS5536 [Geode companion] IDE (rev 01)
00:0f.4 USB Controller: Advanced Micro Devices [AMD] CS5536 [Geode companion] OHC (rev 02)
00:0f.5 USB Controller: Advanced Micro Devices [AMD] CS5536 [Geode companion] EHC (rev 02)
akk:~#

My hard disc is a WD:

[    1.453025] sata_via 0000:00:0c.0: version 2.6
[    1.453153] sata_via 0000:00:0c.0: routed to hard irq line 9
[    1.464542] sata_via 0000:00:0c.0: setting latency timer to 64
[    1.464717] scsi0 : sata_via
[    1.471000] scsi1 : sata_via
[    1.477183] scsi2 : sata_via
[    1.483323] ata1: SATA max UDMA/133 port i16@0x1800 bmdma 0x1c00 irq 9
[    1.496424] ata2: SATA max UDMA/133 port i16@0x1840 bmdma 0x1c08 irq 9
[    1.509496] ata3: PATA max UDMA/133 port i16@0x1880 bmdma 0x1c10 irq 9
...
[    3.694987] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[    3.715340] ata2.00: ATA-8: WDC WD15EADS-00S2B0, 01.00A01, max UDMA/133
[    3.716602] ata2.00: 2930277168 sectors, multi 0: LBA48 NCQ (depth 0/32)
[    3.720406] ata2.00: configured for UDMA/133
[    3.721316] scsi 1:0:0:0: Direct-Access     ATA      WDC WD15EADS-00S 01.0 PQ
: 0 ANSI: 5
[    3.723468] sd 1:0:0:0: [sda] 2930277168 512-byte logical blocks: (1.50 TB/1.
36 TiB)
[    3.724202] sd 1:0:0:0: [sda] Write Protect is off
[    3.725841] sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    3.725980] sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, does
n't support DPO or FUA
[    3.727823]  sda: sda1
[    3.741980] sd 1:0:0:0: [sda] Attached SCSI disk
Comment 23 Martin Qvist 2010-11-11 09:12:25 UTC
I had similar problems with my

00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller (rev 80)

and WDC Caviar Green 2TB WD20EARS disks. I thought I'd report that Joseph Chan's <JosephChan@via.com.tw> magix fix also works for this controller. I patched the kernel with the above sata_via-crc-fix.patch (uncommenting the if statement device check) and haven't had problems since.
Comment 24 Tejun Heo 2010-11-11 09:18:24 UTC
Can you please attach output of lspci -nnv?
Comment 25 Martin Qvist 2010-11-11 09:47:11 UTC
Created attachment 37112 [details]
lspci -nnv output

lspci output for the VT6420 case described below
Comment 26 Markus Müller 2010-11-11 10:43:46 UTC
Possibly more about the cause and the magic patch can be found on
http://lxr.free-electrons.com/source/drivers/ata/sata_via.c#L579
Comment 27 napperley 2011-06-02 22:08:13 UTC
I've also got the same problem (WD Cavier HD doesn't properly communicate with VIA disk controller) with a cheap SATA/IDE disk controller that uses the VIA VT6421 chipset. Linux kernel 2.6.38-8 is used. It seems that the problem still hasn't been resolved.

Typical symptoms I am getting are that an application is launched (eg Libre Office), or any other action is performed on the desktop and the disk will stop working after it has been running for a few seconds. About half a minute later any action that was started is suddenly completed when the disk (WD Cavier) resumes running.

These issues occur frequently. Below is some output relating to the problem via dmesg:

----------------------------------------------
[    1.839815] scsi2 : sata_via
[    1.842185] scsi3 : sata_via
[    1.843810] scsi4 : sata_via
[    1.843964] ata3: SATA max UDMA/133 port i16@0x1460 bmdma 0x1440 irq 16
[    1.843973] ata4: SATA max UDMA/133 port i16@0x1470 bmdma 0x1448 irq 16
[    1.843979] ata5: PATA max UDMA/133 port i16@0x1480 bmdma 0x1450 irq 16
[    1.844854] e100 0000:05:08.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
[    1.982048] e100 0000:05:08.0: PME# disabled
[    1.983255] e100 0000:05:08.0: eth0: addr 0xfc500000, irq 20, MAC addr 00:0b:cd:a3:39:2a
[    2.244068] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[    2.253311] ata3.00: ATA-8: WDC WD5000AAKX-001CA0, 15.01H15, max UDMA/133
[    2.253319] ata3.00: 976773168 sectors, multi 16: LBA48 NCQ (depth 0/32)
[    2.269316] ata3.00: configured for UDMA/133
[    2.269567] scsi 2:0:0:0: Direct-Access     ATA      WDC WD5000AAKX-0 15.0 PQ: 0 ANSI: 5
[    2.269981] sd 2:0:0:0: Attached scsi generic sg1 type 0
[    2.270535] sd 2:0:0:0: [sda] 976773168 512-byte logical blocks: (500 GB/465 GiB)
[    2.270655] sd 2:0:0:0: [sda] Write Protect is off
[    2.270664] sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    2.270715] sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    2.313705]  sda: sda1 sda2 < sda5 >
[    2.314696] sd 2:0:0:0: [sda] Attached SCSI disk
[    2.599041] ata4: SATA link down (SStatus 0 SControl 310)
[    3.006774] EXT4-fs (sda1): INFO: recovery required on readonly filesystem
[    3.006784] EXT4-fs (sda1): write access will be enabled during recovery
[    3.063666] ata3.00: exception Emask 0x12 SAct 0x0 SErr 0x1380500 action 0x6
[    3.063674] ata3.00: BMDMA stat 0x5
[    3.063681] ata3: SError: { UnrecovData Proto 10B8B Dispar BadCRC TrStaTrns }
[    3.063689] ata3.00: failed command: READ DMA EXT
[    3.063702] ata3.00: cmd 25/00:00:88:bb:05/00:01:1d:00:00/e0 tag 0 dma 131072 in
[    3.063705]          res 51/84:7f:88:bb:05/84:00:1d:00:00/e0 Emask 0x12 (ATA bus error)
[    3.063711] ata3.00: status: { DRDY ERR }
[    3.063716] ata3.00: error: { ICRC ABRT }
[    3.063731] ata3: hard resetting link
[    3.380055] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
----------------------------------------------

As you can see the issue still hasn't been resolved even though the issue is currently marked as RESOLVED CODE_FIX.

Does VIA really have a VT6420 chipset? Sounds too similar/close to VT6421.
Comment 28 Tejun Heo 2011-06-12 12:48:46 UTC
napperley, the problem discussed in this report was a quite specific incompatibility between some WDC drives and vt6420/1 and manifests as almost constant stream of ATA bus errors with specific SError value (0x1000500).  The workaround has been applied to all vt6420/1 controllers.

The problem you're seeing seems different.  Can you please try debugging the hardware first?  ie. try different cable, port, hard drive, power supply and observe and report how the pattern of failures change.

Thanks.
Comment 29 BunkoBugsy 2012-11-16 04:59:24 UTC
Ok, this all seems to make sense and hopefuly will fix my very similar problem too.
Only problem is I'm running VT8237 and according to http://forum.sources.ru/index.php?showtopic=328955&hl= last post the config registers seem to be the opposite for VT8237. Is there any way you could realese a similar patch for this chip too? I'm also seeing soft bus resets at high transfer rates or dmraid resync (fixed by limiting the max resync speed to 50000). Or is there any other way to limit max sata transfer speed without a sata_via patch? This is what happens when you mix old chipset with brand new WD drives. Thanks in advance.
Comment 30 BunkoBugsy 2012-11-16 21:38:01 UTC
*** Bug 50661 has been marked as a duplicate of this bug. ***
Comment 31 BunkoBugsy 2012-11-16 21:39:43 UTC
Ok, misread dmesg, sata_via actually running on 00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller (rev 80) on Asus A7V880

I realize that kernel 2.6.32.7 won't ever make it to Centos 4.9 so I'll try to work this around by adding setpci -s 00:0f.0 52=4 to rc.local.
Comment 32 Matej Zary 2013-12-31 00:09:10 UTC
Hi there, it seems that there are 2 problems with the commited patch for this bug (commit 8b27ff4cf6d15964aa2987aeb58db4dfb1f87a19) on VT6421 IDE RAID Controller 

1. it causes major performance regression on disk transfer speed with SAMSUNG HD502IJ HDD 



without commit 8b27ff4cf6d15964aa2987aeb58db4dfb1f87a19:

hdparm -t --direct /dev/sdb

/dev/sdb:
 Timing O_DIRECT disk reads: 310 MB in  3.00 seconds = 103.28 MB/sec


with commit 8b27ff4cf6d15964aa2987aeb58db4dfb1f87a19:

hdparm -t --direct /dev/sdb

/dev/sdb:
 Timing O_DIRECT disk reads: 184 MB in  3.02 seconds =  60.83 MB/sec



2. suspend/resume cycle clears the PCI register value which had been set up by the patch (so it looks like the affected WD drives will start behave badly again after resume, in my case the suspend/resume cycle cures the transfer speed regression)

lspci -xxx can be used to verify the register values before and after suspend/resume

suspend/resume cycle can be "emulated" with setpci -s 02:01.0 0x52.B=4 and setpci -s 02:01.0 0x52.B=0 commands (in my case)

lspci and smartctl in attachments
Comment 33 Matej Zary 2013-12-31 00:14:41 UTC
Created attachment 120341 [details]
lspci output
Comment 34 Matej Zary 2013-12-31 00:18:13 UTC
Created attachment 120351 [details]
smartctl drive output
Comment 35 Paul Fertser 2014-09-10 06:19:51 UTC
I can confirm the speed regression, it's almost 2x slower now. Please reopen the bug, this issue needs to be dealt with in a more elegant way. Testing Seagate ST1000DM003-1ER1 with an add-on VT6421A PCI card (Gembird SIDE-1), motherboard chipset is KT600, Linux version 3.0.0.

# /sbin/setpci -s 00:0c.0 0x52.B=4; for i in `seq 5`; do /sbin/hdparm -t --direct /dev/sda; done; /sbin/setpci -s 00:0c.0 0x52.B=0; for i in `seq 5`; do /sbin/hdparm -t --direct /dev/sda; done

/dev/sda:
 Timing O_DIRECT disk reads: 178 MB in  3.02 seconds =  59.00 MB/sec

/dev/sda:
 Timing O_DIRECT disk reads: 190 MB in  3.02 seconds =  62.86 MB/sec

/dev/sda:
 Timing O_DIRECT disk reads: 190 MB in  3.02 seconds =  63.01 MB/sec

/dev/sda:
 Timing O_DIRECT disk reads: 190 MB in  3.02 seconds =  62.94 MB/sec

/dev/sda:
 Timing O_DIRECT disk reads: 190 MB in  3.03 seconds =  62.79 MB/sec

/dev/sda:
 Timing O_DIRECT disk reads: 348 MB in  3.01 seconds = 115.49 MB/sec

/dev/sda:
 Timing O_DIRECT disk reads: 348 MB in  3.01 seconds = 115.61 MB/sec

/dev/sda:
 Timing O_DIRECT disk reads: 348 MB in  3.01 seconds = 115.45 MB/sec

/dev/sda:
 Timing O_DIRECT disk reads: 348 MB in  3.01 seconds = 115.45 MB/sec

/dev/sda:
 Timing O_DIRECT disk reads: 348 MB in  3.01 seconds = 115.67 MB/sec
Comment 36 Peter Cordes 2015-02-16 21:37:09 UTC
When reading two HDs at once, the lowered high-water-mark PCI register setting (which is applied by default) still isn't enough to prevent some kernel messages.  Is that fixable (maybe with another tunable)?

 I was going to set up an old Athlon XP2500+ on a A7V600 (w/ VT6420 RAID controller onboard) to test some stuff with grub / md before changing anything on the machine I normally use.

 HDs are 
* WDC WD10EADS-65L5B1 (1TB green power, 90MB/s sequential read)
* WD1600JD-00HBB0     (160GB, 57MB/s sequential read)


 I can
sudo dd if=/dev/sda2 of=/dev/null bs=1024k iflag=direct
 or same for the other drive, with no trouble.

 But if I dd from both drives at once (or from /dev/md/g2-root (RAID10, f2 layout, 64k chunk size)), then I get some SATA command error messages on the port of the faster HD (the WD10EADS).


(and btw, the 64k chunk size is to make sure the files GRUB needs aren't contiguous with the f2 layout.  I plan to use 512k for real.)

 Xubuntu's installer crashed most of the way into an install a RAID10,f2 partitioned md device, with segfaults in a several commands that it ran after chrooting into the xfs mount that it copied files to.  (I tested my RAM and my USB stick, I don't think the corruption came from them.  Everything went fine when installing into a plain partition on the 1TB drive, not touching the md device.)

  I'll dd my partitions some more, and see if I get a crash or a change in the crc of either blockdev.  (Not sure the CPU is fast enough to md5sum both disks at full speed...)


 The system was totally idle when I ran the two dd processes on tty1 and tty2.  (The X server was running, but I was logged out.)  I was ssh'ed in in case the system locked up, like I saw happen once while dding an md device from the live CD.  So there were a few interrupts from the network card.

 I killed one of the dd processes very soon after seeing some errors.  When I tried again later (after it had already limited speed to "UDMA/100", (whatever that means for SATA...)), I still get link resets.

these are the error messages:

```
[ 1299.900044] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[ 1299.900072] ata3.00: BMDMA stat 0x5
[ 1299.900084] ata3.00: failed command: READ DMA EXT
[ 1299.900102] ata3.00: cmd 25/00:00:00:d0:03/00:04:00:00:00/e0 tag 0 dma 524288 in
         res 51/84:af:51:cc:03/84:03:00:00:00/e0 Emask 0x10 (ATA bus error)
[ 1299.900126] ata3.00: status: { DRDY ERR }
[ 1299.900136] ata3.00: error: { ICRC ABRT }
[ 1299.900155] ata3: soft resetting link
[ 1300.080192] ata3.00: configured for UDMA/133
[ 1300.080211] ata3: EH complete
[ 1304.552049] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[ 1304.552087] ata3.00: BMDMA stat 0x5
[ 1304.552100] ata3.00: failed command: READ DMA EXT
[ 1304.552117] ata3.00: cmd 25/00:00:00:48:0a/00:04:00:00:00/e0 tag 0 dma 524288 in
         res 51/84:cf:31:44:0a/84:03:00:00:00/e0 Emask 0x10 (ATA bus error)
[ 1304.552141] ata3.00: status: { DRDY ERR }
[ 1304.552151] ata3.00: error: { ICRC ABRT }
[ 1304.552170] ata3: soft resetting link
[ 1304.732180] ata3.00: configured for UDMA/133
[ 1304.732198] ata3: EH complete
[ 1304.784060] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[ 1304.784083] ata3.00: BMDMA stat 0x5
[ 1304.784094] ata3.00: failed command: READ DMA EXT
[ 1304.784111] ata3.00: cmd 25/00:00:00:58:0a/00:04:00:00:00/e0 tag 0 dma 524288 in
         res 51/84:af:51:54:0a/84:03:00:00:00/e0 Emask 0x10 (ATA bus error)
[ 1304.784134] ata3.00: status: { DRDY ERR }
[ 1304.784144] ata3.00: error: { ICRC ABRT }
[ 1304.784161] ata3: soft resetting link
[ 1304.964179] ata3.00: configured for UDMA/133
[ 1304.964195] ata3: EH complete
[ 1308.064044] ata3.00: limiting speed to UDMA/100:PIO4
[ 1308.064055] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[ 1308.064077] ata3.00: BMDMA stat 0x5
[ 1308.064090] ata3.00: failed command: READ DMA EXT
[ 1308.064107] ata3.00: cmd 25/00:00:00:d8:0e/00:04:00:00:00/e0 tag 0 dma 524288 in
         res 51/84:df:21:d4:0e/84:03:00:00:00/e0 Emask 0x10 (ATA bus error)
[ 1308.064130] ata3.00: status: { DRDY ERR }
[ 1308.064827] ata3.00: error: { ICRC ABRT }
[ 1308.065731] ata3: soft resetting link
[ 1308.244183] ata3.00: configured for UDMA/100
[ 1308.244204] ata3: EH complete
[11580.572067] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[11580.572754] ata3.00: BMDMA stat 0x5
[11580.573874] ata3.00: failed command: READ DMA
[11580.575236] ata3.00: cmd c8/00:00:00:07:1c/00:00:00:00:00/e0 tag 0 dma 131072 in
         res 51/84:cf:31:06:1c/84:03:00:00:00/e0 Emask 0x10 (ATA bus error)
[11580.578081] ata3.00: status: { DRDY ERR }
[11580.579536] ata3.00: error: { ICRC ABRT }
[11580.581006] ata3: soft resetting link
[11580.744228] ata3.00: configured for UDMA/100
[11580.744256] ata3: EH complete
```


(I attached full lspci -nnv and dmesg output)

00:00.0 Host bridge: VIA Technologies, Inc. VT8377 [KT400/KT600 AGP] Host Bridge (rev 80)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8237/VX700 PCI Bridge
00:09.0 Ethernet controller: 3Com Corporation 3c940 10/100/1000Base-T [Marvell] (rev 12)
00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller (rev 80)
00:0f.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:10.0 USB controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.1 USB controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.2 USB controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.3 USB controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 81)
00:10.4 USB controller: VIA Technologies, Inc. USB 2.0 (rev 86)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8237 ISA bridge [KT600/K8T800/K8T890 South]
00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237 AC97 Audio Controller (rev 60)
00:11.6 Communication controller: VIA Technologies, Inc. AC'97 Modem Controller (rev 80)
01:00.0 VGA compatible controller: NVIDIA Corporation NV44A [GeForce 6200] (rev a1)


$ lspci -xxx -s 00:0f.0

00:0f.0 RAID bus controller: VIA Technologies, Inc. VIA VT6420 SATA RAID Controller (rev 80)
00: 06 11 49 31 07 00 90 02 80 00 04 01 00 20 80 00
10: 01 d4 00 00 01 d0 00 00 01 b8 00 00 01 b4 00 00
20: 01 b0 00 00 01 a8 00 00 00 00 00 00 43 10 ed 80
30: 00 00 00 00 c0 00 00 00 00 00 00 00 00 02 00 00
40: 13 03 f1 44 06 af 00 00 10 82 45 03 00 00 00 00
50: 00 00 04 00 00 00 04 04 00 10 10 00 05 00 20 00
60: 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 01 00 01 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 46 36 00 10 46 36
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 80 02 49 31 43 10 ed 80 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00


peter@gamma2:~$ cat /proc/interrupts 
           CPU0       
  0:         47   IO-APIC-edge      timer
  1:       7700   IO-APIC-edge      i8042
  8:          0   IO-APIC-edge      rtc0
  9:          0   IO-APIC-fasteoi   acpi
 14:          0   IO-APIC-edge      pata_via
 15:      14537   IO-APIC-edge      pata_via
 16:        608   IO-APIC-fasteoi   nouveau
 18:      45963   IO-APIC-fasteoi   eth0
 20:     393236   IO-APIC-fasteoi   sata_via
 21:      55160   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5
 22:         54   IO-APIC-fasteoi   snd_via82xx
NMI:         59   Non-maskable interrupts
LOC:     516256   Local timer interrupts
SPU:          0   Spurious interrupts
PMI:         59   Performance monitoring interrupts
IWI:          0   IRQ work interrupts
RTR:          0   APIC ICR read retries
RES:          0   Rescheduling interrupts
CAL:          0   Function call interrupts
TLB:          0   TLB shootdowns
TRM:          0   Thermal event interrupts
THR:          0   Threshold APIC interrupts
MCE:          0   Machine check exceptions
MCP:         50   Machine check polls
THR:          0   Hypervisor callback interrupts
ERR:          0
MIS:          0



update:
letting it run for a while, running crc32 in parallel on each disk
(fast disk going at 42MB/s, slow disk going at 52MB/s), I started to see some errors from the slower HD's port.

 And then my ssh session locked up.  And so did the ps/2 keyboard.  (not even alt+sysrq+b works).  I'm still seeing some messages scroll up the console, including some (typed by hand from the console of the wedged machine):

"usb 1-2: device descriptor read/64, error -110", and
"INFO: xfsaild/sda4:143 blocked for more than 120 seconds".  Oh, that's my root FS, so I guess the whole system goes to crap when / and the swap partitions are blocked.

There are:
"end_request: I/O error, dev sda sector 9578752"
"Buffer I/O error on device sda2 ..."
...

and
end_request: I/O error, sdb, sector 5671568"


So I guess if I want to use this old machine for anything, it's going to have to be with only one SATA drive. :(

I had been hoping to maybe use it to replace the PIII-450MHz that's been my router / mail server for over 10 years. :P
Comment 37 Peter Cordes 2015-02-16 21:38:19 UTC
Created attachment 167181 [details]
dmesg from before the lockup
Comment 38 Peter Cordes 2015-02-16 21:39:06 UTC
Created attachment 167191 [details]
lspci -nnv on my A7V600 (KT600 chipset)