Bug 3418

Summary:	(sata promise) kernel freeze on command timeout with promise SATA150 tx4
Product:	IO/Storage	Reporter:	Arno Wagner (wagner)
Component:	Serial ATA	Assignee:	Jeff Garzik (jgarzik)
Status:	CLOSED PATCH_ALREADY_AVAILABLE
Severity:	normal	CC:	bunk
Priority:	P2
Hardware:	i386
OS:	Linux
Kernel Version:	2.6.9-rc2	Subsystem:
Regression:	---	Bisected commit-id:

Description Arno Wagner 2004-09-17 18:47:03 UTC

Distribution:Debian sarge, kernel from www.kernel.org
Hardware Environment: Dual Athlon MP 2800+, Tyan Thunder MP board (AMD-760 MP 
chipset), 2GB ECC memory, 2* SATA150 TX4, 8 * Maxtor  6Y250M0 on the tx4's,
3 more disks on the normal IDE, no PCI cards besides the tx4's 
Software Environment: Raid resync (the 8 sata disks are one raid5) is enough to 
get the crash.
Problem Description: After booting, the kernel becomes completely unresponsive 
within some minutes giving the error message "ata3: command timeout". Only a 
hard reset helps after this. The problem is also present with 2.6.7 (error 
message "ata3: command timeout" and 2.4.27 (error message "ata3: DMA timeout").
One note: the board has 66MHz 32bit and 33MHz 64bit PCI slots. The tx4's are 
currently in two the 66MHz slots. Promise claims they support these.
Steps to reproduce: Set up the hardware, put a RAID5 on the SATA disks and let 
it resync. With 2.4.27 it crashes very fast , with 2.6.9-rc2 it may take some 
time and with 2.6.7 it even did a complete resync once and I could put 60GB of 
data on the disks before the kernel became unresponsive. This is subjective and 
I did not do exhaustive tests. On all three kernels the machine runs fine if I 
stop the resync and don't access the SATA disks. No problems accessing the other 
disks. No problems resyncing the RAID5 on the other three 200GB Maxtor disks. 
The machine itself has run fine for about a year now, only problem is SATA which 
I added recently (obviously being a bit too optimistic...). I am willing to run 
tests as long as a RAID5 on the on-board IDE is not in danger (I would also run 
tests with these disks removed/backed up first, but that needs preparation). I 
have serial capture of console output and remote reset capability, so crashes in 
tests are not a problem. I also have only few students working on the machine at 
the moment.

Comment 1 Jeff Garzik 2004-09-17 19:18:17 UTC

Is your BIOS up to date?

Comment 2 Arno Wagner 2004-09-20 13:08:41 UTC

BIOS was not up to date ("never change a working system"), but is now. Since 
this also caused LILO to stop working (somehow it has problems with the 8 or 11 
BIOS disks that are now there or there is a bug in the BIOS), I am currently 
switching to GRUB (which I have never used before). Then I will re-run the 
tests. Might take a few days.

Some additional info: The 32bit PCI slots on this mainboard are 33MHz and not 
66MHz, at least according to Tyan's website and the chipset datasheet. Seems 
there is an error in the manual. The 64bit slots can be set to 33MHz or 66MHz 
and are on a separate PCI bus.

Comment 3 Arno Wagner 2004-09-21 10:58:39 UTC

Here are my new test results. Mainboard BIOS is v1.08 (current), 
old one was v1.04. 

Kernel 2.6.9-rc2:
   Crash while sata-RAID5 resync after less than 5 minutes.
   Second try: Crash in less than a minute.
   Message on console:  "ata3: command timeout"

Kernel 2.6.7:
   Crash while sata-RAID5 resync after less than 1 minutes.
   Second try: Crash even before ssh login was possible.   
   Message on console:  "ata3: DMA timeout"
   
Kernel 2.Kernel 2.4.27:
   Crash while sata-RAID5 resync in less than 1 minute.             
   Second try: Same as above. 
   Message on console: "ata3: DMA timeout"

Comment 4 Arno Wagner 2004-09-21 15:45:36 UTC

I had a look into the kernel code. The error messages are
from a function called "pdc_eng_timeout" in sata_promise.c.
(Verified by changing the messages). The message is different 
in 2.6.7, because in 2.6.7 there are other cases present.

The kernel than calls 
    ata_qc_complete(qc, ata_wait_idle(ap) | ATA_ERR);
This call never returns and the machine becomes unresponsive.

Comment 5 Arno Wagner 2004-09-28 08:54:12 UTC

I have done a lot more tests with subsets of the disks. 
After some time it became apparent that all crashing RAID5
sets did contain /dev/sdc1. It took some time to see, 
because some configurations including this disk did work.
I also never got any crashes with RAID1 sets involving
/dev/sdc1.

I have tests running now with a spare disk for /dev/sdc 
and I will test the possibly problematic disk in a 
separate set-up. I will post the results when available.

Still, I believe the kernel should _not_ freeze on a disk 
problem like this, correct? 
  
Note: If this was indeed the problem, I would need to 
know soon if I should run any specific tests.

Comment 6 Arno Wagner 2004-10-02 05:28:11 UTC

After a lot more tests, including use of a mili-Ohm meter, exchanging all
cables, opening up one of the SATA enclosures I use and a lot more, it 
became obvious that neither the disk, nor the mainboard, the kernel or 
the power to the disk were at fault. It seems it is a misalignment issue 
in data lines in the SATA enclosures. These are made by Chiftec 
(here: http://tinyurl.com/5trvx, also available in white. I bought them
unbranded, but the specs and picture match.) 

The third disk from the top has the longest signal lines on the PCB. 
I had all kernel freezes (except one) on a disk in this place. 
Furthermore the lines seem to be just enough out of spec that 
it does work with an SATA cable slightly out of alignment in the 
opposite way (I assume). It does not work (i.e. kernel freeze 
in some minutes or less) with other cables, that are presumably 
slightly out of alignment in the same way as the SATA enclosure.

The SATA standard is pretty strict on this, they allow max. 1 ps skew 
between a signal pair in the SATA data connectors (and the enclosure 
by extension). This is a sub-milimeter misalignment. (Table 6.3.9.2 
and Note 1 in the 1.0a SATA specification.) I can well imagine this 
being not met with longer signal lines and a medium accuracy PCB 
manufacturing process.

The final test that convinced me was when I pulled a disk that caused a 
reliable crash within 1 minute before from the enclosure. Attached
to the same SATA cable on the same controller port directly the problem 
just vanished. 

Still, it would be nice if the kernel would not freeze on this issue,
but did some graceful degradation, like marking the disk as unusable.
I guess that something like this will be needed for SATA hotplug 
anyways, to prevent machines freezing when a disk is just pulled out
or not quite properly inserted, so I will stop complaining about this
now.

Comment 7 Alan 2007-06-05 08:36:05 UTC

Closing as libata now has real error handling and will bump disks out of arrays
etc providing the controller firmware doesn't commit suicide handling the problem.