Bug 3418
Summary: | (sata promise) kernel freeze on command timeout with promise SATA150 tx4 | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Arno Wagner (wagner) |
Component: | Serial ATA | Assignee: | Jeff Garzik (jgarzik) |
Status: | CLOSED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | bunk |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.9-rc2 | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
Arno Wagner
2004-09-17 18:47:03 UTC
Is your BIOS up to date? BIOS was not up to date ("never change a working system"), but is now. Since this also caused LILO to stop working (somehow it has problems with the 8 or 11 BIOS disks that are now there or there is a bug in the BIOS), I am currently switching to GRUB (which I have never used before). Then I will re-run the tests. Might take a few days. Some additional info: The 32bit PCI slots on this mainboard are 33MHz and not 66MHz, at least according to Tyan's website and the chipset datasheet. Seems there is an error in the manual. The 64bit slots can be set to 33MHz or 66MHz and are on a separate PCI bus. Here are my new test results. Mainboard BIOS is v1.08 (current), old one was v1.04. Kernel 2.6.9-rc2: Crash while sata-RAID5 resync after less than 5 minutes. Second try: Crash in less than a minute. Message on console: "ata3: command timeout" Kernel 2.6.7: Crash while sata-RAID5 resync after less than 1 minutes. Second try: Crash even before ssh login was possible. Message on console: "ata3: DMA timeout" Kernel 2.Kernel 2.4.27: Crash while sata-RAID5 resync in less than 1 minute. Second try: Same as above. Message on console: "ata3: DMA timeout" I had a look into the kernel code. The error messages are from a function called "pdc_eng_timeout" in sata_promise.c. (Verified by changing the messages). The message is different in 2.6.7, because in 2.6.7 there are other cases present. The kernel than calls ata_qc_complete(qc, ata_wait_idle(ap) | ATA_ERR); This call never returns and the machine becomes unresponsive. I have done a lot more tests with subsets of the disks. After some time it became apparent that all crashing RAID5 sets did contain /dev/sdc1. It took some time to see, because some configurations including this disk did work. I also never got any crashes with RAID1 sets involving /dev/sdc1. I have tests running now with a spare disk for /dev/sdc and I will test the possibly problematic disk in a separate set-up. I will post the results when available. Still, I believe the kernel should _not_ freeze on a disk problem like this, correct? Note: If this was indeed the problem, I would need to know soon if I should run any specific tests. After a lot more tests, including use of a mili-Ohm meter, exchanging all cables, opening up one of the SATA enclosures I use and a lot more, it became obvious that neither the disk, nor the mainboard, the kernel or the power to the disk were at fault. It seems it is a misalignment issue in data lines in the SATA enclosures. These are made by Chiftec (here: http://tinyurl.com/5trvx, also available in white. I bought them unbranded, but the specs and picture match.) The third disk from the top has the longest signal lines on the PCB. I had all kernel freezes (except one) on a disk in this place. Furthermore the lines seem to be just enough out of spec that it does work with an SATA cable slightly out of alignment in the opposite way (I assume). It does not work (i.e. kernel freeze in some minutes or less) with other cables, that are presumably slightly out of alignment in the same way as the SATA enclosure. The SATA standard is pretty strict on this, they allow max. 1 ps skew between a signal pair in the SATA data connectors (and the enclosure by extension). This is a sub-milimeter misalignment. (Table 6.3.9.2 and Note 1 in the 1.0a SATA specification.) I can well imagine this being not met with longer signal lines and a medium accuracy PCB manufacturing process. The final test that convinced me was when I pulled a disk that caused a reliable crash within 1 minute before from the enclosure. Attached to the same SATA cable on the same controller port directly the problem just vanished. Still, it would be nice if the kernel would not freeze on this issue, but did some graceful degradation, like marking the disk as unusable. I guess that something like this will be needed for SATA hotplug anyways, to prevent machines freezing when a disk is just pulled out or not quite properly inserted, so I will stop complaining about this now. Closing as libata now has real error handling and will bump disks out of arrays etc providing the controller firmware doesn't commit suicide handling the problem. |