Bug 5905 - status=0x50 errors with SATA drives
Summary: status=0x50 errors with SATA drives
Status: REJECTED INVALID
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-01-17 02:13 UTC by Haakon Riiser
Modified: 2007-07-09 01:26 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.15
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Kernel config (28.27 KB, text/plain)
2006-01-17 02:42 UTC, Haakon Riiser
Details

Description Haakon Riiser 2006-01-17 02:13:23 UTC
============================================================
Most recent kernel where this bug did not occur
============================================================
My previous kernel version was a 2.6.12 kernel that came with
Fedora Core 3.  Fedora's 2.6.12 did not have this particular
bug, but it had a more serious problem with my SATA drives
that frequently crashed my system.  This is why I'm now using a
vanilla 2.6.15, despite the new error (which does not appear to
affect system stability).  Unless it's absolutely necessary, I'd
prefer not to experiment too much with different kernel versions
to isolate the bug to a specific version.  The affected system is
my main file server, and making it unstable with an older kernel
than panics on heavy disk I/O is not very desirable.


============================================================
Distribution
============================================================
Fedora Core 3


============================================================
Hardware Environment
============================================================
Motherboard: Asus K8V SE Deluxe (VIA K8T800/VT8237)
CPU: Sempron 3000+
PATA disk: Maxtor 6Y160P0 (boot drive that contains OS)
SATA disks: 4 x Seagate Barracuda 7200.8 SATA NCQ - ST3250823AS (mdadm RAID-5)
SATA controller: Promise SATA-II 150 TX4 PCI
Memory: 2 x 256 MB TwinMOS CL2.5 DDR400
Graphics card: Matrox MGA G400 AGP
Network card: Marvell Gigabit chip (integrated in the motherboard)


============================================================
Software Environment
============================================================
Fully updated Fedora Core 3 system running with an untainted
vanilla kernel.


============================================================
Problem Description
============================================================
No noticable problem, except for errors in the kernel log:

   ataN: status=0x50 { DriveReady SeekComplete }
   sdX: Current: sense key: No Sense
       Additional sense: No additional sense information
   ataN: status=0x50 { DriveReady SeekComplete }
   sdX: Current: sense key=0x0
       ASC=0x0 ASCQ=0x0

The 'status=0x50 { DriveReady SeekComplete }' line is always
present, but the two next lines are usually

   sdX: Current: sense key: No Sense
       Additional sense: No additional sense information

but sometimes I also see

   sdX: Current: sense key=0x0
       ASC=0x0 ASCQ=0x0


============================================================
Steps to reproduce
============================================================
Do I/O for a few hours, and it will usually trigger the problem.
Comment 1 Haakon Riiser 2006-01-17 02:42:14 UTC
Created attachment 7045 [details]
Kernel config

This is the .config file for my current kernel, a vanilla 2.6.15.1.
There's been no changes in the bug's behavior since 2.6.15.
Comment 2 Haakon Riiser 2006-01-18 02:30:29 UTC
Looks like this problem is more serious than I thought.  I just found out
(using MD5 sums) that I've had several write errors lately.  It appears that
errors only occur during write operations, and the number of failed MD5 sums
seem to closely match the number of the following error messages in my log:
 
   ataN: status=0x50 { DriveReady SeekComplete }
   sdX: Current: sense key=0x0
       ASC=0x0 ASCQ=0x0

While I don't understand what the above error means, it sure looks like every
one of these corresponds with a write error. :-(
Comment 3 Haakon Riiser 2006-01-19 02:48:49 UTC
To find out whether or not this error message coincided with
the write errors, I created a program that continuously wrote
a 4096 byte block every 0.25 seconds.  Each block contains a
a 32 bit timestamp, and 32 bit CRC32 checksum, and 4088 bytes
of pseudo-random data.  I let this program run overnight, and
the next morning, I checked for new errors in the kernel log.
I had hoped that this program would show that the CRC errors,
if any, would coincide in time with the error messages in the
kernel log.  Sure enough, I found two new warnings in the log,
but when I CRC-checked each block of pseudo-random data, I
found no errors.  It could be that some other write operation
caused the kernel error, and that it wrote some corrupted data,
but I can't be sure.  I'll run the test some more times, and
then we'll see.

Btw, it might be worth mentioning that the failed write operations
I mentioned earlier were all network transfers, unlike my test
program that both generates and writes data locally.
Comment 4 Haakon Riiser 2006-01-25 04:28:10 UTC
Well, this bug is trickier than I thought.  I discovered today that
the original crash bug I thought was fixed in my vanilla 2.6.15 kernel
still exists.  The only difference from the Fedora kernel is that
now, it doesn't print anything before it dies.  I even tried with
a RS232 serial console, and still nothing.  As before, once the
first crash occured, the mdadm RAID-5 array started resyncing,
and this causes the crash to happen MUCH more frequently.  It always
crashes a few minutes into the resync operation, causing it to never
finish.  The only way out of this mess is to downgrade to my previous
motherboard/CPU/memory.
Comment 5 Haakon Riiser 2006-01-25 04:33:30 UTC
I also have some more info on the write errors that frequently occur.
I cannot reproduce this with the test program I mentioned earlier,
but it happens all the time during large rsync transfers (RSYNC_RSH=ssh).
Also, I noticed that I have gotten write errors when there was nothing
new in the log, so apparently, the status=0x50 errors are not directly
linked to write errors.  It looks more like they are related to the crash
bug, since these status errors aren't very common, but now that the system
crashes within 10 minutes (because of the never finishing RAID resync), I
see these errors more often.
Comment 6 Haakon Riiser 2006-01-25 09:48:06 UTC
This will probably be my last comment, as I have finally had
enough.  I've gone back to my trusty old P3 650 MHz (good enough
for a simple file server) that works.  Only the CPU/motherboard/RAM
has changed, so it's probably a race condition in the kernel that
causes all of these problems.  I did have time to test the other
hardware for errors using memtest86 v3.2, and I found nothing
wrong.  Everything points to a software bug.  I even measured
the voltages on my PSU to see if it was overloaded, but found no
problems there either.  The same PSU works perfectly on the older
setup, and the load is about the same, so that's no surprise.

I'd like to upgrade my system eventually though, so if anyone
reading this knows of a hardware configuration that is
proven to work with a software RAID-5 array of SATA disks,
I'd love to hear about it.
Comment 7 Diego Calleja 2006-08-05 07:36:52 UTC
2.6.18-rc3 may work better, as it improves libata in many areas like error
handling etc.
Comment 8 Natalie Protasevich 2007-07-08 16:48:57 UTC
Haakon,
Did you get chance to test with latest kernels, is the problem still there?
Thanks.
Comment 9 Haakon Riiser 2007-07-09 01:26:08 UTC
No, I got a new motherboard for the file server, and this was
never a problem with the new hardware, Even though I still use
the same SATA controller.  The old motherboard is now used in a
desktop box that only has one hard disk, and the problem only
manifested itself in a RAID configuration, so I don't know what
happened.

Since I won't be able to produce any more information, this bug
might as well be closed until someone else encounters a similar
problem.

Note You need to log in before you can comment on or make changes to this bug.