Bug 5905
Summary: | status=0x50 errors with SATA drives | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Haakon Riiser (bugzilla) |
Component: | Serial ATA | Assignee: | Jeff Garzik (jgarzik) |
Status: | REJECTED INVALID | ||
Severity: | high | CC: | bunk, diegocg, protasnb |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.15 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | Kernel config |
Description
Haakon Riiser
2006-01-17 02:13:23 UTC
Created attachment 7045 [details]
Kernel config
This is the .config file for my current kernel, a vanilla 2.6.15.1.
There's been no changes in the bug's behavior since 2.6.15.
Looks like this problem is more serious than I thought. I just found out (using MD5 sums) that I've had several write errors lately. It appears that errors only occur during write operations, and the number of failed MD5 sums seem to closely match the number of the following error messages in my log: ataN: status=0x50 { DriveReady SeekComplete } sdX: Current: sense key=0x0 ASC=0x0 ASCQ=0x0 While I don't understand what the above error means, it sure looks like every one of these corresponds with a write error. :-( To find out whether or not this error message coincided with the write errors, I created a program that continuously wrote a 4096 byte block every 0.25 seconds. Each block contains a a 32 bit timestamp, and 32 bit CRC32 checksum, and 4088 bytes of pseudo-random data. I let this program run overnight, and the next morning, I checked for new errors in the kernel log. I had hoped that this program would show that the CRC errors, if any, would coincide in time with the error messages in the kernel log. Sure enough, I found two new warnings in the log, but when I CRC-checked each block of pseudo-random data, I found no errors. It could be that some other write operation caused the kernel error, and that it wrote some corrupted data, but I can't be sure. I'll run the test some more times, and then we'll see. Btw, it might be worth mentioning that the failed write operations I mentioned earlier were all network transfers, unlike my test program that both generates and writes data locally. Well, this bug is trickier than I thought. I discovered today that the original crash bug I thought was fixed in my vanilla 2.6.15 kernel still exists. The only difference from the Fedora kernel is that now, it doesn't print anything before it dies. I even tried with a RS232 serial console, and still nothing. As before, once the first crash occured, the mdadm RAID-5 array started resyncing, and this causes the crash to happen MUCH more frequently. It always crashes a few minutes into the resync operation, causing it to never finish. The only way out of this mess is to downgrade to my previous motherboard/CPU/memory. I also have some more info on the write errors that frequently occur. I cannot reproduce this with the test program I mentioned earlier, but it happens all the time during large rsync transfers (RSYNC_RSH=ssh). Also, I noticed that I have gotten write errors when there was nothing new in the log, so apparently, the status=0x50 errors are not directly linked to write errors. It looks more like they are related to the crash bug, since these status errors aren't very common, but now that the system crashes within 10 minutes (because of the never finishing RAID resync), I see these errors more often. This will probably be my last comment, as I have finally had enough. I've gone back to my trusty old P3 650 MHz (good enough for a simple file server) that works. Only the CPU/motherboard/RAM has changed, so it's probably a race condition in the kernel that causes all of these problems. I did have time to test the other hardware for errors using memtest86 v3.2, and I found nothing wrong. Everything points to a software bug. I even measured the voltages on my PSU to see if it was overloaded, but found no problems there either. The same PSU works perfectly on the older setup, and the load is about the same, so that's no surprise. I'd like to upgrade my system eventually though, so if anyone reading this knows of a hardware configuration that is proven to work with a software RAID-5 array of SATA disks, I'd love to hear about it. 2.6.18-rc3 may work better, as it improves libata in many areas like error handling etc. Haakon, Did you get chance to test with latest kernels, is the problem still there? Thanks. No, I got a new motherboard for the file server, and this was never a problem with the new hardware, Even though I still use the same SATA controller. The old motherboard is now used in a desktop box that only has one hard disk, and the problem only manifested itself in a RAID configuration, so I don't know what happened. Since I won't be able to produce any more information, this bug might as well be closed until someone else encounters a similar problem. |