============================================================ Most recent kernel where this bug did not occur ============================================================ My previous kernel version was a 2.6.12 kernel that came with Fedora Core 3. Fedora's 2.6.12 did not have this particular bug, but it had a more serious problem with my SATA drives that frequently crashed my system. This is why I'm now using a vanilla 2.6.15, despite the new error (which does not appear to affect system stability). Unless it's absolutely necessary, I'd prefer not to experiment too much with different kernel versions to isolate the bug to a specific version. The affected system is my main file server, and making it unstable with an older kernel than panics on heavy disk I/O is not very desirable. ============================================================ Distribution ============================================================ Fedora Core 3 ============================================================ Hardware Environment ============================================================ Motherboard: Asus K8V SE Deluxe (VIA K8T800/VT8237) CPU: Sempron 3000+ PATA disk: Maxtor 6Y160P0 (boot drive that contains OS) SATA disks: 4 x Seagate Barracuda 7200.8 SATA NCQ - ST3250823AS (mdadm RAID-5) SATA controller: Promise SATA-II 150 TX4 PCI Memory: 2 x 256 MB TwinMOS CL2.5 DDR400 Graphics card: Matrox MGA G400 AGP Network card: Marvell Gigabit chip (integrated in the motherboard) ============================================================ Software Environment ============================================================ Fully updated Fedora Core 3 system running with an untainted vanilla kernel. ============================================================ Problem Description ============================================================ No noticable problem, except for errors in the kernel log: ataN: status=0x50 { DriveReady SeekComplete } sdX: Current: sense key: No Sense Additional sense: No additional sense information ataN: status=0x50 { DriveReady SeekComplete } sdX: Current: sense key=0x0 ASC=0x0 ASCQ=0x0 The 'status=0x50 { DriveReady SeekComplete }' line is always present, but the two next lines are usually sdX: Current: sense key: No Sense Additional sense: No additional sense information but sometimes I also see sdX: Current: sense key=0x0 ASC=0x0 ASCQ=0x0 ============================================================ Steps to reproduce ============================================================ Do I/O for a few hours, and it will usually trigger the problem.
Created attachment 7045 [details] Kernel config This is the .config file for my current kernel, a vanilla 2.6.15.1. There's been no changes in the bug's behavior since 2.6.15.
Looks like this problem is more serious than I thought. I just found out (using MD5 sums) that I've had several write errors lately. It appears that errors only occur during write operations, and the number of failed MD5 sums seem to closely match the number of the following error messages in my log: ataN: status=0x50 { DriveReady SeekComplete } sdX: Current: sense key=0x0 ASC=0x0 ASCQ=0x0 While I don't understand what the above error means, it sure looks like every one of these corresponds with a write error. :-(
To find out whether or not this error message coincided with the write errors, I created a program that continuously wrote a 4096 byte block every 0.25 seconds. Each block contains a a 32 bit timestamp, and 32 bit CRC32 checksum, and 4088 bytes of pseudo-random data. I let this program run overnight, and the next morning, I checked for new errors in the kernel log. I had hoped that this program would show that the CRC errors, if any, would coincide in time with the error messages in the kernel log. Sure enough, I found two new warnings in the log, but when I CRC-checked each block of pseudo-random data, I found no errors. It could be that some other write operation caused the kernel error, and that it wrote some corrupted data, but I can't be sure. I'll run the test some more times, and then we'll see. Btw, it might be worth mentioning that the failed write operations I mentioned earlier were all network transfers, unlike my test program that both generates and writes data locally.
Well, this bug is trickier than I thought. I discovered today that the original crash bug I thought was fixed in my vanilla 2.6.15 kernel still exists. The only difference from the Fedora kernel is that now, it doesn't print anything before it dies. I even tried with a RS232 serial console, and still nothing. As before, once the first crash occured, the mdadm RAID-5 array started resyncing, and this causes the crash to happen MUCH more frequently. It always crashes a few minutes into the resync operation, causing it to never finish. The only way out of this mess is to downgrade to my previous motherboard/CPU/memory.
I also have some more info on the write errors that frequently occur. I cannot reproduce this with the test program I mentioned earlier, but it happens all the time during large rsync transfers (RSYNC_RSH=ssh). Also, I noticed that I have gotten write errors when there was nothing new in the log, so apparently, the status=0x50 errors are not directly linked to write errors. It looks more like they are related to the crash bug, since these status errors aren't very common, but now that the system crashes within 10 minutes (because of the never finishing RAID resync), I see these errors more often.
This will probably be my last comment, as I have finally had enough. I've gone back to my trusty old P3 650 MHz (good enough for a simple file server) that works. Only the CPU/motherboard/RAM has changed, so it's probably a race condition in the kernel that causes all of these problems. I did have time to test the other hardware for errors using memtest86 v3.2, and I found nothing wrong. Everything points to a software bug. I even measured the voltages on my PSU to see if it was overloaded, but found no problems there either. The same PSU works perfectly on the older setup, and the load is about the same, so that's no surprise. I'd like to upgrade my system eventually though, so if anyone reading this knows of a hardware configuration that is proven to work with a software RAID-5 array of SATA disks, I'd love to hear about it.
2.6.18-rc3 may work better, as it improves libata in many areas like error handling etc.
Haakon, Did you get chance to test with latest kernels, is the problem still there? Thanks.
No, I got a new motherboard for the file server, and this was never a problem with the new hardware, Even though I still use the same SATA controller. The old motherboard is now used in a desktop box that only has one hard disk, and the problem only manifested itself in a RAID configuration, so I don't know what happened. Since I won't be able to produce any more information, this bug might as well be closed until someone else encounters a similar problem.