Bug 5995 - RAID with SATA fails on drive un-plug
Summary: RAID with SATA fails on drive un-plug
Status: REJECTED INVALID
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Jeff Garzik
URL:
Keywords:
: 5996 (view as bug list)
Depends on:
Blocks:
 
Reported: 2006-02-02 03:58 UTC by Terry Barnaby
Modified: 2006-05-03 07:40 UTC (History)
0 users

See Also:
Kernel Version: 2.6.14-1.1656_FC4smp
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Terry Barnaby 2006-02-02 03:58:33 UTC
Most recent kernel where this bug did not occur:
Distribution:          Fedora Core 4
Hardware Environment:  Intel i915 based board with "Intel Corporation 82801FB/FW
(ICH6/ICH6W) SATA Controller (rev 04)"
Software Environment:  Fedora Core 4, kernel-2.6.14-1.1656_FC4smp
Problem Description:

I have just set up a Raid 5 disk array using 4 SATA disks on Fedora 4.
To test the setup I unplugged the SATA cable from one of the disk drives.
I was expecting the system to carry on with messages from the Raid system
indicating that there was a disk drive down and an email to root indicating a
problem.

However the Raid 5 partition became in-accessable after un-plugging
the drive. The kernel reported disk errors but there was no error messages
from the Raid system and "mdadm -Q --detail /dev/md2" reported that there
was no problems with the Raid array.

Even worse if I access a file that has not been previously cached there is a
long delay and then the program returns with no error but no data. For example:
"cat /data/test-file" will delay and then exit with status of "0" but no file
contents are displayed. This is VERY VERY BAD ! 

When I rebooted the system (needed a reset) the Raid system reported that
one disk was down and the partition became readable again. This was the expected
behaviour.

I have tried the same test with a SCSI based Raid system and this works fine
as expected.

It appears that there is a bug in the SATA driver that does not react correctly
to a loss of a drive or connection.

The SATA chip set being used is a:
"Intel Corporation 82801FB/FW (ICH6/ICH6W) SATA Controller (rev 04)"

The kernel error messages when a disk is remove are like:
ata2: command 0x35 timeout, stat 0x0 host_stat 0x61
ata2: command 0x25 timeout, stat 0x0 host_stat 0x61


Steps to reproduce:
1. Set up a Raid 1 or 5 array using SATA disks
2. Unplug the SATA cable from a disk
3. Try and access a file on the raid partition
Comment 1 Terry Barnaby 2006-02-02 07:28:16 UTC
*** Bug 5996 has been marked as a duplicate of this bug. ***
Comment 2 Martin J. Bligh 2006-05-03 07:40:42 UTC
Please file fedora bugs in their own bugzilla, or reproduce on mainline

Note You need to log in before you can comment on or make changes to this bug.