Bug 7845 - access to large RAID arrays on Adaptec 2400A RAID controller with dpt_i2o module causes system hang
Summary: access to large RAID arrays on Adaptec 2400A RAID controller with dpt_i2o mod...
Status: REJECTED UNREPRODUCIBLE
Alias: None
Product: SCSI Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: scsi_drivers-other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-01-17 12:21 UTC by Robert B
Modified: 2007-08-07 08:19 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.8, 2.6.18
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Robert B 2007-01-17 12:21:35 UTC
Most recent kernel where this bug did *NOT* occur: 2.4.27 (see detailed
description for more info)
Distribution: Debian 3.1 (Sarge)
Hardware Environment: Adaptec 2400A ATA RAID Controller
Software Environment: Module dpt_i2o
Problem Description: access to large RAID arrays on Adaptec 2400A RAID
controller with dpt_i2o module causes system hang

Steps to reproduce:

Package: kernel-image-2.6.8-3-k7
Version: 2.6.8-16sarge6
Severity: critical
Justification: causes serious data loss


The system has built in an Adaptec 2400A RAID controller with two 80 GB
disks attached which are configured as RAID 1. The controller is handled
by the dpt_i2o module.

In this configuration everything works fine. (The RAID array is
exclusively used for data storage [samba server], the operating system
is installed on a separate single IDE disk which is attached to the
motherboard's IDE controller).

When I attach two 500 GB disks to the controller instead of the 80 GB
disks and configure them as RAID 1 array, the problems begin.
Partitioning the array with cfdisk works, building the file system with
mkfs.ext3 or mkreiserfs works, too. But copying data with cp to the
partition causes cp to hang after some time and some files were copied.
With an other console, changing to the directory in which the partition
on the array is mounted causes this console to hang, too. 'shutdown -h
now' (again in an other console) displays the shutdown-message, but the
system doesn't do a shutdown.

After a hard reset, it depends on the file system, whether there are
data on the array or not. With ext3 there are none of the copied files
on the array, with reiserfs there are the files on the array which were
copied before cp hung.

I tried to google for the problem and found a post on google groups
which seems to be related to the problem. The message-id is
<3XOno-5Pz-19@gated-at.bofh.it>

With the 2.4.27-2-386 kernel, everything seems to work, even with the
500 GB RAID array.

The Kernel 2.6.18 (linux-image-2.6.18-3-k7) from backports.org doesn't
solve the problem neither.

-- System Information:
Debian Release: 3.1
Architecture: i386 (i686)
Kernel: Linux 2.6.8-3-k7
Locale: LANG=de_DE@euro, LC_CTYPE=de_DE@euro (charmap=ISO-8859-15)

Versions of packages kernel-image-2.6.8-3-k7 depends on:
ii  coreutils [fileutils]         5.2.1-2    The GNU core utilities
ii  initrd-tools                  0.1.81.1   tools to create initrd image for p
ii  module-init-tools             3.2-pre1-2 tools for managing Linux kernel mo

-- no debconf information
Comment 1 Natalie Protasevich 2007-08-01 00:35:53 UTC
Robert,
If you still have this problem - it might be helpful to get the trace with alt-sysrq-t at the time of hang and attach to the report.
Have you tried recent kernel lately?
Thanks.
Comment 2 Robert B 2007-08-01 01:18:12 UTC
Natalie,
I solved the problem by using a 3Ware RAID controller for the 500 GB array. Since the machine is productive now, I cannot do any more tests, sorry.
Comment 3 Natalie Protasevich 2007-08-02 00:15:47 UTC
Thanks for the update. I guess this bug can be closed for now until someone runs into same problem...
Comment 4 Steffen Bischof 2007-08-06 16:48:14 UTC
I can confirm the problems described by Robert with a system running up-to-date Debian Etch stable on a Adaptec 2400A with 2x 320Gig Seagate HDDs as RAID-1.

It's easy to force the Server to hang when copying files 500MB and up to the disk. Sometimes it even hangs with much smaller files. 
The problem seems to have some strange relationship to some apps. 
Copying files from USB-HDD to the RAID seems to work fine. Using TAR will crash the system most of the time. Samba seems to be a constant source of crashes too.

System backup and restore with Acronis TrueImage 9.1 (which is using Linux under the hood) always worked flawless.
Comment 5 Natalie Protasevich 2007-08-06 18:14:35 UTC
Stefen,
Can you provide information on such crashes? Whether those are oopses, or if your system hangs maybe you can get the alt-sysrq-t trace and attach it here. If this is a hard hang and you can reproduce it at will then you can start top or "vmstat 3" on some VT and have it running while escalation your load and getting system to hang.
Comment 6 Mark Salyzyn 2007-08-07 05:49:59 UTC
Please make sure the drives being used are RAID compatible. Desktop class drives that perform their own error recovery and bad block remapping will clash with any RAID controller's own recovery actions. Western Digital JD drives I believe (I may be mistaken) are such an example.
Comment 7 Steffen Bischof 2007-08-07 06:17:06 UTC
Mark,

thanks for bringing this topic to the table. I am aware of this fact. There is a special RAID Edition of WD Drives which differs from the consumer version in exactly this single point, the error handling. 

I am almost 100% sure that at the time the 2400A was released to the market there were no special Enterprise grade IDE HDDs and no special RAID versions. 

Is there a drive compatibility list available for the controller? I couldn't find one at Adaptec.com.
Comment 8 Mark Salyzyn 2007-08-07 08:19:37 UTC
The 2.4 kernels have a default 30 to 90 seconds timeout for commands depending on release (Distributions, which Adaptec tested with, err on the high side), the 2.6 kernels have a default 30 second timeout. This may also play as a factor so you may wish to extendthe timeout. The dpt_i2o driver, however, circumvents the timeout and introduces it's own extended timeout of 300 seconds (trust the controller). At least that is the case for the 64 bit capable dpt_i2o driver I hold upstream (available upon request), but some variants in kernel.org did not override the timeout. Not sure if extending the timeout will help given the problems shutting down, that problem points to issues with the hardware (?)

Drives made in the past were all RAID edition at the time the 2400A was released to the market until they started introducing error recovery and bad block remapping features into later consumer version drives. Just because they are consumer version drives does not make them incompatible, the issue arises when the drive's error recovery sets up an interference pattern with the RAID card's recovery, or if commands are blocked from completing, due to the their error recovery, from the drive within a reasonable period of time (ten seconds I believe, when the Adapter starts itroducing it's own error recovery actions).

I contacted Adaptec Technical Support about any compatibility issues and got the following response:

'We have had reports drive recognition issues with WD drives generally at boot and recommend the drives are jumpered for the factory default "cable select" setting.

Information concerning Western Digital drive specifications for Enterprise and Desktop class drives can be found on their knowledge base at:

http://wdc.custhelp.com/cgi-bin/wdc.cfg/php/enduser/std_alp.php

article numbers 1277 and 1397.  Western Digital reports there are no firmware updates for any EIDE hard disk manufactured after 3/25/03 in article number 1348.

We do not maintain a list of tested drives, however we will provide any information concerning known compatibility issues/reports of problems on our ASK knowledge base on our website.  We only have the article concerning the cable select jumper setting at this time.'

Note You need to log in before you can comment on or make changes to this bug.