Bug 6592 - md + bad disk + "mdadd --add" = crach
Summary: md + bad disk + "mdadd --add" = crach
Status: CLOSED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-05-21 02:23 UTC by Petr Balas
Modified: 2007-08-08 20:44 UTC (History)
0 users

See Also:
Kernel Version: 2.6.16
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
config of affected kernel (32.75 KB, text/plain)
2006-05-21 02:24 UTC, Petr Balas
Details

Description Petr Balas 2006-05-21 02:23:33 UTC
Distribution: Debian stable + selfcompiled kernel (vanilla 2.6.16)
(config as attachment)


Hardware Environment:
motherboard Intel D945something, 2GB RAM, 3x SATA disk
e1000 not used - not working reliably
computer tested with memtest86, burnP6, all OK
server:~# lspci
0000:00:00.0 Host bridge: Intel Corp.: Unknown device 2770 (rev 02)
0000:00:02.0 VGA compatible controller: Intel Corp.: Unknown device
2772 (rev 02)
0000:00:1c.0 PCI bridge: Intel Corp.: Unknown device 27d0 (rev 01)
0000:00:1c.2 PCI bridge: Intel Corp.: Unknown device 27d4 (rev 01)
0000:00:1c.3 PCI bridge: Intel Corp.: Unknown device 27d6 (rev 01)
0000:00:1c.4 PCI bridge: Intel Corp.: Unknown device 27e0 (rev 01)
0000:00:1c.5 PCI bridge: Intel Corp.: Unknown device 27e2 (rev 01)
0000:00:1d.0 USB Controller: Intel Corp.: Unknown device 27c8 (rev 01)
0000:00:1d.1 USB Controller: Intel Corp.: Unknown device 27c9 (rev 01)
0000:00:1d.2 USB Controller: Intel Corp.: Unknown device 27ca (rev 01)
0000:00:1d.3 USB Controller: Intel Corp.: Unknown device 27cb (rev 01)
0000:00:1d.7 USB Controller: Intel Corp.: Unknown device 27cc (rev 01)
0000:00:1e.0 PCI bridge: Intel Corp. 82801 PCI Bridge (rev e1)
0000:00:1f.0 ISA bridge: Intel Corp.: Unknown device 27b8 (rev 01)
0000:00:1f.1 IDE interface: Intel Corp.: Unknown device 27df (rev 01)
0000:00:1f.2 0106: Intel Corp.: Unknown device 27c1 (rev 01)
0000:00:1f.3 SMBus: Intel Corp.: Unknown device 27da (rev 01)
0000:01:00.0 Ethernet controller: Intel Corp.: Unknown device 108c (rev 03)
0000:01:00.2 IDE interface: Intel Corp.: Unknown device 108d (rev 03)
0000:01:00.3 Serial controller: Intel Corp.: Unknown device 108f (rev 03)
0000:01:00.4 0c07: Intel Corp.: Unknown device 108e (rev 03)
0000:06:01.0 Ethernet controller: 3Com Corporation 3c905 100BaseTX [Boomerang]

Software Environment:
Debian stable, samba, cups, dhcp server


Problem Description:
1) strange crash with filesystem corruption
server (samba) is not accessible, onsite person rebooted server, boot failed
with filesystem errors (on ext3 partition), manual fsck is needed, major data
loss (sorry, I don't have more info on this one)


2) mdadm --manage .. --add locks computer
after recovering data from backup I noticed broken sw raid. I tried
mdadm --manage /dev/md1 --add /dev/sdb2
nothing happened (observed by cat /prod/mdstat) so I tried it again
At this point any program trying to access disk is locked).
I found some error messages on console.

sw raid configuration:
(correct, at problem time md2 and md3 are brokem - bad sdb)
server:~# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[0] sda5[1]
      1999936 blocks [2/2] [UU]
md2 : active raid1 sdb3[0] sda6[1]
      3903680 blocks [2/2] [UU]
md3 : active raid1 sdb4[0] sda7[1]
      283241856 blocks [2/2] [UU]
md0 : active raid1 sdb1[0] sda1[1]
      3903680 blocks [2/2] [UU]
unused devices: <none>

Console messages (written to paper):
ata2: handling error/timeout
ata2: port reset, p_is 0 is 0 pis 0 cmd 44 017 f7 d0 ss 113 se 0
ata2_status = 0x50
Assertion failed! qc->err_mask == 0
 drivers/scsc/ahci.c, ahci_host_intr line 681
Assertion failed! qc->flags & ATA_QCFLAG_ACTIVE
 drivers/scsi/libata-core.c, ata_qc_complete, line 3631
ata2: status 0x50
sdb: Current sense key = 0x0 ASC = 0x0 ASCQ = 0x0
Badness in blk_do_ordered at block/ll_rw_blk.c: 550
blk_do_ordered+0x282
elv_next_request+0xd6
scsi_request_fn+0x60
blk_run_queue+0x1f
scsi_run_queue+0xfa
scsi_next_command+0x26
scsi_end_request+0x94
scsi_io_completion+0x193
sd_rw_intr+0x1b1
scsi_finish_command+0x13
ata_scsi_qc_complete+0x171
ahci_interrupt+0xda
handle_IRQ_event+0x20
__do_IRQ+0x53
do_IRQ+0x19
common_interrupt+0x1a
cast6_decrypt+16e
scsi_request_fn+0x232
blk_run_queue+0x1f
scsi_run_queue+0xfa
scsi_next_command+0x26
scsi_end_request+0x94
scsi_io_completion+0x193
scsi_blk_pc_done+0x26
ata_scsi_qc_complete+0x6a
ata_qc_complete+0x171
ahci_eng_timeout+0x6a
scsi_error_handle+0x0
ata_scsi_error+0x12
scsi_error_handle+0x69
kthread-0x80
kthread+0x9a
kthread+00
kernel_thread_helper
Badness in blk_do_ordered at block/ll_rw_blk.c: 550
blk_do_ordered+0x282
elv_next_request+0xd6
scsi_request_fn+0x60
scsi_error_handler+0x0
blk_run_queue+0x1f
scsi_run_queue+0xfa
scsi_error_handler+0x0
scsi_run_???_queues+0x12
scsi_error_handler+0x6c2
scsi_error_handler+0
kthread+0x80
scsi_error_handler+0x0
kthread+0x94
kthread
kernel_thread_helper


Steps to reproduce:
1) can't repoduce
2) mdadm --manage /dev/md1 --add /dev/sdb2, now solved by new disk
Comment 1 Petr Balas 2006-05-21 02:24:59 UTC
Created attachment 8159 [details]
config of affected kernel
Comment 2 Neil Brown 2006-05-21 02:33:00 UTC
Looks more like a SATA problem to me -- reassigning.

It might be helpful if you have any more details on the way in which
sdb was 'bad'.
Comment 3 Petr Balas 2006-05-21 02:42:01 UTC
Disk is on my desk. What I should test?
Comment 4 Natalie Protasevich 2007-08-08 13:09:59 UTC
Petr, have you tested with latest kernels, do you still see the problem?
Thanks.
Comment 5 Petr Balas 2007-08-08 14:07:40 UTC
Sorry, I don't have this bad disk so I can't test :-(.
Comment 6 Tejun Heo 2007-08-08 20:44:24 UTC
This is definitely fixed now.  Closing.

Note You need to log in before you can comment on or make changes to this bug.