Bug 6665

Summary: Kernel is frozen during removal of SATA disk from RAID1 (sata_nv/libata/raid1 issue?)
Product: IO/Storage Reporter: Jiri Dlouhy (jiri.dlouhy)
Component: Serial ATAAssignee: Tejun Heo (htejun)
Status: RESOLVED DUPLICATE    
Severity: normal CC: protasnb
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.16.20-smp Subsystem:
Regression: --- Bisected commit-id:

Description Jiri Dlouhy 2006-06-08 04:29:13 UTC
Distribution: 
OpenSUSE 10.1

Hardware Environment:
Supermicro H8DCE motherboard (based on nVidia nForce Pro chipset) with two AMD
Opterons 246
2x SATA harddrives WD1600JS inside hotplug SATA bays

Problem Description:

Hello, 
I observe a problem described bellow. I'm not sure if it's not related more to
MD (RAID) then SATA. I'm running two SATA disks in a software (md) RAID1. If I
extract a disk from RAID array, I would expect it should by instantly signed by
RAID as faulty but a test shows something else.

If I extract one HDD from RAID1 the kernel freezes after first request towards
filesystem. If I insert the HDD back, kernel runs again and the MD RAID state
remains unaffected (output of /proc/mdstat without any failure). 
After kernel revovery from the frozen state the syslog shows:

Jun  8 13:03:48 helpdesk kernel: nv_sata: Secondary device removed
Jun  8 13:04:18 helpdesk kernel: ata2: command 0xea timeout, stat 0xd0 host_stat 0x0
Jun  8 13:04:18 helpdesk kernel: ata2: translated ATA stat/err 0xd0/00 to SCSI
SK/ASC/ASCQ 0xb/47/00
Jun  8 13:04:18 helpdesk kernel: ata2: status=0xd0 { Busy }
Jun  8 13:04:18 helpdesk kernel: ATA: abnormal status 0xD0 on port 0xC007
Jun  8 13:04:18 helpdesk kernel: ATA: abnormal status 0xD0 on port 0xC007
Jun  8 13:04:18 helpdesk kernel: ATA: abnormal status 0xD0 on port 0xC007
Jun  8 13:04:37 helpdesk kernel: nv_sata: Secondary device added
Jun  8 13:05:18 helpdesk kernel: ata2: command 0x35 timeout, stat 0x50 host_stat
0x21


If I extract one HDD as above but wait for long enough time while the kernel is
frozen then after two minutes  some timeout occures. Then the extracted HDD is
kicked out from the RAID and kernel runs again. 
The syslog shows:

Jun  8 13:11:06 helpdesk kernel: nv_sata: Secondary device removed
Jun  8 13:11:36 helpdesk kernel: ata2: command 0xea timeout, stat 0xd0 host_stat 0x0
Jun  8 13:11:36 helpdesk kernel: ata2: translated ATA stat/err 0xd0/00 to SCSI
SK/ASC/ASCQ 0xb/47/00
Jun  8 13:11:36 helpdesk kernel: ata2: status=0xd0 { Busy }
Jun  8 13:11:36 helpdesk kernel: ATA: abnormal status 0xD0 on port 0xC007
Jun  8 13:11:36 helpdesk kernel: ATA: abnormal status 0xD0 on port 0xC007
Jun  8 13:11:36 helpdesk kernel: ATA: abnormal status 0xD0 on port 0xC007
Jun  8 13:12:36 helpdesk kernel: ata2: command 0x35 timeout, stat 0xd0 host_stat
0x21
Jun  8 13:12:36 helpdesk kernel: ata2: translated ATA stat/err 0xd0/00 to SCSI
SK/ASC/ASCQ 0xb/47/00
Jun  8 13:12:36 helpdesk kernel: ata2: status=0xd0 { Busy }
Jun  8 13:12:36 helpdesk kernel: sd 1:0:0:0: SCSI error: return code = 0x8000002
Jun  8 13:12:36 helpdesk kernel: sdb: Current: sense key: Aborted Command
Jun  8 13:12:36 helpdesk kernel:     Additional sense: Scsi parity error
Jun  8 13:13:06 helpdesk kernel: end_request: I/O error, dev sdb, sector 46154502
Jun  8 13:13:06 helpdesk kernel: ATA: abnormal status 0xD0 on port 0xC007
Jun  8 13:13:06 helpdesk kernel: ATA: abnormal status 0xD0 on port 0xC007
Jun  8 13:13:06 helpdesk kernel: ATA: abnormal status 0xD0 on port 0xC007
Jun  8 13:13:06 helpdesk kernel: ata2: command 0xea timeout, stat 0xd0 host_stat 0x0
Jun  8 13:13:06 helpdesk kernel: ata2: translated ATA stat/err 0xd0/00 to SCSI
SK/ASC/ASCQ 0xb/47/00
Jun  8 13:13:06 helpdesk kernel: ata2: status=0xd0 { Busy }
Jun  8 13:13:06 helpdesk kernel: raid1: Disk failure on sdb2, disabling device. 
Jun  8 13:13:06 helpdesk kernel:        Operation continuing on 1 devices
Jun  8 13:13:06 helpdesk kernel: RAID1 conf printout:
Jun  8 13:13:06 helpdesk kernel:  --- wd:1 rd:2
Jun  8 13:13:06 helpdesk kernel:  disk 0, wo:0, o:1, dev:sda2
Jun  8 13:13:06 helpdesk kernel:  disk 1, wo:1, o:0, dev:sdb2
Jun  8 13:13:06 helpdesk kernel: RAID1 conf printout:
Jun  8 13:13:06 helpdesk kernel:  --- wd:1 rd:2
Jun  8 13:13:06 helpdesk kernel:  disk 0, wo:0, o:1, dev:sda2

If I remove the HDD from RAID (by mdadm) before I extract it, nothing bad happens.
The syslog shows only:

Jun  8 13:24:58 helpdesk kernel: nv_sata: Secondary device removed
Jun  8 13:25:07 helpdesk kernel: nv_sata: Secondary device added


Thanks
Jiri


Here are more informations about the system:

# uname -a
Linux helpdesk 2.6.16.20-smp #1 SMP Wed Jun 7 21:56:29 CEST 2006 x86_64 x86_64
x86_64 GNU/Linux

# cat /proc/mdstat 
Personalities : [raid1] 
md2 : active raid1 sda3[0] sdb3[1]
      133210880 blocks [2/2] [UU]
      
md1 : active raid1 sda2[0] sdb2[1]
      20972736 blocks [2/2] [UU]
      
md0 : active raid1 sda1[0] sdb1[1]
      2104384 blocks [2/2] [UU]
      
unused devices: <none>

# lspci 
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio
Controller (rev a2)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:00.0 VGA compatible controller: ATI Technologies Inc RV370 5B60 [Radeon X300
(PCIE)]
01:00.1 Display controller: ATI Technologies Inc RV370 [Radeon X300SE]
80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
80:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
80:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
80:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
80:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
80:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)


# dmesg (extract from the boot-up)

md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
libata version 1.20 loaded.
sata_nv 0000:00:07.0: version 0.8
ACPI: PCI Interrupt Link [LTID] enabled at IRQ 22
GSI 16 sharing vector 0xD9 and IRQ 16
ACPI: PCI Interrupt 0000:00:07.0[A] -> Link [LTID] -> GSI 22 (level, low) -> IRQ 217
PCI: Setting latency timer of device 0000:00:07.0 to 64
ata1: SATA max UDMA/133 cmd 0xC800 ctl 0xC402 bmdma 0xB800 irq 217
ata2: SATA max UDMA/133 cmd 0xC000 ctl 0xBC02 bmdma 0xB808 irq 217
ata1: SATA link up 3.0 Gbps (SStatus 123)
ata1: dev 0 cfg 49:2f00 82:706b 83:7e61 84:4023 85:7069 86:3e41 87:4023 88:407f
ata1: dev 0 ATA-7, max UDMA/133, 312581808 sectors: LBA48
nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
ata1: dev 0 configured for UDMA/133
scsi0 : sata_nv
ata2: SATA link up 3.0 Gbps (SStatus 123)
ata2: dev 0 cfg 49:2f00 82:706b 83:7e61 84:4023 85:7069 86:3e41 87:4023 88:407f
ata2: dev 0 ATA-7, max UDMA/133, 312581808 sectors: LBA48
nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
ata2: dev 0 configured for UDMA/133
scsi1 : sata_nv
  Vendor: ATA       Model: WDC WD1600JS-98N  Rev: 10.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
SCSI device sda: 312581808 512-byte hdwr sectors (160042 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
 sda:<4>nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
 sda1 sda2 sda3
sd 0:0:0:0: Attached scsi disk sda
  Vendor: ATA       Model: WDC WD1600JS-98N  Rev: 10.0
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
SCSI device sdb: 312581808 512-byte hdwr sectors (160042 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
 sdb:<4>nv_sata: Primary device added
nv_sata: Primary device removed
nv_sata: Secondary device added
nv_sata: Secondary device removed
 sdb1 sdb2 sdb3
sd 1:0:0:0: Attached scsi disk sdb
ACPI: PCI Interrupt Link [LTIE] enabled at IRQ 21
GSI 17 sharing vector 0xE1 and IRQ 17
ACPI: PCI Interrupt 0000:00:08.0[A] -> Link [LTIE] -> GSI 21 (level, low) -> IRQ 225
PCI: Setting latency timer of device 0000:00:08.0 to 64
ata3: SATA max UDMA/133 cmd 0xB400 ctl 0xB002 bmdma 0xA400 irq 225
ata4: SATA max UDMA/133 cmd 0xAC00 ctl 0xA802 bmdma 0xA408 irq 225
sd 0:0:0:0: Attached scsi generic sg0 type 0
sd 1:0:0:0: Attached scsi generic sg1 type 0
ata3: SATA link down (SStatus 0)
scsi2 : sata_nv
ata4: SATA link down (SStatus 0)
scsi3 : sata_nv
ACPI: PCI Interrupt Link [LT3D] enabled at IRQ 47
GSI 18 sharing vector 0xE9 and IRQ 18
ACPI: PCI Interrupt 0000:80:07.0[A] -> Link [LT3D] -> GSI 47 (level, low) -> IRQ 233
PCI: Setting latency timer of device 0000:80:07.0 to 64
ata5: SATA max UDMA/133 cmd 0xF800 ctl 0xF402 bmdma 0xE800 irq 233
ata6: SATA max UDMA/133 cmd 0xF000 ctl 0xEC02 bmdma 0xE808 irq 233
ata5: SATA link down (SStatus 0)
scsi4 : sata_nv
ata6: SATA link down (SStatus 0)
scsi5 : sata_nv
ACPI: PCI Interrupt Link [LT2E] enabled at IRQ 46
GSI 19 sharing vector 0x32 and IRQ 19
ACPI: PCI Interrupt 0000:80:08.0[A] -> Link [LT2E] -> GSI 46 (level, low) -> IRQ 50
PCI: Setting latency timer of device 0000:80:08.0 to 64
ata7: SATA max UDMA/133 cmd 0xE400 ctl 0xE002 bmdma 0xD400 irq 50
ata8: SATA max UDMA/133 cmd 0xDC00 ctl 0xD802 bmdma 0xD408 irq 50
ata7: SATA link down (SStatus 0)
scsi6 : sata_nv
ata8: SATA link down (SStatus 0)
scsi7 : sata_nv
md: raid1 personality registered for level 1
md: md0 stopped.
md: bind<sdb1>
md: bind<sda1>
raid1: raid set md0 active with 2 out of 2 mirrors
md: md1 stopped.
md: bind<sdb2>
md: bind<sda2>
raid1: raid set md1 active with 2 out of 2 mirrors
Attempting manual resume
md: md1 stopped.
md: unbind<sda2>
md: export_rdev(sda2)
md: unbind<sdb2>
md: export_rdev(sdb2)
md: md0 still in use.
md: md0 still in use.
md: md0 still in use.
md: md1 stopped.
md: bind<sdb2>
md: bind<sda2>
raid1: raid set md1 active with 2 out of 2 mirrors
md: md2 stopped.
md: bind<sdb3>
md: bind<sda3>
raid1: raid set md2 active with 2 out of 2 mirrors
Comment 1 Natalie Protasevich 2007-08-08 13:07:54 UTC
Jiri, any updates on this? Have you tested with latest kernels?
Thanks.
Comment 2 Tejun Heo 2007-08-08 20:41:24 UTC

*** This bug has been marked as a duplicate of bug 8421 ***
Comment 3 Jiri Dlouhy 2007-08-09 13:21:54 UTC
Hello Natalie,

unfortunately I didn't test it with the latest kernels. My final step
was to use a rc version (I'll post the exact number as soon as I will
reach the server). With this kernel this problem was not present any
more. But remaines the problem that after the disc is inserted back into
the server it gets another /dev/sdX number and thus can't be smoothly
added to the raid.

Jirka


bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=6665
>
>
> protasnb@gmail.com changed:
>
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |protasnb@gmail.com
>
>
>
>
> ------- Comment #1 from protasnb@gmail.com  2007-08-08 13:07 -------
> Jiri, any updates on this? Have you tested with latest kernels?
> Thanks.
>
>
>