Bug 5733 - Oops writing to SATA RAID disks
Summary: Oops writing to SATA RAID disks
Status: CLOSED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Alasdair G Kergon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-12-12 06:26 UTC by Hans-Joachim Baader
Modified: 2006-02-17 03:16 UTC (History)
0 users

See Also:
Kernel Version: 2,6,15-rc5, 2.6.13.4
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Output of lspci -vv (11.41 KB, text/plain)
2005-12-13 02:57 UTC, Hans-Joachim Baader
Details
Stacktrace of partimage processes (1.80 KB, text/plain)
2005-12-20 03:19 UTC, Hans-Joachim Baader
Details

Description Hans-Joachim Baader 2005-12-12 06:26:54 UTC
This bug occurs when writing images with partimage to device /dev/sdp*
Hardware is Intel P4 based, two SATA disks in mirroring configuration.
System is a minimal system that runs in an initrd environment.

After writing exactly 360 MB (+/- 0.25 MB) the partimage process goes in D state
and we find an oops in the log. It looks very similar to a problem reported 10
months ago: https://www.redhat.com/archives/dm-devel/2005-February/msg00039.html


ksymoops 2.4.9 on i686 2.6.14.2-bapdev.  Options used
     -V (default)
     -K (specified)
     -l lsmod (specified)
     -o
/usr/src/lxboot-swfill/lxboot-swfill-V2.0.10-current/root/lib/modules/2.6.15-rc5-swfill
(specified)
     -m /usr/src/linux-2.6.15-rc5-swfill/System.map (specified)

No modules in ksyms, skipping objects
No ksyms, skipping lsmod
Unable to handle kernel paging request at virtual address f88c6000
c01793e5
*pde = 1fc61067
Oops: 0000 [#1]
CPU:    0
EIP:    0060:[<c01793e5>]    Not tainted VLI
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246   (2.6.15-rc5-swfill) 
eax: ffffffff   ebx: f88c6000   ecx: 07fffdfa   edx: 00000000
esi: 00020000   edi: f88c6000   ebp: ffffbf27   esp: f7febec0
ds: 007b   es: 007b   ss: 0068
Stack: 00000000 00000000 f6d83180 f7febefc f6d4158c f7fcfac0 c01cf8c6 f88c2000 
       0001bf27 00020000 f6d4158c f6d41580 c01d00ad f687a4e0 f7febefc 0001ffff 
       f6d4158c f6d41580 f687a4e0 c01d0137 f6d4158c f6d41580 c01d0363 f6d4158c 
Call Trace:
 [<c01cf8c6>]
 [<c01d00ad>]
 [<c01d0137>]
 [<c01d0363>]
 [<c01d06ff>]
 [<c01d0749>]
 [<c011dd81>]
 [<c01d071a>]
 [<c010fce9>]
 [<c0215d5c>]
 [<c010fce9>]
 [<c011dc2b>]
 [<c0120c5e>]
 [<c0120be5>]
 [<c01012b1>]
Code: f0 eb 49 89 44 24 04 83 c7 04 c7 04 24 00 00 00 00 89 f8 29 d8 c1 e0 03 29
c5 74 26 8d 4d 1f 89 fb c1 e9 05 b8 ff ff ff ff 31 d2 <f3> af 74 09 33 47 fc 83
ef 04 0f bc d0 29 df c1 e7 03 01 fa 89 


>>EIP; c01793e5 <find_next_zero_bit+76/a1>   <=====

>>eax; ffffffff <__kernel_rt_sigreturn+1bbf/????>
>>ebx; f88c6000 <pg0+38611000/3fd49400>
>>ecx; 07fffdfa <phys_startup_32+7effdfa/c0000000>
>>edi; f88c6000 <pg0+38611000/3fd49400>
>>ebp; ffffbf27 <pg0+3fd46f27/3fd49400>
>>esp; f7febec0 <pg0+37d36ec0/3fd49400>

Trace; c01cf8c6 <core_get_resync_work+27/55>
Trace; c01d00ad <__rh_recovery_prepare+14/84>
Trace; c01d0137 <rh_recovery_prepare+1a/3b>
Trace; c01d0363 <do_recovery+13/64>
Trace; c01d06ff <do_mirror+4e/69>
Trace; c01d0749 <do_work+2f/58>
Trace; c011dd81 <worker_thread+156/1ba>
Trace; c01d071a <do_work+0/58>
Trace; c010fce9 <default_wake_function+0/12>
Trace; c0215d5c <schedule+456/4bb>
Trace; c010fce9 <default_wake_function+0/12>
Trace; c011dc2b <worker_thread+0/1ba>
Trace; c0120c5e <kthread+79/a3>
Trace; c0120be5 <kthread+0/a3>
Trace; c01012b1 <kernel_thread_helper+5/b>

This architecture has variable length instructions, decoding before eip
is unreliable, take these instructions with a pinch of salt.

Code;  c01793ba <find_next_zero_bit+4b/a1>
00000000 <_EIP>:
Code;  c01793ba <find_next_zero_bit+4b/a1>
   0:   f0 eb 49                  lock jmp 4c <_EIP+0x4c>
Code;  c01793bd <find_next_zero_bit+4e/a1>
   3:   89 44 24 04               mov    %eax,0x4(%esp)
Code;  c01793c1 <find_next_zero_bit+52/a1>
   7:   83 c7 04                  add    $0x4,%edi
Code;  c01793c4 <find_next_zero_bit+55/a1>
   a:   c7 04 24 00 00 00 00      movl   $0x0,(%esp)
Code;  c01793cb <find_next_zero_bit+5c/a1>
  11:   89 f8                     mov    %edi,%eax
Code;  c01793cd <find_next_zero_bit+5e/a1>
  13:   29 d8                     sub    %ebx,%eax
Code;  c01793cf <find_next_zero_bit+60/a1>
  15:   c1 e0 03                  shl    $0x3,%eax
Code;  c01793d2 <find_next_zero_bit+63/a1>
  18:   29 c5                     sub    %eax,%ebp
Code;  c01793d4 <find_next_zero_bit+65/a1>
  1a:   74 26                     je     42 <_EIP+0x42>
Code;  c01793d6 <find_next_zero_bit+67/a1>
  1c:   8d 4d 1f                  lea    0x1f(%ebp),%ecx
Code;  c01793d9 <find_next_zero_bit+6a/a1>
  1f:   89 fb                     mov    %edi,%ebx
Code;  c01793db <find_next_zero_bit+6c/a1>
  21:   c1 e9 05                  shr    $0x5,%ecx
Code;  c01793de <find_next_zero_bit+6f/a1>
  24:   b8 ff ff ff ff            mov    $0xffffffff,%eax
Code;  c01793e3 <find_next_zero_bit+74/a1>
  29:   31 d2                     xor    %edx,%edx

This decode from eip onwards should be reliable

Code;  c01793e5 <find_next_zero_bit+76/a1>
00000000 <_EIP>:
Code;  c01793e5 <find_next_zero_bit+76/a1>   <=====
   0:   f3 af                     repz scas %es:(%edi),%eax   <=====
Code;  c01793e7 <find_next_zero_bit+78/a1>
   2:   74 09                     je     d <_EIP+0xd>
Code;  c01793e9 <find_next_zero_bit+7a/a1>
   4:   33 47 fc                  xor    0xfffffffc(%edi),%eax
Code;  c01793ec <find_next_zero_bit+7d/a1>
   7:   83 ef 04                  sub    $0x4,%edi
Code;  c01793ef <find_next_zero_bit+80/a1>
   a:   0f bc d0                  bsf    %eax,%edx
Code;  c01793f2 <find_next_zero_bit+83/a1>
   d:   29 df                     sub    %ebx,%edi
Code;  c01793f4 <find_next_zero_bit+85/a1>
   f:   c1 e7 03                  shl    $0x3,%edi
Code;  c01793f7 <find_next_zero_bit+88/a1>
  12:   01 fa                     add    %edi,%edx
Code;  c01793f9 <find_next_zero_bit+8a/a1>
  14:   89                        .byte 0x89


Interrupts:
           CPU0       
  0:      78885    IO-APIC-edge  timer
  1:        554    IO-APIC-edge  i8042
  9:          0   IO-APIC-level  acpi
 14:         21    IO-APIC-edge  ide0
 16:     375024   IO-APIC-level  eth0
 19:         35   IO-APIC-level  libata
 20:          0   IO-APIC-level  ehci_hcd:usb1, uhci_hcd:usb2
 21:          0   IO-APIC-level  uhci_hcd:usb3
 22:          0   IO-APIC-level  uhci_hcd:usb4
 23:          0   IO-APIC-level  uhci_hcd:usb5
NMI:          0 
LOC:      78843 
ERR:          0
MIS:          0

/proc/iomem:
00000000-000987ff : System RAM
00098800-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000cb000-000cbfff : Adapter ROM
000cc000-000d11ff : Adapter ROM
000f0000-000fffff : System ROM
00100000-3f6dffff : System RAM
  00100000-00217be0 : Kernel code
  00217be1-0026fdd3 : Kernel data
3f6e0000-3f6ecfff : ACPI Tables
3f6ed000-3f6fffff : ACPI Non-volatile Storage
3f700000-3fffffff : reserved
e0000000-efffffff : 0000:00:02.0
f0000000-f0003fff : 0000:00:1b.0
f0004000-f00043ff : 0000:00:1d.7
  f0004000-f00043ff : ehci_hcd
f0004400-f00047ff : 0000:00:1f.2
  f0004400-f00047ff : ahci
f0040000-f007ffff : 0000:00:02.0
f0080000-f00fffff : 0000:00:02.0
f0100000-f01fffff : PCI Bus #05
  f0100000-f010ffff : 0000:05:00.0
    f0100000-f010ffff : tg3
fec00000-fecfffff : reserved
fee00000-feefffff : reserved
ffb80000-ffffffff : reserved

Module                  Size  Used by    Not tainted
ahci 5988 2 - Live 0xf885a000
ata_piix 3716 0 - Live 0xf8802000
libata 30796 2 ahci,ata_piix, Live 0xf88a5000
sr_mod 9412 0 - Live 0xf887b000
sd_mod 8240 2 - Live 0xf88a1000
usb_storage 18980 0 - Live 0xf88ae000
scsi_mod 47720 5 ahci,libata,sr_mod,sd_mod,usb_storage, Live 0xf884d000
ide_floppy 10400 0 - Live 0xf8844000
ide_cd 27908 0 - Live 0xf8899000
cdrom 27264 2 sr_mod,ide_cd, Live 0xf8891000
ide_disk 8800 0 - Live 0xf8849000
floppy 40676 0 - Live 0xf8870000
tg3 72772 0 - Live 0xf885d000
crc32 2912 0 - Live 0xf8842000
mii 2528 0 - Live 0xf8840000
uhci_hcd 20976 0 - Live 0xf8812000
ohci_hcd 12036 0 - Live 0xf8828000
ehci_hcd 20520 0 - Live 0xf8821000
usbmouse 2272 0 - Live 0xf880e000
usbkbd 3552 0 - Live 0xf8805000
usbhid 26272 0 - Live 0xf8819000
usbcore 70692 8 usb_storage,uhci_hcd,ohci_hcd,ehci_hcd,usbmouse,usbkbd,usbhid,
Live 0xf882d000

lspci -vv:
0000:00:00.0 0600: 8086:2770 (rev 02)
	Subsystem: 1734:1084
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort+ >SERR- <PERR-
	Latency: 0
	Capabilities: [e0] #09 [5109]

0000:00:02.0 0300: 8086:2772 (rev 02)
	Subsystem: 1734:1084
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin A routed to IRQ 11
	Region 0: Memory at f0080000 (32-bit, non-prefetchable) [size=512K]
	Region 1: I/O ports at 2000 [size=8]
	Region 2: Memory at e0000000 (32-bit, prefetchable) [size=256M]
	Region 3: Memory at f0040000 (32-bit, non-prefetchable) [size=256K]
	Capabilities: [90] Message Signalled Interrupts: 64bit- Queue=0/0 Enable-
		Address: 00000000  Data: 0000
	Capabilities: [d0] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0000:00:1b.0 0403: 8086:27d8 (rev 01)
	Subsystem: 1734:1083
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0, Cache Line Size: 0x08 (32 bytes)
	Interrupt: pin A routed to IRQ 9
	Region 0: Memory at f0000000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable-
		Address: 0000000000000000  Data: 0000
	Capabilities: [70] #10 [0091]

0000:00:1c.0 0604: 8086:27d0 (rev 01)
	Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0, Cache Line Size: 0x08 (32 bytes)
	Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
	BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
	Capabilities: [40] #10 [0141]
	Capabilities: [80] Message Signalled Interrupts: 64bit- Queue=0/0 Enable-
		Address: 00000000  Data: 0000
	Capabilities: [90] #0d [0000]
	Capabilities: [a0] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0000:00:1c.1 0604: 8086:27d2 (rev 01)
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0, Cache Line Size: 0x08 (32 bytes)
	Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
	Memory behind bridge: f0100000-f01fffff
	BridgeCtl: Parity- SERR+ NoISA+ VGA- MAbort- >Reset- FastB2B-
	Capabilities: [40] #10 [0141]
	Capabilities: [80] Message Signalled Interrupts: 64bit- Queue=0/0 Enable-
		Address: 00000000  Data: 0000
	Capabilities: [90] #0d [0000]
	Capabilities: [a0] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0000:00:1c.2 0604: 8086:27d4 (rev 01)
	Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0, Cache Line Size: 0x08 (32 bytes)
	Bus: primary=00, secondary=07, subordinate=07, sec-latency=0
	BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
	Capabilities: [40] #10 [0141]
	Capabilities: [80] Message Signalled Interrupts: 64bit- Queue=0/0 Enable-
		Address: 00000000  Data: 0000
	Capabilities: [90] #0d [0000]
	Capabilities: [a0] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0000:00:1c.3 0604: 8086:27d6 (rev 01)
	Control: I/O- Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0, Cache Line Size: 0x08 (32 bytes)
	Bus: primary=00, secondary=09, subordinate=09, sec-latency=0
	BridgeCtl: Parity- SERR+ NoISA- VGA- MAbort- >Reset- FastB2B-
	Capabilities: [40] #10 [0141]
	Capabilities: [80] Message Signalled Interrupts: 64bit- Queue=0/0 Enable-
		Address: 00000000  Data: 0000
	Capabilities: [90] #0d [0000]
	Capabilities: [a0] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0000:00:1d.0 0c03: 8086:27c8 (rev 01)
	Subsystem: 1734:1085
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin A routed to IRQ 20
	Region 4: I/O ports at 2400 [size=32]

0000:00:1d.1 0c03: 8086:27c9 (rev 01)
	Subsystem: 1734:1085
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin B routed to IRQ 21
	Region 4: I/O ports at 2800 [size=32]

0000:00:1d.2 0c03: 8086:27ca (rev 01)
	Subsystem: 1734:1085
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin C routed to IRQ 22
	Region 4: I/O ports at 2c00 [size=32]

0000:00:1d.3 0c03: 8086:27cb (rev 01)
	Subsystem: 1734:1085
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin D routed to IRQ 23
	Region 4: I/O ports at 3000 [size=32]

0000:00:1d.7 0c03: 8086:27cc (rev 01) (prog-if 20)
	Subsystem: 1734:1085
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin A routed to IRQ 20
	Region 0: Memory at f0004000 (32-bit, non-prefetchable) [size=1K]
	Capabilities: [50] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-

0000:00:1e.0 0604: 8086:244e (rev e1) (prog-if 01)
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0
	Bus: primary=00, secondary=0b, subordinate=0b, sec-latency=0
	BridgeCtl: Parity- SERR+ NoISA+ VGA- MAbort- >Reset- FastB2B-
	Capabilities: [50] #0d [0000]

0000:00:1f.0 0601: 8086:27b8 (rev 01)
	Subsystem: 1734:1085
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0
	Capabilities: [e0] #09 [100c]

0000:00:1f.1 0101: 8086:27df (rev 01) (prog-if 8a [Master SecP PriP])
	Subsystem: 1734:1085
	Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin A routed to IRQ 18
	Region 0: I/O ports at <unassigned>
	Region 1: I/O ports at <unassigned>
	Region 2: I/O ports at <unassigned>
	Region 3: I/O ports at <unassigned>
	Region 4: I/O ports at 3820 [size=16]

0000:00:1f.2 0104: 8086:27c3 (rev 01)
	Subsystem: 1734:1085
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR- FastB2B-
	Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0
	Interrupt: pin B routed to IRQ 19
	Region 0: I/O ports at 3840 [size=8]
	Region 1: I/O ports at 3834 [size=4]
	Region 2: I/O ports at 3838 [size=8]
	Region 3: I/O ports at 3830 [size=4]
	Region 4: I/O ports at 3800 [size=32]
	Region 5: Memory at f0004400 (32-bit, non-prefetchable) [size=1K]
	Capabilities: [80] Message Signalled Interrupts: 64bit- Queue=0/0 Enable-
		Address: 00000000  Data: 0000
	Capabilities: [70] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [a8] #12 [0010]

0000:00:1f.3 0c05: 8086:27da (rev 01)
	Subsystem: 1734:1085
	Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
	Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Interrupt: pin B routed to IRQ 9
	Region 4: I/O ports at 3400 [size=32]

0000:05:00.0 0200: 14e4:1677 (rev 01)
	Subsystem: 1734:1096
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping-
SERR+ FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR-
	Latency: 0, Cache Line Size: 0x08 (32 bytes)
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at f0100000 (64-bit, non-prefetchable) [size=64K]
	Capabilities: [48] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
		Status: D0 PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
	Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 Enable-
		Address: 6600080004090000  Data: 0013
	Capabilities: [d0] #10 [0001]


/proc/partitions:
major minor  #blocks  name

   8     0  117220824 sda
   8     1   15719571 sda1
   8     2          1 sda2
   8     5  101496087 sda5
   8    16  117220824 sdb
   8    17   15719571 sdb1
   8    18          1 sdb2
   8    21  101496087 sdb5
 254     0  117218304 sdp
 254     1   15719571 sdp1
 254     5  101496087 sdp5

ps:
  PID  Uid     VmSize Stat Command
    1 root        544 S   init       
    2 root            SWN [ksoftirqd/0]
    3 root            SW< [events/0]
    4 root            SW< [khelper]
    5 root            SW< [kthread]
    7 root            SW< [kblockd/0]
    8 root            SW< [kacpid]
  126 root            SW  [pdflush]
  127 root            SW  [pdflush]
  129 root            SW< [aio/0]
  128 root            SW  [kswapd0]
  200 root            SW< [kseriod]
  251 root        432 S   syslogd 
  253 root        428 S   klogd -c 1 
  261 root       1180 S   /bin/bash /opt/sbin/run 
  262 root       1440 R   /bin/bash -login 
  265 root       1400 S   /bin/bash -login 
  266 root       1404 S   /bin/bash -login 
  267 root       1404 S   /bin/bash -login 
  268 root       1396 S   /bin/bash -login 
  310 root            SW< [khubd]
  437 root        156 S   udhcpc -s /etc/udhcpc/default.script -i eth0 
  441 root        880 S   linuxvnc 1 
  520 root            SW< [ata/0]
  535 root            SW< [scsi_eh_0]
  536 root            SW< [scsi_eh_1]
  537 root            SW< [scsi_eh_2]
  538 root            SW< [scsi_eh_3]
  541 root        424 S   dropbear 
  715 root       1328 S   /bin/sh /opt/sbin/swfill 
  737 root       1008 S   smbmount //172.16.1.115/runin /mnt -o username Admini
  739 root            SW  [smbiod]
  794 root            SW< [kcopyd]
  799 root        752 S   /bin/sh /opt/sbin/swfill 
  802 root       1620 D   partimage restore -f3 -b -g1 /dev/sdp1 /mnt/images/a5
  803 root       1620 S   partimage restore -f3 -b -g1 /dev/sdp1 /mnt/images/a5
  804 root       1620 S   partimage restore -f3 -b -g1 /dev/sdp1 /mnt/images/a5
  980 root        348 S   sleep 1 
  981 root        612 R   ps 

Linux Client 2.6.15-rc5 #1 Mon Dec 12 10:02:35 CET 2005 i686 unknown
Comment 1 Neil Brown 2005-12-12 14:39:42 UTC
Based on the stack trace, this is DM, not MD.
Reassigning....
Comment 2 Alasdair G Kergon 2005-12-12 14:45:55 UTC
What's your dm configuration?

Run

# dmsetup info -c
# dmsetup table
# dmsetup status

(might be dmsetup.static in your environment)
Comment 3 Hans-Joachim Baader 2005-12-13 02:56:00 UTC
Here's the requested dmsetup info. I've also attached a better lspci output.

dmsetup info -c
Name             Maj Min Stat Open Targ Event  UUID
isw_cfdbggdhji_Volume05 254   5 L--w    0    1      0 
isw_cfdbggdhji_Volume0 254   0 L--w    2    1      0 
isw_cfdbggdhji_Volume01 254   1 L--w    1    1      0 

dmsetup table
isw_cfdbggdhji_Volume05: 0 202992174 linear 254:0 31439268
isw_cfdbggdhji_Volume0: 0 234436608 mirror core 2 2048 nosync 2 8:0 0 8:16 0 
isw_cfdbggdhji_Volume01: 0 31439142 linear 254:0 63

dmsetup status
isw_cfdbggdhji_Volume05: 0 202992174 linear 
isw_cfdbggdhji_Volume0: 0 234436608 mirror 2 8:0 8:16 114471/114471
isw_cfdbggdhji_Volume01: 0 31439142 linear 
Comment 4 Hans-Joachim Baader 2005-12-13 02:57:03 UTC
Created attachment 6809 [details]
Output of lspci -vv
Comment 5 Hans-Joachim Baader 2005-12-20 03:19:45 UTC
Created attachment 6863 [details]
Stacktrace of partimage processes

This is a stack trace of the partimage processes, the first of which is stuck
in 'D' state. Obtained with sysrq.
Comment 6 Hans-Joachim Baader 2005-12-20 03:24:22 UTC
I have some additional info. The oops occurs earlier than I thought; perhaps
during definition of the DM RAID or shortly after. Also, I see "Access beyond
end of device" of both /dev/sda and /dev/sdb in the log (couldn't save that log
yet).

I added kdb to the kernel, but the kernel didn't break into kdb on the Oops. I
thought it should. What could have gone wrong?
Comment 7 Hans-Joachim Baader 2005-12-23 08:45:12 UTC
I did some debugging today.
The loop in core_get_resync_work is started with
  lc->sync_bits = 0xf88db000 (the value contained in %edi)
  lc->region_count = 114471
  lc->sync_search = 0
After the first call of find_next_zero_bit, *region becomes 114496
and hence, lc->sync_search becomes 114497. Since this is not equal
to lc->region_count the loop continues. But it is already larger than
lc->region_count and out of range.

The following patch fixes the problem for me:

--- linux-2.6.14.2/drivers/md/dm-log.c  2005-10-28 02:02:08.000000000 +0200
+++ linux-2.6.15-rc5/drivers/md/dm-log.c        2005-12-23 15:36:39.000000000 +0100
@@ -573,7 +573,7 @@
                                             lc->sync_search);
                lc->sync_search = *region + 1;

-               if (*region == lc->region_count)
+               if (*region >= lc->region_count)
                        return 0;

        } while (log_test_bit(lc->recovering_bits, *region));

I wonder why one would only test for equality. Is find_next_zero bit supposed to
only return a result within the range? Or was it once, but has been changed
recently? Or is the real problem elsewhere, perhaps in incorrect drive geometry?
Comment 8 Aleksey Gorelov 2006-02-06 16:35:10 UTC
Hans-Joachim's patch fixes the problem for me as well.
Comment 9 Suzuki 2006-02-17 02:37:47 UTC
The patch attached here is in (atleast) 2.6.16-rc3. Please close this report.

Thanks.

Note You need to log in before you can comment on or make changes to this bug.