Bug 12905

Summary: heavy I/O and CRC errors causes dm-raid1 to freeze/lock up /dev/md1
Product: Memory Management Reporter: Mark (bugzilla-kernel-org-sux2k0)
Component: Page AllocatorAssignee: Andrew Morton (akpm)
Status: RESOLVED OBSOLETE    
Severity: blocking CC: alan, bugzilla-kernel-org-sux2k0, dustins.kernel.spam
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25/7+Xen Subsystem:
Regression: No Bisected commit-id:

Description Mark 2009-03-20 18:49:36 UTC
Latest working kernel version: unknown
Earliest failing kernel version: 2.6.25.18
Distribution: OpenSUSE 11.0/11.1, plain kernel.org kernel (2.6.28.8)
Hardware Environment: AMD Opteron HExxxx, 1G-16G RAM, unicore, dualcore, quadcore, different motherboards
Software Environment: OpenSUSE 11.0/11.1, xen and non-xen
Problem Description: heavy I/O *always* leads to lock up of the stressed md device. Underlying hardware devices are still accessible. Other md devices still work.

Steps to reproduce:

1.) force a resync of /dev/md1 (dm-raid1 of two SATA disks)
2.) put some (say 4) LVM2 LVs on top of /dev/md1, start backup of at least one of them by dumping a snapshot of it.
3.) mount the 4 LVs and run "stress -d 1 --hdd-bytes 10G" on them.

Wait for the resync process to converge to zero after some time. Commands like lvs, lvscan do not work any more. The OS on /dev/md0 (also dm-raid1) still works. BTW: the snapshot is on dm-raid0 on the same physical two SATA disks. There are also two SWAP partitions on these disks (/dev/sdaX and /dev/sdbX, not on dm/md).

The stress processes are in disk sleep state. Load is very high, although no active processes are shown in top. the md1_resync process is also in disk sleep state. For me, it looks like raise_barrier of md1_resync and wait_barrier of the stress processes are blocking each other (lost interrupt??)

# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  0:         80          0          0          0   IO-APIC-edge      timer
  1:          1          0          0          2   IO-APIC-edge      i8042
  2:          0          0          0          0    XT-PIC-XT        cascade
  4:          0          0        222        496   IO-APIC-edge      serial
  7:          1          0          0          0   IO-APIC-edge
  8:          0          0          0          0   IO-APIC-edge      rtc0
 10:    1499953          0          9        920   IO-APIC-fasteoi   eth0
 11:          0    4017105         48      16716   IO-APIC-fasteoi   sata_nv
 12:          1          0          0          0   IO-APIC-edge
 14:          0          0          0          0   IO-APIC-edge      pata_amd
 15:          0          0          0          0   IO-APIC-edge      pata_amd
NMI:          0          0          0          0   Non-maskable interrupts
LOC:     979008     979128     979088     978976   Local timer interrupts
RES:   19914900     363106   21062568   20373212   Rescheduling interrupts
CAL:        109         83        115         62   Function call interrupts
TLB:       1048       1179       1229       1061   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
SPU:          0          0          0          0   Spurious interrupts
ERR:          1
MIS:          0

# cat /proc/`pidof md1_resync`/status
Name:   md1_resync
State:  D (disk sleep)
Tgid:   3646
Pid:    3646
PPid:   2
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 64
Groups:
Threads:        1
SigQ:   0/72704
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: fffffffffffffeff
SigCgt: 0000000000000100
CapInh: 0000000000000000
CapPrm: ffffffffffffffff
CapEff: fffffffffffffeff
CapBnd: ffffffffffffffff
voluntary_ctxt_switches:        14066877
nonvoluntary_ctxt_switches:     165518

# echo t > /proc/sysrq-trigger
# dmesg
md1_resync    D ffff8802349057b0     0  3646      2
 ffff880231117c00 0000000000000046 ffff880231117b60 ffffffff802674a9
 ffffffff806df000 0000000000000001 ffff8802320321c0 ffff8802320324f8
 0000000131117b90 ffffffff804d3eb2 ffff880235550000 0000000000000000
Call Trace:
 [<ffffffff802674a9>] ? trace_hardirqs_on_caller+0x129/0x190
 [<ffffffff804d3eb2>] ? _spin_unlock_irqrestore+0x42/0x80
 [<ffffffff80231bce>] ? __wake_up+0x4e/0x70
 [<ffffffff8041a67c>] raise_barrier+0xac/0x1c0
 [<ffffffff80239b00>] ? default_wake_function+0x0/0x10
 [<ffffffff8041b669>] sync_request+0x139/0x720
 [<ffffffff8041a5c4>] ? raid1_unplug+0x24/0x30
 [<ffffffff8034620c>] ? blk_unplug+0x5c/0x70
 [<ffffffff80424adb>] md_do_sync+0x5ab/0xbb0
 [<ffffffff8026751d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff804256b4>] md_thread+0x54/0x140
 [<ffffffff8026751d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff804d3eb2>] ? _spin_unlock_irqrestore+0x42/0x80
 [<ffffffff80425660>] ? md_thread+0x0/0x140
 [<ffffffff802568b9>] kthread+0x49/0x80
 [<ffffffff8020d529>] child_rip+0xa/0x11
 [<ffffffff8020ca18>] ? restore_args+0x0/0x30
 [<ffffffff80256870>] ? kthread+0x0/0x80
 [<ffffffff8020d51f>] ? child_rip+0x0/0x11

stress        D ffff8801c607d498     0  4244   4230
 ffff8801c607d488 0000000000000046 ffff8801c607d3e8 ffffffff802674a9
 ffffffff806df000 0000000000000001 ffff8801c6680000 ffff8801c6680338
 00000001c607d418 000000010008345d ffff88023722c380 0000000000000000
Call Trace:
 [<ffffffff802674a9>] ? trace_hardirqs_on_caller+0x129/0x190
 [<ffffffff80231bce>] ? __wake_up+0x4e/0x70
 [<ffffffff8041a87c>] wait_barrier+0xec/0x130
 [<ffffffff80239b00>] ? default_wake_function+0x0/0x10
 [<ffffffff804d3ed5>] ? _spin_unlock_irqrestore+0x65/0x80
 [<ffffffff8041e16c>] make_request+0x5c/0x9a0
 [<ffffffff804d3eb2>] ? _spin_unlock_irqrestore+0x42/0x80
 [<ffffffff8025b406>] ? up_read+0x26/0x30
 [<ffffffff80346ca6>] generic_make_request+0x346/0x450
 [<ffffffff80291786>] ? mempool_alloc+0x56/0x140
 [<ffffffff80291786>] ? mempool_alloc+0x56/0x140
 [<ffffffff80346e62>] submit_bio+0xb2/0x160
 [<ffffffff802ed2ad>] ? bio_alloc_bioset+0x5d/0x100
 [<ffffffff802e7c9e>] submit_bh+0xee/0x130
 [<ffffffff802e936d>] bh_submit_read+0x2d/0x90
 [<ffffffffa0037c7b>] read_block_bitmap+0x9b/0x170 [ext3]
 [<ffffffffa0038fc2>] ext3_new_blocks+0x2c2/0x850 [ext3]
 [<ffffffffa003de65>] ext3_get_blocks_handle+0x345/0xc70 [ext3]
 [<ffffffff80212df9>] ? native_sched_clock+0x39/0x80
 [<ffffffff80264d24>] ? get_lock_stats+0x34/0x80
 [<ffffffffa003e809>] ext3_get_block+0x79/0x130 [ext3]
 [<ffffffff804d3f20>] ? _spin_unlock+0x30/0x60
 [<ffffffff802ea347>] __block_prepare_write+0x227/0x660
 [<ffffffffa003e790>] ? ext3_get_block+0x0/0x130 [ext3]
 [<ffffffff802ea909>] block_write_begin+0x59/0xf0
 [<ffffffffa003d973>] ext3_write_begin+0x123/0x200 [ext3]
 [<ffffffffa003e790>] ? ext3_get_block+0x0/0x130 [ext3]
 [<ffffffff802901fb>] generic_file_buffered_write+0x19b/0x330
 [<ffffffff80290821>] __generic_file_aio_write_nolock+0x261/0x450
 [<ffffffff804d238d>] ? __mutex_unlock_slowpath+0x10d/0x180
 [<ffffffff802913d9>] generic_file_aio_write+0x69/0xd0
 [<ffffffffa003a0f6>] ext3_file_write+0x26/0xc0 [ext3]
 [<ffffffff802c3be1>] do_sync_write+0xf1/0x140
 [<ffffffff80256d40>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff802c467b>] vfs_write+0xcb/0x190
 [<ffffffff802c4830>] sys_write+0x50/0x90
 [<ffffffff8020c3db>] system_call_fastpath+0x16/0x1b
Comment 1 Alan 2009-03-21 07:22:56 UTC
What I/O controllers are in use and are there any logged entries about I/O errors ?
Comment 2 Mark 2009-03-21 07:57:10 UTC
It is really a bunch of different hardware. The only common thing they share is the AMD Opteron architecture.

lspci for different machines:

# lspci
00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a3)
00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3)
00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3)
00:09.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3)
00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0b.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control
00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control
01:04.0 VGA compatible controller: XGI Technology Inc. (eXtreme Graphics Innovation) Volari Z7/Z9/Z9s

# lspci
00:00.0 Host bridge: ATI Technologies Inc RS690 Host Bridge
00:01.0 PCI bridge: ATI Technologies Inc RS690 PCI to PCI Bridge (Internal gfx)
00:04.0 PCI bridge: ATI Technologies Inc Device 7914
00:05.0 PCI bridge: ATI Technologies Inc RS690 PCI to PCI Bridge (PCI Express Port 1)
00:12.0 SATA controller: ATI Technologies Inc SB600 Non-Raid-5 SATA
00:13.0 USB Controller: ATI Technologies Inc SB600 USB (OHCI0)
00:13.1 USB Controller: ATI Technologies Inc SB600 USB (OHCI1)
00:13.2 USB Controller: ATI Technologies Inc SB600 USB (OHCI2)
00:13.3 USB Controller: ATI Technologies Inc SB600 USB (OHCI3)
00:13.4 USB Controller: ATI Technologies Inc SB600 USB (OHCI4)
00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI)
00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 14)
00:14.1 IDE interface: ATI Technologies Inc SB600 IDE
00:14.3 ISA bridge: ATI Technologies Inc SB600 PCI to LPC Bridge
00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:05.0 VGA compatible controller: ATI Technologies Inc RS690M [Radeon X1200 Series]
02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
03:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller

# lspci
00:01.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge
00:02.0 Host bridge: Broadcom BCM5785 [HT1000] Legacy South Bridge
00:02.2 ISA bridge: Broadcom BCM5785 [HT1000] LPC
00:03.0 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:03.1 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:03.2 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:05.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:0d.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge (rev b2)
01:0e.0 RAID bus controller: Broadcom BCM5785 [HT1000] SATA (Native SATA Mode)
02:03.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
02:03.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)

# lspci
00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2)
00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2)
00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2)
00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2)
00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2)
00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2)
00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2)
00:02.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:03.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1)
00:05.0 VGA compatible controller: nVidia Corporation C51 [GeForce 6150 LE] (rev a2)
00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2)
00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a3)
00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a3)
00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3)
00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev f1)
00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev f1)
00:0f.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev f1)
00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
04:07.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05)

# lspci
00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
01:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
01:0b.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
01:0d.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5705_2 Gigabit Ethernet (rev 03)
01:0e.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5705_2 Gigabit Ethernet (rev 03)
Comment 3 Mark 2009-03-21 07:59:19 UTC
No, there are no I/O errors. Just some expected buffer I/O problems because of invalidated (full) LVM2 snapshots.
Comment 4 Alan 2009-03-21 08:12:48 UTC
This appears to be a VM deadlock so for first guess assigning to the mm/vm folks
Comment 5 Mark 2009-03-22 15:53:31 UTC
Enabled some kernel debug options in 2.6.28.8:

/var/log/messages:

Mar 22 21:24:02 h1450662 lvm[1301]: Snapshot vg--xen--vm--backup--snap is now 95% full.
Mar 22 21:28:44 h1450662 kernel: device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.
Mar 22 21:28:44 h1450662 kernel:                                                                               
Mar 22 21:28:44 h1450662 kernel: =======================================================                       
Mar 22 21:28:44 h1450662 kernel: [ INFO: possible circular locking dependency detected ]                       
Mar 22 21:28:44 h1450662 kernel: 2.6.28.8-3.2-default #3                                                       
Mar 22 21:28:44 h1450662 kernel: -------------------------------------------------------                       
Mar 22 21:28:44 h1450662 kernel: pdflush/7307 is trying to acquire lock:                                       
Mar 22 21:28:44 h1450662 kernel:  (ksnaphd){--..}, at: [<ffffffff80256670>] flush_workqueue+0x0/0xb0           
Mar 22 21:28:44 h1450662 kernel:                                                                               
Mar 22 21:28:44 h1450662 kernel: but task is already holding lock:                                             
Mar 22 21:28:44 h1450662 kernel:  (&s->lock){--..}, at: [<ffffffff80466fa7>] __find_pending_exception+0xa7/0x1e0
Mar 22 21:28:44 h1450662 kernel:                                                                                
Mar 22 21:28:44 h1450662 kernel: which lock already depends on the new lock.                                    
Mar 22 21:28:44 h1450662 kernel:                                                                                
Mar 22 21:28:44 h1450662 kernel:                                                                                
Mar 22 21:28:44 h1450662 kernel: the existing dependency chain (in reverse order) is:                           
Mar 22 21:28:47 h1450662 kernel:                                                                                
Mar 22 21:28:47 h1450662 kernel: -> #3 (&s->lock){--..}:                                                        
Mar 22 21:28:47 h1450662 kernel:        [<ffffffff8026c00a>] __lock_acquire+0x11ea/0x1bc0                       
Mar 22 21:28:47 h1450662 kernel:        [<ffffffff8026ca71>] lock_acquire+0x91/0xc0                             
Mar 22 21:28:47 h1450662 kernel:        [<ffffffff80504a63>] down_write+0x43/0x80                               
Mar 22 21:28:47 h1450662 kernel:        [<ffffffff80467b49>] snapshot_map+0x79/0x320                            
Mar 22 21:28:47 h1450662 kernel:        [<ffffffff8045f422>] __map_bio+0x42/0x160                               
Mar 22 21:28:47 h1450662 kernel:        [<ffffffff8046062c>] __split_bio+0x4bc/0x590                            
Mar 22 21:28:47 h1450662 kernel:        [<ffffffff804609ac>] dm_request+0x1fc/0x2a0                             
Mar 22 21:28:47 h1450662 kernel:        [<ffffffff80348586>] generic_make_request+0x346/0x450                   
Mar 22 21:28:45 h1450662 lvm[1301]: Snapshot vg--xen--hwn-lvol--xen--vm--1466362--backup--snap changed state to: Invalid
Mar 22 21:28:49 h1450662 kernel:        [<ffffffff8034872b>] submit_bio+0x9b/0x150                                      
Mar 22 21:28:49 h1450662 kernel:        [<ffffffff802e978e>] submit_bh+0xee/0x130                                       
Mar 22 21:28:49 h1450662 kernel:        [<ffffffff802ed8c1>] block_read_full_page+0x2e1/0x3d0                           
Mar 22 21:28:49 h1450662 kernel:        [<ffffffff802f1083>] blkdev_readpage+0x13/0x20                                  
Mar 22 21:28:49 h1450662 kernel:        [<ffffffff80299ce2>] __do_page_cache_readahead+0x212/0x290                      
Mar 22 21:28:49 h1450662 kernel:        [<ffffffff8029a11b>] ondemand_readahead+0x1eb/0x270                             
Mar 22 21:28:49 h1450662 kernel:        [<ffffffff8029a275>] page_cache_sync_readahead+0x25/0x30                        
Mar 22 21:28:49 h1450662 kernel:        [<ffffffff8029278c>] generic_file_aio_read+0x43c/0x6b0                          
Mar 22 21:28:51 h1450662 lvm[1301]: No longer monitoring snapshot vg--xen--hwn-lvol--xen--vm--1466362--backup--snap     
Mar 22 21:28:49 h1450662 kernel:        [<ffffffff802c58e1>] do_sync_read+0xf1/0x140                                    
Mar 22 21:28:53 h1450662 kernel:        [<ffffffff802c6448>] vfs_read+0xc8/0x170                                        
Mar 22 21:28:53 h1450662 kernel:        [<ffffffff802c65e0>] sys_read+0x50/0x90                                         
Mar 22 21:28:53 h1450662 kernel:        [<ffffffff8020c3cb>] system_call_fastpath+0x16/0x1b                             
Mar 22 21:28:53 h1450662 kernel:        [<ffffffffffffffff>] 0xffffffffffffffff                                         
Mar 22 21:28:53 h1450662 kernel:                                                                                        
Mar 22 21:28:53 h1450662 kernel: -> #2 (&md->io_lock){----}:                                                            
Mar 22 21:28:53 h1450662 kernel:        [<ffffffff8026c00a>] __lock_acquire+0x11ea/0x1bc0                               
Mar 22 21:28:53 h1450662 kernel:        [<ffffffff8026ca71>] lock_acquire+0x91/0xc0                                     
Mar 22 21:28:53 h1450662 kernel:        [<ffffffff80504ae6>] down_read+0x46/0x80                                        
Mar 22 21:28:56 h1450662 kernel:        [<ffffffff804607dc>] dm_request+0x2c/0x2a0                                      
Mar 22 21:28:56 h1450662 kernel:        [<ffffffff80348586>] generic_make_request+0x346/0x450                           
Mar 22 21:28:56 h1450662 kernel:        [<ffffffff8034872b>] submit_bio+0x9b/0x150                                      
Mar 22 21:28:56 h1450662 kernel:        [<ffffffff80465bff>] dispatch_io+0x1bf/0x1f0                                    
Mar 22 21:28:56 h1450662 kernel:        [<ffffffff80465e35>] dm_io+0x185/0x2c0                                          
Mar 22 21:28:56 h1450662 kernel:        [<ffffffff80468b54>] do_metadata+0x24/0x30                                      
Mar 22 21:28:56 h1450662 kernel:        [<ffffffff80255b7c>] run_workqueue+0xfc/0x240                                   
Mar 22 21:28:56 h1450662 kernel:        [<ffffffff80256080>] worker_thread+0x60/0xb0                                    
Mar 22 21:28:56 h1450662 kernel:        [<ffffffff8025a1e9>] kthread+0x49/0x80                                          
Mar 22 21:28:53 h1450662 lvm[1301]: Unlocking memory                                                                    
Mar 22 21:28:56 h1450662 kernel:        [<ffffffff8020d519>] child_rip+0xa/0x11                                         
Mar 22 21:29:00 h1450662 kernel:        [<ffffffffffffffff>] 0xffffffffffffffff                                         
Mar 22 21:29:00 h1450662 kernel:                                                                                        
Mar 22 21:29:00 h1450662 kernel: -> #1 (&req.work){--..}:                                                               
Mar 22 21:29:00 h1450662 kernel:        [<ffffffff8026c00a>] __lock_acquire+0x11ea/0x1bc0                               
Mar 22 21:29:00 h1450662 kernel:        [<ffffffff8026ca71>] lock_acquire+0x91/0xc0                                     
Mar 22 21:29:00 h1450662 kernel:        [<ffffffff80255b76>] run_workqueue+0xf6/0x240                                   
Mar 22 21:29:00 h1450662 kernel:        [<ffffffff80256080>] worker_thread+0x60/0xb0                                    
Mar 22 21:29:00 h1450662 kernel:        [<ffffffff8025a1e9>] kthread+0x49/0x80                                          
Mar 22 21:29:00 h1450662 kernel:        [<ffffffff8020d519>] child_rip+0xa/0x11                                         
Mar 22 21:29:05 h1450662 kernel:        [<ffffffffffffffff>] 0xffffffffffffffff                                         
Mar 22 21:29:05 h1450662 kernel:                                                                                        
Mar 22 21:29:05 h1450662 kernel: -> #0 (ksnaphd){--..}:                                                                 
Mar 22 21:29:05 h1450662 kernel:        [<ffffffff8026c188>] __lock_acquire+0x1368/0x1bc0                               
Mar 22 21:29:05 h1450662 kernel:        [<ffffffff8026ca71>] lock_acquire+0x91/0xc0                                     
Mar 22 21:29:05 h1450662 kernel:        [<ffffffff802566cd>] flush_workqueue+0x5d/0xb0                                  
Mar 22 21:29:05 h1450662 kernel:        [<ffffffff80468cc3>] chunk_io+0xe3/0x120                                        
Mar 22 21:29:05 h1450662 kernel:        [<ffffffff80468d5c>] write_header+0x5c/0x70                                     
Mar 22 21:29:05 h1450662 kernel:        [<ffffffff80468d84>] persistent_drop+0x14/0x30                                  
Mar 22 21:29:05 h1450662 kernel:        [<ffffffff80466eb5>] __invalidate_snapshot+0x35/0x80                            
Mar 22 21:29:08 h1450662 kernel:        [<ffffffff804674b5>] origin_map+0x3d5/0x400                                     
Mar 22 21:29:08 h1450662 kernel:        [<ffffffff8045f422>] __map_bio+0x42/0x160                                       
Mar 22 21:29:08 h1450662 kernel:        [<ffffffff8046058e>] __split_bio+0x41e/0x590                                    
Mar 22 21:29:08 h1450662 kernel:        [<ffffffff804609ac>] dm_request+0x1fc/0x2a0                                     
Mar 22 21:29:08 h1450662 kernel:        [<ffffffff80348586>] generic_make_request+0x346/0x450                           
Mar 22 21:29:08 h1450662 kernel:        [<ffffffff8034872b>] submit_bio+0x9b/0x150                                      
Mar 22 21:29:08 h1450662 kernel:        [<ffffffff802e978e>] submit_bh+0xee/0x130                                       
Mar 22 21:29:08 h1450662 kernel:        [<ffffffff802ecd01>] __block_write_full_page+0x1e1/0x3b0                        
Mar 22 21:29:08 h1450662 kernel:        [<ffffffff802ecff9>] block_write_full_page+0x129/0x140                          
Mar 22 21:29:08 h1450662 kernel:        [<ffffffffa0034086>] ext3_ordered_writepage+0xc6/0x190 [ext3]                   
Mar 22 21:29:10 h1450662 kernel:        [<ffffffff80297be5>] __writepage+0x15/0x50                                      
Mar 22 21:29:10 h1450662 kernel:        [<ffffffff80298827>] write_cache_pages+0x237/0x480                              
Mar 22 21:29:10 h1450662 kernel:        [<ffffffff80298a92>] generic_writepages+0x22/0x30                               
Mar 22 21:29:10 h1450662 kernel:        [<ffffffff80298add>] do_writepages+0x3d/0x50                                    
Mar 22 21:29:10 h1450662 kernel:        [<ffffffff802e488b>] __writeback_single_inode+0x9b/0x410                        
Mar 22 21:29:10 h1450662 kernel:        [<ffffffff802e51e4>] generic_sync_sb_inodes+0x364/0x440                         
Mar 22 21:29:10 h1450662 kernel:        [<ffffffff802e54a5>] writeback_inodes+0x65/0x100                                
Mar 22 21:29:10 h1450662 kernel:        [<ffffffff802991a0>] background_writeout+0xb0/0xe0                              
Mar 22 21:29:10 h1450662 kernel:        [<ffffffff802998d0>] pdflush+0x110/0x1f0                                        
Mar 22 21:29:10 h1450662 kernel:        [<ffffffff8025a1e9>] kthread+0x49/0x80                                          
Mar 22 21:29:11 h1450662 kernel:        [<ffffffff8020d519>] child_rip+0xa/0x11                                         
Mar 22 21:29:11 h1450662 kernel:        [<ffffffffffffffff>] 0xffffffffffffffff                                         
Mar 22 21:29:11 h1450662 kernel:                                                                                        
Mar 22 21:29:11 h1450662 kernel: other info that might help us debug this:                                              
Mar 22 21:29:11 h1450662 kernel:                                                                                        
Mar 22 21:29:11 h1450662 kernel: 5 locks held by pdflush/7307:                                                          
Mar 22 21:29:11 h1450662 kernel:  #0:  (&type->s_umount_key#13){----}, at: [<ffffffff802e548e>] writeback_inodes+0x4e/0x100
Mar 22 21:29:11 h1450662 kernel:  #1:  (jbd_handle){--..}, at: [<ffffffffa0019e9e>] journal_start+0xce/0x140 [jbd]         
Mar 22 21:29:11 h1450662 kernel:  #2:  (&md->io_lock){----}, at: [<ffffffff804607dc>] dm_request+0x2c/0x2a0                
Mar 22 21:29:11 h1450662 kernel:  #3:  (&_origins_lock){----}, at: [<ffffffff8046713c>] origin_map+0x5c/0x400              
Mar 22 21:29:14 h1450662 kernel:  #4:  (&s->lock){--..}, at: [<ffffffff80466fa7>] __find_pending_exception+0xa7/0x1e0      
Mar 22 21:29:14 h1450662 kernel:                                                                                           
Mar 22 21:29:14 h1450662 kernel: stack backtrace:                                                                          
Mar 22 21:29:14 h1450662 kernel: Pid: 7307, comm: pdflush Not tainted 2.6.28.8-3.2-default #3                              
Mar 22 21:29:14 h1450662 kernel: Call Trace:                                                                               
Mar 22 21:29:14 h1450662 kernel:  [<ffffffff8026aa08>] print_circular_bug_tail+0xd8/0xe0                                   
Mar 22 21:29:14 h1450662 kernel:  [<ffffffff8026c188>] __lock_acquire+0x1368/0x1bc0                                        
Mar 22 21:29:14 h1450662 kernel:  [<ffffffff8023acb8>] ? finish_task_switch+0x68/0xf0                                      
Mar 22 21:29:14 h1450662 kernel:  [<ffffffff8023ac50>] ? finish_task_switch+0x0/0xf0                                       
Mar 22 21:29:14 h1450662 kernel:  [<ffffffff80502da0>] ? thread_return+0xa3/0x723                                          
Mar 22 21:29:00 h1450662 lvm[1301]: memlock_count dec to 0                                                                 
Mar 22 21:29:16 h1450662 kernel:  [<ffffffff8020ca08>] ? restore_args+0x0/0x30                                             
Mar 22 21:29:16 h1450662 lvm[1301]: Dumping persistent device cache to /etc/lvm/.cache                                     
Mar 22 21:29:16 h1450662 kernel:  [<ffffffff8026ca71>] lock_acquire+0x91/0xc0                                              
Mar 22 21:29:16 h1450662 kernel:  [<ffffffff80256670>] ? flush_workqueue+0x0/0xb0                                          
Mar 22 21:29:16 h1450662 lvm[1301]: Locking /etc/lvm/.cache (F_WRLCK, 1)                                                   
Mar 22 21:29:16 h1450662 kernel:  [<ffffffff802566cd>] flush_workqueue+0x5d/0xb0                                           
Mar 22 21:29:16 h1450662 kernel:  [<ffffffff80256670>] ? flush_workqueue+0x0/0xb0                                          
Mar 22 21:29:16 h1450662 kernel:  [<ffffffff80468cc3>] chunk_io+0xe3/0x120                                                 
Mar 22 21:29:16 h1450662 kernel:  [<ffffffff80468b30>] ? do_metadata+0x0/0x30                                              
Mar 22 21:29:18 h1450662 kernel:  [<ffffffff8026a3a9>] ? trace_hardirqs_on_caller+0x129/0x190                              
Mar 22 21:29:18 h1450662 kernel:  [<ffffffff80468d5c>] write_header+0x5c/0x70                                              
Mar 22 21:29:18 h1450662 kernel:  [<ffffffff80468d84>] persistent_drop+0x14/0x30                                           
Mar 22 21:29:18 h1450662 kernel:  [<ffffffff80466eb5>] __invalidate_snapshot+0x35/0x80                                     
Mar 22 21:29:18 h1450662 kernel:  [<ffffffff804674b5>] origin_map+0x3d5/0x400                                              
Mar 22 21:29:18 h1450662 kernel:  [<ffffffff802930c6>] ? mempool_alloc+0x56/0x140                                          
Mar 22 21:29:18 h1450662 kernel:  [<ffffffff8045f422>] __map_bio+0x42/0x160                                                
Mar 22 21:29:18 h1450662 kernel:  [<ffffffff802eed7d>] ? bio_alloc_bioset+0x5d/0x100                                       
Mar 22 21:29:18 h1450662 kernel:  [<ffffffff8046058e>] __split_bio+0x41e/0x590                                             
Mar 22 21:29:18 h1450662 kernel:  [<ffffffff804601ab>] ? __split_bio+0x3b/0x590                                            
Mar 22 21:29:21 h1450662 kernel:  [<ffffffff80267c24>] ? get_lock_stats+0x34/0x80                                          
Mar 22 21:29:21 h1450662 kernel:  [<ffffffff804609ac>] dm_request+0x1fc/0x2a0                                              
Mar 22 21:29:21 h1450662 kernel:  [<ffffffff804607d4>] ? dm_request+0x24/0x2a0                                             
Mar 22 21:29:21 h1450662 kernel:  [<ffffffff80348586>] generic_make_request+0x346/0x450                                    
Mar 22 21:29:21 h1450662 kernel:  [<ffffffff802930c6>] ? mempool_alloc+0x56/0x140                                          
Mar 22 21:29:21 h1450662 kernel:  [<ffffffff802930c6>] ? mempool_alloc+0x56/0x140                                          
Mar 22 21:29:21 h1450662 kernel:  [<ffffffff8034872b>] submit_bio+0x9b/0x150                                               
Mar 22 21:29:21 h1450662 kernel:  [<ffffffff802eed7d>] ? bio_alloc_bioset+0x5d/0x100                                       
Mar 22 21:29:21 h1450662 kernel:  [<ffffffff802e978e>] submit_bh+0xee/0x130                                                
Mar 22 21:29:21 h1450662 kernel:  [<ffffffff802ecd01>] __block_write_full_page+0x1e1/0x3b0                                 
Mar 22 21:29:22 h1450662 kernel:  [<ffffffffa0035790>] ? ext3_get_block+0x0/0x130 [ext3]                                   
Mar 22 21:29:22 h1450662 kernel:  [<ffffffffa0035790>] ? ext3_get_block+0x0/0x130 [ext3]                                   
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff802ecff9>] block_write_full_page+0x129/0x140                                   
Mar 22 21:29:22 h1450662 kernel:  [<ffffffffa0034086>] ext3_ordered_writepage+0xc6/0x190 [ext3]                            
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff80297be5>] __writepage+0x15/0x50                                               
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff80298827>] write_cache_pages+0x237/0x480                                       
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff80297bd0>] ? __writepage+0x0/0x50                                              
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff802e4880>] ? __writeback_single_inode+0x90/0x410                               
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff80298a92>] generic_writepages+0x22/0x30                                        
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff80298add>] do_writepages+0x3d/0x50                                             
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff802e488b>] __writeback_single_inode+0x9b/0x410                                 
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff80506100>] ? _read_unlock+0x30/0x60                                            
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff802e51e4>] generic_sync_sb_inodes+0x364/0x440                                  
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff802e54a5>] writeback_inodes+0x65/0x100                                         
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff802991a0>] background_writeout+0xb0/0xe0                                       
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff802998d0>] pdflush+0x110/0x1f0                                                 
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff802990f0>] ? background_writeout+0x0/0xe0                                      
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff802997c0>] ? pdflush+0x0/0x1f0                                                 
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff8025a1e9>] kthread+0x49/0x80                                                   
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff8020d519>] child_rip+0xa/0x11                                                  
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff8020ca08>] ? restore_args+0x0/0x30                                             
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff8025a1a0>] ? kthread+0x0/0x80                                                  
Mar 22 21:29:22 h1450662 kernel:  [<ffffffff8020d50f>] ? child_rip+0x0/0x11                                                
Mar 22 21:29:22 h1450662 kernel: __ratelimit: 7 callbacks suppressed                                                       
Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683024                                    
Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683025                                    
Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683026                                    
Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683027                                    
Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683028                                    
Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683029                                    
Mar 22 21:29:23 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683030                                    
Mar 22 21:29:23 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683031                                    
Mar 22 21:29:23 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683032                                    
Mar 22 21:29:23 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683033                                    
Mar 22 21:29:16 h1450662 lvm[1301]: Opened /etc/lvm/.cache RO                                                              
Mar 22 21:29:23 h1450662 lvm[1301]: Loaded persistent filter cache from /etc/lvm/.cache                                    
Mar 22 21:29:23 h1450662 lvm[1301]: Unlocking fd 8                                                                         
Mar 22 21:29:23 h1450662 lvm[1301]: Closed /etc/lvm/.cache                                                                 
Mar 22 21:29:23 h1450662 lvm[1301]: Wiping internal VG cache
Comment 6 Mark 2009-03-22 15:54:28 UTC
Looks like the snapshot invalidation is not handled properly.
Comment 7 Mark 2009-12-22 18:51:16 UTC
hmmm no, also happens with nearly empty snapshot devices...
Comment 9 Mark 2011-01-19 13:00:40 UTC
Also happens with "echo check > .../mdX/md/sync_action"
Comment 10 Mark 2011-01-19 15:19:36 UTC
Update: I have two machines which seem to have totally identical hardware. One triggers this bug, the other not...

The one triggering the bug show UDMA CRC errors for one disk (via smartctl -a).

Are there any mdraid issues related to hardware defects? Improperly handled, specific I/O errors? Can't see anything in logs, though.
Comment 11 Alan 2012-11-20 17:13:03 UTC
Closing as obsolete, if this is still seen with modern kernels please re-open and update