Latest working kernel version: unknown Earliest failing kernel version: 2.6.25.18 Distribution: OpenSUSE 11.0/11.1, plain kernel.org kernel (2.6.28.8) Hardware Environment: AMD Opteron HExxxx, 1G-16G RAM, unicore, dualcore, quadcore, different motherboards Software Environment: OpenSUSE 11.0/11.1, xen and non-xen Problem Description: heavy I/O *always* leads to lock up of the stressed md device. Underlying hardware devices are still accessible. Other md devices still work. Steps to reproduce: 1.) force a resync of /dev/md1 (dm-raid1 of two SATA disks) 2.) put some (say 4) LVM2 LVs on top of /dev/md1, start backup of at least one of them by dumping a snapshot of it. 3.) mount the 4 LVs and run "stress -d 1 --hdd-bytes 10G" on them. Wait for the resync process to converge to zero after some time. Commands like lvs, lvscan do not work any more. The OS on /dev/md0 (also dm-raid1) still works. BTW: the snapshot is on dm-raid0 on the same physical two SATA disks. There are also two SWAP partitions on these disks (/dev/sdaX and /dev/sdbX, not on dm/md). The stress processes are in disk sleep state. Load is very high, although no active processes are shown in top. the md1_resync process is also in disk sleep state. For me, it looks like raise_barrier of md1_resync and wait_barrier of the stress processes are blocking each other (lost interrupt??) # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 80 0 0 0 IO-APIC-edge timer 1: 1 0 0 2 IO-APIC-edge i8042 2: 0 0 0 0 XT-PIC-XT cascade 4: 0 0 222 496 IO-APIC-edge serial 7: 1 0 0 0 IO-APIC-edge 8: 0 0 0 0 IO-APIC-edge rtc0 10: 1499953 0 9 920 IO-APIC-fasteoi eth0 11: 0 4017105 48 16716 IO-APIC-fasteoi sata_nv 12: 1 0 0 0 IO-APIC-edge 14: 0 0 0 0 IO-APIC-edge pata_amd 15: 0 0 0 0 IO-APIC-edge pata_amd NMI: 0 0 0 0 Non-maskable interrupts LOC: 979008 979128 979088 978976 Local timer interrupts RES: 19914900 363106 21062568 20373212 Rescheduling interrupts CAL: 109 83 115 62 Function call interrupts TLB: 1048 1179 1229 1061 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts SPU: 0 0 0 0 Spurious interrupts ERR: 1 MIS: 0 # cat /proc/`pidof md1_resync`/status Name: md1_resync State: D (disk sleep) Tgid: 3646 Pid: 3646 PPid: 2 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 64 Groups: Threads: 1 SigQ: 0/72704 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: fffffffffffffeff SigCgt: 0000000000000100 CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: fffffffffffffeff CapBnd: ffffffffffffffff voluntary_ctxt_switches: 14066877 nonvoluntary_ctxt_switches: 165518 # echo t > /proc/sysrq-trigger # dmesg md1_resync D ffff8802349057b0 0 3646 2 ffff880231117c00 0000000000000046 ffff880231117b60 ffffffff802674a9 ffffffff806df000 0000000000000001 ffff8802320321c0 ffff8802320324f8 0000000131117b90 ffffffff804d3eb2 ffff880235550000 0000000000000000 Call Trace: [<ffffffff802674a9>] ? trace_hardirqs_on_caller+0x129/0x190 [<ffffffff804d3eb2>] ? _spin_unlock_irqrestore+0x42/0x80 [<ffffffff80231bce>] ? __wake_up+0x4e/0x70 [<ffffffff8041a67c>] raise_barrier+0xac/0x1c0 [<ffffffff80239b00>] ? default_wake_function+0x0/0x10 [<ffffffff8041b669>] sync_request+0x139/0x720 [<ffffffff8041a5c4>] ? raid1_unplug+0x24/0x30 [<ffffffff8034620c>] ? blk_unplug+0x5c/0x70 [<ffffffff80424adb>] md_do_sync+0x5ab/0xbb0 [<ffffffff8026751d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff804256b4>] md_thread+0x54/0x140 [<ffffffff8026751d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff804d3eb2>] ? _spin_unlock_irqrestore+0x42/0x80 [<ffffffff80425660>] ? md_thread+0x0/0x140 [<ffffffff802568b9>] kthread+0x49/0x80 [<ffffffff8020d529>] child_rip+0xa/0x11 [<ffffffff8020ca18>] ? restore_args+0x0/0x30 [<ffffffff80256870>] ? kthread+0x0/0x80 [<ffffffff8020d51f>] ? child_rip+0x0/0x11 stress D ffff8801c607d498 0 4244 4230 ffff8801c607d488 0000000000000046 ffff8801c607d3e8 ffffffff802674a9 ffffffff806df000 0000000000000001 ffff8801c6680000 ffff8801c6680338 00000001c607d418 000000010008345d ffff88023722c380 0000000000000000 Call Trace: [<ffffffff802674a9>] ? trace_hardirqs_on_caller+0x129/0x190 [<ffffffff80231bce>] ? __wake_up+0x4e/0x70 [<ffffffff8041a87c>] wait_barrier+0xec/0x130 [<ffffffff80239b00>] ? default_wake_function+0x0/0x10 [<ffffffff804d3ed5>] ? _spin_unlock_irqrestore+0x65/0x80 [<ffffffff8041e16c>] make_request+0x5c/0x9a0 [<ffffffff804d3eb2>] ? _spin_unlock_irqrestore+0x42/0x80 [<ffffffff8025b406>] ? up_read+0x26/0x30 [<ffffffff80346ca6>] generic_make_request+0x346/0x450 [<ffffffff80291786>] ? mempool_alloc+0x56/0x140 [<ffffffff80291786>] ? mempool_alloc+0x56/0x140 [<ffffffff80346e62>] submit_bio+0xb2/0x160 [<ffffffff802ed2ad>] ? bio_alloc_bioset+0x5d/0x100 [<ffffffff802e7c9e>] submit_bh+0xee/0x130 [<ffffffff802e936d>] bh_submit_read+0x2d/0x90 [<ffffffffa0037c7b>] read_block_bitmap+0x9b/0x170 [ext3] [<ffffffffa0038fc2>] ext3_new_blocks+0x2c2/0x850 [ext3] [<ffffffffa003de65>] ext3_get_blocks_handle+0x345/0xc70 [ext3] [<ffffffff80212df9>] ? native_sched_clock+0x39/0x80 [<ffffffff80264d24>] ? get_lock_stats+0x34/0x80 [<ffffffffa003e809>] ext3_get_block+0x79/0x130 [ext3] [<ffffffff804d3f20>] ? _spin_unlock+0x30/0x60 [<ffffffff802ea347>] __block_prepare_write+0x227/0x660 [<ffffffffa003e790>] ? ext3_get_block+0x0/0x130 [ext3] [<ffffffff802ea909>] block_write_begin+0x59/0xf0 [<ffffffffa003d973>] ext3_write_begin+0x123/0x200 [ext3] [<ffffffffa003e790>] ? ext3_get_block+0x0/0x130 [ext3] [<ffffffff802901fb>] generic_file_buffered_write+0x19b/0x330 [<ffffffff80290821>] __generic_file_aio_write_nolock+0x261/0x450 [<ffffffff804d238d>] ? __mutex_unlock_slowpath+0x10d/0x180 [<ffffffff802913d9>] generic_file_aio_write+0x69/0xd0 [<ffffffffa003a0f6>] ext3_file_write+0x26/0xc0 [ext3] [<ffffffff802c3be1>] do_sync_write+0xf1/0x140 [<ffffffff80256d40>] ? autoremove_wake_function+0x0/0x40 [<ffffffff802c467b>] vfs_write+0xcb/0x190 [<ffffffff802c4830>] sys_write+0x50/0x90 [<ffffffff8020c3db>] system_call_fastpath+0x16/0x1b
What I/O controllers are in use and are there any logged entries about I/O errors ?
It is really a bunch of different hardware. The only common thing they share is the AMD Opteron architecture. lspci for different machines: # lspci 00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2) 00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a3) 00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a3) 00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1) 00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2) 00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1) 00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a3) 00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2) 00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3) 00:09.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a3) 00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3) 00:0b.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3) 00:0c.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3) 00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3) 00:0e.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3) 00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a3) 00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] HyperTransport Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Miscellaneous Control 00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 10h [Opteron, Athlon64, Sempron] Link Control 01:04.0 VGA compatible controller: XGI Technology Inc. (eXtreme Graphics Innovation) Volari Z7/Z9/Z9s # lspci 00:00.0 Host bridge: ATI Technologies Inc RS690 Host Bridge 00:01.0 PCI bridge: ATI Technologies Inc RS690 PCI to PCI Bridge (Internal gfx) 00:04.0 PCI bridge: ATI Technologies Inc Device 7914 00:05.0 PCI bridge: ATI Technologies Inc RS690 PCI to PCI Bridge (PCI Express Port 1) 00:12.0 SATA controller: ATI Technologies Inc SB600 Non-Raid-5 SATA 00:13.0 USB Controller: ATI Technologies Inc SB600 USB (OHCI0) 00:13.1 USB Controller: ATI Technologies Inc SB600 USB (OHCI1) 00:13.2 USB Controller: ATI Technologies Inc SB600 USB (OHCI2) 00:13.3 USB Controller: ATI Technologies Inc SB600 USB (OHCI3) 00:13.4 USB Controller: ATI Technologies Inc SB600 USB (OHCI4) 00:13.5 USB Controller: ATI Technologies Inc SB600 USB Controller (EHCI) 00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 14) 00:14.1 IDE interface: ATI Technologies Inc SB600 IDE 00:14.3 ISA bridge: ATI Technologies Inc SB600 PCI to LPC Bridge 00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:05.0 VGA compatible controller: ATI Technologies Inc RS690M [Radeon X1200 Series] 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller 03:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller # lspci 00:01.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge 00:02.0 Host bridge: Broadcom BCM5785 [HT1000] Legacy South Bridge 00:02.2 ISA bridge: Broadcom BCM5785 [HT1000] LPC 00:03.0 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01) 00:03.1 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01) 00:03.2 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01) 00:05.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:0d.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge (rev b2) 01:0e.0 RAID bus controller: Broadcom BCM5785 [HT1000] SATA (Native SATA Mode) 02:03.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) 02:03.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10) # lspci 00:00.0 RAM memory: nVidia Corporation C51 Host Bridge (rev a2) 00:00.1 RAM memory: nVidia Corporation C51 Memory Controller 0 (rev a2) 00:00.2 RAM memory: nVidia Corporation C51 Memory Controller 1 (rev a2) 00:00.3 RAM memory: nVidia Corporation C51 Memory Controller 5 (rev a2) 00:00.4 RAM memory: nVidia Corporation C51 Memory Controller 4 (rev a2) 00:00.5 RAM memory: nVidia Corporation C51 Host Bridge (rev a2) 00:00.6 RAM memory: nVidia Corporation C51 Memory Controller 3 (rev a2) 00:00.7 RAM memory: nVidia Corporation C51 Memory Controller 2 (rev a2) 00:02.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1) 00:03.0 PCI bridge: nVidia Corporation C51 PCI Express Bridge (rev a1) 00:05.0 VGA compatible controller: nVidia Corporation C51 [GeForce 6150 LE] (rev a2) 00:09.0 RAM memory: nVidia Corporation MCP51 Host Bridge (rev a2) 00:0a.0 ISA bridge: nVidia Corporation MCP51 LPC Bridge (rev a3) 00:0a.1 SMBus: nVidia Corporation MCP51 SMBus (rev a3) 00:0b.0 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) 00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) 00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev f1) 00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev f1) 00:0f.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev f1) 00:10.0 PCI bridge: nVidia Corporation MCP51 PCI Bridge (rev a2) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 04:07.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05) # lspci 00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07) 00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05) 00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03) 00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02) 00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) 01:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b) 01:0b.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 01:0d.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5705_2 Gigabit Ethernet (rev 03) 01:0e.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5705_2 Gigabit Ethernet (rev 03)
No, there are no I/O errors. Just some expected buffer I/O problems because of invalidated (full) LVM2 snapshots.
This appears to be a VM deadlock so for first guess assigning to the mm/vm folks
Enabled some kernel debug options in 2.6.28.8: /var/log/messages: Mar 22 21:24:02 h1450662 lvm[1301]: Snapshot vg--xen--vm--backup--snap is now 95% full. Mar 22 21:28:44 h1450662 kernel: device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception. Mar 22 21:28:44 h1450662 kernel: Mar 22 21:28:44 h1450662 kernel: ======================================================= Mar 22 21:28:44 h1450662 kernel: [ INFO: possible circular locking dependency detected ] Mar 22 21:28:44 h1450662 kernel: 2.6.28.8-3.2-default #3 Mar 22 21:28:44 h1450662 kernel: ------------------------------------------------------- Mar 22 21:28:44 h1450662 kernel: pdflush/7307 is trying to acquire lock: Mar 22 21:28:44 h1450662 kernel: (ksnaphd){--..}, at: [<ffffffff80256670>] flush_workqueue+0x0/0xb0 Mar 22 21:28:44 h1450662 kernel: Mar 22 21:28:44 h1450662 kernel: but task is already holding lock: Mar 22 21:28:44 h1450662 kernel: (&s->lock){--..}, at: [<ffffffff80466fa7>] __find_pending_exception+0xa7/0x1e0 Mar 22 21:28:44 h1450662 kernel: Mar 22 21:28:44 h1450662 kernel: which lock already depends on the new lock. Mar 22 21:28:44 h1450662 kernel: Mar 22 21:28:44 h1450662 kernel: Mar 22 21:28:44 h1450662 kernel: the existing dependency chain (in reverse order) is: Mar 22 21:28:47 h1450662 kernel: Mar 22 21:28:47 h1450662 kernel: -> #3 (&s->lock){--..}: Mar 22 21:28:47 h1450662 kernel: [<ffffffff8026c00a>] __lock_acquire+0x11ea/0x1bc0 Mar 22 21:28:47 h1450662 kernel: [<ffffffff8026ca71>] lock_acquire+0x91/0xc0 Mar 22 21:28:47 h1450662 kernel: [<ffffffff80504a63>] down_write+0x43/0x80 Mar 22 21:28:47 h1450662 kernel: [<ffffffff80467b49>] snapshot_map+0x79/0x320 Mar 22 21:28:47 h1450662 kernel: [<ffffffff8045f422>] __map_bio+0x42/0x160 Mar 22 21:28:47 h1450662 kernel: [<ffffffff8046062c>] __split_bio+0x4bc/0x590 Mar 22 21:28:47 h1450662 kernel: [<ffffffff804609ac>] dm_request+0x1fc/0x2a0 Mar 22 21:28:47 h1450662 kernel: [<ffffffff80348586>] generic_make_request+0x346/0x450 Mar 22 21:28:45 h1450662 lvm[1301]: Snapshot vg--xen--hwn-lvol--xen--vm--1466362--backup--snap changed state to: Invalid Mar 22 21:28:49 h1450662 kernel: [<ffffffff8034872b>] submit_bio+0x9b/0x150 Mar 22 21:28:49 h1450662 kernel: [<ffffffff802e978e>] submit_bh+0xee/0x130 Mar 22 21:28:49 h1450662 kernel: [<ffffffff802ed8c1>] block_read_full_page+0x2e1/0x3d0 Mar 22 21:28:49 h1450662 kernel: [<ffffffff802f1083>] blkdev_readpage+0x13/0x20 Mar 22 21:28:49 h1450662 kernel: [<ffffffff80299ce2>] __do_page_cache_readahead+0x212/0x290 Mar 22 21:28:49 h1450662 kernel: [<ffffffff8029a11b>] ondemand_readahead+0x1eb/0x270 Mar 22 21:28:49 h1450662 kernel: [<ffffffff8029a275>] page_cache_sync_readahead+0x25/0x30 Mar 22 21:28:49 h1450662 kernel: [<ffffffff8029278c>] generic_file_aio_read+0x43c/0x6b0 Mar 22 21:28:51 h1450662 lvm[1301]: No longer monitoring snapshot vg--xen--hwn-lvol--xen--vm--1466362--backup--snap Mar 22 21:28:49 h1450662 kernel: [<ffffffff802c58e1>] do_sync_read+0xf1/0x140 Mar 22 21:28:53 h1450662 kernel: [<ffffffff802c6448>] vfs_read+0xc8/0x170 Mar 22 21:28:53 h1450662 kernel: [<ffffffff802c65e0>] sys_read+0x50/0x90 Mar 22 21:28:53 h1450662 kernel: [<ffffffff8020c3cb>] system_call_fastpath+0x16/0x1b Mar 22 21:28:53 h1450662 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Mar 22 21:28:53 h1450662 kernel: Mar 22 21:28:53 h1450662 kernel: -> #2 (&md->io_lock){----}: Mar 22 21:28:53 h1450662 kernel: [<ffffffff8026c00a>] __lock_acquire+0x11ea/0x1bc0 Mar 22 21:28:53 h1450662 kernel: [<ffffffff8026ca71>] lock_acquire+0x91/0xc0 Mar 22 21:28:53 h1450662 kernel: [<ffffffff80504ae6>] down_read+0x46/0x80 Mar 22 21:28:56 h1450662 kernel: [<ffffffff804607dc>] dm_request+0x2c/0x2a0 Mar 22 21:28:56 h1450662 kernel: [<ffffffff80348586>] generic_make_request+0x346/0x450 Mar 22 21:28:56 h1450662 kernel: [<ffffffff8034872b>] submit_bio+0x9b/0x150 Mar 22 21:28:56 h1450662 kernel: [<ffffffff80465bff>] dispatch_io+0x1bf/0x1f0 Mar 22 21:28:56 h1450662 kernel: [<ffffffff80465e35>] dm_io+0x185/0x2c0 Mar 22 21:28:56 h1450662 kernel: [<ffffffff80468b54>] do_metadata+0x24/0x30 Mar 22 21:28:56 h1450662 kernel: [<ffffffff80255b7c>] run_workqueue+0xfc/0x240 Mar 22 21:28:56 h1450662 kernel: [<ffffffff80256080>] worker_thread+0x60/0xb0 Mar 22 21:28:56 h1450662 kernel: [<ffffffff8025a1e9>] kthread+0x49/0x80 Mar 22 21:28:53 h1450662 lvm[1301]: Unlocking memory Mar 22 21:28:56 h1450662 kernel: [<ffffffff8020d519>] child_rip+0xa/0x11 Mar 22 21:29:00 h1450662 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Mar 22 21:29:00 h1450662 kernel: Mar 22 21:29:00 h1450662 kernel: -> #1 (&req.work){--..}: Mar 22 21:29:00 h1450662 kernel: [<ffffffff8026c00a>] __lock_acquire+0x11ea/0x1bc0 Mar 22 21:29:00 h1450662 kernel: [<ffffffff8026ca71>] lock_acquire+0x91/0xc0 Mar 22 21:29:00 h1450662 kernel: [<ffffffff80255b76>] run_workqueue+0xf6/0x240 Mar 22 21:29:00 h1450662 kernel: [<ffffffff80256080>] worker_thread+0x60/0xb0 Mar 22 21:29:00 h1450662 kernel: [<ffffffff8025a1e9>] kthread+0x49/0x80 Mar 22 21:29:00 h1450662 kernel: [<ffffffff8020d519>] child_rip+0xa/0x11 Mar 22 21:29:05 h1450662 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Mar 22 21:29:05 h1450662 kernel: Mar 22 21:29:05 h1450662 kernel: -> #0 (ksnaphd){--..}: Mar 22 21:29:05 h1450662 kernel: [<ffffffff8026c188>] __lock_acquire+0x1368/0x1bc0 Mar 22 21:29:05 h1450662 kernel: [<ffffffff8026ca71>] lock_acquire+0x91/0xc0 Mar 22 21:29:05 h1450662 kernel: [<ffffffff802566cd>] flush_workqueue+0x5d/0xb0 Mar 22 21:29:05 h1450662 kernel: [<ffffffff80468cc3>] chunk_io+0xe3/0x120 Mar 22 21:29:05 h1450662 kernel: [<ffffffff80468d5c>] write_header+0x5c/0x70 Mar 22 21:29:05 h1450662 kernel: [<ffffffff80468d84>] persistent_drop+0x14/0x30 Mar 22 21:29:05 h1450662 kernel: [<ffffffff80466eb5>] __invalidate_snapshot+0x35/0x80 Mar 22 21:29:08 h1450662 kernel: [<ffffffff804674b5>] origin_map+0x3d5/0x400 Mar 22 21:29:08 h1450662 kernel: [<ffffffff8045f422>] __map_bio+0x42/0x160 Mar 22 21:29:08 h1450662 kernel: [<ffffffff8046058e>] __split_bio+0x41e/0x590 Mar 22 21:29:08 h1450662 kernel: [<ffffffff804609ac>] dm_request+0x1fc/0x2a0 Mar 22 21:29:08 h1450662 kernel: [<ffffffff80348586>] generic_make_request+0x346/0x450 Mar 22 21:29:08 h1450662 kernel: [<ffffffff8034872b>] submit_bio+0x9b/0x150 Mar 22 21:29:08 h1450662 kernel: [<ffffffff802e978e>] submit_bh+0xee/0x130 Mar 22 21:29:08 h1450662 kernel: [<ffffffff802ecd01>] __block_write_full_page+0x1e1/0x3b0 Mar 22 21:29:08 h1450662 kernel: [<ffffffff802ecff9>] block_write_full_page+0x129/0x140 Mar 22 21:29:08 h1450662 kernel: [<ffffffffa0034086>] ext3_ordered_writepage+0xc6/0x190 [ext3] Mar 22 21:29:10 h1450662 kernel: [<ffffffff80297be5>] __writepage+0x15/0x50 Mar 22 21:29:10 h1450662 kernel: [<ffffffff80298827>] write_cache_pages+0x237/0x480 Mar 22 21:29:10 h1450662 kernel: [<ffffffff80298a92>] generic_writepages+0x22/0x30 Mar 22 21:29:10 h1450662 kernel: [<ffffffff80298add>] do_writepages+0x3d/0x50 Mar 22 21:29:10 h1450662 kernel: [<ffffffff802e488b>] __writeback_single_inode+0x9b/0x410 Mar 22 21:29:10 h1450662 kernel: [<ffffffff802e51e4>] generic_sync_sb_inodes+0x364/0x440 Mar 22 21:29:10 h1450662 kernel: [<ffffffff802e54a5>] writeback_inodes+0x65/0x100 Mar 22 21:29:10 h1450662 kernel: [<ffffffff802991a0>] background_writeout+0xb0/0xe0 Mar 22 21:29:10 h1450662 kernel: [<ffffffff802998d0>] pdflush+0x110/0x1f0 Mar 22 21:29:10 h1450662 kernel: [<ffffffff8025a1e9>] kthread+0x49/0x80 Mar 22 21:29:11 h1450662 kernel: [<ffffffff8020d519>] child_rip+0xa/0x11 Mar 22 21:29:11 h1450662 kernel: [<ffffffffffffffff>] 0xffffffffffffffff Mar 22 21:29:11 h1450662 kernel: Mar 22 21:29:11 h1450662 kernel: other info that might help us debug this: Mar 22 21:29:11 h1450662 kernel: Mar 22 21:29:11 h1450662 kernel: 5 locks held by pdflush/7307: Mar 22 21:29:11 h1450662 kernel: #0: (&type->s_umount_key#13){----}, at: [<ffffffff802e548e>] writeback_inodes+0x4e/0x100 Mar 22 21:29:11 h1450662 kernel: #1: (jbd_handle){--..}, at: [<ffffffffa0019e9e>] journal_start+0xce/0x140 [jbd] Mar 22 21:29:11 h1450662 kernel: #2: (&md->io_lock){----}, at: [<ffffffff804607dc>] dm_request+0x2c/0x2a0 Mar 22 21:29:11 h1450662 kernel: #3: (&_origins_lock){----}, at: [<ffffffff8046713c>] origin_map+0x5c/0x400 Mar 22 21:29:14 h1450662 kernel: #4: (&s->lock){--..}, at: [<ffffffff80466fa7>] __find_pending_exception+0xa7/0x1e0 Mar 22 21:29:14 h1450662 kernel: Mar 22 21:29:14 h1450662 kernel: stack backtrace: Mar 22 21:29:14 h1450662 kernel: Pid: 7307, comm: pdflush Not tainted 2.6.28.8-3.2-default #3 Mar 22 21:29:14 h1450662 kernel: Call Trace: Mar 22 21:29:14 h1450662 kernel: [<ffffffff8026aa08>] print_circular_bug_tail+0xd8/0xe0 Mar 22 21:29:14 h1450662 kernel: [<ffffffff8026c188>] __lock_acquire+0x1368/0x1bc0 Mar 22 21:29:14 h1450662 kernel: [<ffffffff8023acb8>] ? finish_task_switch+0x68/0xf0 Mar 22 21:29:14 h1450662 kernel: [<ffffffff8023ac50>] ? finish_task_switch+0x0/0xf0 Mar 22 21:29:14 h1450662 kernel: [<ffffffff80502da0>] ? thread_return+0xa3/0x723 Mar 22 21:29:00 h1450662 lvm[1301]: memlock_count dec to 0 Mar 22 21:29:16 h1450662 kernel: [<ffffffff8020ca08>] ? restore_args+0x0/0x30 Mar 22 21:29:16 h1450662 lvm[1301]: Dumping persistent device cache to /etc/lvm/.cache Mar 22 21:29:16 h1450662 kernel: [<ffffffff8026ca71>] lock_acquire+0x91/0xc0 Mar 22 21:29:16 h1450662 kernel: [<ffffffff80256670>] ? flush_workqueue+0x0/0xb0 Mar 22 21:29:16 h1450662 lvm[1301]: Locking /etc/lvm/.cache (F_WRLCK, 1) Mar 22 21:29:16 h1450662 kernel: [<ffffffff802566cd>] flush_workqueue+0x5d/0xb0 Mar 22 21:29:16 h1450662 kernel: [<ffffffff80256670>] ? flush_workqueue+0x0/0xb0 Mar 22 21:29:16 h1450662 kernel: [<ffffffff80468cc3>] chunk_io+0xe3/0x120 Mar 22 21:29:16 h1450662 kernel: [<ffffffff80468b30>] ? do_metadata+0x0/0x30 Mar 22 21:29:18 h1450662 kernel: [<ffffffff8026a3a9>] ? trace_hardirqs_on_caller+0x129/0x190 Mar 22 21:29:18 h1450662 kernel: [<ffffffff80468d5c>] write_header+0x5c/0x70 Mar 22 21:29:18 h1450662 kernel: [<ffffffff80468d84>] persistent_drop+0x14/0x30 Mar 22 21:29:18 h1450662 kernel: [<ffffffff80466eb5>] __invalidate_snapshot+0x35/0x80 Mar 22 21:29:18 h1450662 kernel: [<ffffffff804674b5>] origin_map+0x3d5/0x400 Mar 22 21:29:18 h1450662 kernel: [<ffffffff802930c6>] ? mempool_alloc+0x56/0x140 Mar 22 21:29:18 h1450662 kernel: [<ffffffff8045f422>] __map_bio+0x42/0x160 Mar 22 21:29:18 h1450662 kernel: [<ffffffff802eed7d>] ? bio_alloc_bioset+0x5d/0x100 Mar 22 21:29:18 h1450662 kernel: [<ffffffff8046058e>] __split_bio+0x41e/0x590 Mar 22 21:29:18 h1450662 kernel: [<ffffffff804601ab>] ? __split_bio+0x3b/0x590 Mar 22 21:29:21 h1450662 kernel: [<ffffffff80267c24>] ? get_lock_stats+0x34/0x80 Mar 22 21:29:21 h1450662 kernel: [<ffffffff804609ac>] dm_request+0x1fc/0x2a0 Mar 22 21:29:21 h1450662 kernel: [<ffffffff804607d4>] ? dm_request+0x24/0x2a0 Mar 22 21:29:21 h1450662 kernel: [<ffffffff80348586>] generic_make_request+0x346/0x450 Mar 22 21:29:21 h1450662 kernel: [<ffffffff802930c6>] ? mempool_alloc+0x56/0x140 Mar 22 21:29:21 h1450662 kernel: [<ffffffff802930c6>] ? mempool_alloc+0x56/0x140 Mar 22 21:29:21 h1450662 kernel: [<ffffffff8034872b>] submit_bio+0x9b/0x150 Mar 22 21:29:21 h1450662 kernel: [<ffffffff802eed7d>] ? bio_alloc_bioset+0x5d/0x100 Mar 22 21:29:21 h1450662 kernel: [<ffffffff802e978e>] submit_bh+0xee/0x130 Mar 22 21:29:21 h1450662 kernel: [<ffffffff802ecd01>] __block_write_full_page+0x1e1/0x3b0 Mar 22 21:29:22 h1450662 kernel: [<ffffffffa0035790>] ? ext3_get_block+0x0/0x130 [ext3] Mar 22 21:29:22 h1450662 kernel: [<ffffffffa0035790>] ? ext3_get_block+0x0/0x130 [ext3] Mar 22 21:29:22 h1450662 kernel: [<ffffffff802ecff9>] block_write_full_page+0x129/0x140 Mar 22 21:29:22 h1450662 kernel: [<ffffffffa0034086>] ext3_ordered_writepage+0xc6/0x190 [ext3] Mar 22 21:29:22 h1450662 kernel: [<ffffffff80297be5>] __writepage+0x15/0x50 Mar 22 21:29:22 h1450662 kernel: [<ffffffff80298827>] write_cache_pages+0x237/0x480 Mar 22 21:29:22 h1450662 kernel: [<ffffffff80297bd0>] ? __writepage+0x0/0x50 Mar 22 21:29:22 h1450662 kernel: [<ffffffff802e4880>] ? __writeback_single_inode+0x90/0x410 Mar 22 21:29:22 h1450662 kernel: [<ffffffff80298a92>] generic_writepages+0x22/0x30 Mar 22 21:29:22 h1450662 kernel: [<ffffffff80298add>] do_writepages+0x3d/0x50 Mar 22 21:29:22 h1450662 kernel: [<ffffffff802e488b>] __writeback_single_inode+0x9b/0x410 Mar 22 21:29:22 h1450662 kernel: [<ffffffff80506100>] ? _read_unlock+0x30/0x60 Mar 22 21:29:22 h1450662 kernel: [<ffffffff802e51e4>] generic_sync_sb_inodes+0x364/0x440 Mar 22 21:29:22 h1450662 kernel: [<ffffffff802e54a5>] writeback_inodes+0x65/0x100 Mar 22 21:29:22 h1450662 kernel: [<ffffffff802991a0>] background_writeout+0xb0/0xe0 Mar 22 21:29:22 h1450662 kernel: [<ffffffff802998d0>] pdflush+0x110/0x1f0 Mar 22 21:29:22 h1450662 kernel: [<ffffffff802990f0>] ? background_writeout+0x0/0xe0 Mar 22 21:29:22 h1450662 kernel: [<ffffffff802997c0>] ? pdflush+0x0/0x1f0 Mar 22 21:29:22 h1450662 kernel: [<ffffffff8025a1e9>] kthread+0x49/0x80 Mar 22 21:29:22 h1450662 kernel: [<ffffffff8020d519>] child_rip+0xa/0x11 Mar 22 21:29:22 h1450662 kernel: [<ffffffff8020ca08>] ? restore_args+0x0/0x30 Mar 22 21:29:22 h1450662 kernel: [<ffffffff8025a1a0>] ? kthread+0x0/0x80 Mar 22 21:29:22 h1450662 kernel: [<ffffffff8020d50f>] ? child_rip+0x0/0x11 Mar 22 21:29:22 h1450662 kernel: __ratelimit: 7 callbacks suppressed Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683024 Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683025 Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683026 Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683027 Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683028 Mar 22 21:29:22 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683029 Mar 22 21:29:23 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683030 Mar 22 21:29:23 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683031 Mar 22 21:29:23 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683032 Mar 22 21:29:23 h1450662 kernel: Buffer I/O error on device dm-3, logical block 1683033 Mar 22 21:29:16 h1450662 lvm[1301]: Opened /etc/lvm/.cache RO Mar 22 21:29:23 h1450662 lvm[1301]: Loaded persistent filter cache from /etc/lvm/.cache Mar 22 21:29:23 h1450662 lvm[1301]: Unlocking fd 8 Mar 22 21:29:23 h1450662 lvm[1301]: Closed /etc/lvm/.cache Mar 22 21:29:23 h1450662 lvm[1301]: Wiping internal VG cache
Looks like the snapshot invalidation is not handled properly.
hmmm no, also happens with nearly empty snapshot devices...
Related: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=584881
Also happens with "echo check > .../mdX/md/sync_action"
Update: I have two machines which seem to have totally identical hardware. One triggers this bug, the other not... The one triggering the bug show UDMA CRC errors for one disk (via smartctl -a). Are there any mdraid issues related to hardware defects? Improperly handled, specific I/O errors? Can't see anything in logs, though.
Closing as obsolete, if this is still seen with modern kernels please re-open and update