Bug 5297

Summary: Memory problem with kcopyd snapshots and xfs_freeze under LVM2 lvremove execution and copying data at the same time.
Product: IO/Storage Reporter: Hubert (hubgor)
Component: LVM2/DMAssignee: Alasdair G Kergon (agk)
Status: CLOSED INSUFFICIENT_DATA    
Severity: high CC: agk, akpm, bunk
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: kernel 2.6.14-rc2 Subsystem:
Regression: --- Bisected commit-id:

Description Hubert 2005-09-23 04:04:05 UTC
Most recent kernel where this bug did not occur: 2.6.14-rc1
Distribution: Linux/GNU Debian 3.0 Sarge
Hardware Environment: 2x Pentium Xeon 2.8, 512MB DDR, 3ware 9xxx SATA raid
controller, 6x 250MB SATA HDDs, Raid0 = 1.36T
Software Environment: libdevmapper 1.01.04 stable or 1.01.05-cvs, lvm 2.01.09
stable or 2.01.15-cvs, samba 3.0.14a

Problem Description:

From dmesg:

device-mapper: 4.4.0-ioctl (2005-01-12) initialised: dm-devel@redhat.com
device-mapper: dm-multipath version 1.0.4 loaded
device-mapper: dm-round-robin version 1.0.0 loaded
scsi0 : 3ware 9000 Storage Controller
3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xfe8ffc00, IRQ: 10.
SCSI device sda: 2929557504 512-byte hdwr sectors (1499933 MB)
EXT3 FS on dm-0, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
XFS mounting filesystem dm-1
Ending clean XFS mount for filesystem: dm-1
XFS quotacheck dm-1: Please wait.
XFS quotacheck dm-1: Done.
XFS mounting filesystem dm-4
XFS resetting qflags for filesystem dm-4
Ending clean XFS mount for filesystem: dm-4
XFS mounting filesystem dm-6
XFS resetting qflags for filesystem dm-6
Ending clean XFS mount for filesystem: dm-6
------------[ cut here ]------------
kernel BUG at drivers/md/kcopyd.c:145!
invalid operand: 0000 [#1]
Modules linked in: eepro100 e1000 3w_9xxx uhci_hcd ohci_hcd usbhid sd_mod
ftdi_sio usbserial ehci_hcd usbcore
CPU:    0
EIP:    0060:[<c0368f2b>]    Not tainted VLI
EFLAGS: 00010283   (2.6.14-rc1) 
EIP is at client_free_pages+0x3b/0x50
eax: 00000100   ebx: ccdf2e80   ecx: cb8ab8a0   edx: 00000000
esi: e08ad080   edi: 00000000   ebp: 00000000   esp: d2763ec4
ds: 007b   es: 007b   ss: 0068
Process lvremove (pid: 14694, threadinfo=d2762000 task=dc4485a0)
Stack: ccdf2e80 ccdf2e80 c036997e ccdf2e80 c7a0fd80 c036cdb4 ccdf2e80 dd696a40 
       e08ad080 deddb580 c0363dda e08ad080 deddb580 cb8ab6a0 d2762000 080e0c28 
       c03660d9 deddb580 00000000 c048c7e0 e088b000 c0366940 cb8ab6a0 00000000 
Call Trace:
 [<c036997e>] kcopyd_client_destroy+0x1e/0x3a
 [<c036cdb4>] snapshot_dtr+0x74/0x90
 [<c0363dda>] table_destroy+0x8a/0xa0
 [<c03660d9>] __hash_remove+0x79/0xa0
 [<c0366940>] dev_remove+0x50/0xe0
 [<c0368019>] ctl_ioctl+0xf9/0x160
 [<c03668f0>] dev_remove+0x0/0xe0
 [<c0169778>] do_ioctl+0x58/0x80
 [<c0169905>] vfs_ioctl+0x65/0x1e0
 [<c038cc44>] net_rx_action+0x74/0x100
 [<c0169ae7>] sys_ioctl+0x67/0x90
 [<c0102bb9>] syscall_call+0x7/0xb
Code: 10 75 28 8b 43 08 89 04 24 e8 52 ff ff ff c7 43 08 00 00 00 00 c7 43 0c 00
00 00 00 c7 43 10 00 00 00 00 8b 5c 24 04 83 c4 08 c3 <0f> 0b 91 00 b4 93 43 c0
eb ce 8d 74 26 00 8d bc 27 00 00 00 00 
 <1>Unable to handle kernel NULL pointer dereference at virtual address 00000034
 printing eip:
c015c13d
*pde = 15124001
Oops: 0000 [#2]
Modules linked in: eepro100 e1000 3w_9xxx uhci_hcd ohci_hcd usbhid sd_mod
ftdi_sio usbserial ehci_hcd usbcore
CPU:    0
EIP:    0060:[<c015c13d>]    Not tainted VLI
EFLAGS: 00010292   (2.6.14) 
EIP is at bio_add_page+0xd/0x40
eax: 00000000   ebx: 00000010   ecx: 00000000   edx: d5d56500
esi: 00000000   edi: d5d56500   ebp: d4f05ef0   esp: d4f05e2c
ds: 007b   es: 007b   ss: 0068
Process kcopyd (pid: 14496, threadinfo=d4f04000 task=dff7e030)
Stack: 00000001 00000010 00000000 00000000 d4f05ef0 c036881c d5d56500 c1280fc0 
       00001000 00000000 c1280fc0 00001000 00000000 d4f05ef0 00000000 00000000 
       00000001 c0368939 00000001 00000000 d9b118cc d4f05ef0 c1b16c40 c03685b0 
Call Trace:
 [<c036881c>] do_region+0xec/0x140
 [<c0368939>] dispatch_io+0xc9/0xd0
 [<c03685b0>] list_get_page+0x0/0x30
 [<c03685e0>] list_next_page+0x0/0x20
 [<c03690c0>] complete_io+0x0/0x90
 [<c0368ac3>] async_io+0x83/0xe0
 [<c0368c73>] dm_io_async+0x53/0x60
 [<c03690c0>] complete_io+0x0/0x90
 [<c03685b0>] list_get_page+0x0/0x30
 [<c03685e0>] list_next_page+0x0/0x20
 [<c0369198>] run_io_job+0x48/0x90
 [<c03690c0>] complete_io+0x0/0x90
 [<c03692b2>] process_jobs+0x42/0xa0
 [<c0369310>] do_work+0x0/0x50
 [<c0369352>] do_work+0x42/0x50
 [<c0369150>] run_io_job+0x0/0x90
 [<c0125e3d>] worker_thread+0x1ad/0x250
 [<c0113a70>] default_wake_function+0x0/0x20
 [<c0113a70>] default_wake_function+0x0/0x20
 [<c0125c90>] worker_thread+0x0/0x250
 [<c0129d2a>] kthread+0xaa/0xb0
 [<c0129c80>] kthread+0x0/0xb0
 [<c0100fd9>] kernel_thread_helper+0x5/0xc
Code: eb ac 0f b7 86 42 01 00 00 66 39 43 1e 0f 83 6e fe ff ff e9 9b fe ff ff 8d
b6 00 00 00 00 83 ec 14 8b 54 24 18 8b 42 0c 8b 40 50 <8b> 48 34 8b 44 24 24 89
54 24 04 89 0c 24 89 44 24 10 8b 44 24 

Steps to reproduce:

pvcreate /dev/sda
vgcreate vg /dev/sda
lvcreate -L 698G -n lv vg /dev/sda
mkfs.xfs /dev/vg0
mount -t xfs -o usrquota,grpquota,noatime,nodiratime /dev/vg/lv /mnt/lv

do some smb share /mnt/smb
start copying data from remote host to smb share and under this process lvcreate
/ lvremove 2 big snapshots (347G both) with xfs_freeze -f/u interaction:

xfs_freeze -f /mnt/lv
lvcreate -s -l 11168 -n S1 -p rw /dev/vg/lv &
sleep 7
xfs_freeze -u /mnt/lv
mount -t xfs -o noatime,nodiratime,nouuid,ro /dev/vg/S1 /snapshots/S1

sleep 30

create second snapshot (named S2) identical like S1 above and mount it too

sleep 3m

xfs_freeze -f /mnt/lv
lvremove /dev/vg/lv/S1 &
sleep 1
xfs_freeze -u /mnt/lv

sleep 30

remove second snapshot identically like S1

make 5 iterations of lvcreate/lvremove 2 snapshots like is shown above

Thanks for any help

Best Regards,

Hubert
Comment 1 Alasdair G Kergon 2005-09-26 14:21:31 UTC
Do you still see problems if you remove all the xfs_freeze commands?
They shouldn't be necessary (dm handles the freezing itself) and they have
caused trouble in the past.
Comment 2 Hubert 2005-09-27 03:12:54 UTC
Without xfs_freeze -f/u (or dmsetup suspend/resume) any lvcreate or lvremove
snapshot of XFS lv and copying data to smb share hang up and lvcreate or
lvremove process stays in memory as death. I noticed that with xfs_freeze -f/u
lvcreate or lvremove process going better than with dmsetup suspend/resume
interaction. 

I know that dm handles the freezing itself but still it is not enough or sorry
but  it does it in wrong way

What should be the proper dmsetup tool usage with lvm2 create/remove snapshot ?

Regards Hubert
Comment 3 Andrew Morton 2007-01-31 01:53:09 UTC
Is this bug still present in current kernels?
Comment 4 Adrian Bunk 2007-03-08 21:25:34 UTC
Please reopen this bug if it's still present with kernel 2.6.20.