Bug 3968
Summary: | Kernel crashing repeatedly and hard when trying to combine LVM2 (RAID5) and DM-CRYPT | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Georg Greve (greve) |
Component: | LVM2/DM | Assignee: | Alasdair G Kergon (agk) |
Status: | CLOSED INSUFFICIENT_DATA | ||
Severity: | blocking | CC: | bunk, giuseppe, th |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.10 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
OOPS with DEBUG_PAGEALLOC
new OOPS with LVM over raid5 (no dm-crypt) oops for xfs on /dev/md1 (raid5) OOPS with DM recompiled with debug active |
Description
Georg Greve
2004-12-30 00:59:40 UTC
Tried to narrow problems down. Same effect when replacing dm-crypt by cryptoloop. Kernel keeps crashing even when using plain ext3 on raid5. Last message on remote console: Assertion failure in journal_start() at fs/jbd/transaction.c:271: "handle->h_transaction->t_journal == journal" See comments in http://thread.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/667: It seems to be a very hard to track slab corruption issue that is responsible at an earlier time for the crash. Maybe this should be renamed or closed and another bug should be opened pointing to this one as typical phenomenology of that bug? I probably have the same problem. With both XFS and ext3. I only tried dm-crypt and cannot try using cryptoloop. I may reproduce the problem easily with 2.6.10, 2.6.11-rc1 and 2.6.11-rc2. The crash effect are various: 1. complete crash: no response to ping, no video signal, no response to keyboard 2. partial crash: response to ping, cannot login via SSH. 3. kernel oops When 2., this is what happen from a remote machine: $ ssh -v quad OpenSSH_3.8.1p1 Debian-8.sarge.4, OpenSSL 0.9.7e 25 Oct 2004 debug1: Reading configuration data /home/giuseppe/.ssh/config debug1: Reading configuration data /etc/ssh/ssh_config debug1: Connecting to eppesuig3 [192.168.2.50] port 22. debug1: Connection established. debug1: identity file /home/giuseppe/.ssh/identity type 0 debug1: identity file /home/giuseppe/.ssh/id_rsa type 1 debug1: identity file /home/giuseppe/.ssh/id_dsa type 2 when 3. unable to handle kernel paging request at virtual address 744a46e0 printing eip c0116657 pde =0000... oops 0000 #1 smp modules linked in aes_i586 dmcrypt loop xfs i2c_801 uhcihcd ehcihcd e1000 ide idecd cdrom cpu 2036542274 eip 0060:c0116657 not tainted vli eflags 00010002 eip is at do_page_fault 0x97/0x5e6 eax dd823000 ebx 0e ecx dd823134 edx 300001 esi c030cdea edi c01164c0 ebp 744a4674 esp dd823114 ds 007b es 7b ss 68 unable to.. virtual 632061e2 eip c0116557 pde 03dc3001 Created attachment 4483 [details]
OOPS with DEBUG_PAGEALLOC
This attachment show an oops when the 2.6.11-rc2 was recompiled with
DEBUG_PAGEALLOC.
More tests showed that the problem is present even before mounting the dm-crypted device, i.e., mounting raid1 and raid5 (via LVM) is enought to crash the system. Created attachment 4492 [details]
new OOPS with LVM over raid5 (no dm-crypt)
This OOPS was created in this way:
1. created 3 raid5 arrays;
2. created one VG using these raids;
3. created one LV from the VG;
4. created an XFS fs on the LV;
5. mounted the LV;
6. run a stress test that just create random named file with random file size
(I may provide the simple shell script further).
Created attachment 4493 [details]
oops for xfs on /dev/md1 (raid5)
1. create /dev/md1 as one raid5 array using 4 disks
2. create XFS filesystem on this /dev/md1
3. run my random tests
4. crash.
It seems that I can reproduce the problem only using XFS. I did try ext2, ext3, jfs, reisefs. Everything seems to be working if xfs isn't used. So, I recompiled kernel, activating all XFS debug and trace. The new trace is: ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:93! invalid operand: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: xfs uhci_hcd e1000 raid5 xor w83627hf adm1021 i2c_sensor i2c_isa i2c_i801 CPU: 0 EIP: 0060:[<c02ce713>] Not tainted VLI EFLAGS: 00010202 (2.6.11-rc2-xfs) EIP is at __release_kernel_lock+0x23/0x40 eax: 00000001 ebx: ea2cc000 ecx: ea2cc000 edx: ea2cc030 esi: f7472f50 edi: 00000000 ebp: ea2cc12c esp: ea2cc12c ds: 007b es: 007b ss: 0068 Unable to handle kernel paging request at virtual address bebaee6a printing eip: c011351b *pde = 00000000 Some more tests showed that using XFS isn't the only was to crash the machine. I just had a crash this way: 1. reboot 2. format and mount /dev/md1 (raid5) using reiserfs 3. run my stress test 4. umount the file system 5. mdadm --manage /dev/md1 -f /dev/sda5 -r /dev/sda5 The last command gave a segmentation fault. The OOPS is: Unable to handle kernel NULL pointer dereference at virtual address 00000018 printing eip: c0260dc1 *pde = 00000000 Oops: 0000 [#1] SMP DEBUG_PAGEALLOC Modules linked in: uhci_hcd e1000 raid5 xor w83627hf adm1021 i2c_sensor i2c_isa i2c_i801 CPU: 0 EIP: 0060:[<c0260dc1>] Not tainted VLI EFLAGS: 00010246 (2.6.11-rc2-xfs) EIP is at md_error+0x21/0x90 eax: f7b03888 ebx: c1a4edf8 ecx: f7b03888 edx: 00000000 esi: c1a4edf8 edi: 00000805 ebp: c2a82ea8 esp: c2a82e98 ds: 007b es: 007b ss: 0068 Process mdadm (pid: 7601, threadinfo=c2a82000 task=c4056ad0) Stack: c1585a60 00000001 c1a4edf8 c1a4edf8 c2a82ebc c0260370 c1a4edf8 f7b03888 00000000 c2a82f44 c02605fd c1a4edf8 00800005 00000060 c2a82f68 00000000 00000000 c0162303 bfffde1c c2a82eec 00000060 00000900 00000000 00000000 Call Trace: [<c010354f>] show_stack+0x7f/0xa0 [<c01036ff>] show_registers+0x15f/0x1d0 [<c0103930>] die+0x100/0x190 [<c0113785>] do_page_fault+0x335/0x673 [<c01031bb>] error_code+0x2b/0x30 [<c0260370>] set_disk_faulty+0x30/0x40 [<c02605fd>] md_ioctl+0x27d/0x5f0 [<c0200bcf>] blkdev_ioctl+0x8f/0x410 [<c016adf3>] do_ioctl+0x63/0xa0 [<c016b049>] sys_ioctl+0x79/0x200 [<c0102717>] syscall_call+0x7/0xb Code: 00 00 00 8d bc 27 00 00 00 00 55 89 e5 83 ec 10 89 5d fc 8b 45 0c 8b 5d 08 85 db 74 51 85 c0 74 19 8b 50 38 85 d2 75 12 8b 53 04 <8b> 4a 18 85 c9 75 0f 90 8d b4 26 00 00 00 00 8b 5d fc 89 ec 5d Further tests eliminating DM showed that XFS is working ok. I just created an XFS file system on /dev/sda5 and it worked perfetcly. I made a new test using DM with a mirror (raid1) /dev/md1 on all four disks and an XFS file system on top. The test ran four hours, so I believe that the problem is restricted to raid5 dm. Created attachment 4495 [details]
OOPS with DM recompiled with debug active
This OOPS was collected during a resync (automaticallyt started at boot) of
raid5 array.
The raid5 module was compiled with DEBUG active.
I am now trying different kernels. I just found that the standard debian kernel (kernel-image-2.6.8-2-686-smp) works fine. I cannot check if vanilla 2.6.8.1 work fine as well. I made a new test, using the vanilla 2.6.8.1, and it crashed. similar setup here: xfs on lvm2 on raid5 on sata_sil kernel version 2.6.11.7 crashs only on high IO-activity I think you've been hit by different bugs at different times during this testing. Please retest with a new kernel that incorporates the following patches: device-mapper: fix deadlocks in core device-mapper: fix md->lock deadlocks in core device-mapper: fix deadlocks in core (prep) bio_clone fix (e.g. 2.6.13-rc4 plus the first 3 patches taken from -mm) I'm assuming this issue is already fixed. Please reopen this bug if it's still present in recent 2.6 kernels. |