Bug 3968 - Kernel crashing repeatedly and hard when trying to combine LVM2 (RAID5) and DM-CRYPT
Summary: Kernel crashing repeatedly and hard when trying to combine LVM2 (RAID5) and D...
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: i386 Linux
: P2 blocking
Assignee: Alasdair G Kergon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-12-30 00:59 UTC by Georg Greve
Modified: 2007-11-13 03:53 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.10
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
OOPS with DEBUG_PAGEALLOC (2.10 KB, text/plain)
2005-01-28 10:08 UTC, Giuseppe Sacco
Details
new OOPS with LVM over raid5 (no dm-crypt) (4.44 KB, text/plain)
2005-01-31 00:20 UTC, Giuseppe Sacco
Details
oops for xfs on /dev/md1 (raid5) (1.20 KB, text/plain)
2005-01-31 03:14 UTC, Giuseppe Sacco
Details
OOPS with DM recompiled with debug active (2.09 KB, text/plain)
2005-01-31 21:03 UTC, Giuseppe Sacco
Details

Description Georg Greve 2004-12-30 00:59:40 UTC
Distribution: Debian GNU/Linux (sarge)
Hardware Environment: x86
Software Environment: 
Problem Description:

Have been trying to migrate things to RAID5 and wanted to run ext3 on dm-crypt on
RAID5. When filesystem is mounted, machine is crashing repeatedly and hard.
Not sure where the bug actually lies -- maybe in the interoperation of the
components.

More (crash & system) info available at

 http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/667
 http://marc.theaimsgroup.com/?l=linux-kernel&m=110436747603527&w=2
 (same email)


Steps to reproduce:

Start machine, mount file system, wait.

The machine appears stable as long as the file system is not mounted.
Comment 1 Georg Greve 2004-12-30 08:25:25 UTC
Tried to narrow problems down.

Same effect when replacing dm-crypt by cryptoloop.

Kernel keeps crashing even when using plain ext3 on raid5. Last message on
remote console:

 Assertion failure in journal_start() at fs/jbd/transaction.c:271:
"handle->h_transaction->t_journal == journal"
 
Comment 2 Georg Greve 2004-12-31 01:50:17 UTC
See comments in
http://thread.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/667:

It seems to be a very hard to track slab corruption issue that is responsible at
an earlier time for the crash. 

Maybe this should be renamed or closed and another bug should be opened pointing
to this one as typical phenomenology of that bug?
Comment 3 Giuseppe Sacco 2005-01-28 07:47:43 UTC
I probably have the same problem. With both XFS and ext3. I only tried dm-crypt
and cannot try using cryptoloop.
I may reproduce the problem easily with 2.6.10, 2.6.11-rc1 and 2.6.11-rc2.

The crash effect are various:
1. complete crash: no response to ping, no video signal, no response to keyboard
2. partial crash: response to ping, cannot login via SSH.
3. kernel oops

When 2., this is what happen from a remote machine:
$ ssh -v quad
OpenSSH_3.8.1p1 Debian-8.sarge.4, OpenSSL 0.9.7e 25 Oct 2004
debug1: Reading configuration data /home/giuseppe/.ssh/config
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Connecting to eppesuig3 [192.168.2.50] port 22.
debug1: Connection established.
debug1: identity file /home/giuseppe/.ssh/identity type 0
debug1: identity file /home/giuseppe/.ssh/id_rsa type 1
debug1: identity file /home/giuseppe/.ssh/id_dsa type 2

when 3.
unable to handle kernel paging request at virtual address 744a46e0
printing eip
c0116657
pde =0000...
oops 0000 #1
smp
modules linked in aes_i586 dmcrypt loop xfs i2c_801 uhcihcd ehcihcd e1000 ide
idecd cdrom
cpu 2036542274
eip 0060:c0116657 not tainted vli
eflags 00010002
eip is at do_page_fault  0x97/0x5e6
eax dd823000 ebx 0e ecx dd823134 edx 300001
esi c030cdea edi c01164c0 ebp 744a4674 esp dd823114
ds 007b es 7b ss 68
unable to.. virtual 632061e2
eip
c0116557
pde 03dc3001
Comment 4 Giuseppe Sacco 2005-01-28 10:08:45 UTC
Created attachment 4483 [details]
OOPS with DEBUG_PAGEALLOC

This attachment show an oops when the 2.6.11-rc2 was recompiled with
DEBUG_PAGEALLOC.
Comment 5 Giuseppe Sacco 2005-01-28 15:39:36 UTC
More tests showed that the problem is present even before mounting the
dm-crypted device, i.e., mounting raid1 and raid5 (via LVM) is enought to crash
the system.
Comment 6 Giuseppe Sacco 2005-01-31 00:20:07 UTC
Created attachment 4492 [details]
new OOPS with LVM over raid5 (no dm-crypt)

This OOPS was created in this way:
1. created 3 raid5 arrays;
2. created one VG using these raids;
3. created one LV from the VG;
4. created an XFS fs on the LV;
5. mounted the LV;
6. run a stress test that just create random named file with random file size
(I may provide the simple shell script further).
Comment 7 Giuseppe Sacco 2005-01-31 03:14:27 UTC
Created attachment 4493 [details]
oops for xfs on /dev/md1 (raid5)

1. create /dev/md1 as one raid5 array using 4 disks
2. create XFS filesystem on this /dev/md1
3. run my random tests
4. crash.
Comment 8 Giuseppe Sacco 2005-01-31 08:01:19 UTC
It seems that I can reproduce the problem only using XFS. I did try ext2, ext3,
jfs, reisefs. Everything seems to be working if xfs isn't used.

So, I recompiled kernel, activating all XFS debug and trace. The new trace is:

------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:93!
invalid operand: 0000 [#1]
SMP DEBUG_PAGEALLOC
Modules linked in: xfs uhci_hcd e1000 raid5 xor w83627hf adm1021 i2c_sensor
i2c_isa i2c_i801
CPU:    0
EIP:    0060:[<c02ce713>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-rc2-xfs) 
EIP is at __release_kernel_lock+0x23/0x40
eax: 00000001   ebx: ea2cc000   ecx: ea2cc000   edx: ea2cc030
esi: f7472f50   edi: 00000000   ebp: ea2cc12c   esp: ea2cc12c
ds: 007b   es: 007b   ss: 0068
Unable to handle kernel paging request at virtual address bebaee6a
 printing eip:
c011351b
*pde = 00000000
Comment 9 Giuseppe Sacco 2005-01-31 09:07:50 UTC
Some more tests showed that using XFS isn't the only was to crash the machine.
I just had a crash this way:
1. reboot
2. format and mount /dev/md1 (raid5) using reiserfs
3. run my stress test
4. umount the file system
5. mdadm --manage /dev/md1 -f /dev/sda5 -r /dev/sda5

The last command gave a segmentation fault. The OOPS is:
Unable to handle kernel NULL pointer dereference at virtual address 00000018
 printing eip:
c0260dc1
*pde = 00000000
Oops: 0000 [#1]
SMP DEBUG_PAGEALLOC
Modules linked in: uhci_hcd e1000 raid5 xor w83627hf adm1021 i2c_sensor i2c_isa
i2c_i801
CPU:    0
EIP:    0060:[<c0260dc1>]    Not tainted VLI
EFLAGS: 00010246   (2.6.11-rc2-xfs)
EIP is at md_error+0x21/0x90
eax: f7b03888   ebx: c1a4edf8   ecx: f7b03888   edx: 00000000
esi: c1a4edf8   edi: 00000805   ebp: c2a82ea8   esp: c2a82e98
ds: 007b   es: 007b   ss: 0068
Process mdadm (pid: 7601, threadinfo=c2a82000 task=c4056ad0)
Stack: c1585a60 00000001 c1a4edf8 c1a4edf8 c2a82ebc c0260370 c1a4edf8 f7b03888
       00000000 c2a82f44 c02605fd c1a4edf8 00800005 00000060 c2a82f68 00000000
       00000000 c0162303 bfffde1c c2a82eec 00000060 00000900 00000000 00000000
Call Trace:
 [<c010354f>] show_stack+0x7f/0xa0
 [<c01036ff>] show_registers+0x15f/0x1d0
 [<c0103930>] die+0x100/0x190
 [<c0113785>] do_page_fault+0x335/0x673
 [<c01031bb>] error_code+0x2b/0x30
 [<c0260370>] set_disk_faulty+0x30/0x40
 [<c02605fd>] md_ioctl+0x27d/0x5f0
 [<c0200bcf>] blkdev_ioctl+0x8f/0x410
 [<c016adf3>] do_ioctl+0x63/0xa0
 [<c016b049>] sys_ioctl+0x79/0x200
 [<c0102717>] syscall_call+0x7/0xb
Code: 00 00 00 8d bc 27 00 00 00 00 55 89 e5 83 ec 10 89 5d fc 8b 45 0c 8b 5d 08
85 db 74 51 85 c0 74 19 8b 50 38 85 d2 75 12 8b 53 04 <8b> 4a 18 85 c9 75 0f 90
8d b4 26 00 00 00 00 8b 5d fc 89 ec 5d
Comment 10 Giuseppe Sacco 2005-01-31 13:50:25 UTC
Further tests eliminating DM showed that XFS is working ok.
I just created an XFS file system on /dev/sda5 and it worked perfetcly.
Comment 11 Giuseppe Sacco 2005-01-31 20:28:14 UTC
I made a new test using DM with a mirror (raid1) /dev/md1 on all four disks and
an XFS file system on top. The test ran four hours, so I believe that the
problem is  restricted to raid5 dm.
Comment 12 Giuseppe Sacco 2005-01-31 21:03:33 UTC
Created attachment 4495 [details]
OOPS with DM recompiled with debug active

This OOPS was collected during a resync (automaticallyt started at boot) of
raid5 array.
The raid5 module was compiled with DEBUG active.
Comment 13 Giuseppe Sacco 2005-01-31 23:33:59 UTC
I am now trying different kernels. I just found that the standard debian kernel
(kernel-image-2.6.8-2-686-smp) works fine.
I cannot check if vanilla 2.6.8.1 work fine as well.
Comment 14 Giuseppe Sacco 2005-02-03 13:06:02 UTC
I made a new test, using the vanilla 2.6.8.1, and it crashed.
Comment 15 Tobias Hintze 2005-04-19 00:30:11 UTC
similar setup here:
xfs on lvm2 on raid5 on sata_sil

kernel version 2.6.11.7

crashs only on high IO-activity
Comment 16 Alasdair G Kergon 2005-07-29 06:31:05 UTC
I think you've been hit by different bugs at different times during this testing.

Please retest with a new kernel that incorporates the following patches:

  device-mapper: fix deadlocks in core
  device-mapper: fix md->lock deadlocks in core
  device-mapper: fix deadlocks in core (prep)
  bio_clone fix

(e.g. 2.6.13-rc4 plus the first 3 patches taken from -mm)
Comment 17 Adrian Bunk 2006-02-04 14:11:56 UTC
I'm assuming this issue is already fixed.

Please reopen this bug if it's still present in recent 2.6 kernels.

Note You need to log in before you can comment on or make changes to this bug.