Bug 15902 - kcryptd crashes under heavy I/O
Summary: kcryptd crashes under heavy I/O
Status: CLOSED INVALID
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Alasdair G Kergon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-03 13:28 UTC by Juha Koho
Modified: 2011-02-24 15:24 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.32-3-amd64
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Juha Koho 2010-05-03 13:28:22 UTC
Hello,

I have the following setup in my system: Debian stock kernel version 2.6.32-3-amd64, 6 x 500GB drives with software RAID6 + encryption (luks) + LVM. This is a new installation and I'm having troubles with data encryption. Every once and a while kcryptd crashes and system becomes unresponsive. Well it responds to ping and currently running applications will continue running (if they don't need disk access I suppose) but I'm unable to ssh to the box anymore or run any new applications.

These crashes (always?) happen when there is lots of I/O going on. Ie. I can reproduce these crashes easily.

I have tested this with latest stable kernel version 2.6.33.3 but the problem persists.

Nothing appears in system logs after the crash but I was able to get the following using netconsole:

[ 3295.969539] ------------[ cut here ]------------
[ 3295.969580] kernel BUG at /build/mattems-linux-2.6_2.6.32-9-amd64-NYTFdD/linux-2.6-2.6.32-9/debian/build/source_amd64_none/include/linux/scatterlist.h:63!
[ 3295.969627] invalid opcode: 0000 [#1] SMP 
[ 3295.969654] last sysfs file: /sys/module/nfsd/initstate
[ 3295.969678] CPU 1 
[ 3295.969699] Modules linked in: autofs4 nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext2 netconsole configfs loop snd_hda_codec_realtek snd_hda_intel psmouse snd_hda_codec snd_hwdep parport_pc snd_pcm edac_core parport asus_atk0110 snd_timer serio_raw edac_mce_amd snd pcspkr evdev i2c_nforce2 soundcore wmi snd_page_alloc i2c_core button processor ext4 mbcache jbd2 crc16 sha256_generic cryptd aes_x86_64 aes_generic cbc dm_crypt dm_mod raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 md_mod sd_mod crc_t10dif ata_generic ide_pci_generic ahci ohci_hcd amd74xx floppy ehci_hcd libata forcedeth scsi_mod ide_core usbcore nls_base thermal thermal_sys [last unloaded: scsi_wait_scan]
[ 3295.970250] Pid: 450, comm: kcryptd Not tainted 2.6.32-3-amd64 #1 System Product Name
[ 3295.970287] RIP: 0010:[<ffffffffa0187a5d>]  [<ffffffffa0187a5d>] crypt_convert+0xe9/0x269 [dm_crypt]
[ 3295.970336] RSP: 0018:ffff88012c39fd80  EFLAGS: 00010206
[ 3295.970360] RAX: 0000000000000002 RBX: ffff8800358533c0 RCX: 0000000000000002
[ 3295.970386] RDX: b2901491843e73a7 RSI: 0000000000000000 RDI: 0000000026726b70
[ 3295.970412] RBP: ffff88012b86f330 R08: 0000000000000000 R09: 0000000000000001
[ 3295.970438] R10: ffff88012ba548d0 R11: 0000000000000000 R12: ffff88003585fff0
[ 3295.970464] R13: ffff88012b86f200 R14: ffff88012c351000 R15: 0000000000000000
[ 3295.970492] FS:  00007fb5063f86f0(0000) GS:ffff880005480000(0000) knlGS:0000000000000000
[ 3295.970530] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 3295.970554] CR2: 00007f5f81faf000 CR3: 000000012054c000 CR4: 00000000000006e0
[ 3295.970580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3295.970606] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3295.970633] Process kcryptd (pid: 450, threadinfo ffff88012c39e000, task ffff88012cb0cdb0)
[ 3295.970669] Stack:
[ 3295.970687]  ffff880035853408 0000000300000000 ffff88010000d000 ffff88012b86f338
[ 3295.970776] <0> ffff88012ba54800 ffff8800ceaf83c0 0000000000000000 0000000000000202
[ 3295.970827] <0> ffff880035853390 000000000000df00 0000000000000000 ffffffffa0187fec
[ 3295.970892] Call Trace:
[ 3295.970919]  [<ffffffffa0187fec>] ? kcryptd_crypt+0x40f/0x432 [dm_crypt]
[ 3295.970951]  [<ffffffff8106146b>] ? worker_thread+0x188/0x21d
[ 3295.970979]  [<ffffffffa0187bdd>] ? kcryptd_crypt+0x0/0x432 [dm_crypt]
[ 3295.971008]  [<ffffffff81064a56>] ? autoremove_wake_function+0x0/0x2e
[ 3295.971035]  [<ffffffff810612e3>] ? worker_thread+0x0/0x21d
[ 3295.971060]  [<ffffffff81064789>] ? kthread+0x79/0x81
[ 3295.971085]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
[ 3295.971110]  [<ffffffff81064710>] ? kthread+0x0/0x81
[ 3295.971134]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[ 3295.971156] Code: 0c 48 8d 45 08 48 89 5d 00 48 89 c7 48 89 44 24 18 e8 80 8b 00 e1 49 8b 14 24 41 8b 7c 24 0c 8b 73 30 48 8b 4d 08 f6 c2 03 74 04 <0f> 0b eb fe 44 89 f8 4c 8b 7c 24 20 83 e1 03 48 09 ca 48 c1 e0 
[ 3295.971523] RIP  [<ffffffffa0187a5d>] crypt_convert+0xe9/0x269 [dm_crypt]
[ 3295.971556]  RSP <ffff88012c39fd80>
[ 3295.971808] ---[ end trace 568ed39004d975a7 ]---


Regards,
Juha
Comment 1 Milan Broz 2010-05-03 13:35:45 UTC
Please can you paste here output of "dmsetup table"; cat /proc/mdstat ?

How you can reproduce it? Is there some stress test which fails?

Seems the same problem like in https://bugzilla.kernel.org/show_bug.cgi?id=15806
Comment 2 Juha Koho 2010-05-03 13:45:57 UTC
dmsetup table:
md1_crypt: 0 3906090488 crypt aes-cbc-essiv:sha256 0000000000000000000000000000000000000000000000000000000000000000 0 9:1 2056
vg0-home: 0 19529728 linear 253:0 3899776
vg0-tmp: 0 9764864 linear 253:0 33194368
vg0-swap: 0 7806976 linear 253:0 52724096
vg0-root: 0 3899392 linear 253:0 384
vg0-usr: 0 9764864 linear 253:0 23429504
vg0-var: 0 9764864 linear 253:0 42959232
vg0-data: 0 3845554176 linear 253:0 60531072

/proc/mdstat:
Personalities : [raid1] [raid6] [raid5] [raid4] 
md1 : active raid6 sda2[0] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1]
      1953046272 blocks level 6, 64k chunk, algorithm 2 [6/6] [UUUUUU]
      [=================>...]  resync = 87.1% (425676268/488261568) finish=55.0min speed=18954K/sec
      
md0 : active raid1 sda1[0] sdc1[2](S) sdd1[3](S) sde1[4](S) sdf1[5](S) sdb1[1]
      123840 blocks [2/2] [UU]
      
unused devices: <none>


Array is currently resyncing because of previous crash (had to reset the box).

I can reproduce it by doing something that requires lots of I/O activity. Ie. copying a large file/set of files is enough. First time I noticed this problem when I tried to create ext4 filesystem on vg0-data which is nearly 2TB. This failed every time with this error. Eventually I managed to create the filesystem by using -T largefile option (which was better anyway).

Regards,
Juha
Comment 3 Juha Koho 2010-05-04 11:04:25 UTC
Hello,

I don't know if this helps but here's another trace:

[11844.038809] ------------[ cut here ]------------
[11844.038847] kernel BUG at /build/mattems-linux-2.6_2.6.32-9-amd64-NYTFdD/linux-2.6-2.6.32-9/debian/build/source_amd64_none/kernel/workqueue.c:287!
[11844.038894] invalid opcode: 0000 [#1] SMP 
[11844.038923] last sysfs file: /sys/module/mbcache/initstate
[11844.038947] CPU 2 
[11844.038969] Modules linked in: ext3 jbd usb_storage autofs4 nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext2 netconsole configfs loop snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep parport_pc snd_pcm pcspkr parport psmouse snd_timer edac_core edac_mce_amd serio_raw evdev snd asus_atk0110 wmi soundcore i2c_nforce2 snd_page_alloc i2c_core button processor ext4 mbcache jbd2 crc16 sha256_generic cryptd aes_x86_64 aes_generic cbc dm_crypt dm_mod raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 md_mod sd_mod crc_t10dif ata_generic ohci_hcd ide_pci_generic ahci libata floppy ehci_hcd usbcore scsi_mod forcedeth nls_base thermal amd74xx ide_core thermal_sys [last unloaded: scsi_wait_scan]
[11844.039468] Pid: 472, comm: kcryptd Not tainted 2.6.32-3-amd64 #1 System Product Name
[11844.039507] RIP: 0010:[<ffffffff8106145a>]  [<ffffffff8106145a>] worker_thread+0x177/0x21d
[11844.039553] RSP: 0018:ffff88012b8c5e40  EFLAGS: 00010206
[11844.039579] RAX: ffffe8ffffc02480 RBX: ffff88012b8c5ef8 RCX: ffff8800358d01b0
[11844.039604] RDX: ffffe8ffffc06a88 RSI: 0000000000000286 RDI: ffffe8ffffc06a80
[11844.039632] RBP: ffffe8ffffc06a80 R08: ffff88012b6bf300 R09: ffff88012a46c2c0
[11844.039657] R10: 0000000000000082 R11: ffffffff81028a54 R12: ffff8800358d6010
[11844.039683] R13: ffff8800358d6018 R14: ffff88012b981c40 R15: ffff88012b981c40
[11844.039710] FS:  00007f5c5a9307b0(0000) GS:ffff880005500000(0000) knlGS:0000000000000000
[11844.039747] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[11844.039771] CR2: 000000000274bfd0 CR3: 000000011047d000 CR4: 00000000000006e0
[11844.039797] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11844.039824] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[11844.039850] Process kcryptd (pid: 472, threadinfo ffff88012b8c4000, task ffff88012b981c40)
[11844.039886] Stack:
[11844.039904]  000000000000f8a0 ffff88012b981ff8 ffff88012b981c40 ffff88012b8c5fd8
[11844.039937] <0> ffff88012b981c40 ffffe8ffffc06a98 ffffe8ffffc06a88 ffffffffa01800dd
[11844.039996] <0> 0000000000000000 ffff88012b981c40 ffffffff81064a56 ffff88012b8c5e98
[11844.040060] Call Trace:
[11844.040092]  [<ffffffffa01800dd>] ? __fscache_register_netfs+0xe7/0x146 [fscache]
[11844.040132]  [<ffffffff81064a56>] ? autoremove_wake_function+0x0/0x2e
[11844.040159]  [<ffffffff810612e3>] ? worker_thread+0x0/0x21d
[11844.040184]  [<ffffffff81064789>] ? kthread+0x79/0x81
[11844.040211]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
[11844.040236]  [<ffffffff81064710>] ? kthread+0x0/0x81
[11844.040259]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[11844.040281] Code: 08 48 8b 50 08 48 89 51 08 48 89 0a 48 89 00 48 89 40 08 66 ff 45 00 fb 66 66 90 66 66 90 49 8b 45 f8 48 83 e0 fc 48 39 c5 74 04 <0f> 0b eb fe f0 41 80 65 f8 fe 4c 89 e7 ff 54 24 38 48 8b 44 24 
[11844.040704] RIP  [<ffffffff8106145a>] worker_thread+0x177/0x21d
[11844.040733]  RSP <ffff88012b8c5e40>
[11844.040993] ---[ end trace 4a9f9b2da7ecc61c ]---

Same thing as before. Lots of ongoing I/O.

Regards,
Juha
Comment 4 Juha Koho 2010-05-04 12:13:45 UTC
Hmm... once again. This time a different kind of trace but related to this problem I suppose:

[ 3406.061181] BUG: unable to handle kernel paging request at 000000000000ef40
[ 3406.061218] IP: [<ffffffff810fe4be>] find_inode_fast+0x34/0x4c
[ 3406.061253] PGD 36011067 PUD 360a6067 PMD 0 
[ 3406.061286] Oops: 0000 [#1] SMP 
[ 3406.061316] last sysfs file: /sys/module/mbcache/initstate
[ 3406.061343] CPU 3 
[ 3406.061368] Modules linked in: ext3 jbd autofs4 nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext2 netconsole configfs loop snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore psmouse parport_pc serio_raw pcspkr i2c_nforce2 snd_page_alloc parport evdev edac_core wmi edac_mce_amd asus_atk0110 i2c_core button processor ext4 mbcache jbd2 crc16 sha256_generic cryptd aes_x86_64 aes_generic cbc dm_crypt dm_mod raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 md_mod sd_mod crc_t10dif ata_generic usb_storage ohci_hcd ide_pci_generic ahci amd74xx libata ehci_hcd scsi_mod ide_core floppy forcedeth usbcore thermal nls_base thermal_sys [last unloaded: scsi_wait_scan]
[ 3406.061834] Pid: 2808, comm: tar Not tainted 2.6.32-3-amd64 #1 System Product Name
[ 3406.061872] RIP: 0010:[<ffffffff810fe4be>]  [<ffffffff810fe4be>] find_inode_fast+0x34/0x4c
[ 3406.061911] RSP: 0018:ffff880053345be8  EFLAGS: 00010206
[ 3406.061937] RAX: ffff880035855a00 RBX: 0000000000107066 RCX: 0000000000000012
[ 3406.061964] RDX: 000000000000ef00 RSI: ffffc900005d4a50 RDI: 000000000000ef00
[ 3406.061993] RBP: ffffc900005d4a50 R08: 000000000000005c R09: 0000000000000000
[ 3406.062020] R10: 000000d0063d9660 R11: ffffffff81150cae R12: ffff88012dd80400
[ 3406.062048] R13: ffff88012dd80400 R14: 0000000000107066 R15: ffff88012dd80400
[ 3406.062076] FS:  00007fd02bf996f0(0000) GS:ffff880005580000(0000) knlGS:0000000000000000
[ 3406.062114] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3406.062140] CR2: 000000000000ef40 CR3: 000000003580a000 CR4: 00000000000006e0
[ 3406.062168] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3406.062196] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3406.062224] Process tar (pid: 2808, threadinfo ffff880053344000, task ffff88012cb09c40)
[ 3406.062262] Stack:
[ 3406.062283]  0000000000107066 ffff88009831a840 ffffc900005d4a50 ffffffff810fe4fe
[ 3406.062319] <0> 0000000000107066 ffffc900005d4a50 ffff88012dd80400 ffffffff810ff1f6
[ 3406.062377] <0> 0000000000107066 ffff88009831a840 ffff880006386880 ffff880053345df8
[ 3406.062455] Call Trace:
[ 3406.062479]  [<ffffffff810fe4fe>] ? ifind_fast+0x28/0x59
[ 3406.062506]  [<ffffffff810ff1f6>] ? iget_locked+0x30/0x126
[ 3406.062542]  [<ffffffffa02058d2>] ? ext4_iget+0x24/0x6da [ext4]
[ 3406.062570]  [<ffffffff81115960>] ? inotify_d_instantiate+0x12/0x39
[ 3406.062605]  [<ffffffffa020e49c>] ? ext4_lookup+0x83/0xe1 [ext4]
[ 3406.062632]  [<ffffffff810f4c33>] ? do_lookup+0xd3/0x15d
[ 3406.062659]  [<ffffffff810f5663>] ? __link_path_walk+0x5a8/0x6d4
[ 3406.062686]  [<ffffffff810f59bd>] ? path_walk+0x66/0xc9
[ 3406.062712]  [<ffffffff810f6dff>] ? do_path_lookup+0x20/0x77
[ 3406.062739]  [<ffffffff810f8229>] ? user_path_at+0x48/0x79
[ 3406.062767]  [<ffffffff811f3ce1>] ? n_tty_write+0x2fa/0x346
[ 3406.062796]  [<ffffffff810f06df>] ? cp_new_stat+0xe9/0xfc
[ 3406.062822]  [<ffffffff810f08a6>] ? vfs_fstatat+0x2c/0x57
[ 3406.062849]  [<ffffffff810f0927>] ? sys_newlstat+0x11/0x30
[ 3406.062877]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
[ 3406.062904] Code: 53 48 89 d3 48 8b 7d 00 eb 1c 4c 39 a7 00 01 00 00 75 10 f6 87 10 02 00 00 70 74 22 e8 3a ff ff ff eb e1 48 89 d7 48 85 ff 74 11 <48> 39 5f 40 48 8b 17 48 89 f8 0f 18 0a 75 e9 eb ce 31 c0 5b 5d 
[ 3406.063143] RIP  [<ffffffff810fe4be>] find_inode_fast+0x34/0x4c
[ 3406.063172]  RSP <ffff880053345be8>
[ 3406.063195] CR2: 000000000000ef40
[ 3406.063513] ---[ end trace 952031b6fa014d94 ]---

This time I was running a system backup script.

Regards,
Juha
Comment 5 Milan Broz 2010-05-04 16:08:30 UTC
These last OOPSes are ext4 related, no dm-crypt there.

I am not able to reprodue this at all. What are the workload leading to this crash?
Comment 6 Juha Koho 2010-05-04 16:41:34 UTC
Workload was quite normal or even low. This system is going to be my backup server and I was copying my data to the system over network (1gb connection). The second one happened during running system backup script. There was no other activity going on.

I can reproduce these oopses very easily. Both of these oopses happened during less than hour of use. This basically renders the system unusable. What is also interesting is that I have been running this system with almost identical configuration over a year with no problems. What I have now changed was that I installed two additional drives and reinstalled the system and configured encryption.

Is there anything I could try to get more information about this?
Comment 7 Milan Broz 2010-05-04 16:56:35 UTC
Is the dm-crypt crash reproducible if using ext3 fs? (crash from comment#4 is apparently ext4 problem, so trying to separate it from original report).
Comment 8 Eric Sandeen 2010-05-04 17:12:23 UTC
I'd suggest opening a separate bug for the ext4 problem.
Comment 9 Juha Koho 2010-05-04 17:30:25 UTC
I was a bit afraid that you would ask me to try this with ext3 fs. :) But anyway I tried to make ext3 fs (mkfs.ext3 /dev/mapper/vg0-data) and this is what happened:

[  450.773968] ------------[ cut here ]------------
[  450.774030] kernel BUG at /build/mattems-linux-2.6_2.6.32-9-amd64-NYTFdD/linux-2.6-2.6.32-9/debian/build/source_amd64_none/drivers/md/dm-crypt.c:574!
[  450.774099] invalid opcode: 0000 [#1] SMP 
[  450.774203] last sysfs file: /sys/module/nfsd/initstate
[  450.774250] CPU 2 
[  450.774322] Modules linked in: autofs4 nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext2 netconsole configfs loop snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep edac_core parport_pc snd_pcm evdev edac_mce_amd parport pcspkr snd_timer snd soundcore i2c_nforce2 asus_atk0110 wmi i2c_core snd_page_alloc processor button ext4 mbcache jbd2 crc16 sha256_generic cryptd aes_x86_64 aes_generic cbc dm_crypt dm_mod raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 md_mod sd_mod crc_t10dif ata_generic ohci_hcd ide_pci_generic ahci ehci_hcd amd74xx forcedeth libata scsi_mod ide_core floppy usbcore nls_base thermal thermal_sys [last unloaded: scsi_wait_scan]
[  450.776677] Pid: 385, comm: md1_raid6 Not tainted 2.6.32-3-amd64 #1 System Product Name
[  450.776738] RIP: 0010:[<ffffffffa018977d>]  [<ffffffffa018977d>] crypt_endio+0x5f/0xf3 [dm_crypt]
[  450.776834] RSP: 0018:ffff88012a8ebca0  EFLAGS: 00010246
[  450.776881] RAX: 0000000000000101 RBX: ffff8800358f3200 RCX: ffffea0000bb75f8
[  450.776930] RDX: ffff88012fa61090 RSI: 0000000000000003 RDI: 0000000000000000
[  450.776979] RBP: ffff880043a0e980 R08: 0000000000000008 R09: ffff880000045180
[  450.777028] R10: 0000000000000002 R11: ffffffff810b4ab0 R12: 0000000000000001
[  450.777077] R13: ffff8800358f3180 R14: 0000000000000000 R15: 0000000000000001
[  450.777127] FS:  00007f35248a7750(0000) GS:ffff880005500000(0000) knlGS:0000000000000000
[  450.777187] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  450.777234] CR2: 000000000145ec08 CR3: 0000000129bd5000 CR4: 00000000000006e0
[  450.777283] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  450.777332] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  450.777356] Process md1_raid6 (pid: 385, threadinfo ffff88012a8ea000, task ffff88012b805bd0)
[  450.777356] Stack:
[  450.777356]  ffff88012c06de00 ffff880035f9dcc0 ffff88012ad18500 ffff88012c06c200
[  450.777356] <0> 0000000000000006 0000000000000000 0000000000030002 ffffffffa0162bf3
[  450.777356] <0> 00000003000155c0 ffff88012c06c350 ffff88012af0df00 0003000000000000
[  450.777356] Call Trace:
[  450.778120]  [<ffffffffa0162bf3>] ? handle_stripe+0xc83/0x1785 [raid456]
[  450.778120]  [<ffffffffa015ef13>] ? __release_stripe+0x165/0x199 [raid456]
[  450.778120]  [<ffffffffa0163a9a>] ? raid5d+0x3a5/0x3ee [raid456]
[  450.778120]  [<ffffffff812ee20c>] ? schedule_timeout+0x2e/0xdd
[  450.778120]  [<ffffffffa0114828>] ? md_thread+0xf1/0x10f [md_mod]
[  450.778120]  [<ffffffff81064a56>] ? autoremove_wake_function+0x0/0x2e
[  450.778120]  [<ffffffffa0114737>] ? md_thread+0x0/0x10f [md_mod]
[  450.778120]  [<ffffffff81064789>] ? kthread+0x79/0x81
[  450.778120]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
[  450.778120]  [<ffffffff81064710>] ? kthread+0x0/0x81
[  450.778120]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[  450.778120] Code: e0 01 09 f0 b8 fb ff ff ff 44 0f 44 f0 45 31 e4 41 83 ff 01 74 30 eb 38 44 89 e3 48 c1 e3 04 49 03 5d 48 48 8b 3b 48 85 ff 75 04 <0f> 0b eb fe 48 8b 04 24 41 ff c4 48 8b 70 20 e8 a0 b3 f2 e0 48 
[  450.780002] RIP  [<ffffffffa018977d>] crypt_endio+0x5f/0xf3 [dm_crypt]
[  450.780002]  RSP <ffff88012a8ebca0>
[  450.780770] ---[ end trace 48a2f88ca1403854 ]---

Eventually I was able to create the fs using -T largefile option (same as it was previously with ext4). Now I tried to copy about 30gb of data to the freshly created filesystem and here is the result:

[  764.358454] invalid opcode: 0000 [#1] SMP 
[  764.358575] last sysfs file: /sys/module/mbcache/initstate
[  764.358622] CPU 2 
[  764.358695] Modules linked in: ext3 jbd usb_storage autofs4 nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc ext2 netconsole configfs loop snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd soundcore parport_pc i2c_nforce2 edac_core edac_mce_amd parport processor asus_atk0110 pcspkr evdev snd_page_alloc wmi i2c_core button ext4 mbcache jbd2 crc16 sha256_generic cryptd aes_x86_64 aes_generic cbc dm_crypt dm_mod raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 md_mod sd_mod crc_t10dif ata_generic ohci_hcd ide_pci_generic ahci libata ehci_hcd amd74xx floppy ide_core scsi_mod forcedeth usbcore nls_base thermal thermal_sys [last unloaded: scsi_wait_scan]
[  764.361001] Pid: 470, comm: kcryptd Not tainted 2.6.32-3-amd64 #1 System Product Name
[  764.361001] RIP: 0010:[<ffffffffa016220a>]  [<ffffffffa016220a>] make_request+0x24b0/0x696 [raid456]
[  764.361001] RSP: 0018:ffff88012c3cfe38  EFLAGS: 00010286
[  764.361001] RAX: ffffffffa016220a RBX: ffff88012c3cfef8 RCX: ffff88012c158430
[  764.361001] RDX: ffffea0003fc4370 RSI: 0000000000000000 RDI: ffff8800358d6000
[  764.361001] RBP: ffffe8ffffc06a80 R08: 0000000000000000 R09: ffff88012afc42c0
[  764.361001] R10: ffff88012380fda8 R11: ffffffffa016220a R12: ffff88012380fe50
[  764.361001] R13: ffff88012380fe58 R14: ffff88012b282350 R15: ffff88012b282350
[  764.361001] FS:  00007f3411af4790(0000) GS:ffff880005500000(0000) knlGS:0000000000000000
[  764.361001] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[  764.361001] CR2: 000000000069b328 CR3: 000000012c259000 CR4: 00000000000006e0
[  764.361001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  764.361001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  764.361001] Process kcryptd (pid: 470, threadinfo ffff88012c3ce000, task ffff88012b282350)
[  764.361001] Stack:
[  764.361001]  ffffffff8106146b 000000000000f8a0 ffff88012b282708 ffff88012b282350
[  764.361001] <0> ffff88012c3cffd8 ffff88012b282350 ffffe8ffffc06a98 ffffe8ffffc06a88
[  764.361001] <0> ffffffffa0183bdd 0000000000000000 ffff88012b282350 ffffffff81064a56
[  764.361001] Call Trace:
[  764.361001]  [<ffffffff8106146b>] ? worker_thread+0x188/0x21d
[  764.361001]  [<ffffffffa0183bdd>] ? kcryptd_crypt+0x0/0x432 [dm_crypt]
[  764.361001]  [<ffffffff81064a56>] ? autoremove_wake_function+0x0/0x2e
[  764.361001]  [<ffffffff810612e3>] ? worker_thread+0x0/0x21d
[  764.361001]  [<ffffffff81064789>] ? kthread+0x79/0x81
[  764.361001]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
[  764.361001]  [<ffffffff81064710>] ? kthread+0x0/0x81
[  764.361001]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[  764.361001] Code: 00 00 00 00 00 3d 00 00 00 18 22 15 a0 ff ff ff ff 88 48 17 a0 ff ff ff ff 67 10 16 a0 ff ff ff ff b0 22 16 a0 ff ff ff ff b0 22 <16> a0 ff ff ff ff 43 00 00 00 43 00 00 00 f8 28 16 a0 ff ff ff 
[  764.361001] RIP  [<ffffffffa016220a>] make_request+0x24b0/0x696 [raid456]
[  764.361001]  RSP <ffff88012c3cfe38>
[  764.365099] ---[ end trace 92731f78e1feeadd ]---

So I suppose this problem has nothing to do with the underlying filesystem...
Comment 10 Milan Broz 2010-05-04 19:34:16 UTC
Another different OOPs, this time end_io from MD layer.

Yes, it seem that something is using broken bio structure.

Are you sure that the hw is ok, no memory problems, memtest is ok?

mkfs is very I/O intensive, but it works without problem for me.

I am afraid I cannot move forward here without reproducible backtrace, ideally with 2.6.34-rc kernel...

What was the last working kernel without crash?
Comment 11 Juha Koho 2010-05-04 20:49:13 UTC
Hw should be ok as it has been working for over a year without problems. Memtest is also ok. As it is a raid6 setup I event tried to remove the two drives that have not been in use in this system before but still same problems. Mkfs crashes, copying crashes...

Now with those two drives removed and 4 still connected the setup is identical to what it has been before. The only difference is that the system is a fresh install with encryption enabled.

So the current setup has never worked but the previous one (without encryption and only 4 drives) has been working fine with the same kernel version.

Would you like me to test the system with fresh install without encryption so we could see if the encryption really is the problem?
Comment 12 Juha Koho 2010-05-09 12:39:55 UTC
Hello,

this turned out to be a hardware problem after all. With different hardware everything works just fine.

Regards,
Juha
Comment 13 Milan Broz 2010-05-10 15:25:05 UTC
Thanks. So closing this, hw problem.

Note You need to log in before you can comment on or make changes to this bug.